## 1 Introduction

When labeled data are scarce or expensive to obtain, we often resort to semi-supervised learning which exploits the abundance of unlabeled data. For data concentrating on a lower dimensional manifold, it is often reasonable to assume smoothness, i.e., that data points adjacent on the manifold are likely to have similar values of the target variable (the label). Then, learning the manifold structure from both labeled and unlabeled data can assist in label prediction. [1, 2, 3, 11].

In machine learning, an online method updates the model incrementally as it receives training data in a sequential manner. This approach is to be contrasted with offline machine learning, which generates the best model by learning on the entire training data set at once. Online learning is used either because it is computationally infeasible to train over the entire dataset, or it is used where the algorithm has to dynamically adapt to new patterns in the data, e.g. when the data itself is generated in real time. The last situation is particularly relevant in the context of neuronal networks.

Our brains likely rely on online semi-supervised learning to generate behavior. As our sensory organs stream data about the world they are analyzed in real time to produce behaviorally relevant output. While most of the sensory data lack labels, some supervision is available from other sources such as inter-personal communication.

To represent a data manifold, semi-supervised learning algorithms typically construct an adjacency graph whose vertices are labeled and unlabeled data points and edge weights represent their adjacency on the manifold. However, such representation is impractical in the online setting where the data are streamed sequentially and the labels are predicted on the fly. Furthermore, the online setting does not have the memory capacity to store the past data.

Thus, there is a need for online semi-supervised algorithms both for modeling neural computation and solving general machine learning tasks. Whereas existing online algorithms [4] can rely on a sparse representation, they still require memory quadratic in the dimensionality of data. In addition, these algorithms rely on the availability of an adjacency measure between new and stored data points.

In this paper, we propose a biologically plausible neural network for online semi-supervised learning (Figure 1

, left). By avoiding explicit representation of the adjacency graph our network can process unlimited-size datasets in online setting. Moreover, as required by biology, the network relies only on local learning rules meaning that synaptic weight update depends on the activity of only the two neurons this synapse connects.

The network has two layers. The first layer learns the manifold structure of the data by representing each datum as a sparse vector whose components represent overlapping localities on the manifold. The manifold structure is captured by the correlations between the components carried by corresponding channels. Because most existing algorithms for sparse representations, such as

[10], do not have natural neural implementations we base our work on the recently developed manifold tiling algorithm [8]. Inspiration for such design comes from biological neural networks such as place cells in the rodent hippocampus.The second layer learns a classifier using both occasional supervision and the similarity of the manifold representation of the data provided by the first layer. In our neural network, the supervision signal is not fed back from downstream layers of the network like in perceptron or back-propagation networks, but comes along and synchronously with the data from the previous layer. To make it semi- (rather than fully) supervised, the label signal is assumed to be “silent” most of the time. The output attempts to predict the correct label when that signal is not available, otherwise it just reproduces the label.

We derive both the activity dynamics and the learning rules in each layer from the principle of similarity preservation [6] which was previously used in the unsupervised setting. Starting with a similarity preserving objective function allows us to analyze the output of the algorithm and obtain biologically plausible local learning rules.

We demonstrate experimentally the effectiveness of this semi-supervised network compared to fully supervised online learning. Moreover, we observe that online semi-supervised learning may be competitive with offline methods, especially on smaller samples. This is an important advantage allowing our network to adapt quickly when the manifold shape or the labels are changing with time.

## 2 Review of the Manifold-Tiling Network Derived from Non-negative Similarity Matching

To introduce our notation, let the input to the network be a set of vectors, , coming from channels at time . In response, the manifold learning network layer outputs an activity vector, , being the number of output channels, or hidden units in our two-layer network, Figure 1, left.

Manifold-tiling networks have been derived [8] from similarity-preserving objectives [5] with a non-negativity constraint. Similarity preservation postulates that similar input pairs, and , evoke similar output pairs, and . Similarity of a pair of vectors can be quantified by their scalar product. Nonlinear manifolds can be learned by constraining the sign of the output and introducing a similarity threshold . [8] propose an optimization problem:

(1) | ||||

Here matrix notation was introduced: and , and is a matrix of all ones.

Intuitively, (1) attempts to preserve similarity for similar pairs of input samples but orthogonalizes the outputs corresponding to dissimilar input pairs. Indeed, if the input similarity of a pair of samples is above a specified threshold, , then the output vectors and would prefer to have , i.e., they would be similar. If, however, , then they would tend to be orthogonal, , since the lowest value of for is zero. As and are nonnegative, to achieve orthogonality, the output activity patterns for dissimilar patterns would have non-overlapping sets of active output channels. In the context of manifold representation, (1) strives to preserve in the -representation the local geometry of the input data cloud in -space and let the global geometry emerge out of the nonlinear optimization process.

Fig. 1 illustrates manifold tiling on a two spiral arcs in two dimensions, showing the receptive fields of output channels in the third dimension. Receptive fields tile the arcs with overlaps, but there is no overlap between separate arcs.

To derive a neural network that optimizes (1), we express the norm constraint in the Lagrangian form:

(2) | ||||

Here, unconventionally, the non-negative Lagrange multipliers that impose the inequality constraints are factorized into inner products of two non-negative vectors (). In the second step, we introduce auxiliary variables, [7]:

(3) | ||||

The equivalence of (3) to (2) can be seen by performing the , and optimizations explicitly and plugging back in the optimal values. Eq. (3) suggests a two-step online algorithm (see [8] for full derivation). For each input , in the first step, one solves for , and , by projected gradient descent-ascent-descent,

(4) |

where are step sizes. This iteration can be interpreted as the dynamics of a biologically plausible neural circuit (Fig. 1, right, the upper layer), where components of are activities of excitatory neurons, is a bias term, components of are activities of inhibitory neurons (shown in red), and is the feedforward connectivity matrix. is the synaptic weight matrix from excitatory to inhibitory neurons, which undergoes a fast time-scale anti-Hebbian plasticity, which in computer simulation means repeated updates within one step. In the second step, and are updated by gradient descent-ascent:

where is going through a slow time-scale Hebbian plasticity and through homeostatic plasticity. The parameter is a learning rate.

## 3 A Neural Network for Semi-Supervised Learning

In this section, we propose a neural network architecture for semi-supervised learning. In our approach, contrary to the widely accepted schemes, the label signal is not fed back from downstream layers of the network but comes along and synchronously with the rest of the data. To make it semi- (rather than fully) supervised, the signal is assumed to be “silent” most of the time.

Consider a classification problem with the input stream of data, , where , and the corresponding class labels , where in a binary case . The labels are occasionally signalled by a channel carrying values , where either masks or reveals the true label. The data from the previous layer and the label channel are combined in the semi-supervised learning neuron, Fig. 1, right, bottom layer.

Consider a time period of , where the inputs are organized into a matrix and a vector of (partly hidden) labels . The output

needs to reproduce the label signal, so we employ a quadratic loss function

. We express the assumption of smoothness of predicted label on the manifold using the similarity alignment [7] between the input and output Gramians: ). Also, as the label only takes values and , we restrict the output to stay within those limits. This gives rise to the following optimization problem:(5) | ||||

where we also introduced a regularization coefficient controlling the relative importance of the two parts of the objective function.

To derive an online algorithm, following [7], we introduce an auxiliary variable and expand in time:

(6) | ||||

Optimizing over , we obtain: , which makes it clear the new formulation is equivalent to Eq. (5). The advantage of this formulation is that it suggests a two-step online algorithm. For each input , on the first step, one solves for the instantaneous output under fixed :

(7) |

On the second step, is updated as:

(8) |

This also maps well onto a biologically plausible neural network where components of are interpreted as synapse weights, updated by local Hebbian rule. We assume that the synapse weight of the channel is not changing, thus differentiating it from the other input channels. We set this weight to be equal to 1 without loss of generality. The algorithm is initialized with , assuming no prior information.

An alternative objective function can be obtained by expressing the loss as and adding an entropy-like regularizer treating

as a probability estimate for

:(9) | ||||

The solution of this optimization problem is the familiar sigmoidal neuron rule:

(10) |

with the same update for as in Eq. (8). The behavior of both algorithms is almost indistinguishable, so we only report the results from Eq. (7).

## 4 Numerical Experiments

We apply our algorithm to the synthetic dataset designed as “two moons”: two classes are sets of points in 2D, each concentrated around a spiral arc, Fig. 2, top. Such a synthetic dataset is widely used as a test for semi-supervised learning algorithms (see, e.g., [2, 4]). Note that the classes are not linearly separable, and can be separated only when their manifold structure is discovered. Upon discovering the manifold structure, intuitively, the data can be classified using only one labeled example for each class, see red asterisks in Fig. 2, top.

Our network solves this highly non-linear classification problem. The first layer learns units that tile each “moon” with overlaps while no unit is shared between the two moons. The second layer propagates labeling information along links formed by correlations in the tile responses. We generated data points randomly and uniformly, only placing two labeled points early in the data stream. We used tiling layer with 40 neurons and semi-supervised neuron with . The Fig. 2, bottom row, illustrates the working of the semi-supervised neuron. As seen on Fig. 2, bottom left, the output is zero until labeled points arrive. Then there is a transition period, during which the label signals propagate along correlated tiles. Finally, the responses stabilize to correct values: 1 for green, -1 for blue. Fig. 2, bottom right, illustrates propagation of the labeling information. Initially, all weights are zero. When a labeled point arrives, weights corresponding to the tiles overlapping this point increase in absolute value. That signal gradually propagates until all synapses corresponding to “green moon” get positive weights and, all “blue” ones - negative weights.

^{th}time point is shown. Arrows indicate tiles where labeled points fall.

Next, we apply our network to a larger dataset, a 3D Chessboard on a Swiss roll, Fig. 3, left. All the data live on a 2D Swiss roll manifold and the two classes are defined by the squares of the chessboards. We consider chessboards with varying square sizes with the most fine-grained chessboard being most difficult for classification. Whereas linear classifiers per se can not solve this problem, after learning the manifold classification is linear.

We compare our semi-supervised algorithm with an online fully supervised classifier – logistic regression. Both algorithms get the same input stream of 2000 data points, of which 50, 100, or 200 randomly selected are labeled and the rest are unlabeled. The input for both classification algorithms is the output of tiling with 200 neurons. Parameters

for our neuron and learning rate for logistic regression are selected for best results of each algorithm. All runs repeated 10 times to obtain error bars.Both algorithms classify each input using their current weights. However, the fully supervised algorithm cannot update its weights when an unlabeled example arrives, unlike the semi-supervised algorithm. Indeed, experiments show that the semi-supervised neural network performs better than the supervised classifier (Fig. 3, center), demonstrating its ability to take advantage of unlabeled data.

Next, we compare our online algorithm with an offline semi-supervised learning algorithm. For the latter, we use a state of the art linear SVM with Laplacian penalty following [2], but with a twist ^{1}^{1}1In a separate experiment we made sure this twist only improves the results: for the linear case we assume smoothness of weights rather than labels. This means that components of the separating vector w should have similar values when corresponding tiling components have highly overlapping receptive fields. The degree of overlap between receptive fields can be measured by dot products between tiling components, and can be calculated on both labeled and unlabeled data. Then the Gramian can be thought of as the adjacency matrix of a graph where vertices are tiling components. The graph will be fairly sparse due to the nature of tiling. The graph Laplacian penalty is then:

(11) |

and the objective function of linear SVM with Laplacian penalty takes the form:

(12) |

where the index runs through the labeled samples only, with being labels.

In this experiment, the online algorithm is fed a data stream with 0.05% of samples randomly labeled. Then at every 500th step the classification rule obtained up to this point is applied to a separate test set of 2000 samples. At the same step, the SVM with Laplacian regularization, Eq. (12), is trained on all data seen online so far and tested against the same test set. As before, the input for both algorithms is the output of tiling with 200 neurons. Parameters for our neuron and learning rate for logistic regression are selected for best results of each algorithm. All runs repeated 10 times to obtain error bars.

Offline algorithm has an advantage of considering all data samples before taking decision on labeling, while online algorithm has to assign a label estimate to each data sample as it appears. Results on Fig. 3, right, show, however, that with enough smoothness (i.e., coarser granularity in the “Chessboard” example), the online algorithm perform closely to the offline one. Moreover, online algorithm can perform better than the offline one while the number of presented data points is small (e.g., less than approximately 1200 with granularity 0.5). But small sample sizes is exactly the situation where semi-supervised learning is supposed to be helpful. The ability of the online algorithm to adapt quickly is also important when there is a drift in the manifold shape or the labels.

## 5 Relation to Graph Laplacian

Existing algorithms for semi-supervised learning on manifolds typically utilize the graph Laplacian for smoothness regularization [2, 11, 3], see the last term in Eq. (12). This follows from the analysis of [9], which showed that graph Laplacian regularization results in classifier corresponding to normalized graph cut, which helps avoid heavily imbalanced classes. In contrast, our smoothness term, last term, in Eq. (6), lacks diagonal normalization of Laplacian. When optimized exactly, it should lead to the minimum cut of the graph, which is prone to generate classes of very different size [9]. Consider a simple example of a square, where two labeled points for two classes are close to diagonally opposite corners. Laplacian regularization would cut the square in half approximately along the other diagonal, Fig. 4, left, while the minimum graph cut would lead to highly asymmetric solution: one predicted label concentrates closely around one of the labeled points, all the rest occupied by the other label.

However in our experiments we very rarely observe this trend towards asymmetrical solutions. To develop an intuition for why this happens, consider a period in the learning process during which labels channel is silent (), and is not reaching the limits yet. This is the decisive period, where the label information propagates between the synapse weights, see Fig. 2, bottom left. Then (7) becomes simply . Assume the input points arrive i.i.d., then so are vectors. Then substituting expression for into (8) we can write an expectation for one component of :

(13) |

where we defined , and . Now can be seen as the adjacency matrix of a weighted graph, where vertices are tiling channels, in a manner analogous to the matrix , appearing in Eq. (11) in the previous section. The term , appearing in the right hand side of Eq. (13), has the effect of “smoothing” out in over the tiling channels, in a manner analogous to the effect of the Laplacian penalty presented in (12). Essentially, the Laplacian penalty causes the components of diffuse over the graph [11]. So, in expectation, the evolution of in our algorithm would share some features with the gradient descent of to optimize the expression in (12). “Smoothness” of the resulting over channels translate to smoothness of prediction over input space, thereby reducing the likelihood of extremely imbalanced solutions.

We illustrate this with a simulation experiment on the square in Fig. 4. Ideally, there should be equal number of predicted labels for both classes. We, therefore, measure the imbalance, by looking at fraction associated with the majority class among predictions. This measure of imbalance ranges from 0.5 to 1.0. For each run, we generate 2000 unlabeled sample points uniformly on the square, plus 2 labeled points near the corners. These data were fed to our network with 50 tiling channels and in (6). For comparison, the output of tiling layer is also used as input to linear classifier with Laplacian regularization. The histogram of results, after 100 such runs, is presented in Fig. 4, right. While indeed the results for our network fluctuate more, compared to those of the Laplacian regularization approach, the extreme imbalances are rare in both approaches.

## 6 Conclusion

We presented a neural network that learns low-dimensional manifolds in the data stream, then learns a classifier in a semi-supervised setting, where only small part of inputs are labeled. The network operates in an online fashion, producing an output immediately after seeing every input. Weights are updated by a biologically plausible local Hebbian-type rule. We demonstrated the effectiveness of the network in simulations, comparing it with fully supervised online algorithm and with a semi-supervised offline algorithm.

## Acknowledgements

The authors are grateful to Victor Minden and Mariano Tepper for their insightful comments. We thank Johannes Friedrich, Tiberiu Tesileanu and Charles Windolf for helpful discussions.

## References

- [1] Ando, R.K., Zhang, T.: Learning on graph with laplacian regularization. In: Advances in neural information processing systems, pp. 25–32 (2007). doi: 10.7551/mitpress/7503.003.0009
- [2] Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research 7(Nov), 2399–2434 (2006)
- [3] Bengio, Y., Delalleau, O., Le Roux, N.: Label propagation and quadratic criterion. In: Semi-Supervised Learning, chap. 11. MIT Press (2006). doi: 10.7551/mitpress/9780262033589.001.0001
- [4] Goldberg, A.B., Li, M., Zhu, X.: Online manifold regularization: A new learning setting and empirical study. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 393–407. Springer, Berlin, Heidelberg (2008). doi: 10.1007/978-3-540-87479-9_44
- [5] Pehlevan, C., Chklovskii, D.: A normative theory of adaptive dimensionality reduction in neural networks. In: C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Garnett (eds.) Advances in Neural Information Processing Systems 28, pp. 2269–2277. Curran Associates, Inc. (2015)
- [6] Pehlevan, C., Hu, T., Chklovskii, D.: A hebbian/anti-hebbian neural network for linear subspace learning: A derivation from multidimensional scaling of streaming data. Neural Comput 27, 1461–1495 (2015). doi: 10.1162/neco_a_00745
- [7] Pehlevan, C., Sengupta, A.M., Chklovskii, D.B.: Why do similarity matching objectives lead to hebbian/anti-hebbian networks? Neural computation 30(1), 84–124 (2018). doi: 10.1162/neco_a_01018
- [8] Sengupta, A., Pehlevan, C., Tepper, M., Genkin, A., Chklovskii, D.: Manifold-tiling localized receptive fields are optimal in similarity-preserving neural networks. In: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (eds.) Advances in Neural Information Processing Systems 31, pp. 7080–7090. Curran Associates, Inc. (2018)
- [9] Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8) (2000). doi: 10.1109/cvpr.1997.609407
- [10] Yu, K., Zhang, T., Gong, Y.: Nonlinear learning using local coordinate coding. In: Advances in neural information processing systems, pp. 2223–2231 (2009)
- [11] Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919 (2003)

Comments

There are no comments yet.