DeepAI
Log In Sign Up

Confusion-based rank similarity filters for computationally-efficient machine learning on high dimensional data

09/28/2021
by   Katharine A. Shapcott, et al.
0

We introduce a novel type of computationally efficient artificial neural network (ANN) called the rank similarity filter (RSF). RSFs can be used to both transform and classify nonlinearly separable datasets with many data points and dimensions. The weights of RSF are set using the rank orders of features in a data point, or optionally the 'confusion' adjusted ranks between features (determined from their distributions in the dataset). The activation strength of a filter determines its similarity to other points in the dataset, a measure related to cosine similarity. The activation of many RSFs maps samples into a new nonlinear space suitable for linear classification (the rank similarity transform (RST)). We additionally used this method to create the nonlinear rank similarity classifier (RSC), which is a fast and accurate multiclass classifier, and the nonlinear rank similarity probabilistic classifier (RSPC), which is an extension to the multilabel case. We evaluated the classifiers on multiple datasets and RSC was competitive with existing classifiers but with superior computational efficiency. Open-source code for RST, RSC and RSPC was written in Python using the popular scikit-learn framework to make it easily accessible. In future extensions the algorithm can be applied to specialised hardware suitable for the parallelization of an ANN (GPU) and a Spiking Neural Network (neuromorphic computing) with corresponding performance gains. This makes RSF a promising solution to the problem of efficient analysis of nonlinearly separable data.

READ FULL TEXT VIEW PDF

page 12

page 19

11/10/2018

Efficient Spiking Neural Networks with Logarithmic Temporal Coding

A Spiking Neural Network (SNN) can be trained indirectly by first traini...
06/16/2021

A Spiking Neural Network for Image Segmentation

We seek to investigate the scalability of neuromorphic computing for com...
01/25/2022

Event-based Video Reconstruction via Potential-assisted Spiking Neural Network

Neuromorphic vision sensor is a new bio-inspired imaging paradigm that r...
09/04/2015

A nonlinear aggregation type classifier

We introduce a nonlinear aggregation type classifier for functional data...
10/13/2019

Deep Probabilistic Kernels for Sample-Efficient Learning

Gaussian Processes (GPs) with an appropriate kernel are known to provide...
12/04/2010

Efficient Optimization of Performance Measures by Classifier Adaptation

In practical applications, machine learning algorithms are often needed ...

Abstract

We introduce a novel type of computationally efficient artificial neural network (ANN) called the rank similarity filter (RSF). RSFs can be used to both transform and classify nonlinearly separable datasets with many data points and dimensions. The weights of RSF are set using the rank orders of features in a data point, or optionally the ‘confusion’ adjusted ranks between features (determined from their distributions in the dataset). The activation strength of a filter determines its similarity to other points in the dataset, a measure related to cosine similarity. The activation of many RSFs maps samples into a new nonlinear space suitable for linear classification (the rank similarity transform (RST)). We additionally used this method to create the nonlinear rank similarity classifier (RSC), which is a fast and accurate multiclass classifier, and the nonlinear rank similarity probabilistic classifier (RSPC), which is an extension to the multilabel case. We evaluated the classifiers on multiple datasets and RSC was competitive with existing classifiers but with superior computational efficiency. Open-source code for RST, RSC and RSPC was written in Python using the popular scikit-learn framework to make it easily accessible. In future extensions the algorithm can be applied to specialised hardware suitable for the parallelization of an ANN (GPU) and a Spiking Neural Network (neuromorphic computing) with corresponding performance gains. This makes RSF a promising solution to the problem of efficient analysis of nonlinearly separable data.

Keywords

Machine learning; vector quantization; rank; confusion; nonlinear

Introduction

Data that varies in a nonlinear manner is common in real world datasets, requiring nonlinear classifiers to separate classes. However, nonlinear classifiers are comparatively either computationally inefficient on these large datasets or need multiple runs to find appropriate parameters, for example choosing the correct kernel for a support vector machine (SVM). This inefficiency wastes energy, resources and time, especially as datasets have increased in size and dimensionality in recent years

[1]. The novel method we present here was motivated by the desire to create an efficient, nonlinear classifier that can scale to big data and high dimensions with little prior knowledge of the data.

To solve pattern recognition tasks like classification, vector quantization (VQ) methods have been widely used. These classifiers include SVM, learning vector quantization (LVQ), neural gas and self-organizing maps (SOM) and have been used to solve a wide range of machine learning tasks

[2]. Those that use vector prototypes have the additional benefits that they are intuitive, have a low complexity and are relatively computationally efficient. It has recently been shown that vector prototype layers can be added to deep neural networks in order to increase their interpretability while maintaining performance [3]. Due to the increasing use of machine learning in society, moving towards intuitive models is an important goal [4].

In this paper we present a new class of artificial neural network (ANN), rank similarity filters (RSF). These are filters which can perform supervised VQ using a weight matrix based on the order statistics of the input features. The activation strength of many of these filters in response to an input give a relative measure of similarity. We show that the order statistics of features are usually sufficient to approximate confusion values derived from the data. We created an open source package which implements the algorithm in Python using the scikit-learn framework (found here: https://github.com/KatharineShapcott/rank-similarity) and validate the efficiency and accuracy of our method on real-world datasets. We demonstrate that RSFs are computationally efficient in use of CPU time (and energy) during both training and evaluation. We show their utility in both data transform and classification and discuss possible extensions.

This paper is organised as follows; first we demonstrate the novelty of RSF in comparison to related methods, next we justify our approach theoretically, then explain the rank similarity transform and classifier algorithms and finally we demonstrate the algorithms’ computational efficiency compared to other methods as measured by elapsed CPU time.

Related work

Our method is a novel type of nonparametric vector quantization (VQ). VQ is used not only for pattern recognition but also for data compression or approximate nearest neighbor searches [5]

. Since data are not spread evenly through the entire feature space, mapping the feature space into an appropriately spaced ”codebook” results in a good approximation of the true data space at a much lower storage (and search) cost. K-means (or k-means++) is the most well known VQ method, and is popular because it is a fast and simple algorithm with many uses

[6]

. K-means makes unsupervised partitions of the data into equal variance clusters and the centroids of those clusters make up the codebook. In VQ methods new datapoints can be assigned to part of the codebook according to their similarity.

Biologically inspired VQ methods also exist, of which SOM is a well known example. SOM is a type of ANN with a topological and competitive learning algorithm that performs VQ [7]. It has been extensively used for nonlinear dimensionailty reduction and text classification and has even been modified to solve the travelling salesman problem [8]

. Very briefly, when the SOM codebook is trained on a new data point the neuron (node) with the most similar weight vector (usually measured either with Euclidean distance or with a dot product) updates its weights towards the new vector. In the present work we also use the dot product as similarity measure due to its biological plausibility

[9]. All other neurons then update their weights depending on their topological distance from that neuron. The neural gas [10] is a type of SOM which performs pure VQ. It does not update weight vectors according to a topography but instead according to their ranked similarity to the new training data point.

Supervised biologically inspired VQ methods can perform classification. SVM is a popular example and it is an extremely successful classifier across many datasets [11]. However, it has the drawback that it is not intuitively understandable, since the ”support vectors” in the codebook are the extreme borders between classes. LVQ is an ANN method which instead finds prototypical vectors which are similar to members of the class [12]. The generalized LVQ (GLVQ) is a mathematically tractable version in which prototypes update their weights to minimize a cost function [13]. Like SVM their performance can be improved with a kernel, however this may not be desirable as it makes it no longer intuitive [14]. All these prototype VQ methods have the drawback of being computationally inefficient on modern high dimensional data sets.

In order to make an efficient codebook other VQ methods use e.g. sparseness or tree structures [5]

. Here we instead created a novel nonparametric codebook from the rank transform of datapoints, resulting in ranked prototypical vectors (the rank similarity filters - RSFs). Rank transformed data are robust to outliers, are agnostic to changes in scale and are nonparametric, sharing similar advantages to nonparametric statistical methods

[15]. As the codebook is based on ranks it need only store integer values up to the number of features , which is memory efficient and creates a greatly reduced search space. Importantly, the prototypes in this codebook do not represent feature magnitude but instead reflect the relationship between their features, as detailed in the following section.

Discriminability of ranked features

Confusion and discriminability

Classes in a dataset can be represented by estimating the distribution of the values taken by each of their features (Figure

1A). If these features are ordered by their mean value (highlighted curves in Figure 1

A), it is possible to estimate the confusion between features: the probability that features taking certain values are drawn from one of two neighbouring distributions and not the other. In particular, given two distributions

and , the probability that an observed value is drawn from is

(1)

Taking expectation over gives a symmetric measure of confusion

(2)

Where and are the supports of and respectively. If and are entirely disjoint the confusion will be , and if the confusion will take a maximum value of .

The discriminability of two distributions and can be defined as and is the expected probability that a given random sample can be assigned to the correct distribution based on its value. The minimum value of is , which is obtained when . In this case there is still an even chance of assigning any sample to its correct distribution. This means that

is not a metric on probability distributions. Figure

1B plots the pairwise discriminability between each pair of features in the classes shown in Figure 1A. The greater the overlap between distributions the lower the discriminability (Figure 1C).

Fig 1: Separation of features using confusion. A. Examples of the distributions of features of two classes (top and bottom) with features. The features are rank-ordered by mean for the top class (coloured lines) and the overlapping distributions show the potential overlap of different realisations. B. Pairwise discriminability between features values in each of the classes in Panel A. Top right corresponds to the top class and bottom left to the bottom class. C. Discriminability as a function of feature value (black line) for pairs of distributions with different degrees of overlap (top to bottom). Distributions with more overlap lead to a slower increase in discriminability with feature value. D

. Comparison of different distribution distance measures as a function of the mean separation of two gaussians (lefts) and two exponential distributions (right), each with standard deviations of

(solid lines), (dashed lines), and (dotted lines). Black shows the confusion measure, yellow the Wasserstein distance (same for all standard deviations), and green the Kullback-Liebler divergence.
E. Example of the confusion measure (Eq 3, black solid line) applied cumulatively to features with means drawn from uniform (top), gaussian (middle), and exponential (bottom distributions). The red shaded region shows the variability around the mean of each feature and the grey dashed line shows the cumulative rank value (without accounting for confusion).

Comparison of confusion with other probability distances

A number of other measures can be used to quantify the distance between two probability distributions. The most widely-known are the Kullback-Liebler divergence [16] and the Wasserstein distance [17]. Briefly, the Kullback-Liebler divergence gives the information loss in representing samples from a distribution by distribution and the Wasserstein distance gives the weighted difference in probability between and . Both are unbounded, with the Kullback-Liebler divergence being infinite for non-overlapping distributions and the Wasserstein distance growing linearly with the distance between distributions. In contrast the confusion measure reaches its finite maximum value when and are perfectly discriminable; when presented with a value drawn from either distribution it is possible to assign it to its source with chance. Figure 1D shows this effect in action by comparing the distance measures for pairs of gaussian (left) and exponential (right) distributions with varying standard deviations as a function of the distance between their means. The Wasserstein distance (yellow) is independent of the standard deviations of the features around their means, and both the Kullback-Liebler divergence (green) and the Wasserstein distance grow unboundedly as a function of separation.

Mapping confusion to filters

The order statistics of features convey information about the identity of an object. We propose that the relative weighting of order statistics should depend on the confusion between their values over different examples in the training set. In particular, the size of a gap between values assigned to features with consecutive order statistics should depend on their discriminability, both from each other and the extreme values of the object. If features are sorted by their mean values over a representative training set, then their values are assumed to take distributions , where the index represents the order of the sample mean. Then the map between consecutive values is given by

(3)

In Figure 1

E, this process is applied to data generated with random feature means drawn from uniform (top), gaussian (middle), and exponential (bottom) distributions. In each case the distribution about the mean is gaussian (red shaded area shows two standard deviations around the mean). For data drawn from distributions without a high degree of skew (top two panels), the confusion measure is very well approximated by the mean rank (grey dashed lines). We find that this is the case in the majority of common datasets (with some exceptions, see below), and so propose the simpler rank-based filter as an efficient and powerful heuristic in most cases.

Rank similarity filters

The algorithm

A short description of the training algorithm is as follows:

  1. OPTIONAL: A distribution for the filters is calculated

  2. Filters are initialized with random data points from the training data

  3. Filters are assigned to the data points that they are most strongly activated by.

  4. A new filter is created from the assigned data points. Repeat from step 3 until only points move to another filter.

  5. L1 normalize each filter using .

This is shown in pseudo-code in 1. is the training data with features and samples. A single rank filter belongs to containing filters where is less than . For features and for filters .

1:
2: (filters)
3:function RankSimilarityFilter(, n-filters, distribution)
4:      CreateDistribution(, distribution)
5:      InitializeFilters(, , n-filters)
6:      SpreadFilters(, , )
7:      L1norm()
8:     return
Algorithm 1 Rank Similarity Filters() Creates filters from a dataset

The first step of creating a distribution is optional because for many datasets the numeric ranks are already discriminable enough for practical use. A distribution can be created by any function that produces a sorted vector with a length equal to . Creating the confusion distribution was performed as outlined in the Methods.

For the initialisation of each filter a random data point is drawn from and the numeric rank calculated. This can either be used to draw from a distribution or returned directly as shown in Algorithm 2.

1:, , n-filters
2:
3:function InitializeFilters(, , n-filters)
4:     for n-filters do
5:          random(X) Without replacement
6:         
7:         if  is None  then
8:              
9:         else
10:                             
11:     return
Algorithm 2 InitializeFilters(, , n-filters) Creates filters randomly from with values using distribution

The initialized filters are then spread throughout the data using a greedy algorithm. The aim is to find the set of points in that are most similar to each filter . To this end we find the maximum of a dot product across all filters for each in

(4)

To update the weights of filter we then compute

(5)

Which, as before, can be used to optionally draw from distribution or not. This process is repeated until only points move to a different filter in order to ensure the filters are spread evenly throughout the data, as shown in algorithm 3.

1:, , ,
2:
3:function SpreadFilters(, , , )
4:     
5:     while  do
6:         
7:         for all  do
8:              /* Update filters using most similar data points in X */
9:              
10:              
11:              if  is None  then
12:                  
13:              else
14:                                          
15:         /* Check if filters have converged */
16:         if exists  then
17:                        
18:               
19:     return
Algorithm 3 SpreadFilters(, , , ) Spread filters throughout using distribution until only change membership

After performing the L1 norm the filters have been trained. This is enough to transform data points into a new space in an unsupervised manner based on their similarity to the filters (rank similarity transform). This can be simply done by performing a dot product of the dataset with the filters and then scaling:

(6)

However, due to the ”curse of dimensionality” only the largest values are informative about the filter similarity while other smaller values are less so

[18]. We therefore chose a new informative minimum which is the activation of the th most similar filter (where

is a hyperparameter of the model, see Algorithm

4) and set all values less than this to zero.

(7)
1:
2:
3:function RankSimilarityTransform(,,)
4:     
5:     for all  do
6:         max()
7:         partition(,)
8:         
9:         clip(, 0, 1)      
10:     return
Algorithm 4 RankSimilarityTransform(, ) Transform dataset using filters

Classification algorithm

In order to create a classifier from the rank similarity filters few additional steps in the algorithm are required. To perform classification each filter is assigned a label depending on the class makeup of its data points. For simple multiclass classification problems (i.e. single labels with multiple classes) performed by rank similarity classifier it is sufficient to separate into subsets per label and create the filters . This adds two additional steps (see Algorithm 5), one before the main algorithm and one after:

  1. Split the data according to class.

  2. Assign each filter a label based on which class of data it was trained on.

1:,
2:,
3:function RankSimilarityClassifier(,)
4:      unique()
5:     for all  do
6:          RankSimilarityFilter()
7:               
8:     return ,
Algorithm 5 Rank Similarity Classifier() Creates filters with labels

Then to predict the label of a data point we find the maximally similar filter using (4) then find the class of the data that was trained on and assign .

For the probabilistic classifier the data does not need to be split but after training the label must be calculated:

  1. Assign each filter a label based on the labels of its data points weighted by activation.

1:,
2:,
3:function RSPClassifier(,)
4:      RankSimilarityFilter()
5:      SetLabel(, , )
6:     return ,
Algorithm 6 Rank Similarity Probabilistic Classifier() Creates filters with labels

To perform probabilistic classification (suitable for multiclass-multilabel problems) for filters the probabilistic label is calculated based on the normalized sum of the class identities of the maximally similar data points as calculated in (4).

1:, ,
2:
3:function SetLabel(, , )
4:     
5:     for all  do
6:         /* Label filters using most similar data points in X */
7:         L1norm(sum)      
8:     return
Algorithm 7 SetLabel(, , ) Creates probabilistic label matrix using the class identity of dataset

Then to perform a probabilistic prediction of the label of a data point we perform RankSimilarityTransform 4 and use these values to assign probabilities to the classes. The maximum value is then the class label.

1:, ,
2:
3:function Probabilities(, , )
4:      RankSimilarityTransform(, )
5:     for all  do
6:         for all  do
7:              max()           L1norm()      
8:     return
Algorithm 8 Probabilities(, , ) Creates class probabilities for dataset from filters with label

This Probabilities algorithm (8) can be similarly used to calculate probabilities for the standard RSC but here the label for each class can take only the value .

Determining the number of filters

The number of filters can be set directly as a parameter or calculated based on the number of training data points and two thresholds, and .

if
if
if
if

The threshold of can be set to give the most efficient results based on the available computing resources.

Experimental Results

In order to compare the efficiency of our proposed method to other classifiers we implemented the above algorithms in Python using the scikit-learn framework. The code is publicly available111https://github.com/KatharineShapcott/rank-similarity. We evaluate the code on real-world datasets using separate SLURM jobs via the ACME package [19]. Each was allocated a single core of a CPU (Intel Xeon E5-2650 v2 or v3) and 8 GB RAM using Red Hat Enterprise Linux 8.1. Efficiency was measured by the CPU time used using the Python module time. Accuracy was evaluated using the unweighted mean of the F1 score.

Datasets

We used three multiclass real world datasets to assess the classifiers. The first was Fashion-MNIST

222https://github.com/zalandoresearch/fashion-mnist, a more difficult and nonlinear version of the digit MNIST dataset. Fashion-MNIST is an image dataset with 10 classes (c), dimensionality (d) of 784 and consists of 60000 training samples and 10000 balanced test samples. Kuzushiji-49333https://github.com/rois-codh/kmnist [20] (d=784, c=49, 232,365 training samples and 38,547 test samples) and 20 Newsgroups444http://qwone.com/~jason/20Newsgroups/ (d=101,631, c=20, 11,314 training samples and 7,532 test samples) were also used. A fourth multiclass and multilabel dataset Reuters Corpus Volume 1 (RCV1)555https://jmlr.csail.mit.edu/papers/volume5/lewis04a/ [21] dataset (d=47236 sparse, c=101) was used to assess the multilabel performace of the RSPC. For this dataset all 23,149 training samples but only the final 20,000 test samples were used. No further preprocessing or normalization was performed on any of the datasets.

RST as an unsupervised preprocessing step

Linear SVMs are a fast and commonly used classifier that are unable to solve nonlinear problems. To demonstrate the efficiency and utility of RSF we first examine the rank similarity transform (RST) when used as an unsupervised preprocessing step for a linear SVM. For this demonstration we used Fashion-MNIST, and chose parameters for the linear SVM based on the best performance reported on the Fashion-MNIST benchmark. Classification performance for the SVM on the dataset was calculated across different numbers of training samples (see the grey line in Figure 2A). As the number of training samples increased, the necessary CPU time also increased strongly (see Figure 2B). When we first transformed the data using RST and then trained a linear SVM with identical parameters the efficiency of the SVM increased. As can be seen in Figure 2A, once the number of filters was 150 or greater (compared to 784 original dimensions) this also resulted in an increase in performance and reliability. In addition, it resulted in a dramatic speedup of the SVM classifier of almost 2 orders of magnitude (see Figure 2B), although the additional transform process was included in the calculated time. This speedup is not only present when the number of dimensions was reduced but also when it is increased (compare the grey raw data line (d=784) to red (d=1500) line in Figure 2B). This is because most dimensions are zeroed after RST, with only the values of the top responding filters are preserved. This was not the case for PCA or KPCA (see Supplementary Figure S1). When dimensions were reduced with PCA there was a slight speedup (with less than 50 PCs) but a performance decrease (Figure S1A and B). Using KPCA there was neither a speedup nor a performance improvement compared to SVM alone (Figure S1C and D).

We next looked at the trade off between the number of filters necessary to successfully transform different numbers of training samples and the time taken for training. For low numbers of training samples (100 examples per class) increasing the number of filters did not result in much of a performance improvement (blue line in Figure 2C) but did result in a longer training time (blue line in Figure 2D). However, with higher numbers of training samples (6000 per class), adding more filters continues to improve performance (see red line in Figure 2C) while also increasing training time (Figure 2D).

Fig 2: RST as an unsupervised preprocessing step for an SVM A. Classification performance of SVM classifier on the Fashion-MNIST dataset against number of training examples. Lines are different numbers of filters used to transform the data. Note with 150 filters (150 dimensional input to SVM) performance is already increased compared to the original data (784 dimensions). B. As A but showing duration of transformation and classification. Note that the y axis is logarithmic. C. Classification performance of SVM classifier on the Fashion-MNIST dataset against number of filters. Lines are different numbers of training examples. D. As C but showing duration of transformation and classification. E. Visualization of RSF weights. One random test image was chosen from each class and the weights of the top 8 responding rank filters displayed next to it. In green above the filter weights is their activation in the RST of the image (this value was fed into the SVM as a preprocessing step in A-D).

As Fashion-MNIST is an image dataset we were able to visualise the weights of some of the most strongly activated rank filters for 60000 training samples (see Figure 2E). Although the RST was unsupervised, filters have been created from similar images resulting in weights that match the test image well. The background blur that is visible for some filters is due to the rank procedure which highlights small changes in the background colour.

Rank similarity classifier

Here we evaluate the performance of rank similarity classifier (RSC) on multiple classification tasks. On the same Fashion-MNIST dataset with increasing numbers of training samples, an RSC with default parameter values outperforms both the pure SVM and the SVM with RST transformed input (compare Figure 2A with Figure 3A blue line). Additionally it is more efficient (compare Figure 2B with Figure 3A green line), although it is allocating between 1000 and 10000 filters, while this was capped at 1500 for RST. In Figure 3B it can be seen that the F1 score and time taken both increase together as the number of samples increases.

Fig 3: RSC classification performance A. Performance and duration of the RSC classifier depending on number of training samples. Note the increased performance and speed compared to when RST was used as an unsupervised preprocessing step to a SVM in Figure 2. B. Same data as A displayed as a scatter plot. C. Comparison of classifiers’ performance and speed on Fashion-MNIST dataset. RSC is on the top left. The parameters of other classifiers were chosen from the best performing or fastest after a previously performed parameter search. D. Accuracy of RSC probabilities. The log loss of the probabilities for three datasets with different values of n_best. Note that the dataset with the fewest number of classes (Fashion-MNIST) also has the lowest log loss. E. Confusion, mean and rank distributions for sorted features for the Fashion-MNIST, Kuzushiji-49 and 20 Newsgroups datasets. Each feature was averaged and then sorted from small to large according to its mean. The values of the sum of the distribution were normalized to sum to 1. Note that for the 20 Newsgroups dataset the rank distribution diverges strongly from the mean distribution.

Comparison with other classifiers

We next examined the performance and efficiency of RSC against the best performing classifiers on the full Fashion-MNIST dataset. We chose the classifiers with the parameters giving the best performance or efficiency reported on the Fashion-MNIST benchmark page. In Figure 3C it can be seen that RSC is consistently much faster than the other classifiers on this dataset while still producing a comparable classification performance. Note also that the other classifiers were selected after a parameter search (which would multiply the run time by the number of parameters checked) while RSC used default parameters.

As can be seen in Figure 3

C, RSC was competitive with two commonly used classifiers (SVC and KNN) in terms of performance on the Fashion-MNIST dataset. Using 10-fold cross-validation (CV) on three datasets, Fashion-MNIST (d=784, c=10), Kuzushiji-49 (d=784, c=49) and 20 Newsgroups (d=10000, c=20), two of images and one of text, we show that this method performs well above chance and gives stable results across multiple datasets (see column 1 of Table

1).

Also included in Figure 3C is RSPC. Due to needing to calculate the probabilities of each class in the dataset it took longer than RSC, and because the data is not split evenly between the classes it did not perform as well. Despite this, it was more efficient than the other classifiers on this dataset and performed better than the classical SVM or MLP.

RSC (mean±std) SVM (mean±std) KNN (mean±std)
Fashion-MNIST (10) 0.8678 ±0.0042 0.7765 ±0.0274 0.8668 ±0.0034
20 Newsgroups (20) 0.6120 ±0.0199 0.6424 ±0.0150 0.2290 ±0.0105
Kuzushiji-49 (49) 0.9073 ±0.0151 0.4303 ±0.0333 0.8771 ±0.0211
Table 1: Table with average results from 10-fold CV of the full datasets (test and training data combined) for 3 classifiers.

Probabilities

Probablilites for RSC for the three datasets from Table 1 were calculated as described in the Methods(Algorithm 8). We used log loss (cross-entropy loss) to evaluate the accuracy of the probabilities on test data in Figure 3D. This value was acceptably low for the Fashion-MNIST dataset, although the probabilities were not learned or adjusted to the data at all. For the Kuzushiji-49 and 20 Newsgroups datasets the log loss was higher, which may reflect the lower accuracy of the probabilities for those datasets but was also due to the increased number of classes (c=49 and c=20 respectively). As seen in Figure 3D accuracy of the probabilities was dependent on the value of the hyperparameter that defines (see Algorithm 8). This is because only the top filters have any effect on the probabilities. When was too low, relevant classes were not included in the calculation. When was too high, irrelevant classes had non-zero probabilities which added noise.

Filter distribution on datasets

Here we compare the calculated confusion distribution with the ranks of features on the three examined datasets. While ranks were a good approximation for both the Fashion-MNIST and Kuzushiji-49 datasets, for the 20 Newsgroups they are more divergent (Figure 3E, compare to Figure 1E). This is due to the log-log increase in word frequencies with the rank its frequency of use in natural language (as described by ”Zipf’s law”[22]). When the ranks were replaced by confusion distribution values for 20 Newsgroups there was a 4.6% improvement in the F1 score (0.5230.021 to 0.5470.031).

Rank similarity probabilistic classifier

Since many large datasets are both multiclass and multilabel it is useful to have a classifier that can handle this. While RSC is not natively suitable for the multilabel case, RSPC is able to handle this situation. Here we showed that on a subset of the commonly used text classification dataset RCV1 (d=47236 sparse, c=101), RSPC is able to perform well above chance (see Figure 4A and B). The subset used here was the full 23149 training samples but only the final 20000 test samples. When compared with other classifiers on this dataset it does not perform as well (see Figure 4C) however, it is still much faster than the alternatives and performs well above chance. When used on the more suitable Fashion-MNIST dataset it was able to perform competitively with the other classifiers (see Figure 3C).

Fig 4: Multilabel classification A. Performance and efficiency of RSPC on RCV1 dataset (d=47236, c=103) with variable numbers of training samples. B. Same data as A displayed as a scatter plot. C. Comparison of classifiers on RCV1 dataset, RSPC is on the bottom left. Despite poor performance on this dataset, RSPC is still above chance and faster than the alternatives that were able to complete the classification task using only 8Gb of RAM.

Discussion

Here we demonstrated that RSFs are a quickly converging nonlinear ANN that offers competitive performance on benchmark datasets with a wide array of structures and statistics. With our insight of viewing features as distributions to be discriminated, the use of ranks can be supplemented by a confusion measure. Our software package allows the RSFs to be used as a drop in replacement for scikit-learn transformers and classifiers. This showed RSC to be computationally efficient and almost an order of magnitude faster than other scikit-learn classifiers.

RSC has two other main advantages in comparison with other classifiers. Firstly, by using prototypical vectors the model is intuitively understandable (see for example 2E). This is a desirable feature to protect against overfitting or flaws in uninterpretable ”black box” models like DNNs [4]. Secondly, by using ranks of features we utilise many of the advantages of nonparametric statistics [15], namely that we do not need to make an assumption about the distribution of the underlying data, and that the filters are scale invariant. This means that data preprocessing, which is used extensively in machine learning, will often not be necessary for RSF thereby increasing its efficiency and ease-of-use.

The RSFs are L1 normed, since ranked data with the same number of features are L1 equal. We kept this equal L1 distance (Manhattan distance) so that the values are a measure of how many ranks different each individual feature is; if all features are equal then the activation strength of all RSFs are equal. Homeostatic plasticity appears to be an important feature of biological neurons, where balanced changes in synaptic weights [23] and membrane conductances [24], as well as the intrinsic properties of neuronal dendrites [25], help to maintain stable input-output relationships over time and potentially improve computational performance [26]. A recent paper on synaptic weights measured in cortical neurons in vivo

shows that the average weight across synapses is constant across time

[27], which is specifically equivalent to maintaining an equal L1 distance as here.

While there are many advantages to this method, one drawback is that it is unsuitable for very low dimensional data. As features have only possible combinations (not accounting for ties) the number of points should be much lower than for the prototypes to represent the space. Therefore, for most practical applications at least 9 features are necessary, which allows for a minimum of 362,880 unique filters. RSFs are also not suitable for data where specific features take a different form to others and are much larger or smaller, although within feature normalization can be performed before application to take care of this issue. Using RSF is most suitable when features have the same units and are part of the same space, for example, pixels in the same image or word counts in a text document. If the features are unrelated to each other then the order statistics may no longer be meaningful, although separate features may be normalized to the same scale.

In future work, this algorithm could be sped up further by parallelization via GPU or neuromorphic computing. Neuromorphic chips are highly parallel and energy efficent, designed to imitate the brain [28]. They have been been used to solve a wide range of problems [29, 30], and recently millions of neurons have been connected to implement an efficient KNN search [31]. Our algorithm could be similarly implemented on such a chip as it is also dot product based, winner-takes-all, and has the added benefit of natively using integer weights. As a VQ method it could additionally be extended to problems like data compression or dimensionality reduction.

Overall, RSF performs as well as similar methods and (even without parallelisation) far more efficiently. This makes it a useful alternative when interpretable and efficient models are needed.

Acknowledgements

We would like to thank Prof. Dr. Wolf Singer and Dr. Felix Effenberger for their comments on this manuscript. We acknowledge funding through the grant of Prof. Dr. Wolf Singer from DFG Reinhart Koselleck (project number 325248489) and support from the Ernst Strüngmann Institute (ESI) for Neuroscience in Cooperation with Max Planck Society.

Supplementary Material

Code:
Supplementary Figures.

S1

References

  • Ahmed et al. [2017] Ejaz Ahmed, Ibrar Yaqoob, Ibrahim Abaker Targio Hashem, Imran Khan, Abdelmuttlib Ibrahim Abdalla Ahmed, Muhammad Imran, and Athanasios V Vasilakos. The role of big data analytics in Internet of Things. Computer Networks, 129:459–471, December 2017.
  • Villmann et al. [2017] Thomas Villmann, Andrea Bohnsack, and Marika Kaden.

    Can learning vector quantization be an alternative to SVM and deep learning? - Recent trends and advanced variants of learning vector quantization for classification learning.

    J. Artif. Intell. Soft Comput. Res., 7(1):65–81, January 2017.
  • Li et al. [2018] Oscar Li, Hao Liu, Chaofan Chen, and Cynthia Rudin. Deep Learning for Case-Based Reasoning Through Prototypes: A Neural Network That Explains Its Predictions. AAAI, 32(1), April 2018.
  • Rudin [2019] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, May 2019.
  • Wu and Yu [2019] Ze-Bin Wu and Jun-Qing Yu. Vector quantization: a review. Frontiers of Information Technology & Electronic Engineering, 20(4):507–524, April 2019.
  • Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY, 2009.
  • Kohonen [1982] Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biol. Cybern., 43(1):59–69, January 1982.
  • Kohonen [2013] Teuvo Kohonen. Essentials of the self-organizing map. Neural Netw., 37:52–65, January 2013.
  • Koch and Poggio [1992] Christof Koch and Tomaso Poggio. Multiplying with synapses and neurons. In Single neuron computation, pages 315–345. Elsevier, 1992.
  • Martinetz and Schulten [1991] Thomas Martinetz and Klaus Schulten. A“ neural-gas” network learns topologies. Artificial Neural Network, 1:397–402, 1991.
  • Fernández-Delgado et al. [2014] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res., 15(1):3133–3181, 2014.
  • Kohonen [1988] Teuvo Kohonen. An introduction to neural computing. Neural Netw., 1(1):3–16, January 1988.
  • Sato and Yamada [1995] Atsushi Sato and Keiji Yamada. Generalized learning vector quantization. In NIPS, volume 95, pages 423–429. proceedings.neurips.cc, 1995.
  • Nova and Estévez [2014] David Nova and Pablo A Estévez. A review of learning vector quantization classifiers. Neural Comput. Appl., 25(3):511–524, September 2014.
  • Conover and Iman [1981] W J Conover and Ronald L Iman. Rank Transformations as a Bridge Between Parametric and Nonparametric Statistics. Am. Stat., 35(3):124–129, August 1981.
  • Kullback and Leibler [1951] Solomon Kullback and Richard Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22(1):79–86, 1951.
  • Dobrushin [1970] Roland Dobrushin.

    Prescribing a system of random variables by conditional distributions.

    Theory of Probability and its Applications, 15(3):458–486, 1970.
  • Houle et al. [2010] Michael E Houle, Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Scientific and Statistical Database Management, pages 482–500. Springer Berlin Heidelberg, 2010.
  • Fuertinger et al. [2021] Stefan Fuertinger, Katharine Shapcott, and Joscha Schmiedt. ACME: Asynchronous Computing Made Easy, 7 2021. URL https://github.com/esi-neuroscience/syncopy.
  • Clanuwat et al. [2018] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep Learning for Classical Japanese Literature. December 2018, 1812.01718.
  • Lewis et al. [2004] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res., 5:361–397, December 2004.
  • Powers [1998] David M W Powers. Applications and Explanations of Zipf’s Law. In New Methods in Language Processing and Computational Natural Language Learning, 1998.
  • Royer and Parè [2003] Sèbastien Royer and Denis Parè. Conservation of total synaptic weight through balanced synaptic depression and potentiation. Nature, 422:518–522, 2003.
  • Turrigiano and Nelson [2004] Gina Turrigiano and Sacha Nelson. Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5:97–107, 2004.
  • Häusser [2001] Michael Häusser. Synaptic function: Dendritic democracy. Current Biology, 11:10–12, 2001.
  • Bird et al. [2021] Alex D Bird, Peter Jedlicka, and Hermann Cuntz. Dendritic normalisation improves learning in sparsely connected artificial neural networks. PLoS Comput. Biol., 17(8):e1009202, August 2021.
  • Melander et al. [2021] Joshua B Melander, Aran Nayebi, Bart C Jongbloets, Dale A Fortin, Maozhen Qin, Surya Ganguli, Tianyi Mao, and Haining Zhong. Distinct in vivo dynamics of excitatory synapses onto cortical pyramidal neurons and inhibitory interneurons. April 2021.
  • Schuman et al. [2017] Catherine D Schuman, Thomas E Potok, Robert M Patton, J Douglas Birdwell, Mark E Dean, Garrett S Rose, and James S Plank. A Survey of Neuromorphic Computing and Neural Networks in Hardware. May 2017, 1705.06963.
  • Costas-Santos et al. [2007] Jess Costas-Santos, Teresa Serrano-Gotarredona, Rafael Serrano-Gotarredona, and Bernab Linares-Barranco. A Spatial Contrast Retina With On-Chip Calibration for Neuromorphic Spike-Based AER Vision Systems. IEEE Trans. Circuits Syst. I Regul. Pap., 54(7):1444–1458, July 2007.
  • Blum et al. [2017] Hermann Blum, Alexander Dietmüller, Moritz Milde, Jörg Conradt, Giacomo Indiveri, and Yulia Sandamirskaya. A neuromorphic controller for a robotic vehicle equipped with a dynamic vision sensor. In Robotics Science and Systems, RSS 2017, Robotics Science and Systems, RSS 2017, Berlin, Germany, July 2017. Proceedings of Robotics: Science and Systems 2017.
  • Paxon Frady et al. [2020] E Paxon Frady, Garrick Orchard, David Florey, Nabil Imam, Ruokun Liu, Joyesh Mishra, Jonathan Tse, Andreas Wild, Friedrich T Sommer, and Mike Davies. Neuromorphic Nearest-Neighbor Search Using Intel’s Pohoiki Springs. arXiv [cs.NE], April 2020.

Supplementary Figures

Fig S1: Comparison with dimensionality reduction methods. A. Classification performance of SVM classifier on the Fashion-MNIST dataset after PCA dimensionality reduction. Lines are numbers of samples. There is no improvement in performance. B. As A but showing duration of transformation and classification. Note that there is little speed improvement compared to using RST in Figure 2D. C. Classification performance of SVM classifier on the Fashion-MNIST dataset after nonlinear KPCA dimensionality reduction. Lines are numbers of samples. The performance is worse than RST and similar to PCA. Note that using 8Gb of RAM KPCA was not able to transform the full training dataset of 60000 images. D. As C but showing duration of transformation and classification. Using KPCA takes even longer than PCA. Note that using 8Gb of RAM KPCA was not able to transform the full training dataset of 60000 images.