Sparse evolutionary Deep Learning with over one million artificial neurons on commodity hardware

01/26/2019 ∙ by Shiwei Liu, et al. ∙ 8

Microarray gene expression has widely attracted the eyes of the public as an efficient tool for cancer diagnosis and classification. However, the very-high dimensionality and the small number of samples make it difficult for traditional machine learning algorithms to address this problem due to the high amount of computations required and overfitting. So far, the existing approaches of processing microarray datasets are still far from satisfactory and they employ two phases, feature selection (or extraction) followed by a machine learning algorithm. In this paper, we show that MultiLayer Perceptrons (MLPs) with adaptive sparse connectivity can directly handle this problem without features selection. Tested on four datasets, our novel results demonstrate that deep learning methods can be applied directly also to high dimensional non-grid like data, while learning from a small amount of labeled examples with imbalanced classes and achieving better accuracy than the traditional two phases approach. Moreover, we have been able to create sparse MLP models with over one million neurons and to train them on a typical laptop without GPU. This is with two orders of magnitude more than the largest MLPs which can run currently on commodity hardware.



There are no comments yet.


page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decades, data have become the indispensable factors of scientific progress, medical development and economic growth. Without the increase in the number of available data, scientific development cannot have such an incredible speed. Especially, gene expression obtained from DNA microarray has emerged as a powerful solution to cure cancers (Simon et al., 2003). However, most of the datasets in DNA microarray are high-dimensional and redundant that would result in unnecessary calculations, huge memory requirements and even the decrease of generalization ability due to the “curse of dimensionality” (Destrero et al., 2009). Moreover, the invisible relationships and non-standard structures among different features also make it very time-consuming to find the key features from tens of thousands of features. In order to tackle this problem, various methods have been proposed by researchers. Among them, feature selection is undoubtedly a “de facto” standard as it is not only able to remove the redundant features but to improve the classification performance (Destrero et al., 2009)

. Following the feature detection phase, standard classifiers can be used to perform classification based on the selected features. One of the most used classifier is MultiLayer Perceptron (MLP), e.g it represents 61% of a typical Google TPU (Tensor Processing Unit) workload for production neural networks applications, while convolutional neural networks represent just 5%

(Jouppi et al., 2017).

Motivation. Yet, MLP can not be employed directly on high dimensional data due to the quadratic number of parameters in its fully connected layers with respect to their number of neurons. This limits MLPs size to several thousands neurons and few thousands input features on commodity hardware, and implicitly their representational power. To address this issue, very recently small steps have been made. (Mocanu et al., 2018) proposes Sparse Evolutionary Training (SET), a method to train scalable MLPs with adaptive sparse connectivity (SET-MLPs). However, due to the limitations of typical deep learning libraries (e.g. optimized operations just for fully-connected layers and dense matrices), in (Mocanu et al., 2018) the largest SET-MLP used has just 12,082 neurons - quite a low representational power. Practically, their SET-MLP implementation uses the typical approach from the literature to work with sparsely connected layers, i.e. fully connected layers with sparsity enforced by a binary mask over their weights - this approach, of course, is far from using the full advantage of sparsity.

The first contribution of this paper is an efficient implementation framework which can create and train SET-MLP models with over one million neurons on a typical laptop to handle data with tens of thousands of dimensions. This very high representational power is way over the capacity of state-of-the-art SET-MLPs and fully-connected MLPs. Secondly, we show that our proposed approach can be a good replacement for the current methods which employ both, feature reduction and classifiers, to perform classification on high-dimensional non-image datasets such as microarray gene expression data with imbalanced classes. Thirdly, we show that our proposed solution is robust to the “curse of dimensionality”, avoiding overfitting and achieving very good performance in terms of classification accuracy on a dataset with over 20,000 dimensions (input features) and less than 100 samples.

The remaining of this paper is organized as follows. Section 2 introduces and discusses our proposed methods. Section 3 presents the experiments performed and analyses the results. Section 4 discusses two extreme SET-MLP models and shows that a SET-MLP model with over one million neurons can be trained on one CPU thread of a typical laptop. Section 5 discusses related work, while Section 6 concludes the paper and presents future research directions.

2 Methods

This section introduces our proposed methods. First, it discusses the Sparse Evolutionary Training procedure and its current limitations given by the state-of-the-art deep learning libraries and techniques. Secondly, it describes our novel proposed solution to address those limitations.

2.1 Sparse evolutionary training

Inspired by the fact that biological neural networks are prone to be sparse, rather than dense (Strogatz, 2001; Pessoa, 2014), and due to obvious computational resources limitations, there is an increasing interest in conceiving neural networks with a sparse topology (Mocanu et al., 2016). In (Mocanu et al., 2018), the authors proposed a novel concept, Artificial Neural Networks (ANNs) with adaptive sparse connectivity. The basic idea is to replace the fully connected layers with sparsely connected layers before training in any type of neural network, and after that during the training process to optimize together the weights values and the network sparse topology to fit the data distribution. They also proposed a scalable training method, i.e. Sparse Evolutionary Training, to train such networks. Different from the conventional methods, e.g. weights pruning (Cun et al., 1990; Han et al., 2015) which creates sparse topologies during or after the training process, the adaptive topology of ANNs trained with SET is designed to be sparse from the begin. This reduces quadratically the amount of connections.

SET algorithm is given in Appendix A, Algorithm 1. For the sake of convenience, we briefly describe the structure of SET, using the same notations with the original paper. The original sparse topology is initialized by Erdős-Rényi random graph topology (Erdős and Rényi, 1959) where a sparse matrix

is defined to determine the probability of each connection between two consecutive layers of neurons(i.e.

and ). The weight matrix is given by


whereby the and represent the numbers of neurons of hidden layers and , respectively.

is a hyperparameter to control the sparsity level.

Like conventional fully-connected MLP (FC-MLP), SET-MLP also employs backpropagation with stochastic gradient descent to learn the best weights for different datasets. However, the initial sparse network may not be suitable for every type of dataset, since it is generated randomly with no information about the data distribution. To overcome this problem, in each epoch, weights pruning based on magnitude is used to eliminate the non-informative connections. More exactly, a certain fraction

of the largest negative weights and the smallest positive weights are removed. After weights removal, an equal number of connections are randomly added to the bipartite layers. Roughly speaking, the connections removal in SET represents natural selection, whereas the emergence of new connections corresponds to the mutation phase.

Very recently, several papers proposed various techniques for adaptive sparse connectivity. (Bellec et al., 2018) proposed DEEP R. (Mostafa and Wang, 2019) proposed dynamic sparse reparameterization for deep convolutional neural networks and showed that their proposed method and SET are faster and achieve better accuracy than DEEP R. (Zhu and Jin, 2018) proposed a simplified SET variant (after weights removal, they do not add random connections) to minimize the size of the network as much as possible as needed in federated learning and low-resource devices. Considering the above, in this paper, we focus on the original SET algorithm because it was shown that it is capable to reach very high accuracy performance (Zhu and Jin, 2018; Mocanu et al., 2018), many times even higher than the fully-connected counterparts (Mocanu et al., 2018)

, while being very versatile and suitable for many neural network models (e.g. restricted Boltzmann machines, multilayer perceptrons, and convolutional neural networks) and non-grid like data.

However, the authors of SET have used Keras with Theano back-end to implement their SET-MLP models. This implementation choice, while having the big advantage of offering a wide flexibility of architectural choices (e.g. various activation functions, optimizers, GPUs, and so on) which is very welcomed while conceiving new algorithms, does not offer proper support for sparse matrix operations. This limits considerably the practical aspects of SET-MLP with respect to its maximum possible number of neurons and implicitly to its representational power. Due to these reasons, the maximum size of the SET-MLPs from

(Mocanu et al., 2018) is 12,082 neurons.

2.2 Proposed solution

In this paper, we address the above limitations of the SET original implementation and we show how vanilla SET-MLP can be implemented from scratch using just pure Python, SciPy, and Cython. Our approach enables the construction of SET-MLPs with at least two orders of magnitude larger, i.e. over 1,000,000 neurons. What is more, such SET-MLPs do not need GPUs and can run perfectly fine on a standard laptop.

2.2.1 Sparse Matrices Operations

The key element of our very efficient implementation is to use sparse data structures from SciPy. It is important to use the right representation of a sparse matrix for different operations, because different sparse matrix formats have different advantages and disadvantages (as briefly discussed in Appendix A) and one format cannot handle all operations necessary for sparse weights matrices to implement a SET-MLP. Still, the conversions from one format to another are very fast and efficient. Thus in our implementation which was done in pure Python 3, we have used for specific SET-MLP operations, specific sparse matrix formats and their fast conversion capabilities, as follows.

Initialize sparsely connected layers. The sparse matrices which store the sparsely connected layers are creating using the Linked List (LIL) format and then are transformed into Compressed Sparse Row (CSR) format.

Feed-forward phase. During it, the sparse weights matrices are stored and used in the CSR format.

Backpropagation phase - computing gradients. The only operations which can not be implemented with SciPy sparse matrix operations is computing the gradients for backpropagation (Rumelhart et al., 1986)

due to the simple fact that by multiplying the vector of backpropagation errors from layer

with the vector of activation neurons from layer will perform a considerable amount of unnecessary multiplications (for nonexistent connections) and will create a dense matrix for updates. This dense matrix, besides being very slow to process, will have a quadratically number of parameters with respect to its number of rows and columns and will fill a 16GB RAM very fast (in practice, for less then 10000 neurons per layer given all the other necessary information which have to be stored in the computer memory). To avoid this situation, we have implemented in Cython the computations necessary for the batch weight updates. In this way, we compute in a much faster manner than in pure Python the gradient updates just for the existing connections. For this step, the sparse weight matrices are stored and used in the Coordinate list (COO) format.

Backpropagation phase - weights update. For this, the sparse weights matrices are used in the CSR format.

2.2.2 Implementation of Weights Evolution

The key aspect of the SET method that sets it apart from the conventional ANN training is the evolutionary scheme which modifies the connectivity of the layers at the end of every epoch. As the weight evolution routine is executed quite often, the routine needs to be implemented in an efficient manner to ensure that the SET-MLP training can be done as fast as possible. Furthermore, as the layer connections are extremely sparse in the SET scheme, the implementations should ensure that the sparsity level is maintained. Actually, it shall exploit the sparsity while removing and adding new weights. Two implementations of the weight evolution scheme were coded in native Python using Numpy sparse matrix routines. The first implementation is readable and intuitive, but does not exploit the full capabilities of the Numpy library in its various operations. The second implementation is not as readable, but vectorizes most of the operations using Numpy routines and performs the same operations in much lesser time. These implementations are both explained and compared in Appendix A.

3 Experiments and results

In this section, we evaluate and discuss the performance of our efficient SET-MLP implementation on four benchmark microarray datasets which are publicly available, as detailed in Table 1. For a good understanding of SET-MLP performance, we compare it against another sparse MLP model (implemented by us in the same manner) and in which the bipartite layers are initialized with an Erdős-Rényi topology, but which does not evolve over time and has a fixed sparsity pattern, dubbed MLP as in (Mocanu et al., 2018).

Dataset No. of No. of No. of Data
Samples Features Classes Size
Leukemia (Haferlach et al., 2010) 2096 54,675 18 1.93 GB
CLL-SUB-111 (Haslinger et al., 2004) 111 11,340 3 5.9 MB
SMK-CAN-187 (Zhao et al., 2010) 187 19,993 2 11.9 MB
GLI-85 (Freije et al., 2004) 85 22,283 2 8.7 MB
Table 1: Microarray datasets used.

3.1 Evaluation metrics and experimental setup

To evaluate the performance of the proposed method we have used the accuracy metric and the confusion matrix to get detailed visual information. The rows of the confusion matrix represent the predicted classes and the columns correspond to the true classes. The diagonal cells represent the numbers of samples that are correctly classified, whereas the off-diagonal cells are the number of incorrectly classified samples. The row at the bottom of the confusion matrix gives the proportion of all examples belonging to each class that are correctly (green) and incorrectly (red) classified. The column on the far right of the confusion matrix represents the proportion of all the samples predicted to belong to each class that are correctly (green) and incorrectly (red) classified.

All the experiments have been executed on a typical laptop using a single thread of the CPU. The laptop configuration is as follows: (1) Hardware configuration: CPU Intel Core i7-4700MQ, 2.40 GHz 8, RAM 16 GB, Hard disk 500 GB; and (2) Software used: Ubuntu 16.04, Python 3.5.2, Numpy 1.14, SciPy 0.19.1, and Cython 0.27.3.

For both models, SET-MLP and MLP

, we have used two hidden layers with ReLU activation functions, and backpropagation with stochastic gradient descent and momentum for the connection weights optimization. The numbers of neurons of each layer is given by Table


Dataset input hidden hidden output
Leukemia 54,675 27,500 27,500 18
CLL-SUB-111 11,340 9,000 9,000 3
SNK-CAN-187 19,993 16,000 16,000 2
GLI-85 22,283 20,000 none 2
Table 2: Number of neurons per layer.
Leukemia Class No. of
Label Samples
Mature B-ALL with t(8;14) 1 4
Pro-B-ALL with t(11q23)/MLL 2 23
C-ALL/Pre-B-ALL with t(9;22) 3 41
T-ALL 4 58
ALL with t(12;21) 5 19
ALL with t(1;19) 6 12
ALL with hyperdiploid karyotype 7 14
C-ALL/Pre-B-ALL without t(9;22) 8 79
AML with t(8;21) 9 14
AML with t(15;17) 10 12
AML with inv(16)/t(16;16) 11 9
AML with t(11q23)/MLL 12 13
AML with normal karyotype + other abnormalities 13 115
AML compllex aberrant Karyotype 14 18
CLL 15 149
CML 16 25
MDS 17 68
Non-leukemia and helthy bone marrow 18 26
Table 3: Leukemia class labels and their number of test samples.

3.2 Results on the Leukemia dataset

The Leukemia dataset is obtained from NCBI GEO repository with the accession number GSE13159. It contains 2096 samples with 54,675 features each. The samples are divided in 18 classes. Among these 2096 samples, 1397 samples are selected as training data and 699 as testing data. Table 3 shows the number of test samples in each class. It worths to be highlighted that both, the training and testing sets, are unbalanced. For this dataset, the number of hidden neurons in each layer was set to 27,500, a value which is way above the usual number of neurons in fully-connected MLP models. We performed a small random search, and we set the learning rate equaling 0.005, the batch size equaling 5, momentum to 0.9 and weight decay to 0.0002. As specific SET-MLP settings, we chose sparsity level and pruning rate .

The accuracy of SET-MLP and a MLP with two hidden layers on the Leukemia test set is shown in Fig. 1 over 500 training epochs. We mention that we chose to discuss in details the models with two hidden layer as they offered the best performance. The x-axis shows the training epochs; the y-axis shows the test accuracy. The figure shows that the accuracy of SET-MLP tends to stabilize towards 90% as the training progresses, while MLP stabilizes around 80% accuracy. To understand better the performance of our approach, Fig. 2 shows the confusion matrix on Leukemia dataset for the peak accuracy of SET-MLP (88.10%). We highlight that, to the best of our knowledge, this accuracy is higher than best results (81.11%) reported in the literature (Kumar and Rath, 2015)

for this dataset. There in, an ensemble classifier is proposed to deal with microarray data. This classifier connects several feature selection algorithms with MapReduce based proximal support vector machine (mrPSVM) to classify the microarray data. Their experimental results not only show that the ensemble of mrPSVM classifier and feature selection approaches is a state-of-art method to deal with microarray datasets, but provides also concrete information about training data and testing data. In our experiments, we employed the same training and testing data to guarantee the validity of the comparison.

Figure 1: Test accuracy of SET-MLP and MLP on the Leukemia dataset.

Figure 2: Confusion matrix for the peak accuracy of SET-MLP on the Leukemia dataset
(a) CLL-SUB-111 dataset
(b) SMK-CAN-187 dataset
(c) GLI-85 dataset
Figure 3: Test accuracy of SET-MLP and MLP on three of the benchmark datasets.

3.3 Results on the CLL-SUB-111 dataset

The CLL-SUB-111 is an unbalanced dataset contains gene expressions from high density oligonucleotide arrays, where both genetically and clinically distinct subgroups of B-cell chronic lymphocytic leukemia(B-CLL). It has 11,340 features and 111 samples, out of which 74 samples are selected as training set and 37 as testing set. The number of neurons of each hidden layer is 9000. All the hyperparameters were set the same as for Leukemia, except the 0.01 learning rate.

The test accuracy for 500 epochs of SET-MLP and MLP on the CLL-SUB-111 dataset is depicted in Fig. 2(a). The x-axis shows the training epochs; the y-axis shows the test accuracy. After approximately 200 epochs, SET-MLP reaches an accuracy of 81.08% and MLP an accuracy of about 65%. Among the feature selection based methods to CLL-SUB-111, an accuracy of 78.38% was obtained by using Incremental Wrapper-based Attribute Selection(IWSS) (Bermejo et al., 2012). Although CLL-SUB-111 seriously suffers from an extreme small number of samples, we are still able to obtain outstanding performance with SET-MLP without any overfitting. Fig. 3(a) depicts the confusion matrix of SET-MLP on the CLL-SUB-111 dataset. We can observe that SET-MLP has excellent recall for class 1 (100.0%), even though there are extremely unfavorable conditions, i.e. very few training samples.

(a) CLL-SUB-111 dataset
(b) SMK-CAN-187 dataset
(c) GLI-85 dataset
Figure 4: Confusion matrix for the peak accuracy of SET-MLP on three of the benchmark datasets.

3.4 Results on the SMK-CAN-187 dataset

The SMK-CAN-187 is a RNA dataset obtained from normal bronchial epithelium of smokers with and without lung cancer, which is publicly available from (Zhao et al., 2010). It has 19,993 features and 187 samples. Out of these 187 samples, 124 samples are chose to be training data and 63 are testing data. There are 32 samples labeled as ‘1’ and 31 are labeled as ‘2’. Same as Leukemia and CLL-SUB-111 datasets, SMK-CAN-187 is also unbalanced. The number of neurons in the hidden layers was set to 16,000, learning rate to 0.005, and the batch size to 5. Moreover, momentum was 0.9 and weight decay was set to 0.0002. Additionally, the SET-MLP specific parameters, and were 10 and 0.3, respectively.

Fig. 2(b) shows the accuracy performance of SET-MLP and MLP on the SMK-CAN-187 dataset. The accuracy of SET-MLP is 79.4% which is better than the accuracy reported in (Wang et al., 2016)(74.872.32%) in which feature selection was performed by preserving class correlation. It is noteworthy that on this dataset the accuracy of SET-MLP is stable to around 75% after 350 epochs, while MLP seems to suffer from overfitting. The confusion matrix of SET-MLP on SMK-CAN-187 is given in Fig. 3(b). On class 2, SET-MLP achieves a remarkable performance of 90.3% recall.

3.5 Results on the GLI-85 dataset

The GLI-85 dataset has 22,283 features and 85 samples. We chose this dataset to analyze as it reflects an extreme case for the situation when very little labeled data is available. Out of these 85 samples, 56 samples are training data and 29 are testing data. There are 21 samples labeled as ‘2’ and 8 are labeled as ‘1’. The batch size was set to 1, while the remaining hyperparameters were the same as for Leukemia. With only 85 samples, any model trained on this dataset is clearly prone to overfitting. Due to the very small number of samples, the best results on this dataset were obtained using just one hidden layer, as discussed next.

Fig. 2(c) shows the accuracy performance of SET-MLP and MLP on the GLI-85 dataset. We can observe that peak accuracy of SET-MLP is 100.0% which is much better than the accuracy reported in (Taheri and Nezamabadi-pour, 2014)

(94%) in which an ensemble including three filter methods with a meta-heuristic algorithm is used. It is noteworthy that on this dataset the accuracy of MLP

is stable to 100%, while SET-MLP reaches 100% just in few cases, as it suffers some fluctuations. These suggest that MLP is even more stable than SET-MLP when a very small number of samples is available. The confusion matrix of SET-MLP on GLI-85 is given in Fig. 3(c) for the peak accuracy of 100%.

3.6 Results analysis

To understand better the connections reduction made by the SET procedure in a SET-MLP model in comparison with a fully-connected MLP (FC-MLP) which has the same amount of neurons, Fig. 5 and Table 4 provide the number of connections for the SET-MLP models discussed above and their FC-MLP counterparts on all four datasets. To clarify, we mention that it is impossible to report also the accuracy for FC-MLPs as they can not run on a typical laptop due to their very high memory and computational requirements. However, most probably, they would overfit due the their about one billion connections. It is clear that SET has dramatically reduced the connection numbers in MLPs.

Dataset Number of connections Connections
FC-MLP SET-MLP Reduction
Leukemia 2,260,307,500 1,582,376 99.93%
CLL-SUB-111 183,087,000 409,033 99.78%
SMK-CAN-187 575,920,000 711,305 99.88%
GLI-85 490,270,000 486,350 99.90%
Table 4: Numbers of connections for SET-MLP and FC-MLP.
Dataset Average training time (s) Average testing time (s)
(per epoch) (per epoch)
Leukemia 66.12 3.80
CLL-SUB-111 0.59 0.03
SMK-CAN-187 2.89 0.17
GLI-85 2.74 0.07
Table 5: Running time in seconds (s) per epoch for SET-MLP.

For instance, a traditional FC-MLP on the Leukemia dataset would have 2,260,307,500 connections, while SET-MLP has just 1,582,376 connections. This practically means that SET achieves a 99.93% reduction in the number of connections. For the CLL-SUB-111, the connections number decreases from 183,087,000 to 409,033 which means 99.78% connections reduction. On the SMK-CAN-187 dataset, the connections reduction given by SET is 99.88%, from 575,920,000 to 711,305. This quadratical reduction in the number of connections is a significant guarantee that SET-MLP can run fine on a standard laptop, for datasets with tens (up to few hundreds) of thousands of input features.

Figure 5: The number of connections for the SET-MLP models with two hidden layers used on the Leukemia and CLL-SUB-111 datasets and with one hidden layer used on the SMK-CAN-187, plotted against their FC-MLP counterparts.

For a better understanding of SET computational requirements, Table 5 shows the average training and testing time per epoch of the SET-MLPs used on the datasets. We can observe, as expected, that as the number of features and samples increases also the training time is increasing. Still, it worths to be highlighted that, although, the average training time of Leukemia is relatively long (66.12s), it fulfills an almost impossible mission, that is, running such a large model on a commodity laptop.

In the paper, we have discussed the performance of the SET-MLP models with two hidden layers on the Leukemia, CLL-SUB-111, SMK-CAN-187 datasets and with one hidden layer on the GLI-85 datasets. In Appendix B, we explain our choices on the number of hidden layers by presenting comparatively the performance of SET-MLP models with one, two, and three hidden layers on all datasets and by discussing the beneficial effect of dropout (Hinton et al., 2012) on SET-MLP.

4 Discussion: extreme SET-MLP models

While in the previous section we have analyzed the qualitative performance of our proposed approach, in this section we briefly discuss two extreme SET-MLP models on the largest dataset used in this paper, i.e. Leukemia. The goal is to assess how fast SET-MLP can achieve a good accuracy and to see how large can be a trainable SET-MLP model on a typical laptop. For each model, we used a SET-MLP with two hidden layers, and a Softmax layer as output. For the first model, the number of hidden neurons per layer was set to 1,000, while for the second model the number of hidden neurons per layer was set to 500,000. In both cases, we have used a very eager learning rate (0.05) and we trained the models for 5 epochs. On each hidden layer we applied a dropout rate of 0.4. The other hyperparameters were set as in the previous section for Leukemia and we have used the same training/testing data splitting.

Model Hardware Density level (%) Total time (s) Accuracy (%)
(training + testing)
SET-MLP (54,675;1,000;1,000;18) 1 CPU thread 1.04 65 82.881.18
SET-MLP (54,675;500,000;500,000;18) 1 CPU thread 0.007 4914 81.831.11
mrPSVM with ANOVA (Kumar and Rath, 2015) conventional n/a 1265 81.1
mrPSVM with ANOVA (Kumar and Rath, 2015) Hadoop cluster n/a 291 81.1
Table 6: Two extreme SET-MLP models on Leukemia against state-of-the-art (mrPSVM with ANOVA (Kumar and Rath, 2015)

for feature selection). The numbers in brackets for SET-MLP reflect the number of neurons per layer from input to output. The accuracy of SET-MLP is reported as the mean and standard deviation of 5 runs. The density level represents the percentage of the number of existing connections in the SET-MLP model from the total number of connections in its corresponding FC-MLP.

Table 6 presents SET-MLP performance in comparison with the best state-of-the-art results of mrPSVM from (Kumar and Rath, 2015). We clarify that the goal of this experiment is not to obtain the best accuracy possible with SET-MLP. Still, the small SET-MLP model which has in total 56,693 neurons and 581,469 connections has a total training and testing time of 65 seconds. It is about 20 times faster than mrPSVM which runs on conventional hardware and about 4.5 times faster than mrPSVM which runs in a Hadoop cluster, while reaching with 1.7% better accuracy. At the same time, its small standard deviation shows that the model is very stable. Furthermore, we highlight that the very large SET-MLP model which has in total 1,054,693 neurons and about 19,383,046 connections needs about 16 minutes per training epoch. In 5 epochs it reaches a good accuracy, better than state-of-the-art. All of these happen on 1 CPU thread of a typical laptop. We highlight that this is the first time in the literature when a MLP variant with over 1 million neurons is trained on a laptop, while the usual MLP models trained on a laptop can have at maximum few thousands neurons. In fact, it is hard to quantify, but according with (Goodfellow et al., 2016) the size of the largest neural networks which run currently in the cloud is about 10 to 20 million neurons. Therefore, our results emphasize even more the capabilities of SET-MLPs and open the path for new research directions.

5 Related work

Traditional feature selection methods can be roughly divided into three categories: filter methods, wrapper methods and embedded methods. Independent of classifiers, filter methods employ a certain criterion to rank different features and use the features with the highest scores to fulfill classification tasks. According to whether features are evaluated in an individual way or in a batch way, filter methods can be divided into univariate filters (Liu and Setiono, 1995) and multivariate filters (Hall, 1999). Considering the evaluation role of classifiers, wrapper methods employ classification performance as feedback to assess the selected features repeatedly, which, in turn, helps to find the best feature subset. The WrapperSubsetEval (Hall et al., 2009) is a general wrapper method which can be connected with various learning algorithms. Considering the advantages of filter methods and wrapper methods, embedded methods are capable of utilizing the biases of classifiers, while reducing the computation cost. Lasso regularization (Tibshirani, 1996) was employed to objective functions to eliminate the non-important features, by reducing the non-important features whose coefficients are close to zero. However, as the explosive growth of the internet, more and more datasets with ultrahigh dimensions are available, which are much out of the capacity of commodity computers. To address this problem, distributed computing has been proposed. Ibrahim et al. (Ibrahim et al., 2014)

generalized Principal Components Analysis (ensemblePCA) to Deep Belief Network (DBN) and to transfer high-dimensional data to low-dimensional nodes. A distributed decentralized algorithm for


-Nearest Neighbor (kNN) graph

(Plaku and Kavraki, 2007) was proposed to distribute the computation of kNN graph with ultra big datasets by utilizing the sequential structure of kNN data. MapReduce (Dean and Ghemawat, 2008) is an efficient programming model used by Google to compute different types of data and to process large raw data. Chu et al. (Chu et al., 2007) applied MapReduce to ten learning algorithms and obtained a progressive speed-up. Developed by Doug Cutting, Apache Hadoop is an open-source software framework based on MapReduce that can accomplish data distributed storage and processing (White, 2012). It partitions big data into different blocks, in a distributed fashion, and then employs parallel tools to process data in different blocks on one big cluster. Kumar et al. (Kumar and Rath, 2015) proposed a classifier framework combining MapReduce and proximal Support Vector Machine(mrPSVM). The experimental results on several high-dimensional, low-sample benchmark datasets demonstrated that the ensemble of mrPSVM classifier and other feature selection methods outperforms the classical approaches.

As a promising method that has been widely used in image recognition, speech recognition and language translation, deep learning has also been employed in classification tasks with high-dimensional data. In (Fakoor et al., 2013)

, an autoencoder approach was combined with a softmax regression classifier to address cancer data classification problems. Danaee et la.

(Ibrahim et al., 2014)

introduced a multi-level feature selection method using a deep learning and active learning approach is proposed. The experiments verify the superiority of this approach in terms of classification accuracy. Additionally, a Stacked Denoising Autoencoder (SDAE) was employed to represent high-dimensional features with lower dimensional and more meaningful features

(Danaee et al., 2017).

Although the above-mentioned algorithms can have a good performance on some datasets, one common disadvantage of all these algorithms is that the systems are hierarchical, meaning that at least a dimensionality reduction technique and an efficient classifier are needed. There is a serious dependency between the two subsystems. When one subsystem is not good enough, the overall performance can not be guaranteed due to information loss caused by the feature selection phase. That’s why the aim of this paper was to introduce a method capable to process and classify high-dimensional data with one unitary model, i.e. SET-MLP.

6 Conclusion

Microarray data have been treated in the literature as a difficult task due to their very high number of features but little number of examples. Besides that, this type of data suffer from imbalance and data shift problems.

In this paper, an efficient implementation of SET-MLP, a sparse multilayer perceptron trained with the sparse evolutionary training procedure, is proposed to deal with high dimensional microarray datasets. This implementation makes use just of Python 3, sparse data structures from SciPy, and Cython. With this implementation, we have created for the first time in literature sparse MLP models with over one million neurons which can be trained on a standard laptop using a single CPU thread and without GPU. This is with two orders of magnitude more than state-of-the-art fully connected MLPs and SET-MLPs trained on commodity hardware.

Besides that, we demonstrated on four microarray datasets with tens of thousands of input features and with up to just two thousands of samples that our approach reduces the number of connections quadratically in large MLPs (about 99.9 % connections reduction), while outperforming the state-of-the-art methods on these datasets for the classification task. Moreover, our proposed SET-MLP models showed to be robust to overfitting, imbalanced and data shift problems, which is not so usual for fully connected MLPs. The last but not the least, the results suggest that our proposed approach can cope efficiently with the ”curse of dimensionality”, being capable of learning from small amounts of labeled data, and outperforming the state-of-the-art methods (ensembles of classifiers and feature selection methods) which are currently employed on high dimensional non-grid like data (or tabular data).

In the future, we intend to focus also on other types of neural layers, such as convolutional layers in CNN which have been widely-used to deal with graphic data with grid-like topology. Furthermore, we intend to extend this work to address problems from other fields which suffer from the ”curse of dimensionality” and which have ultra high dimensional data (e.g. social networks, financial networks, semantic networks). The last but not the least future research direction, would be to parallelize our implementation to use efficiently all CPU threads of a typical workstation and to incorporate it into usual Deep Learning frameworks, such as TensorFlow or PyTorch. This probably would allow us to scale with one order of magnitude more the SET-MLP models (up to the level of few tens of millions of neurons), while still using commodity hardware.

7 Acknowledgements

We thank Ritchie Vink111, Last visit 25 Jan 2019 for providing on a vanilla fully connected MLP implementation, and to Thomas Hagebols222, Last visit 25 Jan 2019 for analyzing the performance of SciPy sparse matrix operations.


  • Bellec et al. (2018) G. Bellec, D. Kappel, W. Maass, and R. Legenstein. Deep rewiring: Training very sparse deep networks. In International Conference on Learning Representations, 2018. URL
  • Bermejo et al. (2012) P. Bermejo, L. de la Ossa, J. A. Gámez, and J. M. Puerta. Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking. Knowledge-Based Systems, 25(1):35–44, 2012.
  • Chu et al. (2007) C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. Map-reduce for machine learning on multicore. In Advances in neural information processing systems, pages 281–288, 2007.
  • Cun et al. (1990) Y. L. Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
  • Danaee et al. (2017) P. Danaee, R. Ghaeini, and D. A. Hendrix. A deep learning approach for cancer detection and relevant gene identification. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017, pages 219–229. World Scientific, 2017.
  • Dean and Ghemawat (2008) J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
  • Destrero et al. (2009) A. Destrero, S. Mosci, C. De Mol, A. Verri, and F. Odone. Feature selection for high-dimensional data. Computational management science, 6(1):25–40, 2009.
  • Erdős and Rényi (1959) P. Erdős and A. Rényi. On random graphs i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
  • Fakoor et al. (2013) R. Fakoor, F. Ladhak, A. Nazi, and M. Huber. Using deep learning to enhance cancer diagnosis and classification. In Proceedings of the International Conference on Machine Learning, volume 28, 2013.
  • Freije et al. (2004) W. A. Freije, F. E. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy, L. M. Liau, P. S. Mischel, and S. F. Nelson. Gene expression profiling of gliomas strongly predicts survival. Cancer research, 64(18):6503–6510, 2004.
  • Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning (Subsection 1.2.3). The MIT Press, 2016. ISBN 0262035618, 9780262035613.
  • Haferlach et al. (2010) T. Haferlach, A. Kohlmann, L. Wieczorek, G. Basso, G. T. Kronnie, M.-C. Béné, J. V. De, J. M. Hernández, W.-K. Hofmann, K. I. Mills, et al. Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the international microarray innovations in leukemia study group. Journal of clinical oncology: official journal of the American Society of Clinical Oncology, 28(15):2529–2537, 2010.
  • Hall et al. (2009) M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
  • Hall (1999) M. A. Hall. Correlation-based feature selection for machine learning. 1999.
  • Han et al. (2015) S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1135–1143, Cambridge, MA, USA, 2015. MIT Press. URL
  • Haslinger et al. (2004) C. Haslinger, N. Schweifer, S. Stilgenbauer, H. Döhner, P. Lichter, N. Kraut, C. Stratowa, and R. Abseher. Microarray gene expression profiling of b-cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and vh mutation status. Journal of Clinical Oncology, 22(19):3937–3949, 2004.
  • Hinton et al. (2012) G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
  • Ibrahim et al. (2014) R. Ibrahim, N. A. Yousri, M. A. Ismail, and N. M. El-Makky. Multi-level gene/mirna feature selection using deep belief nets and active learning. In Engineering in Medicine and Biology Society (EMBC), 2014 36th annual international conference of the IEEE, pages 3957–3960. IEEE, 2014.
  • Jouppi et al. (2017) N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 1–12. IEEE, 2017.
  • Kumar and Rath (2015) M. Kumar and S. K. Rath. Classification of microarray using mapreduce based proximal support vector machine classifier. Knowledge-Based Systems, 89:584–602, 2015.
  • Liu and Setiono (1995) H. Liu and R. Setiono. Chi2: Feature selection and discretization of numeric attributes. In

    Tools with artificial intelligence, 1995. proceedings., seventh international conference on

    , pages 388–391. IEEE, 1995.
  • Mocanu et al. (2016) D. C. Mocanu, E. Mocanu, P. H. Nguyen, M. Gibescu, and A. Liotta. A topological insight into restricted boltzmann machines. Machine Learning, 104(2):243–270, Sep 2016. ISSN 1573-0565. doi: 10.1007/s10994-016-5570-z. URL
  • Mocanu et al. (2018) D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1):2383, 2018.
  • Mostafa and Wang (2019) H. Mostafa and X. Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, 2019. URL
  • Pessoa (2014) L. Pessoa. Understanding brain networks and brain organization. Physics of life reviews, 11(3):400–435, 2014.
  • Plaku and Kavraki (2007) E. Plaku and L. E. Kavraki. Distributed computation of the knn graph for large high-dimensional point sets. Journal of parallel and distributed computing, 67(3):346–359, 2007.
  • Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
  • Simon et al. (2003) R. Simon, M. D. Radmacher, K. Dobbin, and L. M. McShane. Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute, 95(1):14–18, 2003.
  • Strogatz (2001) S. H. Strogatz. Exploring complex networks. nature, 410(6825):268, 2001.
  • Taheri and Nezamabadi-pour (2014) N. Taheri and H. Nezamabadi-pour. A hybrid feature selection method for high-dimensional data. In

    Computer and Knowledge Engineering (ICCKE), 2014 4th International eConference on

    , pages 141–145. IEEE, 2014.
  • Tibshirani (1996) R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • Wang et al. (2016) J. Wang, J. Wei, and Z. Yang. Supervised feature selection by preserving class correlation. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1613–1622. ACM, 2016.
  • White (2012) T. White. Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.
  • Zhao et al. (2010) Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Advancing feature selection research. ASU feature selection repository, pages 1–28, 2010.
  • Zhu and Jin (2018) H. Zhu and Y. Jin. Multi-objective evolutionary federated learning. CoRR, abs/1812.07478, 2018. URL

Appendix A Implementation details

a.1 Sparse data structures

Below the SciPy sparse data structures used to implement SET-MLPs are briefly discussed, while the interested reader is referred to333, Last visit 3rd June 2018. for detailed information.

  • Compressed Sparse Row (CSR) sparse matrix: The data is stored in three vectors. The first vector contains nonzero values, the second one stores the extents of rows, and the third one contains the column indices of the nonzero values. This format is very fast for many arithmetic operations, but slow for changes to the sparsity pattern.

  • Linked List (LIL) sparse matrix: This format saves nonzero values in row-based linked lists. Items in the rows are also sorted. The format is fast and flexible in changing the sparsity patterns, but inefficient for arithmetic matrix operations.

  • Coordinate list (COO) sparse matrix: This format saves the nonzero elements and their coordinates (i.e. row and column). It is very fast in constructing new sparse matrices, but it does not support arithmetic matrix operations and slicing.

  • Dictionary Of Keys (DOK) sparse matrix: This format has a dictionary that maps row and column pairs to the value of nonzero elements. It is very fast in incrementally constructing new sparse matrices, but can not handle arithmetic matrix operations.

a.2 Weights evolution - Implementation I

In this implementation, the sparse weight matrices in the CSR format are converted to three vectors representing the indices of the rows, columns of the non-zero elements along with the element values (either using the COO or LIL format). The values are then compared in a for-loop to the threshold to keep the weights or discard them, as per the user specified values. To ensure that the total number of non-zeros in the weight matrix remains the same, random connections between neurons need to be created. Again a for-loop is used to create new random connections in an incremental manner and ensure that the total number of non-zeros are equal to the original number of non-zeros.

Most of the processing time in the code occurs in the for-loops and the while loops and this is confirmed by a code profiling tool in python444Line_profiler by Robert Kern, [Available Online] Furthermore, as we are constantly accessing the weights by the row and column index, this method does not exploit the sparsity of the weight matrix. The code profile of the processing time demonstrated that the removal of weights of the weight matrix takes about 15% of total time in an epoch and adding new random connections takes about 50% of the total time during an epoch (of course, these percentages depend also on the size of the datasets and they become smaller when the dataset gets larger). The detailed algorithm is given in Algorithm 2.

a.3 Weights evolution - Implementation II

In order to make full use of advantages of different sparse matrix formats, we also propose Fast Weights Evolution (FWE). In FWE, the sparse weight matrices in the CSR format are also converted to three vectors representing the indices of the rows, columns of the non-zero elements along with the element values using the COO format. The value vector is compared a single time with the minimum and maximum threshold values using the vectorized operations in numpy. This enables the identification of the indices of small weights for fast deletion of the weights. Next, the remaining row and column indices are stored together into an array and a list of all the arrays of the non-zero elements is created. This is used directly to determine the random row and column indices of the additional weights to ensure that the number of connections between the neurons are constant. As the weights are sparse, the size of the list is much smaller than the full size of the weight matrix and performing all the computations with the list will be faster. The detailed algorithm is given in Algorithm 3. The comparison of running time of these two implementations is given in Table 7, which shows that Implementation II is more efficient than Implementation I.

We know that it is hard for one to reproduce an efficient implementation of an algorithm given just the above details, and we mention that our SET-MLP proof-of-concept implementation is available online555Implementation 2 from

Matrix Size Implementation I (s) Implementation II (s)
500*500 0.58 0.14
2000*2000 2.56 0.71
8000*8000 11.13 2.08
15000*15000 24.14 3.75
Table 7: Mean running time of evolution Implementation I and Implementation II.

Figure 6: Experiments with SET-MLPs on all four datasets to understand the effect of the number of hidden layers(). For each dataset, three cases for the number of hidden layers are considered, i.e. . Each row represents the test classification accuracy of SET-MLPs with one, two, or three hidden layers on the same dataset. Every model from each row has been trained with same hyperparameters as in the paper, except for number of hidden layers.

Figure 7: The performance of SET-MLP with dropout regularization against SET-MLP without dropout on GLI-85 dataset.

Appendix B Comparative study of SET-MLPs with one, two, and three hidden layers on the four benchmark datasets.

The amount of neurons per hidden layer and the other hyperparameters are set to be the same with the models from the paper. Fig. 6 summarizes these experiments. From the first row, it can be inferred that SET-MLP with two hidden layers reaches the highest peak accuracy (88.12%) and has relatively the most robust performance on the Leukemia dataset. Similarly, SET-MLP with two hidden layer reaches outstanding accuracy (81.11%) on the CLL-SUB-111 dataset, while the accuracy can not reach 80% with one or three hidden layers.

As expected, but at the same time having the most interesting results, due to the very small number of samples of GLI-85 (Fig. 6, third row), SET-MLP with one hidden layer avoids overfitting in exchange to quite an oscillating behavior. At the same time, SET-MLP with two or three hidden layers even if they are capable to reach also perfect accuracy of 100%, after about 200 epochs they have a dramatical drop in accuracy to about 80%. We hypothesis that this situation happens due to overfitting as the number of training samples is extremely insufficient. If this is the case, adding dropout regularization to SET-MLP is able to figure out this problem. We applied dropout with 0.5 dropout rate to both hidden layers. The performance is shown in Fig. 7. It is clear that the accuracy of SET-MLP with dropout keeps the same trend as before, without any drop in accuracy after 200 epochs.

1:  %Sparse Topology Initialization;
2:  initialize ANN model;
3:  set and ;
4:  for each bipartite fully-connected layer of the ANN do
5:     replace FC layer with Sparse Connected(SC) layer with a Erdős-Rényi topology given by and Eq.1;
6:  end for
7:  initialize training algorithm parameters;
8:  %Training;
9:  for each training epoch i do
10:     perform standard training procedure;
11:     perform weights update;
12:     for each bipartite SC layer of the ANN do
13:        remove a fraction of the smallest positive weights;
14:        remove a fraction of the largest negative weights;
15:        if i is not the last training epoch then
16:           add randomly new weights (connections) in the same amount as the ones removed previously;
17:        end if
18:     end for
19:  end for
Algorithm 1 SET pseudocode

INPUT: Sparse Weight Matrix (); OUTPUT: Sparse Weight Matrix with random weights added;

1:  %Removal of small weights
2:  Extract values (), row () and column indices () of the non-zeros from ;
3:  Find maximum negative value () and minimum positive value ();
4:  Initialize ;
5:  for i in R do
6:     for c in C do
7:        if  then
8:           ;
9:           ;
10:        end if
11:     end for
12:  end for
13:  %Addition of random weights
14:  while  do
15:     Choose randomly from 1 to rows of ;
16:     Choose randomly from 1 to columns of ;
17:     if  then
18:        Add a random value to ;
19:        ;
20:     end if
21:  end while
Algorithm 2 Weights evolution - Implementation I

INPUT: Sparse Weight Matrix (); OUTPUT: Sparse Weight Matrix with random weights added;

1:  %Removal of small weights
2:  Extract values (), row () and column indices () of the non-zeros from ;
3:  Find the maximum negative value () and the minimum positive value ();
4:  Find the index of the values () which are bigger than and smaller than ;
5:  Delete the indices of , and corresponding to the index ;
6:  N = length();
7:  %Addition of random weights
8:  Create a list of arrays () with the remaining elements after removing: ;
9:  while  do
10:     I = array of N randomly chosen from 1 to rows of ;
11:     J = array of N randomly chosen from 1 to columns of ;
12:     Create list () of arrays with k elements: ;
13:     Remove duplicate elements from ;
14:     Remove elements from in common with ;
15:     N=N-length();
16:      = append (, )
17:     Clear
18:  end while
19:  Append random values to :
20:  Unzip 1st and 2nd elements of :
21:  Use COO format to update the :
Algorithm 3 Fast Weights Evolution - Implementation II