1 Introduction
In the past decades, data have become the indispensable factors of scientific progress, medical development and economic growth. Without the increase in the number of available data, scientific development cannot have such an incredible speed. Especially, gene expression obtained from DNA microarray has emerged as a powerful solution to cure cancers (Simon et al., 2003). However, most of the datasets in DNA microarray are highdimensional and redundant that would result in unnecessary calculations, huge memory requirements and even the decrease of generalization ability due to the “curse of dimensionality” (Destrero et al., 2009). Moreover, the invisible relationships and nonstandard structures among different features also make it very timeconsuming to find the key features from tens of thousands of features. In order to tackle this problem, various methods have been proposed by researchers. Among them, feature selection is undoubtedly a “de facto” standard as it is not only able to remove the redundant features but to improve the classification performance (Destrero et al., 2009)
. Following the feature detection phase, standard classifiers can be used to perform classification based on the selected features. One of the most used classifier is MultiLayer Perceptron (MLP), e.g it represents 61% of a typical Google TPU (Tensor Processing Unit) workload for production neural networks applications, while convolutional neural networks represent just 5%
(Jouppi et al., 2017).Motivation. Yet, MLP can not be employed directly on high dimensional data due to the quadratic number of parameters in its fully connected layers with respect to their number of neurons. This limits MLPs size to several thousands neurons and few thousands input features on commodity hardware, and implicitly their representational power. To address this issue, very recently small steps have been made. (Mocanu et al., 2018) proposes Sparse Evolutionary Training (SET), a method to train scalable MLPs with adaptive sparse connectivity (SETMLPs). However, due to the limitations of typical deep learning libraries (e.g. optimized operations just for fullyconnected layers and dense matrices), in (Mocanu et al., 2018) the largest SETMLP used has just 12,082 neurons  quite a low representational power. Practically, their SETMLP implementation uses the typical approach from the literature to work with sparsely connected layers, i.e. fully connected layers with sparsity enforced by a binary mask over their weights  this approach, of course, is far from using the full advantage of sparsity.
The first contribution of this paper is an efficient implementation framework which can create and train SETMLP models with over one million neurons on a typical laptop to handle data with tens of thousands of dimensions. This very high representational power is way over the capacity of stateoftheart SETMLPs and fullyconnected MLPs. Secondly, we show that our proposed approach can be a good replacement for the current methods which employ both, feature reduction and classifiers, to perform classification on highdimensional nonimage datasets such as microarray gene expression data with imbalanced classes. Thirdly, we show that our proposed solution is robust to the “curse of dimensionality”, avoiding overfitting and achieving very good performance in terms of classification accuracy on a dataset with over 20,000 dimensions (input features) and less than 100 samples.
The remaining of this paper is organized as follows. Section 2 introduces and discusses our proposed methods. Section 3 presents the experiments performed and analyses the results. Section 4 discusses two extreme SETMLP models and shows that a SETMLP model with over one million neurons can be trained on one CPU thread of a typical laptop. Section 5 discusses related work, while Section 6 concludes the paper and presents future research directions.
2 Methods
This section introduces our proposed methods. First, it discusses the Sparse Evolutionary Training procedure and its current limitations given by the stateoftheart deep learning libraries and techniques. Secondly, it describes our novel proposed solution to address those limitations.
2.1 Sparse evolutionary training
Inspired by the fact that biological neural networks are prone to be sparse, rather than dense (Strogatz, 2001; Pessoa, 2014), and due to obvious computational resources limitations, there is an increasing interest in conceiving neural networks with a sparse topology (Mocanu et al., 2016). In (Mocanu et al., 2018), the authors proposed a novel concept, Artificial Neural Networks (ANNs) with adaptive sparse connectivity. The basic idea is to replace the fully connected layers with sparsely connected layers before training in any type of neural network, and after that during the training process to optimize together the weights values and the network sparse topology to fit the data distribution. They also proposed a scalable training method, i.e. Sparse Evolutionary Training, to train such networks. Different from the conventional methods, e.g. weights pruning (Cun et al., 1990; Han et al., 2015) which creates sparse topologies during or after the training process, the adaptive topology of ANNs trained with SET is designed to be sparse from the begin. This reduces quadratically the amount of connections.
SET algorithm is given in Appendix A, Algorithm 1. For the sake of convenience, we briefly describe the structure of SET, using the same notations with the original paper. The original sparse topology is initialized by ErdősRényi random graph topology (Erdős and Rényi, 1959) where a sparse matrix
is defined to determine the probability of each connection between two consecutive layers of neurons(i.e.
and ). The weight matrix is given by(1) 
whereby the and represent the numbers of neurons of hidden layers and , respectively.
is a hyperparameter to control the sparsity level.
Like conventional fullyconnected MLP (FCMLP), SETMLP also employs backpropagation with stochastic gradient descent to learn the best weights for different datasets. However, the initial sparse network may not be suitable for every type of dataset, since it is generated randomly with no information about the data distribution. To overcome this problem, in each epoch, weights pruning based on magnitude is used to eliminate the noninformative connections. More exactly, a certain fraction
of the largest negative weights and the smallest positive weights are removed. After weights removal, an equal number of connections are randomly added to the bipartite layers. Roughly speaking, the connections removal in SET represents natural selection, whereas the emergence of new connections corresponds to the mutation phase.Very recently, several papers proposed various techniques for adaptive sparse connectivity. (Bellec et al., 2018) proposed DEEP R. (Mostafa and Wang, 2019) proposed dynamic sparse reparameterization for deep convolutional neural networks and showed that their proposed method and SET are faster and achieve better accuracy than DEEP R. (Zhu and Jin, 2018) proposed a simplified SET variant (after weights removal, they do not add random connections) to minimize the size of the network as much as possible as needed in federated learning and lowresource devices. Considering the above, in this paper, we focus on the original SET algorithm because it was shown that it is capable to reach very high accuracy performance (Zhu and Jin, 2018; Mocanu et al., 2018), many times even higher than the fullyconnected counterparts (Mocanu et al., 2018)
, while being very versatile and suitable for many neural network models (e.g. restricted Boltzmann machines, multilayer perceptrons, and convolutional neural networks) and nongrid like data.
However, the authors of SET have used Keras with Theano backend to implement their SETMLP models. This implementation choice, while having the big advantage of offering a wide flexibility of architectural choices (e.g. various activation functions, optimizers, GPUs, and so on) which is very welcomed while conceiving new algorithms, does not offer proper support for sparse matrix operations. This limits considerably the practical aspects of SETMLP with respect to its maximum possible number of neurons and implicitly to its representational power. Due to these reasons, the maximum size of the SETMLPs from
(Mocanu et al., 2018) is 12,082 neurons.2.2 Proposed solution
In this paper, we address the above limitations of the SET original implementation and we show how vanilla SETMLP can be implemented from scratch using just pure Python, SciPy, and Cython. Our approach enables the construction of SETMLPs with at least two orders of magnitude larger, i.e. over 1,000,000 neurons. What is more, such SETMLPs do not need GPUs and can run perfectly fine on a standard laptop.
2.2.1 Sparse Matrices Operations
The key element of our very efficient implementation is to use sparse data structures from SciPy. It is important to use the right representation of a sparse matrix for different operations, because different sparse matrix formats have different advantages and disadvantages (as briefly discussed in Appendix A) and one format cannot handle all operations necessary for sparse weights matrices to implement a SETMLP. Still, the conversions from one format to another are very fast and efficient. Thus in our implementation which was done in pure Python 3, we have used for specific SETMLP operations, specific sparse matrix formats and their fast conversion capabilities, as follows.
Initialize sparsely connected layers. The sparse matrices which store the sparsely connected layers are creating using the Linked List (LIL) format and then are transformed into Compressed Sparse Row (CSR) format.
Feedforward phase. During it, the sparse weights matrices are stored and used in the CSR format.
Backpropagation phase  computing gradients. The only operations which can not be implemented with SciPy sparse matrix operations is computing the gradients for backpropagation (Rumelhart et al., 1986)
due to the simple fact that by multiplying the vector of backpropagation errors from layer
with the vector of activation neurons from layer will perform a considerable amount of unnecessary multiplications (for nonexistent connections) and will create a dense matrix for updates. This dense matrix, besides being very slow to process, will have a quadratically number of parameters with respect to its number of rows and columns and will fill a 16GB RAM very fast (in practice, for less then 10000 neurons per layer given all the other necessary information which have to be stored in the computer memory). To avoid this situation, we have implemented in Cython the computations necessary for the batch weight updates. In this way, we compute in a much faster manner than in pure Python the gradient updates just for the existing connections. For this step, the sparse weight matrices are stored and used in the Coordinate list (COO) format.Backpropagation phase  weights update. For this, the sparse weights matrices are used in the CSR format.
2.2.2 Implementation of Weights Evolution
The key aspect of the SET method that sets it apart from the conventional ANN training is the evolutionary scheme which modifies the connectivity of the layers at the end of every epoch. As the weight evolution routine is executed quite often, the routine needs to be implemented in an efficient manner to ensure that the SETMLP training can be done as fast as possible. Furthermore, as the layer connections are extremely sparse in the SET scheme, the implementations should ensure that the sparsity level is maintained. Actually, it shall exploit the sparsity while removing and adding new weights. Two implementations of the weight evolution scheme were coded in native Python using Numpy sparse matrix routines. The first implementation is readable and intuitive, but does not exploit the full capabilities of the Numpy library in its various operations. The second implementation is not as readable, but vectorizes most of the operations using Numpy routines and performs the same operations in much lesser time. These implementations are both explained and compared in Appendix A.
3 Experiments and results
In this section, we evaluate and discuss the performance of our efficient SETMLP implementation on four benchmark microarray datasets which are publicly available, as detailed in Table 1. For a good understanding of SETMLP performance, we compare it against another sparse MLP model (implemented by us in the same manner) and in which the bipartite layers are initialized with an ErdősRényi topology, but which does not evolve over time and has a fixed sparsity pattern, dubbed MLP as in (Mocanu et al., 2018).
Dataset  No. of  No. of  No. of  Data 

Samples  Features  Classes  Size  
Leukemia (Haferlach et al., 2010)  2096  54,675  18  1.93 GB 
CLLSUB111 (Haslinger et al., 2004)  111  11,340  3  5.9 MB 
SMKCAN187 (Zhao et al., 2010)  187  19,993  2  11.9 MB 
GLI85 (Freije et al., 2004)  85  22,283  2  8.7 MB 
3.1 Evaluation metrics and experimental setup
To evaluate the performance of the proposed method we have used the accuracy metric and the confusion matrix to get detailed visual information. The rows of the confusion matrix represent the predicted classes and the columns correspond to the true classes. The diagonal cells represent the numbers of samples that are correctly classified, whereas the offdiagonal cells are the number of incorrectly classified samples. The row at the bottom of the confusion matrix gives the proportion of all examples belonging to each class that are correctly (green) and incorrectly (red) classified. The column on the far right of the confusion matrix represents the proportion of all the samples predicted to belong to each class that are correctly (green) and incorrectly (red) classified.
All the experiments have been executed on a typical laptop using a single thread of the CPU. The laptop configuration is as follows: (1) Hardware configuration: CPU Intel Core i74700MQ, 2.40 GHz 8, RAM 16 GB, Hard disk 500 GB; and (2) Software used: Ubuntu 16.04, Python 3.5.2, Numpy 1.14, SciPy 0.19.1, and Cython 0.27.3.
For both models, SETMLP and MLP
, we have used two hidden layers with ReLU activation functions, and backpropagation with stochastic gradient descent and momentum for the connection weights optimization. The numbers of neurons of each layer is given by Table
2.Dataset  input  hidden  hidden  output 

Leukemia  54,675  27,500  27,500  18 
CLLSUB111  11,340  9,000  9,000  3 
SNKCAN187  19,993  16,000  16,000  2 
GLI85  22,283  20,000  none  2 
Leukemia  Class  No. of 

Label  Samples  
Mature BALL with t(8;14)  1  4 
ProBALL with t(11q23)/MLL  2  23 
CALL/PreBALL with t(9;22)  3  41 
TALL  4  58 
ALL with t(12;21)  5  19 
ALL with t(1;19)  6  12 
ALL with hyperdiploid karyotype  7  14 
CALL/PreBALL without t(9;22)  8  79 
AML with t(8;21)  9  14 
AML with t(15;17)  10  12 
AML with inv(16)/t(16;16)  11  9 
AML with t(11q23)/MLL  12  13 
AML with normal karyotype + other abnormalities  13  115 
AML compllex aberrant Karyotype  14  18 
CLL  15  149 
CML  16  25 
MDS  17  68 
Nonleukemia and helthy bone marrow  18  26 
3.2 Results on the Leukemia dataset
The Leukemia dataset is obtained from NCBI GEO repository with the accession number GSE13159. It contains 2096 samples with 54,675 features each. The samples are divided in 18 classes. Among these 2096 samples, 1397 samples are selected as training data and 699 as testing data. Table 3 shows the number of test samples in each class. It worths to be highlighted that both, the training and testing sets, are unbalanced. For this dataset, the number of hidden neurons in each layer was set to 27,500, a value which is way above the usual number of neurons in fullyconnected MLP models. We performed a small random search, and we set the learning rate equaling 0.005, the batch size equaling 5, momentum to 0.9 and weight decay to 0.0002. As specific SETMLP settings, we chose sparsity level and pruning rate .
The accuracy of SETMLP and a MLP with two hidden layers on the Leukemia test set is shown in Fig. 1 over 500 training epochs. We mention that we chose to discuss in details the models with two hidden layer as they offered the best performance. The xaxis shows the training epochs; the yaxis shows the test accuracy. The figure shows that the accuracy of SETMLP tends to stabilize towards 90% as the training progresses, while MLP stabilizes around 80% accuracy. To understand better the performance of our approach, Fig. 2 shows the confusion matrix on Leukemia dataset for the peak accuracy of SETMLP (88.10%). We highlight that, to the best of our knowledge, this accuracy is higher than best results (81.11%) reported in the literature (Kumar and Rath, 2015)
for this dataset. There in, an ensemble classifier is proposed to deal with microarray data. This classifier connects several feature selection algorithms with MapReduce based proximal support vector machine (mrPSVM) to classify the microarray data. Their experimental results not only show that the ensemble of mrPSVM classifier and feature selection approaches is a stateofart method to deal with microarray datasets, but provides also concrete information about training data and testing data. In our experiments, we employed the same training and testing data to guarantee the validity of the comparison.
3.3 Results on the CLLSUB111 dataset
The CLLSUB111 is an unbalanced dataset contains gene expressions from high density oligonucleotide arrays, where both genetically and clinically distinct subgroups of Bcell chronic lymphocytic leukemia(BCLL). It has 11,340 features and 111 samples, out of which 74 samples are selected as training set and 37 as testing set. The number of neurons of each hidden layer is 9000. All the hyperparameters were set the same as for Leukemia, except the 0.01 learning rate.
The test accuracy for 500 epochs of SETMLP and MLP on the CLLSUB111 dataset is depicted in Fig. 2(a). The xaxis shows the training epochs; the yaxis shows the test accuracy. After approximately 200 epochs, SETMLP reaches an accuracy of 81.08% and MLP an accuracy of about 65%. Among the feature selection based methods to CLLSUB111, an accuracy of 78.38% was obtained by using Incremental Wrapperbased Attribute Selection(IWSS) (Bermejo et al., 2012). Although CLLSUB111 seriously suffers from an extreme small number of samples, we are still able to obtain outstanding performance with SETMLP without any overfitting. Fig. 3(a) depicts the confusion matrix of SETMLP on the CLLSUB111 dataset. We can observe that SETMLP has excellent recall for class 1 (100.0%), even though there are extremely unfavorable conditions, i.e. very few training samples.
3.4 Results on the SMKCAN187 dataset
The SMKCAN187 is a RNA dataset obtained from normal bronchial epithelium of smokers with and without lung cancer, which is publicly available from (Zhao et al., 2010). It has 19,993 features and 187 samples. Out of these 187 samples, 124 samples are chose to be training data and 63 are testing data. There are 32 samples labeled as ‘1’ and 31 are labeled as ‘2’. Same as Leukemia and CLLSUB111 datasets, SMKCAN187 is also unbalanced. The number of neurons in the hidden layers was set to 16,000, learning rate to 0.005, and the batch size to 5. Moreover, momentum was 0.9 and weight decay was set to 0.0002. Additionally, the SETMLP specific parameters, and were 10 and 0.3, respectively.
Fig. 2(b) shows the accuracy performance of SETMLP and MLP on the SMKCAN187 dataset. The accuracy of SETMLP is 79.4% which is better than the accuracy reported in (Wang et al., 2016)(74.872.32%) in which feature selection was performed by preserving class correlation. It is noteworthy that on this dataset the accuracy of SETMLP is stable to around 75% after 350 epochs, while MLP seems to suffer from overfitting. The confusion matrix of SETMLP on SMKCAN187 is given in Fig. 3(b). On class 2, SETMLP achieves a remarkable performance of 90.3% recall.
3.5 Results on the GLI85 dataset
The GLI85 dataset has 22,283 features and 85 samples. We chose this dataset to analyze as it reflects an extreme case for the situation when very little labeled data is available. Out of these 85 samples, 56 samples are training data and 29 are testing data. There are 21 samples labeled as ‘2’ and 8 are labeled as ‘1’. The batch size was set to 1, while the remaining hyperparameters were the same as for Leukemia. With only 85 samples, any model trained on this dataset is clearly prone to overfitting. Due to the very small number of samples, the best results on this dataset were obtained using just one hidden layer, as discussed next.
Fig. 2(c) shows the accuracy performance of SETMLP and MLP on the GLI85 dataset. We can observe that peak accuracy of SETMLP is 100.0% which is much better than the accuracy reported in (Taheri and Nezamabadipour, 2014)
(94%) in which an ensemble including three filter methods with a metaheuristic algorithm is used. It is noteworthy that on this dataset the accuracy of MLP
is stable to 100%, while SETMLP reaches 100% just in few cases, as it suffers some fluctuations. These suggest that MLP is even more stable than SETMLP when a very small number of samples is available. The confusion matrix of SETMLP on GLI85 is given in Fig. 3(c) for the peak accuracy of 100%.3.6 Results analysis
To understand better the connections reduction made by the SET procedure in a SETMLP model in comparison with a fullyconnected MLP (FCMLP) which has the same amount of neurons, Fig. 5 and Table 4 provide the number of connections for the SETMLP models discussed above and their FCMLP counterparts on all four datasets. To clarify, we mention that it is impossible to report also the accuracy for FCMLPs as they can not run on a typical laptop due to their very high memory and computational requirements. However, most probably, they would overfit due the their about one billion connections. It is clear that SET has dramatically reduced the connection numbers in MLPs.
Dataset  Number of connections  Connections  
FCMLP  SETMLP  Reduction  
Leukemia  2,260,307,500  1,582,376  99.93% 
CLLSUB111  183,087,000  409,033  99.78% 
SMKCAN187  575,920,000  711,305  99.88% 
GLI85  490,270,000  486,350  99.90% 
Dataset  Average training time (s)  Average testing time (s) 

(per epoch)  (per epoch)  
Leukemia  66.12  3.80 
CLLSUB111  0.59  0.03 
SMKCAN187  2.89  0.17 
GLI85  2.74  0.07 
For instance, a traditional FCMLP on the Leukemia dataset would have 2,260,307,500 connections, while SETMLP has just 1,582,376 connections. This practically means that SET achieves a 99.93% reduction in the number of connections. For the CLLSUB111, the connections number decreases from 183,087,000 to 409,033 which means 99.78% connections reduction. On the SMKCAN187 dataset, the connections reduction given by SET is 99.88%, from 575,920,000 to 711,305. This quadratical reduction in the number of connections is a significant guarantee that SETMLP can run fine on a standard laptop, for datasets with tens (up to few hundreds) of thousands of input features.
For a better understanding of SET computational requirements, Table 5 shows the average training and testing time per epoch of the SETMLPs used on the datasets. We can observe, as expected, that as the number of features and samples increases also the training time is increasing. Still, it worths to be highlighted that, although, the average training time of Leukemia is relatively long (66.12s), it fulfills an almost impossible mission, that is, running such a large model on a commodity laptop.
In the paper, we have discussed the performance of the SETMLP models with two hidden layers on the Leukemia, CLLSUB111, SMKCAN187 datasets and with one hidden layer on the GLI85 datasets. In Appendix B, we explain our choices on the number of hidden layers by presenting comparatively the performance of SETMLP models with one, two, and three hidden layers on all datasets and by discussing the beneficial effect of dropout (Hinton et al., 2012) on SETMLP.
4 Discussion: extreme SETMLP models
While in the previous section we have analyzed the qualitative performance of our proposed approach, in this section we briefly discuss two extreme SETMLP models on the largest dataset used in this paper, i.e. Leukemia. The goal is to assess how fast SETMLP can achieve a good accuracy and to see how large can be a trainable SETMLP model on a typical laptop. For each model, we used a SETMLP with two hidden layers, and a Softmax layer as output. For the first model, the number of hidden neurons per layer was set to 1,000, while for the second model the number of hidden neurons per layer was set to 500,000. In both cases, we have used a very eager learning rate (0.05) and we trained the models for 5 epochs. On each hidden layer we applied a dropout rate of 0.4. The other hyperparameters were set as in the previous section for Leukemia and we have used the same training/testing data splitting.
Model  Hardware  Density level (%)  Total time (s)  Accuracy (%) 
(training + testing)  
SETMLP (54,675;1,000;1,000;18)  1 CPU thread  1.04  65  82.881.18 
SETMLP (54,675;500,000;500,000;18)  1 CPU thread  0.007  4914  81.831.11 
mrPSVM with ANOVA (Kumar and Rath, 2015)  conventional  n/a  1265  81.1 
mrPSVM with ANOVA (Kumar and Rath, 2015)  Hadoop cluster  n/a  291  81.1 
for feature selection). The numbers in brackets for SETMLP reflect the number of neurons per layer from input to output. The accuracy of SETMLP is reported as the mean and standard deviation of 5 runs. The density level represents the percentage of the number of existing connections in the SETMLP model from the total number of connections in its corresponding FCMLP.
Table 6 presents SETMLP performance in comparison with the best stateoftheart results of mrPSVM from (Kumar and Rath, 2015). We clarify that the goal of this experiment is not to obtain the best accuracy possible with SETMLP. Still, the small SETMLP model which has in total 56,693 neurons and 581,469 connections has a total training and testing time of 65 seconds. It is about 20 times faster than mrPSVM which runs on conventional hardware and about 4.5 times faster than mrPSVM which runs in a Hadoop cluster, while reaching with 1.7% better accuracy. At the same time, its small standard deviation shows that the model is very stable. Furthermore, we highlight that the very large SETMLP model which has in total 1,054,693 neurons and about 19,383,046 connections needs about 16 minutes per training epoch. In 5 epochs it reaches a good accuracy, better than stateoftheart. All of these happen on 1 CPU thread of a typical laptop. We highlight that this is the first time in the literature when a MLP variant with over 1 million neurons is trained on a laptop, while the usual MLP models trained on a laptop can have at maximum few thousands neurons. In fact, it is hard to quantify, but according with (Goodfellow et al., 2016) the size of the largest neural networks which run currently in the cloud is about 10 to 20 million neurons. Therefore, our results emphasize even more the capabilities of SETMLPs and open the path for new research directions.
5 Related work
Traditional feature selection methods can be roughly divided into three categories: filter methods, wrapper methods and embedded methods. Independent of classifiers, filter methods employ a certain criterion to rank different features and use the features with the highest scores to fulfill classification tasks. According to whether features are evaluated in an individual way or in a batch way, filter methods can be divided into univariate filters (Liu and Setiono, 1995) and multivariate filters (Hall, 1999). Considering the evaluation role of classifiers, wrapper methods employ classification performance as feedback to assess the selected features repeatedly, which, in turn, helps to find the best feature subset. The WrapperSubsetEval (Hall et al., 2009) is a general wrapper method which can be connected with various learning algorithms. Considering the advantages of filter methods and wrapper methods, embedded methods are capable of utilizing the biases of classifiers, while reducing the computation cost. Lasso regularization (Tibshirani, 1996) was employed to objective functions to eliminate the nonimportant features, by reducing the nonimportant features whose coefficients are close to zero. However, as the explosive growth of the internet, more and more datasets with ultrahigh dimensions are available, which are much out of the capacity of commodity computers. To address this problem, distributed computing has been proposed. Ibrahim et al. (Ibrahim et al., 2014)
generalized Principal Components Analysis (ensemblePCA) to Deep Belief Network (DBN) and to transfer highdimensional data to lowdimensional nodes. A distributed decentralized algorithm for
kNearest Neighbor (kNN) graph
(Plaku and Kavraki, 2007) was proposed to distribute the computation of kNN graph with ultra big datasets by utilizing the sequential structure of kNN data. MapReduce (Dean and Ghemawat, 2008) is an efficient programming model used by Google to compute different types of data and to process large raw data. Chu et al. (Chu et al., 2007) applied MapReduce to ten learning algorithms and obtained a progressive speedup. Developed by Doug Cutting, Apache Hadoop is an opensource software framework based on MapReduce that can accomplish data distributed storage and processing (White, 2012). It partitions big data into different blocks, in a distributed fashion, and then employs parallel tools to process data in different blocks on one big cluster. Kumar et al. (Kumar and Rath, 2015) proposed a classifier framework combining MapReduce and proximal Support Vector Machine(mrPSVM). The experimental results on several highdimensional, lowsample benchmark datasets demonstrated that the ensemble of mrPSVM classifier and other feature selection methods outperforms the classical approaches.As a promising method that has been widely used in image recognition, speech recognition and language translation, deep learning has also been employed in classification tasks with highdimensional data. In (Fakoor et al., 2013)
, an autoencoder approach was combined with a softmax regression classifier to address cancer data classification problems. Danaee et la.
(Ibrahim et al., 2014)introduced a multilevel feature selection method using a deep learning and active learning approach is proposed. The experiments verify the superiority of this approach in terms of classification accuracy. Additionally, a Stacked Denoising Autoencoder (SDAE) was employed to represent highdimensional features with lower dimensional and more meaningful features
(Danaee et al., 2017).Although the abovementioned algorithms can have a good performance on some datasets, one common disadvantage of all these algorithms is that the systems are hierarchical, meaning that at least a dimensionality reduction technique and an efficient classifier are needed. There is a serious dependency between the two subsystems. When one subsystem is not good enough, the overall performance can not be guaranteed due to information loss caused by the feature selection phase. That’s why the aim of this paper was to introduce a method capable to process and classify highdimensional data with one unitary model, i.e. SETMLP.
6 Conclusion
Microarray data have been treated in the literature as a difficult task due to their very high number of features but little number of examples. Besides that, this type of data suffer from imbalance and data shift problems.
In this paper, an efficient implementation of SETMLP, a sparse multilayer perceptron trained with the sparse evolutionary training procedure, is proposed to deal with high dimensional microarray datasets. This implementation makes use just of Python 3, sparse data structures from SciPy, and Cython. With this implementation, we have created for the first time in literature sparse MLP models with over one million neurons which can be trained on a standard laptop using a single CPU thread and without GPU. This is with two orders of magnitude more than stateoftheart fully connected MLPs and SETMLPs trained on commodity hardware.
Besides that, we demonstrated on four microarray datasets with tens of thousands of input features and with up to just two thousands of samples that our approach reduces the number of connections quadratically in large MLPs (about 99.9 % connections reduction), while outperforming the stateoftheart methods on these datasets for the classification task. Moreover, our proposed SETMLP models showed to be robust to overfitting, imbalanced and data shift problems, which is not so usual for fully connected MLPs. The last but not the least, the results suggest that our proposed approach can cope efficiently with the ”curse of dimensionality”, being capable of learning from small amounts of labeled data, and outperforming the stateoftheart methods (ensembles of classifiers and feature selection methods) which are currently employed on high dimensional nongrid like data (or tabular data).
In the future, we intend to focus also on other types of neural layers, such as convolutional layers in CNN which have been widelyused to deal with graphic data with gridlike topology. Furthermore, we intend to extend this work to address problems from other fields which suffer from the ”curse of dimensionality” and which have ultra high dimensional data (e.g. social networks, financial networks, semantic networks). The last but not the least future research direction, would be to parallelize our implementation to use efficiently all CPU threads of a typical workstation and to incorporate it into usual Deep Learning frameworks, such as TensorFlow or PyTorch. This probably would allow us to scale with one order of magnitude more the SETMLP models (up to the level of few tens of millions of neurons), while still using commodity hardware.
7 Acknowledgements
We thank Ritchie Vink^{1}^{1}1https://www.ritchievink.com/, Last visit 25 Jan 2019 for providing on Github.com a vanilla fully connected MLP implementation, and to Thomas Hagebols^{2}^{2}2https://github.com/ThomasHagebols, Last visit 25 Jan 2019 for analyzing the performance of SciPy sparse matrix operations.
References
 Bellec et al. (2018) G. Bellec, D. Kappel, W. Maass, and R. Legenstein. Deep rewiring: Training very sparse deep networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJ_wN01C.
 Bermejo et al. (2012) P. Bermejo, L. de la Ossa, J. A. Gámez, and J. M. Puerta. Fast wrapper feature subset selection in highdimensional datasets by means of filter reranking. KnowledgeBased Systems, 25(1):35–44, 2012.
 Chu et al. (2007) C.T. Chu, S. K. Kim, Y.A. Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. Mapreduce for machine learning on multicore. In Advances in neural information processing systems, pages 281–288, 2007.
 Cun et al. (1990) Y. L. Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
 Danaee et al. (2017) P. Danaee, R. Ghaeini, and D. A. Hendrix. A deep learning approach for cancer detection and relevant gene identification. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017, pages 219–229. World Scientific, 2017.
 Dean and Ghemawat (2008) J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
 Destrero et al. (2009) A. Destrero, S. Mosci, C. De Mol, A. Verri, and F. Odone. Feature selection for highdimensional data. Computational management science, 6(1):25–40, 2009.
 Erdős and Rényi (1959) P. Erdős and A. Rényi. On random graphs i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
 Fakoor et al. (2013) R. Fakoor, F. Ladhak, A. Nazi, and M. Huber. Using deep learning to enhance cancer diagnosis and classification. In Proceedings of the International Conference on Machine Learning, volume 28, 2013.
 Freije et al. (2004) W. A. Freije, F. E. CastroVargas, Z. Fang, S. Horvath, T. Cloughesy, L. M. Liau, P. S. Mischel, and S. F. Nelson. Gene expression profiling of gliomas strongly predicts survival. Cancer research, 64(18):6503–6510, 2004.
 Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning (Subsection 1.2.3). The MIT Press, 2016. ISBN 0262035618, 9780262035613.
 Haferlach et al. (2010) T. Haferlach, A. Kohlmann, L. Wieczorek, G. Basso, G. T. Kronnie, M.C. Béné, J. V. De, J. M. Hernández, W.K. Hofmann, K. I. Mills, et al. Clinical utility of microarraybased gene expression profiling in the diagnosis and subclassification of leukemia: report from the international microarray innovations in leukemia study group. Journal of clinical oncology: official journal of the American Society of Clinical Oncology, 28(15):2529–2537, 2010.
 Hall et al. (2009) M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
 Hall (1999) M. A. Hall. Correlationbased feature selection for machine learning. 1999.
 Han et al. (2015) S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, pages 1135–1143, Cambridge, MA, USA, 2015. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969239.2969366.
 Haslinger et al. (2004) C. Haslinger, N. Schweifer, S. Stilgenbauer, H. Döhner, P. Lichter, N. Kraut, C. Stratowa, and R. Abseher. Microarray gene expression profiling of bcell chronic lymphocytic leukemia subgroups defined by genomic aberrations and vh mutation status. Journal of Clinical Oncology, 22(19):3937–3949, 2004.
 Hinton et al. (2012) G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. CoRR, abs/1207.0580, 2012.
 Ibrahim et al. (2014) R. Ibrahim, N. A. Yousri, M. A. Ismail, and N. M. ElMakky. Multilevel gene/mirna feature selection using deep belief nets and active learning. In Engineering in Medicine and Biology Society (EMBC), 2014 36th annual international conference of the IEEE, pages 3957–3960. IEEE, 2014.
 Jouppi et al. (2017) N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. Indatacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 1–12. IEEE, 2017.
 Kumar and Rath (2015) M. Kumar and S. K. Rath. Classification of microarray using mapreduce based proximal support vector machine classifier. KnowledgeBased Systems, 89:584–602, 2015.

Liu and Setiono (1995)
H. Liu and R. Setiono.
Chi2: Feature selection and discretization of numeric attributes.
In
Tools with artificial intelligence, 1995. proceedings., seventh international conference on
, pages 388–391. IEEE, 1995.  Mocanu et al. (2016) D. C. Mocanu, E. Mocanu, P. H. Nguyen, M. Gibescu, and A. Liotta. A topological insight into restricted boltzmann machines. Machine Learning, 104(2):243–270, Sep 2016. ISSN 15730565. doi: 10.1007/s109940165570z. URL https://doi.org/10.1007/s109940165570z.
 Mocanu et al. (2018) D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1):2383, 2018.
 Mostafa and Wang (2019) H. Mostafa and X. Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, 2019. URL https://openreview.net/forum?id=S1xBioR5KX.
 Pessoa (2014) L. Pessoa. Understanding brain networks and brain organization. Physics of life reviews, 11(3):400–435, 2014.
 Plaku and Kavraki (2007) E. Plaku and L. E. Kavraki. Distributed computation of the knn graph for large highdimensional point sets. Journal of parallel and distributed computing, 67(3):346–359, 2007.
 Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. nature, 323(6088):533, 1986.
 Simon et al. (2003) R. Simon, M. D. Radmacher, K. Dobbin, and L. M. McShane. Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute, 95(1):14–18, 2003.
 Strogatz (2001) S. H. Strogatz. Exploring complex networks. nature, 410(6825):268, 2001.

Taheri and Nezamabadipour (2014)
N. Taheri and H. Nezamabadipour.
A hybrid feature selection method for highdimensional data.
In
Computer and Knowledge Engineering (ICCKE), 2014 4th International eConference on
, pages 141–145. IEEE, 2014.  Tibshirani (1996) R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Wang et al. (2016) J. Wang, J. Wei, and Z. Yang. Supervised feature selection by preserving class correlation. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1613–1622. ACM, 2016.
 White (2012) T. White. Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.
 Zhao et al. (2010) Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Advancing feature selection research. ASU feature selection repository, pages 1–28, 2010.
 Zhu and Jin (2018) H. Zhu and Y. Jin. Multiobjective evolutionary federated learning. CoRR, abs/1812.07478, 2018. URL http://arxiv.org/abs/1812.07478.
Appendix A Implementation details
a.1 Sparse data structures
Below the SciPy sparse data structures used to implement SETMLPs are briefly discussed, while the interested reader is referred to^{3}^{3}3https://docs.scipy.org/doc/scipy/reference/sparse.html, Last visit 3rd June 2018. for detailed information.

Compressed Sparse Row (CSR) sparse matrix: The data is stored in three vectors. The first vector contains nonzero values, the second one stores the extents of rows, and the third one contains the column indices of the nonzero values. This format is very fast for many arithmetic operations, but slow for changes to the sparsity pattern.

Linked List (LIL) sparse matrix: This format saves nonzero values in rowbased linked lists. Items in the rows are also sorted. The format is fast and flexible in changing the sparsity patterns, but inefficient for arithmetic matrix operations.

Coordinate list (COO) sparse matrix: This format saves the nonzero elements and their coordinates (i.e. row and column). It is very fast in constructing new sparse matrices, but it does not support arithmetic matrix operations and slicing.

Dictionary Of Keys (DOK) sparse matrix: This format has a dictionary that maps row and column pairs to the value of nonzero elements. It is very fast in incrementally constructing new sparse matrices, but can not handle arithmetic matrix operations.
a.2 Weights evolution  Implementation I
In this implementation, the sparse weight matrices in the CSR format are converted to three vectors representing the indices of the rows, columns of the nonzero elements along with the element values (either using the COO or LIL format). The values are then compared in a forloop to the threshold to keep the weights or discard them, as per the user specified values. To ensure that the total number of nonzeros in the weight matrix remains the same, random connections between neurons need to be created. Again a forloop is used to create new random connections in an incremental manner and ensure that the total number of nonzeros are equal to the original number of nonzeros.
Most of the processing time in the code occurs in the forloops and the while loops and this is confirmed by a code profiling tool in python^{4}^{4}4Line_profiler by Robert Kern, [Available Online] https://github.com/rkern/line_profiler. Furthermore, as we are constantly accessing the weights by the row and column index, this method does not exploit the sparsity of the weight matrix. The code profile of the processing time demonstrated that the removal of weights of the weight matrix takes about 15% of total time in an epoch and adding new random connections takes about 50% of the total time during an epoch (of course, these percentages depend also on the size of the datasets and they become smaller when the dataset gets larger). The detailed algorithm is given in Algorithm 2.
a.3 Weights evolution  Implementation II
In order to make full use of advantages of different sparse matrix formats, we also propose Fast Weights Evolution (FWE). In FWE, the sparse weight matrices in the CSR format are also converted to three vectors representing the indices of the rows, columns of the nonzero elements along with the element values using the COO format. The value vector is compared a single time with the minimum and maximum threshold values using the vectorized operations in numpy. This enables the identification of the indices of small weights for fast deletion of the weights. Next, the remaining row and column indices are stored together into an array and a list of all the arrays of the nonzero elements is created. This is used directly to determine the random row and column indices of the additional weights to ensure that the number of connections between the neurons are constant. As the weights are sparse, the size of the list is much smaller than the full size of the weight matrix and performing all the computations with the list will be faster. The detailed algorithm is given in Algorithm 3. The comparison of running time of these two implementations is given in Table 7, which shows that Implementation II is more efficient than Implementation I.
We know that it is hard for one to reproduce an efficient implementation of an algorithm given just the above details, and we mention that our SETMLP proofofconcept implementation is available online^{5}^{5}5Implementation 2 from https://github.com/dcmocanu/sparseevolutionaryartificialneuralnetworks
Matrix Size  Implementation I (s)  Implementation II (s) 

500*500  0.58  0.14 
2000*2000  2.56  0.71 
8000*8000  11.13  2.08 
15000*15000  24.14  3.75 
Appendix B Comparative study of SETMLPs with one, two, and three hidden layers on the four benchmark datasets.
The amount of neurons per hidden layer and the other hyperparameters are set to be the same with the models from the paper. Fig. 6 summarizes these experiments. From the first row, it can be inferred that SETMLP with two hidden layers reaches the highest peak accuracy (88.12%) and has relatively the most robust performance on the Leukemia dataset. Similarly, SETMLP with two hidden layer reaches outstanding accuracy (81.11%) on the CLLSUB111 dataset, while the accuracy can not reach 80% with one or three hidden layers.
As expected, but at the same time having the most interesting results, due to the very small number of samples of GLI85 (Fig. 6, third row), SETMLP with one hidden layer avoids overfitting in exchange to quite an oscillating behavior. At the same time, SETMLP with two or three hidden layers even if they are capable to reach also perfect accuracy of 100%, after about 200 epochs they have a dramatical drop in accuracy to about 80%. We hypothesis that this situation happens due to overfitting as the number of training samples is extremely insufficient. If this is the case, adding dropout regularization to SETMLP is able to figure out this problem. We applied dropout with 0.5 dropout rate to both hidden layers. The performance is shown in Fig. 7. It is clear that the accuracy of SETMLP with dropout keeps the same trend as before, without any drop in accuracy after 200 epochs.
Comments
There are no comments yet.