1 Introduction
In spite of the good performance of deep neural networks, they encounter generalization issues and overfitting problems, especially when the amount of parameters is much higher than the amount of training examples. While, understanding this tradeoff is an open research question, fortunately, various works have been proposed to handle this problem including, implicit norm regularization (Neyshabur et al., 2014), TwoStage Training Process (Zheng et al., 2018), dropout (Srivastava et al., 2014)
(Ioffe & Szegedy, 2015), etc. Recently, many complexity measures have emerged to understand what drives generalization in deep networks, such as sharpness (Keskar et al., 2016), PACBayes (Dziugaite & Roy, 2017) and marginbased measures (Neyshabur et al., 2017b). (Neyshabur et al., 2017a) analyze different complexity measures and demonstrate that the combination of some of these measures seems to capture better neural networks generalization behavior.On the other side, the ability of sparse neural networks to reduce the number of parameters can dramatically shrink the model size and, therefore, relieve overfitting. However, the traditional algorithms to train such networks, make use of an initial fullyconnected network which is trained first. Further on, the unimportant connections in this network are pruned using various techniques, e.g. (LeCun et al., 1990; Hassibi & Stork, 1993; Han et al., 2017; Narang et al., 2017; Lee et al., 2018) to obtain a sparse topology. The initial fullyconnected network is a critical point hindering neural networks scalability due to its quadratic number of (many unnecessary) parameters with respect to its number of neurons. To address this issue, (Mocanu et al., 2018) have proposed a new class of models, i.e. intrinsically sparse neural networks with adaptive sparse connectivity. These models have a linear number of parameters with respect to the number of neurons, don’t require an initial fullyconnected network, and can be trained with the Sparse Evolutionary Training (SET) procedure.
In this paper, we introduce a new improvement to SET, dubbed SET with Neurons Pruning (NPSET), to further reduce the number of hidden neurons and parameters. Our approach is able to identify and eliminate a large number of noninformative hidden neurons and their accompanying connections by applying neurons pruning to the SET procedure. Same as SET, NPSET starts with a sparse topology, thus having a clear advantage over the stateoftheart methods which start from fully connected topologies. The experimental results show that the removal of the hidden layer neurons with very few output connections allows NPSET to further reduce computational costs in both phases (training and inference). Moreover, we show that intrinsically sparse MLPs trained with both, SET or NPSET, have higher generalization ability than their fullyconnected counterparts.
2 Related Work
Dataset  Number of  Features  Data Type  Classes  Training  Test 
Samples  Samples  Samples  
Leukemia  72  7070  Discrete  2  48  24 
PCMAC  1943  3289  Discrete  2  1295  648 
Lungdiscrete  73  325  Discrete  7  48  25 
gisette  7000  5000  Continuous  2  4666  2334 
lung  203  3312  Ccontinuous  5  135  68 
CLLSUB111  111  11340  Continuous  3  74  37 
Carcinom  174  9183  Continuous  11  116  58 
orlraws10P  100  10304  Continuous  10  66  34 
TOX171  171  5748  Continuous  4  114  57 
ProstateGE  102  5966  Continuous  2  68  34 
arcene  200  10000  Continuous  2  133  67 
madelon  2600  500  Continuous  2  1733  867 
Yale  165  1024  Continuous  15  110  55 
GLIOMA  50  4434  Continuous  4  33  17 
RELATHE  1427  4322  Continuous  2  951  476 
Inspired by Darwinian theory, Sparse Evolutionary Training (SET) (Mocanu et al., 2018)
is a simple but efficient training method which enables an initially sparse topology of bipartite layers of neurons to evolve towards a scalefree topology, while learning to fit the data characteristics. After each training epoch, the connections having weights closest to zero are removed (magnitude based removal). After that, new connections (in the same amount as the removed ones) are randomly added to the network. This offers benefits in both, computational time (pronounced faster training and testing time in comparison with fully connected bipartite layers) and quadratically lower memory requirements. The interested reader is referred to
(Mocanu et al., 2018) for a detailed discussion, and to (Mostafa & Wang, 2019; Zhu & Jin, 2018; Sohoni et al., 2019) for further developments and analyses on it.Dataset  SET  NPSET  NPSET  NPSET  Direct  Direct 
MLP (%)  MLP (%)  MLP (%)  MLP (%)  SETMLP (%)  FCMLP (%)  
Leukemia  87.50  87.50 (+0.00)  87.50 (+0.00)  87.50 (+0.00)  87.50 (+0.00)  75.00 (12.50) 
PCMAC  87.35  88.43 (+1.08)  86.73 (0.62)  88.43 (+1.08)  87.81 (+0.46)  85.19 (2.16) 
Lungdiscrete  88.00  88.00 (+0.00)  88.00 (+0.00)  84.00 (4.00)  88.00 (+0.00)  80.00 (8.00) 
gisette  97.43  97.52 (+0.09)  97.64 (+0.21)  97.52 (+0.09)  97.47 (+0.04)  97.60 (+0.17) 
lung  92.65  94.12 (+1.47)  94.12 (+1.47)  92.65 (+0.00)  94.12 (+1.47)  92.65 (+0.00) 
CLLSUB111  67.57  75.68 (+8.11)  62.16 (5.41)  67.57 (+0.00)  70.27 (+2.70)  59.46 (8.11) 
Carcinom  79.31  81.03 (+1.72)  75.86 (3.45)  75.86 (3.45)  77.59 (1.72)  68.97 (10.34) 
orlraws10P  88.24  88.24 (+0.00)  85.29 (2.95)  88.24 (+0.00)  88.24 (+0.00)  79.41 (8.77) 
TOX171  91.23  91.23 (+0.00)  85.97 (5.26)  89.47 (1.76)  91.23 (+0.00)  82.46 (8.77) 
ProstateGE  88.24  88.24 (+0.00)  88.24 (+0.00)  88.24 (+0.00)  88.24 (+0.00)  79.41 (8.83) 
arcene  77.61  77.61 (+0.00)  82.09 (+4.48)  74.63 (2.98)  79.10 (+1.49)  77.61 (+0.00) 
madelon  71.16  71.28 (+0.12)  71.74 (+0.58)  70.13 (1.03)  71.05 (0.11)  56.40 (14.76) 
Yale  70.91  74.55 (+3.64)  69.09 (1.82)  70.91 (+0.00)  69.09 (1.82)  63.64 (7.27) 
GLIOMA  76.47  76.47 (+0.00)  76.47 (+0.00)  76.47 (+0.00)  76.47 (+0.00)  64.71 (11.76) 
RELATHE  89.71  90.55 (+0.84)  89.71 (+0.00)  89.92 (+0.21)  87.61 (2.10)  90.76 (+1.05) 

3 Methods
In this section, we detail our proposed method (NPSET).
3.1 Why Neurons Pruning.
The sparse topology allows SET to create MultiLayer Perceptrons with hundreds of thousands of neurons (Liu et al., 2019), which guarantees their ability to represent all sorts of features and approximate function to tackle different problems. However, such a large number of neurons is also a doubleedged sword which can lead to significant redundancy. For example, in the case of CIFAR10 dataset whose features number is 3072, the neurons number of the first hidden layer is 4000 in (Mocanu et al., 2018). Obviously, not all neurons can provide important information to outputs. To prove this, we test whether removing hidden neurons which have the least output connections will decrease the performance. Figure 1 shows the influence of removing neurons from the first hidden layer for Lungdiscrete dataset (due to the space limitation, we only give one dataset). It can be observed that the model maintains or even improves its accuracy after removing these unimportant neurons. In order to remove these noninformative neurons, at the beginning of each training epoch, we remove a certain fraction
of hidden neurons that have the smallest numbers of connections. Thus, with a large probability they will not have a notable impact on the model performance.
Dataset  Parameters (#)  Neurons (#)  Compression Rate ()  
FCMLP  SETMLP  NPSETMLP  FCMLP  SETMLP  NPSETMLP  SETMLP  NPSETMLP  
Leukemia  98,504,000  294,235  40,039  21,070  21,070  9,710  335  2,460 
PCMAC  18,873,000  128,432  18,622  9,289  9,289  4,435  147  1,013 
Lungdiscrete  189,600  13,446  2,447  925  925  457  14  77 
gisette  50,010,000  209,556  29,884  15,000  15,000  6,892  238  1,673 
lung  18,951,000  135,689  19,776  9,312  9,312  4,458  140  958 
CLLSUB111  245,773,000  474,738  65,421  33,340  33,340  15,488  518  3,757 
Carcinom  163,746,000  420,592  67,726  27,182  27,182  12,580  389  2,418 
orlraws10P  203,140,000  465,977  72,871  30,304  30,304  14,072  436  2,788 
TOX171  53,760,000  225,416  31,815  15,748  15,748  7,640  238  1,690 
ProstateGE  54,840,000  219,191  30,690  15,966  15,966  7,858  250  1,787 
arcene  200,020,000  419,469  57,136  30,000  30,000  13,768  477  3,501 
madelon  1,502,000  36,563  5,096  2,500  2,500  896  41  295 
Yale  2,039,000  47,222  8,576  3,024  3,024  1,420  43  238 
GLIOMA  33,752,000  178,678  25,228  12,434  12,434  5,956  189  1,338 
RELATHE  33,296,000  170,804  24,280  12,322  12,322  5,844  195  1,371 

3.2 Where to Start Pruning.
The initial network topology generated by SET is randomly sparse and does not provide any specific information. Thus, pruning neurons at the beginning may eliminate significant neurons forever, along with a serious damage to performance. It is best to start applying neurons pruning after a certain number of epochs rather than removing neurons from the beginning. After evolving in the first epochs, the network has already learned how to identify and retain important connections, while the evolved neurons connectivity provides a helpful guidance to identify nonimportant neurons.
3.3 How Many Epochs to Prune.
It is the fact that if we prune neurons in each epoch, the final number of neurons would be too small to keep a good accuracy. On the other hand, if we only prune neurons for several epochs, the number of removed neurons would be too trivial to decrease the computation. To preserve the good performance, we only apply neurons pruning for epochs, after which the SET procedure continues normally.
4 Experiments and Results
We evaluated the proposed NPSET^{1}^{1}1The code of NPSET is built based on the source code of SET https://github.com/dcmocanu/sparseevolutionaryartificialneuralnetworks. method by training sparse MLPs from scratch on 15 classification datasets with limited amount of samples and many input features, as detailed in Table 1. All datasets can be retrieved from Arizona State University opensource repository^{2}^{2}2http://featureselection.asu.edu/index.php.. In order to understand the NPSET performance better, we compare it against five methods: (1) SETMLP (Mocanu et al., 2018), (2) NPSETMLP where only neurons of the first hidden layer are pruned, (3)
NPSETMLP where only neurons of the second hidden layer are pruned, (4) Direct SETMLP, directly trained SETMLP having the same hidden layer sizes as NPSETMLP after neurons pruning. (5) Direct FCMLP, directly trained FCMLP having the same hidden layer sizes as NPSETMLP after neurons pruning. All models used in this paper have two hidden layers, and ReLU activation function. We trained NPSETMLP on a python implementation of fully connected MLPs
^{3}^{3}3https://github.com/ritchie46/vanillamachinelearning., as the SETMLP implementation was also built on top of this code, guaranteeing the validity of the comparison in this paper. Since our new method is an improvement over SET, we used SETMLP as the baseline for experiments. All these methods are trained from scratch.To find the most suitable hyperparameters values, we performed a small random search experiment. This showed
, and are safe choices that not only removed the noninformative neurons, but also lead NPSETMLP to have better performance.The maximum accuracies of all 6 models for each dataset are shown in Table 2. From Table 2, we can observe that, compared with SETMLP, NPSETMLP improves the peak accuracy on 8 datasets, while both models reach better accuracy than their fullyconnected counterpart.
Table 3 shows the compression rates, the numbers of parameters and neurons on 15 datasets for SETMLP and NPSETMLP. It worths noting that applying iterative neurons pruning to SETMLP further increases the compression rate by 6 to 7 times.
To start understanding better the generalization capabilities of SETMLP and NPSETMLP, we performed an extra experiment by comparing them with a DenseMLP (a FCMLP having the same amount of hidden neurons as SETMLP). Figure 2 shows their learning curves and visualizes their generalization capabilities on 4 datasets (we limited the number of datasets due to space constraints). All three models are trained without any explicit regularization methods e.g. dropout, L1 and L2 Regularization, etc. We can see that the gap between the training and test accuracies of NPSETMLP and SETMLP is smaller than for DenseMLP. Perhaps, the most interesting behavior is on the Yale dataset on which DenseMLP presents perfect overfitting (e.g. zero loss, 100% classification accuracy). Contrary, the implicit regularization made by connections addition and removal in SETMLP and NPSETMLP don’t let these models to perfectly overfit on the training data and enables them better generalization.
5 Conclusion
In this paper we propose a new method, i.e. NPSET, to enhance the Sparse Training Evolutionary procedure with Neurons Pruning. NPSET trains efficiently intrinsically sparse MLPs, in a number of cases achieving better classification accuracy than SET, while leading to a smaller amount of parameters. This is highly desirable to enhance neural networks scalability. Moreover, the experimental results demonstrate that both methods, SET and NPSET, can train intrinsically sparse MLPs with adaptive sparse connectivity to have higher generalization capabilities than their fullyconnected counterparts.
This study is limited in its purpose. E.g., we focus just on MLP which even if can be one of the most used model in realworld applications (it represents 61% of a typical Google TPU (Tensor Processing Unit) workload
(Jouppi et al., 2017)), it does not represent all neural network models. Consequently, there are many future research directions, e.g. analyze the methods performance on much larger tabular datasets, or on other types of neural networks models (e.g. convolutional neural networks). Among all of these, the most interesting research direction would be to understand
why and when intrinsically sparse neural networks with adaptive sparse connectivity can generalize better than their fullyconnected counterparts.References
 Dziugaite & Roy (2017) Dziugaite, G. K. and Roy, D. M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
 Han et al. (2017) Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y., et al. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 75–84. ACM, 2017.
 Hassibi & Stork (1993) Hassibi, B. and Stork, D. G. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171, 1993.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Jouppi et al. (2017) Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. Indatacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pp. 1–12. IEEE, 2017.
 Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 LeCun et al. (1990) LeCun, Y., Denker, J. S., and Solla, S. A. Optimal brain damage. In Advances in neural information processing systems, pp. 598–605, 1990.
 Lee et al. (2018) Lee, N., Ajanthan, T., and Torr, P. H. Snip: Singleshot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
 Liu et al. (2019) Liu, S., Mocanu, D. C., Matavalam, A. R. R., Pei, Y., and Pechenizkiy, M. Sparse evolutionary deep learning with over one million artificial neurons on commodity hardware. arXiv preprint arXiv:1901.09181, 2019.
 Mocanu et al. (2018) Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1):2383, 2018.
 Mostafa & Wang (2019) Mostafa, H. and Wang, X. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. CoRR, abs/1902.05967, 2019. URL http://arxiv.org/abs/1902.05967.
 Narang et al. (2017) Narang, S., Elsen, E., Diamos, G., and Sengupta, S. Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119, 2017.
 Neyshabur et al. (2014) Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
 Neyshabur et al. (2017a) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5947–5956, 2017a.
 Neyshabur et al. (2017b) Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pacbayesian approach to spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017b.
 Sohoni et al. (2019) Sohoni, N. S., Aberger, C. R., Leszczynski, M., Zhang, J., and Ré, C. Lowmemory neural network training: A technical report. CoRR, abs/1904.10631, 2019.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Zheng et al. (2018) Zheng, Q., Yang, M., Yang, J., Zhang, Q., and Zhang, X. Improvement of generalization ability of deep cnn via implicit regularization in twostage training process. IEEE Access, 6:15844–15869, 2018.
 Zhu & Jin (2018) Zhu, H. and Jin, Y. Multiobjective evolutionary federated learning. CoRR, abs/1812.07478, 2018.
Comments
There are no comments yet.