On improving deep learning generalization with adaptive sparse connectivity

06/27/2019 ∙ by Shiwei Liu, et al. ∙ 0

Large neural networks are very successful in various tasks. However, with limited data, the generalization capabilities of deep neural networks are also very limited. In this paper, we empirically start showing that intrinsically sparse neural networks with adaptive sparse connectivity, which by design have a strict parameter budget during the training phase, have better generalization capabilities than their fully-connected counterparts. Besides this, we propose a new technique to train these sparse models by combining the Sparse Evolutionary Training (SET) procedure with neurons pruning. Operated on MultiLayer Perceptron (MLP) and tested on 15 datasets, our proposed technique zeros out around 50 linear number of parameters to optimize with respect to the number of neurons. The results show a competitive classification and generalization performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In spite of the good performance of deep neural networks, they encounter generalization issues and overfitting problems, especially when the amount of parameters is much higher than the amount of training examples. While, understanding this trade-off is an open research question, fortunately, various works have been proposed to handle this problem including, implicit norm regularization (Neyshabur et al., 2014), Two-Stage Training Process (Zheng et al., 2018), dropout (Srivastava et al., 2014)

, Batch normalization

(Ioffe & Szegedy, 2015), etc. Recently, many complexity measures have emerged to understand what drives generalization in deep networks, such as sharpness (Keskar et al., 2016), PAC-Bayes (Dziugaite & Roy, 2017) and margin-based measures (Neyshabur et al., 2017b). (Neyshabur et al., 2017a) analyze different complexity measures and demonstrate that the combination of some of these measures seems to capture better neural networks generalization behavior.

On the other side, the ability of sparse neural networks to reduce the number of parameters can dramatically shrink the model size and, therefore, relieve overfitting. However, the traditional algorithms to train such networks, make use of an initial fully-connected network which is trained first. Further on, the unimportant connections in this network are pruned using various techniques, e.g. (LeCun et al., 1990; Hassibi & Stork, 1993; Han et al., 2017; Narang et al., 2017; Lee et al., 2018) to obtain a sparse topology. The initial fully-connected network is a critical point hindering neural networks scalability due to its quadratic number of (many unnecessary) parameters with respect to its number of neurons. To address this issue, (Mocanu et al., 2018) have proposed a new class of models, i.e. intrinsically sparse neural networks with adaptive sparse connectivity. These models have a linear number of parameters with respect to the number of neurons, don’t require an initial fully-connected network, and can be trained with the Sparse Evolutionary Training (SET) procedure.

In this paper, we introduce a new improvement to SET, dubbed SET with Neurons Pruning (NPSET), to further reduce the number of hidden neurons and parameters. Our approach is able to identify and eliminate a large number of non-informative hidden neurons and their accompanying connections by applying neurons pruning to the SET procedure. Same as SET, NPSET starts with a sparse topology, thus having a clear advantage over the state-of-the-art methods which start from fully connected topologies. The experimental results show that the removal of the hidden layer neurons with very few output connections allows NPSET to further reduce computational costs in both phases (training and inference). Moreover, we show that intrinsically sparse MLPs trained with both, SET or NPSET, have higher generalization ability than their fully-connected counterparts.

2 Related Work

Dataset Number of Features Data Type Classes Training Test
Samples Samples Samples
Leukemia 72 7070 Discrete 2 48 24
PCMAC 1943 3289 Discrete 2 1295 648
Lung-discrete 73 325 Discrete 7 48 25
gisette 7000 5000 Continuous 2 4666 2334
lung 203 3312 Ccontinuous 5 135 68
CLL-SUB-111 111 11340 Continuous 3 74 37
Carcinom 174 9183 Continuous 11 116 58
orlraws10P 100 10304 Continuous 10 66 34
TOX-171 171 5748 Continuous 4 114 57
Prostate-GE 102 5966 Continuous 2 68 34
arcene 200 10000 Continuous 2 133 67
madelon 2600 500 Continuous 2 1733 867
Yale 165 1024 Continuous 15 110 55
GLIOMA 50 4434 Continuous 4 33 17
RELATHE 1427 4322 Continuous 2 951 476
Table 1: Datasets characteristics.

Inspired by Darwinian theory, Sparse Evolutionary Training (SET) (Mocanu et al., 2018)

is a simple but efficient training method which enables an initially sparse topology of bipartite layers of neurons to evolve towards a scale-free topology, while learning to fit the data characteristics. After each training epoch, the connections having weights closest to zero are removed (magnitude based removal). After that, new connections (in the same amount as the removed ones) are randomly added to the network. This offers benefits in both, computational time (pronounced faster training and testing time in comparison with fully connected bipartite layers) and quadratically lower memory requirements. The interested reader is referred to

(Mocanu et al., 2018) for a detailed discussion, and to (Mostafa & Wang, 2019; Zhu & Jin, 2018; Sohoni et al., 2019) for further developments and analyses on it.

Dataset SET- NPSET- NPSET- NPSET- Direct Direct
MLP (%) MLP (%) MLP (%) MLP (%) SET-MLP (%) FC-MLP (%)
Leukemia 87.50 87.50 (+0.00) 87.50 (+0.00) 87.50 (+0.00) 87.50 (+0.00) 75.00 (-12.50)
PCMAC 87.35 88.43 (+1.08) 86.73 (-0.62) 88.43 (+1.08) 87.81 (+0.46) 85.19 (-2.16)
Lung-discrete 88.00 88.00 (+0.00) 88.00 (+0.00) 84.00 (-4.00) 88.00 (+0.00) 80.00 (-8.00)
gisette 97.43 97.52 (+0.09) 97.64 (+0.21) 97.52 (+0.09) 97.47 (+0.04) 97.60 (+0.17)
lung 92.65 94.12 (+1.47) 94.12 (+1.47) 92.65 (+0.00) 94.12 (+1.47) 92.65 (+0.00)
CLL-SUB-111 67.57 75.68 (+8.11) 62.16 (-5.41) 67.57 (+0.00) 70.27 (+2.70) 59.46 (-8.11)
Carcinom 79.31 81.03 (+1.72) 75.86 (-3.45) 75.86 (-3.45) 77.59 (-1.72) 68.97 (-10.34)
orlraws10P 88.24 88.24 (+0.00) 85.29 (-2.95) 88.24 (+0.00) 88.24 (+0.00) 79.41 (-8.77)
TOX-171 91.23 91.23 (+0.00) 85.97 (-5.26) 89.47 (-1.76) 91.23 (+0.00) 82.46 (-8.77)
Prostate-GE 88.24 88.24 (+0.00) 88.24 (+0.00) 88.24 (+0.00) 88.24 (+0.00) 79.41 (-8.83)
arcene 77.61 77.61 (+0.00) 82.09 (+4.48) 74.63 (-2.98) 79.10 (+1.49) 77.61 (+0.00)
madelon 71.16 71.28 (+0.12) 71.74 (+0.58) 70.13 (-1.03) 71.05 (-0.11) 56.40 (-14.76)
Yale 70.91 74.55 (+3.64) 69.09 (-1.82) 70.91 (+0.00) 69.09 (-1.82) 63.64 (-7.27)
GLIOMA 76.47 76.47 (+0.00) 76.47 (+0.00) 76.47 (+0.00) 76.47 (+0.00) 64.71 (-11.76)
RELATHE 89.71 90.55 (+0.84) 89.71 (+0.00) 89.92 (+0.21) 87.61 (-2.10) 90.76 (+1.05)

Table 2: The maximum accuracy of each method for each dataset. The entry with the highest accuracy for each dataset is made bold.

3 Methods

In this section, we detail our proposed method (NPSET).

Figure 1: Influence of hidden neurons removal (from the first hidden layer) on accuracy on the Lung-discrete dataset.

3.1 Why Neurons Pruning.

The sparse topology allows SET to create MultiLayer Perceptrons with hundreds of thousands of neurons (Liu et al., 2019), which guarantees their ability to represent all sorts of features and approximate function to tackle different problems. However, such a large number of neurons is also a double-edged sword which can lead to significant redundancy. For example, in the case of CIFAR10 dataset whose features number is 3072, the neurons number of the first hidden layer is 4000 in (Mocanu et al., 2018). Obviously, not all neurons can provide important information to outputs. To prove this, we test whether removing hidden neurons which have the least output connections will decrease the performance. Figure 1 shows the influence of removing neurons from the first hidden layer for Lung-discrete dataset (due to the space limitation, we only give one dataset). It can be observed that the model maintains or even improves its accuracy after removing these unimportant neurons. In order to remove these non-informative neurons, at the beginning of each training epoch, we remove a certain fraction

of hidden neurons that have the smallest numbers of connections. Thus, with a large probability they will not have a notable impact on the model performance.

Dataset Parameters (#) Neurons (#) Compression Rate ()
FC-MLP SET-MLP NPSET-MLP FC-MLP SET-MLP NPSET-MLP SET-MLP NPSET-MLP
Leukemia 98,504,000 294,235 40,039 21,070 21,070 9,710 335 2,460
PCMAC 18,873,000 128,432 18,622 9,289 9,289 4,435 147 1,013
Lung-discrete 189,600 13,446 2,447 925 925 457 14 77
gisette 50,010,000 209,556 29,884 15,000 15,000 6,892 238 1,673
lung 18,951,000 135,689 19,776 9,312 9,312 4,458 140 958
CLL-SUB-111 245,773,000 474,738 65,421 33,340 33,340 15,488 518 3,757
Carcinom 163,746,000 420,592 67,726 27,182 27,182 12,580 389 2,418
orlraws10P 203,140,000 465,977 72,871 30,304 30,304 14,072 436 2,788
TOX-171 53,760,000 225,416 31,815 15,748 15,748 7,640 238 1,690
Prostate-GE 54,840,000 219,191 30,690 15,966 15,966 7,858 250 1,787
arcene 200,020,000 419,469 57,136 30,000 30,000 13,768 477 3,501
madelon 1,502,000 36,563 5,096 2,500 2,500 896 41 295
Yale 2,039,000 47,222 8,576 3,024 3,024 1,420 43 238
GLIOMA 33,752,000 178,678 25,228 12,434 12,434 5,956 189 1,338
RELATHE 33,296,000 170,804 24,280 12,322 12,322 5,844 195 1,371

Table 3: Compression Rates of SET-MLP and NPSET-MLP based on the detailed below FC-MLP.

3.2 Where to Start Pruning.

The initial network topology generated by SET is randomly sparse and does not provide any specific information. Thus, pruning neurons at the beginning may eliminate significant neurons forever, along with a serious damage to performance. It is best to start applying neurons pruning after a certain number of epochs rather than removing neurons from the beginning. After evolving in the first epochs, the network has already learned how to identify and retain important connections, while the evolved neurons connectivity provides a helpful guidance to identify non-important neurons.

3.3 How Many Epochs to Prune.

It is the fact that if we prune neurons in each epoch, the final number of neurons would be too small to keep a good accuracy. On the other hand, if we only prune neurons for several epochs, the number of removed neurons would be too trivial to decrease the computation. To preserve the good performance, we only apply neurons pruning for epochs, after which the SET procedure continues normally.

4 Experiments and Results

We evaluated the proposed NPSET111The code of NPSET is built based on the source code of SET https://github.com/dcmocanu/sparse-evolutionary-artificial-neural-networks. method by training sparse MLPs from scratch on 15 classification datasets with limited amount of samples and many input features, as detailed in Table 1. All datasets can be retrieved from Arizona State University open-source repository222http://featureselection.asu.edu/index.php.. In order to understand the NPSET performance better, we compare it against five methods: (1) SET-MLP (Mocanu et al., 2018), (2) NPSET-MLP where only neurons of the first hidden layer are pruned, (3)

NPSET-MLP where only neurons of the second hidden layer are pruned, (4) Direct SET-MLP, directly trained SET-MLP having the same hidden layer sizes as NPSET-MLP after neurons pruning. (5) Direct FC-MLP, directly trained FC-MLP having the same hidden layer sizes as NPSET-MLP after neurons pruning. All models used in this paper have two hidden layers, and ReLU activation function. We trained NPSET-MLP on a python implementation of fully connected MLPs

333https://github.com/ritchie46/vanilla-machine-learning., as the SET-MLP implementation was also built on top of this code, guaranteeing the validity of the comparison in this paper. Since our new method is an improvement over SET, we used SET-MLP as the baseline for experiments. All these methods are trained from scratch.

To find the most suitable hyperparameters values, we performed a small random search experiment. This showed

, and are safe choices that not only removed the non-informative neurons, but also lead NPSET-MLP to have better performance.

The maximum accuracies of all 6 models for each dataset are shown in Table 2. From Table 2, we can observe that, compared with SET-MLP, NPSET-MLP improves the peak accuracy on 8 datasets, while both models reach better accuracy than their fully-connected counterpart.

Figure 2: NPSET-MLP, SET-MLP and Dense-MLP generalization capabilities reflected by their learning curves.

Table 3 shows the compression rates, the numbers of parameters and neurons on 15 datasets for SET-MLP and NPSET-MLP. It worths noting that applying iterative neurons pruning to SET-MLP further increases the compression rate by 6 to 7 times.

To start understanding better the generalization capabilities of SET-MLP and NPSET-MLP, we performed an extra experiment by comparing them with a Dense-MLP (a FC-MLP having the same amount of hidden neurons as SET-MLP). Figure 2 shows their learning curves and visualizes their generalization capabilities on 4 datasets (we limited the number of datasets due to space constraints). All three models are trained without any explicit regularization methods e.g. dropout, L1 and L2 Regularization, etc. We can see that the gap between the training and test accuracies of NPSET-MLP and SET-MLP is smaller than for Dense-MLP. Perhaps, the most interesting behavior is on the Yale dataset on which Dense-MLP presents perfect overfitting (e.g. zero loss, 100% classification accuracy). Contrary, the implicit regularization made by connections addition and removal in SET-MLP and NPSET-MLP don’t let these models to perfectly overfit on the training data and enables them better generalization.

5 Conclusion

In this paper we propose a new method, i.e. NPSET, to enhance the Sparse Training Evolutionary procedure with Neurons Pruning. NPSET trains efficiently intrinsically sparse MLPs, in a number of cases achieving better classification accuracy than SET, while leading to a smaller amount of parameters. This is highly desirable to enhance neural networks scalability. Moreover, the experimental results demonstrate that both methods, SET and NPSET, can train intrinsically sparse MLPs with adaptive sparse connectivity to have higher generalization capabilities than their fully-connected counterparts.

This study is limited in its purpose. E.g., we focus just on MLP which even if can be one of the most used model in real-world applications (it represents 61% of a typical Google TPU (Tensor Processing Unit) workload 

(Jouppi et al., 2017)

), it does not represent all neural network models. Consequently, there are many future research directions, e.g. analyze the methods performance on much larger tabular datasets, or on other types of neural networks models (e.g. convolutional neural networks). Among all of these, the most interesting research direction would be to understand

why and when intrinsically sparse neural networks with adaptive sparse connectivity can generalize better than their fully-connected counterparts.

References