SparseNet: A Sparse DenseNet for Image Classification

04/15/2018
by   Wenqi Liu, et al.
0

Deep neural networks have made remarkable progresses on various computer vision tasks. Recent works have shown that depth, width and shortcut connections of networks are all vital to their performances. In this paper, we introduce a method to sparsify DenseNet which can reduce connections of a L-layer DenseNet from O(L^2) to O(L), and thus we can simultaneously increase depth, width and connections of neural networks in a more parameter-efficient and computation-efficient way. Moreover, an attention module is introduced to further boost our network's performance. We denote our network as SparseNet. We evaluate SparseNet on datasets of CIFAR(including CIFAR10 and CIFAR100) and SVHN. Experiments show that SparseNet can obtain improvements over the state-of-the-art on CIFAR10 and SVHN. Furthermore, while achieving comparable performances as DenseNet on these datasets, SparseNet is x2.6 smaller and x3.7 faster than the original DenseNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/28/2021

ThresholdNet: Pruning Tool for Densely Connected Convolutional Networks

Deep neural networks have made significant progress in the field of comp...
05/09/2018

Evaluating ResNeXt Model Architecture for Image Classification

In recent years, deep learning methods have been successfully applied to...
11/21/2016

Training Sparse Neural Networks

Deep neural networks with lots of parameters are typically used for larg...
12/11/2019

An Improving Framework of regularization for Network Compression

Deep Neural Networks have achieved remarkable success relying on the dev...
04/02/2020

Learning Sparse Ternary Neural Networks with Entropy-Constrained Trained Ternarization (EC2T)

Deep neural networks (DNN) have shown remarkable success in a variety of...
06/17/2021

Layer Folding: Neural Network Depth Reduction using Activation Linearization

Despite the increasing prevalence of deep neural networks, their applica...
06/23/2020

Principal Component Networks: Parameter Reduction Early in Training

Recent works show that overparameterized networks contain small subnetwo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks have achieved great successes on many computer vision tasks, such as object classification, detection and segmentation

[1] [2] [3]. ‘Depth’ played a significant role while neural networks are achieving their successes. From AlexNet[1] to VGGNet[4] and GoogLeNet[5], their performances on various computer vision tasks are boosting as network’s depth is increasing.

Experiments[6] have shown if we simply stack layers without changing network’s structure, its performance would get worse otherwise. Because gradients of network’s parameters will vanish as depth is increasing. To settle this problem, He[6] proposed ResNet, which introduced a residual learning framework by adding identity-mapping shortcuts. ResNet extended its depth up to over 100 layers and achieved state-of-art performances in many computer vision tasks. However, when ResNet is getting deeper(e.g. over 1000), it will suffer from the overfitting problem.

Huang[7] proposed a new training procedure, named stochastic depth, solved this problem. Take ResNet for example, Huang[7]

trained shallower subnetworks by randomly dropping residual modules(while retaining shortcut connections). The vanishing-gradient problem has been alleviated since only shallower networks are trained in the training phase. This training procedure can extend depth of networks to over 1000 layers(e.g. 1202 layers) and the performance on image classification has been further improved.

Zagoruyko[8] improved ResNet from another aspect. He introduced a wider(more channels in convolution layers) and shallower ResNet variant. The performance of wide ResNet with only 16 layers exceeds that of original ResNet with over 1000 layers. Another benefit brought by wide ResNet is the training is very fast since it can take advantage of the parallel of GPU computing.

By gradually increasing width of neural networks, Han[9] presented deep pyramidal residual Networks. For the original ResNet, width only doubled after downsampling happened. For example, there are modules in original ResNet[6]: , , and . Width for each module are , , and . Within every module, dimensions are all the same. For pyramidal residual networks, width for each residual unit are always increasing no matter they are in the same module or not. Experiments shown pyramidal residual networks had superior generalization ability compared to the original residual networks. So except increasing the depth, properly increasing width is also another way to boost network’s performance.

Besides increasing depth or width, increasing number of shortcut connections is another effective way of improving network’s performance. It can gain network’s performance from two aspects. 1)It shortens the distance between input and output and thus alleviates the vanishing-gradient problem with shorter forward flows. Highway networks[10] and ResNet[6] proposed different ways of shortcut connections, both of which made training easier. 2)Shortcut connections can take advantage of multi-scale feature maps, which can improve performances on various computer vision tasks[11][3][12][13].

Huang[14]takes this idea to the extreme. He proposed DenseNet, for the layer of which, it takes all previous layers as its input(connections of this layer is ). By this kind of network structure design, it not only alleviates vanishing-gradient problem, but also achieves better feature reuse. DenseNet achieves superior performance on datasets of CIFAR-10,CIFAR-100 and SVHN. However, it has its own disadvantages. There are total connections for a L-layer DenseNet. The excessive connections not only decrease networks’ computation-efficiency and parameter-efficiency, but also make networks more prone to overfitting. As we can see from upper of Fig.1, When I modify connections of a 40-layer DenseNet, the test error rates on CIFAR10 first decrease and then increase as connection is increasing. When the connections is , the error rate reaches the lowest, . However, as we can see from bottom of Fig.1, error rates on the training datasets is decreasing as connections is increasing.

Figure 1: test/train error rate on CIFAR10 of different paths(connections) in DenseNet

To settle this problem, we proposed a method to sparsify DenseNet. Zeiler[15] found out that for a deep neural network, shallower layers can learn concrete features, whereas deeper layers can learn abstract features. Based on this observation, we can drop connections from middle layers and reduce connections for each layer from O(n) to O(1). So total connections of the sparsified DenseNet is . As we can see in Fig. 2, left is a small part of DenseNet, right is a small part of SparseNet. the dotted line are dropped connections. So our idea for sparsifing is simply dropping connections from middle layers and only retaining the nearest and farthest connections. And then we can extend the network to deeper or wider, which would result in better performance. As we can see in Fig. 2, while keeping the overall parameters unchanged, by dropping some connections and then extend network’s width or depth, the performance of networks are getting better.

Figure 2: Left is DenseNet, input to layers are from all previous layers; right is SparseNet, dotted lines are dropped connections. input to layers are from at most two previous layers.
Figure 3: Wider Sparse DenseNet and Deeper Sparse DenseNet are networks extended to wider or deeper after drop some middle connections. setup of DenseNet is k(growth rate)=12, layer=40; setup of Wider Sparse DenseNet is k=16, layer=40, path(total connections)=12; setup of Deeper Sparse DenseNet is k=12, path=12, layer=64.

Beside changing networks’ depth, width or shortcut connections to boost model’s performance, we can also borrow our knowledge about human visual processing mechanism. The most significant feature of human visual system lies in its attention mechanism. When we skim images, we can automatically focus on important regions, and then devote more attentional resources to those regions. Recently some researchers on computer vision are enlightened by attention mechanism of human visual system. They designed mechanisms which can firstly select most significant regions in an image(e.g. foreground regions for object segmentation), and then pay more attention to those regions. Attention mechanism has made progresses on various computer vision tasks, such as image classification[16],image segmentation[17]

, human pose estimation

[18] and so on. Recently, Hu[19] took advantage of attention mechanism from another perspective, he put different amounts of ‘attentional resources’ to different channels of feature maps. To be specific, he increases weights on channels which have informative features and decreases weights on channels which have less useful features. He proposed SE module, which can calibrate feature responses adaptively for different channels in cost of slightly more computation and parameters. The SE module has been proved to be effective for ResNet[6],Inception[5] and Inception_ResNet[20]

. However, the improvement is ignorable when it applied to our SparseNet. To settle this problem, we present a new attention mechanism. Its structure is shown in Fig 4. It consists of one global average pooling layer and two convolution modules(includes convolution layer, ReLU layer and batch normalization layer). Borrowing idea of shortcut connections, outputs of both global average pooling layer and the first convolution module are taken as input to the second convolution module. And then outputs of the second convolution module are used to calibrate the original network’s output.

There are two contributions in our paper:

1) We present an effective way to sparsify DenseNet, which can improve network’s performance by simultaneously increasing depth, width and shortcut connections of networks. Besides,

2) we also proposed an attention mechanism, which can further boost network’s performance.

Figure 4: SparseNet with attention module

2 Related work

2.1 Convolutional neural networks

Since 2012, neural networks, as a new way of constructing models, have made big steps in various computer vision regions. AlexNet[1]

, which consists of 8 layers, won the image classification championship of ILSVRC-2012. It reduced error rate on ImageNet dataset from

(best performance in 2011) to . In 2014, when VGGNet[4] and Inception Net[21] were introduced, depth of networks had been easily extended to 20 layers and the accuracy of image classification also improved a lot. As network goes deeper, simply stacking layers would degrade its performance. To solve the problem, He[6] introduced ResNet, which learns Residual function ,instead of target function directly. ResNet can be extended to over 100 layers and the performance can be further improved.

Many researches have been made on ResNet variants. He[22] changed the conventional ”post-activation” of the weight layers to ”pre-activation”. To be specific, he put BN layer and ReLU layer before Conv layer. As the result turned out, this identity-mapping change made training easier and thus the performance of networks better. Han[9] introduced Deep Pyramidal Residual Networks, which increase width gradually layer by layer and rearrange the convolution module. Experiments showed their network architecture has superior generalization ability compared to original ResNet.

Targ[23] proposed ResNet in ResNet(RiR), which changed convolution module to a small deep dual-stream architecture. RiR makes network generalize between residual stream which is similar to a residual block and transient stream which is a standard convolutional layer. Huang[7]

constructs a very deep ResNet. By randomly dropping some residual modules with probability

, they can train different shallower subnetworks in the training phase. In the testing phase, they use the whole deep network, whereas recalibrated every residual module with the survival probability . In this way, ResNet can be expended to over 1200 layers. Zagoruyko[8] introduced a ResNet variant with wider width and shallower depth, named WRN(Wide Residual Networks), which can improve ResNet’s performance further. Huang[14] presented DenseNet with layers connected to its all previous layers. With this kind of network design, it not only accomplishes feature reuse, but also alleviates the vanishing-gradient problem.

2.2 Attention mechanism

Attention mechanism has achieved many progresses in areas such as machine translation[24]. Recently attention mechanism is playing a significant role in various computer vision tasks. Harley[17]learned weights of pixels in multiple scales using attention mechanism, and calculating the weighted average value as the final result of segmentation. Chu[18]

improve human pose estimation using multi-context attention module. They use holistic attention model to get global consistency information of human body; while using body part attention module to get detailed information for each human part. Wang

[16] proposed a residual attention network for image classification, which achieved state-of-art performance on CIFAR dataset. By attention residual learning, they can easily extend their networks to hundreds of layers. Hu[19] proposed SENet(Squeeze-Excitation networks), which calibrate weights for different channels by explicitly modeling channel interdependencies. SENet won ILSVRC-2017 image classification championship.

3 SparseNet

3.1 DenseNet

We represent the input image as ,output of the layer as and each convolutional module as function . Since input to the layer is outputs of all previous layers. The formula is presented as follows:

,

where is the concatenation of outputs of all previous layers. DenseNet is composed of several dense blocks connected by transition layer. Normally, size of feature map decreased by for each block. For example, size of feature map for the first block is , then for the second block, for the third block. In DenseNet, number of output feature-maps for each convolution module are always the same, which is denoted by . Thus the output number of the layer is , where is the number input to the first dense block. k was referenced as .

As DenseNet goes deeper, number of input feature-maps would become excess very soon. To settle this problem, the author put convolution module(as bottleneck layer) before the convolution module. Thus, the convolutional module has changed from BN+ReLU+Conv to BN+ReLU+Conv+BN+ReLU+Conv.(And the new convolution module is counted as two layers instead of one). The normal setup for output feature-maps of bottleneck layer(Conv) is . Thus, inputs to every Conv layer is fixed to . To further improve model compactness, number of feature-maps can also be reduced in transition layer. The normal setup is number of feature-maps is reduced by . This kind of DenseNet is called DenseNet-BC.

3.2 SparseNet

We introduce a method to sparsify DenseNet. The basic idea is dropping connections from middle layers and preserving only the farthest and nearest connections. The formula is as followings:

,

where

denotes number of connections we will preserve(We call it ‘path’). As DenseNet does, we also use bottleneck layer and compress the model in the transition layer, the hyperparameters are set the same as that of DenseNet.

Moreover, we also make a structure optimization. In DenseNet, layer number are the same for all dense blocks. However, in our SparseNet, the layer number in each block is increasing. We will talk about the advantages of this arrangement in section 4.6.

3.3 Attention mechanism

We proposed an attention mechanism to further boost network’s performance. Structure is shown in Figure 3. Suppose the input is , the left part is a convolution module, we denoted the function as . The right part is the attention mechanism module, and denote it as . It consists of one global Pooling layer and two convolution modules. The input to the second convolution module is the concatenation of outputs of both global pooling layer and the first convolution module. Then the final result is calculated as .

3.4 Framework

To summarize, as it is shown in Fig.5, We proposed three networks. (a) is the original DenseNet; (b) is the basic SparseNet(path = , since connections to every layer is at most ); (c) is SparseNet-bc, by adding bottleneck layers and reducing number of feature-maps in transition layer; (d) is SparseNet-abc, by adding attention mechanism on SparseNet-bc. The whole framework is shown in Figure 6.

Figure 5: a is DenseNet; b is SparseNet(path=2); c is SparseNet-bc; d is SparseNet-abc.
Figure 6: the framework of SparseNet for image classification. Between Sparse blocks are transition layer.

3.5 Implementation details

All our models include three sparse blocks. The layers within each sparse block are increasing. Besides bottleneck layer, all convolutional kernels are . blocks are connected with transition layer, which reduced feature map size by and feature map number by

(feature map number will remain the same for the basic SparseNet). After the last block, a global pooling layer and a softmax classifier is attached. For each network(SparseNet, SparseNet-bc, SparseNet-abc), we construct three different sizes of parameters. denoting by V1, V2, V3 and V4. For V1, the layer number for three blocks are 8,12,16; 12,18,24 for V2; 16, 24,32 for V3 and 20,30,40 for V4. other parameters are listed in table 1.

name #Params Depth Growth rate Path
SparseNet-V1 1.20M 40 16 14
SparseNet-V2 5.70M 68 24 21
SparseNet-V3 17.5M 76 32 28
SparseNet-V4 65.7M 96 50 35
SparseNet-bc-V1 0.83M 76 16 14
SparseNet-bc-V2 3.45M 132 24 21
SparseNet-bc-V3 9.69M 148 32 28
SparseNet-bc-V4 34.3M 184 50 35
SparseNet-abc-V1 0.86M 76 16 14
SparseNet-abc-V2 3.56M 132 24 21
SparseNet-abc-V3 9.92M 148 32 28
SparseNet-abc-V4 35.0M 184 50 35
Table 1: setups of networks.

4 Experiments

4.1 Datasets

4.1.1 Cifar

CIFAR[25] are colored images with three channels. Their sizes are . CIFAR10 consists of 10 classes and CIFAR100 consists of 100 classes. Both are composed of 50,000 training images and 10,000 test images.

4.1.2 Svhn

The Street View House Numbers(SVHN)[26] are also colored images with three channels. Their sizes are . SVHN includes 73,257 training images, 531,131 additional training images and 26,032 test images. We training our model using all the training images.

4.2 Training

All networks are trained using stochastic gradient descent. The weight decay is 0.0001, Nesterov momentum is 0.9 without dampening. We initialize parameters as He

[27] does. All datasets are augmented with method introduced in Huang[14]

. For CIFAR, the training epoch is 280. the initial learning rate is 0.1, and decreasing learning rate to 0.01,0.001,0.0002 at epoch 150, 200 and 250. For SVHN, the total epoch is 40, and decreasing to 0.01 and 0.001 at epoch 20 and 30. the batch size of both datasets are 64.

4.3 Classification Results on CIFAR and SVHN

Results on datasets of CIFAR and SVHN are shown in table 2. Compared to DenseNet, SparseNet achieves superior performances on all datasets. On CIFAR10, SparseNet decreases error rate from to . On CIFAR100, SparseNet achieves error rate of , while DenseNet achieved . On SVHN, SparseNet also achieves lower error rate( v.s. ). Furthermore, SpareNet outperforms the existing state-of-art on CIFAR10 and SVHN. Its error rates are lower than PyramidNet on CIFAR10, which achieved and DenseNet on SVHN, which achieved .

Networks #Params Depth CIFAR10 CIFAR100 SVHN
ResNet[27] 1.70M 110 6.41 - -
ResNet with Stochastic Depth[7] 1.70M 1.7M 5.23 24.58 1.75
10.2M 1202 4.91 - -
wide ResNet[8] 11.0M 16 4.81 22.07 -
36.5M 28 4.17 20.50 -
pre-activation ResNet[22] 1.7M 164 5.46 24.33 -
10.2M 1001 4.62 22.71 -
ResNeXt[28] 34.4M 29 3.65 17.77 -
68.1M 29 3.58 17.33 -
FractalNet with Dropout/Drop-path[29] 38.6M 21 5.22 23.30 2.01
38.6M 21 4.60 23.73 1.87
PyramidNet(=48)[9] 1.7M 110 4.58 23.12 -
PyramidNet(=48) 28.3M 110 3.73 18.25 -
PyramidNet(=200,bottleneck) 26.0M 272 3.31 16.35 -
DenseNet-bc[14] 0.8M 100 4.51 22.27 1.76
15.3M 250 3.62 17.60 1.74
25.6M 190 3.46 17.18 -
CondenseNet-122[30] 0.95M 122 4.48 - -
CondenseNet-182 4.2M 182 3.76 18.47 -
SparseNet-bc-V1 0.83M 76 4.34 22.18 2.0
SparseNet-bc-V2 3.45M 132 3.93 19.27 1.85
SparseNet-bc-V3 9.69M 148 3.56 17.75 1.75
SparseNet-abc-V1 0.86M 76 4.25 21.59 1.98
SparseNet-abc-V2 3.56M 132 3.75 19.54 1.89
SparseNet-abc-V3 9.92M 148 3.40 17.53 1.69
SparseNet-abc-V4 35.0M 184 3.24 16.98 -
Table 2: Error rates on datasets of CIFAR and SVHN.

4.4 Attention mechanism

As we can see from table 2, attention mechanism can boost networks’ performance for most model sizes(V1, V2 and V3) with only increasing in parameters and increasing in inference time. We also compared our attention mechanism to SE module[19] on SparseNet-V1. Results are shown in Fig. 7. In the whole training phase, our attention mechanism is always superior to SE module. Besides that, the effect of SE module on SparseNet-V1 is nearly neglectable.

Figure 7: original is the SparseNet; attention is SparseNet+attention module; SE is the SparseNet+SE module(the epoch and learning rates are set as DenseNet)

4.5 Parameter Efficiency and Computation Efficiency of SparseNet

The results in Fig.8 indicate that SparseNet utilizes parameters more efficiently than alternative models. SparseNet-abc-v1 achieves lower test error on CIFAR10 than pre-activation ResNet of 10001 layers, while latter has times more parameters than the former one. For DenseNet-BC, the best model achieves , while SparseNet achives lower test errror() with less parameters. For the recent CondenseNet[30], which designed for mobile devices, Our SparseNet is still more parameter-efficient.

Figure 8: Comparison parameter-efficiency on CIFAR10 of different models

To analyze SparseNet’s computation, we compared FLOPs111

computed Conv2D with TensorFlow framwork

(floating-point operations) of pre-activation ResNet, DenseNet and SparseNet. Results are shown in Fig. 8. It shows SparseNet is more computation-efficient than the other two models. Compared to the best DenseNet Model, SparseNet is faster than DenseNet.

Figure 9: Comparison of SparseNet-abc and DenseNet error rate on CIFAR10 as a function of FLOPs.

4.6 Structure optimization

We also analyzed the effectiveness of our layer arrangement for each sparse block. We compared two kinds of block arrangements. One is the increasing arrangement: 8-12-16; the other is equal arrangement: 12-12-12. Results are listed in table 3. It shows that our increasing arrangement is superior not only in computation but also in accuracy.

block setup #FLOPS test error
8-12-16 578M 4.95
12-12-12 849M 5.36

Table 3: Results of two block arrangements on CIFAR10.

5 Discussion

5.1 Where to drop connections

In this section, we experimented different methods of reducing connections. Take SparseNet-V1(path=14) for example, we tried different ways of dropping connections:

1)14-0: only preserving the farthest 14 connections;

2)10-4: preserving the farthest 10 connections and nearest 4 connections;

3)7-7(ours): preserving the farthest 7 connections and nearest 7 connections;

4) 4-10: preserving the farthest 4 connections and nearest 10 connections;

5) 0-14: only preserving the nearest 14 connections.

As we can see from Fig. 10, different dropping connection methods resulted in different error rates. And our dropping connections method(7-7) achieves best performance. Besides 7-7, 0-14 also achieved comparable performance to our method. One possible explanation is that the method of preserving the nearest 14 connections contains as much information as method of preserving the farthest 7 connections and nearest 7 connections for SparseNet-V1.

Figure 10: different methods of reducing connections in SparseNet-V1.

5.2 How layers, growth rate and path influence network’s performance

We also analyzed how networks’ layer, width and shortcut connections influence network’s performance. In our experiments, we set up three layers: ( layers per block), layers( layers per block) and layers( layers per block). We set the range of growth rate(k) to be [6,26]. The parameters of all models are around 1M. So when we set the layer number and the growth rate, the number of connections(path) is also determined. The results are showed in Fig. 11. We can see that for each layer setup, all test errors are experiencing decreasing first and then increasing, resulting the optimal test error are always in the somewhere middle. For different layer setups, the lowest test error is within layers of , which is between and . The results showed that none of the three factors shouldn’t be set to be extreme. Only by increasing layers, growth rate and path synchronously, can SparseNet achieve better performance.

Figure 11: Error rates of different setups of layers, growth rate and path on CIFAR10.

6 Conclusion

In this work, we proposed a method to sparsify DenseNet. After reducing shortcut connections, we can expend the network to deeper and wider. Moreover, we also introduced an attention model, which can boost networks’ performance further. Experiments showed that compared to DenseNet, our model achieved comparable performance on datasets of CIFAR and SVHN with much less parameters and much lower computation. Besides, we analyzed several ways of reducing connections and how layers, growth rate and shortcut connections influence networks’ performance. In future work, we will apply our models to other computer vision tasks, for example object detection, object segmentation, human pose estimation and so on.

References