Towards More Efficient and Effective Inference: The Joint Decision of Multi-Participants

01/19/2020 ∙ by Hui Zhu, et al. ∙ 0

Existing approaches to improve the performances of convolutional neural networks by optimizing the local architectures or deepening the networks tend to increase the size of models significantly. In order to deploy and apply the neural networks to edge devices which are in great demand, reducing the scale of networks are quite crucial. However, It is easy to degrade the performance of image processing by compressing the networks. In this paper, we propose a method which is suitable for edge devices while improving the efficiency and effectiveness of inference. The joint decision of multi-participants, mainly contain multi-layers and multi-networks, can achieve higher classification accuracy (0.26 number of parameters for classical convolutional neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning, which relies heavily on neural networks, has achieved huge success in the fields of computer vision and image processing. Recently, multitudinous neural architectures have been proposed to improve the performance of image processing. Bran-new architectures mainly come from human design such as VGG [13], ResNet [4] and DenseNet [6] or from automatic neural architecture search such as NASNet [17], AmoebaNet [11] and EfficientNet [14]. With the improvement on the performance of image processing, the scale of neural networks is gradually increasing.

Nevertheless, as an emerging field, edge computing tends to analyze and process data at the source rapidly to reduce costs and improve efficiency. Many researches attempt to reduce the scale of networks in order to deploy and apply them to edge devices such as wearable devices and IoT devices. The proposed methods are mainly limited to model compression [3, 16] and lightweight network design [12, 15]. These methods may have special requirements for the devices or only theoretically reduce the number of parameters and the amount of computations. Furthermore, these reductions in the scale of networks are often at the expense of the performance of image processing.

In this paper, we propose a new approach to address these problems, and thus, the results of image classification with higher accuracy can be achieved by using less number of parameters. We break up the large network into smaller parts and merge the information of more layers as multi-participants to make a joint decision. These small networks are more suitable for deploying to edge devices for the sake of limited storage and concurrent training. The experiment results also show that our approach is more efficient and effective for inference.

Figure 1: The joint decision of multi-layers. Different colors represent different scales of feature maps and represents the weight of the layer in the final joint decision.
Figure 2: The joint decision of multi-networks. Different colors represent different Networks and represents the weight of the network in the final joint decision. After the training for a network, we’ll make small adjustments to the weights of the data.

Our contributions are summarized as follows:

  • We propose the joint decision of multi-participants which contain multi-layers and multi-networks. The effectiveness and robustness of a single network are improved, while the overall performance of multiple networks has better performance by this approach.

  • We propose clear rules for the architecture design and the training methods of every participant.

  • Our method achieves better results for classification on CIFAR-10 and CIFAR-100 with a similar number of parameters and amount of computations.

2 Related Work

The methods to improve the performance of convolutional neural networks mainly focus on optimizing the convolutional neural architectures. Human-designed architectures such as short-cut connections [4], squeeze-excitation blocks [5] and automatic search algorithms such as ENAS [10], DARTS [8] have been proposed to achieve some significant breakthroughs. At the same time, scaling down the models is also widely concerned to apply the networks on edge devices. To do this, many researches about model compression [3, 16] and lightweight network design [12, 15] are successively proposed. Our approach is inspired by Adaboost [2]

, which trains several weak classifiers on the same training dataset and merge them to form the final classifier. It is more suitable for edge devices to achieve more efficient and effective inference and is orthogonal to these predecessor researches.

3 Methods

In this section, we propose our method, the joint decision of multi-participants which contain multi-layers and multi-networks. Furthermore, we propose the rules of the architecture design and the training methods for them.

3.1 The Joint Decision of Multi-Layers

Traditional convolutional neural networks extract features of images forward and finally achieve the image classification via the output of the last Softmax layer. Although the feature extraction tends to be more abstract by deeper layers in a network, we notice that the front layers can also solve some problems that deep ones cannot. We design the network in which multiple layers participate in the final decision together.

The overview of joint decision of multi-layers is presented in Fig.1. We divide the network into several parts (suppose the number is

) according to the scales of the feature maps and they are represented by the different colors in the figure. Then we add the Global Average Pooling and the Softmax layer at the end of each part. In particular, a Batch-Norm layer and a activation function may be added simultaneously. This method will add few parameters and FLOPs, but gives voice for more layers in the network.

Suppose that the weight of the original output of network is represented as . and represent the factors for regulation. Then the weight of the output we added can be expressed as:

(1)

The loss function in the training can be expressed as:

(2)

And the final joint output can be expressed as:

(3)

3.2 The Joint Decision of Multi-Networks

It may cause a few misjudgments due to noise and other errors when the decision made by only one network. The joint decision made by multi-networks may effectively reduce these errors so as to improve the accuracy and robustness of the results. Based on this, we propose the method in which multiple networks participate in the final decision together.

The overview of joint decision of multi-networks is presented in Fig.2. Suppose that the number of total networks is and the accuracy for classification on training set of a network is . Then, the weight can be represented as:

(4)

where represents a factor to control directly and further affect . Specifically, for the first network, We train it normally on the original dataset and for the others, we will make small adjustments to the weights of data. We’ll decrease the weights of correctly classified images and increase the weights of wrongly classified ones. In detail, we take an example of one image (its weight is represented as ) which is extracted from the total dataset. It initially has the same weight as the other images, but it may change depending on the classification results after the weight adjustment. The new weight for the -th network can be represented as:

(5)

When dealing with classification problems, The most common loss function is cross entropy:

(6)

We can notice that it is easy to achieve the weight adjustment by modifying the value of labels. The final joint output can be expressed as:

(7)

3.3 The Joint Decision of Multi-Participants

Similar to Adaboost, with the same number of parameters, we tend to use several small networks instead of one large network to make the final decision. We use to record the total number of networks selected and the total number of parameters are similar to the original large network.

1:Number of Networks: , Networks List: , , , , Weight of Networks: , Accuracy: , Weight of Data: , Number of data: .
2:Final Decision: .
3:for  do
4:     Construct the multi-layers network ;
5:     Calculate the joint accuracy of : ;
6:     Calculate the Weight of : ;
7:     for  do
8:         Update ;
9:     end for
10:end for
11:Calculate the joint output of : ;
12:return
Algorithm 1 The joint Decision of Multi-Participants

The joint decision of multi-participants which contain multi-layers and multi-networks is described in Algorithm 1.

Networks Params Test Error (%)
(Mil.) Baseline Ours () Ours ()
VGG-16 [13] 15.00 6.290.10 6.070.13 6.050.17
ResNet-18 [4] 11.18 4.000.11 3.870.14 3.910.12
DenseNet-BC [6] 0.79 4.380.12 4.350.16 4.260.14
Table 1: Comparison against the baselines of different networks on CIFAR-10. The performances for the joint decision of multi-layers with different are both improved.
Scaling Factor Params Number of Single Total
(Mil.) Networks Test Error (%) Test Error (%)
Original 11.18 1 4.000.11 4.000.11
5.78 2 4.220.09 3.770.16
2.8 3 4.900.24 3.950.08
4 3.790.08
0.7 5 6.610.21 5.240.06
10 4.920.11
Table 2: Comparison against the results of original networks for ResNet-18 with different scaling factors on CIFAR-10. The performances for the joint decision of multi-networks with several appropriate are improved.

4 Experiments

In this section, we introduce the implementation of experiments and report the performances of our methods. The experiments are mainly implemented in accordance with the methods mentioned in Sect.3 and we compare the networks designed by our methods with the original classical ones to prove the feasibility and effectiveness.

4.1 Datasets

We use CIFAR-10 and CIFAR-100 [7]

for image classification as basic datasets in our experiments. We normalize the images using channel means and standard deviations for preprocessing and apply a standard data augmentation scheme (zero-padded with 4 pixels on each side to obtain a

pixel image, then a crop is randomly extracted and the image is randomly flipped horizontally).

Networks Original Networks The Joint Decision of Multi-Participants
Params C10 Test Error C100 Test Error Total Params C10 Test Error C100 Test Error
(Mil.) (%) (%) (Mil.) (%) (%)
VGG-16 [13] 15.00 6.290.10 28.220.38 15.00 6.050.17 26.060.29
ResNet-18 [4] 11.18 4.000.11 24.580.23 11.22 3.740.12 20.090.11
DenseNet-BC [6] 0.79 4.380.12 21.960.19 0.73 4.140.13 21.770.19
MobileNetV2 [12] 2.29 6.320.19 25.680.23 2.30 6.160.18 23.450.21
Table 3: Comparison against the results of classical convolutional neural networks on CIFAR-10 and CIFAR-100. The results about classification error on test datasets of the joint decision of multi-participants perform better while controlling the total number of parameters similar to the original networks.

4.2 Training Methods

Networks are trained on the full training dataset until convergence using Cutout [1]. For a fair comparison, we retrain the original networks by the same training method with the networks designed by our methods. That is to say, all the networks which contain baselines and those modified by us are trained with a batch size of 128 using SGDR [9]

with Nesterov’s momentum for 511 epochs. The hyper-parameters of the methods are as follows: the cutout size is

for Cutout, , , and for SGDR. We conducted every experiment more than three or four times and the mean classification error with standard deviation on the test dataset will be finally reported.

4.3 Crucial Preparations Before the Main Experiment

In this subsection, we show some crucial preparations before the main experiment. We conducted confirmatory experiments separately according to the methods proposed above, and further determined the super parameters roughly.

The experiments for the joint decision of multi-layers. We verified the performance of this method by several experiments for different classical networks on CIFAR-10. The networks contain VGG-16 (Reducing several FC layers), ResNet-18 and DenseNet-BC (). Together with the original networks, we got the results with two super parameters for ( and ) and . As is shown in Table 1, the performances of the networks with joint decision of multi-layers are better than the original ones. The difference for the performances of different is not obvious. Specifically, the increment of number of parameters by multi-layers can be ignored (about 0.01Mil.) so we didn’t specify that in the table. By observing the results, we can definitely declare the effectiveness of our method.

The experiments for the joint decision of multi-networks. We also verified the performance of this method by several experiments for ResNet-18 on CIFAR-10. We scaled the number of parameters of ResNet-18 to , , and of original by reducing the number of channels (the scaling factors in Table 2: , and of the original, respectively). Together with the original network, we got the results with super parameters for and different which should guarantee that the total number of parameters of the multi-networks is smaller or similar to the original network. As is shown in Table 2, the performance of the joint decision of multi-networks is better than the original with several appropriate . This leads to that the number of multiply networks should not be too many in the following main experiment.

4.4 The Experiments for the Joint Decision of Multi-Participants and Results

We conducted the experiments for the joint decision of multi-participants by combining both two parts mentioned above with the super parameters and . We selected several classical convolutional neural networks to prove the excellence of our methods.

The comparison against the results of original classical convolutional neural networks on CIFAR-10 and CIFAR-100 is presented in Table 3. We show the comparison of the mean error with standard deviation on the test datasets while controlling the total number of parameters to be similar. Specifically, the number of parameters of the networks used to compare the test error on CIFAR-100 is slightly more than that on CIFAR-10, but we only show the one for CIFAR-10 in the table because the number of parameters are almost the same.

We can clearly notice that the results of our method perform better. The accuracy can be improved by 0.26% on CIFAR-10 and 4.49% on CIFAR-100 at most with similar number of parameters (FLOPs is also similar because we only reduce the number of channels within the networks).

What’s more, the smaller single networks designed by our methods are more suitable for the edge devices. We no longer need to load the large networks at once for training or inference but are plagued by the limited storage. In addition, concurrent training and inference will make image processing more efficient and effective which we attribute to the joint decision of multi-participants.

5 Conclusion

We propose the joint decision of multi-participants, which mainly contain multi-layers and multi-networks. It is suitable for edge devices while improving the efficiency and effectiveness of inference. Our method can achieve higher classification accuracy with the similar number of parameters for classical convolutional neural networks and it is orthogonal to the predecessor researches.

References

  • [1] T. Devries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. External Links: Link, 1708.04552 Cited by: §4.2.
  • [2] Y. Freund and R. E. Schapire (1995) A decision-theoretic generalization of on-line learning and an application to boosting. See DBLP:conf/eurocolt/1995, pp. 23–37. External Links: Link, Document Cited by: §2.
  • [3] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural network. See DBLP:conf/nips/2015, pp. 1135–1143. External Links: Link Cited by: §1, §2.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. External Links: Link, Document Cited by: §1, §2, Table 1, Table 3.
  • [5] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. See DBLP:conf/cvpr/2018, pp. 7132–7141. External Links: Link, Document Cited by: §2.
  • [6] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. See DBLP:conf/cvpr/2017, pp. 2261–2269. External Links: Link, Document Cited by: §1, Table 1, Table 3.
  • [7] A. Krizhevsky and G. Hinton (2009-01) Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep 1, pp. . Cited by: §4.1.
  • [8] H. Liu, K. Simonyan, and Y. Yang (2018) DARTS: differentiable architecture search. CoRR abs/1806.09055. External Links: Link, 1806.09055 Cited by: §2.
  • [9] I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    .
    See DBLP:conf/iclr/2017, External Links: Link Cited by: §4.2.
  • [10] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. See DBLP:conf/icml/2018, pp. 4092–4101. External Links: Link Cited by: §2.
  • [11] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2018) Regularized evolution for image classifier architecture search. CoRR abs/1802.01548. External Links: Link, 1802.01548 Cited by: §1.
  • [12] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. See DBLP:conf/cvpr/2018, pp. 4510–4520. External Links: Link, Document Cited by: §1, §2, Table 3.
  • [13] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. See DBLP:conf/iclr/2015, External Links: Link Cited by: §1, Table 1, Table 3.
  • [14] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. See DBLP:conf/icml/2019, pp. 6105–6114. External Links: Link Cited by: §1.
  • [15] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. See DBLP:conf/cvpr/2018, pp. 6848–6856. External Links: Link, Document Cited by: §1, §2.
  • [16] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian (2019) Variational convolutional neural network pruning. See DBLP:conf/cvpr/2019, pp. 2780–2789. External Links: Link Cited by: §1, §2.
  • [17] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. See DBLP:conf/cvpr/2018, pp. 8697–8710. External Links: Link, Document Cited by: §1.