1 Introduction
Deep learning, which relies heavily on neural networks, has achieved huge success in the fields of computer vision and image processing. Recently, multitudinous neural architectures have been proposed to improve the performance of image processing. Brannew architectures mainly come from human design such as VGG [13], ResNet [4] and DenseNet [6] or from automatic neural architecture search such as NASNet [17], AmoebaNet [11] and EfficientNet [14]. With the improvement on the performance of image processing, the scale of neural networks is gradually increasing.
Nevertheless, as an emerging field, edge computing tends to analyze and process data at the source rapidly to reduce costs and improve efficiency. Many researches attempt to reduce the scale of networks in order to deploy and apply them to edge devices such as wearable devices and IoT devices. The proposed methods are mainly limited to model compression [3, 16] and lightweight network design [12, 15]. These methods may have special requirements for the devices or only theoretically reduce the number of parameters and the amount of computations. Furthermore, these reductions in the scale of networks are often at the expense of the performance of image processing.
In this paper, we propose a new approach to address these problems, and thus, the results of image classification with higher accuracy can be achieved by using less number of parameters. We break up the large network into smaller parts and merge the information of more layers as multiparticipants to make a joint decision. These small networks are more suitable for deploying to edge devices for the sake of limited storage and concurrent training. The experiment results also show that our approach is more efficient and effective for inference.
Our contributions are summarized as follows:

We propose the joint decision of multiparticipants which contain multilayers and multinetworks. The effectiveness and robustness of a single network are improved, while the overall performance of multiple networks has better performance by this approach.

We propose clear rules for the architecture design and the training methods of every participant.

Our method achieves better results for classification on CIFAR10 and CIFAR100 with a similar number of parameters and amount of computations.
2 Related Work
The methods to improve the performance of convolutional neural networks mainly focus on optimizing the convolutional neural architectures. Humandesigned architectures such as shortcut connections [4], squeezeexcitation blocks [5] and automatic search algorithms such as ENAS [10], DARTS [8] have been proposed to achieve some significant breakthroughs. At the same time, scaling down the models is also widely concerned to apply the networks on edge devices. To do this, many researches about model compression [3, 16] and lightweight network design [12, 15] are successively proposed. Our approach is inspired by Adaboost [2]
, which trains several weak classifiers on the same training dataset and merge them to form the final classifier. It is more suitable for edge devices to achieve more efficient and effective inference and is orthogonal to these predecessor researches.
3 Methods
In this section, we propose our method, the joint decision of multiparticipants which contain multilayers and multinetworks. Furthermore, we propose the rules of the architecture design and the training methods for them.
3.1 The Joint Decision of MultiLayers
Traditional convolutional neural networks extract features of images forward and finally achieve the image classification via the output of the last Softmax layer. Although the feature extraction tends to be more abstract by deeper layers in a network, we notice that the front layers can also solve some problems that deep ones cannot. We design the network in which multiple layers participate in the final decision together.
The overview of joint decision of multilayers is presented in Fig.1. We divide the network into several parts (suppose the number is
) according to the scales of the feature maps and they are represented by the different colors in the figure. Then we add the Global Average Pooling and the Softmax layer at the end of each part. In particular, a BatchNorm layer and a activation function may be added simultaneously. This method will add few parameters and FLOPs, but gives voice for more layers in the network.
Suppose that the weight of the original output of network is represented as . and represent the factors for regulation. Then the weight of the output we added can be expressed as:
(1) 
The loss function in the training can be expressed as:
(2) 
And the final joint output can be expressed as:
(3) 
3.2 The Joint Decision of MultiNetworks
It may cause a few misjudgments due to noise and other errors when the decision made by only one network. The joint decision made by multinetworks may effectively reduce these errors so as to improve the accuracy and robustness of the results. Based on this, we propose the method in which multiple networks participate in the final decision together.
The overview of joint decision of multinetworks is presented in Fig.2. Suppose that the number of total networks is and the accuracy for classification on training set of a network is . Then, the weight can be represented as:
(4) 
where represents a factor to control directly and further affect . Specifically, for the first network, We train it normally on the original dataset and for the others, we will make small adjustments to the weights of data. We’ll decrease the weights of correctly classified images and increase the weights of wrongly classified ones. In detail, we take an example of one image (its weight is represented as ) which is extracted from the total dataset. It initially has the same weight as the other images, but it may change depending on the classification results after the weight adjustment. The new weight for the th network can be represented as:
(5) 
When dealing with classification problems, The most common loss function is cross entropy:
(6) 
We can notice that it is easy to achieve the weight adjustment by modifying the value of labels. The final joint output can be expressed as:
(7) 
3.3 The Joint Decision of MultiParticipants
Similar to Adaboost, with the same number of parameters, we tend to use several small networks instead of one large network to make the final decision. We use to record the total number of networks selected and the total number of parameters are similar to the original large network.
The joint decision of multiparticipants which contain multilayers and multinetworks is described in Algorithm 1.
Networks  Params  Test Error (%)  
(Mil.)  Baseline  Ours ()  Ours ()  
VGG16 [13]  15.00  6.290.10  6.070.13  6.050.17 
ResNet18 [4]  11.18  4.000.11  3.870.14  3.910.12 
DenseNetBC [6]  0.79  4.380.12  4.350.16  4.260.14 
Scaling Factor  Params  Number of  Single  Total 
(Mil.)  Networks  Test Error (%)  Test Error (%)  
Original  11.18  1  4.000.11  4.000.11 
5.78  2  4.220.09  3.770.16  
2.8  3  4.900.24  3.950.08  
4  3.790.08  
0.7  5  6.610.21  5.240.06  
10  4.920.11 
4 Experiments
In this section, we introduce the implementation of experiments and report the performances of our methods. The experiments are mainly implemented in accordance with the methods mentioned in Sect.3 and we compare the networks designed by our methods with the original classical ones to prove the feasibility and effectiveness.
4.1 Datasets
We use CIFAR10 and CIFAR100 [7]
for image classification as basic datasets in our experiments. We normalize the images using channel means and standard deviations for preprocessing and apply a standard data augmentation scheme (zeropadded with 4 pixels on each side to obtain a
pixel image, then a crop is randomly extracted and the image is randomly flipped horizontally).Networks  Original Networks  The Joint Decision of MultiParticipants  

Params  C10 Test Error  C100 Test Error  Total Params  C10 Test Error  C100 Test Error  
(Mil.)  (%)  (%)  (Mil.)  (%)  (%)  
VGG16 [13]  15.00  6.290.10  28.220.38  15.00  6.050.17  26.060.29 
ResNet18 [4]  11.18  4.000.11  24.580.23  11.22  3.740.12  20.090.11 
DenseNetBC [6]  0.79  4.380.12  21.960.19  0.73  4.140.13  21.770.19 
MobileNetV2 [12]  2.29  6.320.19  25.680.23  2.30  6.160.18  23.450.21 
4.2 Training Methods
Networks are trained on the full training dataset until convergence using Cutout [1]. For a fair comparison, we retrain the original networks by the same training method with the networks designed by our methods. That is to say, all the networks which contain baselines and those modified by us are trained with a batch size of 128 using SGDR [9]
with Nesterov’s momentum for 511 epochs. The hyperparameters of the methods are as follows: the cutout size is
for Cutout, , , and for SGDR. We conducted every experiment more than three or four times and the mean classification error with standard deviation on the test dataset will be finally reported.4.3 Crucial Preparations Before the Main Experiment
In this subsection, we show some crucial preparations before the main experiment. We conducted confirmatory experiments separately according to the methods proposed above, and further determined the super parameters roughly.
The experiments for the joint decision of multilayers. We verified the performance of this method by several experiments for different classical networks on CIFAR10. The networks contain VGG16 (Reducing several FC layers), ResNet18 and DenseNetBC (). Together with the original networks, we got the results with two super parameters for ( and ) and . As is shown in Table 1, the performances of the networks with joint decision of multilayers are better than the original ones. The difference for the performances of different is not obvious. Specifically, the increment of number of parameters by multilayers can be ignored (about 0.01Mil.) so we didn’t specify that in the table. By observing the results, we can definitely declare the effectiveness of our method.
The experiments for the joint decision of multinetworks. We also verified the performance of this method by several experiments for ResNet18 on CIFAR10. We scaled the number of parameters of ResNet18 to , , and of original by reducing the number of channels (the scaling factors in Table 2: , and of the original, respectively). Together with the original network, we got the results with super parameters for and different which should guarantee that the total number of parameters of the multinetworks is smaller or similar to the original network. As is shown in Table 2, the performance of the joint decision of multinetworks is better than the original with several appropriate . This leads to that the number of multiply networks should not be too many in the following main experiment.
4.4 The Experiments for the Joint Decision of MultiParticipants and Results
We conducted the experiments for the joint decision of multiparticipants by combining both two parts mentioned above with the super parameters and . We selected several classical convolutional neural networks to prove the excellence of our methods.
The comparison against the results of original classical convolutional neural networks on CIFAR10 and CIFAR100 is presented in Table 3. We show the comparison of the mean error with standard deviation on the test datasets while controlling the total number of parameters to be similar. Specifically, the number of parameters of the networks used to compare the test error on CIFAR100 is slightly more than that on CIFAR10, but we only show the one for CIFAR10 in the table because the number of parameters are almost the same.
We can clearly notice that the results of our method perform better. The accuracy can be improved by 0.26% on CIFAR10 and 4.49% on CIFAR100 at most with similar number of parameters (FLOPs is also similar because we only reduce the number of channels within the networks).
What’s more, the smaller single networks designed by our methods are more suitable for the edge devices. We no longer need to load the large networks at once for training or inference but are plagued by the limited storage. In addition, concurrent training and inference will make image processing more efficient and effective which we attribute to the joint decision of multiparticipants.
5 Conclusion
We propose the joint decision of multiparticipants, which mainly contain multilayers and multinetworks. It is suitable for edge devices while improving the efficiency and effectiveness of inference. Our method can achieve higher classification accuracy with the similar number of parameters for classical convolutional neural networks and it is orthogonal to the predecessor researches.
References
 [1] (2017) Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. External Links: Link, 1708.04552 Cited by: §4.2.
 [2] (1995) A decisiontheoretic generalization of online learning and an application to boosting. See DBLP:conf/eurocolt/1995, pp. 23–37. External Links: Link, Document Cited by: §2.
 [3] (2015) Learning both weights and connections for efficient neural network. See DBLP:conf/nips/2015, pp. 1135–1143. External Links: Link Cited by: §1, §2.
 [4] (2016) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. External Links: Link, Document Cited by: §1, §2, Table 1, Table 3.
 [5] (2018) Squeezeandexcitation networks. See DBLP:conf/cvpr/2018, pp. 7132–7141. External Links: Link, Document Cited by: §2.
 [6] (2017) Densely connected convolutional networks. See DBLP:conf/cvpr/2017, pp. 2261–2269. External Links: Link, Document Cited by: §1, Table 1, Table 3.
 [7] (200901) Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep 1, pp. . Cited by: §4.1.
 [8] (2018) DARTS: differentiable architecture search. CoRR abs/1806.09055. External Links: Link, 1806.09055 Cited by: §2.

[9]
(2017)
SGDR: stochastic gradient descent with warm restarts
. See DBLP:conf/iclr/2017, External Links: Link Cited by: §4.2.  [10] (2018) Efficient neural architecture search via parameter sharing. See DBLP:conf/icml/2018, pp. 4092–4101. External Links: Link Cited by: §2.
 [11] (2018) Regularized evolution for image classifier architecture search. CoRR abs/1802.01548. External Links: Link, 1802.01548 Cited by: §1.
 [12] (2018) MobileNetV2: inverted residuals and linear bottlenecks. See DBLP:conf/cvpr/2018, pp. 4510–4520. External Links: Link, Document Cited by: §1, §2, Table 3.
 [13] (2015) Very deep convolutional networks for largescale image recognition. See DBLP:conf/iclr/2015, External Links: Link Cited by: §1, Table 1, Table 3.
 [14] (2019) EfficientNet: rethinking model scaling for convolutional neural networks. See DBLP:conf/icml/2019, pp. 6105–6114. External Links: Link Cited by: §1.
 [15] (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. See DBLP:conf/cvpr/2018, pp. 6848–6856. External Links: Link, Document Cited by: §1, §2.
 [16] (2019) Variational convolutional neural network pruning. See DBLP:conf/cvpr/2019, pp. 2780–2789. External Links: Link Cited by: §1, §2.
 [17] (2018) Learning transferable architectures for scalable image recognition. See DBLP:conf/cvpr/2018, pp. 8697–8710. External Links: Link, Document Cited by: §1.
Comments
There are no comments yet.