Deep learning, which relies heavily on neural networks, has achieved huge success in the fields of computer vision and image processing. Recently, multitudinous neural architectures have been proposed to improve the performance of image processing. Bran-new architectures mainly come from human design such as VGG , ResNet  and DenseNet  or from automatic neural architecture search such as NASNet , AmoebaNet  and EfficientNet . With the improvement on the performance of image processing, the scale of neural networks is gradually increasing.
Nevertheless, as an emerging field, edge computing tends to analyze and process data at the source rapidly to reduce costs and improve efficiency. Many researches attempt to reduce the scale of networks in order to deploy and apply them to edge devices such as wearable devices and IoT devices. The proposed methods are mainly limited to model compression [3, 16] and lightweight network design [12, 15]. These methods may have special requirements for the devices or only theoretically reduce the number of parameters and the amount of computations. Furthermore, these reductions in the scale of networks are often at the expense of the performance of image processing.
In this paper, we propose a new approach to address these problems, and thus, the results of image classification with higher accuracy can be achieved by using less number of parameters. We break up the large network into smaller parts and merge the information of more layers as multi-participants to make a joint decision. These small networks are more suitable for deploying to edge devices for the sake of limited storage and concurrent training. The experiment results also show that our approach is more efficient and effective for inference.
Our contributions are summarized as follows:
We propose the joint decision of multi-participants which contain multi-layers and multi-networks. The effectiveness and robustness of a single network are improved, while the overall performance of multiple networks has better performance by this approach.
We propose clear rules for the architecture design and the training methods of every participant.
Our method achieves better results for classification on CIFAR-10 and CIFAR-100 with a similar number of parameters and amount of computations.
2 Related Work
The methods to improve the performance of convolutional neural networks mainly focus on optimizing the convolutional neural architectures. Human-designed architectures such as short-cut connections , squeeze-excitation blocks  and automatic search algorithms such as ENAS , DARTS  have been proposed to achieve some significant breakthroughs. At the same time, scaling down the models is also widely concerned to apply the networks on edge devices. To do this, many researches about model compression [3, 16] and lightweight network design [12, 15] are successively proposed. Our approach is inspired by Adaboost 
, which trains several weak classifiers on the same training dataset and merge them to form the final classifier. It is more suitable for edge devices to achieve more efficient and effective inference and is orthogonal to these predecessor researches.
In this section, we propose our method, the joint decision of multi-participants which contain multi-layers and multi-networks. Furthermore, we propose the rules of the architecture design and the training methods for them.
3.1 The Joint Decision of Multi-Layers
Traditional convolutional neural networks extract features of images forward and finally achieve the image classification via the output of the last Softmax layer. Although the feature extraction tends to be more abstract by deeper layers in a network, we notice that the front layers can also solve some problems that deep ones cannot. We design the network in which multiple layers participate in the final decision together.
The overview of joint decision of multi-layers is presented in Fig.1. We divide the network into several parts (suppose the number is
) according to the scales of the feature maps and they are represented by the different colors in the figure. Then we add the Global Average Pooling and the Softmax layer at the end of each part. In particular, a Batch-Norm layer and a activation function may be added simultaneously. This method will add few parameters and FLOPs, but gives voice for more layers in the network.
Suppose that the weight of the original output of network is represented as . and represent the factors for regulation. Then the weight of the output we added can be expressed as:
The loss function in the training can be expressed as:
And the final joint output can be expressed as:
3.2 The Joint Decision of Multi-Networks
It may cause a few misjudgments due to noise and other errors when the decision made by only one network. The joint decision made by multi-networks may effectively reduce these errors so as to improve the accuracy and robustness of the results. Based on this, we propose the method in which multiple networks participate in the final decision together.
The overview of joint decision of multi-networks is presented in Fig.2. Suppose that the number of total networks is and the accuracy for classification on training set of a network is . Then, the weight can be represented as:
where represents a factor to control directly and further affect . Specifically, for the first network, We train it normally on the original dataset and for the others, we will make small adjustments to the weights of data. We’ll decrease the weights of correctly classified images and increase the weights of wrongly classified ones. In detail, we take an example of one image (its weight is represented as ) which is extracted from the total dataset. It initially has the same weight as the other images, but it may change depending on the classification results after the weight adjustment. The new weight for the -th network can be represented as:
When dealing with classification problems, The most common loss function is cross entropy:
We can notice that it is easy to achieve the weight adjustment by modifying the value of labels. The final joint output can be expressed as:
3.3 The Joint Decision of Multi-Participants
Similar to Adaboost, with the same number of parameters, we tend to use several small networks instead of one large network to make the final decision. We use to record the total number of networks selected and the total number of parameters are similar to the original large network.
The joint decision of multi-participants which contain multi-layers and multi-networks is described in Algorithm 1.
|Networks||Params||Test Error (%)|
|(Mil.)||Baseline||Ours ()||Ours ()|
|Scaling Factor||Params||Number of||Single||Total|
|(Mil.)||Networks||Test Error (%)||Test Error (%)|
In this section, we introduce the implementation of experiments and report the performances of our methods. The experiments are mainly implemented in accordance with the methods mentioned in Sect.3 and we compare the networks designed by our methods with the original classical ones to prove the feasibility and effectiveness.
We use CIFAR-10 and CIFAR-100 
for image classification as basic datasets in our experiments. We normalize the images using channel means and standard deviations for preprocessing and apply a standard data augmentation scheme (zero-padded with 4 pixels on each side to obtain apixel image, then a crop is randomly extracted and the image is randomly flipped horizontally).
|Networks||Original Networks||The Joint Decision of Multi-Participants|
|Params||C10 Test Error||C100 Test Error||Total Params||C10 Test Error||C100 Test Error|
4.2 Training Methods
Networks are trained on the full training dataset until convergence using Cutout . For a fair comparison, we retrain the original networks by the same training method with the networks designed by our methods. That is to say, all the networks which contain baselines and those modified by us are trained with a batch size of 128 using SGDR 
with Nesterov’s momentum for 511 epochs. The hyper-parameters of the methods are as follows: the cutout size isfor Cutout, , , and for SGDR. We conducted every experiment more than three or four times and the mean classification error with standard deviation on the test dataset will be finally reported.
4.3 Crucial Preparations Before the Main Experiment
In this subsection, we show some crucial preparations before the main experiment. We conducted confirmatory experiments separately according to the methods proposed above, and further determined the super parameters roughly.
The experiments for the joint decision of multi-layers. We verified the performance of this method by several experiments for different classical networks on CIFAR-10. The networks contain VGG-16 (Reducing several FC layers), ResNet-18 and DenseNet-BC (). Together with the original networks, we got the results with two super parameters for ( and ) and . As is shown in Table 1, the performances of the networks with joint decision of multi-layers are better than the original ones. The difference for the performances of different is not obvious. Specifically, the increment of number of parameters by multi-layers can be ignored (about 0.01Mil.) so we didn’t specify that in the table. By observing the results, we can definitely declare the effectiveness of our method.
The experiments for the joint decision of multi-networks. We also verified the performance of this method by several experiments for ResNet-18 on CIFAR-10. We scaled the number of parameters of ResNet-18 to , , and of original by reducing the number of channels (the scaling factors in Table 2: , and of the original, respectively). Together with the original network, we got the results with super parameters for and different which should guarantee that the total number of parameters of the multi-networks is smaller or similar to the original network. As is shown in Table 2, the performance of the joint decision of multi-networks is better than the original with several appropriate . This leads to that the number of multiply networks should not be too many in the following main experiment.
4.4 The Experiments for the Joint Decision of Multi-Participants and Results
We conducted the experiments for the joint decision of multi-participants by combining both two parts mentioned above with the super parameters and . We selected several classical convolutional neural networks to prove the excellence of our methods.
The comparison against the results of original classical convolutional neural networks on CIFAR-10 and CIFAR-100 is presented in Table 3. We show the comparison of the mean error with standard deviation on the test datasets while controlling the total number of parameters to be similar. Specifically, the number of parameters of the networks used to compare the test error on CIFAR-100 is slightly more than that on CIFAR-10, but we only show the one for CIFAR-10 in the table because the number of parameters are almost the same.
We can clearly notice that the results of our method perform better. The accuracy can be improved by 0.26% on CIFAR-10 and 4.49% on CIFAR-100 at most with similar number of parameters (FLOPs is also similar because we only reduce the number of channels within the networks).
What’s more, the smaller single networks designed by our methods are more suitable for the edge devices. We no longer need to load the large networks at once for training or inference but are plagued by the limited storage. In addition, concurrent training and inference will make image processing more efficient and effective which we attribute to the joint decision of multi-participants.
We propose the joint decision of multi-participants, which mainly contain multi-layers and multi-networks. It is suitable for edge devices while improving the efficiency and effectiveness of inference. Our method can achieve higher classification accuracy with the similar number of parameters for classical convolutional neural networks and it is orthogonal to the predecessor researches.
-  (2017) Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. External Links: Cited by: §4.2.
-  (1995) A decision-theoretic generalization of on-line learning and an application to boosting. See DBLP:conf/eurocolt/1995, pp. 23–37. External Links: Cited by: §2.
-  (2015) Learning both weights and connections for efficient neural network. See DBLP:conf/nips/2015, pp. 1135–1143. External Links: Cited by: §1, §2.
-  (2016) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. External Links: Cited by: §1, §2, Table 1, Table 3.
-  (2018) Squeeze-and-excitation networks. See DBLP:conf/cvpr/2018, pp. 7132–7141. External Links: Cited by: §2.
-  (2017) Densely connected convolutional networks. See DBLP:conf/cvpr/2017, pp. 2261–2269. External Links: Cited by: §1, Table 1, Table 3.
-  (2009-01) Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep 1, pp. . Cited by: §4.1.
-  (2018) DARTS: differentiable architecture search. CoRR abs/1806.09055. External Links: Cited by: §2.
SGDR: stochastic gradient descent with warm restarts. See DBLP:conf/iclr/2017, External Links: Cited by: §4.2.
-  (2018) Efficient neural architecture search via parameter sharing. See DBLP:conf/icml/2018, pp. 4092–4101. External Links: Cited by: §2.
-  (2018) Regularized evolution for image classifier architecture search. CoRR abs/1802.01548. External Links: Cited by: §1.
-  (2018) MobileNetV2: inverted residuals and linear bottlenecks. See DBLP:conf/cvpr/2018, pp. 4510–4520. External Links: Cited by: §1, §2, Table 3.
-  (2015) Very deep convolutional networks for large-scale image recognition. See DBLP:conf/iclr/2015, External Links: Cited by: §1, Table 1, Table 3.
-  (2019) EfficientNet: rethinking model scaling for convolutional neural networks. See DBLP:conf/icml/2019, pp. 6105–6114. External Links: Cited by: §1.
-  (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. See DBLP:conf/cvpr/2018, pp. 6848–6856. External Links: Cited by: §1, §2.
-  (2019) Variational convolutional neural network pruning. See DBLP:conf/cvpr/2019, pp. 2780–2789. External Links: Cited by: §1, §2.
-  (2018) Learning transferable architectures for scalable image recognition. See DBLP:conf/cvpr/2018, pp. 8697–8710. External Links: Cited by: §1.