1 Introduction
In these years, deep learning has been widely used in lots of areas such as computer vision, natural language processing, and audio recognition. To obtain stateoftheart results, models become much larger and deeper. However, large neural networks are so computationally intensive that the loadings are heavy for general low power edge devices. For example, NoisyStudent
Xie et al. (2020) proposed by Xie et al.has achieved a stateoftheart performance of 88.4% top1 accuracy on ImageNet. Their model EfficientNetL2 is extremely large with 480 million parameters, making the model unsuitable for deploying on edge devices. To tradeoff between accuracy and latency, many lightweight models such as MobileNet
Howard et al. (2017) and SqueezeNet Iandola et al. (2016) have been proposed. Many model compression techniques are also widely used Han et al. (2015). Nevertheless, the performance of lightweight models is still far from the stateoftheart models and model compression techniques suffer from dramatic accuracy drop once reaching the limit.Distributed inference seems to be a solution to take care of both accuracy and latency. With the development of the internet of things (IoT), many edge devices can join a network and compose a cluster. Making good use of the idle devices in the cluster has become an issue and is still under discussion Mao et al. (2017). If we can distribute the inference process properly on these devices, we can make the inference of a large model much faster without a performance drop. There are two typical distributed methods usually being considered, data parallelism Krizhevsky et al. (2012) and model parallelism Dean et al. (2012). However, in many inference scenarios on edge devices, such as object detection, devices receive streaming data of images and make inferences with a pretrained model. This kind of data must be input one by one instead of batch by batch, so it lacks the mechanism of data parallelism. In this case, model parallelism seems to be a better solution to deal with this problem.
In this paper, we proposed an architecture, Separable Neural Network (SNN), with a new model parallelism method to make distributed inference on edge devices efficiently. With our approach, we can parallelize a class of very deep neural network models, such as ResNet and ResNeXt, to a cluster of edge devices and reduce the costs and overheads of transmission. It not only speeds up the inference but also reduces the memory usage and computation per device. Fig. 1
shows the pipeline of our approach. We separate the original model and introduce an RLbased neural architecture search method to find the best communication policy from an extremely large searching space. Finally, we finetune the best model according to decisions made by the policy and deploy it on edge devices. Our method is more suitable for the models with residual connections and group convolutions.
2 Related Work
2.1 Multipath structures
Benchmark convolutional neural networks such as AlexNet
Krizhevsky et al. (2012), VGG Russakovsky et al. (2015) are in general designed as a sequence of convolution layers to extract features from low level to high level. GoogLeNet Szegedy et al. (2015) and the series of Inception networks Szegedy et al. (2016) Szegedy et al. (2017) present a concept of parallelism, which extracts features with different size of filters at the same time and merge them together.ResNet He et al. (2016) introduced residual connections to make features and gradients propagate more unimpededly, and lots of works follow the thought. ResNeXt Xie et al. (2017) is one of the following works. It splits the layers into multipaths and all the paths can compute in parallel. The multipath structure can be implemented by group convolution, which divides input channels and output channels into several groups. The number of paths is controlled by cardinality , representing the number of groups. If there are input channels, output channels in a group and the cardinality number is , the kernel size is , then the total number of input channels is , and the total number of output channels is . The total number of parameters is , which is times smaller than normal convolution under the same number of input channels and output channels.
2.2 Model parallelism
Model parallelism Dean et al. (2012) aims to split the model into several parts and distribute them on different compute nodes. Since each node only computes part of the entire model, it doesn’t need to synchronize the weights and gradients with other nodes. However, this kind of parallelism increases the dependency between the nodes BenNun and Hoefler (2019). That is, one node needs to wait for the results from previous nodes to run the computation. The extra data transmission between layers should also be considered.
2.3 Distributed inference for neural networks
Distributed inference is a way to improve inference on edge devices. Teerapittayanon et al. Teerapittayanon et al. (2017)
proposed a hierarchical model, distributed deep neural network (DDNN), to distribute the components of the model on the cloud, the edge, and the end devices. The endtoend DDNN can be jointly trained and minimizes transmission costs. DDNN can use the swallow portions in the edge to make inference faster or send the data to the cloud if the local aggregator determines that the information is insufficient to classify accurately. Edgent
Li et al. (2018) is a framework to adaptively partition DNN on edge and end devices according to the bandwidth. After welltrained partitioning, Edgent can make a coinference on the hybrid resources. It also introduces an early exit mechanism to balance latency and accuracy. DeepThings Zhao et al. (2018) is a framework for adaptively distributed CNN models on edge devices. It can divide convolutional layers into several parts to process and minimize memory usage to reduce communication costs. Furthermore, it can be used in dynamic application scenarios with an adaptive partition method to prevent synchronization overheads.2.4 Neural architecture search
Since rulebased design architectures reach the performance limit, neural architecture search, a method of automatically searching for the best model, becomes more and more popular. However, the searching space of architecture candidates is huge, which makes random searching not practical. Zoph et al. Zoph and Le (2016) proposed Neural Architecture Search (NAS), which constructs an RNN controller to sample models from searching space and updates the controller based on Reinforcement Learning (RL). This method successfully found a stateoftheart model but still spent lots of time and computation power. Pham et al. Pham et al. (2018) proposed Efficient Neural Architecture Search (ENAS) to improve NAS by sharing models’ weight, omitting the process of training sample models from scratch, using minibatch data to evaluate performance. Finally, ENAS finds a model with similar performance and achieves 1000 times speedup compared to the original NAS.
3 Method
3.1 Separable Neural Network
Started from a model with sequential layers, we denote the feature in layer as . If we separate into groups, then can be seen as the concatenation of components.
(1) 
To get the feature in the next layer, each of the component has to do the following computation:
(2) 
where denotes the th component, denotes a transformation such as convolution, and denotes the feature received from other component. After we get the new feature, a function with quantization and sparsification further reduces the size of to make it more efficient for transmission. Another predefined function then decides to send the feature to a set of receivers. We can simply define this process with following equation:
(3) 
stage  output  ResNeXt56 (816d)  SepResNeXt56 (816d) 
conv0  32x32  3x3, 16, stride 1 
3x3, 16, stride 1 
32x32    concatenate 4 duplications  
conv1  32x32  
conv2  16x16  
conv3  8x8  
8x8    keep first 1/4 channels  
1x1  global average pool  global average pool  
100d fc, softmax  100d fc, softmax  
# of params.  4.39M  4.54M 
Analysis
With a little modification in the training stage, the model becomes separable during inference. The original ResNet He et al. (2016) bottleneck block has:
(4) 
parameters, where M is the number of input channels, N is the output channels and K is the filter size. ResNeXt Xie et al. (2017) further modifies the blocks into multipath structures, and the number of parameters become:
(5) 
where the number of paths and the number of channels in middle layer can be adjusted to make amount of parameters as close as possible to the original ResNet for comparison. We can further rewrite Equation 5 as follows:
(6) 
It is clear to see that the first layer of the block has input channels, output channels, and the middle layer is a group convolutional layer with input channels, output channels and convolutional groups.
To construct Separable ResNeXt, we first decide , the maximum number of compute nodes used to deploy the distributed inference system on, then we divide the paths into groups. Equation 5 becomes
(7) 
Again, we rewrite Equation 7 as follows:
(8) 
Comparing with Equation 6, the first layer becomes group convolutional layer, so the number of convolutional group is and the number of input channels becomes
. The middle layer is exactly the same. As a result, the total number of parameters is theoretically the same as ResNeXt. However, since every group has to maintain its own batch normalization and downsampling operations, the size of separable models may increase by about
.Table 1 compares ResNeXt56 and SepResNeXt56, and their corresponding amount of parameters, where SepResNeXt56 denotes separable ResNeXt56. The most different is that in SepResNeXt we concatenate replications of the output feature maps of the first layer when training to simulate the synchronization of data on devices in the first transmission step when deployment. Additionally, to avoid large transmission overhead when aggregating all the results on all devices, we tend to make the device use local feature maps to run classification. Therefore, we only keep the first feature maps in the last convolutional layer, and these feature maps become the input to the final classifier.
The setting of is related to the number of compute nodes and gives some flexibility of deployment. For example, if we set as 4, we expect the model to be used in a cluster with four devices. Additionally, the model can also be deployed on less than four compute nodes. We can put four parts together or put two parts on two compute nodes, respectively. These scenarios of deployment perform the same on accuracy.
3.2 Neural Architecture Search for the best communication policy
After we separate the model, parts of the model are detached from each other. If all the parts transmit their information to each other, there would be a large amount of data transmission and cause large overheads. We make use of neural architecture search (NAS) to deal with this problem. With the method of neural architecture search, we can just keep the transmissions that are critical for the model’s performance.
The search method contains two components: a separable network and a controller network. We use a computational graph to represent a neural network model – the nodes are the convolutional layers of the model, and the edges denote the data flow. In this perspective, the separable network contains the nodes of the graph and the controller network control the edges. Since the edges determine the transmissions, every decision sampled from the controller in a step determines a transmission scheme. The controller is only used in the training stage. After getting the best model from the controller, we no longer use it in the inference stage.
Since the models are encoded as sequences and recurrent neural network (RNN) is suitable for dealing with the sequence data, we construct the controller network using RNN. In the first step, the RNN controller uses default input and default hidden state to generate an output, representing every possible communication’s probability. We sample a decision in this step according to the probability. The decision and the current hidden state then become the input and hidden state to the next step. We repeat the process until all the decisions have been made. Furthermore, we expect that every compute node only deals with one sending process and one receiving process, so the destination of data to be sent should not be duplicated. Based on this premise, the number of choices of each step can be specified as
, where is the number of compute nodes.The training process can be divided into three stages: (1) training the separable network, (2) training the controller network, (3) sampling the best model from the controller, and finetuning. Fig 2 shows the overview of these three stages. We use different sets of data to train different networks in different stages. In the following, we will describe the details in order.

Training the separable network
To train the separable network, we repeat the following steps: first, we fix the parameter of controller network , then sample decisions from the policy output by the controller to determine a model with the parameters of separable network . The expected losscan be estimated by the Monte Carlo method. According to the expected loss, gradients can be computed as
(9) and finally, the separable network can be updated using stochastic gradient descent algorithm.

Training the controller network
As for controller network , we fix the parameters of the separable network , then sample decisions to determine a model . With the sampled model, we make inference on validation data and take the accuracy as reward . In this stage we try to maximize the expected reward by using reinforcement learning. We can compute the gradient with the policy gradient method and update parameters . 
Sampling the best model from the controller network
After the training on separable network and controller network, we simply sample some models with the method in the training stage to measure the performance on both two networks and find the best models. We keep the best weights and decisions of the separable network model every epoch and finetune with a small learning rate to get the best result.
3.3 Transmission overhead reduction
To make the process perform computation and transmission at the same time, we use a staleness factor denoted by to control the tolerant delay between sending and receiving. The staleness factor can be defined as the ratio the of number of block computing to block transmission. In default, the staleness factor is assumed to 1, which means one block computing and one transmission. If we let the staleness factor , during transmission, compute nodes do not wait for the data arrived but keep computing. After the transmission done, compute nodes finally aggregate the current result with the data they received. So in the optimal scenario, the transmission time can fully overlap with the computation time. If we set to 2, it means that on average there would be one transmission every two blocks being computed, and it would save half of the transmission. Fig. 3 illustrates the concept of the staleness factor, where With staleness factor , we can do more block computations and save the transmission at the same time.
Moreover, we try to add a sparsification decision to control how much data need to be sent at every transmission. We set levels between 100% to control the percentage of feature maps to be sent. So the amount of data to be sent can be decided by following equation:
(10) 
The compute node only needs to send the first channels of features at the transmission step. Additionally, casting the data type from Float32 to Float16 before transmission also helps to reduce data size.
4 Experiment
4.1 Dataset
CIFAR10 and CIFAR100 are datasets that contain 10 classes and 100 classes of images respectively. Both of them have 50000 training images and 10000 testing images. Every image has a size of 32
32 and RGB three channels. It is more general than MNIST, so we do all the experiments on these datasets. Furthermore, we preprocess the images with Autoaugment
Cubuk et al. (2018), a bunch of data augmentation policies searching by reinforcement learning, in the training stage. The operators in Autoaugment include shearing, rotation, contrast, brightness, sharpness, etc., and the policies are the combination of operators.10% of the training images are chosen randomly to be validation data.4.2 Implementation details
First we use stochastic gradient descent (SGD) optimizer to train all the original models for 200 epochs from scratch. The learning rate is initially set as 0.1, with a decay schedule of dividing the learning rate by 10 every 50 epochs. The batch size is fixed to 128. As for SNN, we construct an RNN controller with 100 LSTM cells. In the first stage, we train the separable networks through the entire training dataset. Then we train the controller network through the validation dataset for 500 steps in the second stage. In the third stage, we fix all the parameters and sample 100 architectures. Every architecture makes inference on a minibatch of the testing dataset and the one with the highest accuracy will be kept. These three stages are repeated for 60 times iteratively. Finally, we finetune the best model with a learning rate of 0.0001 for 45 epochs.
Method  Test accuracy  

CIFAR10  CIFAR100  
ResNeXt56 (416d)  95.76%  79.30% 
SepResNeXt56 (416d, )  95.53%  79.10% 
ResNeXt110 (816d)  96.23%  81.11% 
SepResNeXt110 (816d, )  96.32%  81.98% 
ResNeXt56 (644d)  96.30%  81.35% 
SepResNeXt56 (644d, )  96.05%  81.68% 
Choice ID  Decisions  Choice ID  Decisions 

0  0, 1, 2, 3  12  2, 0, 1, 3 
1  0, 1, 3, 2  13  2, 0, 3, 1 
2  0, 2, 1, 3  14  2, 1, 0, 3 
3  0, 2, 3, 1  15  2, 1, 3, 0 
4  0, 3, 1, 2  16  2, 3, 0, 1 
5  0, 3, 2, 1  17  2, 3, 1, 0 
6  1, 0, 2, 3  18  3, 0, 1, 2 
7  1, 0, 3, 2  19  3, 0, 2, 1 
8  1, 2, 0, 3  20  3, 1, 0, 2 
9  1, 2, 3, 0  21  3, 1, 2, 0 
10  1, 3, 0, 2  22  3, 2, 0, 1 
11  1, 3, 2, 0  23  3, 2, 1, 0 
Choice ID  Sparsity 

0  50.00% 
1  56.25% 
2  62.50% 
3  68.75% 
4  75.00% 
5  81.25% 
6  87.50% 
7  93.75% 
8  100.00% 
4.3 Searching for highperformance models
We construct an RNN controller with 100 LSTM cells to learn the policy for sampling communication decisions. The possible number of choices in every transmission step is , when . Table 3 shows all decisions and their corresponding id. The searching spaces for separable ResNeXt56 () and ResNeXt110 () are and , respectively. Furthermore, Table 4 lists all the sparsification decisions when if we additionally consider reducing amount of transmission data. The best decisions learned by the controller under different settings are listed in Table 6, and the results compare with original models are shown in Table 2, Table 5. We notice that a welltrained controller tends to sample communicationintensive decisions, which proves that the connections between separated parts are important. However, some decisions which are not communicationintensive are still sampled. These decisions help reduce the loading of transmission because at least one node do not need to transmit data.
4.4 Reduction of transmission data
To further reduce the amount of transmission data, we add decisions as described previously to control the sparsity of data. The sparsity level K is set as 9, so there are 9 levels between 50% and 100% and we simply use 08 to stand for the nine levels. We also discuss the impact if we cast full floatingpoint to half floatingpoint when transmission. The performance is shown in Table 5. We found that there is only a little accuracy drop with these techniques, and the total amount of transmission data can be reduced to only 14.43%, compared with the original model.
Method  Acc.  Comm. costs  

ResNeXt56 (416d) w/ ring allreduce  79.30%  100.00%  

79.10%  31.48%  

78.70%  27.99%  

78.56%  14.43% 
Method  Decision  

ResNeXt56 (416d)    
SepResNeXt56  [12, 12, 7, 9, 10, 23, 9, 16, 18]  





4.5 The benefit of the controller in neural architecture search
In this section we discuss if the controller really learned to sample better policies. We reproduce the experiment in the previous section by replacing the RNN controller with a random sampler. To reduce the effect of how well a separable network is trained, we also design a control group that uses the pretrain weight from the case with the controller to initialize the separable network. The training loss and average testing accuracy are shown in Fig. 4. It is clear to see that the case with the controller outperforms cases without a controller, showing that the controller really helps to achieve better decisions.
4.6 Deployment analysis
To deploy our model on edge clusters, we further specify the relation between our settings and specifications of devices. With appropriate setting of staleness , we can theoretically mitigate transmission overhead. That is, if there is a given separable model, the computation and amount of transmission data are known, and the upper bound of the ratio can be determined. Fig. 5 shows the system requirements to deploy separable ResNeXt56(416d, ) under different . We can see that under same computing power, would be set larger if the transmission is much slower. The area under the line means that the specifications of the devices can match the setting of . For example, we evaluate our target device Raspberry pi 3b+, which has computing power of FLOPS and 300Mbps transmission speed. The result indicates that the specification is sufficient to deploy our model. As for the implementation in real scenario, we use 4 devices with quadcore ARM A57 to deploy separable ResNeXt56(416d, ) model. The time measurement of different components is shown in Fig. 6. Although there are some overheads on memory copies, first transmission and aggregation of feature maps, we can speed up inference about 3X.
5 Conclusion
We proposed a new approach of parallelizing machine learning models to enable the deployment on edge clusters and make inference efficiently. Different from traditional parallelism methods, we focus on the reduction of transmission costs through the architecture design to solve the problem of transmission overheads. With our approach, the latency of inference and the model size on one device can be decreased greatly. The communication overheads can be balanced by setting the proposed staleness factor. We also apply the techniques of the neural architecture search to find the best performing models and further reduce transmission. Overall, our work provides a solution to aggregate computing power of edge devices in a cluster. Large models can be deployed properly without significant performance drops. For future research, our work can combine with model compression techniques to further decrease latency. The support of heterogeneous networks should also be considered to fit the scenario in real life more precisely.
Acknowledgement
The work is supported in part by Ministry of Science and Technology, Taiwan, with grant no. 1092221E002145MY2.
References
 Demystifying parallel and distributed deep learning: an indepth concurrency analysis. ACM Comput. Surv. 52 (4). External Links: ISSN 03600300, Link, Document Cited by: §2.2.
 Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §4.1.
 Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. Cited by: §1, §2.2.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §2.1, §3.1.  Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1.
 Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §2.1.
 Edge intelligence: ondemand deep learning model coinference with deviceedge synergy. In Proceedings of the 2018 Workshop on Mobile Edge Communications, pp. 31–36. Cited by: §2.3.
 A survey on mobile edge computing: the communication perspective. IEEE Communications Surveys & Tutorials 19 (4), pp. 2322–2358. Cited by: §1.
 Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.4.
 Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2.1.
 Inceptionv4, inceptionresnet and the impact of residual connections on learning. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA, S. P. Singh and S. Markovitch (Eds.), pp. 4278–4284. External Links: Link Cited by: §2.1.
 Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.1.
 Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2.1.
 Distributed deep neural networks over the cloud, the edge and end devices. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 328–339. Cited by: §2.3.
 Selftraining with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §1.
 Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §2.1, §3.1.
 Deepthings: distributed adaptive deep learning inference on resourceconstrained iot edge clusters. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 37 (11), pp. 2348–2359. Cited by: §2.3.
 Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2.4.