Recent advances in deep neural networks (DNNs) have contributed to the state-of-the-art performance in various artificial intelligence (AI)-based applications such as image classification[1, 2], object detection , speech recognition 5], and so forth. Consequently, mobile and internet of things (IoT) devices are increasingly relying on theses DNNs to improve their performance in such AI-based applications. However, the storage and computation requirements of most of the state-of-the-art deep models limit the fully deployment of the inference network on mobile devices. Therefore, as the most common way for deployment of most of the DNN-based applications on mobile devices, the input data of DNN is sent to cloud servers, and the computations associated with the inference network are performed fully on the cloud side .
One of the downsides of the cloud-only approach is the fact that it requires the mobile edge devices to send considerable amounts of data, which can be images, audio, and video, over the wireless network to the cloud. This leads to notable latency and energy overheads on the mobile device. Furthermore, in a scenario where a large number of mobile devices send a vast amount of simultaneous bit streams to the cloud server, the imposed computation workload of simultaneously executing numerous deep models could become a bottleneck on the cloud server.
Recently, inspired by the progress in the computation power and energy efficiency of mobile devices, there has been a body of research studies investigating the strategy of pushing a portion of the workload from cloud servers to mobile edge devices, where both the mobile and cloud execute the inference network collaboratively. As a result, a concept named collaborative intelligence has been introduced [7, 8, 9, 10, 11, 12, 6]
. In collaborative intelligence, the deep network is split at an intermediate layer between the mobile and cloud. In other words, instead of sending the original raw data from the mobile device to the cloud and executing the inference network fully on the cloud side, the computations associated with the initial layers are performed on the mobile side. Then, the computed feature tensor of the last assigned layer on the mobile side could be sent to the cloud for executing the remained computation layers of the inference network. By allocating a portion of the inference network to the mobile side, the imposed workload on the cloud reduces, where this results in the increased throughput on the cloud. Furthermore, in some deep models which are based on convolutional neural networks (CNNs), e.g. AlexNet, the feature data volume generally shrinks as we go deeper in the model, and it might become even less than the model input size after a number of layers [9, 7, 6]. Therefore, by computing a few layers on the mobile, the amount of data needed to be sent to the cloud in the collaborative intelligence framework can decrease compared to the cloud-only approach. This can lead to reduced energy and latency overheads on the mobile side.
According to a recent study done in  for different hardware and wireless connectivity configurations, the optimal operating point for the inference network in terms of latency and/or energy consumption is associated with dividing the network between the mobile and cloud, and not the common cloud-only approach, or the mobile-only approach (in case the deep model is able to be executed fully on the mobile device). The optimal point of split depends on the computational and data characterization of DNN layers and is usually at a point deep in the inference network. The approach  has extended the work of  and included model training and additional network architectures. The network is again split between the mobile and cloud, but the data can flow in both ways in order to optimize the efficiency of both the inference and training. In summary, in the research works studying the collaborative intelligence framework, a given deep network is split between the mobile device and the cloud without any modification to the network architecture itself [7, 8, 9, 10, 11, 12, 6].
In this paper, we investigate the problem of altering a given deep network architecture before the partitioning of it between the mobile and cloud. For this purpose, on the mobile side, we introduce a reduction unit right before uploading the feature tensor to the cloud. The reduction unit is stacked to the end of the computation layers assigned to the mobile side. The computation associated with the reduction unit is also done on the mobile side. The purpose of this unit is reducing the feature data volume needed to be sent to the cloud via the wireless network to a greater extent, since the latency and energy overheads associated with the wireless upload link in state-of-the-art approaches for collaborative intelligence still contribute to the major portion of the energy consumption of the inference network on the mobile side and end-to-end latency . Accordingly, on the cloud side, we introduce a restoration unit which is stacked before the computation layers assigned to the cloud. Both the reduction and restoration units use a convolutional layer as their main component. The dimension of the convolution layers used in reduction and restoration units are set in a way so that the dimension of the input tensor of the reduction unit is equal to the dimension of the output tensor of the restoration unit. We refer to the combination of reduction and restoration units as the butterfly unit (see Fig. 1). The new network architecture, including the introduced butterfly unit after a selected layer of the underlying deep model, is trained end-to-end, while in other works which have considered compression for reducing the feature data volume needed to be sent to the cloud, they have added non-learnable compression techniques (e.g. JPEG) to an already trained model [12, 6, 11].
The rest of the paper is structured as follows: Section II elaborates more on the details of the proposed butterfly unit and the proposed DNN partitioning algorithm. Section III provides the obtained improvements in terms of end-to-end latency and the mobile energy consumption. It also discusses the flexibility of network partitioning point based on the load level of the cloud and mobile, and the wireless network conditions. Finally, Section IV concludes the paper.
Ii Proposed Method
Ii-a Butterfly Unit
The butterfly unit, as shown in Fig. 2, consists of two components: 1) the reduction unit, and 2) the restoration unit. The input to the reduction unit is a tensor of size (, , , ). A convolution filter of size (, , , ) is applied to the input, producing a tensor of size (, , , ) as the output of the reduction unit. The output tensor of the reduction unit is the shrunk representation of its input along the channel axis ( ), and it is the tensor which is uploaded to the cloud server. On the cloud side, in the restoration unit, by applying a convolution filter of size (, , , ), we restore the dimension of the original input of the butterfly unit to proceed the rest of the inference. The butterfly unit is placed after one of the layers in a DNN. The intuition behind decreasing the tensor dimension along the channel axis in the reduction unit is the fact that typically each channel preserves the visual structure of the input. Therefore, we can expect this non-expensive 11 convolution can keep enough information of the feature data. In addition, depending on the architecture of the underlying deep model, the feature tensor size varies layer by layer, typically increasing in channel sizes. Therefore, as we go deeper in the model, more channels would be required in the output of the reduction unit, , for maintaining the accuracy of the model.
From the perspective of the mobile device, the location of the butterfly unit is desired to be closer to the input layer so that the mobile device computes fewer layers. However, from the perspective of the cloud server, we want to push more computations towards the mobile device in order to reduce the data center workloads. Particularly, when the cloud server and/or the wireless network are congested, pushing computations towards the mobile device is advantageous. As a result, there is a trade-off in choosing the location of the butterfly unit in the inference network.
Ii-B Partitioning Algorithm
The proposed algorithm, for choosing the location of the butterfly unit and the proper value of , comprises three main steps: 1) Training, 2) Profiling, and 3) Selection. In the training phase, we train models, where each model is associated with placing the butterfly unit after a different arbitrary layer among the total layers of the inference network (). For each model, via linear search, we find and choose the minimum for the butterfly unit that reaches a pre-defined acceptable accuracy. During the profiling phase, for each of models, we measure the latency values corresponding to the computation of layers assigned to the mobile side, the reduction unit, the wireless up-link of the shrunk feature data to the cloud, the restoration unit, and the computations of layers assigned to the cloud side. Furthermore, for energy consumption, we measure the values associated with the computation of layers assigned to the mobile side, the reduction unit, and the wireless up-link of the shrunk feature data. These measurements vary for each of models, and the current load level of the mobile, cloud, or wireless network conditions. In the end, in the selection phase, depending on whether the target is minimizing the end-to-end latency or the mobile energy consumption, we select the best partitioning among available options.
The full procedure for choosing the location of the butterfly unit and the proper value of is shown in Algorithm 1.
Iii-a Experimental Setup
We evaluate our proposed method on NVIDIA Jetson TX2 board, which is equipped with NVIDIA Pascal™ GPU which fairly represents the computing power of mobile devices. Our server platform is equipped with NVIDIA Geforce® GTX 1080 Ti GPU, which has almost 30x more computing power compared to our mobile platform. The detailed specifications of our mobile and server platforms are presented in Table I and Table II, respectively. We measure the GPU power consumption on our mobile platform using INA226 power monitoring sensor with a sampling rate of 500 kHz. For our wireless network settings, the average upload speed of different wireless networks, 3G, 4G, and Wi-Fi, in the U.S. are used in our experiments [15, 16]. We use the transmission power models of 
for wireless networks, which have estimation errors less than 6%. The power level for up-link is, where is the up-link throughput, and and are regression coefficients of power models. The values for our power model parameters are presented in Table III.
We prototype the proposed method by implementing the inference networks, both for the mobile device and cloud server, using NVIDIA TensorRT™ , which is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and run-time that delivers low latency and high-throughput for deep learning inference applications. TensorRT is equipped with cuDNN, a GPU-accelerated library of primitives for DNNs. TensorRT supports three precision modes for creating the inference graph: FP32 (single precision), FP16 (half precision), and INT8. However, our mobile device does not support INT8 operations on its GPU. Therefore, we use FP16 mode for creating the inference graph from the trained model graph, where for training itself single precision mode (32-bit) is used. As demonstrated in 
, 8-bit quantization would be enough for even challenging tasks like ImageNet classification. Therefore, we quantize FP16 data types to 8 bits only for uploading the feature tensor to the cloud. We implement our client-server interface using Thrift , an open source flexible RPC interface for inter-process communication. To allow for flexibility in the dynamic selection of partition points, both the mobile and cloud host all possible partitioned models. For each of models, the mobile and cloud store only their assigned layers. At run-time, depending on the load of the mobile and cloud, wireless network conditions, and the optimization goal (minimizing for latency or energy), only one of partitioned models is selected. Given a partition decision, execution begins on the mobile device and cascades through the layers of the DNN leading up to that partition point. Upon completion of that layer and the reduction unit, the mobile sends the output of the reduction unit from the mobile device to the cloud. Cloud server then executes the computations associated with the restoration unit and remaining DNN layers. Upon the completion of the execution of the last DNN layer on the cloud, the final result is sent back to the mobile device.
We evaluate the proposed method on one of the promising and mostly used DNN architectures, ResNet . DNNs are hard to train because of the notorious vanishing/exploding gradient issue, which hampers the convergence of the model. As a result, as the network goes deeper, its performance gets saturated or even starts degrading rapidly . The core idea of ResNet is introducing a so-called “identity shortcut connection” that skips one or more layers. The output of a residual block (RB) with identity mapping will be instead of traditional . The argument behind ResNet’s good performance is that stacking layers should not degrade the network performance, because we could simply stack identity mappings (layers that do nothing, i.e.,
) upon the current model, and the resulting architecture would perform the same. It indicates that the deeper model should not produce a training error higher than its shallower counterparts. They hypothesize that letting the stacked layers fit a residual mapping is easier than letting them directly fit the desired underlying mapping. If the dimensions change, there are two cases: 1) increasing dimensions: The shortcut still performs identity mapping, with extra zero entries padded with the increased dimension. 2) decreasing dimensions: A projection shortcut is used to match the dimensions ofand using the formula of , as shown in Fig. 6.
ResNet architecture comes with flexible number of layers (e.g. 34, 50, 101, etc.). In our experiments, we use ResNet-50. There are 16 residual blocks in ResNet-50 . Using Algorithm 1, we obtain 16 models where each model is corresponding to placing the butterfly unit after one of 16 residual blocks. The detailed architecture, and the data size of layer outputs of ResNet-50 are demonstrated in Fig. 4, and Fig. 5, respectively. As indicated in Fig. 5, the size of intermediate feature tensors in ResNet-50 are larger than the input size up until RB14, which is relatively deep in the model. Therefore, merely splitting this network between the mobile and cloud for collaborative intelligence may not perform better than the cloud-only approach in terms of latency and mobile energy consumption, since a large portion of the workload is pushed toward the mobile.
We evaluate the proposed method on miniImageNet  dataset, a subset of ImageNet dataset, which includes 100 classes and 600 examples per each class. We use 85% of whole dataset examples as the training set and the rest as the test set. We randomly crop a 224
224 region out of each sample for data augmentation and train each of the models for 90 epochs.
|System||NVIDIA Jetson TX2 Developer Kit|
|GPU||NVIDIA Pascal™, 256 CUDA cores|
|CPU||HMP Dual Denver + Quad ARM® A57/2 MB L2|
|Memory||8 GB 128 bit LPDDR4 59.7 GB/s|
|GPU||NVIDIA Geforce® GTX 1080 Ti, 12GB GDDR5|
|CPU||Intel® Xeon® CPU E7- 8837 @ 2.67GHz|
|Memory||64 GB DDR4|
Iii-B Latency and Energy Improvements
The accuracy of ResNet-50 model for miniImageNet dataset without the butterfly unit is 76%. We refer to this accuracy as the target accuracy. The accuracy results of the proposed method by placing the butterfly unit after each of the 16 residual blocks are demonstrated in Fig. 7. According to Fig. 7, as we increase the number of channels in the reduction unit, accuracy improves but larger feature tensors are needed to be transferred to the cloud. By assuming an acceptable error of 2% compared to the target accuracy, placing the butterfly unit after residual blocks 1-3, 4-7, 8-13, and 14-16, requires the of 1, 2, 5, and 10, in order to maintain the accuracy of the proposed method at or above 74% (less than 2% accuracy loss), respectively.
|Offloaded Data (KB)||3.1||3.1||3.1||1.6||1.6||1.6||1.6||1.0||1.0||1.0||1.0||1.0||1.0||0.5||0.5||0.5|
|Latency 3G (ms)||23.7||24.7||25.6||15.0||15.9||16.8||17.7||14.3||15.4||16.2||17.1||17.9||18.8||16.1||17.1||17.9|
|Energy 3G (mJ)||21.6||22.4||23.3||13.7||14.4||15.4||16.2||13.1||13.9||14.7||15.5||16.4||17.2||14.8||15.7||16.6|
|Latency 4G (ms)||5.2||6.1||6.9||5.8||6.7||7.6||8.5||8.6||9.6||10.5||11.2||12.1||13.1||13.1||14.2||15.1|
|Energy 4G (mJ)||9.8||11.6||13.2||10.9||12.7||14.3||15.9||12.6||13.1||14.3||15.2||16.3||17.0||14.4||16.8||17.2|
|Latency Wi-Fi (ms)||2.4||3.3||4.1||4.3||5.2||6.1||7.0||7.7||8.6||9.4||10.7||11.1||12.2||12.9||13.8||14.7|
|Energy Wi-Fi (mJ)||4.8||6.8||8.7||9.1||11.2||13.1||14.9||12.1||12.7||13.9||14.8||15.5||16.3||14.1||16.1||16.6|
|Setup||Latency (ms)||Energy (mJ)||Butterfly Unit Location||Offloaded Data (B)||Accuracy|
Table IV presents the latency and mobile energy consumption associated with placing the butterfly unit with proper size (with the accuracy loss less than 2%) after each residual block, for different wireless networks when there is no congestion in the mobile, cloud, and wireless network. Table V shows the selected partition points by our algorithm for the goal of minimum end-to-end latency and mobile energy consumption, while the acceptable 2% accuracy loss is reached, across three different wireless configurations (3G, 4G, and Wi-Fi) and when there is no congestion on the mobile, cloud, and wireless network (These selected partitions are also highlighted in Table IV). Note that the best partitioning for the goal of minimum end-to-end latency is the same as the best partitioning for the goal of minimum mobile energy consumption in each wireless network settings. This is mainly due to the fact that end-to-end latency and mobile energy consumption are proportional to each other since the dominant portion of both of them are associated with the wireless transfer overheads of the intermediate feature tensor. Obtained results for cloud-only and mobile-only approaches are also provided in Table V.
Latency Improvement - As demonstrated in Table V, using our proposed method, the end-to-end latency achieves 77, 40, 41 improvements over the cloud-only approach in 3G, 4G, and Wi-Fi networks, respectively.
Energy Improvement - As demonstrated in Table V, using our proposed method, the mobile energy consumption achieves 80, 54, and 71 improvements over the cloud-only approach in 3G, 4G, and Wi-Fi networks, respectively.
In the case of 4G and Wi-Fi, the mobile device is only required to compute DNN layers upon the completion of the RB1 and the reduction unit. In the case of 3G, the mobile device should compute DNN layers upon the completion of the RB8 and the reduction unit.
Iii-C Server Load Variations
Data centers typically experience fluctuating load patterns. High server utilization leads to increased service times for DNN queries. Using our proposed method, by training multiple DNNs split on different layers and storing corresponding partitions in the mobile and cloud, the best model can be selected at run-time by the mobile, based on the current server load level, by periodically pinging the server during the mobile idle period. This leads to avoiding long latencies of DNN queries caused by high user demands. If the server is congested, we can move the partition point into deeper layers which this pushes more of the workload towards the mobile device. As a result, the computation load of the mobile device will increase. In summary, depending on the server load, the partition point can be changed while preserving the accuracy and still offloading less data than raw input.
Consequently, this new compute paradigm not only reduces the end-to-end latency and mobile energy consumption but also reduces the workload required on the data center, leading to the shorter query service time and higher query throughput.
Iii-D Comparison to Other Feature Compression Techniques
In the collaborative intelligence works which have considered the compression of intermediate features before uploading them to the cloud, the obtained compression ratios are significantly less compared to our work. For instance, as reported in , the maximum achieved compression ratio is reported as 3.3. However, with the proposed trainable butterfly unit, we achieve up to 256 compression ratio when the butterfly unit is placed after RB1 (in which the channel size is reduced from 256 to 1). This shows that in collaborative intelligence framework, the compression using the proposed learnable butterfly unit can significantly perform better than traditional compressors.
Iv Conclusion and Future Work
As the core component of today’s intelligent services, DNNs have been traditionally executed in the cloud. Recent studies have shown that the latency and energy consumption of DNNs in mobile applications can be considerably reduced using collaborative intelligence framework, where the inference network is divided between the mobile and cloud and intermediate features computed on the mobile device are offloaded to the cloud instead of the raw input data of the network, reducing the size of the data needed to be sent to the cloud. With these insights, in this work, we develop a new partitioning scheme that creates a bottleneck in a neural network using the proposed butterfly unit, which alleviates the communication costs of feature transfer between the mobile and cloud to a greater extent. It can adapt to any DNN architecture, hardware platform, wireless connections, and mobile and server load levels, and selects the best partition point for the minimum end-to-end latency and/or mobile energy consumption at run-time. The new network architecture, including the introduced butterfly unit after a selected layer of the underlying deep model, is trained end-to-end. Our proposed method, across different wireless networks, achieves on average 53 improvements for end-to-end latency and 68 improvements for mobile energy consumption compared to the status quo cloud-only approach for ResNet-50, while the accuracy loss is less than 2%.
As a future work, the extent of reduction in the feature data size which is transferred between the mobile and cloud can be explored. Furthermore, the efficacy of the proposed method can be investigated for different DNN architectures and mobile/server load variations. Additionally, collaborative intelligence frameworks by considering the advent of revolutionary 5G technology can be studied.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in neural information processing systems, 2013, pp. 3111–3119.
-  H. Choi and I. V. Bajic, “Deep feature compression for collaborative object detection,” arXiv preprint arXiv:1802.03931, 2018.
-  A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “Jointdnn: an efficient training and inference engine for intelligent mobile cloud computing services,” arXiv preprint arXiv:1801.08618, 2018.
-  A. E. Eshratifar and M. Pedram, “Energy and performance efficient computation offloading for deep neural networks in a mobile cloud computing environment,” pp. 111–116, 2018. [Online]. Available: http://doi.acm.org/10.1145/3194554.3194565
-  Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGPLAN Notices, vol. 52, no. 4, pp. 615–629, 2017.
-  P. M. Grulich and F. Nawab, “Collaborative edge and cloud neural networks for real-time video processing,” Proceedings of the VLDB Endowment, vol. 11, no. 12, pp. 2046–2049, 2018.
-  Z. Chen, W. Lin, S. Wang, L. Duan, and A. C. Kot, “Intermediate deep feature compression: the next battlefield of intelligent sensing,” arXiv preprint arXiv:1809.06196, 2018.
-  H. Choi and I. V. Bajic, “Near-lossless deep feature compression for collaborative intelligence,” arXiv preprint arXiv:1804.09963, 2018.
-  “Jetson TX2 Module,” https://developer.nvidia.com/embedded/buy/jetson-tx2, 2018.
-  “INA Current/Power Monitor,” http://www.ti.com/product/INA226.
-  “State of Mobile Networks in USA,” https://opensignal.com/reports/2017/08/usa/state-of-the-mobile-network, 2017.
-  “United States Speedtest Market Report,” http://www.speedtest.net/reports/united-states/, 2017.
-  J. Huang, F. Qian, A. Gerber, Z. M. Mao, S. Sen, and O. Spatscheck, “A close examination of performance and power characteristics of 4g lte networks,” in Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services, ser. MobiSys ’12. New York, NY, USA: ACM, 2012, pp. 225–238.
-  “NVIDIA TensorRT,” https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/index.html, 2018.
-  S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” CoRR, vol. abs/1410.0759, 2014. [Online]. Available: http://arxiv.org/abs/1410.0759
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149, 2015. [Online]. Available: http://arxiv.org/abs/1510.00149
-  J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in
-  M. Slee, A. Agarwal, and M. Kwiatkowski, “Thrift: Scalable cross-language services implementation,” Facebook White Paper, vol. 5, 2007. [Online]. Available: http://thrift.apache.org/static/files/thrift-20070401.pdf
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics
, ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9. Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 249–256. [Online]. Available:http://proceedings.mlr.press/v9/glorot10a.html
-  O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” CoRR, vol. abs/1606.04080, 2016. [Online]. Available: http://arxiv.org/abs/1606.04080