JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile Cloud Computing Services

Deep neural networks are among the most influential architectures of deep learning algorithms, being deployed in many mobile intelligent applications. End-side services, such as intelligent personal assistants (IPAs), autonomous cars, and smart home services often employ either simple local models or complex remote models on the cloud. Mobile-only and cloud-only computations are currently the status quo approaches. In this paper, we propose an efficient, adaptive, and practical engine, JointDNN, for collaborative computation between a mobile device and cloud for DNNs in both inference and training phase. JointDNN not only provides an energy and performance efficient method of querying DNNs for the mobile side, but also benefits the cloud server by reducing the amount of its workload and communications compared to the cloud-only approach. Given the DNN architecture, we investigate the efficiency of processing some layers on the mobile device and some layers on the cloud server. We provide optimization formulations at layer granularity for forward and backward propagation in DNNs, which can adapt to mobile battery limitations and cloud server load constraints and quality of service. JointDNN achieves up to 18X and 32X reductions on the latency and mobile energy consumption of querying DNNs, respectively.


BottleNet: A Deep Learning Architecture for Intelligent Mobile Cloud Computing Services

Recent studies have shown the latency and energy consumption of deep neu...

HierTrain: Fast Hierarchical Edge AI Learning with Hybrid Parallelism in Mobile-Edge-Cloud Computing

Nowadays, deep neural networks (DNNs) are the core enablers for many eme...

Towards Collaborative Intelligence Friendly Architectures for Deep Learning

Modern mobile devices are equipped with high-performance hardware resour...

CoEdge: Cooperative DNN Inference with Adaptive Workload Partitioning over Heterogeneous Edge Devices

Recent advances in artificial intelligence have driven increasing intell...

Budget Learning via Bracketing

Conventional machine learning applications in the mobile/IoT setting tra...

AutoScale: Optimizing Energy Efficiency of End-to-End Edge Inference under Stochastic Variance

Deep learning inference is increasingly run at the edge. As the programm...

SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud

Despite the soaring use of convolutional neural networks (CNNs) in mobil...

1. Introduction

Deep Neural Network (DNN) architectures are promising solutions in achieving remarkable results in a wide range of machine learning applications, including, but not limited to computer vision, speech recognition, language modeling and autonomous cars.

Currently, there is a major growing trend in introducing more advanced DNN architectures and employing them in end-user applications. The considerable improvements in DNNs are usually achieved by increasing complexity which requires more computational resources for training and inference. Recent research directions to make this progress sustainable are: development of Graphical Processing Units (GPUs) as the vital hardware component of both servers and mobile devices (Oh and Jung, 2004), design of efficient algorithms for large-scale distributed training (Dean et al., 2012) and efficient inference (Razlighi et al., 2017), compression and approximation of models (Sze et al., 2017), and most recently introducing collaborative computation of cloud and fog as known as dew computing (Skala et al., 2015).

Using cloud servers for computation and storage is becoming extensively favorable due to technical advancements and improved accessibility. Scalability, low cost, and satisfactory Quality of Service (QoS), made offloading to cloud the typical choice for computing intensive tasks. On the other side, mobile-device are being equipped with more powerful general purpose CPUs and GPUs. Very recently there is a new trend in hardware companies to design dedicated chips to better tackle machine-learning tasks. For example, Apple’s A11 Bionic chip (Newsroom, 2017) used in iPhone X uses a neural engine in its GPU to speed up DNN queries of applications such as face identification and facial motion capture (Li et al., 2013).

There are currently two methods for DNN inference: mobile only and cloud only. In simple models, a mobile device is responsible for performing all of the computation. In case of complex models, the raw input data (image, video stream, voice, etc.) is uploaded and then computed on the cloud. The results of the task are later downloaded to the device.

Besides the improvements of the mobiles devices mentioned earlier, the computational power of mobile devices are still considered significantly weaker than the cloud ones. Therefore, mobile-only approach can cause large inference latency and failure in meeting QoS. Moreover, embedded devices undergo major energy consumption constraints due to battery capacity limits. On the other hand, cloud-only suffers communication overhead for uploading the raw data and downloading the outputs. Moreover, slowdowns caused by service congestions, subscription costs, and network dependency should be considered as downsides of this approach.

The superiority and persistent improvement of DNNs is heavily dependent on providing huge amount of training data. Typically, this data is collected from different resources and later fed into network for training. The final model can then be delivered to different devices for inference functions. However, there is a trend of appearance of applications requiring adaptive learning in online environments, such as self driving cars and security drones (Pan et al., 2017)(Nazemi et al., 2018). Model parameters in these smart devices are constantly being changed based on their continuous interaction with surroundings. Complexity of these architectures with extended number of parameters and current cloud-only methods for DNN training, implies a constant communication cost and burden of increased power consumption for mobile device.

Automatic partitioning of computationally extensive tasks over the cloud for optimization of performance and energy consumption has been already well studied (Chun et al., 2011). Most recently, scalable distributed hierarchy structures between end-user device, edge, and cloud have been suggested (Teerapittayanon et al., 2017) which are specialized for DNN applications. However, exploiting the layer granularity of DNN architectures for run time partitioning has not been studied throughly yet.

Figure 1. Different computation partitioning methods. (a) Mobile only: computation is completely done on mobile device. (b) Cloud only: raw input data is sent to cloud, computations is done on cloud and results are sent back to mobile device. (c) JointDNN: DNN architecture is partitioned at the granularity of layers, each layer can be computed either on cloud or mobile.

In this work, we are investigating inference and training of DNNs in a joint platform of mobile and cloud as an alternatives to the current single-platform methods as illustrated in Figure 1. Considering DNN architectures as an ordered sequence of layers, and possibility of computation of every layer either on mobile or cloud, we can model the DNN structure as a directed acyclic graph (DAG). The parameters of our real-time adaptive model are dependent on the following factors: mobile/cloud hardware and software resources, battery capacity, network specifications, and QoS. Based on this modeling, we show that the problem of finding the optimal computation schedule for different scenarios, i.e. best performance or energy consumption, can be reduced to the polynomial time shortest path problem.

To present realistic results, we made experiments with real hardwares as mobile device and cloud. To model the communication between platform, we used different network technologies and the most recent reports on their specifications in the U.S.

DNN architectures can be categorized based on functionality. These differences enforce specific type and order of layers in architecture, directly affecting the partitioning result in the collaborative method. For discriminative models, used in recognition applications, the layer size gradual decrease proceeding from input toward output 2

. This sequence suggests computation of the first few layers on the mobile device to avoid excessive communication cost of uploading large raw input data. On the other hand, growth of the layer size from input to output in generative models used for synthesizing new data, implies the possibility of uploading small input to the cloud and later downloading and computing the last layers on the mobile device for better efficiency. Interesting mobile applications like image to image translation are implemented with autoencoder architectures, usually consisting of middle layers with smaller sizes compared to input and output. Consequently we expect the first and last layers to be computed on the mobile device in our collaborative approach. We examined eight well-known DNN benchmarks selected from these categories to illustrate their differences in collaborative computation approach.

As we will see in Section LABEL:Results, the communication between the mobile and cloud is the main bottleneck for both performance and energy in the collaborative approach. We investigated the specific characteristics of CNN layer outputs and introduced a lossless compression method to reduce the communication costs.

State-of-the-art work for collaborative computation of DNNs (Kang et al., 2017) only considers one offloading point, assigning computation of its previous layers and next layers on the mobile and cloud platforms, respectively. We show that this approach is non-generic and fails to be optimal, and introduced a new method granting the possibility of computation on either platforms for each layer independent of other layers. Our evaluations show that JointDNN significantly improves the latency and energy up to and respectively compared to the status-quo single platform approaches without any compression. The main contributions of this paper can be listed as:

  • Introducing a novel model for collaborative computation between the mobile and cloud

  • Formulating the problem of optimal computation scheduling of DNNs at layer granularity in mobile cloud computing environment as shortest path problem and integer linear programming (ILP)

  • Examining compressibility of DNN layers and developing a lossless compression method to improve communication costs

  • Demonstrating the significant improvements of performance, mobile energy consumption, and cloud workload achieved by using JointDNN

Figure 2. Typical layer size architecture of (a) Discriminative (b) Autoencoder (c) Generative models.

2. Problem definition and modeling

In this section, we explain the general architecture of DNN layers and our profiling method. Moreover, we elaborate on how the cost optimization can be reduced to a shortest path problem by introducing the JointDNN graph model. Finally, we show how the constrained problem is formulated by setting up ILP.

2.1. DNN Building Blocks

DNNs are networks composed of several layers stacked to each other. We briefly explain the functionality of each layers used in the state-of-the-art architectures:

Convolution Layer (conv)

consists of a set of filters with dimensions relatively smaller than their input. Each filter completely traverses through the input with a predefined step size and computes the dot product between it’s parameters and the corresponding part of the input. This process creates different feature maps (referred to as channels) for different filters from the same input data. This aspect of preserving the locality of input features has made Convolutional Neural Network (CNN) architectures the horse power of the state-of-the-art image classification models. Because of dot product basis of

conv, it can be formulated as General Matrix Multiplication (GEMM), therefore capable of gaining performance improvement by using parallel computing devices (e.g. GPUs).

Fully Connected Layer (fc)

is the main component of most regular neural networks in which every neuron is connected to all neurons of the previous layer. This fully pairwise connection architecture comprises large portion of computation of the whole network. Like

conv, fc layer is also formulated as GEMM.

Pooling Layer (pool)

performs a non-linear down sampling function over non-overlapping spatially local parts of input. Max-pooling is the most common function used in this type of layer alongside other functions such as average or L2-norm pooling.

Activation Layer

increases the non-linearity property of neural network architectures. This layer applies non-linear activation function on single data points of input to generate an output with the same size. Among various non-linear functions, such as sigmoid and hyperbolic tangent, Rectified Linear Unit


is currently the favorable choice in DNN architectures as it is simple and speeds up the tedious training process (Glorot et al., 2011).

Local Response Normalization (lrn) performs local normalization by imposing a local competition for big activities between adjacent features in a channel, and also between features at the same spatial location in different channels. lrn are inspired by inhibition schemes observed in the brain helps with intention of generalization. There are different formulations suggested for lrn, as shown in (Krizhevsky et al., 2012; Jarrett et al., 2009) they may lead to slight improvements.

Dropout Layer (drop) As mentioned earlier, fc occupies most of the parameters of DNN models and thus vulnerable to overfitting. Typically regularization methods are used to prevent overfitting by reducing high dependency of network on individual neurons during training. In dropout (Srivastava et al., 2014)

technique, at each training iteration every neurons can be removed (droped out) from network with a predetermined probability

or kept with probability and the training is done on the remaining network. The dropped out nodes will have their previous weight for the next training iteration.

Deconvolution Layer (deconv) also known as transposed convolution is mostly used on generative and autoencoder models in applications such as building high-resolutions picture from low-resolution pictures and high-level descriptions. The goal in deconvolution is to find in the convolution equation of form . In case of DNNs, is the filter and is the input of the convolution (Zeiler et al., 2010).

Long Short-Term Memory Layer (lstm)

is a building unit for layers of a recurrent neural network (RNN) and is widely used due to its promising results in speech recognition applications. A typical LSTM unit is composed of a cell, an input gate, an output gate and a forget gate, which is responsible for remembering and forgetting specific values over arbitrary time intervals. The whole LSTM unit can be thought as a typical artificial neuron, as in a feed-forward neural network.

Softmax (soft) is the last layer in multi-class architectures, usually connected in a one-to-one correspondence way to a fc

layer. Softmax establishes a probability distribution by representing each class probability with a single neuron.

2.2. Energy and Latency Profiling

There are three methods in measuring the latency and energy consumption of each layer in neural networks:

Statistical Modeling:

In this method, a regression model over the configurable parameters of operators (e.g. filter size in convolution) can be used to estimate the associated latency and energy. This method is prone to large error because of the inter-layer optimizations performed by DNN software packages. Therefore, it is necessary to consider execution of several consecutive operators grouped with each other during profiling. Many of these software packages are proprietary, making access to inter-layer optimization techniques impossible.

In order to illustrate this issue, we designed two experiments with 25 consecutive convolutions on NVIDIA Pascal GPU using cuDNN® library (Chetlur et al., 2014). In the first experiment, we measure the latency of each convolution operator separately and set the total latency as sum of them. In the second experiment, we group the convolutions together and measure the total latency. All parameters are located on GPU’s memory in both experiments, avoiding any data transfer from the main memory to make sure results are exactly representing the actual computation latency.

As we see in Figure 3, there is a large error gap between separated and grouped execution experiments which grows as the number of convolutions is increased. This observation confirms that we need to profile grouped operators to have more accurate estimations. Considering various consecutive combination of operators and different input sizes, this method requires a very large number of measurements, not to mention the need for a complex regression model.

Figure 3. Latency of grouped and separated execution of convolution operator.

Analytical Modeling: To derive an analytical approach for estimation of the latency and energy consumption, it is required to obtain the exact hardware and software specifications. However, the state-of-the-art work in latency modeling of DNNs (Qi et al., 2017) fails to estimate layer-level delay within an acceptable error bound, for instance, underestimating the latency of a fully connected layer with 4096 neurons by around 900%. Industrial developers do not reveal the detailed hardware architecture specifications and the proprietary parallel computing architectures such as CUDA®, therefore, analytical approach could be quite challenging (Hong and Kim, 2010).

Application-specific Profiling: In this method, the DNN architecture of the application being used is profiled in run-time. The number of applications in a mobile device using neural networks are generally limited. In conclusion, this method is more feasible, promising higher accuracy estimations. We have chosen this method for estimation of energies and latencies in the experiments of this paper.

2.3. JointDNN Graph Model

First, we assume that a DNN is presented by a sequence of distinct layers with a linear topology as depicted in Figure 4. Layers are executed sequentially, with output data generated by one layer feeds into the input of the next one. We denote the input and output data sizes of k layer as and , respectively. Denoting the latency (energy) of layer k as , where , the total latency (energy) of querying the DNN is .

The mobile cloud computing optimal scheduling problem can be reduced to a shortest path problem, from node to , in the graph of Figure 5. Mobile Execution cost of the k layer () is the cost of executing the k layer in the mobile while the cloud server is idle. Cloud Execution cost of the k layer () is the executing cost of the k layer in the cloud server while the mobile is idle. Uploading the Input Data cost of the k layer is the cost of uploading output data of the (k-1) layer to the cloud server . Downloading the Input Data cost of the k layer is the cost of downloading output data of the (k-1) layer to the mobile . The costs can refer to either latency or energy. However, as we showed in Section 2.2, the assumption of linear topology in DNNs is not true and we need to consider all the consecutive grouping of the layers in the network. This fact suggests replacement of linear topology by a tournament graph as depicted in Figure 6. We define the parameters of this new graph, JointDNN graph model, in Table 1.

Figure 4. Computation model in linear topology.
Figure 5. Graph representation of mobile cloud computing optimal scheduling problem for linear topology.
Param. Description of Cost
Executing layers to on the cloud
Executing layers to on the mobile
All the following edges:
All the following edges:
All the following edges:
All the following edges:
All the following edges:
All the following edges:
Uploading the input of the first layer
Table 1. Parameter Definition of Graph Model
Figure 6. JointDNN graph model.

In this graph, node represents that the layers to are computed on the cloud server, while node represents that the layers to are computed on the mobile device. An edge between two adjacent nodes in JointDNN graph model is associated with four possible cases: 1) A transition from the mobile to the mobile, which only includes the mobile computation cost () 2) A transition from the cloud to the cloud, which only includes the cloud computation cost () 3) A transition from the mobile to the cloud, which includes the mobile computation cost and uploading cost of the inputs of the next node () 4) A transition from the cloud to the mobile, which includes the cloud computation cost and downloading cost of the inputs of the next node (). Under this formulation, we can transform the computation scheduling problem to finding the shortest path from to .

Residual networks are a class of powerful and easy-to-train architectures of DNNs (He et al., 2015).

In residual networks, as depicted in Figure 7 (a), the output of one layer is fed into another layer with distance of at least two. Thus, we need to keep track of the source layer (node in Figure 7) so as to know that this layer is computed on the mobile or the cloud.

Our standard graph model has a memory of one which is the very previous layer. We provide a method to transform the computation graph of this type of network to our standard model, JointDNN graph.

In this regard, we add two additional chains of size , where is the number of nodes in the residual block ( in Figure 7). One chain represents the case of computing layer on the mobile and the other one represents the case of computing layer on the cloud. In Figure 7, we have only shown the weights that need to be modified, where and are the cost of downloading and uploading the output of layer , respectively.

By solving the shortest path problem in JointDNN graph model, we can obtain the optimal scheduling of inference in DNNs. Online training consists of one inference and one back-propagation step. The total number of layers is noted by consistently throughout this paper so there are layers for modeling training, where the second layers are the mirrored version of the first layers, and their associated operations are the gradients of the error function with respect to the DNN’s weights. The main difference between the mobile cloud computing graph of inference and online training is the need for updating the model by downloading the new weights from the cloud. We assume that the cloud server performs the whole back-propagation step separately, even if it is scheduled to be done on the mobile, therefore, there is no need for mobile device to upload the weights that are updated by itself in order to save mobile energy consumption. The modification in JointDNN graph model is adding the costs of downloading weights of the layers that are updated in the cloud to .

The shortest path problem can be solved in polynomial time efficiently.

However, the problem of shortest path subjected to constraints has been shown to be NP-Complete (Wang and Crowcroft, 1996). For instance, assuming our standard graph is constructed for energy and we need to find the shortest path subject to the constraint of the total latency of that path being less than a time deadline (QoS). However, there is an approximation solution to this problem, ”LARAC” algorithm (Juttner et al., 2001), the nature of our application does not require to solve this optimization problem frequently, therefore, we aim to obtain the optimal solution. We can constitute a small look-up table of optimization results for different set of parameters (e.g. network bandwidth, cloud server load, etc.). We provide the ILP formulations of DNN partitioning in the following sections.

Figure 7. (a) A residual building block (b) Transformation of a residual building block into shortest path problem.

2.4. ILP Setup

2.4.1. Performance Efficient Computation Offloading ILP Setup for Inference

We formulated the scheduling of inference in DNNs as an ILP with tractable number of variables. In our method, first we profile the delay and energy consumption of consecutive layers of size . Thus, we will have


number of different profiling values for delay and energy. Considering layer to layer

to be computed either on the mobile device or cloud server, we assign two binary variables

and , respectively. Download and upload communication delays needs to be added to the execution time, when switching from/to cloud to/from mobile, respectively.


and represent the execution time of the i layer to the j layer on the mobile and cloud, respectively. and represent the latency of downloading and uploading the output of the i layer, respectively. Considering each set of the consecutive layers, whenever and one of are equal to one, the output of the j layer is uploaded to the cloud. The same argument applies to downloading. We also note that the last two terms in Eq. 3 represent the condition by which the last layer is computed on the cloud and we need to download the output to the mobile device, and the first layer is computed on the cloud and we need to upload the input to the cloud, respectively. To support for residual architectures, we need to add a pair of download and upload terms similar to the first two terms in Eq. 3 for the starting and ending layers of each residual block. In order to guarantee that all layers are computed exactly once, we need to add the following set of constraints:


Because of the non-linearity of multiplication, an additional step is needed to transform Eq. 3 to the standard form of ILP. We define two sets of new variables:


with the following constraints:


The first two constraints ensure that will be zero if either or are zero. The third inequality guarantees that will take value one if both binary variables, and , are set to one. The same reasoning works for . In summary, the total number of variables in our ILP formulation will be , where is total number of layers in the network.

2.4.2. Energy Efficient Computation Offloading ILP Setup for Inference

Because of the nature of the application, we only care about the energy consumption on the mobile side. We formulate ILP as follows:


and represent the amount of energy required to compute the i layer to the j layer on the mobile and cloud, respectively. and represent the energy required to download and upload the output of i layer, respectively. Similar to performance efficient ILP constraints, each layer should be executed exactly once:


The ILP problem can be solved for different set of parameters (e.g. different uplink and download speeds), and then the scheduling results can be stored as a look-up table in the mobile device. Moreover because the number of variables in this setup is tractable solving ILP is quick. For instance, solving ILP for AlexNet takes around 0.045 seconds on Intel(R) Core(TM) i7-3770 CPU with MATLAB®’s intlinprog() function using primal simplex algorithm.

2.4.3. Performance Efficient Computation Offloading ILP Setup for Training

The ILP formulation of online training phase is very similar to that of inference. In online training we have layers instead of obtained by mirroring the DNN, where the second layers are backward propagation. Moreover, we need to download the weights that are updated in the cloud to the mobile. We assume that the cloud server always has the most updated version of the weights and does not require the mobile device to upload the updated weights. The following terms need to be added for the ILP setup of training:


2.4.4. Energy Efficient Computation Offloading ILP Setup for Training


2.4.5. Scenarios

There can be different optimization scenarios defined for ILP as listed below:

  • Performance efficient computation: In this case, it is sufficient to solve the ILP formulation for performance efficient computation offloading.

  • Energy efficient computation: In this case, it is sufficient to solve the ILP formulation for energy efficient computation offloading.

  • Battery budget limitation: In this case, based on the available battery, the operating system can decide to dedicate a specific amount of energy consumption to each application. By adding the following constraint to the performance efficient ILP formulation, our framework would adapt to battery limitations:

  • Cloud limited resources: In the presence of cloud server congestion or limitations on user’s subscription, we can apply execution time constraints to each application to alleviate the server load:

  • QoS: In this scenario, we minimize the required energy consumption while meeting a specified deadline:


    This constraint could be applied to both energy and performance efficient ILP formulations.

1 function JointDNN ;
Input :  1: : number of layers in the DNN
2: : layers in the DNN
3: : data size at each layer
4: : mobile network bandwidth
5: : mobile network uplink and downlink power consumption
Output : Optimal schedule of DNN
2 for  do
3       for  do
4             = ProfileGroupedLayers;
6       end for
8 end for
9G,S,F = ConstructShortestPathGraph(,,,,) //S and F are start and finish nodes and G is the JointDNN graph model
10 if no constraints then
11       = ShortestPath(G,S,F)
13       if Battery Limited Constraint then
15             = PerformanceEfficientILP(,,,,)
16       end if
17      if Cloud Server Contraint then
19             = PerformanceEfficientILP(,,,,)
20       end if
21      if QoS then
23             = EnergyEfficientILP(,,,,)
24       end if
25      ; 
26 end if
27return ;
Algorithm 1 JointDNN engine optimal scheduling of DNNs
Figure 8. Latency and energy improvements for different batch sizes during inference.
Figure 9. Latency and energy improvements for different batch sizes during training.

3. Evaluation

3.1. Deep Architecture Benchmarks

Since the architecture of neural networks depends on the type of the application, we have chosen three common application types of DNNs:

  1. Discriminative neural networks

    are a class of models in machine learning for modeling the conditional probability distribution

    . This class generally is used in classification and regression tasks. AlexNet(Krizhevsky et al., 2012), OverFeat(Sermanet et al., 2013), VGG16(Simonyan and Zisserman, 2014), Deep Speech(Hannun et al., 2014), ResNet(He et al., 2015), and NiN(Lin et al., 2013) are well-known discriminative models we use as benchmarks in this experiment. Except Deep Speech, used for speech recognition, all other benchmarks are used in image classification tasks.

  2. Generative neural networks model the joint probability distribution , allowing generation of new samples. These networks have applications in Computer Vision (Goodfellow et al., 2014) and Robotics (Finn and Levine, 2016), which can be deployed on a mobile device. Chair (Dosovitskiy et al., 2014) is a generative model we use as benchmark in this work.

  3. Autoencoders

    are another class of neural networks used to learn a representation for a data set. Their applications are image reconstruction, image to image translation, and denoising to name a few. Mobile robots can be equipped with autoencoders to be used in their computer vision tasks. We use Pix2Pix 

    (Isola et al., 2016), as a benchmark from this class.

center Type Model Layers Discriminative AlexNet 21 OverFeat 14 Deep Speech 10 ResNet 70 VGG16 37 NiN 29 Generative Chair 10 Autoencoder Pix2Pix 32

Table 2. Benchmark Specifications
Param. 3G 4G Wi-Fi
Download speed (Mpbs) 2.0275 13.76 54.97
Upload speed (Mbps) 1.1 5.85 18.88
(mW/Mpbs) 868.98 438.39 283.17
(mW/Mpbs) 122.12 51.97 137.01
(mW) 817.88 1288.04 132.86
Table 3. Mobile networks specifications in the U.S.

3.2. Mobile and Server Setup

We used Jetson TX2 module developed by NVIDIA® (Corporation, 2018a), a fair representative of mobile computation power as our mobile device. This module enables efficient implementation of DNN applications used in products such as robots, drones, and smart cameras. It is equipped with NVIDIA Pascal®GPU with 256 CUDA cores and a shared 8 GB 128 bit LPDDR4 memory between GPU and CPU. To measure the power consumption of the mobile platform, we used INA226 power sensor (Incorporated, 2018).

NVIDIA® Tesla® K40C (Corporation, 2018b) with 12 GB memory serves as our server GPU. The computation capability of this device is more than one order of magnitude compared to our mobile device.

3.3. Communication Parameters

To model the communication between platforms, we used the average download and upload speed of mobile Internet (, 2017a, b) for different networks (3G, 4G and Wi-Fi) as shown in Table 3.

The communication power for download () and upload () is dependent on the network throughput ( and ). Comprehensive examinations in (Huang et al., 2012) indicates that uplink and downlink power can be modeled with linear equations (Eq. 21) fairly accurate with less than 6% error rate. Table 3 shows the parameter values of this equation for different networks.


4. Results

The latency and energy improvements of inference and online training with our engine for 8 different benchmarks are shown in Figures 8 and 9, respectively. We considered the best case of mobile-only and cloud-only as our baseline. JointDNN can achieve up to 66% and 86% improvements in latency and energy consumption, respectively during inference. Communication cost increases linearly with batch size while this is not the case for computation cost and it grows with much lower rate, as depicted in 10(b). Therefore, a key observation is that as we increase the batch size, the mobile-only approach becomes more preferable.

Figure 10.

(a) Latency of one epoch of online training using JointDNN algorithm vs percentage of updated weights (b) Latency of mobile-only inference vs. batch size.

During online training, the huge communication overhead of transmitting the updated weights will be added to the total cost. Therefore, in order to avoid downloading this large data, only a few back-propagation steps are computed in the cloud server. We performed a simulation by varying the percentage of updated weight. As the percentage of updated weights increases, the latency and energy consumption becomes constant which is shown in Figure 10. This is the result of the fact that all the back-propagations will be performed on the mobile device and weights are not transfered from the cloud to the mobile. JointDNN can achieve improvements up to 73% in latency and 56% in energy consumption during inference.

Figure 11. Interesting schedules of execution for three types of DNN architectures.

Different patterns of scheduling are demonstrated in Figure 11. They represent the optimal solution in Wi-Fi network while optimizing for latency. They show how the computations in DNN is divided between the mobile and the cloud. As it can be seen, discriminative models (e.g. AlexNet), inference follows a mobile-cloud pattern and training follows a mobile-cloud-mobile pattern. The intuition is that the last layers are computationally intensive (fc) with small data sizes, which require a low communication cost, therefore, last layers tend to be computed on the cloud. For generative models (e.g. Chair), the execution schedule of inference is the opposite of discriminative networks, in which the last layers are generally huge and in the optimal solution they are computed on the mobile. Lastly, for autoencoders, where both the input and output data sizes are large, the first and last layers are computed on the mobile.

JointDNN pushes some parts of the computations toward the mobile device. As a result this will lead to less workload on the cloud server. As we see in Table 4, we can reduce the cloud server’s workload up to 84% and 53% on average, which enables the cloud provider to service more users, while obtaining higher performance and lower energy consumptions compared to single-platform approaches.

Optimization Target 3G (%) 4G (%) Wi-Fi (%)
Latency 84 49 12
Energy 73 49 51
Table 4. Workload reduction of the cloud server in different mobile networks

4.1. Communication Dominance

Execution time and energy breakdown for AlexNet, which is noted as a representative for the state-of-the-art architectures deployed in cloud servers, is depicted in Figure 12. The cloud-only approach is dominated by the communication costs. As demonstrated in Figure 12, 99%, 93% and 81% of the total execution time is used for communication in case of 3G, 4G, and Wi-Fi, respectively. This relative portion also applies to energy consumption. Comparing the latency and energy of the communication to those of mobile-only approach, we notice that mobile-only approach for AlexNet is better than the cloud-only approach in all the mobile networks. We apply lossless compression methods in order to reduce the effect of the communication, which will be covered in the next section.

Figure 12. (a) Execution time of AlexNet optimized for performance (b) Mobile energy consumption of AlexNet optimized for energy (c) Data size of the layers in AlexNet and the scheduled computation, where the first nine layers are computed on the mobile and the rest on the cloud, which is the optimal solution w.r.t. both performance and energy.

4.2. Layer Compression

The preliminary results of our experiments show that more than

of the total energy and delay cost in DNNs are caused by communication in the collaborative approach. This cost is directly proportional to the size of the layer being downloaded to or uploaded from the mobile device. Because of the complex feature extraction process of DNNs, the size of some of the intermediate layers are even larger than network’s input data. For example, this ratio can go as high as

in VGG16. To address this bottleneck, we investigated compression of the data before any communication. This process can be applied to different DNN architecture types; however, we only considered CNNs due to their specific characteristics explained later in details.

Figure 13. Layer output after passing the input image through conv, relu and lrn. Channels are preserving the general structure of the input image and large ratio of the output data is black (zero) due to existence of relu. Tiling is used to put all 96 channels together.

CNN architectures are mostly used for image and video recognition applications. Because of the spatially local preservation characteristics of conv layers, we can assume that the output of the first convolution layers are following the same structure as the input image, as shown in Figure 13. Moreover, a big ratio of layer outputs are expected to be zero due to the presence of the relu layer. Our observations shows that the ratio of neurons equal to zero (ZR) varies from 50% to 90% after relu in CNNs. These two characteristics, layers being similar to the input image, and large proportion of their data being a single value, suggest that we can employ existing image compression techniques to their output.

There are two general categories of compression techniques, lossy and lossless (Cover and Thomas, 2006). In lossless techniques it is possible to reconstruct the original information completely. On the contrary, lossless techniques use approximations and the original data cannot be reconstructed. In our experiments, we examined the impact of compression using PNG, a lossless technique, based on encoding of frequent sequences in an image.

Even though the data type of DNN parameters in typical implementations are 32-bits floating-points, most image formats are based on 3-bytes RGB color triples. Therefore, to compress the layer in the same way as 2D pictures, the floating-point data should be quantized into 8-bits fixed-point. Recent studies show representing the parameters of DNNs with only 4-bits affect the accuracy not more than 1% (Sze et al., 2017). In this work, we implemented our architectures with 8-bits fixed-point and presented our baseline without any compression. The layers of CNN contain numerous channels of 2D matrices, each similar to an image. A simple method is to compress each channel separately. In addition to extra overhead of file header for each channel, this method will not take the best of the frequent sequence decoding of PNG. One alternative is locating different channels side by side, referred to as tiling, to form a large 2D matrix representing one layer as shown in Figure 13. It should be noted that 1D fc layers are very small and we did not apply compression on them.

The Compression Ratio (CR) is defined as the ratio of the size of the layer (8-bit) to the size of the compressed 2D matrix in PNG. Looking at the results of compression for two different CNN architectures in Figure 14, we can observe a high correlation between ratio of pixels being zero (ZR) and CR. PNG can compress the layer data up to and by average. These results confirm the effectiveness of the proposed compression method. By replacing the compressed layers output and adding the cost of compression process itself in JointDNN formulations, we achieve an extra and improvements in energy and latency on average, respectively.

Figure 14. Compression Ratio (CR) and ratio of zero valued neurons (ZR) for different layers of (a) AlexNet and (b) VGG16.

5. Related work and comparison

General Task Offloading Frameworks. There are existing prior arts focusing on offloading computation from the mobile to the cloud(Ra et al., 2011; Gordon et al., 2012; Chun et al., 2011; Cuervo et al., 2010; Wang et al., 2012; Zhang et al., 2012). However, all these frameworks share a limiting feature that makes them impractical for computation partitioning of the DNN applications.

These frameworks are programmer annotations dependent as they make decisions about pre-specified functions, whereas JointDNN makes scheduling decisions based on the model topology and mobile network specifications in run-time. Offloading in function level, cannot lead to efficient partition decisions due to layers of a given type within one architecture can have significantly different computation and data characteristics. For instance, a specific convolution layer structure can be computed on mobile or cloud in different models in the optimal solution.

Neurosurgeon is the only prior art exploring a similar computation offloading idea in DNNs between the mobile device and the cloud server at layer granularity. Neurosurgeon assumes that there is only one data transfer point and the execution schedule of the efficient solution starts with mobile and then switches to the cloud, which performs the whole rest of the computations. Our results show this is not true especially for online training, where the optimal schedule of execution often follows the mobile-cloud-mobile pattern. Moreover, generative and autoencoder models follow a multi data transfer points pattern. Also, the execution schedule can start with the cloud especially in case of generative models where the input data size is large. Furthermore, inter-layer optimizations performed by DNN libraries are not considered in Neurosurgeon. Moreover, Neurosurgeon only schedules for optimal latency and energy, while JointDNN adapts to different scenarios including battery limitation, cloud server congestion, and QoS. Lastly, Neurosurgeon only targets simple CNN and ANN models, while JointDNN utilizes a graph based approach to handle more complex DNN architectures like ResNet and RNNs.

6. Conclusions

In this paper, we demonstrated that the status-quo approaches, cloud-only or mobile-only, are not optimal with regard to latency and energy. We reduced the problem of partitioning the computations in a DNN to shortest path problem in a graph. Adding constraints to the shortest path problem makes it NP-Complete, therefore, we also provided ILP formulations to cover different possible scenarios of limitations of mobile battery, cloud congestion, and QoS. One can solve this problem for different set of parameters beforehand (e.g. network bandwidth, cloud server load, etc.) and use a look-up table accordingly to avoid the overhead of solving the optimization problem. The output data size in discriminative networks is typically smaller than other layers in the network, therefore, last layers are expected to be computed on the cloud, while first layers are expected to be computed on the mobile. A reverse reasoning works for Generative models. Autoencoders have large input and output data sizes, which implies that the first and last layers are expected to be computed on the mobile. With these insights, the execution schedule of DNNs can possibly have various patterns depending on the model architecture.

This research was supported by grants from NSF SHF and DARPA MTO.