1. Introduction
Deep Neural Network (DNN) architectures are promising solutions in achieving remarkable results in a wide range of machine learning applications, including, but not limited to computer vision, speech recognition, language modeling and autonomous cars.
Currently, there is a major growing trend in introducing more advanced DNN architectures and employing them in enduser applications. The considerable improvements in DNNs are usually achieved by increasing complexity which requires more computational resources for training and inference. Recent research directions to make this progress sustainable are: development of Graphical Processing Units (GPUs) as the vital hardware component of both servers and mobile devices (Oh and Jung, 2004), design of efficient algorithms for largescale distributed training (Dean et al., 2012) and efficient inference (Razlighi et al., 2017), compression and approximation of models (Sze et al., 2017), and most recently introducing collaborative computation of cloud and fog as known as dew computing (Skala et al., 2015).
Using cloud servers for computation and storage is becoming extensively favorable due to technical advancements and improved accessibility. Scalability, low cost, and satisfactory Quality of Service (QoS), made offloading to cloud the typical choice for computing intensive tasks. On the other side, mobiledevice are being equipped with more powerful general purpose CPUs and GPUs. Very recently there is a new trend in hardware companies to design dedicated chips to better tackle machinelearning tasks. For example, Apple’s A11 Bionic chip (Newsroom, 2017) used in iPhone X uses a neural engine in its GPU to speed up DNN queries of applications such as face identification and facial motion capture (Li et al., 2013).
There are currently two methods for DNN inference: mobile only and cloud only. In simple models, a mobile device is responsible for performing all of the computation. In case of complex models, the raw input data (image, video stream, voice, etc.) is uploaded and then computed on the cloud. The results of the task are later downloaded to the device.
Besides the improvements of the mobiles devices mentioned earlier, the computational power of mobile devices are still considered significantly weaker than the cloud ones. Therefore, mobileonly approach can cause large inference latency and failure in meeting QoS. Moreover, embedded devices undergo major energy consumption constraints due to battery capacity limits. On the other hand, cloudonly suffers communication overhead for uploading the raw data and downloading the outputs. Moreover, slowdowns caused by service congestions, subscription costs, and network dependency should be considered as downsides of this approach.
The superiority and persistent improvement of DNNs is heavily dependent on providing huge amount of training data. Typically, this data is collected from different resources and later fed into network for training. The final model can then be delivered to different devices for inference functions. However, there is a trend of appearance of applications requiring adaptive learning in online environments, such as self driving cars and security drones (Pan et al., 2017)(Nazemi et al., 2018). Model parameters in these smart devices are constantly being changed based on their continuous interaction with surroundings. Complexity of these architectures with extended number of parameters and current cloudonly methods for DNN training, implies a constant communication cost and burden of increased power consumption for mobile device.
Automatic partitioning of computationally extensive tasks over the cloud for optimization of performance and energy consumption has been already well studied (Chun et al., 2011). Most recently, scalable distributed hierarchy structures between enduser device, edge, and cloud have been suggested (Teerapittayanon et al., 2017) which are specialized for DNN applications. However, exploiting the layer granularity of DNN architectures for run time partitioning has not been studied throughly yet.
In this work, we are investigating inference and training of DNNs in a joint platform of mobile and cloud as an alternatives to the current singleplatform methods as illustrated in Figure 1. Considering DNN architectures as an ordered sequence of layers, and possibility of computation of every layer either on mobile or cloud, we can model the DNN structure as a directed acyclic graph (DAG). The parameters of our realtime adaptive model are dependent on the following factors: mobile/cloud hardware and software resources, battery capacity, network specifications, and QoS. Based on this modeling, we show that the problem of finding the optimal computation schedule for different scenarios, i.e. best performance or energy consumption, can be reduced to the polynomial time shortest path problem.
To present realistic results, we made experiments with real hardwares as mobile device and cloud. To model the communication between platform, we used different network technologies and the most recent reports on their specifications in the U.S.
DNN architectures can be categorized based on functionality. These differences enforce specific type and order of layers in architecture, directly affecting the partitioning result in the collaborative method. For discriminative models, used in recognition applications, the layer size gradual decrease proceeding from input toward output 2
. This sequence suggests computation of the first few layers on the mobile device to avoid excessive communication cost of uploading large raw input data. On the other hand, growth of the layer size from input to output in generative models used for synthesizing new data, implies the possibility of uploading small input to the cloud and later downloading and computing the last layers on the mobile device for better efficiency. Interesting mobile applications like image to image translation are implemented with autoencoder architectures, usually consisting of middle layers with smaller sizes compared to input and output. Consequently we expect the first and last layers to be computed on the mobile device in our collaborative approach. We examined eight wellknown DNN benchmarks selected from these categories to illustrate their differences in collaborative computation approach.
As we will see in Section LABEL:Results, the communication between the mobile and cloud is the main bottleneck for both performance and energy in the collaborative approach. We investigated the specific characteristics of CNN layer outputs and introduced a lossless compression method to reduce the communication costs.
Stateoftheart work for collaborative computation of DNNs (Kang et al., 2017) only considers one offloading point, assigning computation of its previous layers and next layers on the mobile and cloud platforms, respectively. We show that this approach is nongeneric and fails to be optimal, and introduced a new method granting the possibility of computation on either platforms for each layer independent of other layers. Our evaluations show that JointDNN significantly improves the latency and energy up to and respectively compared to the statusquo single platform approaches without any compression. The main contributions of this paper can be listed as:

Introducing a novel model for collaborative computation between the mobile and cloud

Formulating the problem of optimal computation scheduling of DNNs at layer granularity in mobile cloud computing environment as shortest path problem and integer linear programming (ILP)

Examining compressibility of DNN layers and developing a lossless compression method to improve communication costs

Demonstrating the significant improvements of performance, mobile energy consumption, and cloud workload achieved by using JointDNN
2. Problem definition and modeling
In this section, we explain the general architecture of DNN layers and our profiling method. Moreover, we elaborate on how the cost optimization can be reduced to a shortest path problem by introducing the JointDNN graph model. Finally, we show how the constrained problem is formulated by setting up ILP.
2.1. DNN Building Blocks
DNNs are networks composed of several layers stacked to each other. We briefly explain the functionality of each layers used in the stateoftheart architectures:
Convolution Layer (conv)
consists of a set of filters with dimensions relatively smaller than their input. Each filter completely traverses through the input with a predefined step size and computes the dot product between it’s parameters and the corresponding part of the input. This process creates different feature maps (referred to as channels) for different filters from the same input data. This aspect of preserving the locality of input features has made Convolutional Neural Network (CNN) architectures the horse power of the stateoftheart image classification models. Because of dot product basis of
conv, it can be formulated as General Matrix Multiplication (GEMM), therefore capable of gaining performance improvement by using parallel computing devices (e.g. GPUs).Fully Connected Layer (fc)
is the main component of most regular neural networks in which every neuron is connected to all neurons of the previous layer. This fully pairwise connection architecture comprises large portion of computation of the whole network. Like
conv, fc layer is also formulated as GEMM.Pooling Layer (pool)
performs a nonlinear down sampling function over nonoverlapping spatially local parts of input. Maxpooling is the most common function used in this type of layer alongside other functions such as average or L2norm pooling.
Activation Layer
increases the nonlinearity property of neural network architectures. This layer applies nonlinear activation function on single data points of input to generate an output with the same size. Among various nonlinear functions, such as sigmoid and hyperbolic tangent, Rectified Linear Unit
(relu)
is currently the favorable choice in DNN architectures as it is simple and speeds up the tedious training process (Glorot et al., 2011).Local Response Normalization (lrn) performs local normalization by imposing a local competition for big activities between adjacent features in a channel, and also between features at the same spatial location in different channels. lrn are inspired by inhibition schemes observed in the brain helps with intention of generalization. There are different formulations suggested for lrn, as shown in (Krizhevsky et al., 2012; Jarrett et al., 2009) they may lead to slight improvements.
Dropout Layer (drop) As mentioned earlier, fc occupies most of the parameters of DNN models and thus vulnerable to overfitting. Typically regularization methods are used to prevent overfitting by reducing high dependency of network on individual neurons during training. In dropout (Srivastava et al., 2014)
technique, at each training iteration every neurons can be removed (droped out) from network with a predetermined probability
or kept with probability and the training is done on the remaining network. The dropped out nodes will have their previous weight for the next training iteration.Deconvolution Layer (deconv) also known as transposed convolution is mostly used on generative and autoencoder models in applications such as building highresolutions picture from lowresolution pictures and highlevel descriptions. The goal in deconvolution is to find in the convolution equation of form . In case of DNNs, is the filter and is the input of the convolution (Zeiler et al., 2010).
Long ShortTerm Memory Layer (lstm)
is a building unit for layers of a recurrent neural network (RNN) and is widely used due to its promising results in speech recognition applications. A typical LSTM unit is composed of a cell, an input gate, an output gate and a forget gate, which is responsible for remembering and forgetting specific values over arbitrary time intervals. The whole LSTM unit can be thought as a typical artificial neuron, as in a feedforward neural network.
Softmax (soft) is the last layer in multiclass architectures, usually connected in a onetoone correspondence way to a fc
layer. Softmax establishes a probability distribution by representing each class probability with a single neuron.
2.2. Energy and Latency Profiling
There are three methods in measuring the latency and energy consumption of each layer in neural networks:
Statistical Modeling:
In this method, a regression model over the configurable parameters of operators (e.g. filter size in convolution) can be used to estimate the associated latency and energy. This method is prone to large error because of the interlayer optimizations performed by DNN software packages. Therefore, it is necessary to consider execution of several consecutive operators grouped with each other during profiling. Many of these software packages are proprietary, making access to interlayer optimization techniques impossible.
In order to illustrate this issue, we designed two experiments with 25 consecutive convolutions on NVIDIA Pascal^{™} GPU using cuDNN^{®} library (Chetlur et al., 2014). In the first experiment, we measure the latency of each convolution operator separately and set the total latency as sum of them. In the second experiment, we group the convolutions together and measure the total latency. All parameters are located on GPU’s memory in both experiments, avoiding any data transfer from the main memory to make sure results are exactly representing the actual computation latency.
As we see in Figure 3, there is a large error gap between separated and grouped execution experiments which grows as the number of convolutions is increased. This observation confirms that we need to profile grouped operators to have more accurate estimations. Considering various consecutive combination of operators and different input sizes, this method requires a very large number of measurements, not to mention the need for a complex regression model.
Analytical Modeling: To derive an analytical approach for estimation of the latency and energy consumption, it is required to obtain the exact hardware and software specifications. However, the stateoftheart work in latency modeling of DNNs (Qi et al., 2017) fails to estimate layerlevel delay within an acceptable error bound, for instance, underestimating the latency of a fully connected layer with 4096 neurons by around 900%. Industrial developers do not reveal the detailed hardware architecture specifications and the proprietary parallel computing architectures such as CUDA^{®}, therefore, analytical approach could be quite challenging (Hong and Kim, 2010).
Applicationspecific Profiling: In this method, the DNN architecture of the application being used is profiled in runtime. The number of applications in a mobile device using neural networks are generally limited. In conclusion, this method is more feasible, promising higher accuracy estimations. We have chosen this method for estimation of energies and latencies in the experiments of this paper.
2.3. JointDNN Graph Model
First, we assume that a DNN is presented by a sequence of distinct layers with a linear topology as depicted in Figure 4. Layers are executed sequentially, with output data generated by one layer feeds into the input of the next one. We denote the input and output data sizes of k layer as and , respectively. Denoting the latency (energy) of layer k as , where , the total latency (energy) of querying the DNN is .
The mobile cloud computing optimal scheduling problem can be reduced to a shortest path problem, from node to , in the graph of Figure 5. Mobile Execution cost of the k layer () is the cost of executing the k layer in the mobile while the cloud server is idle. Cloud Execution cost of the k layer () is the executing cost of the k layer in the cloud server while the mobile is idle. Uploading the Input Data cost of the k layer is the cost of uploading output data of the (k1) layer to the cloud server . Downloading the Input Data cost of the k layer is the cost of downloading output data of the (k1) layer to the mobile . The costs can refer to either latency or energy. However, as we showed in Section 2.2, the assumption of linear topology in DNNs is not true and we need to consider all the consecutive grouping of the layers in the network. This fact suggests replacement of linear topology by a tournament graph as depicted in Figure 6. We define the parameters of this new graph, JointDNN graph model, in Table 1.
Param.  Description of Cost 

Executing layers to on the cloud  
Executing layers to on the mobile  
+  
+  
All the following edges:  
All the following edges:  
All the following edges:  
All the following edges:  
All the following edges:  
All the following edges:  
Uploading the input of the first layer 
In this graph, node represents that the layers to are computed on the cloud server, while node represents that the layers to are computed on the mobile device. An edge between two adjacent nodes in JointDNN graph model is associated with four possible cases: 1) A transition from the mobile to the mobile, which only includes the mobile computation cost () 2) A transition from the cloud to the cloud, which only includes the cloud computation cost () 3) A transition from the mobile to the cloud, which includes the mobile computation cost and uploading cost of the inputs of the next node () 4) A transition from the cloud to the mobile, which includes the cloud computation cost and downloading cost of the inputs of the next node (). Under this formulation, we can transform the computation scheduling problem to finding the shortest path from to .
Residual networks are a class of powerful and easytotrain architectures of DNNs (He et al., 2015).
In residual networks, as depicted in Figure 7 (a), the output of one layer is fed into another layer with distance of at least two. Thus, we need to keep track of the source layer (node in Figure 7) so as to know that this layer is computed on the mobile or the cloud.
Our standard graph model has a memory of one which is the very previous layer. We provide a method to transform the computation graph of this type of network to our standard model, JointDNN graph.
In this regard, we add two additional chains of size , where is the number of nodes in the residual block ( in Figure 7). One chain represents the case of computing layer on the mobile and the other one represents the case of computing layer on the cloud. In Figure 7, we have only shown the weights that need to be modified, where and are the cost of downloading and uploading the output of layer , respectively.
By solving the shortest path problem in JointDNN graph model, we can obtain the optimal scheduling of inference in DNNs. Online training consists of one inference and one backpropagation step. The total number of layers is noted by consistently throughout this paper so there are layers for modeling training, where the second layers are the mirrored version of the first layers, and their associated operations are the gradients of the error function with respect to the DNN’s weights. The main difference between the mobile cloud computing graph of inference and online training is the need for updating the model by downloading the new weights from the cloud. We assume that the cloud server performs the whole backpropagation step separately, even if it is scheduled to be done on the mobile, therefore, there is no need for mobile device to upload the weights that are updated by itself in order to save mobile energy consumption. The modification in JointDNN graph model is adding the costs of downloading weights of the layers that are updated in the cloud to .
The shortest path problem can be solved in polynomial time efficiently.
However, the problem of shortest path subjected to constraints has been shown to be NPComplete (Wang and Crowcroft, 1996). For instance, assuming our standard graph is constructed for energy and we need to find the shortest path subject to the constraint of the total latency of that path being less than a time deadline (QoS). However, there is an approximation solution to this problem, ”LARAC” algorithm (Juttner et al., 2001), the nature of our application does not require to solve this optimization problem frequently, therefore, we aim to obtain the optimal solution. We can constitute a small lookup table of optimization results for different set of parameters (e.g. network bandwidth, cloud server load, etc.). We provide the ILP formulations of DNN partitioning in the following sections.
2.4. ILP Setup
2.4.1. Performance Efficient Computation Offloading ILP Setup for Inference
We formulated the scheduling of inference in DNNs as an ILP with tractable number of variables. In our method, first we profile the delay and energy consumption of consecutive layers of size . Thus, we will have
(1) 
number of different profiling values for delay and energy. Considering layer to layer
to be computed either on the mobile device or cloud server, we assign two binary variables
and , respectively. Download and upload communication delays needs to be added to the execution time, when switching from/to cloud to/from mobile, respectively.(2) 
(3) 
(4) 
and represent the execution time of the i layer to the j layer on the mobile and cloud, respectively. and represent the latency of downloading and uploading the output of the i layer, respectively. Considering each set of the consecutive layers, whenever and one of are equal to one, the output of the j layer is uploaded to the cloud. The same argument applies to downloading. We also note that the last two terms in Eq. 3 represent the condition by which the last layer is computed on the cloud and we need to download the output to the mobile device, and the first layer is computed on the cloud and we need to upload the input to the cloud, respectively. To support for residual architectures, we need to add a pair of download and upload terms similar to the first two terms in Eq. 3 for the starting and ending layers of each residual block. In order to guarantee that all layers are computed exactly once, we need to add the following set of constraints:
(5) 
Because of the nonlinearity of multiplication, an additional step is needed to transform Eq. 3 to the standard form of ILP. We define two sets of new variables:
(6)  
with the following constraints:
(7)  
The first two constraints ensure that will be zero if either or are zero. The third inequality guarantees that will take value one if both binary variables, and , are set to one. The same reasoning works for . In summary, the total number of variables in our ILP formulation will be , where is total number of layers in the network.
2.4.2. Energy Efficient Computation Offloading ILP Setup for Inference
Because of the nature of the application, we only care about the energy consumption on the mobile side. We formulate ILP as follows:
(8) 
(9) 
(10) 
and represent the amount of energy required to compute the i layer to the j layer on the mobile and cloud, respectively. and represent the energy required to download and upload the output of i layer, respectively. Similar to performance efficient ILP constraints, each layer should be executed exactly once:
(11) 
The ILP problem can be solved for different set of parameters (e.g. different uplink and download speeds), and then the scheduling results can be stored as a lookup table in the mobile device. Moreover because the number of variables in this setup is tractable solving ILP is quick. For instance, solving ILP for AlexNet takes around 0.045 seconds on Intel(R) Core(TM) i73770 CPU with MATLAB®’s intlinprog() function using primal simplex algorithm.
2.4.3. Performance Efficient Computation Offloading ILP Setup for Training
The ILP formulation of online training phase is very similar to that of inference. In online training we have layers instead of obtained by mirroring the DNN, where the second layers are backward propagation. Moreover, we need to download the weights that are updated in the cloud to the mobile. We assume that the cloud server always has the most updated version of the weights and does not require the mobile device to upload the updated weights. The following terms need to be added for the ILP setup of training:
(12) 
(13) 
(14) 
2.4.4. Energy Efficient Computation Offloading ILP Setup for Training
(15) 
(16) 
(17) 
2.4.5. Scenarios
There can be different optimization scenarios defined for ILP as listed below:

Performance efficient computation: In this case, it is sufficient to solve the ILP formulation for performance efficient computation offloading.

Energy efficient computation: In this case, it is sufficient to solve the ILP formulation for energy efficient computation offloading.

Battery budget limitation: In this case, based on the available battery, the operating system can decide to dedicate a specific amount of energy consumption to each application. By adding the following constraint to the performance efficient ILP formulation, our framework would adapt to battery limitations:
(18) 
Cloud limited resources: In the presence of cloud server congestion or limitations on user’s subscription, we can apply execution time constraints to each application to alleviate the server load:
(19) 
QoS: In this scenario, we minimize the required energy consumption while meeting a specified deadline:
(20) This constraint could be applied to both energy and performance efficient ILP formulations.
3. Evaluation
3.1. Deep Architecture Benchmarks
Since the architecture of neural networks depends on the type of the application, we have chosen three common application types of DNNs:

Discriminative neural networks
are a class of models in machine learning for modeling the conditional probability distribution
. This class generally is used in classification and regression tasks. AlexNet(Krizhevsky et al., 2012), OverFeat(Sermanet et al., 2013), VGG16(Simonyan and Zisserman, 2014), Deep Speech(Hannun et al., 2014), ResNet(He et al., 2015), and NiN(Lin et al., 2013) are wellknown discriminative models we use as benchmarks in this experiment. Except Deep Speech, used for speech recognition, all other benchmarks are used in image classification tasks. 
Generative neural networks model the joint probability distribution , allowing generation of new samples. These networks have applications in Computer Vision (Goodfellow et al., 2014) and Robotics (Finn and Levine, 2016), which can be deployed on a mobile device. Chair (Dosovitskiy et al., 2014) is a generative model we use as benchmark in this work.

Autoencoders
are another class of neural networks used to learn a representation for a data set. Their applications are image reconstruction, image to image translation, and denoising to name a few. Mobile robots can be equipped with autoencoders to be used in their computer vision tasks. We use Pix2Pix
(Isola et al., 2016), as a benchmark from this class.
Param.  3G  4G  WiFi 

Download speed (Mpbs)  2.0275  13.76  54.97 
Upload speed (Mbps)  1.1  5.85  18.88 
(mW/Mpbs)  868.98  438.39  283.17 
(mW/Mpbs)  122.12  51.97  137.01 
(mW)  817.88  1288.04  132.86 
3.2. Mobile and Server Setup
We used Jetson TX2 module developed by NVIDIA^{®} (Corporation, 2018a), a fair representative of mobile computation power as our mobile device. This module enables efficient implementation of DNN applications used in products such as robots, drones, and smart cameras. It is equipped with NVIDIA Pascal®GPU with 256 CUDA cores and a shared 8 GB 128 bit LPDDR4 memory between GPU and CPU. To measure the power consumption of the mobile platform, we used INA226 power sensor (Incorporated, 2018).
NVIDIA^{®} Tesla^{®} K40C (Corporation, 2018b) with 12 GB memory serves as our server GPU. The computation capability of this device is more than one order of magnitude compared to our mobile device.
3.3. Communication Parameters
To model the communication between platforms, we used the average download and upload speed of mobile Internet (OpenSignal.com, 2017a, b) for different networks (3G, 4G and WiFi) as shown in Table 3.
The communication power for download () and upload () is dependent on the network throughput ( and ). Comprehensive examinations in (Huang et al., 2012) indicates that uplink and downlink power can be modeled with linear equations (Eq. 21) fairly accurate with less than 6% error rate. Table 3 shows the parameter values of this equation for different networks.
(21) 
4. Results
The latency and energy improvements of inference and online training with our engine for 8 different benchmarks are shown in Figures 8 and 9, respectively. We considered the best case of mobileonly and cloudonly as our baseline. JointDNN can achieve up to 66% and 86% improvements in latency and energy consumption, respectively during inference. Communication cost increases linearly with batch size while this is not the case for computation cost and it grows with much lower rate, as depicted in 10(b). Therefore, a key observation is that as we increase the batch size, the mobileonly approach becomes more preferable.
During online training, the huge communication overhead of transmitting the updated weights will be added to the total cost. Therefore, in order to avoid downloading this large data, only a few backpropagation steps are computed in the cloud server. We performed a simulation by varying the percentage of updated weight. As the percentage of updated weights increases, the latency and energy consumption becomes constant which is shown in Figure 10. This is the result of the fact that all the backpropagations will be performed on the mobile device and weights are not transfered from the cloud to the mobile. JointDNN can achieve improvements up to 73% in latency and 56% in energy consumption during inference.
Different patterns of scheduling are demonstrated in Figure 11. They represent the optimal solution in WiFi network while optimizing for latency. They show how the computations in DNN is divided between the mobile and the cloud. As it can be seen, discriminative models (e.g. AlexNet), inference follows a mobilecloud pattern and training follows a mobilecloudmobile pattern. The intuition is that the last layers are computationally intensive (fc) with small data sizes, which require a low communication cost, therefore, last layers tend to be computed on the cloud. For generative models (e.g. Chair), the execution schedule of inference is the opposite of discriminative networks, in which the last layers are generally huge and in the optimal solution they are computed on the mobile. Lastly, for autoencoders, where both the input and output data sizes are large, the first and last layers are computed on the mobile.
JointDNN pushes some parts of the computations toward the mobile device. As a result this will lead to less workload on the cloud server. As we see in Table 4, we can reduce the cloud server’s workload up to 84% and 53% on average, which enables the cloud provider to service more users, while obtaining higher performance and lower energy consumptions compared to singleplatform approaches.
Optimization Target  3G (%)  4G (%)  WiFi (%) 

Latency  84  49  12 
Energy  73  49  51 
4.1. Communication Dominance
Execution time and energy breakdown for AlexNet, which is noted as a representative for the stateoftheart architectures deployed in cloud servers, is depicted in Figure 12. The cloudonly approach is dominated by the communication costs. As demonstrated in Figure 12, 99%, 93% and 81% of the total execution time is used for communication in case of 3G, 4G, and WiFi, respectively. This relative portion also applies to energy consumption. Comparing the latency and energy of the communication to those of mobileonly approach, we notice that mobileonly approach for AlexNet is better than the cloudonly approach in all the mobile networks. We apply lossless compression methods in order to reduce the effect of the communication, which will be covered in the next section.
4.2. Layer Compression
The preliminary results of our experiments show that more than
of the total energy and delay cost in DNNs are caused by communication in the collaborative approach. This cost is directly proportional to the size of the layer being downloaded to or uploaded from the mobile device. Because of the complex feature extraction process of DNNs, the size of some of the intermediate layers are even larger than network’s input data. For example, this ratio can go as high as
in VGG16. To address this bottleneck, we investigated compression of the data before any communication. This process can be applied to different DNN architecture types; however, we only considered CNNs due to their specific characteristics explained later in details.CNN architectures are mostly used for image and video recognition applications. Because of the spatially local preservation characteristics of conv layers, we can assume that the output of the first convolution layers are following the same structure as the input image, as shown in Figure 13. Moreover, a big ratio of layer outputs are expected to be zero due to the presence of the relu layer. Our observations shows that the ratio of neurons equal to zero (ZR) varies from 50% to 90% after relu in CNNs. These two characteristics, layers being similar to the input image, and large proportion of their data being a single value, suggest that we can employ existing image compression techniques to their output.
There are two general categories of compression techniques, lossy and lossless (Cover and Thomas, 2006). In lossless techniques it is possible to reconstruct the original information completely. On the contrary, lossless techniques use approximations and the original data cannot be reconstructed. In our experiments, we examined the impact of compression using PNG, a lossless technique, based on encoding of frequent sequences in an image.
Even though the data type of DNN parameters in typical implementations are 32bits floatingpoints, most image formats are based on 3bytes RGB color triples. Therefore, to compress the layer in the same way as 2D pictures, the floatingpoint data should be quantized into 8bits fixedpoint. Recent studies show representing the parameters of DNNs with only 4bits affect the accuracy not more than 1% (Sze et al., 2017). In this work, we implemented our architectures with 8bits fixedpoint and presented our baseline without any compression. The layers of CNN contain numerous channels of 2D matrices, each similar to an image. A simple method is to compress each channel separately. In addition to extra overhead of file header for each channel, this method will not take the best of the frequent sequence decoding of PNG. One alternative is locating different channels side by side, referred to as tiling, to form a large 2D matrix representing one layer as shown in Figure 13. It should be noted that 1D fc layers are very small and we did not apply compression on them.
The Compression Ratio (CR) is defined as the ratio of the size of the layer (8bit) to the size of the compressed 2D matrix in PNG. Looking at the results of compression for two different CNN architectures in Figure 14, we can observe a high correlation between ratio of pixels being zero (ZR) and CR. PNG can compress the layer data up to and by average. These results confirm the effectiveness of the proposed compression method. By replacing the compressed layers output and adding the cost of compression process itself in JointDNN formulations, we achieve an extra and improvements in energy and latency on average, respectively.
5. Related work and comparison
General Task Offloading Frameworks. There are existing prior arts focusing on offloading computation from the mobile to the cloud(Ra et al., 2011; Gordon et al., 2012; Chun et al., 2011; Cuervo et al., 2010; Wang et al., 2012; Zhang et al., 2012). However, all these frameworks share a limiting feature that makes them impractical for computation partitioning of the DNN applications.
These frameworks are programmer annotations dependent as they make decisions about prespecified functions, whereas JointDNN makes scheduling decisions based on the model topology and mobile network specifications in runtime. Offloading in function level, cannot lead to efficient partition decisions due to layers of a given type within one architecture can have significantly different computation and data characteristics. For instance, a specific convolution layer structure can be computed on mobile or cloud in different models in the optimal solution.
Neurosurgeon is the only prior art exploring a similar computation offloading idea in DNNs between the mobile device and the cloud server at layer granularity. Neurosurgeon assumes that there is only one data transfer point and the execution schedule of the efficient solution starts with mobile and then switches to the cloud, which performs the whole rest of the computations. Our results show this is not true especially for online training, where the optimal schedule of execution often follows the mobilecloudmobile pattern. Moreover, generative and autoencoder models follow a multi data transfer points pattern. Also, the execution schedule can start with the cloud especially in case of generative models where the input data size is large. Furthermore, interlayer optimizations performed by DNN libraries are not considered in Neurosurgeon. Moreover, Neurosurgeon only schedules for optimal latency and energy, while JointDNN adapts to different scenarios including battery limitation, cloud server congestion, and QoS. Lastly, Neurosurgeon only targets simple CNN and ANN models, while JointDNN utilizes a graph based approach to handle more complex DNN architectures like ResNet and RNNs.
6. Conclusions
In this paper, we demonstrated that the statusquo approaches, cloudonly or mobileonly, are not optimal with regard to latency and energy. We reduced the problem of partitioning the computations in a DNN to shortest path problem in a graph. Adding constraints to the shortest path problem makes it NPComplete, therefore, we also provided ILP formulations to cover different possible scenarios of limitations of mobile battery, cloud congestion, and QoS. One can solve this problem for different set of parameters beforehand (e.g. network bandwidth, cloud server load, etc.) and use a lookup table accordingly to avoid the overhead of solving the optimization problem. The output data size in discriminative networks is typically smaller than other layers in the network, therefore, last layers are expected to be computed on the cloud, while first layers are expected to be computed on the mobile. A reverse reasoning works for Generative models. Autoencoders have large input and output data sizes, which implies that the first and last layers are expected to be computed on the mobile. With these insights, the execution schedule of DNNs can possibly have various patterns depending on the model architecture.
Acknowledgements.
This research was supported by grants from NSF SHF and DARPA MTO.References
 (1)
 Chetlur et al. (2014) Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, and others. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014). arXiv:1410.0759 http://arxiv.org/abs/1410.0759
 Chun et al. (2011) ByungGon Chun, Sunghwan Ihm, Petros Maniatis, Mayur Naik, and Ashwin Patti. 2011. CloneCloud: Elastic Execution Between Mobile Device and Cloud. (2011), 301–314.
 Corporation (2018a) Nvidia Corporation. 2018a. Jetson TX2 Module. https://developer.nvidia.com/embedded/buy/jetsontx2. (2018). [Online; accessed 15January2018].
 Corporation (2018b) Nvidia Corporation. 2018b. TESLA DATA CENTER GPUS FOR SERVERS. http://www.nvidia.com/object/teslaservers.html. (2018). [Online; accessed 15January2018].
 Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). WileyInterscience.
 Cuervo et al. (2010) Eduardo Cuervo, Aruna Balasubramanian, Daeki Cho, and others. 2010. MAUI: Making Smartphones Last Longer with Code Offload. (2010), 49–62. https://doi.org/10.1145/1814433.1814441
 Dean et al. (2012) Jeffrey Dean, Greg S. Corrado, Rajat Monga, and others. 2012. Large Scale Distributed Deep Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1 (NIPS’12). Curran Associates Inc., USA, 1223–1231. http://dl.acm.org/citation.cfm?id=2999134.2999271
 Dosovitskiy et al. (2014) Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. 2014. Learning to Generate Chairs with Convolutional Neural Networks. CoRR abs/1411.5928 (2014). arXiv:1411.5928 http://arxiv.org/abs/1411.5928
 Finn and Levine (2016) Chelsea Finn and Sergey Levine. 2016. Deep Visual Foresight for Planning Robot Motion. CoRR abs/1610.00696 (2016). arXiv:1610.00696

Glorot
et al. (2011)
Xavier Glorot, Antoine
Bordes, and Yoshua Bengio.
2011.
Deep Sparse Rectifier Neural Networks. In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
(Proceedings of Machine Learning Research), Vol. 15. PMLR, Fort Lauderdale, FL, USA, 315–323. http://proceedings.mlr.press/v15/glorot11a.html  Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, and others. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672–2680. http://papers.nips.cc/paper/5423generativeadversarialnets.pdf
 Gordon et al. (2012) Mark S. Gordon, D. Anoushe Jamshidi, Scott Mahlke, Z. Morley Mao, and Xu Chen. 2012. COMET: Code Offload by Migrating Execution Transparently. (2012), 93–106. http://dl.acm.org/citation.cfm?id=2387880.2387890
 Hannun et al. (2014) Awni Y. Hannun, Carl Case, Jared Casper, and others. 2014. Deep Speech: Scaling up endtoend speech recognition. CoRR abs/1412.5567 (2014). arXiv:1412.5567 http://arxiv.org/abs/1412.5567
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
 Hong and Kim (2010) Sunpyo Hong and Hyesoon Kim. 2010. An Integrated GPU Power and Performance Model. SIGARCH Comput. Archit. News 38, 3 (June 2010), 280–289. https://doi.org/10.1145/1816038.1815998
 Huang et al. (2012) Junxian Huang, Feng Qian, Alexandre Gerber, and others. 2012. A Close Examination of Performance and Power Characteristics of 4G LTE Networks. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services (MobiSys ’12). ACM, New York, NY, USA, 225–238.
 Incorporated (2018) Texas Instruments Incorporated. 2018. INA Current/Power Monitor. http://www.ti.com/product/INA226. (2018). [Online; accessed 15January2018].
 Isola et al. (2016) Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. ImagetoImage Translation with Conditional Adversarial Networks. CoRR abs/1611.07004 (2016). arXiv:1611.07004 http://arxiv.org/abs/1611.07004
 Jarrett et al. (2009) K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. 2009. What is the best multistage architecture for object recognition?. In 2009 IEEE 12th International Conference on Computer Vision. 2146–2153.
 Juttner et al. (2001) A. Juttner, B. Szviatovski, I. Mecs, and Z. Rajko. 2001. Lagrange relaxation based method for the QoS routing problem. 2 (2001), 859–868 vol.2.
 Kang et al. (2017) Yiping Kang, Johann Hauswald, Cao Gao, and others. 2017. Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge. In Proceedings of the TwentySecond International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). ACM, New York, NY, USA, 615–629. https://doi.org/10.1145/3037697.3037698
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf
 Li et al. (2013) Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with Onthefly Correctives. ACM Trans. Graph. 32, 4, Article 42 (July 2013), 10 pages.
 Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network In Network. CoRR abs/1312.4400 (2013). arXiv:1312.4400 http://arxiv.org/abs/1312.4400
 Nazemi et al. (2018) Mahdi Nazemi, Amir Erfan Eshratifar, and Massoud Pedram. 2018. A HardwareFriendly Algorithm for Scalable Training and Deployment of Dimensionality Reduction Models on FPGA. In Proceedings of the 19th IEEE International Symposium on Quality Electronic Design.
 Newsroom (2017) Apple Newsroom. 2017. The future is here: iPhone X. https://www.apple.com/newsroom/2017/09/thefutureishereiphonex/. (2017). [Online; accessed 15January2018].
 Oh and Jung (2004) KyoungSu Oh and Keechul Jung. 2004. GPU implementation of neural networks. 37 (06 2004), 1311–1314.
 OpenSignal.com (2017a) OpenSignal.com. 2017a. State of Mobile Networks: USA. https://opensignal.com/reports/2017/08/usa/stateofthemobilenetwork. (2017). [Online; accessed 15January2018].
 OpenSignal.com (2017b) OpenSignal.com. 2017b. United States Speedtest Market Report. http://www.speedtest.net/reports/unitedstates/. (2017). [Online; accessed 15January2018].
 Pan et al. (2017) Yunpeng Pan, ChingAn Cheng, Kamil Saigol, and others. 2017. Agile OffRoad Autonomous Driving Using EndtoEnd Deep Imitation Learning. CoRR abs/1709.07174 (2017). arXiv:1709.07174 http://arxiv.org/abs/1709.07174
 Qi et al. (2017) Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. (2017).
 Ra et al. (2011) MooRyong Ra, Anmol Sheth, Lily Mummert, and others. 2011. Odessa: Enabling Interactive Perception Applications on Mobile Devices. (2011), 43–56. https://doi.org/10.1145/1999995.2000000
 Razlighi et al. (2017) M. S. Razlighi, M. Imani, F. Koushanfar, and T. Rosing. 2017. LookNN: Neural network with no multiplication. In Design, Automation Test in Europe Conference Exhibition (DATE), 2017. 1775–1780. https://doi.org/10.23919/DATE.2017.7927280
 Sermanet et al. (2013) Pierre Sermanet, David Eigen, Xiang Zhang, and others. 2013. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. CoRR abs/1312.6229 (2013). arXiv:1312.6229 http://arxiv.org/abs/1312.6229
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for LargeScale Image Recognition. CoRR abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556
 Skala et al. (2015) Karolj Skala, Davor Davidovic, Enis Afgan, Ivan Sovic, and Zorislav Sojat. 2015. Scalable Distributed Computing Hierarchy: Cloud, Fog and Dew Computing. Open Journal of Cloud Computing (OJCC) 2, 1 (2015), 16–24. http://nbnresolving.de/urn:nbn:de:101:1201705194519
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 1929–1958. http://dl.acm.org/citation.cfm?id=2627435.2670313
 Sze et al. (2017) Vivienne Sze, YuHsin Chen, TienJu Yang, and Joel S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. CoRR abs/1703.09039 (2017). arXiv:1703.09039 http://arxiv.org/abs/1703.09039
 Teerapittayanon et al. (2017) Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. 2017. Distributed Deep Neural Networks over the Cloud, the Edge and End Devices. CoRR abs/1709.01921 (2017). arXiv:1709.01921 http://arxiv.org/abs/1709.01921
 Wang et al. (2012) Xudong Wang, Xuanzhe Liu, Ying Zhang, and Gang Huang. 2012. Migration and Execution of JavaScript Applications Between Mobile Devices and Cloud. (2012), 83–84. https://doi.org/10.1145/2384716.2384750
 Wang and Crowcroft (1996) Zheng Wang and J. Crowcroft. 1996. Qualityofservice routing for supporting multimedia applications. IEEE Journal on Selected Areas in Communications 14, 7 (Sep 1996), 1228–1234. https://doi.org/10.1109/49.536364

Zeiler
et al. (2010)
M. D. Zeiler, D.
Krishnan, G. W. Taylor, and R.
Fergus. 2010.
Deconvolutional networks. In
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
. 2528–2535. https://doi.org/10.1109/CVPR.2010.5539957  Zhang et al. (2012) Ying Zhang, Gang Huang, Xuanzhe Liu, and others. 2012. Refactoring Android Java Code for Ondemand Computation Offloading. SIGPLAN Not. 47, 10 (Oct. 2012), 233–248. https://doi.org/10.1145/2398857.2384634