1 Introduction and Motivation
Deep neural networks (DNNs) are extending our capabilities to solve traditionally challenging problems such as computer vision, natural language processing, neural machine translation, and video recognition. While academia and companies are developing specialized hardware, these hardware are still expensive and they target highperformance computing (HPC) datacenters. Furthermore, conventional consumerlevel devices, such as Internet of things (IoT) devices, lack the required performance to execute DNNs. As a result, to understand an environment, consumers need to offload these computation to cloud services. Such approach, in addition to a constant dependency on cloud services and highquality network availability
Biscotti et al. (2014); Lee & Lee (2015); Khan et al. (2012), raises several new privacy concerns for users over their private data (e.g., 24/7 recordings of home security cameras) Li et al. (2015). At the same time, IoT devices are a perfect match for DNNs candidate applications (e.g., temperature sensors and smart cameras) Li et al. (2015); Gubbi et al. (2013). This is because user’s data remains and processed locally in the same network that is produced. How can we move the computations close to the edge by only using IoT devices, while providing an acceptable performance?Our vision in this paper is to enable an efficient, local, and distributed computation of DNNs close to the edge by using IoT devices (i.e., resourceconstrained devices). This is because a single IoT device cannot entirely handle the computations of DNNs. Although with some optimizations, such as weight pruning Yu et al. (2017); Han et al. (2016) and precision reduction Courbariaux et al. (2014); Gong et al. (2014); Vanhoucke et al. (2011), we can run limited versions of the current models on IoT devices, with the advancement of DNNs and emergence of generalized model, the increase in demanded compute power for DNNs is not expected to stop. Therefore, exploring the distribution of DNN computation is essential. As discussed, since IoT devices are a great candidate for these DNNbased applications, this paper, by moving DNN computations closer to the edge, helps to address to achieve the following goals: (1) Reducing the dependability on cloud resources and highquality network, (2) protecting consumer private data, (3) providing an alternative and cheaper solution to understand raw data locally, (4) developing a unified framework that is able to distribute any DNN model on existing devices while providing realtime execution performance, (5) not being limited to a particular model or dataflow, and (6) decreasing the deployment time by providing general methods and avoiding model and hardwarespecific methods.
We target IoT devices, the number of which have already outnumbered the world’s population Gartner (2015); Li et al. (2015); Satyanarayanan (2017), because these devices are generally idle for the most of the time. However, when performing the inference computations of DNNs using IoT devices, compared to the cloud, some important assumptions change. First, since the requests are local, we might not have enough data to process in parallel (i.e., no immediate datalevel parallelism or batching). This means we cannot batch many requests immediately to amortize expensive costs of memory operations. Second, compared to HPC machines, IoT devices, besides having less computation power, have significantly smaller memories. Therefore, if the memory requirement of even a small computation task cannot fit in the memory of these devices, the execution performance suffers considerably. This is because, in such situations, the device uses offchip storage as swap memory, which causes a huge slowdown.
In this paper, we propose a solution in which collaborative, lowpower, and resourceconstrained IoT devices perform distributed, realtime, and singlebatch DNNbased recognition. By using these collaborative IoT devices, we generate and deploy a balanced data processing pipeline that is able to process DNN’s computations efficiently. To address the challenge of limited memory space, we introduce several modelparallelism techniques for the common layers (convolution and fullyconnected layers) of visual DNN models for reducing the memory footprint of their heavy computations. We discuss and analyze different DNN distribution methods. We also propose a heuristics to distribute DNNs on IoT devices by considering the memory requirement and amount of computation/communication to achieve the optimal performance. Moreover, we study prevalent visual DNN models such as image recognition (AlexNet
Krizhevsky et al. (2012), VGG16 Simonyan & Zisserman (2015), ResNet He et al. (2016), and Xception Chollet (2016)) and video recognition (C3D Tran et al. (2015)). After examining various methods for model parallelism in fullyconnected and convolution layers and their advantages and disadvantages, using our heuristics and monitoring tools, we create an evenly distributed data processing pipeline. For demonstration, we deploy our distributed system on an interconnected network of up to 11 Raspberry Pi 3s.The reminder of this paper is organized as follows. In Section 2, we review prior work in this area. Section 3 provides a background on convolution and fullyconnected layers while introducing models used in the paper. Next, Section 4 explains model and data parallelism and gives a detailed exploration of modelparallelism methods for fullyconnected and convolution layers. Then, in Section 5, we discuss the processing pipeline and describe our heuristics in finding a nearoptimal one. We evaluate our models in Section 6, and conclude the paper in Section 7
2 Prior Work
Recently, with extensive large DNN models, distributing a single model has gained the attention of researchers Mao et al. (2017); Teerapittayanon et al. (2017); Hadidi et al. (2018a); Kang et al. (2017); Hadidi et al. (2018c, b). Large models need more memory, and when the memory requirement of a DNN model is larger than the system’s memory, the performance of the model (both training and inference) suffers noticeably. More importantly, when executing DNNs on IoT devices versus companies datacenters, two important criteria changes: (1) first, a consumer, unlike large companies, cannot batch several requests and use more data parallelism. Therefore, all inferences are performed in a single batch mode which increases memory consumption per inference considerably. (2) Second, consumers does not have access to machines with high memory capacities, so more models suffers in performance. This is why recently some companies has released tools to alleviate this performance lost such as ELL library Microsoft (2017)
by Microsoft, and Tensorflow Lite
Google (2017b) and MobileNets Google (2017a). These libraries target devices such a Raspberry Pi, Arduino, and micro:bit. However, these tools are still in developments for single devices and do not distribute any computations on multiple devices. They aim for smaller number of weights, and convolution layers that have strides of two for reducing the dimensions of the input. Interestingly, some of tailored models in these implementation do not have any fullyconneted layers. Although such an effort might alleviate the overhead of DNNs on resourceconstrained devices, the lower accuracy of the models in addition to the time to explore a specialized tailored model hinders the implementation of other models and increases the deployment time.
From the academic community, Hadidi et al. (2018a) study, in which several robots collaborate together to perform distributed DNN computations, is the most similar to our work. The authors introduced only one method that uses model parallelism on fullyconnected layers, and use data parallelism for convolution layers. They do not study how the processing pipeline or modelparallelism methods for convolution layers helps the performance. Although they have provided an algorithm to distribute the tasks, we find that our heuristics with monitoring tools significantly shortens the time to find a nearoptimal distribution. This is because their algorithm needs to have access to the entire profiled data, which takes a long time to gather and does not always covers all cases. By using the same technique in their work, we can also perform dynamic allocation during execution with similar concepts and tools. Another work, Neurosurgeon Kang et al. (2017), dynamically partitions a DNN model between a single edge device and the cloud, which incurs high network traffic and the increase the risk of privacy loss. Furthermore, the partitioning is always between the cloud and only one edge device. DDNN Teerapittayanon et al. (2017) also tries to partition the model between edge devices and the cloud, but model retraining is necessary for each setting. In DDNN, Sensor devices (edge devices) only perform the first few layers in the network and the rest of the computation is offloaded to the cloud.
Another general direction is to reduce the overhead of DNNs by using methods such as weight pruning Yu et al. (2017); Han et al. (2016); Lin et al. (2017), resource partitioning Shen et al. (2017); Guo et al. (2017), quantization and lowprecision inference Courbariaux et al. (2014); Gong et al. (2014); Vanhoucke et al. (2011); Köster et al. (2017)
, and binarizing weights
Li et al. (2016); Courbariaux et al. (2016); Rastegari et al. (2016). Although these techniques reduce the overhead of DNNs, but require several additional steps that decrease the accuracy and enforce retraining the model. In fact, our work could be applied on top of these techniques to increase final performance as well. In other words, our work is orthogonal to these techniques in followings, which were not covered by previous studies. (1) We study resourceconstrained devices with limited memory space, (2) we increase the realtime performance of singlebatch DNN inferencing, (3) we introduce many methods for model parallelism, and (4) we design a collaborative system.3 Background
In this section, we provide an overview of convolution and dense layer computations to better understand how model and data parallelism in the next section applies to these layers. Note that we only introduce these two layers because they are among the most compute and dataintensive layers Venkataramani et al. (2017); Rhu et al. (2016) in visual models. Then, we introduce our models that is used in the evaluation section.
Dense Layer: In a dense or fully connected (fc) layer, the value of each output element is calculated from the weighted sum of all inputs. Figure 1 depicts a dense layer with size four and input size of two. Each activation is calculated from the sum of inputs and weights products as , in which is the input, is the output number, inputs are denoted as , weights as , as biases, and as activations. This formula may be also written using matrix notations in . During training phase, the parameters of and are defined, and will be constant during inference phase.
Convolution Layer: DNN models are composed of several layers that process inputs. For visual models, usually, all the layers except last ones are convolution layers (conv). A convolution layer applies a set of filters to a subset of inputs by sweeping each filter (i.e., kernel) over them. Each filter creates a channel, or depth (i.e., zaxis) of the output (Figure 2
). The spatial dimensions (i.e., x and yaxis) of the output is defined by four parameters: The size of input, filter, stride, and padding. In the simplest case of the convolution layer with just one filter, the filter slides across every position of the input producing one element per position. Figure
3a illustrates the applying of a 33 filter on an input of size 44 with a unit stride (), which is the shifting amount of the kernel, and zero padding (), which is defined as extra zeros appended to the input to control the output size. Figure 3b depicts the same example, but with a padding of one which means the the size of the new input is . Figure 3c shows formulas to calculate the output size in general cases. In this paper, we use the same padding, which means the output size of a convolution layer is same as its input size. The same padding can be achieved by setting . In other cases, one can simply replace the output dimensions with the formulas in Figure 3c. Note that Figure 3 shows 2D examples for convolution, similarly, we can extend such calculations to the 3D and 4D inputs and filters, usually used for time series.Other Layers: To introduce nonlinearity, an activation layer (
), such as Sigmoid or ReLU, is applied on the activations to create the input to the next layer, or
. This allows a model to learn complex functions. In addition, sometimes a pooling layer downsamples the input size and reduces the dimensions of data, such as max pooling
(maxpool) or average pooling (avgpool) layers. These layers, compared to fc and conv layers, are much less compute intensive, so we group them with their corresponding parent layer, which is the layer that produces their input.3.1 Models Overview
We briefly overview model architectures we use in the paper. We specifically targeted tasks in computer vision because of their heavy computations and fastpaced advancements. Moreover, we tried our best to choose the prevalent and stateoftheart models.
AlexNet:
In 2012 ImageNet largescale visual recognition challenge (ILSVRC), a challenge for image recognition task, AlexNet
Krizhevsky et al. (2012) significantly outperformed all the prior competitors and won the challenge with a deeper CNN and more filters per layer. Figure 4 illustrates the model of the single stream AlexNet, which consists of five convolution layers, and three dense layers. This model has a total of 40M parameters. VGG16: Figure 5 depicts VGG16 model Simonyan & Zisserman (2015), which has 16 layers, 13 convolution and three dense layers. As seen, VGG16 has a structured model, deeper convolution layers have more filters and smaller spatial dimensions. Number of parameters of VGG16 is around 140M, which is the highest in wellknow image recognition models.ResNet: Residual neural network (ResNet) He et al. (2016) introduced “skipconnection” for training deeper network in 2016. In this paper, we used ResNet50 that has 50 layers. Figure 6a illustrates basic blocks for ResNet. This model is residual in a sense that a shortcut connection skips a block, and makes training easier for such a deep model. Although ResNet50 is a deep model with several layers, the total number of parameters are 25M.
Xception: The most recent and accurate image recognition model among the models we used is Xception Chollet (2016). This model is based on Inception V3 Szegedy et al. (2016)
. Xception extends Inception module with a vision to process crosschannel and spatial correlations independently. Therefore, Xception introduces a special convolution layer, shown in Figure
6b, separable convolution Chollet (2016), that its mapping of crosschannel and spatial correlations is decoupled. Separable convolution first performs crosschannel (i.e., depthwise) convolution over input channels, and then performs an independent spatial convolution on each of the outputs. Figure 8 shows Xception model, with 34 separable convolution layers. The total number of parameters for this model is 23M.C3D: Convolution 3D (C3D) Tran et al. (2015)
model is designed to process videos and has been used in action recognition and scene classification tasks. To learn spatiotemporal features, C3D model uses 3D convolutions, which produce an output volume instead of a 2D output per filter. Compared to conventional convolution layer, an additional sweep along the zaxis creates a volume in the output. Figure
9 shows C3D model, which consists of eight 3D convolution layers. The total number of parameters for this model is 80M.4 Distributing and Parallelizing Inference
Name  #Node  Distributed  Multipication  Reduction  Weights  Communication  Merge 
Activation  (per node)  (per node)  (per node)  (totalper inference)  Operation  
No Splitting  1  N/A  N/A  
Output Splitting  ✓  Concat  
Input Splitting  ✗  Sum 
In this section, we overview our methods for distributing and parallelizing inference computations for dense and convolution layers. We examine two general directions: data parallelism and model parallelism. In data parallelism, we rely on the presence of many data inputs to distribute/parallelize the computations. This direction enables us to increase the number of inferences per second while maintaining a constant time to process each input. In model parallelism, which is applicable to the computations required for a single input, the computations is distributed/parallelized over multiple compute nodes. By following this direction, we reduce the time to process an input.
Since the DNNs have multiple of layers, it has a builtin pipeline parallelism. Hence, the first step is to divide a model into multiple devices by layers (or a group of layers) to utilize the pipeline parallelism. These layers process the input sequentially and the output of each layer is dependent on the output of its previous layer(s). Thus, we must correctly maintain this dependency between layers. Using the pipeline parallelism, we can increase the throughput of computation while the latency for each computation remains the same. We improve the performance further by applying datalevel and modellevel parallelism on top of pipeline parallelism.
Going Further than Data Parallelism: Data parallelism is already introduced by Hadidi et al. (2018a) for dense and convolution layers for realworld models. But, only applying data parallelism would not always work for resourceconstrained devices and in the edge device scenarios. Data parallelism duplicates a node that performs the same computation. Since the computation is the same, but on a different input data, memory footprint is not reduced. In fact, this is one of the main reasons why data parallelism is not enough for distributing and parallelizing DNN computations. This is because: (1) For sufficiently large layers, just the duplication of devices would not give us a good performance benefit because the entire data is not loaded to the memory, and a device pays a high cost for accessing the offchip storage (i.e., swap). (2) Data parallelism needs a stream of input data. But, in some cases such a single image inference or sentence translation, we only have one data input or input injection frequency is low. (3) To create a balanced and efficient data processing pipeline in our distributed system, we need a balanced pipeline design (i.e., the amount of computation per each node should be almost the same. ) However, data parallelism is not flexible in adjusting the amount of computation on each node.
Model Parallelism: Model parallelism is based on this fact that since the computations of each given layer to calculate its output are independent from each other, we can parallelize the computations. For instance, in a convolutional layer, the computation of each element in the output is independent from all other elements. Therefore, in model parallelism, we can exploit such intralayer independency of computations to increase parallelism. In fact, employing such deeper level parallelism, compared to data parallelism, needs a knowledge of how each layer does its computations, and how parallelism affects data communication, computations, and aggregation. In summary, in a DNN model, model parallelism is to distribute/parallelize the computations of a single input. In the following, we introduce our model parallelism methods for dense and convolution layers.
4.1 Model Parallelism for Dense Layers
In a dense layer, since the computations of each activation () is independent from other activations, we can parallelize the computations of a dense layer. We describe two model parallelism methods specific to dense layers: Output and input splitting, shown in Figure 10a and b, respectively. In output splitting, we parallelize the computation of each activation, while we transmit all input data to all devices. Figure 10
a highlights a node and its computations to derive its activation. As seen, for each node, we need to transmit all the inputs. Moreover, each node holds the weights corresponding to its activations. Later, when each node is done with its computations, we merge the results. The merge consists of concatenating values in a correct order. In addition, we can apply activation function (e.g., RELU, Sigmoid) either on each node, or after the merging.
In input splitting, a node computes a partial part of all activations. Figure 10b illustrates an example in which a node computes the half of required multiplications for all the activations. In this method, we transmit a part of the input to each node. Furthermore, each node holds the weights corresponding to the input split that processes. Later, when each node is done with its computations, we merge the results by adding all of the corresponding partial sums. Thus, the merge consists of a reduction operation (i.e., summation). However, contrary to the output splitting method, we cannot apply activation function before the merge. Mentioned methods may also be mixed which creates a spectrum of methods, however, in this paper we focus on extreme cases of this spectrum.
A more detailed summary of the mentioned methods are presented in Table 1. These methods trade communication with the memory footprint. This is because each node holds a part of the weights, but need to transmit more variables. The more detailed examination is done in the table where is the number of the nodes, and and are input and output dimensions, respectively. As seen, both methods divide the memory footprint (i.e. saved weights) and the amount of multiplications. Input splitting slightly increases the number of reductions because computing the partial sums are necessary on each node if the node receives more than one input element. Furthermore, output and input splitting methods have a communication overhead of and , respectively. We run a series of dense layers in Figure 11 on a single device, and their distributed versions on two devices (in total four devices, with initial sender and final receiver). We cover a range of 512 to 16384 in output sizes, and two input sizes, 7680 (not power of two) and 8192 (power of two). As seen, for the input size of 7680 and large output sizes, we achieve superlinear speedups. This is because in these cases, slow offchip storage (i.e., swap) is used. On the other hand, for input size of 8192, the baseline DNN framework can optimize accesses and avoid swap activities by tiling. The baselind DNN framework optimizes the swap space accesses, it cannot always hide the cost such as 7680 as we shown or a larger size than 8192. Thus, model parallelism helps us in avoiding such costs. Furthermore, other speedup values in the figure is less than the ideal value of two because each distribution has a communication cost. Input splitting has mostly lower performance than output splitting since it cannot apply activations locally.
Name  Division  #Nodes  Distributed  Weights  Input  Filters  Output  Communication  Merge 
Factor  Activation  (per node)  (per node)  (per node)  (per node)  (totalper inference)  Operation  
Baseline  N/A  1  N/A  N/A  
Channel  ✓  Concat  
Spatial  ✓  Eq.2  Concat  
Filter  batches  ✗  Sum 
Channel Splitting  Spatial Splitting  Filter Splitting  
Input  Entire input is copied  Input is divided spatially  Input is divided channelwise 
Filters  Some filters are saved  All filters are saved  Part of all filters are saved 
Output  Each node calculates a channel  Each node calculates a spatial region  Each node calculates a partial output 
Overhead  Input is copied across all nodes  Input overlapping elements  Output partial sums 
4.2 Model Parallelism for Convolutional Layers
As discussed, in a convolution layer, each filter creates a channel in the output data. As Figure 2 illustrates, let us assume the dimensions of input, filters, and output is as , , and , respectively. The depth of filters is defined by the depth of input, or . Here, without loss of generality, we assume square filters, . The number of channels in output is defined by number of filters, or . Each filter contains weights that are set during training. Per output element, each filter performs multiplications of its weights and input values, and one reduction operation. So, for filters in a convolution layer, per output element, we perform multiplications and reductions. Therefore, from Figure 3, total number of multiplications and reductions in a convolution layer for all elements are as below:
(1) 
For a single inference, the amount of communication is the sum of number of input and output elements, or . In the reset of this section, we describe our specific methods of modelparallelism for convolution layers. Since each method has its own advantages and disadvantages depending on the target convolution layer, choosing the best method requires careful considerations. To summary, Table 2 provides a detailed overview of discussions in this section.
Channel Splitting: In channel splitting, each node calculates a nonoverlapping set of channels in the output. In other words, each node only processes, namely filters, . Figure 12a shows an example output of this method with three nodes. Since filter is processed per node, we need a total of nodes. Each node only needs its set of filters, but because each node takes the whole input, each needs a copy of the input data. Filters are divided, so each node saves the weights of its dedicated filters, or . The total number of multiplications and reductions remains the same, and each node handles part. At the end, when every node is done with its computations, we concatenate their data depthwise which is in . For the output, the total number of output elements to be transferred is . We have the option to apply activation function on each node or after the merging, since activation function applies on every element independently. In total, based on Table 2, we pay in communication overhead, since we need to transmit a copy of the input to all nodes.
Spatial Splitting: In spatial splitting, instead of splitting the filters, we split the input spatially (in x and yaxis). Let us assume that we split each dimension in parts, so we have a total of parts^{1}^{1}1Here, for simplicity we divide each dimension to equal parts, which only allows square numbers for the number of the nodes. In our implementations, we implement the general version with unequal parts for which any number of nodes is possible. , as shown in Figure 12b. We transmit each part of the input to a node. Furthermore, we need to extend each region for more overlapping elements with neighboring parts, so that we can do convolution on the borders. Therefore, the number of input data elements to be transmitted per node is:
(2) 
in which the first term represents the splitted input, and the second term represents the numbers of extra overlapping elements. Compared to channel splitting in which we transmit of copy of input to all nodes, here we pay the extra overhead for only the overlapping parts. Since each node processes all filters, each needs a copy of all weights. Hence, the total number of filter elements to be transmitted is . However, note that this is a onetime cost for all inferences. The total number of multiplications and reductions is the same in total, and each node process only part. When the computation of each node is done, we concatenate their output spatially. The concatenation is in order of the total number of parts. Similar to previous method, the total number of output elements to be transferred is . We have the option to apply activation function either on each node or after the merging. As discussed, the communication overhead for spatial splitting is only for overlapping parts, which approximately is . Since usually filter size is small, this overhead is not significant. Spatial splitting has another advantage, which is to generate a part of output, we do not need to merge all the results. Therefore, in constructing a parallelized model while maintaining correctness, we can process a few convolution layers sequentially without merging their result back after every one layer.
Filter Splitting: In filter splitting, both input and filter are splitted channelwise in batches of size . Figure 13a illustrates the base case in the convolution of one filter which produces a single channel in the output. Figure 13b illustrates same filter in the filter splitting method. Both the input and filter is divided to three parts, each of which is processed separately. Since there is a onetoone correspondence between input and filter elements, each node computes a partial output. In the end, to create the final output, we need to sum all corresponding elements and apply the activation function. If we denote input channel size as , then we need a total of nodes. Since input is splitted channel wise, total number of input elements transfers is without an overhead, or in total. Similarly, each node only saves its dedicated channels of all filters, so memory footprint is also divided. But, since each node sends a partial output to the merging node, there is overhead of for transmitting output elements compared to the baseline. In addition, to create the final output, we need to perform reductions. The concatenation is in .
Methods Comparison: Now that we know different methods for model parallelism, how should we choose the best performing one for a specific convolution layer? A better comparison of these methods is presented in Table 3. For instance, channel splitting has an overhead of copying the input, whereas filter splitting has to transmit partial sums. The impact strength of these differences on the performance is defined by the properties of a convolution layer. As illustration, in Figure 14, we run a convolution layer with the kernels , filter depths 128 and 512, and various input depths with inputs. We distributed the layer on three Raspberry Pis using the mentioned methods (in total five devices, with initial sender and final receiver). Speedups are relative to single deivce execution. We see that in the kernel and filter depth 128, smaller input depths have no speedup. This is because the amount of computation per node after distribution is pretty small. However, for the larger input depths, since the amount of computation after distribution is more balanced, we see a speedup. We discuss more in Section 5 about this. Furthermore, we see that in most cases, spatial splitting performs better. This is because spatial splitting, contrary to other methods, does not have a significant communication overhead and since we only distribute on three devices, the number of overlapping elements (i.e., additional computation in spatial splitting) are not high. This is why for larger kernels, since the number of overlapping elements increases, the advantage of spatial splitting compared to other methods is less noticeable.
5 Finding a NearOptimal Distribution
To understand why distributing and parallelizing DNN computations are necessary for resourceconstrained devices Figure 15 shows memory usage and time to process an input (i.e., latency) of some layers in C3D and VGG16. As we see, beginning dense layers of both models have extremely long latencies (in order of minutes, not shown), because of their high memory footprint and low compute intensity which causes the usage of swap space. Hence, model parallelism is necessary for them. Convolution layers have much less memory footprint, but with a few layers on a device we will eventually exceed the available memory of the device and face the same issue as dense layers. Moreover, for convolution layers, the latency of a single computation is long and not suitable for realtime processing, as shown in Figure 15b and c. Note that most DNN models have more than ten layer, and here we are only showing the statistics for one of them. The mentioned challenges are more exacerbated with more layers. In summary, total latency of executing the entire model on a single resourceconstrained device is much longer because: (1) limited memory causes extremely slow swap space activities, and (2) latencies of all layers is accumulated because there are no parallelization opportunity. Model parallelism methods in the previous section helps us in solving these challenges because they reduce the memory footprint and exploit more compute resources.
Now that we are able to distribute and parallelize DNN computations using model and data parallelism, the question is: How can we find a nearoptimal distribution for a given number of nodes? The distributed system that we study is essentially a processing pipeline for DNN model. Our goal is to increase the execution performance of DNN models when performing singlebatch inference in the terms of the number of inferences per second (IPS) (higher is better) and latency (lower is better). In general, if we have amount of work and workers, our speedup would be
(3) 
in which the overhead entails to communication overhead ( data size), and some fixed overhead such as the network setup between devices. If the communication overhead dominates our distribution, then we will have a slowdown (as seen in the convolution layers study in Section 4. To avoid such scenarios, we need to :(1) avoid unnecessary splits to reduce the amount of communication overhead, and (2) associate enough work per node so the benefit of parallelizing exceeds communication overhead. To do so, we merge less compute intensive layers together on a single node. We also monitor idle nodes and combine the layers, we also increase the utilization of each node thereby achieving a balanced pipeline.
Generating a Balanced Pipeline: To do a nearoptimal distribution in our pipeline of IoT nodes, each node’s latency should be similar to other nodes. Thus, the amount of work per node, or should be the same. Model parallelism helps us gain access to smaller granularities of work during distribution, therefore shorter latencies. On the other hand, data parallelism does not changes the amount of work per node, but increases the throughput. In other words, since the throughput increases (depending on the number of the nodes), the work on these dataparallelism nodes could have a higher latency compared to other nodes in the pipeline. By considering these, to generate a distribution, first we need to create a database with a mix of (i) regression models based on the amount of work and type of the layers, and (ii) profiled data from some layers and their splitted versions (as seen in Section 4). Then, we study our given DNN model layer by layer. If the memory footprint is large and causes swap activities, for that layer, we have to first use model parallelism. After that, we try to group less computeintensive (sequential) layers to reduce the communication overhead mentioned before. The grouping is done in a way that the average latency for processing an input on each device would be similar. After deploying such initial distribution, we monitor the queue occupancy and latency of each device. With these gathered new data, we repeat the above steps and fine tune the distribution. Procedure 1 summarizes these steps.
6 System Evaluation
We evaluate our method on a distributed system with Raspberry Pi 3s Raspberry Pi Foundation (2017) (Table 4
). To show how our distribution heuristics provides better performance, we compare our results with a randomly assigned distributed system. The randomly assigned distributed system ensures correctness, but not an optimally designs a processing pipeline. In details, random assignment can create an unbalanced amount of work among the nodes, or assign a memoryintensive computation to one layer. For all implementations, we use Keras 2.1
Chollet et al. (2015) with the TensorFlow backend (version 1.5) Abadi et al. (2015). For RPC calls and serialization we use Apache Avro Apache Software Foundation (2017). A local network with the measured bandwidth of 94.1 Mbps and a measured clienttoclient latency of 0.4 ms for 64 B is used. All trained weights are loaded to each Pi’s storage, so each Pi can be assigned to any task.CPU  1.2 GHz Quad Core ARM CortexA53 
Memory  900 MHz 1 GB RAM LPDDR2 
GPU  No GPGPU Capability 
Price  $35 (Board) + $5 (SD Card) 
To create the pipeline, after finding a distribution of computations, we create a single file containing a dictionary of the IP addresses and their assigned computation. We upload the file to all nodes, and each node, by reading model description and its assigned computation, finds its position in the pipeline. After handshaking, which takes less than one minute, the system is ready for processing. During runtime, each node reports its latency and the histogram of request queue occupancy. By collecting such stats, we are able to find bottleneck nodes in our pipeline and create a more balanced pipeline, as Procedure 1 describes.
In the reminder of this section, we will analyze the application of the described modelparallelism methods on several models. Note that, if a model, after analyzing layerbylayer requirement, does not require application of modelparallelism, we will not describe it here. This is because, data parallelism can help these models tremendously by grouping the layers together and copy the work on several nodes. For instance, Resnet50 contains several lowlatency layers. Therefore, to distribute it, we can easily create groups of layers with even latencies and use data parallelism to increase its performance.
AlexNet & VGG16: In these set of experiments, we deploy the entire AlexNet and VGG16, including last dense layers, models on distributed systems. AlexNet, as introduced in Section 3, has 8 layers in total. Since the first dense layer in AlexNet face limited memory issue, all of our distributions perform output splitting for this layer. The rest of convolution layers are allocated to idle nodes. Our two example systems has 4 and 6 devices and achieve around 2 speedups compared to randomly distributed systems, as seen in Figure 17. Because AlexNet layers all have low compute requirements (per layer latency of less than 0.2s), we could not get more benefit by distributing the computations. But, VGG16, compared to AlexNet, consists of more computationally intensive layers. Therefore, as Figure 17 shows, we use 8 and 11 devices for distribution to achieve up to 6 speedup. Note that, similar to AlexNet, since we include the first dense layer, all of our distributions perform output splitting for this layer. For other layers, to gain a better insight, in Figure 18, we measured layerwise latency of VGG16 layers that are executed on Raspberry Pi. Except the first dense layers, we are able to run all other layers on a single Raspberry Pi. But, some layers have extremely long latency, so we will be bounded by such layers in our pipeline (e.g., second convolution layer). Our 11 and 8device systems bypass this bottleneck by using the methods proposed in Section 4 to achieve these speedups.
C3D: To understand when applying modelparallelism methods are appropriate, we analyze layerbylayer latency of C3D models on the Raspberry Pi in Figure 19a. As shown, C3D is quite heavy for resourceconstrained devices which is due to its layers that have high latency. This high latency is caused by a few convolution layers. The memory footprint of these convolution layers can be fit in our device memory, thus, the latency is caused by their heavy computations. In other words, because C3D has 3D convolutions, a key layer in understating temporal content, which has high computation demands, our model will experience a huge slowdown when begin distributed. To see how model parallelism can help these situations, we apply our three methods of modelparallelism with three devices on the second convolution layer. As seen, we can get up to 2.6 speedup by using only three devices.
Resnet50 & Xception: We did the similar latency analysis for Resnet50 model. However, as seen from their model in Section 3, since the layerwise latency of each layer is so short (less than 0.2 seconds per layer), there is not much opportunity in applying model parallelism on these model. Figure 16 provides a detailed latency overview of Xception. As seen the highest latency is less than 0.2 seconds. In this case, datalevel parallelism will provide a linear speedup as the number of nodes.
7 Conclusion
In this work, we proposed a solution to aid moving the computations of DNNs closer to the edge devices. Our target was resourceconstrained devices such as prevalent IoT that have small memory and low computation power. We increased the realtime performance of singlebatch inferencing by deploying a processing pipeline that exploits the collaboration between such devices. To overcome the limited memory and compute power of these devices, we introduce several modelparallelism methods and throughly analyzed their cost and benefits. Finally, we deployed processing pipelines for a few stateoftheart visual models. for For future work, we plan to extend our work to more than visual DNNs, covering areas such as translation and speech recognition. Furthermore, we are studying the possibility of various methods in alleviation the communication overhead such as bypassing merging in the next layer, compression, and using coded distribution.
References
 Abadi et al. (2015) Abadi, M. et al. TensorFlow: LargeScale Machine Learning on Heterogeneous Systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Apache Software Foundation (2017) Apache Software Foundation. Apache Avro. https://avro.apache.org, 2017. [Online; accessed 9/10/18].
 Biscotti et al. (2014) Biscotti, F., Skorupa, J., Contu, R., et al. The Impact of the Internet of Things on Data Centers. Gartner Research, 18, 2014.

Chollet (2016)
Chollet, F.
Xception: Deep Learning with Depthwise Separable Convolutions.
arXiv preprint, 2016.  Chollet et al. (2015) Chollet, F. et al. Keras. https://github.com/fchollet/keras, 2015.
 Courbariaux et al. (2014) Courbariaux, M., Bengio, Y., and David, J.P. Training Deep Neural Networks with Low Precision Multiplication. arXiv preprint arXiv:1412.7024, 2014.
 Courbariaux et al. (2016) Courbariaux, M., Hubara, I., Soudry, D., ElYaniv, R., and Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or 1. arXiv preprint arXiv:1602.02830, 2016.
 Gartner (2015) Gartner, I. Gartner Says 6.4 Billion Connected ”Things” Will Be in Use in 2016, Up 30 Percent From 2015. https://www.gartner.com/newsroom/id/3165317, 2015. [Online; accessed 9/10/18].
 Gong et al. (2014) Gong, Y., Liu, L., Yang, M., and Bourdev, L. Compressing Deep Convolutional Networks Using Vector Quantization. arXiv preprint arXiv:1412.6115, 2014.
 Google (2017a) Google. MobileNets: OpenSource Models for Efficient OnDevice Vision. https://ai.googleblog.com/2017/06/mobilenetsopensourcemodelsfor.html, 2017a. [Online; accessed 9/10/18].
 Google (2017b) Google. Introduction to TensorFlow Lite. https://www.tensorflow.org/mobile/tflite/, 2017b. [Online; accessed 9/10/18].
 Gubbi et al. (2013) Gubbi, J., Buyya, R., Marusic, S., and Palaniswami, M. Internet of things (iot): A vision, architectural elements, and future directions. Future generation computer systems, 29(7):1645–1660, 2013.
 Guo et al. (2017) Guo, J., Yin, S., Ouyang, P., Liu, L., and Wei, S. BitWidth Based Resource Partitioning for CNN Acceleration on FPGA. In FCCM,17, 2017.
 Hadidi et al. (2018a) Hadidi, R., Cao, J., Ryoo, M. S., and Kim, H. Distributed perception by collaborative robots. IEEE Robotics and Automation Letters (RAL), Invited to IEEE/RSJ International Conference on Intelligent Robots and Systems 2018 (IROS), 3(4):3709–3716, Oct 2018a. ISSN 23773766. doi: 10.1109/LRA.2018.2856261.
 Hadidi et al. (2018b) Hadidi, R., Cao, J., Woodward, M., Ryoo, M., and Kim, H. Musical chair: Efficient realtime recognition using collaborative iot devices. arXiv preprint arXiv:1802.02138, 2018b.
 Hadidi et al. (2018c) Hadidi, R., Cao, J., Woodward, M., Ryoo, M. S., and Kim, H. Realtime image recognition using collaborative iot devices. In Proceedings of the 1st on Reproducible QualityEfficient Systems Tournament on Codesigning Paretoefficient Deep Learning, ReQuEST ’18, New York, NY, USA, 2018c. ACM. ISBN 9781450359238. doi: 10.1145/3229762.3229765. URL http://doi.acm.org/10.1145/3229762.3229765.
 Han et al. (2016) Han, S., Mao, H., and Dally, W. J. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In ICLR’16. ACM, 2016.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. In CVPR’16, pp. 770–778. IEEE, 2016.
 Kang et al. (2017) Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., and Tang, L. Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge. In ASPLOS’17, pp. 615–629. ACM, 2017.
 Khan et al. (2012) Khan, R., Khan, S. U., Zaheer, R., and Khan, S. Future Internet: The Internet of Things Architecture, Possible Applications and Key Challenges. In FIT’12, pp. 257–260. IEEE, 2012.
 Köster et al. (2017) Köster, U., Webb, T., Wang, X., Nassar, M., Bansal, A. K., Constable, W., Elibol, O., Gray, S., Hall, S., Hornof, L., et al. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 1742–1752, 2017.

Krizhevsky et al. (2012)
Krizhevsky, A., Sutskever, I., and Hinton, G. E.
Imagenet Classification With Deep Convolutional Neural Networks.
In NIPS’12, pp. 1097–1105. ACM, 2012.  Lee & Lee (2015) Lee, I. and Lee, K. The Internet of Things (IoT): Applications, Investments, and Challenges for Enterprises. Business Horizons, 58(4):431–440, 2015.
 Li et al. (2016) Li, F., Zhang, B., and Liu, B. Ternary Weight Networks. arXiv preprint arXiv:1605.04711, 2016.
 Li et al. (2015) Li, S., Da Xu, L., and Zhao, S. The internet of things: a survey. Information Systems Frontiers, 17(2):243–259, 2015.
 Lin et al. (2017) Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191, 2017.
 Mao et al. (2017) Mao, J., Chen, X., Nixon, K. W., Krieger, C., and Chen, Y. MoDNN: Local Distributed Mobile Computing System for Deep Neural Network. In DATE’17, pp. 1396–1401. IEEE, 2017.
 Microsoft (2017) Microsoft. Embedded Learning Library (ELL). https://microsoft.github.io/ELL/, 2017. [Online; accessed 9/10/18].
 Raspberry Pi Foundation (2017) Raspberry Pi Foundation. Raspberry Pi 3. https://www.raspberrypi.org/products/raspberrypi3modelb/, 2017. [Online; accessed 9/10/18].
 Rastegari et al. (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. XNORNet: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV’16, pp. 525–542. Springer, 2016.
 Rhu et al. (2016) Rhu, M., Gimelshein, N., Clemons, J., Zulfiqar, A., and Keckler, S. W. vdnn: Virtualized deep neural networks for scalable, memoryefficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 18. IEEE Press, 2016.
 Satyanarayanan (2017) Satyanarayanan, M. The emergence of edge computing. Computer, 50(1):30–39, 2017.
 Shen et al. (2017) Shen, Y., Ferdman, M., and Milder, P. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In ISCA’17. IEEE, 2017.
 Simonyan & Zisserman (2015) Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for LargeScale Image Recognition. In ICLR’15. ACM, 2015.

Szegedy et al. (2016)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z.
Rethinking the inception architecture for computer vision.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2818–2826, 2016.  Teerapittayanon et al. (2017) Teerapittayanon, S., McDanel, B., and Kung, H. Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices. In ICDCS’17, pp. 328–339, 2017.
 Tran et al. (2015) Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 4489–4497. IEEE, 2015.
 Vanhoucke et al. (2011) Vanhoucke, V., Senior, A., and Mao, M. Z. Improving the Speed of Neural Networks on CPUs. In Proceeding Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, pp. 4. ACM, 2011.
 Venkataramani et al. (2017) Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avancha, S., Jagannathan, A., Durg, A., Nagaraj, D., Kaul, B., Dubey, P., et al. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. In ISCA’17, pp. 13–26. ACM, 2017.
 Yu et al. (2017) Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., and Mahlke, S. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. In ISCA’17, pp. 548–560. IEEE, 2017.
Comments
There are no comments yet.