Inference Time Optimization Using BranchyNet Partitioning

05/01/2020 ∙ by Roberto G. Pacheco, et al. ∙ 0

Deep Neural Network (DNN) applications with edge computing presents a trade-off between responsiveness and computational resources. On one hand, edge computing can provide high responsiveness deploying computational resources close to end devices, which may be prohibitive for the majority of cloud computing services. On the other hand, DNN inference requires computational power to be executed, which may not be available on edge devices, but a cloud server can provide it. To solve this problem (trade-off), we partition a DNN between edge device and cloud server, which means the first DNN layers are processed at the edge and the other layers at the cloud. This paper proposes an optimal partition of DNN, according to network bandwidth, computational resources of edge and cloud, and parameter inherent to data. Our proposal aims to minimize the inference time, to allow high responsiveness applications. To this end, we show the equivalency between DNN partitioning problem and shortest path problem to find an optimal solution, using Dijkstra's algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNN) are largely employed in machine learning applications, such as computer vision and speech recognition. DNN is composed of neuron layers, where each neuron receives inputs and generates a non-linear output. In summary, a DNN architecture is composed of an input layer, a sequence of middle layers, and an output layer. For image classification, DNN inference executes a feed-forward algorithm to label an input image into one of the predefined classes. In this algorithm, each layer receives the output data from the prior layer, executes a computation, and then propagates its output data to the next layer. DNN inference executes this algorithm from the input layer through the middle layers, until it reaches the output layer, which generates the probability for each predefined class 

[5].

Traditionally, DNNs can be deployed on end devices (e.g., smartphones and personal assistants) or on a cloud server [7, 4]. DNN inference generally requires high computational power, and its execution on resource-constrained end devices can result in a prohibitive processing delay. DNN inference can thus be executed in a cloud computing infrastructure, which is generally equipped with computational resources to accelerate the processing, such as GPUs (Graphics Processing Unit). In cloud-based solutions, end devices gather the data and transmit it to the cloud server, which executes DNN inference. This adds a data communication delay, which is affected by the network behavior between the end device and the cloud, increasing the inference time. This is a severe problem since recent DNN applications, such as cognitive assistance and intelligent vehicles, require high responsiveness [1]. It is necessary to reduce communication and processing delays to achieve high responsiveness. The former is the time required to send data, through the Internet, from the end device to the cloud server. The second one is the time to perform the inference itself, related to the employed hardware. Edge computing emerges as an alternative to reduce the communication delay imposed by cloud computing [7]. This paradigm consists of deploying computational resources at the edge of the Internet (i.e., close to end devices), reducing the communication delay. Edge servers can be installed in locations such as cellular base stations and Wi-Fi access points. Nevertheless, edge devices provide a computational capacity significantly lower than the cloud, which adds processing delay. Therefore, edge computing reduces communication delay but increases the processing delay as compared to the cloud. Thus, considering responsiveness, there is a clear trade-off between the communication delay and the processing one.

In literature, there are proposals to handle each one of the delays related to DNN inference. To reduce processing delay, BranchyNet proposes classifying an input sample at the middle layers if a certain confidence level is achieved. Regarding communication delay, DNN partitioning proposes to compute the first DNN layers at the edge device and the other ones at the cloud server. This proposal is based on the fact that the communication delay to send data from middle layers is significantly lower than the delay to send a raw image 

[3]. This paper combines BranchyNet and DNN partitioning to evaluate the trade-off between processing and communication delay. To address this trade-off, this paper formalizes an optimization problem whose objective is to find an optimal partition that minimizes the inference time for a BranchyNet. This optimization problem depends not only on network bandwidth and the computational power of edge and cloud, but also aspects inherent to input data, such as image quality. To this end, we model the inference time for BranchyNet. Then, to minimize the inference time, we show the equivalency between BranchyNet partitioning and the shortest path problem. Thus, we can derive a globally optimal solution in polynomial-time.

This paper is organized as follows. Section II reviews related works about DNN partitioning. Section III presents basics concepts of BranchyNet. Section IV models the inference time for this DNN type. Then, Section V formalized the BranchyNet partitioning problem. The experiments are shown in Section VI. Finally, Section VII concludes this paper and suggests future directions.

Ii Related Work

To accelerate DNN inference, several prior works study how to partition a DNN between edge devices and the cloud server. Neurosurgeon [4]

constructs performance prediction models based on DNN architectures. It allows to estimate the processing delay at the edge device and the cloud server. Then, these prediction models are combined with wireless network conditions to dynamically select the best partition. However, Neurosurgeon is limited to chain-topology DNNs. To address this limitation, DADS 

[3] (Dynamic Adaptive DNN Surgery) optimally partitions a general DAG (Directed Acyclic Graph) topology DNN. To this end, DADS treats the partitioning problem as a min-cut problem. These papers propose DNN partitioning methods, considering DNNs with no side branches. Regarding DNNs with side branches (i.e., BranchyNet), Li et al [6] propose a partitioning method that, given a latency requirement, maximizes the inference accuracy, using a brute force search. This method may be unfeasible for increasingly deeper DNNs. Unlike this previous work, our paper optimally partitions a BranchyNet, to minimize inference time. Moreover, this paper is the first work to model the inference time for BranchyNet, considering the probability that a sample is classified at the side branch also as a factor that impacts the inference time. Then, we convert the BranchyNet partitioning problem into a shortest path problem.

Iii BranchyNet

BranchyNet is a DNN architecture, whose goal is to accelerate inference. This architecture is based on the idea that features extracted in the first layers can label correctly a large number of samples on a dataset. To this end, BranchyNet proposes to modify an original DNN architecture, inserting side branches at the middle layers. These side branches allow an input sample to be classified at middle layers, instead of the output one as in regular DNNs. BranchyNet can use entropy as an uncertainty metric to compute the confidence level of sample classification, to decide if the inference can stop or not at the middle layers 

[8].

Figure 1 illustrates a generic BranchyNet with side branches. In this figure, the nodes to represent layers of the main branch, and to

refers to the side branches inserted at those middle layers. In summary, these layers can be of three types: convolutional (conv), max-pooling (max-pool) and fully-connected (fc). The convolutional layers consist of a set of filters, whose components are learnable parameters during the training process. Each filter is responsible to generate a set of output features using convolutional operations. The max-pooling layers provide robustness to noise in output features of a convolutional layer. To this end, max-pooling layers get the maximum value of a predefined window. The output fully-connected layer receives the features extracted by the previous convolutional layers and generates a probability vector, containing the probability that a sample belongs to each predefined class.

Once trained, BranchyNet receives an image, which is processed, layer by layer, until a side branch is reached. Then, on the side branch, it computes the confidence level of sample classification based on the probability vector and verifies if this confidence level is less than a threshold. If so, the inference finishes and the class inferred is the class with the highest probability. Hence, this sample is not processed by any next layer, reducing the number of processed layers and thus the processing delay. Otherwise, the sample is processed by the next layers of the main branch until the next side branch is reached. Then, the whole procedure is performed for this branch. If the sample is not classified in any branch, the inference ends when the output layer is reached.

When the majority of samples on a dataset cannot be classified at side branches, executing BranchyNet inference at an edge device can introduce processing delay. To avoid that, it is necessary to determine which layer the DNN partitioning must occur to minimize inference time. This decision should take into account the probability of classifying at side branches, the network conditions, and the processing capacity of edge and cloud hardware. Therefore, this work formalizes a BranchyNet partitioning problem, giving the model defined next.

Fig. 1: An illustration of a general BranchyNet.

Iv Partitioning Model

In this section, we model the inference time for BranchyNet partitioning. To this end, we represent BranchyNet as a graph and define the BranchyNet partitioning problem.

Iv-a BranchyNet Graph

A DNN can be modeled as a DAG . The set contains vertices of . Each vertex represents the layers in a DNN. For instance, the vertices and represent the input and output layers, respectively. The set contains the links222In this paper, we use the word “link” to denote an edge in a graph, to avoid misunderstanding with edge computing. of the graph. The link exists if and only if the output data of layer feeds the input of the layer . Thus, the layer is processed before .

A BranchyNet can be modeled as a DAG since it is a DNN. According to Figure 1, a BranchyNet is characterized by inserting a side branch between the middle layers of the main branch, where is the set of side branches. Therefore, we can summarize the BranchyNet architecture into two components: the main branch and side branches. In this paper, the main branch is modeled as a chain graph, denoted by , where the sub-index indicates the number of vertices in . In a chain graph, each vertex has only one outgoing link, representing its connection to , for all , since . Let be a DAG of BranchyNet. To model the graph , we introduce the vertices into the graph of the main branch. In addition, we replace the link with the links and . In other words, we replace the outgoing link from vertex to its neighbor with a link from to the side branch vertex , and we add a link from to . Thus, at this stage, a BranchyNet also can be modeled by a chain graph denoted by .

(a) Edge-only processing.
(b) Cloud-only processing.
(c) Processing with partitioning
Fig. 2: Possible scenarios for DNN processing.

Iv-B BranchyNet Partitioning

BranchyNet partitioning problem consists of choosing which layer sends its output data from an edge device to the cloud. In this section, we approach this problem as a graph partitioning problem. Given a graph and a positive integer , the graph partitioning finds vertex sets , such that and, which means that each node belongs to only one of these subsets.

BranchyNet partitioning splits the set of a BranchyNet graph into two (i.e., ) disjoint subsets and . The vertices represent BranchyNet layers processed at the edge. The vertices are the layers processed in the cloud. As in graph partitioning, each vertex belongs to only one subset thus the layer is processed only at the edge device or at the cloud server, which means that . As explained in Section IV-A, we model the main branch of Branchynet as a chain graph. Thereby, formally, the partitioning task must find a layer that defines set that splits into two parts. The partitioning layer is the last one to be processed in the edge. Hence, the layers from to are processed at edge. Then, the edge device sends the output data of to the cloud that processes the next layers.

Once the partitioning layer is found, we can determine which set each layer belongs to. Formally, the set processed at the edge is , where and . Hence, the set processed in the cloud is . It is important to note, in this paper, that no side branch is processed in the cloud, hence no vertex belongs to the set . Therefore, all vertices posterior to are discarded. It occurs because, in the cloud, the reduction of processing delays provided by classifying a sample at the side branch is negligible when compared with the time required to execute all layers of the main branch. Moreover, this reduction is significantly lower than the communication delay between the edge and the cloud. Figure 2 shows different partitioning scenarios for a BranchyNet composed of four layers.

In Figure 2, the gray vertices represent the layers processed at the edge device, the orange vertex corresponds to the partitioning layer and the blue vertices refer to layers processed at the cloud server. Figure 2(a) illustrates an example where all layers are processed in the edge, so the partitioning layer is . Thus, we have and . In this case, no data is sent to the cloud. On the other hand, Figure 2(b) shows cloud-only processing where all layers are processed at the cloud, so and . In this case, the output data size sent to the cloud corresponds to the raw input data size. Finally, Figure 2(c) shows an example where partition occurs in the layer , so and . In this case, the output data of layer is sent from an edge device to the cloud.

Generally, the partitioning splits a BranchyNet to follow an objective, such as reducing inference time, saving bandwidth, maximizing inference accuracy, or even reducing energy consumption. In this paper, our goal is to minimize the inference time. We derive next a model to estimate the inference time.

Iv-C Estimation of Inference Time

In DNN applications with edge computing, end-devices send input data to the edge device. When the DNN has no side branches, this data is processed by all layers placed at the edge. The edge device sends the output data of the partitioning layer to the cloud, which is responsible to process the remaining layers. Hence, in this DNN, the inference time depends on the processing delay in the edge and the cloud, as well as their communication delay.

Edge devices and cloud servers differ significantly regarding their computational power and thus have different processing delays. Hence, the processing time for a given layer depends on where it is computed. Let and be the processing time to compute the layer at the edge and the cloud, respectively. The total processing delay at the edge, when the DNN has no side branches, is

(1)

and the processing time to compute all the remaining layers at cloud is given by

(2)

The communication delay depends on the output data size of the partitioning layer and the network bandwidth. The output data size generated by each DNN layer presents a non-monotonic behavior, which means each layer produces output data with different sizes. Therefore, there are layers closer to the input layer that generate less data than a deeper layer (i.e., closer to output layer), resulting in higher communication delay. For each layer , we can define the communication time as where is the output data size of layer and is the network bandwidth. At this stage, we can define and as parameters related to hardware resources and related to network condition. Additionally, we can assign a 3-tuple to each vertex of a DNN with no side branches or for the main branch in a BranchyNet .

In a DNN with no side branches, the inference time is the sum of the total processing delay (i.e. and ) with the communication delay (i.e., ), as shown in Equation 3 [3].

(3)

If we consider a DNN with side branches (i.e., a BranchyNet), an inference can stop at one of these branches, if the classification achieves a certain confidence criterion. To model the inference time for DNN with side branches, we divide our analysis into a particular case and a generic one.

Iv-C1 Particular Case

We first consider a particular BranchyNet, with only one side branch , placed in the output of any middle layer , where . As BranchyNet inference algorithm described in Section III, in this case, when a sample reaches this side branch

, it can be classified and exits the BranchyNet at that middle layer or is processed by the next layers until reach the output layer. To model this two possible outcomes, we define a Bernoulli random variable

that takes on 1 if a sample is classified at the side branch with probability , where . Otherwise, if the sample is process by the next layers to side branch with .

Iv-C2 General Case

We generalize the previous analysis, considering a BranchyNet with side branches, as shown in Figure 1. As in particular case, we define a Bernoulli random variable for each side branch , resulting in a sequence of Bernoulli random variables . For a sample to be classified at side branch , this sample cannot meet the confidence criterion in any of the previous side branches. Thus, the Bernoulli random variable takes on value 1 for the first time at side branch after trials. To model it, we define a random variable that represents the number of side branches, which have already been processed before the current side branch , and it cannot classify the sample. Then, the probability of random variable Y as follows:

(4)

Considering a batch of input data samples, we can compute the expected value of the inference time. As the cloud has no side branches, if the partitioning occurs earlier than the first side branch, the inference time is modeled by Equation 3 as a DNN with no side branches. If the partitioning occurs after the side branch, the sample can be classified by this branch, and thus it is not processed by the remaining layers. In this case, in a BranchyNet with only one side branch, the expected value of the inference time is

(5)

Equation 5 shows that the edge device always processes the layers before the side branch . However, the processing and communication delays of any remaining layers are weighted by the probability of classifying the input data at side branch. In an extreme case, where the input samples are always classified at the side branch, which means , Equation 5 considers neither the communication delay nor the processing delay for the remaining layers. On the other extreme, if the inference never stops at a side branch (i.e., ), Equation 5 is equal to Equation 3. At this stage, according to the partitioning layer position, the expected value of the inference time can be modeled as follows:

(6)

V BranchyNet Partitioning Optimization

In this section, our goal is to determine the partitions , that minimize the inference time of Equation 6, given the input parameters , , , and . As defined in Section IV-A, the main branch of BranchyNet is modeled as a chain graph . Hence, if we find a partitioning layer that minimize , we can determine the partitions and . To this end, we propose to construct a new graph based on the BranchyNet graph. These graph allows to associate one delay of the a 3-tuple , , to each link, as described Section IV-C. The delay associated with each link depends on where the vertex is processed. When then we show that the partitioning problem can be considered as the shortest path problem in .

Fig. 3: Graph representation of 3-layer BranchyNet.

To build , we create two disjoint chain graphs and , such as those of Figures 2(a) and 2(b), respectively. The vertices of and represent the layers processed in the edge in the cloud, respectively. Then, we assign a weight to each link of the graphs and , representing the processing time to compute at the edge (i.e., ) and at cloud (i.e., ), respectively. Figure 3 shows the graph constructed based on a BranchyNet constituted by the main branch with three layers and one side branch inserted after the first layer of the main branch. In this figure, each gray and blue vertices represents a layer processed at the edge and cloud, respectively. The dashed red and blue links correspond to links of graphs and .

The graph must model three possibilities: edge-only processing, cloud-only processing, and processing with partitioning. To model cloud-only and edge-only, we introduce two virtual vertices called and and then, we add links and into graph , illustrated by a blue and black links in Figure 3. Then, we assign weights to these links that are related to communication delay in cloud-only and edge-only processing. The weight corresponds to communication time to upload a raw input sample to the cloud denoted by in cloud-only processing. In edge-only, the weight since there is not communication delay. Figure 3 shows that the path between and using only the red dashed and blue links computes the inference time for edge-only and cloud-only processing, respectively. To model the processing with partitioning in graph , for all , we introduce an auxiliary vertex (i.e., orange vertices in Figure 3), where is the set of auxiliary vertices. Then, we add a link and replace (i.e., red dashed links in Figure 3) to . In Figure 3, the red dashed links are replaced to the black links and .

To model the communication between edge and cloud, we add a link, whose the weight corresponds to the communication time to send output data of partitioning layer placed at edge to placed at cloud denoted by . In this figure, the orange link, such as , represents the communication between edge side and cloud. To avoid ambiguity in the choice of the shortest path when the probability , we add a virtual vertex as successor of and predecessor of vertex . Then, we assign the weight to the link . The weight must be a very small value, to not interfere with the result of the shortest path problem. If the probability , which means the probability that any sample is classified by side branch , the graph represents a regular DNN. The weights are assigned to links of as follows:

(7)

To model the expected value of inference time in a BranchyNet, the weights assigned to the links in is weighted by the probability that the sample is classified at side branch . Thereby, as higher the probability that a sample is classified at the side branch, less significant are the weights of links after the side branch. Thus, the weights assigned to links in a Branchynet is given by

(8)

Our goal is to determine the path with the minimum cost that connects the virtual vertices and . The cost of the path is defined as the sum of the weights associated with the links in graph . At this point, we show the equivalency between the BranchyNet partitioning problem and the shortest path problem. Given two vertices, the shortest path problem finds a forward path that connects these two vertices with minimum cost. In this problem, these two vertices are and . In Figure 3, if the layers of are contained in the shortest path, it means that processing strategy is edge-only. In this case, the total cost correspondents to , as shown in Equation 1. However, if all vertices is contained in the shortest path, it means cloud-only processing. The cost of shortest path is thus , as defined in Equation 2. Otherwise, if the vertices in shortest path belongs to as well as , then partitioning occurs. The total cost is given by Equation 6. For instance, in Figure 3, assuming the partitioning layer is vertex , the shortest path is , , , , , , and . Thus, the layers through are processed at the edge and belong to set then the edge sends output data of to the cloud, which, in turn, processes the that belongs to set .

The shortest path problem is a well-known problem that can be solved in polynomial-time. In this work, Dijkstra’s algorithm is used to find the shortest path with the computational complexity of , where and are the number of links and vertices in , respectively.

Vi Evaluation

(a) Processing factor of 10.
(b) Processing factor of 100.
(c) Processing factor of 1000.
Fig. 4: Inference time according to the probability of classifying a sample for different wireless technology and processing factors

The experiments present a sensitivity analysis evaluating the impacts of input parameters described in Section IV-C, such as , , and , in inference time, under different network bandwidth. Then, we analyze the partitioning layer choice under different processing capacities of the cloud server and edge device. To this end, we implement a BranchyNet called B-AlexNet [8]. The B-AlexNet is composed of a standard AlexNet architecture as the main branch with one side branch inserted after the first middle layer of the main branch. The choice of only one side branch aims to simplify the experimental analysis, and the side branch position is chosen to avoid unnecessary processing in the edge. First, we have to obtain those input parameters. The parameters and the output data size for each layer are obtained after the definition of B-AlexNet as BranchyNet architecture. To obtain the communication time , we consider that the edge device uses wireless technology to send data to the cloud. Besides, we assume that the bottleneck is the access network. We thus use average uplink rates of 1.10, 5.85, 18.80 Mbps, which corresponds to 3G, 4G, and Wi-Fi, respectively. These average uplink rates used in this work are based on the values presented in [3]. To obtain , we measure the processing time for each layer of B-AlexNet, using Google Colaboratory.333https://colab.research.google.com Google Colaboratory is a cloud computing service for machine learning, equipped with Intel 2-core Xeon(R)@ 2.20GHz processor, 12 GB VRAM and GPU Tesla K80. We define the processing time at edge for each layer as where is a proportionality factor that indicates the ratio between processing time at cloud and edge. We use this factor to span different edge hardware types in our evaluation. For example, Jetson TX2444https://developer.nvidia.com/embedded/jetson-tx2 module developed by NVIDIA can represent an edge device equipped with high processing power and thus a low . On the other hand, Raspberry Pi555www.raspberrypi.org can represent a resource-constraint device, with a high .

Figure 4 shows the impact of probability in expected value of inference time, given by Equation 5, for three processing factors: 10, 100 and 1000. These results are obtained based on the solution of our optimization problem when varying the probability that a sample is classified at the side branch. For each processing factor, we present results for 3G, 4G and Wi-Fi. Figure 4 shows that, when the edge has high processing power, the probability has a severe impact on inference time. Also note that, for each processing time, the y-axis has a different scale, meaning that a high edge processing power leads to an overall low inference time.

Considering a given processing factor, Figures 4(a) and 4(b) show that networks with lower bandwidth are more affected by probability. For example, the 3G results in Figure 4(a) show that the inference time reduces 87.27% if we compare a case where the probability is zero with the case where the probability is one. On the other hand, this difference is 82.98% and 70%, for 4G and Wi-Fi, respectively. Figure 4(a) also shows that, when the probability is one, all network technologies have the same inference time. In this case, this result is expected since all samples are classified at the side branch. Although the probability also affects inference time in Figure 4(b), we can note that, for each technology, the y-axis remains constant for a given range of probability. This explained since, as compared to the case of Figure 4(a), the edge has lower processing power. Hence, for low probabilities, the optimization problem chooses cloud-only processing since the major part of the samples are not classified on the side branch. As cloud-only processing does not have side branches, the inference time is not affected by the probability. After a given probability value, the problem begins to choose partitioning solutions, where the edge is involved, and thus the probability begins to affect the inference time. For example, for 3G, the inference time only starts to decrease when the probability is higher than 0.3. For 4G, this value is 0.8 since, as compared to 3G, the bandwidth is higher, and thus it is more interesting to send raw data to the cloud for a large probability range. Finally, the Wi-Fi results of Figure 4(b) shows that it is always interesting to perform cloud-only processing due to its high bandwidth. Figure 4(b) shows an extreme situation where the probability does not affect the inference time. This behavior happens because the edge has low processing power, and thus it is always interesting to perform cloud-only processing.

Using the same scenario of Figure 4, we vary the processing factor and analyze which layer the optimization problem chooses as the partitioning one. Figure 5(a) and 5(b) shows the chosen layer for different processing factors, when using 3G and 4G, respectively. Each curve represents a given probability of classifying the sample at a side branch. This behavior is expected since an edge with a lower processing power means that it is more interesting to process the layers in the cloud. Figure 5(a) shows that as the processing factor increases, the chosen partitioning layer moves toward to input layer. For example, assuming a probability , when changes from 500 to 600, Figure 5(a) shows that the partitioning layer changes from to , which means cloud-only processing.

(a) Partitioning Layer according to factor with 3G.
(b) Partitioning Layer according to factor with 4G.
Fig. 5: Partitioning layer for different processing factors.

When comparing Figure 5(b) with Figure 5(a), we note that, for 4G, the problem starts to choose cloud-only processing for a lower processing factor (i.e., a higher edge processing power). This confirms the trend observed in Figure 4, where, for a higher average uplink rate, the problems tend to choose cloud-only processing. Finally, Figure 5 also confirms the behavior of Figure 4 where the probability influences the choice of the partitioning layer, thus impacting the inference time.

In practice, the probability of classifying an input image at a side branch is a parameter related to aspects inherent of input data that depends on numerous factors. One of these factors is image quality. To show that image quality affects the probability, and thus the partitioning decision, we train a B-AlexNet for image classification, using a cat-and-dog dataset [2]. This dataset is composed of images of dogs and cats in different environments without any distortion. Once trained, we apply a batch with 48 samples with different levels of Gaussian blur. This experiment implements Gaussian filters with dimensions 5, 15, 65 to represent images with low, intermediate, and high distortion, respectively. We use these images as samples to perform inferences to classify if an animal is a dog or a cat. Figure 6 shows the probability of classifying an input image according to the entropy threshold. This figure shows that as distortion level increases, the probability that a sample is classified at a side branch decreases. This is true since images with higher distortion levels tend to have a higher uncertainty in the inference, resulting in a lower probability that it is classified at a side branch.

Fig. 6: Probability of side branch classification under different distortion levels in B-AlexNet.

Vii Conclusion

In this paper, we accelerate DNN inference, partitioning DNN between the edge device and cloud server, to minimize the inference time. Different from a traditional DNN, the BranchyNet has side branches that allow the inference to stop at the middle layers, which can reduce the inference time. Hence, to partition a BranchyNet we have to take into account the probability that inferences stop at a side branch. To address this problem, we model the expected value of inference time as a function of different factors, such as processing and communication delay and the probability of classification at a side branch. We also model the BranchyNet as a graph, showing that the minimization of inference time can be solved as the shortest path problem. Hence, this problem can be solved in polynomial-time, using Dijkstra’s algorithm. We evaluate our problem using a sensitivity analysis where we vary the probability of side branch classification and the processing power in the edge. From the results obtained, we demonstrate that probability affects partitioning layer choice, hence impacts the inference time. Thus, estimating the probability allows improving the partitioning decision as network conditions and computational resources. Moreover, evaluations show that processing strategy changes according to the probability. Thus, this paper also introduces the probability as a factor to be considered in BranchyNet partitioning. As future work, our first goal is to extend our proposal to handle also DAG topology DNN. Moreover, we will investigate heuristics for side branch placement, to attempt also accuracy requirement.

Acknowledgement

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. It was also supported by CNPq, FAPERJ Grants E-26/203.211/2017, E-26/201.833/2019, and E-26/010.002174/2019, and FAPESP Grant 15/24494-8.

References

  • [1] S. Biookaghazadeh, M. Zhao, and F. Ren (2018) Are fpgas suitable for edge computing?. In USENIX Workshop on Hot Topics on Edge Computing (HotEdge), Cited by: §I.
  • [2] S. Dodge and L. Karam (2016) Understanding how image quality affects deep neural networks. In IEEE International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: §VI.
  • [3] C. Hu, W. Bao, D. Wang, and F. Liu (2019) Dynamic adaptive DNN surgery for inference acceleration on the edge. In IEEE Conference on Computer Communications (INFOCOM), pp. 1423–1431. Cited by: §I, §II, §IV-C, §VI.
  • [4] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang (2017) Neurosurgeon: collaborative intelligence between the cloud and mobile edge. In ACM Computer Architecture News (SIGARCH), Vol. 45, pp. 615–629. Cited by: §I, §II.
  • [5] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §I.
  • [6] E. Li, Z. Zhou, and X. Chen (2018) Edge intelligence: on-demand deep learning model co-inference with device-edge synergy. In Proceedings of Workshop on Mobile Edge Communications, pp. 31–36. Cited by: §II.
  • [7] M. Satyanarayanan (2017) The emergence of edge computing. Computer 50 (1), pp. 30–39. Cited by: §I.
  • [8] S. Teerapittayanon, B. McDanel, and H. Kung (2016) Branchynet: fast inference via early exiting from deep neural networks. In

    IEEE International Conference on Pattern Recognition (ICPR)

    ,
    pp. 2464–2469. Cited by: §III, §VI.