Latency-Memory Optimized Splitting of Convolution Neural Networks for Resource Constrained Edge Devices

07/19/2021 ∙ by Tanmay Jain, et al. ∙ University of Cambridge Indian Institute of Technology Delhi 0

With the increasing reliance of users on smart devices, bringing essential computation at the edge has become a crucial requirement for any type of business. Many such computations utilize Convolution Neural Networks (CNNs) to perform AI tasks, having high resource and computation requirements, that are infeasible for edge devices. Splitting the CNN architecture to perform part of the computation on edge and remaining on the cloud is an area of research that has seen increasing interest in the field. In this paper, we assert that running CNNs between an edge device and the cloud is synonymous to solving a resource-constrained optimization problem that minimizes the latency and maximizes resource utilization at the edge. We formulate a multi-objective optimization problem and propose the LMOS algorithm to achieve a Pareto efficient solution. Experiments done on real-world edge devices show that, LMOS ensures feasible execution of different CNN models at the edge and also improves upon existing state-of-the-art approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent times have seen an increasing trend towards bringing computation to the edge in order to increase the level of automation at the edge and obtain more realistic real-time solutions. As per a 2018 study, the percentage of data processed on the edge in one way or another would increase to in 2025 from a mere in 2018 (gartner). AI/ML on edge is linked to crucial future applications. Highly reliant on Convolution Neural Networks (CNNs) (shi; lecun1999object)

, these applications include quality control in industries, facial recognition, health care, smart retail, and autonomous vehicles, to name a few. Furthermore, AI in edge computing is going to be seen in

of all edge computing applications by 2025 (idc).

Convolution Neural Networks (CNNs) (lecun1999object)

are a class of deep neural networks that utilize convolution instead of matrix multiplication in at least one of the network layers. In addition to the Convolution layers, the CNN relies on the Pooling, Rectified Linear Unit (ReLU), and Fully connected layers to provide the output volume. While computation at all of these layers is usually resource-intensive; it is to be noted that a majority of the edge devices are resource-constrained with limited memory and computational capability. This renders running CNNs on edge devices quite challenging. Existing techniques have tried to address this challenge. One class of work tries to compress the network architecture to build compact CNN models that could run on an edge device 

(hu2020fast; gamanayake2020cluster). These models compromise on accuracy and many-a-time are model specific. Another class of work splits the input into smaller inputs to minimize the memory requirements (jin2019split). These approaches are limited by the level of image splitting and would thus be model-specific. Instead of carrying out all the computations on the edge device, the other set of solutions split the CNN architecture to perform part of the processing on the cloud (mehta2020deepsplit; tang2020joint). However, these works either split the CNN based on some model-specific empirical thresholds or rely on latency optimization, which in most cases would defer the splitting. When closely observed, the splitting of a CNN between the edge and the cloud could be formulated as a resource-constrained optimization problem that tries to minimize the latency, while also trying to maximize resource utilization on the edge device.

Figure 1. Total latency vs splitting point of CNN for different models.

In this paper, we formulate a multi-objective optimization problem that considers both latency and memory utilization when finding the optimal layer for splitting the CNN between the edge device and the cloud server 111We use Cloud, Server and Cloud Server interchangeably throughout the paper. Following this, we develop the Latency-Memory Optimized Splitting (LMOS) algorithm, that calculates the Pareto optimal solution for the multi-objective optimization problem. We evaluate LMOS over a prototype edge environment set up using Raspberry Pi4 (rpi) modules. Experiments show that LMOS computes the optimal split point for splitting the CNN in order to minimize the latency and maximize the resource utilization without compromising on the accuracy. Further, the LMOS algorithm also improves upon existing solutions.

In the following section (), we give a brief overview of existing solutions and their limitations. Next, we define and formulate the problem () before describing LMOS in detail (). We then evaluate LMOS in different scenarios () and finally conclude with a discussion of future work ().

2. Related Work

Convolution Neural Networks (CNN) (lecun1999object)

were introduced as image recognition neural networks, but have become crucial to several computer vision and other related applications. This is evident from the numerous CNN models developed in recent times 

(krizhevsky2012imagenet; simonyan2014very; sandler2018mobilenetv2; szegedy2017inception). However, when trying to run CNNs on an edge device, the high computation and memory requirements become a bottleneck (khalil2021deep).

With the increasing need to bring AI on the edge, many works have applied different approaches to overcome this bottleneck. The naive approach is to offload all computation to the server (huang2017deep), undermining the utility of the edge devices that could themselves perform some computation. One popular approach is to develop lightweight strategies that reduce the computation and memory requirements manifold (hu2020fast; gamanayake2020cluster; sandler2018mobilenetv2; iandola2016squeezenet; louis2019towards). These approaches achieve computational and memory efficiency by either compressing the CNN model (iandola2016squeezenet; sandler2018mobilenetv2; hu2020fast) or using lightweight libraries (louis2019towards). Network pruning is another approach that reduces the complexity by pruning redundant and non-informative weights (fan2019cscc; louis2019towards; hassibi1993second; han2015deep). However, both model compression and network pruning compromise accuracy and are specific for a certain model or certain class of models (khalil2021deep).

Another class of approaches splits the CNN architecture between the edge device and the server (mehta2020deepsplit; matsubara2019distilled; tang2020joint; zhou2019distributing; leroy2021optimal). The algorithm-based splitting methods decide the splitting point either based on a model-specific threshold (mehta2020deepsplit) or input-output dimension size (matsubara2019distilled). However, these are model-specific splitting and cannot be generalized. Other approaches split the CNN by trying to optimize the computation latency (tang2020joint; zhou2019distributing) limiting the utility of the edge device.

In contrast to the existing state-of-the-art approaches, there is a need for a dynamic splitting approach, that not only optimizes latency but also optimizes utilization of the edge device without compromising accuracy.

3. Problem Definition

As discussed earlier, the problem of deciding where to split the CNN architecture could be seen as a resource-constrained optimization problem. Several system parameters have an impact on the optimization decision. In this section, we describe how and why the system parameters are considered in formulating the problem.

3.1. Pilot Study

When splitting a CNN between an edge device and the cloud, numerous system parameters need to be considered. We perform a set of experiments to identify these system parameters. The experiment is set up between a Raspberry Pi4 module (rpi), used as an edge device, and an Ubuntu 20.04 system (ubuntu) as the cloud server. The RPi4 module has 16 GB storage, 4 GB RAM, and a quad-core 1.5 GHz processor. The cloud server has 8 GB RAM and an octa-core 1.5 GHz processor. Both RPi4 module and the cloud have a network connection of 10 Mbps between them. We run our experiments with four pre-trained CNN models, Alexnet (21 layers) (krizhevsky2012imagenet), VGG13 (32 layers), VGG16 (38 layers), and VGG19 (44 layers) (simonyan2014very). All models perform image classification based on an input image. Each model is split at different layers in different runs and the computation and transmission times are logged for each run.

In Figure 1, we plot the total latency for each model when split at a particular layer. In addition to the total latency, we also plot the four contributory latency factors, viz., computation time at the edge device, transmission time from edge to the server, computation time at the server, and transmission time from the server to the edge device. As is evident from all the four plots in Figure 1, the computation time at the edge device and the transmission time from the edge to the cloud are the primary contributing factors for total latency. It can be observed that splitting at different layers affects the contributing latency factors differently. Moreover, the edge device can also perform low latency computation when trying to compute more layers. Hence, it is essential to allow the edge device to perform the maximal computation in addition to minimizing the latency. Computation time at the edge, which plays a crucial part in the total latency calculation, is impacted by the edge device’s computational capability. Further, the computational capability of the server also has a role in the total latency computation that impacts the splitting point. Finally, bandwidth is an important parameter to consider when computing the transmission time between edge and server.

3.2. Problem description

The above set of experiments show that we not only need to minimize the latency but also maximize the computation at the edge which could be linked to the memory usage at the edge device. Therefore, we need to define two objective functions for the optimization problem, one addressing latency and the other addressing memory usage at the edge.

We assume that there are layers in the CNN. After splitting the CNN there are layers at the edge and layers at the server. The memory usage, denoted by , is computed as the memory required when the edge device is performing convolution over the layers. This forms the basis for the objective function that attempts to maximize the memory usage at the edge device.

As observed in Figure 1

, there are four components involved when considering the overall latency of the computation. However, the transmission time from server to edge is constant and negligible since the server sends a fixed low size classification output to the edge and the bandwidth is usually constant with minimal variance. This can be seen in Figure 

1. Thus, we do not include the transmission time from server to edge in the objective function. The other three latency values are considered when defining the objective function. We describe the latency components in the following sub-sections.

3.2.1. Edge Convolution Latency ():

This is the time that the edge device takes to compute the convolution layers. This is represented as:


where, is the amount of local computation done at the edge device given it has to compute layers. This value is computed based on the network depth, the width and the height of the kernel at each layer (giro2016memory). The denominator defines the computational capacity of the edge device which we take as the product of the number of CPU cores and the processor speed .

3.2.2. Edge to Server Transmission Latency ():

This is the transmission time for sending the intermediate results from the edge to the cloud server. is dependent on the size of the intermediate output, which, in turn, is a function of the kernel weights and the network depth at the last layer to be computed and the type of computation, i.e., Convolution, Regularization or Pooling (giro2016memory). Given that the intermediate output size is for layers, and the bandwidth between the edge and server is , the transmission latency is computed as:


3.2.3. Server Convolution Latency ():

The computation latency at the server is a function of the amount of local computation performed with layers () at the server, the number of CPU cores () and the processor speed (). is calculated as follows:


The above three latency components form the basis for the objective function that attempts to minimize the latency.

3.3. Problem formulation

We formulate the objective functions that define the optimization problem along with the related constraints as follows:


Equation 4 defines the latency objective function that is the end-to-end latency given that the edge computes layers and the server computes layers of the CNN. Equation 5 defines the memory objective function that is computed given the edge layers. The optimization problem can thus be represented as follows:

—s— F = (f_1, -f_2) X_edge—x_1 ≤M x_1 + x_2 = L 0 ≤x_1 ≤L 0 ≤x_2 ≤L

With equations 4 and 5, we wish to minimize and maximize . We formulate the problem as in Equation 3.3 where we minimize and . The multi-objective optimization problem must adhere to the four constraints. First, the local computation memory required on the edge device must not exceed the total available storage () at the edge device. Second, the sum of the number of layers at the edge and the server should always add up to the total layers of the CNN (). Finally, the number of layers at the edge and the server should not be negative or exceed .

4. The Lmos Optimization Algorithm

In this section, we design the Latency-Memory Optimized Splitting (LMOS) algorithm to obtain an optimal solution for the multi-objective optimization problem . We first define a few terms.

Definition 0 (Solution Space).

is the set of feasible solutions that

can have. Hence, any solution vector


Definition 0 (Objective Space).

We represent any evaluation vector as for a given solution vector . Then the objective space is defined as

An optimal solution vector is said to be the one which dominates all other . We define dominance as follows;

Definition 0 (Dominance).

Let and be two evaluation vectors in . We say dominates , if and only if and . Further, at least one inequality should be strict.

We observe from Figure 1 that the total latency is generally lower when splitting is done at lower layers while the computation latency at the edge is higher when splitting is done at higher layers. Higher computation latency at the edge implies more computation is performed at the edge leading to higher edge memory usage, thus implying that memory usage is higher when splitting is done at the higher layers. Hence, a dominant solution for the optimization problem cannot be obtained and there is a need to compute a non-dominant or Pareto-efficient (debreu1954valuation) solution for .

We develop an algorithm based on the -constrained method (chankong2008multiobjective) to solve the multi-objective optimization problem. We find the optimal solution for one of the objective functions and represent the other as a constraint. Thus, we rewrite Equation 3.3 as follows:

—s— F_i(ϵ_j) = (f_i) f_j ≤ϵ_j, i,j ∈[1,2] ∧i ≠j X_edge—x_1 ≤M x_1 + x_2 = L 0 ≤x_1 ≤L 0 ≤x_2 ≤L

The solution of Equation 3 is based on the following theorems222Proof of the theorems is out of scope of this paper and is available in the literature (chankong2008multiobjective) (chankong2008multiobjective);

Theorem 4 ().

is an efficient solution of if and only if solves for

Theorem 5 ().

If solves for some and if this is a unique solution, then is an efficient solution for .

The two theorems summarize that an exact Pareto front can be found for by solving the -constrained problems, given that we get a solution for every point in , which is the Pareto front.

Definition 0 (Pareto Front).

is Pareto efficient in

Result: The Pareto Optimal Solution for
1 or ;
2 ;
3 ;
4 ;
5 while  do
6       Solve to get ;
7       ;
8       next best value of ;
10 end while
11Remove dominant points in if any;
Return from which minimizes
Algorithm 1 The LMOS Optimization Algorithm

In Algorithm 1, we describe how the Pareto optimal solution is obtained for using the -constrained method. The first task is to decide which objective function is to be optimized. Consequently, the other objective function will be constrained. This sets the value of and . Subsequently, the ideal and nadir points are obtained for both the objective functions. As the aforementioned theorems state, solving multiple -constrained problems would give the efficient solution by creating the Pareto front. The algorithm achieves this by setting as the worst possible value first and then decreasing it to reach the ideal value. For every value of , the algorithm solves , and the solution is added to . is then set at the next best value in . Once the loop is over, the algorithm checks if there are any dominant points in and removes these since we are only interested in the non-dominating points. It should be noted that the dominating points are only dominating in and hence do not follow the dominance rule defined earlier. Finally, based on the ranking method, the Pareto optimal solution is chosen as the one which minimizes .

In this paper, we use the latency objective function () as a constraint and optimize the memory requirement function (). Further, we solve for the optimization problem . This decision has been taken based upon an empirical analysis, which reveals that when running LMOS for , not all values of give a solution for (Please refer to Line 6 of Algorithm 1).

5. Performance Evaluation

In this section, we first describe the experimental setup of a prototype edge environment. Subsequently, we perform sensitivity analysis of LMOS followed by the performance evaluation of LMOS.

5.1. Experiment Setup

We build a prototype edge-cloud set up using Raspberry Pi4 and Ubuntu server to evaluate our work. We use four Raspberry Pi4 modules as edge devices and an Ubuntu 20 system as the cloud server. The RPi4 modules have storage sizes of 16 GB, 8 GB, 4 GB, and 2 GB. However, all have the same 4 GB RAM and a quad-core 1.5 GHz processor. The cloud server is the same used in Section 3. It runs a Ubuntu 20.04 OS, with an 8 GB RAM and octa-core 1.5 GHz processor. The RPi4 modules and the cloud server are connected to a Wi-Fi network providing a bandwidth of 10 Mbps. We utilize pre-trained CNN models of Alexnet (21 layers) (krizhevsky2012imagenet), VGG13 (32 layers), VGG16 (38 layers), VGG19 (44 layers) (simonyan2014very), and MobileNet v2 (21 layers) (sandler2018mobilenetv2)

available from PyTorch Hub 

(torchhub) for evaluation.

5.2. Parameter Sensitivity of Lmos

Parameter Range Default Value
Bandwidth 1 - 200 Mbps 10 Mbps
Edge Cores 1 - 8 2
Edge CPU Speed 1 - 2 GHz 1.5 GHz
Edge Storage Size 256 - 16000 MB 8000 MB
Server Cores 1 - 8 8
Server CPU Speed 1.5 - 3.2 GHz 2.6 GHz
Table 1. Range and default value of all the parameters used in the sensitivity analysis of LMOS

We run LMOS for five scenarios, where we fix the number of layers in the CNN model to be and observe how the splitting point changes when specific parameters are varied. Table 1 lists the parameters with their range and default values333Default value is the value a parameter takes when another parameter is varied..

Figure 2. Impact of varying Bandwidth and Edge Storage
Figure 3. Impact of varying Edge and Server CPU speed

We calculate the fraction of layers computed at the edge device when we vary a particular parameter in Table 1. Figure 2(a) shows the impact of varying bandwidth and Figure 2(b) shows the impact of varying edge storage size. It is observed that at low bandwidth, LMOS prefers to compute less on the edge device as more computation on the edge increases the intermediate output size thus increasing the transmission time. However, after a certain value, the bandwidth does not have any impact on the CNN splitting. Similarly, at low storage size, there are fewer layers at the edge that increases considerably as the storage size increases and then becomes constant. Another interesting observation is that for layers, the storage size does not have any impact. This is primarily because of the low memory requirement in the smaller CNN and therefore the primary driving factor is latency computation.

Figure 3 shows the impact of computation capability of the edge device and server on the optimization results. Intuitively, increasing the computational capability by increasing the CPU speed at the edge increases the number of layers that the edge device can compute, while, the reverse happens when the computational capability of the server is increased. We obtain similar results when varying the number of cores at the edge and the server.

5.3. Performance Evaluation of Lmos

We now evaluate how splitting the CNN impacts the computation of the CNN models between the edge device and the cloud server. We show the impact of splitting on the accuracy, latency and memory utilization when varying the bandwidth between the RPi4 modules and cloud server and varying the storage size of the RPi4 modules. We run the experiments for four different models, viz., AlexNet, VGG13, VGG16, and VGG19. We evaluate all models with input images and the reported results are averaged over the runs.

Figure 4. Impact of splitting the CNN on Model Accuracy

In our results, we refer to the classification accuracy(cnnaccuracy) as the accuracy of the model. The classification accuracy is defined as the ratio of correct predictions to the total number of input samples. In order to show that splitting has no impact on the accuracy of the model, we split the models at all layers and compute the result. As can be seen in Figure 4, splitting does not affect the accuracy of the models. This strengthens the fact that splitting the CNN models between the edge and the server does not impact the model output.

Model Bandwidth (Mbps) Storage (GB)
0.5 1 5 10 2 4 8 16
AlexNet 8 18 19 19 10 13 17 19
VGG13 10 18 23 30 10 13 17 21
VGG16 10 18 23 32 10 13 17 21
VGG19 10 18 23 32 11 13 17 21
Table 2. Number of CNN layers computed at edge when varying bandwidth and varying storage size of edge device
Figure 5. Impact of bandwidth variation

We vary the bandwidth between the 16 GB RPi4 module and the cloud server to show the change in split points and the resulting memory and latency values for these scenarios. The number of layers computed at the edge device is shown in Table 2. In Figure 5, we show the impact of bandwidth variation on latency and memory usage. The key takeaway for the latency result is that an increase in bandwidth does not always reduce the latency. For example, the latency increases for VGG19 from 1 Mbps to 5 Mbps. Such an increase could be linked to the fact that an increase in bandwidth implies that the edge can send more data to the server and hence can compute more layers. This leads to more computation at the edge device resulting in higher convolution time at the edge, thus increasing the total latency. However, an increase in bandwidth always leads to an increase in memory requirement which is due to the increase in the number of layers being computed at the edge device. VGG13 and VGG16 show a higher increase since almost all the layers of the CNN are computed at the edge device thus requiring more memory.

Figure 6. Impact of RPi storage variation

We show a similar analysis by varying the storage size of the RPi4 modules. The corresponding layers at the edge are given in Table 2. Figure 6 shows the impact on latency and memory requirement by varying the storage size at the edge. Increasing the storage leads to more computation at the edge and thus leads to higher latency; however, this need not always be true. The decrease in latency could be linked to the type of computations the edge performs (Convolution, Regularization or Pooling) since Convolution takes more time than Regularization or Pooling. It is to be noted that the memory requirement always increases with increasing storage.

5.4. Competing Approaches

We compare LMOS with four competing approaches, one is based on latency optimization, two are boundary approaches and the final is a random approach.

5.4.1. Latency Optimized Approach (LOA):

Several works (kang2017neurosurgeon; li2018edge) split the layers to optimize the latency of the system. Hence, there isn’t any optimization on utilization of the edge device.

5.4.2. Edge Computation Only (ECO):

In this approach, CNN computation is done at the edge device only with no server interaction.

5.4.3. Server Computation Only (SCO):

The entire CNN computation is done at the server, the edge device is only responsible to send the input to the server.

5.4.4. Random Splitting (RS):

A random number is generated for each trial and the CNN is split at that layer.

Approach AlexNet VGG13 VGG16 VGG19
LMOS 19 21 21 21
LOA 1 1 3 7
ECO 21 32 38 44
Table 3. Number of layers at edge for competing approaches. SCO has zero layers on edge and RS generates random layers.
Figure 7. Comparing LMOS with other approaches

We run the above approaches for the four CNN models between the 16 GB RPi4 module and the server. We evaluate all CNN models with input images and the reported results are averaged over the runs. The bandwidth between the edge device and server is 10 Mbps. Table 3 shows the layers computed at the edge for all the models except SCO and RS. There are no layers at the edge for SCO and different split layers are generated for RS in each trial.

The average values are reported in Figure 7. We observe that as expected, RS shows promising results in some scenarios but poor results in others and therefore isn’t a stable approach. SCO has the minimum latency but also utilizes negligible edge memory which is undesirable. On the other hand, ECO consumes maximum memory but has higher latency than the other approaches. Although LOA has low latency than LMOS for all the models, memory utilization is very low, almost the same as SCO. In contrast to these systems, LMOS has latency requirements comparable to LOA but has higher memory utilization, making it a better alternative than the other competing algorithms.

5.5. Comparison with Edge Optimized Model

We compare LMOS with MobileNetv2 (sandler2018mobilenetv2) which is optimized for edge devices. Instead of comparing MobileNetV2 with all four CNN models, in this experiment we use VGG19, that has the highest number of layers and the maximum memory requirements. Furthermore, VGG19 also has the highest accuracy than the other models. Hence, establishing that VGG19 with LMOS gives better results than MobileNetV2 would help us prove that splitting is a better alternative than compressing a CNN model. We run MobileNetv2 on the RPi4 module and run the VGG19 model both with and without LMOS. We evaluate both models with input images and the reported results are averaged over the runs.

Model Accuracy Latency Memory Used
VGG19 No Split 0.91 12.39 s 662 MB
MobileNetV2 0.83 1.3 s 29.77 MB
VGG19 with LMOS 0.91 5.62 s 27.03 MB
Table 4. Comparing with edge optimized model

Comparison results for the three models are given in Table 4. While VGG19 gives better accuracy, due to CNN splitting, memory utilization is less than that of MobileNetV2 for VGG19 with LMOS. Furthermore, the total latency when using LMOS is seconds greater than that of MobileNetV2 which is a small trade-off for improving accuracy. The above result strengthens our assertion that splitting is a superior alternative for running CNN at the edge.

6. Conclusion

In this paper, we show that running resource-intensive CNNs at the edge could be formulated as a multi-objective optimization problem that minimizes the end-to-end latency and maximizes the memory utilization at the edge device. We have proposed LMOS– an -constrained algorithm to solve the optimization problem. Our experiments performed on a prototype edge environment show that LMOS provides an optimal solution for splitting the CNN. The key takeaways from the paper are: (i) splitting the CNN does not impact the model accuracy, (ii) LMOS is a better alternative as compared to the existing splitting-based approaches since it ensures that both total latency and memory utilization is optimized, (iii) splitting a CNN with LMOS is a superior alternative for running CNN at the edge as compared to other edge-optimized CNN models that only run on the edge device.

There exist key aspects that need further analysis in order to improve LMOS. For instance, energy consumption is a crucial parameter when considering edge devices. Running resource-hungry applications is likely to drain the device power. Thus, including the energy metric in the optimization problem could be an important extension. Further, it is important to investigate whether LMOS could be generalised to other neural networks. Yet another direction of investigation could be the use of smartphones as edge devices with additional constraints.