AI-oriented Medical Workload Allocation for Hierarchical Cloud/Edge/Device Computing

02/10/2020 ∙ by Tianshu Hao, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences The Chinese University of Hong Kong, Shenzhen 0

In a hierarchically-structured cloud/edge/device computing environment, workload allocation can greatly affect the overall system performance. This paper deals with AI-oriented medical workload generated in emergency rooms (ER) or intensive care units (ICU) in metropolitan areas. The goal is to optimize AI-workload allocation to cloud clusters, edge servers, and end devices so that minimum response time can be achieved in life-saving emergency applications. In particular, we developed a new workload allocation method for the AI workload in distributed cloud/edge/device computing systems. An efficient scheduling and allocation strategy is developed in order to reduce the overall response time to satisfy multi-patient demands. We apply several ICU AI workloads from a comprehensive edge computing benchmark Edge AIBench. The healthcare AI applications involved are short-of-breath alerts, patient phenotype classification, and life-death threats. Our experimental results demonstrate the high efficiency and effectiveness in real-life health-care and emergency applications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the rapid development of user-end Internet of Things (IoT) [30, 17], edge computing emphasizes real-time computing near the end devices [29]. A common hierarchical framework of distributed cloud and edge computing consists of three layers: cloud cluster, edge server, and end devices. The principal advantage of this distributed computing paradigm is reducing the latency by reducing the time of data transmission to users [28].

Artificial Intelligence (AI) technology is widely used in edge computing to support edge intelligence [32]. There are a lot of edge AI scenarios closed to human daily life, such as smart home, smart health, smart factory and so on. What’s more, most of them are latency-sensitive applications. Thus how to reduce the response time of these edge AI applications has become a really important problem.

Many corporations, such as Google’s edge TPU [3] and Intel’s neural compute stick [7], put efforts on edge AI accelerator chips to speed up the processing time on the edge device. And [19, 31] implements the light-weighted model to reduce the complexity of the AI models which can be applied to edge servers. And there is also some benchmarking work to evaluate the inference of AI workload on edge devices. [21, 18] have evaluated the inference performance on the edge devices.

However, common practices usually deploy real-time processing on edge servers. But they don’t consider changing the workload deployment layer for different workloads. Different allocation strategies of the AI workload greatly influence the response time. Deploying the inference process of an AI workload on the edge server layer doesn’t always achieve the minimum latency. Therefore, an AI workload allocation model is lacked for edge computing latency-sensitive applications now. Considering different device computational abilities of each level and peak network bandwidth, how to make the trade-off between them is still a problem in edge computing hierarchical framework.

We extract medical AI workloads as the problem environment from Edge AIBench [13] in this paper. Based on the hierarchically-structured framework, this paper proposes a workload allocation strategy for the medical edge AI workload. And we give a numerical interpretation of the response time of a single workload. What’s more, we also propose an efficient scheduling algorithm for multi-job AI scenarios to reduce the total response time of jobs.

We attempt to evaluate the application response time in the hierarchical computing system in the following two aspects.

1) Processing time: Since the AI workloads are usually compute-intensive applications, the processing time is decided by the computing ability function.

2) Transmission time: The data transmission time is represented by the network function.

The workload allocation aims to reduce the response time for the medical latency-sensitive applications. We give a latency reduction algorithm for single and multiple workloads. Finally, we use the real-world edge AI workload and ICU datasets from Edge AIBench [13] to validate the efficiency and effectiveness. The experimental results show our allocation algorithm can get the minimum response time comparing with other strategies.

The rest of this paper is organized as follows. Section II introduces the problem environment of the hierarchically-structured framework and edge AI workload. Section III and IV present the workload allocation problem and an optimal algorithm for a single job. In section V and VI, we present an efficient allocation and scheduling algorithm for multi jobs. Section VII gives the experimental setup and section VIII shows the experiment results analysis. Section IX introduces recent and related work. Finally, we conclude in section X.

Ii The Problem Environment

Ii-a Hierarchically-structured Cloud/Edge/Device Computing

Edge computing is a distributed computing paradigm with the three-layer framework: cloud cluster, edge computing server and user-side end devices. Figure 1 shows the hierarchically-structured cloud/edge/device computing framework of edge computing.

Fig. 1: Hierarchical Framework of Edge Computing.

The cloud cluster is a centralized server cluster remote away from the edge devices. Traditionally in cloud computing, most of the computing tasks are being executed in the datacenter. But the transmission latency from the cloud server to user-side devices is long, sometimes even longer than the processing time. Therefore, in the edge computing framework, the cloud server usually executes offline tasks and central control.

Edge computing layer server is closer to users than the cloud cluster, which can be personal computers, mobile phones, routers, base stations and so on. And the edge server has more computational resources than user-side end devices. Online tasks are usually executed on the edge computing layer.

There is a great diversity of user-side end devices, such as smartwatches, sensors, unmanned aerial vehicles, and smart vehicles, and etc. The end devices usually gather data and conduct some simple preprocess tasks because of the restriction of the computing resources.

Generally in the three-layer framework, the higher the layer, the more computational resources of the device, the faster the speed of processing, but the longer the data transmission time. Thus, how to trade off the computational ability and data location becomes a problem.

Ii-B Medical AI Workloads Benchmark

Edge AIBench [13] is a comprehensive end-to-end benchmark towards AI edge computing. By surveying the edge computing scenarios extensively, Edge AIBench finds that most of these workloads use AI technology. Therefore, we focus on edge AI workloads in this paper.

Figure 2 shows the main workflow of AI workloads, including offline and online processes [25].

Fig. 2: AI Online and Offline Workflow.

Considering the complexity of the AI workload, there are high requirements for computing machines [4]. Because the cloud cluster has a better computational ability, offline processes in figure 2 are always executed on the cloud cluster in the edge computing framework. And the pre-trained model will be sent to the online processing device after completing training on the cloud cluster. And the most common edge AI workload allocation method is putting the training on the cloud server and putting the inference or other online processes on the edge computing layer [27, 16].

However, where to deploy the online process greatly influence the response time considering the data transmission time. What’s more, many edge AI workloads are latency-sensitive applications. Therefore, response time reduction is important in edge computing workload distribution.

From 4 typical edge AI scenarios and 8 application benchmarks in Edge AIBench, we extract the ICU patient monitor as the workload allocation problem background.

ICU Patient monitor is a typical latency-sensitive edge AI scenario. Because ICU is the treatment place for critical patients, the response latency of any task is very important for doctors’ further action. As figure 3 shows, there are several patients in serious condition in an ICU room. Each patient has many end devices to measure vital signs such as temperature, respiration, pulse, and heartbeat, and etc. And there is one end computing device for one patient and one edge server of one ICU room. What’s more, there is a cloud server for heavy computation tasks and data storage.

Fig. 3: ICU Patient Monitor Edge Computing Scenario in Hospital.

In this scenario, an AI model LSTM [26] is usually used as the AI network model and real-world ICU patients’ data are usually text data.

There are many applications in the ICU scenario and we choose three typical AI workloads from Edge AIBench: short-of-breath alerts, patient phenotype classification, and life-death prediction to conduct the evaluation experiments. Based on the purpose of these workloads, we set different priorities for them.

Iii Latency Analysis for Single Workload

Iii-a Basic Assumptions

As figure 1 shows, the hierarchically-structured framework consists of three levels: indicates cloud clusters, indicates edge computing servers, and indicates end devices. The three major factors that determine the latency of the inference process are the computational ability of each level, the size and the complexity of the model and the dataset, and the network latency. In this paper we make the following assumptions:

(a). The data is gathered from the end devices, thus data transmission needn’t be considered if the process is on the end devices.

(b). The transmission time from the cloud cluster to the end device is equal to the sum of the transmission time from the cloud cluster to the edge server plus the transmission time from the edge server to the device .

(c). Considering the AI workload is compute-intensive, the computational ability of each layer is expressed by the floating point operations per second (FLOPS)[11, 33] of devices.

(d). To simplify the problem, there are just one cloud server and one edge server to be considered in this model.

(e). The response time of any workload depends on the workload model complexity, dataset size, the device processing the workload, and the network condition.

(f). The transmission time of the inference result is relatively much lower than the transmission time of the inference data . Therefore we don’t consider in this problem.

Based on the above assumptions, the problem can be expressed as: there are a cloud server, an edge server, and an end device constructed following the hierarchically structured framework as figure 1 shows. The problem is which layer to allocate the given online AI workload (such as inference) can get the minimum response time.

From the online tasks in figure 2, we mainly consider the inference allocation problem in this paper for the actual production environment. Figure LABEL:location shows the three different inference location methods respectively on cloud cluster, edge server, and end devices.

Iii-B Objective Function for Latency Reduction

In order to find the optimal solution for this problem, we express it as a numerical problem. Table I shows the notations in this problem.

Notation Interpretation
the transmission time of data from end device to the execution layer i
the time of making the inference on the layer i
the size of the workload data
the FLOPs of the AI model in workload i
the FLOPS of the layer i
the weight coefficient of the data transmission time
the weight coefficient of the inference processing time
TABLE I: Notations of Workload Allocation Problem.

For different layers, the probability of whether to make inference on this layer is expressed by:


The data transimission time is expressed by:


The computational ability of the device is expressed by .

The inference processing time is expressed by:


Where the inference processing time on layer i is expressed by .

Therefore, the objective function can be expressed by:


Iii-C Performance Metrics for AI Workload

We measure the computational ability of devices from each layer by FLOPS [11, 33]. And the FLOPS is calculated by the number of cores operating frequency operations per cycle.

And we also measure the complexity of the AI model by FLOPs. The FLOPs is computed using the number of parameters multiplying with the size of the feature map for convolution kernels. The computation formula is . And for fully connected layers, the FLOPS equals to the number of the parameters [24]. The computation formula is . indicates the height, indicates the width and indicates the number of channels of the input feature map. indicates the kernel width and indicates the number of output channels. indicates the input dimensionality and indicates the output dimensionality[24].

Iv Allocation Algorithm for Latency Reduction

This section gives an optimization strategy to get the minimum latency with a given workload and devices environment. In order to solve the problem we defined, we simplify the problem by assuming that the quantity of each layer’s device is one.

For a given edge AI workload and device environment, we can calculate the computational ability of three layers and the network condition firstly. Then we can calculate the minimum response time by putting the inference in different layers. Algorithm 1 shows the procedure of our optimization strategy.

Firstly, for each workload, we analyze their computation model parameters and calculate the FLOPs needed to execute.

Secondly, we calculate the unit network transmission latency for deploying on the cloud server and edge server using a unit size extracting from the given dataset. Therefore we can calculate the given workload transmission time by multiplying the dataset size and weight coefficient.

Thirdly, we calculate the computation ability of the device in each layer by FLOPS. Because most of the AI workloads are computation-intensity, so we just consider FLOPS here.

The next step is to calculate the weight coefficient for processing time and transmission time. We conduct an experiment to compute the time of one respectively small dataset and get and by comparing them.

Next, we get the estimated response time for deploying on each layer by summing up the processing time

and transmission time .

Finally, we choose the minimum response time layer as the deployment layer for this workload.

Input: , , ,
Output: the minimum response time

1:Calculate the number of FLOPs of the AI workload model
2:Calculate the unit network latency
3:for i = CC, ES do   
4:      = latency time of unit dataset transmission
5:Calculate the computational of each layer’s device
6:for i = CC, ES, ED do   
8:Normalize the transmission time and inference time by calculating the weight coefficient ,
9:Calculate the inference processing time.
10:for i = CC, ES, ED do
12:Calculate the data transmission time.
13:for i = CC, ES do
15:Calculate the minimum latency
16:Let minimum response time
17:for ( do )
20:     if  then     
Algorithm 1 Algorithm of Latency Optimization

V Workload Allocation for Multiple Jobs

V-a Multi-job Workload

Considering the real-life ICU edge computing environment, every patient has an end device to conduct preprocessing in one emergency room. And these patients share one edge server in the emergency room and one cloud server remote from the ICU. These machines, including one cloud server, one edge server, and several patients’ end devices, can be considered as several unrelated parallel machines.

Each patient’s end device may release an inference job randomly. Assume these jobs are released in a time sequence, our goal is to minimize the overall response time of all jobs. Then this problem can be considered as an unrelated parallel machine scheduling problem[35, 1].

From the above section, we can calculate the response time for every single job deployed on different layers and get the best layer of the minimum latency. And we set the following constraints in this problem:

C1. Each device can only execute one job at a time to simplify the problem.

C2. The preemption of jobs is not allowed in our system.

C3. The release time and response time are normalized as the non-zero integer units of time.

C4. The job data can be transmitted to the execution layer first and wait for executing.

C5. The workload is prioritized by its importance in real-life. The workload with the higher priority needs to be considered firstly.

Thus, the scheduling problem can be summarized as follows: there are n patients’ jobs () needed to be executed in one emergency room. These jobs have their own priority (). The bigger the , the higher the priority of the job . And these jobs can be processed on one cloud server (), one edge server (), or an individual patient end device (). Each job has a known release time () in the time sequence. And the response latency time of job executed on machine is . Table II shows the notation of this problem.

The main goal of this problem is to reduce the whole response time of multiple jobs.

Notation Interpretation
i the index of the job, i = 1, 2, … , n
j the index of the machine, j = c (cloud), e (edge), d (device)
T the unit of time, T = 1, 2, … ,
the priority weight of job i
equals 1 if the job i is processed on the machine j, otherwise equals 0
equals 1 if the job i is processed on the machine j at time T, otherwise equals 0
the response time of the job i
the response time of the job i considering the weight
the release time of job i
the start processing time of job i
the completion time of job i
the whole response time of all jobs
the completion time of the last job
TABLE II: Notations of Allocation Problem for Multiple Jobs.

V-B Objective Function to Minimize Reponse Time

To get the minimum whole response time of all the jobs, the objective function can be expressed by:


The whole response time is calculated by summing up the response time of all jobs. The response time of job equals the end time minus the release time . And for different priorities of these jobs, we multiply the response time with the priority weight to get the new response time . End devices don’t need to subject to this constraint because we assume every job has its own end device.

And and

are binary variables equal 0 or 1. The constraint

needs to be considered to ensure the job i is exactly only on one machine. And the constraint ensures there is only one job executing on the machine at a time.

From section III we know the execution time of job consists of the transmission time and processing time . The exact processing time on the machine is . And the job can transmit data to the machine while another job is running on the same machine.

Vi Workload Allocation Heuristic Algorithm

We can change the lower bound from 0 to the sum of the minimum execution time of all the jobs. The lower bound can be expressed by:


Therefore, the problem can be simplified to be selecting the minimum response time of the last release job.

Because the scheduling problem for n jobs executed on m machines is very complicated, so we develop a heuristic greedy algorithm to solve the problem in this section. Considering each patient has its own end device, so the end device is not the shared machine. And jobs can be executed on end devices at the same time.

We can obtain the initial feasible solutions by ensuring the earliest released job to have the shortest response time. And then we optimize the solution by a neighborhood search method [23]. Algorithm 2 shows the concrete steps. And algorithm 2 needs to follow the constraints we mentioned in the previous section.

Firstly, we use algorithm 1 to calculate the estimated execution time of each job deployed on each layer. And then we normalize the response time to the integer time units. We use a heuristic greedy method to get the initial deployment strategy. We find the optimal deployment machine for each job to have the minimum completion time by time sequence. Then we obtain the initial feasible solution.

Then we generate the neighborhood solutions from the current solution by swapping the current job to another machine. And then calculate the deployment machine of other jobs using the above heuristic greedy method. If the whole response time reduces, then we swap the deployment machine of job .

And we set max number of iterations as the stopping condition for the algorithm.

Input: ,
Output: the minimum response time of the whole jobs

1:Calculate the execution time of job deployed on machine . And get the response time matrix .
2:Get by multiplying with the job priority .
4:Normalize matrix .
7:Initial the tabu array, 0 represents the job or machine can be switched and 1 represents can’t be switched.
9:Optimize the solution by swaping the job to another machine reduce the whole response time.
10:Set maxCount = (large enough)
11:while maxCount do
12:     Set
13:     for i = 1, 2, … , n do
14:         Set
15:         from all the jobs which choose the earliest completion job k
16:         Set
17:         Initial the max response time improvement.
18:         Set
19:         for j = cc, es, ed do
20:              if  then
21:                  Calculate the reduction when swap job i to machine j
23:                  if  then
26:         if  then
27:              Swap the job to machine               
Algorithm 2 Multi-job Allocation Heuristic Algorithm

Vii Experimental Setup

Vii-a Experimental Environment

We set the experimental environment to simulate the realistic cloud/edge/device computing scenario. We consider that the cloud server has the highest computational ability, while the edge server’s performance is lower and the end device’s performance is lowest. Meanwhile, the cloud server has a longer distance to the end device than the edge server.

We conduct our experiment on three devices respectively representing cloud server, edge server, and end device. The cloud server has 12 2.20-GHz Intel(R) Xeon(R) Gold 5220 CPU cores and 128GB of DDR4 RAM. The edge server has 4 2.20-GHz Intel(R) Xeon(R) Gold 5220 CPU cores and 32GB of DDR4 RAM. The end device is a Raspberry Pi 4B with 4 1.5-GHz speed Quad-core Broadcom CPU cores and 4GB of DDR4 RAM.

And we refer to [36] to set the network latency between the cloud server to the end device as 42ms and network bandwidth as 2.9MB/s. What’s more, we measure the network latency between the edge server and end device in our lab LAN environment as 0.239ms and bandwidth as 10MB/s.

Then we calculate the FLOPS of each device using the processor information. Table III shows the basic computational ability of devices of each layer.

Layer CPU Cores CPU Frequency FLOPS
Cloud Server 12 2.2GHz 422.4GFLOPS
Edge Server 4 2.2GHz 140.8GFLOPS
End Device 4 1.5GHz 96GFLOPS
TABLE III: Computational Ability of Device on Each Layer.

Vii-B AI Workload in ICU Scenario

For the medical dataset, we choose MIMIC-III [20] as the real-world dataset. MIMIC-III includes vital signs and other medical information for over 60000 ICU stays of 40000 unique ICU patients. And we preprocess the original data from the MIMIC-III website [22] to get the training data.

We choose three ICU AI applications from Edge AIBench[13] for the experiment: short-of-breath alerts, patient phenotype classification, and life-death prediction.

Short-of-breath alerts uses the information of ICU vital signs including glucose, heart rate, height, mean blood pressure, and oxygen, and etc. The purpose of this application is to predict if the patient will suffer short-of-breath later using the LSTM model. Therefore the priority of this application needs to be set high. For the workload of short-of-breath alerts, we set the weight as 2. And using the number of parameters of the LSTM model in this application, we get the number of FLOPs of this model is 105089.

Life-death prediction uses the information of ICU vital signs including heart rate, height, mean blood pressure, oxygen, and other physiological records. The purpose of this application is to predict whether the patient will die in the hospital. The priority of this application also needs to be set high. So we also set the priority weight of this application as 2. And using the number of parameters of the LSTM model in this application, we get the number of FLOPs of this model is 7569.

Patient phenotype classification uses the information of the complete ICU physiological record until the time. The purpose of this application is to conduct a 25 separate binary classification task using the LSTM model. Because it’s not a very emergency task comparing with the above two tasks. We set lower priority weight of this application as 1. And using the number of parameters of the LSTM model in this application, we get the number of FLOPs of this model is 347417.

We implement these three ICU AI applications by Python using Tensorflow and Keras 

[14, 15, 12]. And we train these three models offline on our cloud server to get the pre-trained model for inference.

What’s more, we set 6 different inference data sizes for ICU AI applications to get 18 different workloads. Table IV shows the concrete information of each workload.

The size of data in table IV is calculated in proportion of the number of record files, the real sizes of these workloads datasets are [700, 1300, 2300, 5000, 10700, 21500, 479, 950, 1900, 3900, 7800, 15900, 836, 1700, 2900, 5300, 10800, 21600] KB.

Workload No. ICU Application Data Size Model FLOPs
WL1-1 Short-of-breath alerts 64 105089
WL1-2 Short-of-breath alerts 128 105089
WL1-3 Short-of-breath alerts 256 105089
WL1-4 Short-of-breath alerts 512 105089
WL1-5 Short-of-breath alerts 1024 105089
WL1-6 Short-of-breath alerts 2048 105089
WL2-1 Life-death prediction 64 7569
WL2-2 Life-death prediction 128 7569
WL2-3 Life-death prediction 256 7569
WL2-4 Life-death prediction 512 7569
WL2-5 Life-death prediction 1024 7569
WL2-6 Life-death prediction 2048 7569
WL3-1 Patient phenotype classification 64 347417
WL3-2 Patient phenotype classification 128 347417
WL3-3 Patient phenotype classification 256 347417
WL3-4 Patient phenotype classification 512 347417
WL3-5 Patient phenotype classification 1024 347417
WL3-6 Patient phenotype classification 2048 347417
TABLE IV: AI Workload Characteristics.

Viii Experiments and Performance Analysis

Viii-a Single Medical Workload Allocation

Firstly for each workload, we calculate the estimated response time of deploying on different layers by using algorithm 1. Then we choose the optimal deployment layer referring to the results. Table V shows our estimated computation results and the best deployment layer for each workload.

Worklo- ad No. Chosen Deplo- yment Layer Estimated Response Time for Deploying on
Cloud Server Edge Server End Device
WL1-1 Edge Server 2091 1279 1394
WL1-2 Edge Server 4182 2558 2788
WL1-3 Edge Server 8364 5116 5576
WL1-4 Edge Server 16728 10232 11152
WL1-5 Edge Server 33456 20464 22304
WL1-6 Edge Server 66912 40928 44608
WL2-1 End Device 212 109 79
WL2-2 End Device 424 218 158
WL2-3 End Device 848 436 316
WL2-4 End Device 1696 872 632
WL2-5 End Device 3392 1744 1264
WL2-6 End Device 6784 3488 2528
WL3-1 Edge Server 3115 2931 3618
WL3-2 Edge Server 6230 5862 7236
WL3-3 Edge Server 12460 11724 14472
WL3-4 Edge Server 24920 23448 28944
WL3-5 Edge Server 49840 46896 57888
WL3-6 Edge Server 99680 93792 115776
TABLE V: Estimated Response Time Using Algorithm 1 at the Cloud, Edge, and Device Levels.

Viii-B Measured Response Time for Single Workload

For validating the effectiveness of the computational results, we deploy each workload respectively on each layer to conduct the inference experiment on the real experimental environment. Then we get the real response time for each workload deployed on different layers. Figure LABEL:deployment_exp shows the experimental results.

Figure LABEL:deployment_expa shows the results for the short-of-breath application. Deploying the workload on the edge server can get the lowest response time, and deploying on the cloud server can get the highest response time. Figure LABEL:deployment_expb shows for life-death prediction, the end device is the optimal deployment layer. And figure LABEL:deployment_expc shows the edge layer is the optimal deployment layer for patient phenotype classification.

By comparing table V with figure LABEL:deployment_exp, we find most of the estimated deployment layer will get the lowest response time. For WL2-1 and WL2-2, it’s better to deploy on edge server but we predict to deploy on end device. But the response time of deploying on the edge server and end devices are very close in these two workloads.

In general, the experimental result demonstrates the effectiveness of our optimal strategy for the single healthcare workload deployment.

What’s more, we begin exploring the critical influence factor of the response time by the breakdown figure 6. We choose the workload WL1-6, WL2-6, and WL3-6 to represent the three applications. Figure 6 shows the processing time and transmission time of the three workloads.

Fig. 6: Response time breakdown of each workload.

From figure 6, we have the following observations.

The model of patient phenotype classification is more complicated than the other two applications. So the processing time on the end device is respectively higher. Therefore, the transmission time has a smaller influence on this condition.

For life-death prediction, the optimal deployment layer is the end device. Because the number of parameters of this model is small, the end device can satisfy the requirements of this model. In this situation, workloads don’t need to be offloaded on the edge or cloud server.

We can conclude that the more simplified the workload models, the greater influence the transmission time has. Therefore, computing near the user may get the lowest response time.

On the contrary, for a heavy-weight workload, being processed on the higher layer may get the lowest response time.

So for the AI workload deployment on edge computing framework, we need to estimate the computation ability of the devices on different layers and the network condition firstly. Then we can decide which layer to offload the workload by trading off between the processing time and transmission time.

Viii-C Multi-job Scheduling Strategy

To validate the scheduling algorithm, we extract 10 jobs from the above experimental workload execution time results. And we normalize their response time. Meanwhile, we set release time for each workload to simulate the real-world environment. Tabel VI shows the scheduling experiment setting.

Job No. Release Priotity Weight Deployed on Cloud Server Deployed on Edge Server Deployed on End Device
Processing Transmission Processing Transmission Processing
J1 1 2 6 56 9 11 14
J2 1 2 3 32 3 6 12
J3 3 1 4 12 6 2 49
J4 5 1 7 23 11 5 69
J5 10 2 4 27 5 5 11
J6 20 2 5 70 5 14 22
J7 21 2 5 70 5 14 22
J8 21 1 4 12 6 2 49
J9 22 1 4 12 6 2 49
J10 25 1 7 23 11 5 69
TABLE VI: Processing time, transmission time and release time of jobs deployed on different layer.

We use algorithm 2 to calculate the efficient workload deployment strategy for these 10 jobs. Firstly we calculate the initial feasible deployment strategy. We choose the best deployment layer for each workload by time sequence. Then we adjust the deployment strategy using the heuristic method iteratively.

Figure 7 shows the scheduling method by our workload allocation algorithm 2. The horizontal axis represents the time sequence. And it shows the start execution time and completion time of each job on the deployment layer. Our workload allocation strategy gets 150 whole response time and the last completion time is 43. And there are 4 workloads that need to be deployed on end devices, 4 on the edge server, and 2 on the cloud server. And we can see the deployment layer of each job is not optimal for the single workload.

Fig. 7: Allocation Strategy Using Algorithm 2.

For comparison, we calculate the response time of the other 4 deployment strategies. We deploy all the workloads on the cloud, edge server, end devices, or each job’s optimal layers.

And figure 8 shows the scheduling strategy by choosing the optimal deployment layer of each job. For job 1, the optimal deployment layer is the cloud server. And for other jobs are edge server. We can see from figure 8 that a lot of jobs need to wait for the completion of the last one. So it leads to great delays, which indicates the necessity of our strategy.

Fig. 8: Allocation Strategy Using the Optimal Layer for Each Job.

Viii-D Measure Total Response Time of Multi Jobs

We calculate the sum of whole response time and the last completion time of our allocation and scheduling strategy and other 4 strategies. Table VII shows the results of these strategies.

Strategy Whole Response Time Last Response Time
Our Allocation Strategy 150 43
Deployed on the Optimal Layer for Each Job 227 67
Deployed on Cloud Server 291 74
Deployed on Edge Server 416 100
Deployed on End Device 366 94
TABLE VII: Reponse Time Using Different Algorithms.

Our strategy has the lowest whole response time and the lowest last response time. And for the whole response time, our optimal deployment strategy respectively gets 33%, 48%, 63%, and 59% lower response time than other strategies. The comparison results demonstrate our strategy can get the minimum whole response time and last response time.

Ix Related Work

In this section, we review the recent and related work of edge computing workload deployment and resource allocation.

Xu et al. [34] proposed a resource allocation model in the edge computing platform. Cao et al. [2] proposed a task allocation strategy to reduce resource consumption. However, these studies focus on resource allocation among edge computing servers, assuming that the workload is deployed on the edge computing layer.

Fan et al. [9] designed an application-aware workload allocation scheme to minimize the response delay. This scheme decides which cloudlet to allocate the workload by considering the computing resources allocated location and the type of the application.

Chen et al. [5] proposed an optimal caching strategy for mobile services. But this work doesn’t consider the multiple jobs deployment problem.

Feng et al. [10] proposed an optimal offload algorithm to maximize the data utility and minimize energy consumption. It encourages offload computing tasks to the MEC server and doesn’t consider the other layer to offload the computing tasks.

Dong et al. [8] proposed a graph cut problem solution to reduce the transmission consumption and energy consumption during offloading.

Chen et al. [6] focused on task allocation by considering the importance of the task. They proposed an allocation approach to accelerate the computational efficiency.

To the best of our knowledge, this paper is the first research effort focusing on the AI workload allocation and multiple jobs scheduling strategy in time sequence towards the three-layer cloud/edge/device hierarchical framework.

X Conclusions

In this work, we focus on the AI-oriented medical workload allocation algorithm on cloud/edge/device computing hierarchically-structured framework. Our work can be applied to reduce the response time for latency-sensitive workload such as ICU patient monitor applications.

Firstly, we propose a method to deploy on the optimal layer for a single workload. We analyze the complexity of the model, the device computation ability of the three edge computing layer and the network condition. Then we estimate the response time by summing up the processing time and transmission time. Therefore, we can choose the optimal deployment layer.

Based on the above algorithm, we also propose an allocation and scheduling strategy for multiple jobs in the time sequence. This strategy’s goal is to minimize the whole response time of all jobs, which considers the different priorities of these jobs. We use a heuristic algorithm in this strategy.

At last, we have conducted several experiments for medical AI applications on the cloud/edge/device environment. And we choose three latency-sensitive healthcare applications: short-of-breath alerts, life-death prediction, and patient phenotype classification. And we use the real-world medical datasets.

Experimental results show that the allocation strategy will greatly influence the response time of the workload. And by comparison with other strategies, the results demonstrate the effectiveness and efficiency of our workload allocation strategies.

Xi Acknowledgements

This research is partially supported by the Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS) and the State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences.


  • [1] I. M. Al-harkan and A. A. Qamhan (2019) Optimize unrelated parallel machines scheduling problems with multiple limited additional resources, sequence-dependent setup times and release date constraints. IEEE Access 7, pp. 171533–171547. Cited by: §V-A.
  • [2] C. Cao, J. Wang, J. Wang, K. Lu, J. Zhou, A. Jukan, and W. Zhao (2019) Optimal task allocation and coding design for secure coded edge computing. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 1083–1093. Cited by: §IX.
  • [3] S. Cass (2019) Taking ai to the edge: google’s tpu now comes in a maker-friendly package. IEEE Spectrum 56 (5), pp. 16–17. Cited by: §I.
  • [4] M. Chen, F. Herrera, and K. Hwang (2018) Cognitive computing: architecture, technologies and intelligent applications. IEEE Access 6, pp. 19774–19783. Cited by: §II-B.
  • [5] M. Chen, Y. Qian, Y. Hao, Y. Li, and J. Song (2018) Data-driven computing and caching in 5g networks: architecture and delay analysis. IEEE Wireless Communications 25 (1), pp. 70–75. Cited by: §IX.
  • [6] Q. Chen, Z. Zheng, C. Hu, D. Wang, and F. Liu (2019)

    Data-driven task allocation for multi-task transfer learning on the edge

    In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 1040–1050. Cited by: §IX.
  • [7] E. Di Nardo, A. Petrosino, and V. Santopietro (2018)

    Embedded deep learning for face detection and emotion recognition with intel© movidius (tm) neural compute stick

    Cited by: §I.
  • [8] L. Dong, M. N. Satpute, J. Shan, B. Liu, Y. Yu, and T. Yan (2019) Computation offloading for mobile-edge computing with multi-user. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 841–850. Cited by: §IX.
  • [9] Q. Fan and N. Ansari (2018) Application aware workload allocation for edge computing-based iot. IEEE Internet of Things Journal 5 (3), pp. 2146–2153. Cited by: §IX.
  • [10] J. Feng, Q. Pei, F. R. Yu, X. Chu, and B. Shang (2019) Computation offloading and resource allocation for wireless powered mobile edge computing with latency constraint. IEEE Wireless Communications Letters. Cited by: §IX.
  • [11] M. Fernandes Nodes, sockets, cores and flops, oh, my.. Note: Jan 6, 2020 Cited by: §III-A, §III-C.
  • [12] A. Gulli and S. Pal (2017) Deep learning with keras. Packt Publishing Ltd. Cited by: §VII-B.
  • [13] T. Hao, Y. Huang, X. Wen, W. Gao, F. Zhang, C. Zheng, L. Wang, H. Ye, K. Hwang, Z. Ren, et al. (2018) Edge aibench: towards comprehensive end-to-end edge computing benchmarking. In International Symposium on Benchmarking, Measuring and Optimization, pp. 23–30. Cited by: §I, §I, §II-B, §VII-B.
  • [14] H. Harutyunyan, H. Khachatrian, D. C. Kale, G. V. Steeg, and A. Galstyan (2017) Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771. Cited by: §VII-B.
  • [15] H. Harutyunyan, H. Khachatrian, D. C. Kale, G. Ver Steeg, and A. Galstyan (2019) Multitask learning and benchmarking with clinical time series data. Scientific Data 6 (1), pp. 96. External Links: ISSN 2052-4463, Document, Link Cited by: §VII-B.
  • [16] M. Hosseini, T. X. Tran, D. Pompili, K. Elisevich, and H. Soltanian-Zadeh (2017) Deep learning with edge computing for localization of epileptogenicity using multimodal rs-fmri and eeg big data. In 2017 IEEE International Conference on Autonomic Computing (ICAC), pp. 83–92. Cited by: §II-B.
  • [17] K. Hwang and M. Chen (2017) Big-data analytics for cloud, iot and cognitive computing. John Wiley & Sons. Cited by: §I.
  • [18] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool (2018)

    Ai benchmark: running deep neural networks on android smartphones


    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 0–0. Cited by: §I.
  • [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §I.
  • [20] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §VII-B.
  • [21] C. Luo, F. Zhang, C. Huang, X. Xiong, J. Chen, L. Wang, W. Gao, H. Ye, T. Wu, R. Zhou, et al. (2018) AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. In International Symposium on Benchmarking, Measuring and Optimization, pp. 31–35. Cited by: §I.
  • [22] MIMIC-iii database. Note: Jan 6, 2020 Cited by: §VII-B.
  • [23] N. Mladenović and P. Hansen (1997) Variable neighborhood search. Computers & operations research 24 (11), pp. 1097–1100. Cited by: §VI.
  • [24] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2016)

    Pruning convolutional neural networks for resource efficient inference

    arXiv preprint arXiv:1611.06440. Cited by: §III-C.
  • [25] V. C. Müller and N. Bostrom (2016) Future progress in artificial intelligence: a survey of expert opinion. In Fundamental issues of artificial intelligence, pp. 555–572. Cited by: §II-B.
  • [26] C. Olah (2015) Understanding lstm networks. Cited by: §II-B.
  • [27] A. M. Rahmani, T. N. Gia, B. Negash, A. Anzanpour, I. Azimi, M. Jiang, and P. Liljeberg (2018) Exploiting smart e-health gateways at the edge of healthcare internet-of-things: a fog computing approach. Future Generation Computer Systems 78, pp. 641–658. Cited by: §II-B.
  • [28] M. Satyanarayanan (2017) The emergence of edge computing. Computer 50 (1), pp. 30–39. Cited by: §I.
  • [29] W. Shi and S. Dustdar (2016) The promise of edge computing. Computer 49 (5), pp. 78–81. Cited by: §I.
  • [30] S. Smith ‘INTERNET of things’ connected devices to almost triple to over 38 billion units by 2020. Note: Jan 10, 2020 Cited by: §I.
  • [31] Tensorflow lite. Note: Jan 10, 2020 Cited by: §I.
  • [32] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen (2019) In-edge ai: intelligentizing mobile edge computing, caching and communication by federated learning. IEEE Network 33 (5), pp. 156–165. Cited by: §I.
  • [33] S. Williams, A. Waterman, and D. Patterson (2009) Roofline: an insightful visual performance model for floating-point programs and multicore architectures. Technical report Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States). Cited by: §III-A, §III-C.
  • [34] J. Xu, B. Palanisamy, H. Ludwig, and Q. Wang (2017) Zenith: utility-aware resource allocation for edge computing. In 2017 IEEE International Conference on Edge Computing (EDGE), pp. 47–54. Cited by: §IX.
  • [35] X. Zheng and L. Wang (2016) A two-stage adaptive fruit fly optimization algorithm for unrelated parallel machine scheduling problem with additional resource constraints. Expert Systems with Applications 65, pp. 28–39. Cited by: §V-A.
  • [36] A. C. Zhou, Y. Gong, B. He, and J. Zhai (2017) Efficient process mapping in geo-distributed cloud data centers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 16. Cited by: §VII-A.