Skedulix: Hybrid Cloud Scheduling for Cost-Efficient Execution of Serverless Applications

06/05/2020 ∙ by Anirban Das, et al. ∙ Rensselaer Polytechnic Institute 0

We present a framework for scheduling multifunction serverless applications over a hybrid public-private cloud. A set of serverless jobs is input as a batch, and the objective is to schedule function executions over the hybrid platform to minimize the cost of public cloud use, while completing all jobs by a specified deadline. As this scheduling problem is NP-Hard, we propose a greedy algorithm that dynamically determines both the order and placement of each function execution using predictive models of function execution time and network latencies. We present a prototype implementation of our framework that uses AWS Lambda and OpenFaaS, for the public and private cloud, respectively. We evaluate our prototype in live experiments using a mixture of compute and I/O heavy serverless applications. Our results show that our framework can achieve a speedup in batch processing of up to 1.92 times that of an approach that uses only the private cloud, at 40.5 only the public cloud.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Serverless computing is an emerging paradigm for the deployment of application and services [14, 10, 6]. A serverless application consists of a collection of stateless functions that transfer information between them using platform services, such as databases and object stores. The serverless paradigm provides benefits including ease of development and deployment, as well as seamless elasticity, since an application can be scaled up by executing multiple copies of the same function in parallel.

In public cloud serverless platforms, such as AWS Lambda [3], Azure Functions [5], the cloud provider itself maintains the platform and its underlying infrastructure. Each function is executed in its own sandboxed container, and the platform can scale up the application to meet demand by simply deploying more containers. The application owner pays for the time during which the functions execute, as well as for any services the functions use.

Serverless applications can also be deployed in a private cloud, i.e., a set of servers that is maintained by the application owner, using frameworks like OpenFaaS [20] and Apache Openwhisk [2]. This private cloud may consist of privately owned hardware that is kept on the application owner’s premises or on leased hardware in offsite premises. A private cloud offers several benefits to application owners. Sensitive data can be processed on protected servers without sending them to the public cloud, thus preserving user privacy [24, 23]. Further, by processing jobs on site, the end-to-end application latency may be reduced. Finally, a private cloud offers cost benefits, especially for long-running jobs and services. However, because a private cloud has limited resources, it may not be possible to meet service level agreement (SLA) requirements for all workloads using the private cloud alone.

To leverage the benefits from both the private and public clouds, an application owner can use a hybrid cloud approach [23]. In this model, an application owner maintains a private cloud that is resourced, for example, to process typical workloads with the desired performance guarantees. When the workload increases beyond the capacity of the private cloud, to maintain quality of service, some jobs, or parts of jobs, can be dispatched to the public cloud, with an incurred cost. This approach gives an application owner the privacy, performance, and cost benefits of a private cloud, while eliminating the need to maintain hardware for peak workloads that would otherwise go unused. The challenge is then to determine which jobs to run in the private cloud and which to run in the public cloud to achieve both high performance and low cost.

We propose Skedulix, a framework for scheduling multi-function serverless applications over a hybrid cloud. We consider a batch input scenario, where the workload consists of a set of serverless jobs, and the objective is to schedule function executions over the hybrid platform to minimize the cost of public cloud use, while meeting a user-specified deadline. A formalization of this scheduling and assignment problem can realized as a Mixed Integer Linear Program (MILP) and can be proved to be

-hard. Therefore, we propose a greedy approach, with two variations for cost minimization. Further, we implement and evaluate our approach in live experiments.

Specifically, our contributions are as follows. (i) We formalize a framework for scheduling multi-function applications in the hybrid cloud setting to minimize the cost of execution in public cloud. (ii) We then propose a greedy algorithm that dynamically determines both the order and placement of each function in the hybrid cloud. (iii) For scheduling, our algorithm requires accurate models of function execution time, data transfer times, and intermediate data sizes. We demonstrate how to generate such application-specific models from training data obtained using AWS Lambda in the public cloud and OpenFaaS in our private cloud. (iv) We present a prototype implementation of our framework that runs within this hybrid cloud platform. (v) We then evaluate the performance of our algorithm on this prototype using three canonical examples with mixed compute and I/O heavy workloads. Our results show that our framework can achieve a speedup in batch processing of up to 1.92 times that of an approach that uses only the private cloud, at 40.5% the cost of an approach that uses only the public cloud.

The remainder of this paper is organized as follows. Sec. II describes the platform model, outlines the problem we address, and describes the necessary elements for scheduling. In Sec. III, we present the scheduler design and algorithm. In Sec. IV, we give the details of our framework implementation, including how we generate our performance models. Sec. V presents an experimental evaluation of our framework. Sec. VI summarizes related work, and we conclude in Sec. VII.

Ii System Overview

Fig. 1: An example DAG representing precedence constraints in a Video Processing application.

Ii-a System Model

We model applications as directed acyclic graphs (DAGs); each node is a function, or a stage, and the DAG identifies the partial order in which the stages must execute. A sample DAG for a serverless Video Processing application is shown in Fig. 1, where the red arrows represent precedence constraints (but no conditionals) of the stages in the workflow. For example, stage starts after stage , and stage starts after both stages and complete.

We consider a hybrid cloud setting, where there are serverless platforms in both the private and the public cloud. In the private cloud, we assume that there is a fixed number of replicas allocated to each stage. For a stage , we denote the number of deployed replicas by . Each replica can execute a single function at a time, so that functions of stage may execute in parallel in separate containers, where the containerization provides resource isolation. The private cloud contains a storage service to store inputs and results, as shown in Fig. 1. We assume that the public cloud has unlimited capacity and thus executes functions in parallel as needed with no performance degradation. The public cloud also provides a storage service, and data may be transferred between private and public cloud storage if necessary.

We assume cost of executing a function in the private cloud to be zero. The public execution cost of a function (in USD) is determined from the compute latency of the public cloud function, using the AWS Lambda cost model :

(1)

where is the execution time of the function in milliseconds, and is the memory configuration in MB of the Lambda function. We note that our framework can be trivially extended to any cost model that is a deterministic function of the execution time.

Ii-B Problem Formulation

We consider a scenario where there is a batch workload of jobs to be processed. Each job corresponds to a single execution of a serverless application, i.e., a DAG of stages. The input data for the workload is available in the private cloud at the beginning. A stage of a job can execute in either the private or public cloud. The objective is to process all jobs with minimal cost and store the results in the private cloud by a given deadline . This deadline is also referred to as the makespan of the batch. Intuitively, private cloud resources should be used as much as possible to reduce cost. If there are not sufficient computing resources to meet the deadline, function executions can be offloaded to the public cloud. The challenge is to identify which function executions to offload so as to minimize cost while meeting the deadline.

To achieve this, we need a schedule. We define a schedule to be a combination of both an order of job executions at each stage as well as an assignment of executions for each stage to either a replica or the public cloud.

This scheduling and assignment problem can be formalized as a MILP and can be fed into a standard solver such as Gurobi [12]. However, this problem is -hard, and hence it is not tractable to find an optimal solution for even moderately-sized workloads. Hence we develop a greedy scheduling algorithm. A detailed formulation of the MILP and the proof of -hardness is provided in the Appendix.

In the next section, we describe our scheduler design and scheduling algorithm. The scheduler requires accurate estimates of the application execution latencies in both the private and public clouds, as well as the data transfer time between them. We give the details of how we create models to generate these estimates in Sec. 

IV-B.

Iii Scheduler Design and Scheduling Algorithms

Iii-a Scheduler Overview

The scheduler is designed as a long running service inside the private cloud. On receipt of a batch job execution request, the scheduler uses the algorithm described in the next section to decide where to run the stages of each job and in which order. The scheduler has one process for each stage in the application, as shown in Fig. 2. Each process maintains a queue of uncompleted jobs for its corresponding stage. When a private cloud replica for that stage is available, the scheduler process dequeues the job at the head of the queue at dispatches it to that replica. We explore several priority orders for the queue, designed to minimize the cost of public cloud utilization. We detail these orders in Sec. III-C.

The scheduler process also monitors the queue and offloads jobs to the public cloud when necessary to meet the makespan deadline. When a private cloud replica completes executing a job stage, the job is added to the queue(s) of the next stage(s) in the application, as specified by the application’s DAG. If a job stage is offloaded to the public cloud, the downstream stages for the job are also executed in the public cloud.

Iii-B Scheduling Algorithm

The pseudocode for our scheduling algorithm is described in Alg 1. We consider an application with stages, Let denote the set of all stages in the application, and let denote the input set of jobs. Let the batch start executing at time . We let denote the estimated latency of executing stage for job in the private cloud and denote the estimated latency of executing the same stage for job in the public cloud.

1:Initialization:
2:     
3:     for  do
4:         
5:     end for
6:      sorted in priority order
7:     
8:     Dispatch jobs in to public cloud
9:     
10:     Put jobs in in priority queue(s) for first stage(s)
11:At each stage :
12:     on replica availability:      
13:         Dequeue head of and dispatch to replica      
14:     on add or remove from :      
15:         Make a copy of to loop over
16:         for job  do
17:              If
18:                    dequeue and dispatch to public cloud
19:         end for
20:         Sync with      
Algorithm 1 Scheduling algorithm

In initialization phase, before executing any job, we first get a rough estimate of the number of jobs that can fit in the computing capacity of the private cloud, and we offload any jobs that cannot fit within this computing capacity to the public cloud. We calculate this computing capacity, , as

i.e., is the total computing time in the private cloud if all replicas in all stages execute for the entire makespan duration . We then find the total estimated runtime of each job in the private cloud as . The scheduler immediately offloads a set of jobs , in priority order, until the sum of the execution times of the remaining jobs is less than or equal to . The jobs are offloaded from the tail of the priority queue during initialization phase. This initialization phase is shown in Lines 2 - 10. We note that this initial offloading to the public cloud will not be sufficient to meet the deadline, since to process the jobs in in the private cloud would require 100% utilization of all replicas for the entire duration until . This is not possible due to both the framework overhead and the stage precedence constraints. Therefore, we must offload additional job stages throughout the batch execution. We do this adaptively, as described next.

For each job stage , its scheduling process monitors for jobs that it estimates may not complete by the deadline, and it offloads them to the public cloud. To determine whether a job will result in a violation, we use the apparent closeness to deadline . The scheduler computes the estimated latency along the longest-latency path from the current stage to the final stage(s) and checks whether there is sufficient time remaining to execute all stages in this path for the job. We denote the set stages along this path by . At time , the is computed as follows:

where . Here, indicates the time remaining for execution before the deadline. The first summation is the estimated current queue delay. This is the sum of private cloud execution latencies of jobs prior to job in , divided equally among available replicas . This estimate holds under the assumption that the total work at stage is evenly distributed among all replicas. We also add to get an optimistic estimate about total time needed to complete job in the private cloud. If is negative, we estimate that job will not complete by the deadline, and thus should be offloaded to the public cloud.

Whenever there is a change in the priority queue for a stage, i.e., a job is added or removed, the value of is computed for all the jobs inside the queue by looping over a local copy () of . If , the job is dispatched to the public cloud. These steps are shown in Lines 14 - 20. Finally, we update with jobs in in the correct order. A job remains in the priority queue until it is dispatched to a replica in the private cloud or until its becomes negative and it is offloaded to the public cloud. Since these steps are executed at every stage, this approach allows us to offload enough jobs to the public cloud, in priority order, such that the remaining jobs can finish in the private cloud by the deadline.

Iii-C Priority Queue Sort Orders

The selection of which jobs to offload to the public cloud is made based on their order in the priority queue. We consider the following priority orders.

Highest Cost First order (HCF)

In this method, jobs are ordered by their cost of execution in the public cloud. We can obtain the cost of execution of stage of job in public cloud from using Eqn. 1. Here, the less expensive jobs are offloaded to the public cloud first.

Shortest Processing Time order (SPT)

Here, we order jobs in Shortest Processing Time (SPT) order i.e. the jobs with lower processing time are always towards the head of the priority queue. We offload the jobs with longer processing time that are present towards the tail of the priority queue to the public cloud when necessary. Since AWS rounds up the execution time of Lambda functions to the nearest 100 ms, we make an observation that, if we offload jobs with longer duration to the cloud, the rounding penalty will be a smaller fraction of the total cost. Hence, we waste less budget relatively for those jobs. Further, executing longer jobs in the public cloud can avail the benefit of parallelism without affecting the makespan of the entire batch of jobs negatively.

Iv Framework Implementation

In this section, we describe the architecture of our framework, Skedulix, and we provide details of our prototype implementation. We also summarize how we generate the performance models used in our scheduling algorithm.

Iv-a Framework Architecture

The framework architecture is shown in Fig. 2. For the public cloud deployment, we use AWS Lambda to execute functions corresponding to the stages, and we use AWS S3 to store inputs and intermediate results. To implement function chaining in AWS, we create unique S3 buckets that trigger their corresponding functions; one function triggers another by storing its output in the next function’s input S3 bucket. If at any stage, a job is scheduled to be executed in the public cloud, the corresponding scheduler process uploads the appropriate raw input to the specific S3 bucket in the public cloud. This raw input upload subsequently triggers the corresponding Lambda function for that stage. For DAGs with parallel stages, we use AWS Step Functions [4].

For the private cloud platform, we use OpenFaaS as the serverless platform, and we use Minio [19] for the storage. We deploy OpenFaaS on top of the Kubernetes Container Orchestration System [16]. When a function is deployed in OpenFaaS, the latter initiates a function instance that runs inside a Kubernetes pod and also exposes an API address. The API address, when invoked, executes that deployed function. Since we need each replica of a function to be separately addressable by our custom scheduler, we configure our system such that each OpenFaaS function instance has exactly one pod at any time. We then create replicas for each function by creating versions of the same function, which can each be uniquely addressed via simple http calls.

The input data for the batch workload is stored in a Minio bucket in the private cloud. The scheduler then uses Alg 1 to execute jobs in the private and public clouds. All interactions between function replicas and the scheduler are done via http requests. This includes the replica notifying scheduler of its availability and notifying the scheduler to execute downstream stages when a job stage completes. If the last stage of the application executes in the public cloud, the last function notifies the scheduler when it completes. The scheduler then downloads the results from the S3 bucket to a Minio bucket in the private cloud.

Fig. 2: Skedulix framework architecture and the scheduler implementation on OpenFaaS.

Iv-B Performance Modeling

To execute jobs in our framework using Alg. 1, we need estimates for the application execution latencies in the private cloud () and the public cloud (). To estimate these quantities, we train performance models using traces gathered from executing a substantial number of jobs on AWS and in our private cloud. We use Python and scikit-learn library for training and subsequently tuning the models via cross-validation as described next. We create separate models for each stage of an application where each model makes a prediction for each job .

For the private cloud, for each stage , we model the execution latency as the sum of the following components:

  1. [leftmargin=*]

  2. Function compute time: This is parameterized by function input properties, e.g., the file size or data dimension.

  3. Framework overhead: This is modeled using the mean over the training dataset. Its magnitude for each application stage is on the order of 15-20 ms.

For the public cloud, for each function, we model as a linear function of the input features (e.g., file size or data dimension).

For all stages except the first stage of an application, we must also predict the size or properties of the input data for that stage, i.e., the size of the output data from the preceding function(s), since the latency performance models are parameterized by these properties. For example, in the Video Processing application shown in Fig. 1, to model the latency of the detectObject function, we need the characteristics of the output files of the extractFrames function. Therefore, for each function, we create a model that predicts the size of its output as a function of the size of its input.

For all of the above quantities, except the framework overhead, we create our models using regularized ridge regression over the training data.

V Experimental Evaluation

In this section, we present an evaluation of our framework using three serverless applications. We first describe these applications, with the corresponding workloads and infrastructure setup in Sec. V-A. We describe the performance modeling and error in Sec. V-B and present the results from using our framework on these applications in Sec. V-C.

V-a Experimental Environment and Infrastructure

V-A1 Serverless Applications

We employ three different applications that exhibit different canonical behaviors in terms of resource utilization. The first, the Matrix Processing application is compute-heavy, with minimal I/O. The second, Video Processing, consists of a mixture of compute-heavy and I/O bound functions. The third is the Image Processing application, which is I/O-heavy, with smaller compute requirements.

Matrix Processing Application: The application consists of two stages. The first stage is a matrix multiplication stage (MM). This stage takes, as input, a matrix in a CSV-formatted text file and computes a product matrix by multiplying the matrix with its transpose. The resulting matrix is saved to storage in the same format. The second stage (LU) takes in this product matrix, computes an LU decomposition, and stores the results. The input matrices are random integer matrices of dimensions between and . This is a synthetic workload which we use as a characterization of a compute heavy extract-transform-load (ETL) workload.

Video Processing Application: This is a Video Processing application with four stages, as shown in Fig. 1. Key frames are first extracted from the input video clips using the extractFrames (EF) function, and the resulting images are stored in blob storage as zip. The detectObject (DO) function then loads the images from storage, detects the objects in each image and saves resulting coordinates and inference in a text file. The rescaleImage (RI) function rescales the images to lower resolution and saves the resulting images to storage in a single zip file. Finally, the merger (ME) function combines the results of the DO and RI functions into another zip file, and saves it to storage. This application is representative of a traffic surveillance application that detects objects in frames such as cars, buses, bikes etc.

For our evaluation, we rescale each image to half its original resolution in the RI stage. Further, at the EF stage, we extract one key frame per second from the video files. For the input dataset, we use videos from the BDD100K database [26]. All input videos are of duration 10s in our experiments.

Image Processing Application: This application has three stages. The input to this application is an image file of arbitrary size. The first stage is a rotate function, which rotates the image. The output is an image file of similar, but non-identical, size to the input. The second stage is a resize function, which scales the image to a specific configurable size, which is pixels for our experiments. While the number of pixels in the image is uniform across all output files of resize function, the output file size, in bytes, is not. Thus, our output size prediction models play a crucial role in the scheduling for this application. The final stage is a compress

function, which reduces the quality of the image. For the input dataset, we use the open-source Image of Groups 

[9].

V-A2 Experimental Setup

We set up OpenFaaS in a private cloud, consisting of a single node Kubernetes cluster on a machine with Intel Xeon CPU E5-1650 with 12 cores and 64 GB ram. The machine is connected to the internet via a wired LAN connection through a 1 Gbps network. It is synced with our private GPS time source for accurate time-keeping. The private object storage, Minio, is set up inside the same machine. Further, for the purpose of the experiments, in our private cloud, we use only two replicas to allow us to study the impact of limited resources. Accordingly, throughout all the experiments, we fix the CPU and RAM resources of replicas of all functions in the Matrix processing application at 1.0 CPU and 512 MB RAM, all function replicas in the Image processing application at 0.2 CPU and 512 MB RAM. For the Video processing application we configured each replica of EF at 0.5 CPU and 1024 MB RAM, DO at 1.0 CPU and 2048 MB RAM, RE and ME function replicas at 0.2 CPU and 512 MB RAM each.

For the public cloud, we use the US-East-1, North Virginia as the region of our AWS data center. For Matrix Multiplication and Image Processing, we set the memory of all the AWS Lambda functions at 2048 MB. For Video Processing, we set the memory of the EF and RI functions at 1024 MB, the DO function at 3008 MB, and the ME function at 512 MB. For this study, we consider warm starts only, and hence, we pre-warm a sufficient number of Lambda functions before doing the experiments.

We train the performance models as described in Sec. IV-B for all applications. For the Matrix Processing application, we use 774 matrices for training and 150 as the test set for the live experiments. For Video Processing, we use 800 videos for training and 200 as the test set, and for Image Processing we use 800 images for training and 200 for the test set. The regression models are selected using the Grid Search method of scikit-learn on the model parameters, with five-fold cross validation.

V-B Performance Model Accuracy

To study the accuracy of our performance models, we present results on the test set for each application.

V-B1 Matrix Processing Models

We model the compute latency of the MM and LU stages as functions of the size of the input matrix, in bytes, and of the total number of entries in the input matrix, respectively. In the private cloud we obtain a Mean Absolute Percentage Error (MAPE) of 6.51% and 4.57%, respectively, for the MM and LU stages. The MAPE of the public cloud function execution latencies of MM and LU are 5.74% and 2.52%, respectively. The LU stage of the Matrix Processing application depends only on the dimensions of the input matrix, which can be determined from the input to MM stage. Thus, we do not need a prediction model for the intermediate data in this stage.

V-B2 Video Processing Models

We model the compute latency of the Video Processing application as a function of the input filesize in bytes and the duration of the original video file. For the private cloud, we model the compute latency of the EF, DO, RI and ME stages and obtain a MAPE of 4.42%, 1.44%, 8.48% and 51.3% respectively. For the public cloud function execution latency, we observe a MAPE of 5.28%, 1.52%, 7.69% and 23.62% for functions EF, DO, RI and ME, respectively. The latency for the ME stage, which just merges the outputs of RI and DO, is very low in magnitude and does not show any clear pattern, resulting in high MAPE. However due to the small magnitude, this error has low impact on the performance prediction for the entire application. For the Video Processing application we need the output size prediction models from inputs for stages EF, RI and ME, MAPE of which are, respectively, 38.6%, 5.24% and 0.2%.

V-B3 Image Processing Models

For the private cloud, in modeling compute latency of different stages, we observe a MAPE of 13.71%, 12.24%, and 12.91%, respectively, for the Rotate, Resize, and Compress stages. We further observe a MAPE of 26.1%, 26.5% and 29.5% for the public cloud function execution latencies of the Rotate, Resize, and Compress functions. The latencies observed in the Image Processing application are of high variance and small magnitude, hence, we observe overall large error magnitudes. MAPE of prediction of output size from input of Rotate is 7.08%, that of Resize is 11.69% and that of Compress is 0.52%.

V-C Scheduling Framework Evaluation

In this section, we present an experimental evaluation of our hybrid cloud scheduling framework prototype. In each experiment, the input batch arrives at the private cloud at time . The makespan is counted from this starting time until the timestamp of the last saved file in the result bucket in Minio, after all jobs have completed. For each application, we explore a range of values of . For the Matrix Processing application we explore between 300-700 s; for Video Processing, we explore between 200-400 s, and for Image Processing, we use between 13-17 s.

As points of comparison, we also include results for executions that take place entirely in the public cloud and entirely in the private cloud For the all-public cloud execution, the batch of input files is uploaded in parallel to the input bucket of first stage in the public cloud. For the all-private cloud execution, we choose

large enough and execute all jobs using SPT order. All experiment results are averaged over three runs, and error bars show one standard deviation.

For small scale experiments, we also compare the performance of our scheduling approach to the performance obtained from an optimal schedule. To generate the optimal schedule, we need to model some additional latency constituents. This includes the public cloud function start-up latency and the data transfer latencies between the public and the private cloud. We model the start-up latency by taking the mean over the 99 percentile of measurements of the training data. We model upload and download latencies between the private and public cloud as a function of the data size, in bytes. We note that we generate application-specific models for these quantities based on the range of input and output sizes expected for each application using regularized ridge regression. We then solve the scheduling MILP using the Gurobi solver. We let the solver run for 20 hours, until we observe convergence of the objective function.

(a) Total execution cost
(b) Makespan of execution
Fig. 3: Comparison of the optimal, SPT and HCF schedules, and the all-public cloud execution, with 30 input jobs for the Matrix Processing and Video Processing Application.

V-C1 Optimal vs. SPT and HCF

We first compare the performance of our scheduling algorithm, with the SPT and HCF priority orders, to the optimal schedule. We study both the Matrix Processing and Video Processing applications. For each application, the input batch consists of 30 jobs, selected at random from the respective test sets. For the Matrix Processing application, we use , and for the Video Processing application, we use . We also show results for the all-public cloud execution.

Fig. 2(a) shows the total cost of execution for each application, and Fig. 2(b) shows the actual makespan. First, we observe that for the optimal, HCF, and SPT schedules, the makespan is very close to , and in fact, the HCF and SPT schedules complete before

. The all-public cloud execution is much faster because of the parallelism offered by the cloud, but at the same time, it is much more expensive than HCF and SPT, showing the benefit of the hybrid cloud execution. The performance of the SPT and HCF heuristics are close to each other, however, HCF is slightly more expensive. Further, we observe that the SPT algorithm has 34% higher cost than the run with the optimal schedule for the Matrix Processing application and 28.2% higher cost than the optimal schedule for the Video Processing application. The greedy algorithm thus performs quite close to optimal, especially when we consider the fact that, obtaining the optimal schedule is infeasible for larger number of jobs.

V-C2 SPT vs. HCF with varying

We next study the trend of execution cost and number of offloaded functions in SPT and HCF priority orders in our greedy algorithm with varying values of . For the Matrix Processing application, we use all 150 jobs in the test set and for the Video Processing application we use all 200 jobs in the test set. We also use the Image Processing application to observe the performance of our heuristics on an application with smaller latencies. For this application, we use all 200 images in the test set.

(a) Matrix Processing (150 jobs, 300 function calls)
(b) Video Processing (200 jobs, 800 function calls)
(c) Image Processing (200 jobs, 600 function calls)
Fig. 4: Comparison of the number of functions offloaded to the public cloud and the total execution cost for the SPT and HCF scheduling polices, with varying deadlines , for the Matrix, Video and Image Processing applications.

The results are shown in Fig. 4. The left Y-axis denotes percentage of the total number of function executions offloaded to the public cloud. The right Y-axis denotes the total public cloud execution cost. We first observe that the total number of function stages offloaded to public cloud is a decreasing function of the deadline for both heuristics. Our scheduler offloads more job stages to the public cloud as the deadline decreases. This, in turn, increases the total cost of execution.

We observe that for all applications, the total number of function stages offloaded is higher for HCF compared to SPT priority order in general. This is because the HCF order tries to execute more expensive jobs in the private cloud, which roughly corresponds to many jobs with long durations in the private cloud. Hence, it ends up offloading a substantially larger number of inexpensive jobs of short durations. This also results in higher overall cost, as seen in the Fig. 4, for the Matrix and Video Processing applications. On average, across all values of , we see the trend that the HCF ordering is 14.3% more expensive than SPT for the Matrix Processing application, and the HCF ordering is 17.9% more expensive than SPT in the Video Processing application. However, the trend is the opposite in the case of the Image Processing application in Fig. 3(c). We see that the cost of the HCF heuristic is actually lower than SPT even when SPT offloads fewer jobs. Here, the number of functions offloaded in SPT and HCF are very close across different . The number of offloaded jobs across SPT and HCF being very close, coupled with the fact that SPT offloads larger jobs, results in the total cost being higher for SPT than that of HCF in this case.

In these applications, there can be bottleneck stage(s) where the private cloud execution latency of jobs are in general larger than the other stages. In order to maintain the makespan, and to keep the public cloud execution cost low, a good choice would be to prefer offloading the bottleneck stage(s) instead of the entire job. Our scheduler correctly offloads these stages to to the public cloud to meet the makespan. For the Matrix Processing application, this corresponds to the LU or the last stage. For Video Processing application this is the DO stage, and the scheduler correctly offloads the DO and ME function most frequently. Finally, for Image Processing, Rotate is the bottleneck stage, hence, all three functions gets offloaded to cloud once the scheduler decides to offload a job at Rotate.

From these experiments, we observe that there is a clear trade off between performance, in terms of latency and cost. Our hybrid cloud scheduling framework provides a mechanism for an application owner to determine their own balance of cost and performance by selecting the value for . Offloading longer jobs to the public cloud using SPT priority order works very well in practice in this system model for medium-high compute heavy workloads.

(a) Matrix Processing
(b) Video Processing
Fig. 5: Actual execution makespan for the SPT and HCF heuristics with varying for the Matrix and Video Processing applications.

In Fig. 4(a) and 4(b), we show the actual obtained makespan from our hybrid scheduling framework for the Matrix and Video processing applications with varying . In both the applications, we find that the observed makespan is very close to the user desired value of , with  3.5% and  1.5% absolute error, respectively for the Matrix and Video Processing applications. This shows the validity of the performance of our scheduler and the heuristics. Prediction errors in the performance models contributes a lot towards this error during the scheduling, as having inaccurate prediction models for would prevent the scheduler from utilizing the private cloud at the highest possible efficiency.

In our experiments, the all-private cloud execution has the makespan of 740 s for executing all 150 jobs in Matrix Processing application and 407 s for all 200 jobs in the Video Processing application. Therefore, given a reasonable deadline of for Matrix Processing application and for Video Processing application, our framework via SPT ordering can achieve a speedup of 1.92 and 1.65 over an approach that uses only private cloud execution, at a cost which is, respectively, 40.5% and 39.5% of an approach that uses only the public cloud.

We further note that for applications like the Image Processing, where the compute latencies are the order of 100s of milliseconds, the effects of error in scheduling will be much larger as communication and coordination latencies between different parts of intra and inter public-private clouds can introduce large variances. However, the absolute error in makespan with SPT ordering is 5% and that with HCF ordering is 23%, which is comparable to the error we get in our prediction models. This suggests that the scheduling framework can perform with higher accuracy for moderate to heavy workloads.

Vi Related Work

Various approaches for computational offloading and scheduling in the context of datacenters, microservices, and serverless architectures have been proposed in recent years. In context of mobile cloud computing, computational offloading or cyber-foraging using virtual machines has been studied extensively [25], [18]. Here, tasks can be offloaded from resource constrained mobile devices as needed to meet performance objectives such as minimizing execution cost, energy usage or latency. The offloading problems are generally solved using methods such as integer liner programs, greedy heuristics, or dynamic programming. A similar work to our is DEFT [13], which uses regression-based performance models to dynamically determine where to offload computation to optimize latency or energy. However, DEFT offloads entire single-stage applications, whereas our framework makes finer grained, per stage offloading decisions to optimize cost.

In context of serverless computing, more recently, several systems have been proposed for scheduling single stage serverless functions in a single cloud platform. Spock [11]

minimizes service-level objective violations by distributing the execution of machine learning inference jobs over serverless functions and VMs in the public cloud. The goal of the FnSched scheduler 

[22] is to maximize resource utilization from a cloud platform provider’s perspective while meeting latency guarantees. This is done via regulating the allocated resources to the function containers based on their resource consumption patterns. NOAH [21] is a framework that uses a game theoretic approach for scheduling and resource allocation of single stage functions in a private cloud to minimize response time.

Multi-stage serverless applications have been studied in [6], where the authors present an implementation of function chaining in Apache OpenWhisk. The authors in [1] propose a multistage serverless video processing framework. The authors in [17] studied task placement of multi stage applications in edge-cloud systems to minimize completion time, however they do not consider actual cost, and schedule on a per input basis, whereas we consider a batch input. In [7], the authors propose Costless, a framework that optimizes the cost of serverless application execution by splitting a function chain and executing part on an edge platform and part on the cloud. Costless optimizes for a single application execution at a time and does not consider sequencing and scheduling tasks concurrently in function replicas. The authors in [15] propose GrandSLAm, a framework for scheduling machine learning workloads to maximize data center resource utilization by dynamically adjusting batch sizes and and reordering the executing of requests. In contrast to these works, we focus on task scheduling in a hybrid cloud setting. In addition, our approach optimizes for performance and cost for the platform clients, rather than for the service provider.

Vii Conclusion

We have presented a framework for scheduling serverless applications over a hybrid public-private cloud in a manner that minimizes the cost of public cloud use, while meeting a user-specified makespan constraint. We proved this problem to be -hard and proposed a greedy algorithm with two heuristics. We then presented the details of our framework, which relies on accurate predictive models for function compute time, intermediate data sizes, and network transfer latencies. Finally, we presented an evaluation of a prototype implementation of our framework that uses AWS Lambda for the public cloud and OpenFaaS, running in an on-premise server, for the private cloud using canonical examples of serverless applications. Our results showed that our framework can achieve a speedup of times in the Matrix Processing application and times in the Video Processing application over an approach that uses only the private cloud, at a cost that is, respectively, percent and percent of an approach that uses only the public cloud. Our framework essentially handles each application stage independently, hence, we believe it is possible to extend this decoupled approach to complicated DAGs. In future work, we plan on extending our approach to introduce a dynamic tolerance to the deadline violation and to minimize a dual cost / makespan objective.

Acknowledgment

This work is supported by the National Science Foundation under grants CNS 1553340 and CNS 1816307, Air Force Office of Scientific Research (AFOSR) under grant FA9550-19-1-0054, and an AWS Cloud Credits for Research grant.

References

  • [1] L. Ao, L. Izhikevich, G. M. Voelker, and G. Porter (2018) Sprocket: a serverless video processing framework. In Proc. ACM Symp. on Cloud Computing, pp. 263–274. Cited by: §VI.
  • [2] Apache OpenWhisk: Open Source Serverless Cloud Platform. Note: https://openwhisk.apache.org/Accessed Feb 8, 2020 Cited by: §I.
  • [3] AWS lambda. Note: https://aws.amazon.com/lambda/Accessed Feb 8, 2020 Cited by: §I.
  • [4] AWS step functions. Note: https://aws.amazon.com/step-functions/Accessed Feb 8, 2020 Cited by: §IV-A.
  • [5] Azure Functions. Note: Accessed Feb 8, 2020https://azure.microsoft.com/en-us/services/functions/ Cited by: §I.
  • [6] I. Baldini, P. Cheng, S. J. Fink, N. Mitchell, V. Muthusamy, R. Rabbah, P. Suter, and O. Tardieu (2017) The serverless trilemma: function composition for serverless computing. In Proc. ACM SIGPLAN Int. Symp. New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 89–103. Cited by: §I, §VI.
  • [7] T. Elgamal, A. Sandur, K. Nahrstedt, and G. Agha (2018) Costless: optimizing cost of serverless computing through function fusion and placement. In Proc. IEEE/ACM Symp. Edge Computing, pp. 300–312. Cited by: §VI.
  • [8] H. Emmons and G. Vairaktarakis (2012) Flow shop scheduling: theoretical results, algorithms, and applications. Vol. 182, Springer Science & Business Media. Cited by: §-A, §-C.
  • [9] A. Gallagher and T. Chen (2009) Understanding images of groups of people. In Proc. CVPR, Cited by: §V-A1.
  • [10] P. García-López, M. Sánchez-Artigas, S. Shillaker, P. Pietzuch, D. Breitgand, G. Vernik, P. Sutra, T. Tarrant, and A. J. Ferrer (2019) ServerMix: tradeoffs and challenges of serverless data analytics. arXiv preprint arXiv:1907.11465. Cited by: §I.
  • [11] J. R. Gunasekaran, P. Thinakaran, M. T. Kandemir, B. Urgaonkar, G. Kesidis, and C. Das (2019) Spock: exploiting serverless functions for slo and cost aware resource procurement in public cloud. In Proc. 12th IEEE Int. Conf. Cloud Computing, pp. 199–208. Cited by: §VI.
  • [12] L. Gurobi Optimization (2019) Gurobi optimizer reference manual. Note: Accessed Feb 8, 2020http://www.gurobi.com Cited by: §II-B.
  • [13] F. Jalali, T. Lynar, O. J. Smith, R. R. Kolluri, C. V. Hardgrove, N. Waywood, and F. Suits (2019) Dynamic edge fabric environment: seamless and automatic switching among resources at the edge of iot network and cloud. In Proc. IEEE Int. Conf. Edge Computing, pp. 77–86. Cited by: §VI.
  • [14] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht (2017) Occupy the cloud: distributed computing for the 99%. In Proc. Symp. Cloud Computing, pp. 445–451. Cited by: §I.
  • [15] R. S. Kannan, L. Subramanian, A. Raju, J. Ahn, J. Mars, and L. Tang (2019) GrandSLAm: guaranteeing slas for jobs in microservices execution frameworks. In Proc. 14th EuroSys Conf., pp. 34. Cited by: §VI.
  • [16] Kubernetes: Production-Grade Container Orchestration. Note: https://kubernetes.io/Accessed Feb 8, 2020 Cited by: §IV-A.
  • [17] L. Liu, H. Tan, S. H. Jiang, Z. Han, X. Li, and H. Huang (2019) Dependent task placement and scheduling with function configuration in edge computing. In Proc. Int. Symp. on Quality of Service, pp. 1–10. Cited by: §VI.
  • [18] P. Mach and Z. Becvar (2017) Mobile edge computing: a survey on architecture and computation offloading. IEEE Commun. Surv. Tutor. 19 (3), pp. 1628–1656. Cited by: §VI.
  • [19] Minio: High Performance, Kubernetes-Friendly Object Storage. Note: https://min.ioAccessed Feb 8, 2020 Cited by: §IV-A.
  • [20] OpenFaaS - Serverless Functions Made Simple. Note: https://docs.openfaas.com/Accessed Feb 8, 2020 Cited by: §I.
  • [21] M. Stein (2019) Adaptive event dispatching in serverless computing infrastructures. arXiv preprint arXiv:1901.03086. Cited by: §VI.
  • [22] A. Suresh and A. Gandhi (2019) FnSched: an efficient scheduler for serverless functions. In Proc. 5th Int. Workshop Serverless Computing, pp. 19–24. Cited by: §VI.
  • [23] B. Varghese and R. Buyya (2018) Next generation cloud computing: new trends and research directions. Future Gener. Comput. Syst. 79, pp. 849–861. Cited by: §I, §I.
  • [24] J. Wang, B. Cao, P. Yu, L. Sun, W. Bao, and X. Zhu (2018) Deep learning towards mobile applications. In IEEE 38th Int. Conf. on Distributed Computing Systems (ICDCS), pp. 1385–1393. Cited by: §I.
  • [25] D. Xu, Y. Li, X. Chen, J. Li, P. Hui, S. Chen, and J. Crowcroft (2018-thirdquarter) A survey of opportunistic offloading. IEEE Commun. Surv. Tutor. 20 (3), pp. 2198–2236. External Links: Document, ISSN 2373-745X Cited by: §VI.
  • [26] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell (2018) Bdd100k: a diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687. Cited by: §V-A1.