FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

06/06/2023
by   Minchen Yu, et al.
0

The dynamic request patterns of machine learning (ML) inference workloads have driven an increasing trend towards exploiting serverless computing for scalable ML model serving. However, today's serverless platforms lack efficient support for GPUs – provisioning functions on GPUs incurs extremely high overhead, forcing them to keep long-running even when idling for reduced cold starts. This leads to significant resource waste to perform ML inference and hinders the pay-per-use billing for GPUs. In this paper, we present FaaSwap, a serverless platform enabling fine-grained, request-level GPU sharing for resource-efficient ML inference. FaaSwap leverages model swapping to support fast inference execution at low resource cost. It keeps models in a host which has a large amount of cheap memory and quickly swaps models to GPUs when requested, reducing per-function keep-alive cost and enabling efficient GPU sharing across much more functions. FaaSwap also supports swapping models between GPUs for load balancing and improved inference performance. In FaaSwap, we design sophisticated request scheduling and memory management algorithms that efficiently exploit model swapping to reduce GPU cost and meet latency service-level objectives (SLOs) for all inference functions. We have implemented and integrated FaaSwap into Alibaba Cloud Function Compute (FC), one of the world's largest commercial serverless platform. Evaluation results show that FaaSwap can achieve low-latency model swapping, efficiently share a GPU across hundreds of functions, and satisfy per-function latency SLOs at scale.

READ FULL TEXT
research
03/09/2023

GPU-enabled Function-as-a-Service for Machine Learning Inference

Function-as-a-Service (FaaS) is emerging as an important cloud computing...
research
04/21/2023

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

The use of machine learning (ML) inference for various applications is g...
research
11/22/2019

Archipelago: A Scalable Low-Latency Serverless Platform

The increased use of micro-services to build web applications has spurre...
research
04/19/2023

Green Carbon Footprint for Model Inference Serving via Exploiting Mixed-Quality Models and GPU Partitioning

This paper presents a solution to the challenge of mitigating carbon emi...
research
06/01/2022

Cooling Down FaaS: Towards Getting Rid of Warm Starts

Serverless execution and most notably the Function as a Service (FaaS) m...
research
02/23/2022

MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms

Serving machine learning inference workloads on the cloud is still a cha...
research
07/21/2023

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

Deploying deep learning models in cloud clusters provides efficient and ...

Please sign up or login with your details

Forgot password? Click here to reset