Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

08/29/2023
by   Rui Kong, et al.
0

Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, Parameter Committee, that intelligently maintains a subset of important experts in use to reduce resource consumption. The optimal configuration of Parameter Committee is found offline by a profiling-guided committee planner, and expert swapping and request handling at runtime are managed by an adaptive committee scheduler. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34 0.10

READ FULL TEXT

page 1

page 3

page 11

research
10/06/2022

Enabling Deep Learning on Edge Devices

Deep neural networks (DNNs) have succeeded in many different perception ...
research
02/29/2020

Hazard Detection in Supermarkets using Deep Learning on the Edge

Supermarkets need to ensure clean and safe environments for both shopper...
research
03/10/2023

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Mixture-of-Experts (MoE) models have gained popularity in achieving stat...
research
09/26/2022

Diversified Dynamic Routing for Vision Tasks

Deep learning models for vision tasks are trained on large datasets unde...
research
11/04/2020

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Deep learning recommendation models have grown to the terabyte scale. Tr...
research
01/28/2022

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Deep learning (DL) workflows demand an ever-increasing budget of compute...
research
08/28/2023

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

Large Language Models (LLMs) such as GPTs and LLaMa have ushered in a re...

Please sign up or login with your details

Forgot password? Click here to reset