M^3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

10/26/2022
by   Hanxue Liang, et al.
0

Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, and switch between tasks as needed: therefore such all tasks activated inference is also highly inefficient and non-scalable. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL. Our framework, dubbed M^3ViT, customizes mixture-of-experts (MoE) layers into a vision transformer (ViT) backbone for MTL, and sparsely activates task-specific experts during training. Then at inference with any task of interest, the same design allows for activating only the task-corresponding sparse expert pathway, instead of the full model. Our new model design is further enhanced by hardware-level innovations, in particular, a novel computation reordering scheme tailored for memory-constrained MTL that achieves zero-overhead switching between tasks and can scale to any number of experts. When executing single-task inference, M^3ViT achieves higher accuracies than encoder-focused MTL methods, while significantly reducing 88 platform of one Xilinx ZCU104 FPGA, our co-design framework reduces the memory requirement by 2.4 times, while achieving energy efficiency up to 9.23 times higher than a comparable FPGA baseline. Code is available at: https://github.com/VITA-Group/M3ViT.

READ FULL TEXT

page 4

page 6

page 8

research
05/30/2023

Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts

Computer vision researchers are embracing two promising paradigms: Visio...
research
12/15/2022

Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners

Optimization in multi-task learning (MTL) is more challenging than singl...
research
07/28/2023

TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts

Learning discriminative task-specific features simultaneously for multip...
research
04/11/2022

MIME: Adapting a Single Neural Network for Multi-task Inference with Memory-efficient Dynamic Pruning

Recent years have seen a paradigm shift towards multi-task learning. Thi...
research
05/25/2022

Eliciting Transferability in Multi-task Learning with Task-level Mixture-of-Experts

Recent work suggests that transformer models are capable of multi-task l...
research
04/17/2023

AdaMTL: Adaptive Input-dependent Inference for Efficient Multi-Task Learning

Modern Augmented reality applications require performing multiple tasks ...
research
03/24/2021

FastMoE: A Fast Mixture-of-Expert Training System

Mixture-of-Expert (MoE) presents a strong potential in enlarging the siz...

Please sign up or login with your details

Forgot password? Click here to reset