RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models

09/12/2023
by   Yufei Li, et al.
0

Recent advancements in language models (LMs) have gained substantial attentions on their capability to generate human-like responses. Though exhibiting a promising future for various applications such as conversation AI, these LMs face deployment challenges on various devices due to their extreme computational cost and unpredictable inference latency. Such varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency and degrade the overall performance of LMs, especially under high-traffic workloads. Unfortunately, the bandwidth of these uncertainty sources is extensive, complicating the prediction of latency and the effects emanating from such uncertainties. To understand and mitigate the impact of uncertainty on real-time response-demanding systems, we take the first step to comprehend, quantify and optimize these uncertainty-induced latency performance variations in LMs. Specifically, we present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs. RT-LM innovatively quantifies how specific input uncertainties, adversely affect latency, often leading to an increased output length. Exploiting these insights, we devise a lightweight yet effective method to dynamically correlate input text uncertainties with output length at runtime. Utilizing this quantification as a latency heuristic, we integrate the uncertainty information into a system-level scheduler which explores several uncertainty-induced optimization opportunities, including uncertainty-aware prioritization, dynamic consolidation, and strategic CPU offloading. Quantitative experiments across five state-of-the-art LMs on two hardware platforms demonstrates that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.

READ FULL TEXT

page 1

page 5

research
08/03/2022

Exploration with Model Uncertainty at Extreme Scale in Real-Time Bidding

In this work, we present a scalable and efficient system for exploring t...
research
09/14/2023

Tree of Uncertain Thoughts Reasoning for Large Language Models

While the recently introduced Tree of Thoughts (ToT) has heralded advanc...
research
11/08/2020

Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination

Despite the superb performance of State-Of-The-Art (SOTA) DNNs, the incr...
research
09/14/2018

Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection

We present a robust real-time LiDAR 3D object detector that leverages he...
research
06/15/2023

Audio Tagging on an Embedded Hardware Platform

Convolutional neural networks (CNNs) have exhibited state-of-the-art per...
research
02/11/2021

The Benefit of the Doubt: Uncertainty Aware Sensing for Edge Computing Platforms

Neural networks (NNs) lack measures of "reliability" estimation that wou...
research
07/17/2023

Harnessing Scalable Transactional Stream Processing for Managing Large Language Models [Vision]

Large Language Models (LLMs) have demonstrated extraordinary performance...

Please sign up or login with your details

Forgot password? Click here to reset