SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based Platforms

01/30/2023
by   Yaodan Xu, et al.
0

In up-to-date machine learning (ML) applications on cloud or edge computing platforms, batching is an important technique for providing efficient and economical services at scale. In particular, parallel computing resources on the platforms, such as graphics processing units (GPUs), have higher computational and energy efficiency with larger batch sizes. However, larger batch sizes may also result in longer response time, and thus it requires a judicious design. This paper aims to provide a dynamic batching policy that strikes a balance between efficiency and latency. The GPU-based inference service is modeled as a batch service queue with batch-size dependent processing time. Then, the design of dynamic batching is a continuous-time average-cost problem, and is formulated as a semi-Markov decision process (SMDP) with the objective of minimizing the weighted sum of average response time and average power consumption. The optimal policy is acquired by solving an associated discrete-time Markov decision process (MDP) problem with finite state approximation and "discretization". By creatively introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5 results show that the optimal policies potentially possess a control limit structure. Numerical results also show that SMDP-based batching policies can adapt to different traffic intensities and outperform other benchmark policies. Furthermore, the proposed solution has notable flexibility in balancing power consumption and latency.

READ FULL TEXT
research
05/12/2020

Age-Energy Tradeoff in Fading Channels with Packet-Based Transmissions

The optimal transmission strategy to minimize the weighted combination o...
research
08/17/2021

On the equivalence of holding cost and response time for evaluating performance of queues

This self-contained discussion relates the long-run average holding cost...
research
07/13/2022

Dynamic gNodeB Sleep Control for Energy-Conserving 5G Radio Access Network

5G radio access network (RAN) is consuming much more energy than legacy ...
research
01/10/2021

Learning Augmented Index Policy for Optimal Service Placement at the Network Edge

We consider the problem of service placement at the network edge, in whi...
research
08/21/2015

On Monotonicity of the Optimal Transmission Policy in Cross-layer Adaptive m-QAM Modulation

This paper considers a cross-layer adaptive modulation system that is mo...
research
07/13/2023

Deep reinforcement learning for the dynamic vehicle dispatching problem: An event-based approach

The dynamic vehicle dispatching problem corresponds to deciding which ve...

Please sign up or login with your details

Forgot password? Click here to reset