TinyStack: A Minimal GPU Stack for Client ML

05/04/2021
by   Heejin Park, et al.
0

TinyStack is a novel way for deploying GPU-accelerated computation on mobile and embedded devices. It addresses the high complexity of a modern GPU stack. Without an overhaul of the stack, TinyStack provides a static, fast path for an app to push its computation to GPU. It records GPU executions on the full GPU stack ahead of time and replays the executions with only a small replayer on new input at run time. TinyStack addresses challenges in capturing key CPU/GPU interactions and GPU states, working around proprietary GPU internals, and preventing replay divergence. The resultant replayer is a drop-in replacement of the original GPU stack. It is tiny (as few as 50 KB executable), robust (replaying long executions without divergence), portable (running in a POSIX OS, in TEE, or on baremetal), and quick to launch (speeding up startup by up to two orders of magnitude). We have implemented TinyStack and tested it with a variety of ML frameworks, GPU programming APIs, and integrated GPUs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/04/2021

Safe and Practical GPU Acceleration in TrustZone

We present a holistic design for GPU-accelerated computation in TrustZon...
research
12/04/2020

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Deep learning (DL) frameworks take advantage of GPUs to improve the spee...
research
08/19/2019

Across-Stack Profiling and Characterization of Machine Learning Models on GPUs

The world sees a proliferation of machine learning/deep learning (ML) mo...
research
09/10/2021

GPU Algorithms for Efficient Exascale Discretizations

In this paper we describe the research and development activities in the...
research
10/11/2021

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

Enterprises and labs performing computationally expensive data science a...
research
04/12/2016

GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring

Fisher vector has been widely used in many multimedia retrieval and visu...
research
12/11/2020

Trash Talk: Accelerating Garbage Collection on Integrated GPUs is Worthless

Systems integrating heterogeneous processors with unified memory provide...

Please sign up or login with your details

Forgot password? Click here to reset