TinyStack: A Minimal GPU Stack for Client ML
TinyStack is a novel way for deploying GPU-accelerated computation on mobile and embedded devices. It addresses the high complexity of a modern GPU stack. Without an overhaul of the stack, TinyStack provides a static, fast path for an app to push its computation to GPU. It records GPU executions on the full GPU stack ahead of time and replays the executions with only a small replayer on new input at run time. TinyStack addresses challenges in capturing key CPU/GPU interactions and GPU states, working around proprietary GPU internals, and preventing replay divergence. The resultant replayer is a drop-in replacement of the original GPU stack. It is tiny (as few as 50 KB executable), robust (replaying long executions without divergence), portable (running in a POSIX OS, in TEE, or on baremetal), and quick to launch (speeding up startup by up to two orders of magnitude). We have implemented TinyStack and tested it with a variety of ML frameworks, GPU programming APIs, and integrated GPUs.
READ FULL TEXT