DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

11/07/2022
by   Newsha Ardalani, et al.
0

Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20 AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. We propose CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. We also propose DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.

READ FULL TEXT
research
10/10/2020

Cross-Stack Workload Characterization of Deep Recommendation Systems

Deep learning based recommendation systems form the backbone of most per...
research
01/05/2022

CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs

We present CFU Playground, a full-stack open-source framework that enabl...
research
08/19/2019

Across-Stack Profiling and Characterization of Machine Learning Models on GPUs

The world sees a proliferation of machine learning/deep learning (ML) mo...
research
08/19/2019

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs

There has been a rapid proliferation of machine learning/deep learning (...
research
06/21/2022

CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework

There is a growing demand for shifting the delivery of AI capability fro...
research
03/24/2023

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

As deep learning models and input data are scaling at an unprecedented r...

Please sign up or login with your details

Forgot password? Click here to reset