ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

03/24/2023
by   William Won, et al.
0

As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.

READ FULL TEXT

page 1

page 3

page 7

page 8

page 9

research
06/04/2023

Proteus: Simulating the Performance of Distributed DNN Training

DNN models are becoming increasingly larger to achieve unprecedented acc...
research
09/24/2021

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models

Deep Neural Networks have gained significant attraction due to their wid...
research
03/10/2023

MOELA: A Multi-Objective Evolutionary/Learning Design Space Exploration Framework for 3D Heterogeneous Manycore Platforms

To enable emerging applications such as deep machine learning and graph ...
research
04/12/2021

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Deep learning recommendation models (DLRMs) are used across many busines...
research
11/07/2022

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

Over the past decade, machine learning model complexity has grown at an ...
research
10/05/2017

Learning Graphical Models from a Distributed Stream

A current challenge for data management systems is to support the constr...
research
09/25/2020

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

Modern deep learning systems like PyTorch and Tensorflow are able to tra...

Please sign up or login with your details

Forgot password? Click here to reset