Towards a Better Theoretical Understanding of Independent Subnetwork Training

06/28/2023
by   Egor Shulgin, et al.
0

Modern advancements in large-scale machine learning would be impossible without the paradigm of data-parallel distributed computing. Since distributed computing with large-scale models imparts excessive pressure on communication channels, significant recent research has been directed toward co-designing communication compression strategies and training algorithms with the goal of reducing communication costs. While pure data parallelism allows better data scaling, it suffers from poor model scaling properties. Indeed, compute nodes are severely limited by memory constraints, preventing further increases in model size. For this reason, the latest achievements in training giant neural network models also rely on some form of model parallelism. In this work, we take a closer theoretical look at Independent Subnetwork Training (IST), which is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication, and provide a precise analysis of its optimization performance on a quadratic model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/06/2023

Does compressing activations help model parallel training?

Large-scale Transformer models are known for their exceptional performan...
research
10/04/2019

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Training large DL models with billions and potentially trillions of para...
research
02/06/2023

Computation vs. Communication Scaling for Future Transformers on Future Hardware

Scaling neural network models has delivered dramatic quality gains acros...
research
10/08/2020

Interlocking Backpropagation: Improving depthwise model-parallelism

The number of parameters in state of the art neural networks has drastic...
research
02/14/2021

Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

Large scale distributed optimization has become the default tool for the...
research
01/27/2023

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Many deep learning applications benefit from using large models with bil...
research
03/06/2020

Trends and Advancements in Deep Neural Network Communication

Due to their great performance and scalability properties neural network...

Please sign up or login with your details

Forgot password? Click here to reset