Interlocking Backpropagation: Improving depthwise model-parallelism

10/08/2020
by   Aidan N. Gomez, et al.
11

The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism suffers from poor resource utilisation, which leads to wasted resources. In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2020

Parallel Training of Deep Networks with Local Updates

Deep learning models trained on large data sets have been widely success...
research
07/14/2021

Model-Parallel Model Selection for Deep Learning Systems

As deep learning becomes more expensive, both in terms of time and compu...
research
06/28/2023

Towards a Better Theoretical Understanding of Independent Subnetwork Training

Modern advancements in large-scale machine learning would be impossible ...
research
11/18/2020

Whale: A Unified Distributed Training Framework

Data parallelism (DP) has been a common practice to speed up the trainin...
research
07/05/2023

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving st...
research
05/30/2021

Maximizing Parallelism in Distributed Training for Huge Neural Networks

The recent Natural Language Processing techniques have been refreshing t...
research
07/30/2020

Parallel, Self Organizing, Consensus Neural Networks

A new neural network architecture (PSCNN) is developed to improve perfor...

Please sign up or login with your details

Forgot password? Click here to reset