TF-Replicator: Distributed Machine Learning for Researchers

02/01/2019
by   Peter Buchlovsky, et al.
0

We describe TF-Replicator, a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. TF-Replicator simplifies writing data-parallel and model-parallel research code. The same models can be effortlessly deployed to different cluster architectures (i.e. one or many machines containing CPUs, GPUs or TPU accelerators) using synchronous or asynchronous training regimes. To demonstrate the generality and scalability of TF-Replicator, we implement and benchmark three very different models: (1) A ResNet-50 for ImageNet classification, (2) a SN-GAN for class-conditional ImageNet image generation, and (3) a D4PG reinforcement learning agent for continuous control. Our results show strong scalability performance without demanding any distributed systems expertise of the user. The TF-Replicator programming model will be open-sourced as part of TensorFlow 2.0 (see https://github.com/tensorflow/community/pull/25).

READ FULL TEXT

page 7

page 8

research
05/04/2018

Dynamic Control Flow in Large-Scale Machine Learning

Many recent machine learning models rely on fine-grained dynamic control...
research
11/12/2021

AlphaRotate: A Rotation Detection Benchmark using TensorFlow

AlphaRotate is an open-source Tensorflow benchmark for performing scalab...
research
11/05/2018

Mesh-TensorFlow: Deep Learning for Supercomputers

Batch-splitting (data-parallelism) is the dominant distributed Deep Neur...
research
10/18/2018

Private Machine Learning in TensorFlow using Secure Computation

We present a framework for experimenting with secure multi-party computa...
research
11/05/2018

Simple, Distributed, and Accelerated Probabilistic Programming

We describe a simple, low-level approach for embedding probabilistic pro...
research
11/25/2020

TLeague: A Framework for Competitive Self-Play based Distributed Multi-Agent Reinforcement Learning

Competitive Self-Play (CSP) based Multi-Agent Reinforcement Learning (MA...
research
03/24/2021

FastMoE: A Fast Mixture-of-Expert Training System

Mixture-of-Expert (MoE) presents a strong potential in enlarging the siz...

Please sign up or login with your details

Forgot password? Click here to reset