Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

05/28/2021
by   Liang Luo, et al.
16

ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads of distributed training of deep neural networks and gradient boosted decision trees using state-of-the-art frameworks.

READ FULL TEXT
research
05/21/2018

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasin...
research
11/28/2022

CWD: A Machine Learning based Approach to Detect Unknown Cloud Workloads

Workloads in modern cloud data centers are becoming increasingly complex...
research
03/04/2022

Benchmarking tunnel and encryption methodologies in cloud environments

The recent past has seen the adoption of multi-cloud deployments by ente...
research
09/19/2022

Supporting Multi-Cloud in Serverless Computing

Serverless computing is a widely adopted cloud execution model composed ...
research
10/13/2022

Skyplane: Optimizing Transfer Cost and Throughput Using Cloud-Aware Overlays

Cloud applications are increasingly distributing data across multiple re...
research
03/09/2023

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

Geo-distributed ML training can benefit many emerging ML scenarios (e.g....
research
10/10/2021

A Serverless Distributed Ledger for Enterprises

Enterprises have been attracted by the capability of blockchains to prov...

Please sign up or login with your details

Forgot password? Click here to reset