Distributed Machine Learning through Heterogeneous Edge Systems

11/16/2019
by   Hanpeng Hu, et al.
0

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large volumes and/or security/privacy concerns. Edge devices are intrinsically heterogeneous in computing capacity, posing significant challenges to parameter synchronization for parallel training with the parameter server (PS) architecture. This paper proposes ADSP, a parameter synchronization scheme for distributed machine learning (ML) with heterogeneous edge systems. Eliminating the significant waiting time occurring with existing parameter synchronization models, the core idea of ADSP is to let faster edge devices continue training, while committing their model updates at strategically decided intervals. We design algorithms that decide time points for each worker to commit its model update, and ensure not only global model convergence but also faster convergence. Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability and adaptability to large heterogeneity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2023

Arena: A Learning-based Synchronization Scheme for Hierarchical Federated Learning–Technical Report

Federated learning (FL) enables collaborative model training among distr...
research
05/17/2021

Towards Demystifying Serverless Machine Learning Training

The appeal of serverless (FaaS) has triggered a growing interest on how ...
research
12/21/2020

A Fast Edge-Based Synchronizer for Tasks in Real-Time Artificial Intelligence Applications

Real-time artificial intelligence (AI) applications mapped onto edge com...
research
10/11/2019

Blink: Fast and Generic Collectives for Distributed ML

Model parameter synchronization across GPUs introduces high overheads fo...
research
05/07/2023

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

Distributed Machine Learning (DML) systems are utilized to enhance the s...
research
04/02/2020

Cocktail: Cost-efficient and Data Skew-aware Online In-Network Distributed Machine Learning for Intelligent 5G and Beyond

To facilitate the emerging applications in the 5G networks and beyond, m...
research
06/12/2020

Optimal Task Allocation for Mobile Edge Learning with Global Training Time Constraints

This paper proposes to minimize the loss of training a distributed machi...

Please sign up or login with your details

Forgot password? Click here to reset