Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance

02/27/2023
by   Amine Barrak, et al.
0

Distributed Machine Learning refers to the practice of training a model on multiple computers or devices that can be called nodes. Additionally, serverless computing is a new paradigm for cloud computing that uses functions as a computational unit. Serverless computing can be effective for distributed learning systems by enabling automated resource scaling, less manual intervention, and cost reduction. By distributing the workload, distributed machine learning can speed up the training process and allow more complex models to be trained. Several topologies of distributed machine learning have been established (centralized, parameter server, peer-to-peer). However, the parameter server architecture may have limitations in terms of fault tolerance, including a single point of failure and complex recovery processes. Moreover, training machine learning in a peer-to-peer (P2P) architecture can offer benefits in terms of fault tolerance by eliminating the single point of failure. In a P2P architecture, each node or worker can act as both a server and a client, which allows for more decentralized decision making and eliminates the need for a central coordinator. In this position paper, we propose exploring the use of serverless computing in distributed machine learning training and comparing the performance of P2P architecture with the parameter server architecture, focusing on cost reduction and fault tolerance.

READ FULL TEXT

page 1

page 2

page 3

research
01/04/2020

Search techniques in peer to peer networks

Peer to peer (P2P) networks are an overlay on IP network of the internet...
research
05/23/2018

Collective Online Learning via Decentralized Gaussian Processes in Massive Multi-Agent Systems

Distributed machine learning (ML) is a modern computation paradigm that ...
research
04/17/2023

Decentralized Learning Made Easy with DecentralizePy

Decentralized learning (DL) has gained prominence for its potential bene...
research
10/17/2018

Distributed Learning over Unreliable Networks

Most of today's distributed machine learning systems assume reliable ne...
research
05/04/2022

Babel: A Framework for Developing Performant and Dependable Distributed Protocols

Prototyping and implementing distributed algorithms, particularly those ...
research
05/22/2023

Efficient Exchange of Metadata Information in Geo-Distributed Fog Systems

Metadata information is crucial for efficient geo-distributed fog comput...
research
06/30/2019

Network-accelerated Distributed Machine Learning Using MLFabric

Existing distributed machine learning (DML) systems focus on improving t...

Please sign up or login with your details

Forgot password? Click here to reset