Empirical Study of Straggler Problem in Parameter Server on Iterative Convergent Distributed Machine Learning

07/28/2023
by   Benjamin Wong, et al.
0

The purpose of this study is to test the effectiveness of current straggler mitigation techniques over different important iterative convergent machine learning(ML) algorithm including Matrix Factorization (MF), Multinomial Logistic Regression (MLR), and Latent Dirichlet Allocation (LDA) . The experiment was conducted to implemented using the FlexPS system, which is the latest system implementation that employ parameter server architecture. The experiment employed the Bulk Synchronous Parallel (BSP) computational model to examine the straggler problem in Parameter Server on Iterative Convergent Distributed Machine Learning. Moreover, the current research analyzes the experimental arrangement of the parameter server strategy concerning the parallel learning problems by injecting universal straggler patterns and executing latest mitigation techniques. The findings of the study are significant in that as they will provide the necessary platform for conducting further research into the problem and allow the researcher to compare different methods for various applications. The outcome is therefore expected to facilitate the development of new techniques coupled with new perspectives in addressing this problem.

READ FULL TEXT
research
10/29/2014

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

As Machine Learning (ML) applications increase in data size and model co...
research
05/22/2017

An Asynchronous Distributed Framework for Large-scale Learning Based on Parameter Exchanges

In many distributed learning problems, the heterogeneous loading of comp...
research
06/13/2022

Modern Distributed Data-Parallel Large-Scale Pre-training Strategies For NLP models

Distributed deep learning is becoming increasingly popular due to the ex...
research
02/03/2020

Dynamic Parameter Allocation in Parameter Servers

To keep up with increasing dataset sizes and model complexity, distribut...
research
09/20/2023

Towards a Prediction of Machine Learning Training Time to Support Continuous Learning Systems Development

The problem of predicting the training time of machine learning (ML) mod...
research
02/07/2018

MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster

In this paper, we present a co-designed petascale high-density GPU clust...

Please sign up or login with your details

Forgot password? Click here to reset