Petuum: A New Platform for Distributed Machine Learning on Big Data

12/30/2013
by   Eric P Xing, et al.
0

What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, allowing ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2015

Strategies and Principles of Distributed Machine Learning on Big Data

The rise of Big Data has led to new demands for Machine Learning (ML) sy...
research
12/19/2013

Structure-Aware Dynamic Scheduler for Parallel Machine Learning

Training large machine learning (ML) models with many variables or param...
research
05/17/2021

Towards Demystifying Serverless Machine Learning Training

The appeal of serverless (FaaS) has triggered a growing interest on how ...
research
10/09/2017

Run Time Prediction for Big Data Iterative ML Algorithms: a KMeans case study

Data science and machine learning algorithms running on big data infrast...
research
05/10/2021

GSPMD: General and Scalable Parallelization for ML Computation Graphs

We present GSPMD, an automatic, compiler-based parallelization system fo...
research
05/30/2022

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

To break the bottlenecks of mainstream cloud-based machine learning (ML)...
research
12/04/2014

LightLDA: Big Topic Models on Modest Compute Clusters

When building large-scale machine learning (ML) programs, such as big to...

Please sign up or login with your details

Forgot password? Click here to reset