Dynamic Parameter Allocation in Parameter Servers

by   Alexander Renz-Wieland, et al.

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management—a key concern in distributed training—, but can induce severe communication overhead. To reduce communication overhead, distributed machine learning algorithms use techniques to increase parameter access locality (PAL), achieving up to linear speed-ups. We found that existing parameter servers provide only limited support for PAL techniques, however, and therefore prevent efficient training. In this paper, we explore whether and to what extent PAL techniques can be supported, and whether such support is beneficial. We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse, and experimentally compare its performance to existing parameter servers across a number of machine learning tasks. We found that Lapse provides near linear scaling and can be orders of magnitude faster than existing parameter servers.



There are no comments yet.


page 1

page 2

page 3

page 4


Replicate or Relocate? Non-Uniform Access in Parameter Servers

Parameter servers (PSs) facilitate the implementation of distributed tra...

GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning

When the data is distributed across multiple servers, efficient data exc...

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learni...

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasin...

Sub-logarithmic Distributed Oblivious RAM with Small Block Size

Oblivious RAM (ORAM) is a cryptographic primitive that allows a client t...

HyperNAT: Scaling Up Network AddressTranslation with SmartNICs for Clouds

Network address translation (NAT) is a basic functionality in cloud gate...

Distributed Learning over Unreliable Networks

Most of today's distributed machine learning systems assume reliable ne...

Code Repositories


A Parameter Server with Dynamic Parameter Allocation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.