FfDL : A Flexible Multi-tenant Deep Learning Platform

09/14/2019
by   K. R. Jayaram, et al.
1

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2018

Dependability in a Multi-tenant Multi-framework Deep Learning as-a-Service Platform

Deep learning (DL), a form of machine learning, is becoming increasingly...
research
06/05/2021

An Empirical Study on Tensor Shape Faults in Deep Learning Systems

Software developers frequently adopt deep learning (DL) libraries to inc...
research
10/24/2019

Taxonomy of Real Faults in Deep Learning Systems

The growing application of deep neural networks in safety-critical domai...
research
01/06/2023

Systems for Parallel and Distributed Large-Model Deep Learning Training

Deep learning (DL) has transformed applications in a variety of domains,...
research
06/27/2023

To Spike or Not To Spike: A Digital Hardware Perspective on Deep Learning Acceleration

As deep learning models scale, they become increasingly competitive from...
research
01/07/2020

High Performance I/O For Large Scale Deep Learning

Training deep learning (DL) models on petascale datasets is essential fo...
research
09/03/2023

Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Large language models such as GPT-3 ChatGPT have transformed deep le...

Please sign up or login with your details

Forgot password? Click here to reset