CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

11/05/2020
by   Kiwan Maeng, et al.
0

The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads. The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial recovery to recommendation models and identified a trade-off between accuracy and performance. Motivated by the analysis, we present CPR, a partial recovery training system that can reduce the training time and maintain the desired level of model accuracy by (1) estimating the benefit of partial recovery, (2) selecting an appropriate checkpoint saving interval, and (3) prioritizing to save updates of more frequently accessed parameters. Two variants of CPR, CPR-MFU and CPR-SSU, reduce the checkpoint-related overhead from 8.2-8.5 emulating the failure pattern and overhead of a production-scale cluster. While reducing overhead significantly, CPR achieves model quality on par with the more expensive full recovery scheme, training the state-of-the-art recommendation model using Criteo's Ads CTR dataset. Our preliminary results also suggest that CPR can speed up training on a real production-scale cluster, without notably degrading the accuracy.

READ FULL TEXT

page 5

page 8

research
02/13/2023

SWIFT: Expedited Failure Recovery for Large-scale DNN Training

As the size of deep learning models gets larger and larger, training tak...
research
05/04/2021

Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems

Deep learning recommendation systems at scale have provided remarkable g...
research
01/14/2023

Failure Tolerant Training with Persistent Memory Disaggregation over CXL

This paper proposes TRAININGCXL that can efficiently process large-scale...
research
04/05/2021

ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

Deep-learning-based recommendation models (DLRMs) are widely deployed to...
research
10/17/2020

Check-N-Run: A Checkpointing System for Training Recommendation Models

Checkpoints play an important role in training recommendation systems at...
research
10/07/2020

Machine learning for recovery factor estimation of an oil reservoir: a tool for de-risking at a hydrocarbon asset evaluation

Well known oil recovery factor estimation techniques such as analogy, vo...
research
07/08/2020

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

As computers reach exascale and beyond, the incidence of faults will inc...

Please sign up or login with your details

Forgot password? Click here to reset