Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark

by   Cody Coleman, et al.

The deep learning community has proposed optimizations spanning hardware, software, and learning theory to improve the computational performance of deep learning workloads. While some of these optimizations perform the same operations faster (e.g., switching from a NVIDIA K80 to P100), many modify the semantics of the training procedure (e.g., large minibatch training, reduced precision), which can impact a model's generalization ability. Due to a lack of standard evaluation criteria that considers these trade-offs, it has become increasingly difficult to compare these different advances. To address this shortcoming, DAWNBENCH and the upcoming MLPERF benchmarks use time-to-accuracy as the primary metric for evaluation, with the accuracy threshold set close to state-of-the-art and measured on a held-out dataset not used in training; the goal is to train to this accuracy threshold as fast as possible. In DAWNBENCH , the winning entries improved time-to-accuracy on ImageNet by two orders of magnitude over the seed entries. Despite this progress, it is unclear how sensitive time-to-accuracy is to the chosen threshold as well as the variance between independent training runs, and how well models optimized for time-to-accuracy generalize. In this paper, we provide evidence to suggest that time-to-accuracy has a low coefficient of variance and that the models tuned for it generalize nearly as well as pre-trained models. We additionally analyze the winning entries to understand the source of these speedups, and give recommendations for future benchmarking efforts.


page 1

page 2

page 3

page 4


MLPerf Training Benchmark

Machine learning is experiencing an explosion of software and hardware s...

Measuring and Reducing Gendered Correlations in Pre-trained Models

Pre-trained models have revolutionized natural language understanding. H...

Training EfficientNets at Supercomputer Scale: 83 Accuracy in One Hour

EfficientNets are a family of state-of-the-art image classification mode...

Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization

In Domain Generalization (DG) settings, models trained on a given set of...

Image Classification at Supercomputer Scale

Deep learning is extremely computationally intensive, and hardware vendo...

Does Progress On Object Recognition Benchmarks Improve Real-World Generalization?

For more than a decade, researchers have measured progress in object rec...

Duet Benchmarking: Improving Measurement Accuracy in the Cloud

We investigate the duet measurement procedure, which helps improve the a...

Please sign up or login with your details

Forgot password? Click here to reset