Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark

by   Cody Coleman, et al.

The deep learning community has proposed optimizations spanning hardware, software, and learning theory to improve the computational performance of deep learning workloads. While some of these optimizations perform the same operations faster (e.g., switching from a NVIDIA K80 to P100), many modify the semantics of the training procedure (e.g., large minibatch training, reduced precision), which can impact a model's generalization ability. Due to a lack of standard evaluation criteria that considers these trade-offs, it has become increasingly difficult to compare these different advances. To address this shortcoming, DAWNBENCH and the upcoming MLPERF benchmarks use time-to-accuracy as the primary metric for evaluation, with the accuracy threshold set close to state-of-the-art and measured on a held-out dataset not used in training; the goal is to train to this accuracy threshold as fast as possible. In DAWNBENCH , the winning entries improved time-to-accuracy on ImageNet by two orders of magnitude over the seed entries. Despite this progress, it is unclear how sensitive time-to-accuracy is to the chosen threshold as well as the variance between independent training runs, and how well models optimized for time-to-accuracy generalize. In this paper, we provide evidence to suggest that time-to-accuracy has a low coefficient of variance and that the models tuned for it generalize nearly as well as pre-trained models. We additionally analyze the winning entries to understand the source of these speedups, and give recommendations for future benchmarking efforts.


page 1

page 2

page 3

page 4


MLPerf Training Benchmark

Machine learning is experiencing an explosion of software and hardware s...

Measuring and Reducing Gendered Correlations in Pre-trained Models

Pre-trained models have revolutionized natural language understanding. H...

Training EfficientNets at Supercomputer Scale: 83 Accuracy in One Hour

EfficientNets are a family of state-of-the-art image classification mode...

Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization

In Domain Generalization (DG) settings, models trained on a given set of...

Image Classification at Supercomputer Scale

Deep learning is extremely computationally intensive, and hardware vendo...

Terabyte-scale Deep Multiple Instance Learning for Classification and Localization in Pathology

In the field of computational pathology, the use of decision support sys...

X-TrainCaps: Accelerated Training of Capsule Nets through Lightweight Software Optimizations

Convolutional Neural Networks (CNNs) are extensively in use due to their...

1 Introduction

In recent years, researchers have proposed a range of hardware, software, and statistical optimizations for deep learning, ranging from new software systems abadi2016tensorflow ; chetlur2014cudnn ; chilimbi2014project ; dean2012large ; jia2014caffe and hardware burger2017microsoft ; han2016eie ; jouppi2017datacenter ; pena2017benchmarking to communication methods de2017understanding ; harlap2016addressing ; recht2011hogwild ; zhang2014dimmwitted and training algorithms glorot2011deep ; goyal2017accurate ; iandola2016squeezenet ; kingma2014adam ; sohl2014fast ; sun2017meprop ; sutskever2013importance . Many of these optimizations preserve the statistical efficiency (number of iterations) of the underlying algorithm and model; for example, replacing an NVIDIA K80 GPU with an NVIDIA P100 GPU can give up to a 4 speedup murphy2017deep . However, many optimizations sacrifice some statistical efficiency for hardware efficiency (time needed for each iteration) or other auxiliary metrics. On the hardware side, large minibatch training goyal2017accurate ; jouppi2017datacenter and reduced precision chilimbi2014project ; de2017understanding ; han2015deep

can help better saturate hardware units and speed up “proxy” metrics such as time to process an epoch (especially in the distributed setting), but can prevent the model from converging. On the statistical side, optimizations such as Adam 

kingma2014adam can speed up proxy metrics such as “time-to-training-loss”, but can also prevent the model from converging to a desired validation accuracy. Due to a lack of standard evaluation criteria, it has become increasingly difficult to compare the utility of these advances: prior benchmarks have tried by measuring proxy metrics for training time adolf2016fathom ; bahrampour2015comparative ; chetlur2014cudnn ; baidu2017deepbench ; chintala2017convnet ; google2017benchmarks ; shi2016benchmarking or accuracy (irrespective of time) krizhevsky2012imagenet ; lin2014microsoft ; rajpurkar2016squad , but are not sufficient coleman2017dawnbench .

To rectify this lack of standard evaluation criteria, DAWNBench and the upcoming MLPerf, two recent deep learning benchmarks, use the end-to-end training time to a specified validation accuracy level (called time-to-accuracy) as their main metric. DAWNBench

was the first such benchmark, which had the explicit goal to provide “an objective means of normalizing across differences in computation frameworks, hardware, optimization algorithms, hyperparameter settings, and other factors that affect real-world performance” 

coleman2017dawnbench . During the competition, Google, Intel,, and more submitted optimized implementations that achieved two orders of magnitude improvements in training time for ImageNet and an order of magnitude improvement in cost over the initial seed entries. Even though DAWNBench does not explicitly require submissions to include code, the vast majority of submissions did include their implementations and command line instructions, demonstrating a strong interest from the community in reproducibility. Building on DAWNBench’s success, MLPerf mlperf2018 has also adopted time-to-accuracy as its primary metric, while expanding the set of tasks being evaluated.

Despite the improvements seen in DAWNBench’s submissions, time-to-accuracy has not been shown to have a low coefficient of variance, nor have the submissions been tested for generalization; only a single, possibly cherry-picked run was required for entry into the competition. In this paper, we analyze the DAWNBench entries to evaluate the time-to-accuracy metric. By running the top DAWNBench results using the provided open-source code, we were able to analyze the stability and generalizability of time-to-accuracy. Our data suggests that,

  • Time-to-accuracy is stable, i.e., has a low coefficient of variation (~5%) over multiple training runs, suggesting submissions reach the target accuracy in nearly the same time across runs.

  • Models optimized for time-to-accuracy generalize well on unseen data and fine-tuning tasks.

  • Independent runs that use the same hyperparameters and model architectures learn more similar functions than runs which use different hyperparameters and models.

  • Lowering the accuracy threshold allows for optimizations that reach the lower threshold quicker, but prevent models from converging to higher accuracy thresholds.

  • Submissions are generally reproducible by a third-party.

Beyond the metric, we analyze DAWNBench’s entries to understand the optimizations that resulted in the speedups over the seed entries. Due to the open nature of DAWNBench, submissions used a variety of hardware, software, model, and training procedure optimizations. For example, the team used cyclic learning rates smith2017super and 8 NVIDIA V100 GPUs to train a custom Wide ResNet zagoruyko2016wide on CIFAR10 to 94% accuracy in less than 3 minutes (cost of $1.18). In addition, our experiments show that both hardware and statistical efficiency for deep learning workloads do not scale perfectly, with up to a 2 gap from ideal scaling, and that current software frameworks often severely underutilize modern hardware.

Finally, based on our experience reproducing these results, we suggest that future deep learning benchmarking efforts be open, reproducible, and agile. As software, hardware, and statistical methods are rapidly changing, we recommend that submissions include checkpoints, final predictions, and machine configuration to further improve reproducibility. Our experiences suggest that this information will help third-parties validate and compare results more easily, partially addressing the problems with reproducibility and reusability in machine learning joellepinaeu ; baker2016crisis .

2 Benchmark Summary

2.1 Overview

DAWNBench evaluates the time and cost of popular deep learning training and inference workloads. The initial release included two tasks: image classification on CIFAR10 and ImageNet, and question answering on SQuAD, and four metrics: training time to a specified validation accuracy, cost (in USD) of training to that accuracy for submissions that use hardware in the public cloud, average latency of performing inference on a single item (image or question), and average cost of inference for 10,000 items. While prior benchmarks, such as Baidu DeepBench baidu2017deepbench , Fathom adolf2016fathom

, and TensorFlow’s Performance Benchmark 

google2017benchmarks evaluate proxy metrics such as time to compute a minibatch or other low-level operations, DAWNBench was the first to measure end-to-end performance as the time to a pre-specified (and high) level of accuracy.

Entries were required to submit a description of their submission and the validation accuracy after every epoch. While source code was optional, every submission for image classification included a link to all code needed to reproduce runs, assuming access to the appropriate hardware. However, some of the submissions for question answering on SQuAD did not include code; because of this and the general lack of submissions, we focus exclusively on CIFAR10 and ImageNet submissions in this paper.

2.2 Summary of Entries

The entries spanned a range of hardware platforms, software platforms, and models. We summarize the entries in Table 1

. Notably, the entries spanned GPUs, TPUs, and CPUs on the hardware side, TensorFlow, PyTorch, and Caffe on the software side, and used both standard and learned model architectures.

ImageNet. DAWNBench used a top-5 accuracy target of 93% for ImageNet. The best seed entry was a ResNet-152 he2016deep model trained on 8 NVIDIA K80 GPUs in 10 days 10 hours and 42 minutes for $1112.64. The fastest entry overall trained ResNet-50 on half of a TPUv2 pod in 30 minutes, two orders-of-magnitude faster. The fastest entry on publicly available hardware was a ResNet-50, which trained in about 3 hours. The cheapest entry was an AmoebaNet real2018regularized model that trained for $49.30.

CIFAR10. DAWNBench used an accuracy target of 94% for CIFAR10. The best seed entry was a ResNet-164 he2016deep model trained in 2.5 hours on a NVIDIA P100. The fastest entry trained a custom Wide ResNet zagoruyko2016wide architecture in less than 3 minutes on 8 NVIDIA V100 GPUs, a 52 improvement. The cheapest entry trained the same custom Wide ResNet for $0.26 on 1 NVIDIA V100 GPU.

These large improvements for both ImageNet and CIFAR10 come from a range of optimizations, including mixed-precision training micikevicius2017mixed , large minibatch training goyal2017accurate , progressive resizing of images lim2017enhanced ; karras2017progressive , novel model architectures real2018regularized , and faster hardware murphy2017deep ; jouppi2017datacenter ; we investigate this further in Section 4.

# of entries
# of entries
GPU 3 6
TPU 6 0
CPU 3 0
# of entries
# of entries
TensorFlow 8 2
PyTorch 1 4
Caffe 3 0
Table 1: Overview of hardware platform and software framework for each DAWNBench submission.

3 Metric Evaluation

Entry name Dataset Coefficient of variation Fraction of runs
Wide ResNet-34, 8xV100 CIFAR10 3.8% 50%
Wide ResNet-34, 1xV100 CIFAR10 2.9% 70%
ResNet-18, 1xV100 CIFAR10 1.4% 90%
ResNet-50, 8xV100 ImageNet 5.3% 80%
ResNet-50, 1xTPU ImageNet 4.5% 100%
AmoebaNet-D, 1xTPU ImageNet 2.3% 100%
ResNet-50, 1/2 TPU Pod ImageNet 2.5% 100%
Table 2: Coefficient of variation and fraction of runs that reached the desired target accuracy of the top single server blade entries for image classification on CIFAR10 (10 runs) and ImageNet (5 runs). We additionally include the coefficient over 4 runs of 1/2 a TPU Pod for ResNet-50.

In this section, we evaluate the time-to-accuracy metric along three axes, using publicly available code from DAWNBench’s top submissions. First, we demonstrate that time-to-accuracy is stable with a low coefficient of variation over several runs with fixed hyperparameters but different random seeds. Second, we show that setting a lower threshold allows for optimizations that prevent networks from converging, making the choice of a threshold near state-of-the-art critical. Third, we provide evidence that networks optimized for time-to-accuracy generalize as well as regular pre-trained networks.

3.1 Variance and stability of metric

(a) Custom wide ResNet-34, CIFAR10.
(b) ResNet-50, ImageNet.
Figure 1: Validation accuracy over time of the top single server blade entries for image classification. Left: 10 training runs of’s custom Wide ResNet-34 (8 V100 GPUs) on CIFAR10 with an accuracy threshold of 94% top-1 accuracy (dashed line). Right: 10 training runs of’s ResNet-50 with progressive resizing on ImageNet with an accuracy threshold of 93% top-5 accuracy (dashed line). Each trial is shown as a separate line. The variance across runs is small near the accuracy thresholds.

Variability of time-to-accuracy. We measured the stability of time-to-accuracy by running the top single server blade training time entries for CIFAR10 and ImageNet several times each. As shown in Table 2

, the coefficient of variation (the ratio of the standard deviation to the mean) of time-to-accuracy is at most 5.3% for all entries, indicating that the metric is stable to the randomness inherent to training deep networks. However, several entries failed to consistently achieve the given accuracy threshold. In particular, the cyclic learning rates used by the top two CIFAR10 entries appear to make validation convergence less robust 

smith2017super . Additionally, cyclic learning rates prevent validation convergence on ImageNet and were thus not included in any top submissions howard2018training .

The validation convergence curves for all 10 trials of the top single server blade entries in CIFAR10 and ImageNet are shown in Figure 1 and were tuned specifically for time to 94% top-1 accuracy and time to 93% top-5 accuracy respectively. Both reach the accuracy targets later in training when the model converges. Earlier in training, validation accuracy is less stable as both entries start training with large learning rates.

(a) CIFAR10.
(b) ImageNet.
Figure 2: Jaccard similarity for predictions of various CIFAR10 and ImageNet models. The Jaccard similarity is higher for instances of the same model compared to instances of different models.

Variability of models. To further understand how models optimized for time-to-accuracy behave, we compared the top-1 predictions of models over several trials with different random seeds using Jaccard similarity. As shown in Figure 2, the Jaccard similarity within multiple trials of a single entry is higher than between different entries, indicating that there is higher variance between different entries than from the randomness present in training.

3.2 Sensitivity to Threshold

The choice of accuracy threshold is critical to the final ranking of submissions across model architecture, framework and hardware. Using the official validation accuracy data included in the top DAWNBench entries along with data collected from two other “synthetic” runs we ran, we show the time-to-accuracy for a number of different target accuracies in Figure 3. We demonstrate that aggressive learning rate schedules can reach 91% and 92% up to 2 faster, but fail to converge to 93%. This suggests that while rankings are stable if the models converge to a high validation accuracy, they are not if the accuracy threshold is lowered. This indicates to us that it is critical to pick a validation accuracy threshold close to state-of-the-art for time-to-accuracy to be a useful metric.

Figure 3: Time-to-accuracy of the DAWNBench ImageNet entries for various accuracy thresholds. The dotted lines are synthetic entries we ran with aggressive learning rate schedules. Thus, optimizations such as aggressive learning rates train to a lower accuracy faster, but prevent the model from reaching higher accuracy thresholds.

3.3 Generalization of Optimized Models

Evaluation on new data. We scraped and labeled a set of 2,864 images from Flickr. The images were scraped based on the WordNet keywords associated with each class in the ImageNet dataset. The top five images based on relevance were shown to a human labeler and labeled correct or incorrect. To ensure no overlap with ImageNet, only images posted after January 1st, 2014 were used. The images spanned 886 (out of 1000) classes. While these images are not entirely representative of ImageNet, we believe they reflect a reasonable distribution.

(unseen data)
ResNet-18 (p) 89.5%
ResNet-50 (p) 92.2%
ResNet-152 (p) 93.2%
ResNet-50, 1xTPU 92.6%
ResNet-50, 8xV100 91.9%
AmoebaNet-D, 1xTPU 91.3%
98.4 0.1%
99.1 0.3%
35.7 0.5%
ResNet-50 (p) 98.7 0.1% 98.5 0.5% 37.6 0.7%
ResNet-18 (p) 98.5 0.1% 97.7 0.8% 97.6 0.4%
ResNet-152 (p) 99.0 0.1% 97.6 0.4% 41.2 0.4%
Table 3: Performance of pre-trained models and models optimized for time-to-accuracy on unseen data and three fine-tuning tasks. The models optimized for time-to-accuracy perform nearly as well as or better than the PyTorch pre-trained model. (p) refers to pretrained.

For both the pre-trained ResNet-50 weights provided by PyTorch and DAWNBench entries optimized for time-to-accuracy, we computed the top-5 accuracy on the images from Flickr. The results are summarized in Table 3. As shown, the models optimized for time-to-accuracy achieve nearly the same accuracy or higher than the pre-trained ResNet-50, indicating that optimizing for time-to-accuracy does not sacrifice generalization performance.

Fine-tuning on new tasks. We additionally fine-tuned a model optimized for time-to-accuracy and a pre-trained ResNet-50 provided by PyTorch on three image classification tasks: Birds-200 wah2011caltech , Kaggle cat/dog elson2007asirra (limited to 1000 training and 1000 validation examples), and Leeds butterfly wang2009learning .

We only trained the fully connected layer (i.e., not the convolutional layers) to test the representational capacity of the convolutional layers (which can be thought of as a featurizer). We used SGD with a momentum of 0.9, with 10 epochs of a learning rate of 0.001 followed by 10 epochs of a learning rate of 0.0001. We ran this procedure 10 times for each task and network.

The results are summarized in Table 3. As shown, the performance of the optimized model is near the performance of the pre-trained model for all the tasks. The optimized model performs 1.87% worse on Birds-200, which is the hardest task, but nearly the same on Kaggle cat/dog and Leeds butterfly.

4 Entry Analysis

In this section, we analyze the top DAWNBench entries for image classification to better understand the impact of the various optimizations used. First, we report general trends in the submissions, showing that hardware, software, and statistical optimizations are necessary for high performance. Then, we perform a factor analysis and lesion study on a CIFAR10 entry to demonstrate the range of optimizations used to speed up deep learning workloads. Finally, we analyze how well submissions scale to multiple servers, and use a roofline analysis williams2009roofline to illustrate the fact that many entries severely underutilize the available hardware.

4.1 Overall Trends

To better understand the speedups seen in the DAWNBench entries, we plot the time to achieve the required accuracy for CIFAR10 and ImageNet in Figure 4. While many optimizations improved the time per epoch, several also reduced the number of epochs needed for convergence. We discuss some optimizations in more detail below.

Figure 4: Time per epoch vs. the total time to achieve the required accuracy for ImageNet (left) and CIFAR10 (right) entries. Most entries use software and hardware optimizations to reach the accuracy threshold faster; however, some use statistical optimizations to converge faster.

Hardware Optimizations. A large number of the DAWNBench entries took advantage of modern hardware for improved support of reduced precision and low-level operations commonly found in deep learning. For example, every ImageNet entry used mixed-precision training micikevicius2017mixed . Distributed training was also common because of advances in large minibatch training goyal2017accurate , where learning rate warm-ups and scaled learning rates enable large minibatch sizes to converge to similar accuracies as small minibatch sizes, but with a faster time-per-epoch. Both ResNet-50 and a learned architecture from Google, AmoebaNet, were able to use minibatch sizes an order of magnitude larger than the original work on ResNets he2016deep to leverage multiple TPUs. However, ResNet-50 was able to use a maximum minibatch size twice as large as AmoebaNet and consequently more TPUs while still converging to the target accuracy threshold, indicating that learned architectures could be more sensitive to the values of hyperparameters like the minibatch size.

Software Optimizations. Properly configuring and using hardware resources made a significant difference to performance. Simply changing the TensorFlow version from 1.7 to 1.8 gives a 1.4 decrease in training time for the same network architecture. In addition, most entries use optimized communication libraries (NCCLv2) and efficient image loading libraries (libjpeg-turbo, Pillow-SIMD) to give significant performance improvements.

Statistical Optimizations. Optimizations like progressive image resizing lim2017enhanced ; karras2017progressive and cyclical learning rates smith2017super also dramatically helped improve performance by reducing the total number of epochs needed to reach the accuracy threshold.

Figure 5: Time vs. cost for the various entries submitted to DAWNBench, grouped by the device used. We show a zoomed-in version of the box in the plot on the left on the right. Different hardware substrates have different pareto frontiers. The costs of all entries that use TPU Pods were extrapolated using the cost of a single TPU in the cloud.

Trade-offs between Time and Cost. Most of the DAWNBench entries were run on the public cloud, and it was easy to compute the cost of training a model to the desired accuracy level. Figure 5 shows the time versus cost for the different entries. We observe different pareto frontiers between cost and time for the different device types, e.g., CPUs are expensive compared to TPUs and GPUs. TPU Pods are not publicly available, but we extrapolated their cost in the cloud by multiplying the price of a single TPU by the number of TPUs used.

Some entries are objectively better than others, e.g., the right-most TPU entry is more expensive and takes longer than the TPU entry just to the left of it. These are by-products of optimizations (e.g., improvements in TensorFlow and a different model architecture that converges to the target accuracy in fewer epochs) applied while keeping the hardware configuration the same, resulting in entries that achieve the accuracy target faster at the same cost-per-unit-time.

4.2 Factor Analysis

In this section, we perform a factor analysis on the optimizations for the winning entry for CIFAR10 on publicly available hardware. Using a custom wide ResNet zagoruyko2016wide with mixed-precision training micikevicius2017mixed , large minibatch training dean2012large , and cyclic learning rates smith2017super the model trained to 94% top-1 accuracy on CIFAR10 with a NVIDIA V100 GPU in 6 minutes and 45 seconds for $0.26. Applying all these optimizations in conjunction gives a 3.14 speedup in time-to-94% accuracy as shown in Figure 6. In addition, turning off each of these optimizations individually gives as much as a 2.91 slowdown.

Figure 6: Lesion study and factor analysis for the one V100 Wide ResNet-32 CIFAR10 entry. All optimizations give significant speedups, indicating the co-evolution of statistical and hardware improvements can dramatically improve performance. The baseline ran the Wide ResNet-32 on a P100.

4.3 Scaling and Roofline Analysis

(a) AmoebaNet on TPUs in a TPU pod.
(b) ResNet50 on V100 GPUs in a DGX-1.
Figure 7: Speedup with respect to a single worker vs. number of workers for two ImageNet models, one on a TPU pod, and another on a NVIDIA DGX-1. As the number of workers increases, the scaling performance drops off (over 2 gap from ideal scaling).

Scaling Analysis. Figure 6(a) shows the speedup relative to one worker for two different accuracy thresholds, as well as the average time to compute a single epoch, for different numbers of workers. We see that the ‘time-per-epoch’ does not scale linearly with the number of workers for the AmoebaNet model in the TPU pod. The ‘time-to-accuracy’ scales even worse, since a greater number of epochs are needed to converge to the same accuracy target. This is in line with past work that says small minibatch sizes usually lead to more robust training masters2018revisiting .

Figure 6(b) shows the speedups when scaling the training of a ResNet-50 model to 1, 2, 4, and 8 GPUs in a NVIDIA DGX-1. We observe that both time-per-epoch and time-to-accuracy for the ResNet-50 model scale much better on the DGX-1, partially because of less communication overhead.

Figure 8: Roofline models for the various entries submitted to DAWNBench. All of the entries under-utilize the hardware resources, by up to 10. The figure on the right is a zoomed-in version of the box in the plot on the left.

Roofline Analysis. To further study the hardware performance of various entries, we use the roofline model williams2009roofline , which can highlight causes of performance bottlenecks. The roofline model plots computational throughput (in floating point operations per second) against the operational intensity of the application (number of floating-point operations performed per DRAM byte accessed). Applications with high operational intensity are “compute-bound” (the flat line in Figure 8) and bottlenecked on the device’s computation units, while applications with low operational intensity are “memory-bound” (the slanting line in Figure 8) and bottlenecked on memory accesses. Each point in Figure 8 represents a DAWNBench entry. For entries which use progressive image resizing lim2017enhanced ; karras2017progressive , we show different points for each image size used. Operational intensities and throughputs are approximated by instrumenting model code, and using off-the-shelf command-line utilities like nvprof.

We observe that all entries severely underutilize the available compute resources – each plotted point achieves a throughput significantly lower than the peak device throughput. All analyzed CIFAR10 models operate in low operational intensity regimes, partially because of the small input size () and associated operations like matrix multiplications and convolutions. The ImageNet models do a better job of utilizing the underlying hardware, but still are as much as a factor of

away from peak device throughput for the GPUs. We suspect that the entries that use the V100s do not fully use the Tensor Cores, which are advertised to deliver 125 Teraflops of half-precision arithmetic 

markidis2018nvidia . The TPUs are much better utilized, with a utilization of for the ResNet-50 model, and for the AmoebaNet model (not pictured in Figure 8).

5 Lessons

Openness and Community Involvement. DAWNBench would not have been successful without valuable contributions from the community. In addition to producing the majority of entries, the community found several bugs in the initial submissions that have been since rectified. We believe this open process magnifies any benchmarking efforts by distributing the work required to produce, review, and validate high quality submissions.

Reproducibility. From reproducing the submissions, we recommend that future benchmarking efforts include checkpoints, predictions, and machine configurations in addition to source code. While source code is necessary for efforts to reproduce or understand experiments, it is far from sufficient. The level of information provided with DAWNBench submissions was variable. An even higher reproducibility bar would accelerate what the community can validate and learn from the entries.

Agile Benchmarking. Given the pace of deep learning research, workloads are rapidly changing. Since DAWNBench launched, issues have been raised about question-answering models jia2017adversarial , and the best performing models in SQuAD have significantly increased in performance. Additionally, state-of-the-art image classification models have increased in performance to the point where top-1 accuracy is used as the metric mahajan2018exploring and has improved from ~76% for ResNet-50 to ~81% for ResNeXt-101 xie2017aggregated . To stay relevant, benchmarks need to be agile with their choice of benchmark and accuracy thresholds.

6 Conclusion

By analyzing the top DAWNBench entries, we evaluated both time-to-accuracy as a metric for measuring deep learning system performance and the many optimizations that led to considerable speedups for ImageNet and CIFAR10 training time. Reproducing and repeating runs of the top entries revealed time-to-accuracy as a stable metric with a low coefficient of variation despite the inherent randomness in training. Even though time-to-accuracy is sensitive to the threshold, the high accuracy target prevented optimizations that hurt final validation convergence and generalization, but provided room for many unique combinations of hardware, software, and statistical optimizations. All the entries, however, still underutilized the available compute resources, leaving opportunities for further improvements. Finally, none of this work would have been possible without the code and command line instructions being openly available. Future benchmarks should continue the trend of open, end-to-end evaluation of software and hardware to enable reproducible, reusable, and robust research.


We thank Jeremy Howard, the Google Cloud TPU team (including Sourabh Bajaj, Frank Chen, Brennan Saeta, and Chris Ying), and the many other teams that submitted to DAWNBench. We thank Juan Manuel Camacho, Shoumik Palkar, Kexin Rong, Keshav Santhanam, Sahaana Suri, Pratiksha Thaker, and James Thomas for their assistance in labeling. We also thank Amazon and Google for cloud credits. This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Google, Intel, Microsoft, Teradata, and VMware—as well as DARPA under No. FA8750-17-2-0095 (D3M), industrial gifts and support from Toyota Research Institute, Juniper Networks, Keysight Technologies, Hitachi, Facebook, Northrop Grumman, NetApp, and the NSF under grants DGE-1656518, DGE-114747, and CNS-1651570.