In recent years, researchers have proposed a range of hardware, software, and statistical optimizations for deep learning, ranging from new software systems abadi2016tensorflow ; chetlur2014cudnn ; chilimbi2014project ; dean2012large ; jia2014caffe and hardware burger2017microsoft ; han2016eie ; jouppi2017datacenter ; pena2017benchmarking to communication methods de2017understanding ; harlap2016addressing ; recht2011hogwild ; zhang2014dimmwitted and training algorithms glorot2011deep ; goyal2017accurate ; iandola2016squeezenet ; kingma2014adam ; sohl2014fast ; sun2017meprop ; sutskever2013importance . Many of these optimizations preserve the statistical efficiency (number of iterations) of the underlying algorithm and model; for example, replacing an NVIDIA K80 GPU with an NVIDIA P100 GPU can give up to a 4 speedup murphy2017deep . However, many optimizations sacrifice some statistical efficiency for hardware efficiency (time needed for each iteration) or other auxiliary metrics. On the hardware side, large minibatch training goyal2017accurate ; jouppi2017datacenter and reduced precision chilimbi2014project ; de2017understanding ; han2015deep
can help better saturate hardware units and speed up “proxy” metrics such as time to process an epoch (especially in the distributed setting), but can prevent the model from converging. On the statistical side, optimizations such as Adamkingma2014adam can speed up proxy metrics such as “time-to-training-loss”, but can also prevent the model from converging to a desired validation accuracy. Due to a lack of standard evaluation criteria, it has become increasingly difficult to compare the utility of these advances: prior benchmarks have tried by measuring proxy metrics for training time adolf2016fathom ; bahrampour2015comparative ; chetlur2014cudnn ; baidu2017deepbench ; chintala2017convnet ; google2017benchmarks ; shi2016benchmarking or accuracy (irrespective of time) krizhevsky2012imagenet ; lin2014microsoft ; rajpurkar2016squad , but are not sufficient coleman2017dawnbench .
To rectify this lack of standard evaluation criteria, DAWNBench and the upcoming MLPerf, two recent deep learning benchmarks, use the end-to-end training time to a specified validation accuracy level (called time-to-accuracy) as their main metric. DAWNBench
was the first such benchmark, which had the explicit goal to provide “an objective means of normalizing across differences in computation frameworks, hardware, optimization algorithms, hyperparameter settings, and other factors that affect real-world performance”coleman2017dawnbench . During the competition, Google, Intel, fast.ai, and more submitted optimized implementations that achieved two orders of magnitude improvements in training time for ImageNet and an order of magnitude improvement in cost over the initial seed entries. Even though DAWNBench does not explicitly require submissions to include code, the vast majority of submissions did include their implementations and command line instructions, demonstrating a strong interest from the community in reproducibility. Building on DAWNBench’s success, MLPerf mlperf2018 has also adopted time-to-accuracy as its primary metric, while expanding the set of tasks being evaluated.
Despite the improvements seen in DAWNBench’s submissions, time-to-accuracy has not been shown to have a low coefficient of variance, nor have the submissions been tested for generalization; only a single, possibly cherry-picked run was required for entry into the competition. In this paper, we analyze the DAWNBench entries to evaluate the time-to-accuracy metric. By running the top DAWNBench results using the provided open-source code, we were able to analyze the stability and generalizability of time-to-accuracy. Our data suggests that,
Time-to-accuracy is stable, i.e., has a low coefficient of variation (~5%) over multiple training runs, suggesting submissions reach the target accuracy in nearly the same time across runs.
Models optimized for time-to-accuracy generalize well on unseen data and fine-tuning tasks.
Independent runs that use the same hyperparameters and model architectures learn more similar functions than runs which use different hyperparameters and models.
Lowering the accuracy threshold allows for optimizations that reach the lower threshold quicker, but prevent models from converging to higher accuracy thresholds.
Submissions are generally reproducible by a third-party.
Beyond the metric, we analyze DAWNBench’s entries to understand the optimizations that resulted in the speedups over the seed entries. Due to the open nature of DAWNBench, submissions used a variety of hardware, software, model, and training procedure optimizations. For example, the fast.ai team used cyclic learning rates smith2017super and 8 NVIDIA V100 GPUs to train a custom Wide ResNet zagoruyko2016wide on CIFAR10 to 94% accuracy in less than 3 minutes (cost of $1.18). In addition, our experiments show that both hardware and statistical efficiency for deep learning workloads do not scale perfectly, with up to a 2 gap from ideal scaling, and that current software frameworks often severely underutilize modern hardware.
Finally, based on our experience reproducing these results, we suggest that future deep learning benchmarking efforts be open, reproducible, and agile. As software, hardware, and statistical methods are rapidly changing, we recommend that submissions include checkpoints, final predictions, and machine configuration to further improve reproducibility. Our experiences suggest that this information will help third-parties validate and compare results more easily, partially addressing the problems with reproducibility and reusability in machine learning joellepinaeu ; baker2016crisis .
2 Benchmark Summary
DAWNBench evaluates the time and cost of popular deep learning training and inference workloads. The initial release included two tasks: image classification on CIFAR10 and ImageNet, and question answering on SQuAD, and four metrics: training time to a specified validation accuracy, cost (in USD) of training to that accuracy for submissions that use hardware in the public cloud, average latency of performing inference on a single item (image or question), and average cost of inference for 10,000 items. While prior benchmarks, such as Baidu DeepBench baidu2017deepbench , Fathom adolf2016fathom
, and TensorFlow’s Performance Benchmarkgoogle2017benchmarks evaluate proxy metrics such as time to compute a minibatch or other low-level operations, DAWNBench was the first to measure end-to-end performance as the time to a pre-specified (and high) level of accuracy.
Entries were required to submit a description of their submission and the validation accuracy after every epoch. While source code was optional, every submission for image classification included a link to all code needed to reproduce runs, assuming access to the appropriate hardware. However, some of the submissions for question answering on SQuAD did not include code; because of this and the general lack of submissions, we focus exclusively on CIFAR10 and ImageNet submissions in this paper.
2.2 Summary of Entries
The entries spanned a range of hardware platforms, software platforms, and models. We summarize the entries in Table 1
ImageNet. DAWNBench used a top-5 accuracy target of 93% for ImageNet. The best seed entry was a ResNet-152 he2016deep model trained on 8 NVIDIA K80 GPUs in 10 days 10 hours and 42 minutes for $1112.64. The fastest entry overall trained ResNet-50 on half of a TPUv2 pod in 30 minutes, two orders-of-magnitude faster. The fastest entry on publicly available hardware was a ResNet-50, which trained in about 3 hours. The cheapest entry was an AmoebaNet real2018regularized model that trained for $49.30.
CIFAR10. DAWNBench used an accuracy target of 94% for CIFAR10. The best seed entry was a ResNet-164 he2016deep model trained in 2.5 hours on a NVIDIA P100. The fastest entry trained a custom Wide ResNet zagoruyko2016wide architecture in less than 3 minutes on 8 NVIDIA V100 GPUs, a 52 improvement. The cheapest entry trained the same custom Wide ResNet for $0.26 on 1 NVIDIA V100 GPU.
These large improvements for both ImageNet and CIFAR10 come from a range of optimizations, including mixed-precision training micikevicius2017mixed , large minibatch training goyal2017accurate , progressive resizing of images lim2017enhanced ; karras2017progressive , novel model architectures real2018regularized , and faster hardware murphy2017deep ; jouppi2017datacenter ; we investigate this further in Section 4.
3 Metric Evaluation
|Entry name||Dataset||Coefficient of variation||Fraction of runs|
|Wide ResNet-34, 8xV100||CIFAR10||3.8%||50%|
|Wide ResNet-34, 1xV100||CIFAR10||2.9%||70%|
|ResNet-50, 1/2 TPU Pod||ImageNet||2.5%||100%|
In this section, we evaluate the time-to-accuracy metric along three axes, using publicly available code from DAWNBench’s top submissions. First, we demonstrate that time-to-accuracy is stable with a low coefficient of variation over several runs with fixed hyperparameters but different random seeds. Second, we show that setting a lower threshold allows for optimizations that prevent networks from converging, making the choice of a threshold near state-of-the-art critical. Third, we provide evidence that networks optimized for time-to-accuracy generalize as well as regular pre-trained networks.
3.1 Variance and stability of metric
Variability of time-to-accuracy. We measured the stability of time-to-accuracy by running the top single server blade training time entries for CIFAR10 and ImageNet several times each. As shown in Table 2
, the coefficient of variation (the ratio of the standard deviation to the mean) of time-to-accuracy is at most 5.3% for all entries, indicating that the metric is stable to the randomness inherent to training deep networks. However, several entries failed to consistently achieve the given accuracy threshold. In particular, the cyclic learning rates used by the top two CIFAR10 entries appear to make validation convergence less robustsmith2017super . Additionally, cyclic learning rates prevent validation convergence on ImageNet and were thus not included in any top submissions howard2018training .
The validation convergence curves for all 10 trials of the top single server blade entries in CIFAR10 and ImageNet are shown in Figure 1 and were tuned specifically for time to 94% top-1 accuracy and time to 93% top-5 accuracy respectively. Both reach the accuracy targets later in training when the model converges. Earlier in training, validation accuracy is less stable as both entries start training with large learning rates.
Variability of models. To further understand how models optimized for time-to-accuracy behave, we compared the top-1 predictions of models over several trials with different random seeds using Jaccard similarity. As shown in Figure 2, the Jaccard similarity within multiple trials of a single entry is higher than between different entries, indicating that there is higher variance between different entries than from the randomness present in training.
3.2 Sensitivity to Threshold
The choice of accuracy threshold is critical to the final ranking of submissions across model architecture, framework and hardware. Using the official validation accuracy data included in the top DAWNBench entries along with data collected from two other “synthetic” runs we ran, we show the time-to-accuracy for a number of different target accuracies in Figure 3. We demonstrate that aggressive learning rate schedules can reach 91% and 92% up to 2 faster, but fail to converge to 93%. This suggests that while rankings are stable if the models converge to a high validation accuracy, they are not if the accuracy threshold is lowered. This indicates to us that it is critical to pick a validation accuracy threshold close to state-of-the-art for time-to-accuracy to be a useful metric.
3.3 Generalization of Optimized Models
Evaluation on new data. We scraped and labeled a set of 2,864 images from Flickr. The images were scraped based on the WordNet keywords associated with each class in the ImageNet dataset. The top five images based on relevance were shown to a human labeler and labeled correct or incorrect. To ensure no overlap with ImageNet, only images posted after January 1st, 2014 were used. The images spanned 886 (out of 1000) classes. While these images are not entirely representative of ImageNet, we believe they reflect a reasonable distribution.
|ResNet-50 (p)||98.7 0.1%||98.5 0.5%||37.6 0.7%|
|ResNet-18 (p)||98.5 0.1%||97.7 0.8%||97.6 0.4%|
|ResNet-152 (p)||99.0 0.1%||97.6 0.4%||41.2 0.4%|
For both the pre-trained ResNet-50 weights provided by PyTorch and DAWNBench entries optimized for time-to-accuracy, we computed the top-5 accuracy on the images from Flickr. The results are summarized in Table 3. As shown, the models optimized for time-to-accuracy achieve nearly the same accuracy or higher than the pre-trained ResNet-50, indicating that optimizing for time-to-accuracy does not sacrifice generalization performance.
Fine-tuning on new tasks. We additionally fine-tuned a model optimized for time-to-accuracy and a pre-trained ResNet-50 provided by PyTorch on three image classification tasks: Birds-200 wah2011caltech , Kaggle cat/dog elson2007asirra (limited to 1000 training and 1000 validation examples), and Leeds butterfly wang2009learning .
We only trained the fully connected layer (i.e., not the convolutional layers) to test the representational capacity of the convolutional layers (which can be thought of as a featurizer). We used SGD with a momentum of 0.9, with 10 epochs of a learning rate of 0.001 followed by 10 epochs of a learning rate of 0.0001. We ran this procedure 10 times for each task and network.
The results are summarized in Table 3. As shown, the performance of the optimized model is near the performance of the pre-trained model for all the tasks. The optimized model performs 1.87% worse on Birds-200, which is the hardest task, but nearly the same on Kaggle cat/dog and Leeds butterfly.
4 Entry Analysis
In this section, we analyze the top DAWNBench entries for image classification to better understand the impact of the various optimizations used. First, we report general trends in the submissions, showing that hardware, software, and statistical optimizations are necessary for high performance. Then, we perform a factor analysis and lesion study on a CIFAR10 entry to demonstrate the range of optimizations used to speed up deep learning workloads. Finally, we analyze how well submissions scale to multiple servers, and use a roofline analysis williams2009roofline to illustrate the fact that many entries severely underutilize the available hardware.
4.1 Overall Trends
To better understand the speedups seen in the DAWNBench entries, we plot the time to achieve the required accuracy for CIFAR10 and ImageNet in Figure 4. While many optimizations improved the time per epoch, several also reduced the number of epochs needed for convergence. We discuss some optimizations in more detail below.
Hardware Optimizations. A large number of the DAWNBench entries took advantage of modern hardware for improved support of reduced precision and low-level operations commonly found in deep learning. For example, every ImageNet entry used mixed-precision training micikevicius2017mixed . Distributed training was also common because of advances in large minibatch training goyal2017accurate , where learning rate warm-ups and scaled learning rates enable large minibatch sizes to converge to similar accuracies as small minibatch sizes, but with a faster time-per-epoch. Both ResNet-50 and a learned architecture from Google, AmoebaNet, were able to use minibatch sizes an order of magnitude larger than the original work on ResNets he2016deep to leverage multiple TPUs. However, ResNet-50 was able to use a maximum minibatch size twice as large as AmoebaNet and consequently more TPUs while still converging to the target accuracy threshold, indicating that learned architectures could be more sensitive to the values of hyperparameters like the minibatch size.
Software Optimizations. Properly configuring and using hardware resources made a significant difference to performance. Simply changing the TensorFlow version from 1.7 to 1.8 gives a 1.4 decrease in training time for the same network architecture. In addition, most entries use optimized communication libraries (NCCLv2) and efficient image loading libraries (libjpeg-turbo, Pillow-SIMD) to give significant performance improvements.
Statistical Optimizations. Optimizations like progressive image resizing lim2017enhanced ; karras2017progressive and cyclical learning rates smith2017super also dramatically helped improve performance by reducing the total number of epochs needed to reach the accuracy threshold.
Trade-offs between Time and Cost. Most of the DAWNBench entries were run on the public cloud, and it was easy to compute the cost of training a model to the desired accuracy level. Figure 5 shows the time versus cost for the different entries. We observe different pareto frontiers between cost and time for the different device types, e.g., CPUs are expensive compared to TPUs and GPUs. TPU Pods are not publicly available, but we extrapolated their cost in the cloud by multiplying the price of a single TPU by the number of TPUs used.
Some entries are objectively better than others, e.g., the right-most TPU entry is more expensive and takes longer than the TPU entry just to the left of it. These are by-products of optimizations (e.g., improvements in TensorFlow and a different model architecture that converges to the target accuracy in fewer epochs) applied while keeping the hardware configuration the same, resulting in entries that achieve the accuracy target faster at the same cost-per-unit-time.
4.2 Factor Analysis
In this section, we perform a factor analysis on the optimizations for the winning entry for CIFAR10 on publicly available hardware. Using a custom wide ResNet zagoruyko2016wide with mixed-precision training micikevicius2017mixed , large minibatch training dean2012large , and cyclic learning rates smith2017super the model trained to 94% top-1 accuracy on CIFAR10 with a NVIDIA V100 GPU in 6 minutes and 45 seconds for $0.26. Applying all these optimizations in conjunction gives a 3.14 speedup in time-to-94% accuracy as shown in Figure 6. In addition, turning off each of these optimizations individually gives as much as a 2.91 slowdown.
4.3 Scaling and Roofline Analysis
Scaling Analysis. Figure 6(a) shows the speedup relative to one worker for two different accuracy thresholds, as well as the average time to compute a single epoch, for different numbers of workers. We see that the ‘time-per-epoch’ does not scale linearly with the number of workers for the AmoebaNet model in the TPU pod. The ‘time-to-accuracy’ scales even worse, since a greater number of epochs are needed to converge to the same accuracy target. This is in line with past work that says small minibatch sizes usually lead to more robust training masters2018revisiting .
Figure 6(b) shows the speedups when scaling the training of a ResNet-50 model to 1, 2, 4, and 8 GPUs in a NVIDIA DGX-1. We observe that both time-per-epoch and time-to-accuracy for the ResNet-50 model scale much better on the DGX-1, partially because of less communication overhead.
Roofline Analysis. To further study the hardware performance of various entries, we use the roofline model williams2009roofline , which can highlight causes of performance bottlenecks. The roofline model plots computational throughput (in floating point operations per second) against the operational intensity of the application (number of floating-point operations performed per DRAM byte accessed). Applications with high operational intensity are “compute-bound” (the flat line in Figure 8) and bottlenecked on the device’s computation units, while applications with low operational intensity are “memory-bound” (the slanting line in Figure 8) and bottlenecked on memory accesses. Each point in Figure 8 represents a DAWNBench entry. For entries which use progressive image resizing lim2017enhanced ; karras2017progressive , we show different points for each image size used. Operational intensities and throughputs are approximated by instrumenting model code, and using off-the-shelf command-line utilities like nvprof.
We observe that all entries severely underutilize the available compute resources – each plotted point achieves a throughput significantly lower than the peak device throughput. All analyzed CIFAR10 models operate in low operational intensity regimes, partially because of the small input size () and associated operations like matrix multiplications and convolutions. The ImageNet models do a better job of utilizing the underlying hardware, but still are as much as a factor of
away from peak device throughput for the GPUs. We suspect that the entries that use the V100s do not fully use the Tensor Cores, which are advertised to deliver 125 Teraflops of half-precision arithmeticmarkidis2018nvidia . The TPUs are much better utilized, with a utilization of for the ResNet-50 model, and for the AmoebaNet model (not pictured in Figure 8).
Openness and Community Involvement. DAWNBench would not have been successful without valuable contributions from the community. In addition to producing the majority of entries, the community found several bugs in the initial submissions that have been since rectified. We believe this open process magnifies any benchmarking efforts by distributing the work required to produce, review, and validate high quality submissions.
Reproducibility. From reproducing the submissions, we recommend that future benchmarking efforts include checkpoints, predictions, and machine configurations in addition to source code. While source code is necessary for efforts to reproduce or understand experiments, it is far from sufficient. The level of information provided with DAWNBench submissions was variable. An even higher reproducibility bar would accelerate what the community can validate and learn from the entries.
Agile Benchmarking. Given the pace of deep learning research, workloads are rapidly changing. Since DAWNBench launched, issues have been raised about question-answering models jia2017adversarial , and the best performing models in SQuAD have significantly increased in performance. Additionally, state-of-the-art image classification models have increased in performance to the point where top-1 accuracy is used as the metric mahajan2018exploring and has improved from ~76% for ResNet-50 to ~81% for ResNeXt-101 xie2017aggregated . To stay relevant, benchmarks need to be agile with their choice of benchmark and accuracy thresholds.
By analyzing the top DAWNBench entries, we evaluated both time-to-accuracy as a metric for measuring deep learning system performance and the many optimizations that led to considerable speedups for ImageNet and CIFAR10 training time. Reproducing and repeating runs of the top entries revealed time-to-accuracy as a stable metric with a low coefficient of variation despite the inherent randomness in training. Even though time-to-accuracy is sensitive to the threshold, the high accuracy target prevented optimizations that hurt final validation convergence and generalization, but provided room for many unique combinations of hardware, software, and statistical optimizations. All the entries, however, still underutilized the available compute resources, leaving opportunities for further improvements. Finally, none of this work would have been possible without the code and command line instructions being openly available. Future benchmarks should continue the trend of open, end-to-end evaluation of software and hardware to enable reproducible, reusable, and robust research.
We thank Jeremy Howard, the Google Cloud TPU team (including Sourabh Bajaj, Frank Chen, Brennan Saeta, and Chris Ying), and the many other teams that submitted to DAWNBench. We thank Juan Manuel Camacho, Shoumik Palkar, Kexin Rong, Keshav Santhanam, Sahaana Suri, Pratiksha Thaker, and James Thomas for their assistance in labeling. We also thank Amazon and Google for cloud credits. This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Google, Intel, Microsoft, Teradata, and VMware—as well as DARPA under No. FA8750-17-2-0095 (D3M), industrial gifts and support from Toyota Research Institute, Juniper Networks, Keysight Technologies, Hitachi, Facebook, Northrop Grumman, NetApp, and the NSF under grants DGE-1656518, DGE-114747, and CNS-1651570.
-  MLPerf. https://mlperf.org/, 2018.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, volume 16, pages 265–283, 2016.
-  R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks. Fathom: Reference Workloads for Modern Deep Learning Methods. In Workload Characterization (IISWC), 2016 IEEE International Symposium on, pages 1–10. IEEE, 2016.
-  S. Bahrampour, N. Ramakrishnan, L. Schott, and M. Shah. Comparative Study of Deep Learning Software Frameworks. arXiv preprint arXiv:1511.06435, 2015.
-  Baidu. DeepBench: Benchmarking Deep Learning Operations on Different Hardware. https://github.com/baidu-research/DeepBench, 2017.
-  M. Baker. 1,500 Scientists Lift the Lid on Reproducibility. https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970, May 2016.
-  D. Burger. Microsoft unveils Project Brainwave for Real-time AI. Microsoft Research, Microsoft, 22, 2017.
-  S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.
-  T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, volume 14, pages 571–582, 2014.
-  S. Chintala. Convnet-Benchmarks: Easy Benchmarking of All Publicly Accessible Implementations of Convnets. https://github.com/soumith/convnet-benchmarks, Sept. 2017.
-  C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia. DAWNBench: An End-to-End Deep Learning Benchmark and Competition. NIPS ML Systems Workshop, 2017.
C. De Sa, M. Feldman, C. Ré, and K. Olukotun.
Understanding and Optimizing Asynchronous Low-precision Stochastic Gradient Descent.In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 561–574. ACM, 2017.
-  J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
-  J. Elson, J. J. Douceur, J. Howell, and J. Saul. Asirra: A CAPTCHA that Exploits Interest-aligned Manual Image Categorization. 2007.
X. Glorot, A. Bordes, and Y. Bengio.
Deep Sparse Rectifier Neural Networks.In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
-  Google. TensorFlow Benchmarks. https://www.tensorflow.org/performance/benchmarks, 2017.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, Large Minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Networks. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254. IEEE, 2016.
-  S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149, 2015.
-  A. Harlap, H. Cui, W. Dai, J. Wei, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Addressing the Straggler Problem for Iterative Convergent Parallel ML. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 98–111. ACM, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In , pages 770–778, 2016.
-  J. Howard. Training ImageNet in 3 hours for $25; and CIFAR10 for $0.26. http://www.fast.ai/2018/04/30/dawnbench-fastai/, 2018.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and MB Model Size. arXiv preprint arXiv:1602.07360, 2016.
-  R. Jia and P. Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. arXiv preprint arXiv:1707.07328, 2017.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 675–678. ACM, 2014.
-  N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12. ACM, 2017.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv preprint arXiv:1710.10196, 2017.
-  D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ICLR, 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
ImageNet Classification with Deep Convolutional Neural Networks.In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee.
Enhanced Deep Residual Networks for Single Image Super-Resolution.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, volume 1, page 3, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
-  D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
-  S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, and J. S. Vetter. Nvidia tensor core programmability, performance & precision. arXiv preprint arXiv:1803.04014, 2018.
-  D. Masters and C. Luschi. Revisiting Small Batch Training for Deep Neural Networks. arXiv preprint arXiv:1804.07612, 2018.
-  P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al. Mixed Precision Training. arXiv preprint arXiv:1710.03740, 2017.
-  J. Murphy. Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs, 2017.
-  F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A Lock-free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
-  D. Pena, A. Forembski, X. Xu, and D. Moloney. Benchmarking of CNNs for Low-Cost, Low-Power Robotics Applications. 2017.
Reproducibility, Reusability, and Robustness in Deep Reinforcement Learning.https://www.youtube.com/watch?v=Vh4H0gOwdIg, 2018. ICLR.
-  P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv preprint arXiv:1606.05250, 2016.
-  E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized Evolution for Image Classifier Architecture Search. arXiv preprint arXiv:1802.01548, 2018.
-  S. Shi, Q. Wang, P. Xu, and X. Chu. Benchmarking State-of-the-Art Deep Learning Software Tools. In Cloud Computing and Big Data (CCBD), 2016 7th International Conference on, pages 99–104. IEEE, 2016.
-  L. N. Smith and N. Topin. Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates. arXiv preprint arXiv:1708.07120, 2017.
-  J. Sohl-Dickstein, B. Poole, and S. Ganguli. Fast Large-scale Optimization by Unifying Stochastic Gradient and Quasi-Newton Methods. In International Conference on Machine Learning, pages 604–612, 2014.
-  X. Sun, X. Ren, S. Ma, and H. Wang. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. arXiv preprint arXiv:1706.06197, 2017.
-  I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. In International Conference on Machine Learning, pages 1139–1147, 2013.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The CalTech-UCSD Birds-200-2011 Dataset. 2011.
-  J. Wang, K. Markert, and M. Everingham. Learning Models for Object Recognition from Natural Language Descriptions. 2009.
-  S. Williams, A. Waterman, and D. Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4):65–76, 2009.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
-  S. Zagoruyko and N. Komodakis. Wide Residual Networks. arXiv preprint arXiv:1605.07146, 2016.
-  C. Zhang and C. Ré. Dimmwitted: A Study of Main-memory Statistical Analytics. Proceedings of the VLDB Endowment, 7(12):1283–1294, 2014.