MLPerf Training Benchmark

10/02/2019 ∙ by Peter Mattson, et al. ∙ Google 6

Machine learning is experiencing an explosion of software and hardware solutions, and needs industry-standard performance benchmarks to drive design and enable competitive evaluation. However, machine learning training presents a number of unique challenges to benchmarking that do not exist in other domains: (1) some optimizations that improve training throughput actually increase time to solution, (2) training is stochastic and time to solution has high variance, and (3) the software and hardware systems are so diverse that they cannot be fairly benchmarked with the same binary, code, or even hyperparameters. We present MLPerf, a machine learning benchmark that overcomes these challenges. We quantitatively evaluate the efficacy of MLPerf in driving community progress on performance and scalability across two rounds of results from multiple vendors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

Code Repositories

training

Reference implementations of MLPerf™ training benchmarks


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning (ML) has revolutionized a diverse set of application domains, including computer vision 

Krizhevsky et al. (2012), language processing Devlin et al. (2018); Radford et al. (2019), speech recognition Hinton et al. (2012), and game playing Silver et al. (2018); Mnih et al. (2013); Chan (2018)

. Much of this progress is driven by deep learning (DL) techniques that train neural networks on large datasets to perform various tasks. To keep pace with growing computational demand 

Amodei & Hernandez (2018), significant investment in software and hardware systems have been made to accelerate ML Metz (2018).

As the number of hardware and software system options for DL training increases Paszke et al. (2017); Abadi et al. (2016); Chen et al. (2015); Jia et al. (2014); Jouppi et al. (2017); Chen et al. (2018); Markidis et al. (2018); Intel (2019), so does the need for a comprehensive benchmark. History shows that benchmarks accelerate progress Hennessy & Patterson (2011). For example, microprocessor and relational database system breakthroughs in the 1980s inspired the Standard Performance Evaluation Corporation (SPEC) for Unix servers Dixit (1991) and the Transaction Processing Performance Council (TPC) for transaction processing and databases Council (2005) to develop and maintain benchmarks that were embraced by their communities.

Inspired by the successes of SPEC and TPC, MLPerf, a consortium of commercial and academic organizations, formed to design a comprehensive benchmark suite to address the specific challenges for DL. Unlike other computational workloads, DL is inherently stochastic and allows a range of statistical, hardware, and software optimizations that can change the mathematical semantics of the underlying operators. While these optimizations can improve performance (i.e., the speed of training), some optimizations change the learning dynamics and impact the final model quality

(i.e., the accuracy of the model). Even accommodating system scale requires changing hyperparameters, which can affect the amount of computation to reach a particular quality target. This is in contrast to other compute benchmarks, which can evaluate systems with targeted microbenchmarks. DL is also intrinsically approximate and stochastic, allowing multiple different but equally valid solutions, unlike conventional computing which tends to require a single correct answer. As a result, implementations can differ and training times can vary while the final quality remains the same. The approximate nature of DL requires carefully defining a class of equivalently valid solutions and the appropriate degrees of freedom.

Prior work has spanned varying levels of granularity but has not addressed the above challenges or lacked critical workloads for modern ML. Microbenchmarks such as DeepBench Baidu (2017) were affordable to run and enabled a fair comparison of competing systems by isolating hardware and software from statistical optimizations, but they did not reflect the complexity of real workloads and had limited utility. Throughput benchmarks Google (2017); Adolf et al. (2016); Zhu et al. (2018) evaluated full model architectures across a broad range of tasks to better reflect the diversity and complexity of real workloads, but they limited innovations in model architecture and training that improve the state-of-the-art of ML. DAWNBench Coleman et al. (2017) measured end-to-end training time subject to a quality threshold (i.e., time-to-train) to allow for innovative solutions (i.e., novel model architectures and training techniques, such as progressive resizing and cyclic learning rates) and collected source code to promote reproducibility. However, DAWNBench’s flexibility also made it challenging to draw fair comparisons between different hardware and software platforms.

MLPerf builds on the strengths of prior work to create a representative benchmark suite for the fair evaluation of system performance guided by five high-level goals:

  • Enable fair comparison of competing systems yet encourage innovation to improve the state-of-the-art of ML.

  • Accelerate progress in ML via fair and useful measurement.

  • Enforce replicability to ensure reliable results.

  • Serve both the commercial and research communities.

  • Keep benchmarking effort affordable so all can participate.

In this paper, we focus on the design and rationale for the MLPerf training benchmark (a related MLPerf inference benchmark is outside the scope). While prior ML benchmarking efforts Coleman et al. (2017); Adolf et al. (2016); Google (2017); Baidu (2017) each contributed uniquely to one or more of the five goals outlined above, MLPerf was created to address all of the goals, holistically, building on the lessons learned from prior work. To achieve these goals, MLPerf training:

  • Established a comprehensive suite of benchmarks that covers a diverse set of application domains, neural network models, and optimizers.

  • Created reference implementations of each benchmark, to precisely define models and training procedures.

  • Established rules for ensuring submission equivalence to references and hyper-parameter choices.

  • Established timing rules to minimize the effects of stochasticity when comparing results.

  • Open sourced submission code, so that results can be replicated and studied by the ML and systems communities.

  • Set up standing working groups, tasked with ensuring the benchmark suite stays up to date.

The rest of the paper is organized as follows. Section 2 discusses the main challenges to benchmarking DL training, as well as related prior work. Section 3 reviews the benchmarks in the suite, time to train metric, and details for its measurement, as well as quality thresholds. Submission, review, and result reporting process and categories are described in Section 4. Progress between the first two rounds of MLPerf benchmarking and future work directions are presented in Sections 5 and 6.

2 Background

In this section, we describe the mechanisms of training machine learning models in Section 2.1, detail the unique challenges of benchmarking machine learning compared to other compute benchmarks Dongarra (1988); Council (2005) in Section 2.2, and review prior efforts in machine learning benchmarking in Section 2.3.

2.1 Machine Learning Training

The goal of training in machine learning is to create a model that generalizes well to unseen data according to a given quality

metric (e.g., accuracy). Recently, deep neural networks (DNNs) have become a dominant form of machine learning across a wide range of tasks (e.g., image classification, object detection, and machine translation). These models learn complex internal representations by processing the input data through many hidden layers, which are made up of weights and non-linear activation functions (e.g., ReLU 

Nair & Hinton (2010)

). Through stochastic gradient descent on large labeled datasets, a randomly initialized neural network learns to match its predictions with the provided ground truth labels. Training datasets range from thousands to millions of samples, depending on the task. A separate labeled dataset held out from training is used to quantify model generalization.

Training DNNs is an iterative process. Each iteration processes a minibatch (a small subset of the training dataset) to update the model weights. Typically, the training dataset is partitioned into a set of minibatches called an epoch

such that each example is only seen once, and many epochs are required for the model to reach peak quality. For example, training an image classification model on the ILSVRC 2012 dataset 

Deng et al. (2009) with a minibatch size of 256 samples would break the dataset into an approximately 5 thousand minibatches for a single epoch and take tens of epochs to maximize network accuracy. In order to avoid overfitting to the training examples, modern training procedures employ data augmentation to modify the input data to increase the size and diversity of the training set effectively. Popular data augmentation for image tasks include random cropping, reflection, and color jitter.

A single iteration consists of 3 phases – forward pass, backward pass, and weight update. The forward pass propagates the minibatch of samples through a DNN’s layers to compute the output, which is used by the loss function to compute an error metric. The backward pass computes the error derivatives with respect to each model weight. During weight update, an optimizer uses the weight derivatives to update corresponding weights. Several optimizer choices are used in practice, including synchronous gradient descent with momentum (SGD), Adam 

Kingma & Ba (2015), and others.

Setting up a training session for a given network and dataset requires specifying values for many hyperparameters

that control the training behavior. Some hyperparameters affect data preparation (e.g.,minibatch size and procedures for selecting and augmenting samples to create a minibatch), others control optimizers (e.g., learning rate schedule, weight decay, and momenta.), and some layers types also have hyperparameters (e.g., moving average decay for batch normalization). These hyperparameter choices affect both the final model quality and the number of epochs it takes to reach that quality. Thus, an ML training performance performance benchmark must carefully treat hyperparameter choices in order to ensure a fair comparison of various systems.

2.2 Unique Challenges of Benchmark Training

Benchmarking machine learning has unique challenges compared to other compute benchmarks such as LINPACK Dongarra (1988) or SPEC Dixit (1991) that necessitate an end-to-end approach, as opposed to a lightweight microbenchmark suite. A machine learning practitioner will select a dataset, optimizer, and neural network model and the system will train the network to its state-of-the-art quality level (e.g., accuracy for image classification). Provided this quality requirement is satisfied, practitioners accept different system choices for operation, implementation, and numerical representation to maximize system performance, i.e., how fast training executes. Thus, an ML performance benchmark must ensure that systems under test achieve state-of-the-art quality while providing sufficient flexibility for implementation choices. This trade-off between quality and performance presents a set of unique challenges, not encountered in other compute benchmarks because a number of factors affect both the final accuracy of a model and time to achieve that accuracy.

2.2.1 Effect of Optimizations on Quality

While many optimizations immediately improve traditional performance metrics like throughput, some optimization choices can adversely affect the quality of the final model, which can only be observed by running an entire training session. For example, the accuracy difference between single precision training and significantly lower precision training can only be seen in later epochs (Figure 1). Across several representation and training choices, the validation error curves only begin to separate after tens of epochs, and some numerical representations never match the final error of the full precision training (lower validation error directly corresponds to higher accuracy: ). Thus, while microbenchmarks Baidu (2017); Chetlur et al. (2014) can assess the impact of an optimization on performance, a complete training session is required to determine the impact on quality and whether it achieves the desired accuracy. With the introduction of systems with varying numerics  Abadi et al. (2016); Banner et al. (2018); Köster et al. (2017); Micikevicius et al. (2018) and performance optimization techniques, ML benchmarks cannot afford to omit accuracy measures.

Figure 1:

Training AlexNet on ImageNet dataset with different weight representations (from 

Zhu et al. (2016)).

2.2.2 Effect of Scale on Time-to-Train

Scaling training to large distributed systems containing many processors typically employs data parallelism with large minibatch sizes to maximize system utilization and minimize training time. In turn, these large minibatches require adjusting optimizer parameters, such as learning rate Krizhevsky (2014); Goyal et al. (2017). Together these changes affect the learning dynamics and can affect the number of updates required to achieve the target accuracy. For example, MLPerf v0.5 ResNet-50 takes around 64 epochs to reach the target top-1 accuracy of 74.9% at a minibatch size of 4K 111Source: MLPerf v0.5 results https://mlperf.org/training-results-0-5, while a minibatch size of 16K can require over 80 epochs to reach the same accuracy, resulting in a 30% increase in computation. However, larger minibatches allow for efficient scaling to distributed systems, resulting in faster time-to-train. The interactions among system size, minibatch size, and learning dynamics presents another trade-off and challenge for a performance benchmark.

2.2.3 Run to Run Variation

Training neural networks involves many sources of stochasticity and exhibits substantial run-to-run variation Choromanska et al. (2015); Gori & Tesi (1992); Auer et al. (1996); Coleman et al. (2019). Different training sessions with the same model and hyper-parameters can reach slightly different accuracies after a fixed number of epochs. Alternatively, different training sessions can take a different number of epochs to reach a given target accuracy. For example, time to reach target accuracy for two of MLPerf v0.5 benchmarks, using reference implementations and default batch sizes, is shown in Figure 2. Several factors contribute to this variation, such as application behavior (e.g., random weight initialization and random data traversal), and system characteristics (e.g., profile-driven algorithm selection and non-commutativity of floating point additions). Large distributed training can involve asynchronous updates leading to different gradient accumulation orders. These variations create a challenge for a performance benchmark that aims to reliably compare system performance.

(a) NCF
(b) MiniGo
Figure 2: Training epochs to reach the target quality for MLPerf v0.5 NCF (top) and MiniGo (bottom) benchmarks. Each repetition uses identical hyperparameters except for the random seed. For MiniGo, we observed significant variability across runs even when fixing the random seed (colored groupings).

2.2.4 Diverse Software Environments

Multiple ML software frameworks have been developed, each of which execute similar but not identical computations due to various implementation choices and constraints Abadi et al. (2016); Paszke et al. (2017); Chen et al. (2015); Jia et al. (2014). Software frameworks and the underlying math libraries use different algorithms to implement the same neural network operation. For example, convolutional and fully-connected layers, two compute-intensive layers prevalent in modern neural networks, are typically implemented using cache blocking to take advantage of processor memory hierarchies. Different block sizes and processing order (e.g., to optimize for different hardware), while being algebraically equivalent, result in slight differences in results. Convolution layers have a number of algorithmic choices, including direct convolutions, GEMM-based, as well as transform based (FFT, Winograd) variants. For example, the cuDNN v7.6 library provides on the order of 10 algorithm choices for the forward computation of convolution 222cuDNN documentation for forward convolution function. https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html, some of which further have different tiling/blocking choices based on the hardware. While mathematically equivalent, different implementations will result in numerical result differences due to finite precision of floating point representation.

Additionally, sometimes frameworks have implementations of the same function that are mathematically different. For example, modern training frameworks implement stochastic gradient descent with momentum in one of two different ways:

(1)
(2)

The first approach is implemented in Caffe framework 

Jia et al. (2014)

, whereas the second one is implemented in PyTorch 

Paszke et al. (2017)

and TensorFlow 

Abadi et al. (2016). The two approaches are not mathematically identical if the learning rate changes during training, which is a commonly used technique. While this difference is not substantial in many cases, it can affect training convergence at higher minibatch sizes.

Differences also arise in the frameworks’ choices for their programming interface. For example, PyTorch and Tensorflow have different interpretations of asymmetric padding, creating difficulties in porting model weights between frameworks. Data augmentation pipelines across frameworks can also differ on the order of augmentations (e.g. crop, zoom, rotation) being applied to images.

While tools such as ONNX Bai et al. (2019) and TVM Chen et al. (2018) are emerging to enable interoperability of model architectures, their support is still limited and ML systems involve a range of optimizations that extend beyond the model architecture, including preprocessing, precision, and communication methods. Without a standard way of specifying every part of training, benchmarks must find a way to accommodate the diversity of systems used in practice.

2.3 Prior Work

Prior machine learning benchmarks span varying levels of granularity and implementation difficulties. Microbenchmarks such as DeepBench Baidu (2017) measure kernel-level operations that are reflective of those commonly used in industry’s deployed models. Benchmarking such low-level operations does not address the challenges described in the previous section associated with numerical precision, hyperparameter choices, and system scales. Furthermore, such an approach does not capture the end-to-end application, cannot account for memory and cache hierarchy effects across layers and operations, nor does it measure data preprocessing common in deep learning.

Several benchmarks are defined at the granularity of entire neural network models. Fathom and Google TF Benchmarks Adolf et al. (2016); Google (2017) provides a reference suite of implementation neural network models that span a wide application space, but specifically measure model throughput and do not account for accuracy. Similarly, TBD (Training Benchmarks for DNNs) Zhu et al. (2018) profiled performance on GPUs (but not other architectures) across a diverse set of workloads, including metrics such as memory and hardware utilization. Our benchmark builds on the diversity of applications in these projects and captures trade-offs between quality and performance.

DAWNBench Coleman et al. (2017) was the first multi-entrant benchmark competition to use “time-to-train” (originally called time-to-accuracy) to measure the end-to-end performance of deep learning systems and allowed optimizations across model architectures, optimization procedures, software frameworks, and hardware platforms. Our benchmark follows a similar approach to DAWNBench, but with a greater diversity of tasks (Section 3.1) as well as important rules and mechanisms in the Closed division (Section 4.2.1) to enable fair comparisons between different hardware systems.

Several other benchmark efforts are in development. AI Matrix measures workloads at different granularities (microbenchmarks, layer-wise benchmark, end-to-end models, and synthetic) aim . Deep500, while not a benchmark, provides a software framework for instrumenting performance from individual operations to distributed training systems Ben-Nun et al. (2019).

3 MLPerf Training Benchmark

In this section, we present the MLPerf training benchmark, detailing the initial selection of workloads (Section 3.1), the timing runs for training (Section 3.2), choice of quality thresholds (Section 3.3), and the provided reference implementations and hyperparameters (Section 3.4).

3.1 Benchmark Suite

To create a fair and useful benchmark suite for modern ML workloads, we curated a representative set of tasks from several key ML areas, including vision, language, recommendation, and reinforcement learning. Benchmarks were primarily selected based on commercial and research relevance and to represent a diverse set of compute motifs. To establish relevance, we relied on feedback from the tens of commercial and academic organizations supporting

MLPerf. To keep the benchmark suite affordable to run, we selected a compact but representative set of 7 benchmarks as described below and summarized in Table 1. While the benchmarks above already cover a wide range of research and industrial tasks, we are continuously exploring additional benchmarks to keep the suite relevant to the machine learning community (Section 6).



Benchmark
Dataset Model Quality Threshold
Image Classification ImageNet Deng et al. (2009) ResNet-50v1.5 MLPerf (2019b) 74.9% Top-1 Accuracy
Object detection (light weight) COCO 2017 Lin et al. (2014) SSD-ResNet-34 Liu et al. (2016) 21.2 mAP
Instance segmentation and object detection (heavy weight) COCO 2017 Lin et al. (2014) Mask R-CNN He et al. (2017a) 0.377 Box min AP, 0.339 Mask min AP
Translation (recurrent) WMT16 EN-DE WMT (2016) GNMT Wu et al. (2016) 21.8 Scare BLEU
Translation (non-recurrent) WMT17 EN-DE WMT (2017) Transformer Vaswani et al. (2017) 25.0 BLEU
Recommendation MovieLens-20M GroupLens (2016) NCF He et al. (2017b) 0.635 HR@10
Reinforcement Learning Go (9x9 Board) MiniGo MLPerf (2019a) 40.0% pro move prediction
Table 1: MLPerf Training v0.5 benchmarks

3.1.1 Image Classification

Image classification is the most widely used task for evaluation of ML system performance Coleman et al. (2017); Adolf et al. (2016); Zhu et al. (2018); Goyal et al. (2017); Jia et al. (2018); Mikami et al. (2018); Ying et al. (2018); Google (2017)

. A classifier selects a class that best describes the contents of a given image. Classification networks are also used as feature extractors for many other computer vision tasks, including object detection, captioning, style transfer, and others. We use the ILSVRC 2012 ImageNet classification dataset, consisting of 1.28 million training images and 50,000 validation images 

Deng et al. (2009). Model quality is measured in terms of top-1 accuracy on the validation set.

ResNet-50. Residual networks He et al. (2016a, b) and their derivatives are still state-of-the-art in image classification, and commonly used for system studies Goyal et al. (2017); Jia et al. (2018); Mikami et al. (2018); Ying et al. (2018); Sun et al. (2019). There are a number of slightly different implementations of ResNet-50 model in training framework repositories, which lead to earlier system performance claims not being comparable due to model differences. To ensure meaningful system comparison, MLPerf defines ResNet-50 v1.5 network (addition after batch normalization, no 1x1 convolution in the skip-connection of the first residual block, and downsampling is applied by the 3x3 convolutions), parameter initialization, optimizer schedule, and data augmentation.

3.1.2 Object Detection and Segmentation

Object detection and segmentation tasks are key components of many industrial systems, including robotics, autonomous driving, video analytics, and social media. Object detection is a regression task, as opposed to a classification task, returning bounding box coordinates for objects in a given image. Segmentation assigns an object class to pixels in the input image. While pretrained image classification models are commonly used as the backbone (feature extractor) for DNN object detectors and segmenters, these tasks have significantly different compute characteristics from image classification. These include additional layer types (upscaling, ROIalign, NMS, sorting) as well as operating on larger input resolutions. MLPerf uses the 2017 COCO dataset Lin et al. (2014) consisting of 118K training images and 5K validation images. Model quality is measured in mAP for both detection and segmentation.

Mask R-CNN He et al. (2017a) is a widely adopted network model for object detection and instance segmentation of images. It is a two-stage model, with the first stage proposing regions of interest, and the second stage processing those regions to compute bounding boxes and segmentation masks. Mask R-CNN provides high accuracy results for these tasks, but at the cost of higher latency as well as compute and memory requirements. The benchmark is trained using images resized to 800 pixels on the shorter side and uses ResNet-50 as the backbone.

Single Shot Detection (SSD) Liu et al. (2016) represents real-time applications that require low latency solutions trading speed for accuracy compared to two-stage solutions like Mask R-CNN Huang et al. (2016). These applications include autonomous driving, robotics, and video analytics. Rather than using full images, training uses 300x300 crops. A ResNet-34 backbone was chosen to represent current real-time application use cases. Note that ResNet-34 uses a different residual-block structure compared to ResNet-50, thus providing additional diversity to computational motifs covered by the benchmark suite.

3.1.3 Translation

Neural Machine Translation translates a sequence of words from source to target language, representing many applications in industry. As is common in translation research, we use the WMT English to German (EN-DE) dataset WMT (2017), which contains about 4.5 million sentence pairs. Model quality is measured in terms of Bilingual Evaluation Understudy Score (BLEU) score on the newstest2014 test set. We include two translation benchmarks, to account for the two significantly different model architectures commonly used for translation as well as other sequence-data tasks.

Transformer Vaswani et al. (2017) is an attention-based model architecture, currently achieving state-of-the-art quality for language translation task. It consists of an encoder and decoder, each a stack of 6 blocks. Each block is composed of multi-head attention and point-wise, fully connected layers.

GNMT Wu et al. (2016)

is a recurrent neural network (RNN) for language translation. While it achieves lower translation accuracy than Transformer on the WMT English to German language dataset, it is included in the suite as a representative for RNNs usecases. While these usecases span a number of tasks, language translation has the most available datasets as well as publications, enabling clearer system comparison. GNMT is the only RNN in the suite and consists of an 8-layer encoder and an 8-layer decoder, each using 1024 LSTM cells with skip connections.

3.1.4 Reinforcement Learning

Reinforcement Learning (RL) is responsible for the recent dramatic increase in compute demand Amodei & Hernandez (2018) and is used for control systems. RL algorithms can train agents (representation of which includes neural networks) that rival human ability in video games, Go, and Chess, major milestones in machine learning Silver et al. (2018); Mnih et al. (2013); Chan (2018). RL has a computationally different profile from the other ML benchmarks, generating training data through exploration, rather than relying on a predetermined dataset.

MiniGo MLPerf (2019a), inspired by AlphaGo Silver et al. (2016, 2017, 2018) trains a single network that represents both value and policy functions for a 9x9 game board. Training uses self-play (simulated games) between agents to generate data, which performs many forward passes through the model to generate actions rather than using a simulator. MLPerf chose MiniGo to keep the benchmark more ML oriented, since a number of other RL problems use simulators (physics, video game environments, etc.) to generate data, spending most time in computations not related to ML. To measure quality, we calculate the percentage of predicted moves that match human reference games.

3.1.5 Recommendation

Recommendation systems are a key commercial workload for internet companies Naumov et al. (2019); Zhou et al. (2018); Cheng et al. (2016). These workloads are characterized by large embedding tables, followed by linear layers.

Neural Collaborative Filtering (NCF) He et al. (2017b), an instance of Wide and Deep models was chosen as the benchmark. The benchmark is trained to predict user-item interactions. More so than for other tasks, recommender compute characteristics are dependent on the datasets. For example, dataset defines the size of embedding tables as well as memory access patterns. Thus, a representative dataset is key to a representative benchmark. Unfortunately public datasets tend to be orders of magnitude smaller than industrial datasets. While MLPerf v0.5 adopted the MovieLens-20M dataset GroupLens (2016) for NCF benchmark, the dataset and benchmark are being updated for v0.7 synthetically, while retaining characteristics of the original data Belletti et al. (2019)

3.2 Time to Train Performance Metric

To address the ML benchmarking challenges of system optimization and scale outlined in Sections 2.2.1 and 2.2.2, MLPerf

adopts time-to-train to a defined quality target as the performance metric. This metric incorporates both system speed and accuracy, and is the most relevant quantity for machine learning practitioners. As an end-to-end metric, it also captures the breadth of operations needed to train such models, such as the data pipeline and accuracy calculations. The generality of the metric enables it to be applied to diverse training schemes such as those found in reinforcement learning, unsupervised learning, or generative adversarial networks. Together, time-to-train overcomes the challenges in Section 

2.2.1 and 2.2.2 by preventing submissions from using optimizations that harm quality while still allowing for a great deal of flexibility in system scale and software environments.

3.2.1 Timing Rules

Timing requirements were chosen to ensure fair comparison of different systems and represent various training use cases. Timing begins when any training or validation data is touched, and stops when the defined quality target has been achieved on the validation dataset.

We exclude from timing several components that can provide significant overhead that would not be indicative of real world differences.

System initialization. Initialization, especially at larger scales, varies based on cluster administrator choices as well as system queue load. For example, initialization may include running diagnostics on each node prior to starting the training job. Such overheads are not indicative of a system’s training capability, thus they are excluded from timing.

Model creation and initialization. Some frameworks can compile the model graph in order to optimize subsequent execution. This compilation time is not significant for the longer training sessions when using industry-scale datasets. However, MLPerf uses public datasets which are usually much smaller than industry datasets. As a result, large distributed systems can train some MLPerf benchmarks in minutes, making compilation times a significant portion of the total time. To make benchmarks representative of training on the largest industrial datasets, we allow excluding up to 20 minutes of model creation time. This limit ensures that MLPerf is also representative of the smaller training jobs, as well as discourages submissions with compilation approaches that would be too computationally/operationally expensive to use in practice.

Data reformatting. The raw input data is commonly reformatted once and then used for many subsequent training sessions. Examples of reformatting include changing image file formats, creating a database of data (e.g., LMDB, TFRecords, RecordIO) for more efficient access. Since these operations are performed once for many training sessions, MLPerf does not include reformatting in timing. However, MLPerf requires that any data processing or augmentation that happens during training not be moved to the reformatting stage (e.g., different crops of each image cannot be created and saved outside of the timed portion of training).

3.2.2 Number of Timing Samples

To address the inherent stochasticity and resulting run to run variance of modern deep learning methods, described in Section 2.2.3, MLPerf requires submissions to provide several runs of each benchmark in order to stabilize timing. The number of runs varies among benchmarks and was determined by studying the behavior of reference implementations. Five runs are required for vision tasks to ensure 90% of entries from the same system were within 5%, and for all other tasks, ten runs are required, so 90% of entries from the same system were within 10%. The fastest and slowest times are dropped, and the arithmetic mean of the remaining runs is the result reported by MLPerf.

3.3 Choice of Quality Thresholds

For each benchmark, we chose quality metrics near the state-of-the-art for the corresponding model and dataset (Table 1), based on experiments with the reference implementations. Some of these thresholds are slightly lower than results reported in literature to enable benchmarking across software frameworks and to ensure that training sessions consistently achieve the quality metric. While choosing a lower threshold that is achievable early in a training session reduces submission resources, there are two reasons we chose higher thresholds that require longer training sessions. First, we need to ensure that optimizations do not adversely affect the final results (challenges described in Sections 2.2.1 and 2.2.2). Second, we need to minimize run-to-run variation, which tends to be much higher early in the training. For example, Figure 3 shows accuracy achieved by 5 training sessions of MLPerf v0.5 ResNet-50 v1.5 reference implementation, where the first 30 epochs exhibit significantly more noise.

Figure 3: Top-1 accuracy of MLPerf v0.5 ResNet-50 benchmark over epochs. 5 different runs (denoted by color) with identical hyper-parameters other than the random seed. Dotted line indicates quality target of 74.9% Top-1 Accuracy. The early phase of training is marked by significantly more variability.

3.4 References and Hyperparameters

MLPerf provide a reference implementation for each benchmark in the suite, using either the PyTorch or TensorFlow framework. References also include scripts or directions to download and preprocess public datasets. References are not optimized for performance (thus they shouldn’t be used for performance assessment or comparison) as their primary purpose is to define a concrete implementation of a benchmark network model and training procedure. These must be followed by all submitters—submissions are allowed to reimplement a benchmark in their framework of choice as long as the neural network and training operations are mathematically equivalent to the reference. Furthermore, reference implementations are used to establish the required quality thresholds.

MLPerf rules specify the list of modifiable hyperparameters as well as restrictions to their modification. These restrictions are intended to ensure that result differences are due to system characteristics, rather than system-independent workload adjustments resulting from different hyperparameters. While hyperparameter searches are a common part of ML practice, MLPerf focus is on system optimization rather than hyperparameter exploration. However, to accommodate a wide range of training system scales, submissions must be able to adjust the minibatch size in order to showcase maximum system efficiency (this is similar in concept to Top500 Linpack benchmark allowing systems to choose the size of the problem to solve). In order to ensure that training still convergences to the required threshold, other hyper-parameters, such as the learning rate and optimization schedule, may need to be adjusted to match the different minibatch sizes. For example, a common ResNet training practice is to to increase the learning rate linearly with the minibatch size Goyal et al. (2017). A list of modifiable hyperparameters is defined in MLPerf rules, to both ensure fair system comparison and to reduce the computational/operational cost of submission which could be inflated by hyperparameter searches. MLPerf working groups review the hyperparameter requirements and choices for each benchmarking round, to account for advances in ML training scale research and practice.

4 Benchmarking Process

In this section, we outline the submission and review process (Section 4.1) and how results are reported (Section 4.2) to account for innovative solutions, availability, and scale. To date we’ve had two rounds of MLPerf benchmark - v0.5 and v0.6. The cadence between rounds is on the order of a few months, so the benchmark suite may be updated for a given round. Each round has a submission and review period followed by result publication.

4.1 Submission and Review

An MLPerf submission consists of system description, training session log files, and all code and libraries required to reproduce those training sessions. All of these are made publicly available in MLPerf GitHub simultaneously with publication of MLPerf results, allowing for reproducibility and enabling the community to improve on results in subsequent rounds. System description includes both the hardware description (number of nodes, processor and accelerator counts and types, storage per node, network interconnect) and software description (operating system, libraries and their versions). A training session log file contains a variety of structured information including timestamps for important stages of the workload, quality metric evaluated at prescribed intervals, hyper-parameter choices, and others. These logs form the foundation for subsequent result analysis.

Prior to result publication submissions are peer reviewed for compliance with MLPerf rules. Compliance issues, if any, are brought up with submitters and resubmission after addressing them is allowed. Additionally, some hyper-parameter borrowing is allowed during the review period, to ensure fair comparisons of systems under as similar conditions as possible. Specifically, if a submission uses hyper-parameters that would also benefit other submissions, we want to ensure that those systems have an opportunity to adopt those hyper-parameters.

4.2 Results Reporting

Each MLPerf submission has several labels: division (Open or Closed), category (Available, Preview, or Research), and system type (On-Premise, or Cloud).

4.2.1 Submission Divisions

MLPerf has two submission divisions – Closed and Open. Both divisions require submissions to use the same dataset and quality metric as the corresponding reference implementation.

The Closed division is intended for direct system comparison, therefore it strives to ensure workload equivalence by requiring submissions to be equivalent to reference implementations. Equivalence includes mathematically equivalent network implementations, parameter initialization, optimizer and training schedule, as well as data processing and traversal. To ensure fair system comparisons the Closed division also restricts hyper-parameter modification.

The Open division is intended to encourage innovative solutions to important practical tasks, and hardware-software co-design. It allows submissions to use model architectures, optimization procedures, and data augmentations different from the reference implementations.

4.2.2 System Categories

To allow for a broad range of research and industry submissions, we defined three submission categories of systems: Available, Preview, and Research. The three submission categories encourage submissions using novel techniques and systems (e.g., academic researchers), but also distinguish between shipping product and proof-of-concept or early engineering samples.

The Available category imposes requirements on both hardware and software availability. Submitted hardware must be either available for rent on a cloud service by a third party, or available for purchase for on-premise submissions. Supply and lead times for renting or purchasing should be appropriate for both system scale and company size. To ensure that benchmark submissions can be widely consumed, and to discourage benchmark-specific engineering, we also require that software in the Available category be versioned and supported for general use.

Preview systems contain components which will meet the Available criteria within the later of: 60 days from the submission date, or the next submission cycle. Any system submitted to Preview must submit in the Available category by that time.

Research submissions contain components that are not intended for production. This includes research prototypes from the academic community, where the intent is a proof-of-concept rather than a robust product. The research category also includes systems that are built from production hardware and software, but are larger in scale than currently offered by an Available system configuration.

4.2.3 Reporting Scale

Modern ML training systems span multiple orders of magnitude in both power draw and cost. Thus, comparison of systems is more useful if scale is reported alongside performance scores. A common scale metric, such as cost or power, cannot be defined across a wide range of systems (cloud, on-premise, pre-production), requiring differentiating scale by the system type.

In the first two MLPerf rounds we include system configuration (number of processors and/or accelerators) alongside performance scores. For on-premise systems, the future versions will include a specification for measuring power. For cloud systems, a ‘cloud scale’ metric was derived from: 1) number of host processors, 2) amount of host memory, and 3) number and type of accelerators. We empirically verified that cloud scale correlates closely with cost across three major cloud providers. Reporting of these scale metrics was optional in MLPerf rounds v0.5 and v0.6.

4.2.4 Reporting Scores

MLPerf results report provides the time to train metric for each benchmark in a given submission. While a single summary score for a submission that spans the entire suite may be desired for system comparisons, a summary score is not appropriate for MLPerf for two main reasons. First, a summary score implies some weighting of individual benchmark scores. Given the diverse set of system users and a wide range of application areas that MLPerf suite covers, there exists no universally representative weighting. Second, a summary score becomes less meaningful if a single submitted system does not report results on all benchmarks in the suite. There are multiple reasons why a submission may omit some benchmarks—not all benchmarks are practical at different system scales (for example, some networks cannot currently be trained at the minibatch sizes required for data-parallel training on the largest systems). Some processors may target only specific application areas.

5 Results

MLPerf, like all benchmarks, aims to use constructive competition to encourage innovation; we measure progress toward this end by comparing results across submission rounds. To date there have been two submission rounds of the MLPerf Training benchmark, v0.5 and v0.6. The two rounds were six months apart and the underlying hardware systems did not change. The results on the five benchmarks that were either unmodified or modified in limited ways between submission rounds show that MLPerf is driving both rapid improvements in performance and scaling of the implementations and software stacks. Figure 4 shows that, between the two submission rounds, the best performance results submitted on a 16-chip system increased by an average of 1.3 times despite the higher quality targets. Figure 5 depicts that, between the two submission rounds, the number of chips in a system used to produce the best overall performance result increased by an average of 5.5 times. Some of this improvement is due to better benchmark implementations and some is enabled by rule changes such as allowing the LARs You et al. (2017) optimizer for large ResNet batch sizes. However, we believe much of the performance and scaling improvements were incorporated into the underlying software infrastructure and passed onto users. We expect MLPerf to drive similar improvements through focused hardware innovation over time.

Figure 4: Speedup in fastest 16-chip entry from MLPerf version v0.5 to v0.6, along with increased quality targets.
Figure 5: Increase in number of chips used in the system that produced the fastest overall score from MLPerf version v0.5 to v0.6.

6 Conclusions

MLPerf Training defines a suite of benchmarks that are representative of both industrial and academic ML usecases. In addition to being the first and only widely-used ML training benchmark suite with such coverage, MLPerf contributions include:

  • Precise definition of network models and training procedures for each benchmark. This enables system comparisons on equivalent workloads, whereas previous results for systems would often be reported with substantially different variants of a given model architecture (for example, there are at least 5 variants of ResNet-50).

  • Reference implementations and rule definitions to address the challenges unique to benchmarking ML training that are not encountered by other compute workloads. These challenges include the stochastic nature of training processes, the need for training to completion to determine the impact of performance optimizations on result quality, and the need for workload variation at different system scales (Section 2.2).

We intend the MLPerf benchmark suites to become industry standards and make a lasting contribution to the field, which requires careful technical and organizational design. MLPerf is a collaborative effort led by working groups. The training benchmark suite is developed and administered by a submitters working group which makes rules decisions, a special topics working group which considers in-depth technical topics, and a results working group which handles the review and publication of results. We are working to evolve this organization by recruiting industry and academic advisors for each benchmark area to provide neutral market- and research-need- driven guidance, and to create a permanent non-profit to host MLPerf. Our approach has gathered substantial momentum: the MLPerf organization is now supported by over 50 companies and researchers from 8 educational institutions. An organizational contribution of the MLPerf effort is the consensus, sometimes grudging, formed across these multiple organizations. We believe that the approach of starting from a small number of previous benchmarking groups, forming an initial consensus, then adding more community members and incorporating their experience and perspective into MLPerf, has resulted in a better suite than any one group could have built.

Since machine learning is an evolving field, MLPerf established a process to maintain and update the benchmark suite over time. For example, MLPerf v0.6 round included a number of updates: ResNet-50 benchmark added the use of LARS optimizer You et al. (2017), commonly used for training on large scale systems; GNMT model architecture was improved to increase achieved translation quality; MiniGo reference was switched from Python to C++ to increase performance. As a result of these enhancements target thresholds were increased. Future work includes:

  • Expanding the benchmark suite to systematically cover the key machine learning areas of vision, speech, language, commerce (e.g. time series), and research (e.g. GANs).

  • Improving reference implementations as starting points for development.

  • Producing a table that maps system scale and precision to recommended hyperparameters for each benchmark.

  • Developing better large public datasets for benchmarking and other purposes.

  • Developing better software best practices for ML benchmarking and experimentation.

The MLPerf organization welcomes input and contributions from interested engineers and researchers.

References