Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models

by   Yuxin Wang, et al.
Hong Kong Baptist University

Deep neural networks (DNNs) have become widely used in many AI applications. Yet, training a DNN requires a huge amount of calculations and it takes a long time and energy to train a satisfying model. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) play a key role in training DNNs. However, different many-core processors from different vendors perform very differently in terms of performance and power consumption. To investigate the differences among several popular off-the-shelf processors (i.e., Intel CPU, Nvidia GPU, AMD GPU and Google TPU) in training DNNs, we carry out a detailed performance and power evaluation on these processors by training multiple types of benchmark DNNs including convolutional neural networks (CNNs), recurrent neural networks (LSTM), Deep Speech and transformers. Our evaluation results make two valuable directions for end-users and vendors. For the end-users, the evaluation results provide a guide for selecting a proper accelerator for training DNN models. For the vendors, some advantage and disadvantage revealed in our evaluation results could be useful for future architecture design and software library optimization.


Multi-DNN Accelerators for Next-Generation AI Systems

As the use of AI-powered applications widens across multiple domains, so...

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Deep learning frameworks have been widely deployed on GPU servers for de...

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Training deep learning models is compute-intensive and there is an indus...

Survey and Benchmarking of Machine Learning Accelerators

Advances in multicore processors and accelerators have opened the flood ...

ProAI: An Efficient Embedded AI Hardware for Automotive Applications - a Benchmark Study

Development in the field of Single Board Computers (SBC) have been incre...

EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices

In recent years, advances in deep learning have resulted in unprecedente...

High performance and energy efficient inference for deep learning on ARM processors

We evolve PyDTNN, a framework for distributed parallel training of Deep ...

1 Introduction

Recent years have witnessed the fast development of deep neural networks (DNNs) [lecun2015deep], which have been widely used in many AI applications, such as image recognition [krizhevsky2012imagenet][he2015deep], object detection [girshick2015fast][redmon2016you], speech to text tasks [hinton2012deep], etc. However, training these DNN models requires a considerable amount of computational resources [lecun2015deep][dean2012large].

Graphics Processing Units (GPUs) [luebke2006gpgpu] serve as one of the most popular hardware to accelerate the training speed of DNNs. Different from the conventional CPU, a typical GPU is generally equipped with thousands of cores and large Gigabytes of memory bandwidth [luebke2006gpgpu], which significantly accelerates the training and reasoning speed of DNNs compared to the traditional CPU. Since 2016, the new generation of computing device - the TPU [jouppi2017datacenter], has been launched by Google, followed by Cloud TPU v2 and Cloud TPU v3 [ying2018image]

. The difference between different generations of TPUs is mainly on performance and memory capacity. Benefited from its extraordinary parallel computing capability, the cloud service of TPU greatly fastened the steps of artificial intelligence and its relating applications. They have achieved better performance than many other AI accelerators


Meanwhile, the development of optimized software keeps in pace with the hardware. On the CPU processors, there exist highly optimized high-performance libraries like MKL and MKLDNN [wang2014intel][cyphers2018intel]. On the Nvidia GPUs, researchers and industry make cuDNN [chetlur2014cudnn], cuBLAS and other CUDA based libraries be able to achieve nearly peak performance of GPU cards. On the AMD GPUs, ROCm111

is also actively developed for supporting high performance deep learning. Also, for TPUs, TensorFlow

[abadi2015tensorflow] is highly optimized under a large development community.

However, different AI accelerators design by various vendors has a large diversity in terms of performance, power and energy consumption. For example, the time performance could be different even on similar capacity GPUs from Nvidia and AMD. In terms of performance, there exist some benchmarks including software comparison [bahrampour2016comparative][shi2018performance], hardware comparison [wei2019benchmarking] and the combination of software and hardware comparison [shi2016benchmarking] in training DNNs. In addition, different vendors provide their own benchmarks to demonstrate the performance with their own highly optimized libraries or configurations, while these results could be unfairly compared.

In terms of power and energy consumption, it is of importance in server deployment for DNN training as lower energy consumption can directly benefit from the long-term electric bill. Combining with performance and energy, one can scale the hardware configuration with dynamic voltage and frequency scaling (DVFS) techniques [mei2017survey][wang2018gpgpu] to save energy. Tang et al. [tang2019impact] have evaluated energy with DVFS on GPUs in training DNNs. Wang et al. [eppminer] propose a benchmark suite for both performance and energy, while they only focus on traditional algorithms but not on deep learning. Furthermore, performance and energy data together are critical to job scheduling algorithms [chau2017energy] in saving energy while preserving the computing efficiency of tasks.

In summary, existing benchmarks consider either only performance, or only energy for particular accelerators or algorithms. Furthermore, there is little study on AMD GPUs while AMD researchers have also developed a deep learning ecosystem ROCm for users. To this end, in this paper, we make wide benchmarks on many popular AI accelerators including Intel CPU, Nvidia GPUs, AMD GPU and Google TPUs in terms of multiple metrics including performance, power and energy. On one hand, our evaluation results give a guide for end-users on how to choose proper AI accelerators to train their own DNN models for different considerations. For example, end-users can compare the budgets of cloud-based GPUs and TPUs for a specific model, and choose a cheaper one to train the model. On the other hand, the problems revealed by the evaluation results could be helpful for hardware or software design for further optimization. For example, GPU library engineers can have an insight into the performance why some operations dose not well utilize the computing resources. The experimental numbers with performance, power and energy can be used by the job scheduling algorithms for energy conservation [chau2017energy][mei2017energy], in which one should consider the task should be finished in expected time (related to performance) while it should not consume too much power (related to energy).

To make the evaluation thorough, we first evaluate the performance on low-level operations on the above aforementioned accelerators, and then we evaluate the performance, power and energy on end-to-end training of currently popular DNNs from different AI areas including CNNs [krizhevsky2012imagenet][he2015deep], LSTM [gers1999learning], Deep Speech 2 [amodei2016deep] and Transformers[vaswani2017attention]. Our major findings are shown in Table 1.

Section Key Factor Metric Main Findings
4.1.1 Multi-threading on CPU Performance CPU multi-threading should be set equal number of threads to the number physical cores
AVX on CPU Performance AVX is generally useful in matrix multiplication, while there is no obvious improvement in convolution.
4.1.2 Tensor size Performance Larger tensor sizes have higher workloads for accelerators and they generally achieve higher throughput.
4.1.2 Software for TPU Performance TPU nearly fully utilize the computing resource.
4.1.2 Nvidia vs AMD Performance Nvidia GPUs have better optimized software than AMD.
4.1.2 Software for GPUs Performance Matrix multiplication has been well optimized on GPUs while convolution still has space for further optimization.
4.2.1 Multi-threading on CPU Performance Some CPU cores should be allocated to data pre-processing during training.
4.2.2 Mini-batch size Performance Mini-batch size should be large enough to fully utilize the computational resources of accelerators.
4.2.2 Low-bit precision Performance FP16 generally achieves higher throughput than FP32, especially on CNNs with Tesla V100 using Tensor Cores. However, on NLP models, FP16 has no obvious improvement compared to FP32.
4.2.2 GPU vendor Performance Nvidia GPUs achieve higher throughput and have wider supported software than AMD GPU.
4.2.3 Latest TPU Performance TPU V3-8 has about 1.5 higher throughput than TPU V2-8.
4.2.4 TPU vs GPU Performance TPU V3-8 achieves more than 3 higher throughput than Tesla V100 on CNNs, while it has only about 1.5 on Transformer.
4.3.1 Nvidia GPU Model Power Tesla V100 has the lowest power consumption on CNNs, while Titan X(Pascal) consumes the lowest power on NLP models.
4.3.1 GPU vendor Power AMD GPU has the lowest power consumption on NLP models, while it consumes much higher power on ResNet-50 and VGG-16 than Nvidia GPUs.
4.3.2 Mini-batch size Energy Higher mini-batch size consumes higher energy on CNNs.
4.3.2 GPU Model Energy Nvidia Tesla V100 has the lowest energy consumption among evaluated GPUs.
Table 1: Summary of main findings

The rest of this paper is organized as follows. Section 2 introduces some background knowledge related to DNNs, AI accelerators and training algorithms. Section 3 describes our experimental designs and setups, including hardware configurations and DNNs. Section 4 demonstrates our experimental results and analysis of AI accelerators with different training tasks. Some related work is introduced in Section 5. We finally conclude the paper in Section 6.

2 Preliminaries

2.1 Deep Models

In different areas of AI applications, there exist various types of deep architectures achieving state-of-the-art results. In image classification and object detection tasks, convolutional neural networks (CNNs) are the main architectures to extract the image features automatically, among which VGG [simonyan2014very], ResNet [he2015deep] and Inception [szegedy2016rethinking]

architectures are widely used. These CNNs also achieved very good results in the popular challenge of ImageNet


on the tasks of image classification and object detection. In the area of natural language processing, recurrent neural network (RNN) was one of the success models, especially LSTM

[press2016using]. Recent years, Deep Speech 2 [amodei2016deep] was proposed to achieve state-of-the-art results on speech recognition tasks, and attention-based models including transformer [vaswani2017attention] and BERT [devlin2018bert] have achieved very good scores in many machine translation tasks.

(a) VGG16 [simonyan2014very].
(b) LSTM [gers1999learning].
(c) Deep Speech 2 [amodei2016deep].
(d) Transformer [vaswani2017attention].
Figure 1: Different Architecture of DNNs.

2.2 AI Accelerators

There are many newly developed AI accelerators. In this paper, we mainly focus on the widely available processors such as CPUs, GPUs and TPUs. We will investigate FPGAs in our future work.


CPUs are traditional processors that used in very computer, while it was not good at doing highly parallel and computing-intensive tasks. In the era of deep learning, the main CPU vendor designs its many-core CPUs for these kinds of tasks. For example, the Intel Xeon Scalable processor was reported that it outperforms Nvidia GPU in deep learning inference on the ResNet-50 model222 The Intel Xeon processor [regnier2004eta] is a powerful CPU with high computing FLOPS among Intel CPUs.

Nvidia and AMD GPUs.

GPUs are designed for highly parallel applications in terms of the number of computing cores and the memory access speed, and the peak FLOPS has increased rapidly in the last ten years. In Table 1, we listed the parameter details of four recent GPUs. The listed GPUs contain three (Tesla V100, P100, and Titan X(Pascal)) from Nvidia and one (Radeon VII) from AMD. It can be seen that the peak FP32 computing FLOPS is more than 10TFLOPs, which is around 5 higher than CPUs.

Product Name Tesla V100 Tesla P100 Titan X(Pascal) Radeon VII
GPU GV100 GP100 GP108 Vega 20
GPU Cores 5120 3584 3584 3840
Tensor Cores 640 - - -
Core Clock 1246 MHz 1190 MHz 1417MHz 1400 MHz
Boost Clock 1455MHz 1329 MHz 1531 MHz 1750 MHz
Memory Clock 876 MHz 715 MHz 1251MHz 1000 MHz
Memory Bus Width 4096 bit 4096 bit 384 bit 4096 bit
Memory Bandwidth 897.0 GB/s 732.2 GB/s 480.4 GB/s  1 TB/s
Memory Type HBM2 HBM2 GDDR5X HBM2
FP16 Computing 28.26 TFLOPS 19.05 TFLOPS - 26.88 TFLOPS
FP32 Computing 14.13 TFLOPS 9.526 TFLOPS 10.97 TFLOPS 13.44 TFLOPS
TDP 250w 250w 250w 295w
Table 2: The Parameter Details of GPUs
Google TPUs.

Tensor Processing Units (TPUs) are Google’s custom-designed machine learning application-specific integrated circuits (ASICs). Each TPU device has 4 chips and each consists of 2 cores, so a TPU device contains 8 cores. Each core has scalar, vector and matrix units (MXU) and is connected with the on-chip high bandwidth memory (HBM). There are two types of TPUs: TPU v2 and TPU v3. For TPU v2, the amount of HBM of each core is 8 GB and each core has one MXU, while for TPU v3, each core has two MXUs and is connected with 16 GB of HBM. TPUs support the bfloat16 format which has a wider range of values than float16 with the same 16-bit storage. TPUv2 with 8 cores (TPUv2-8) and TPUv3 with 8 cores (TPUv3-8) have peak bfloat16 computing capacity of 180 Tera bfloat16 per second and 420 Tera bfloat16 per second respectively. Additionally, TPU v2 Pod is assembled by 64 TPU v2 devices, containing 512 TPU v2 cores. TPU v3 Pod provides a maximum of 256 TPU v3 devices and consists of a total 2048 TPU v3 cores.

Figure 2: The structures of TPU v2 (top) and TPU v3 (bottom).
bfloat16 and float16

The MXU in each TPU core is used to execute 16K multiply-accumulate operations in each cycle. Besides, MXU supports mixed precision training, i.e. its input and output are 32-bit floating point values and it can use bfloat16 for activation and gradients. Compared with IEEE half-precision floating point (fp16), bfloat16 has a wider range of values because it has one sign bit, eight exponent bits, and seven mantissa bits plus one implicit mantissa bit, as shown in Fig. 3. Using bfloat16 can help reduce the size of data in memory, making larger models available for the same size of memory, while ensuring no degradation of converged accuracy.

Figure 3: The comparison between bfloat16 and float16.

2.3 Training Algorithms.

Figure 4: The training process of mini-batch SGD and mixed precision.
Mini-batch SGD.

The Stochastic Gradient Descent (SGD)

[chen2016revisiting] algorithm and its variants are widely used in training deep models. Mini-batch SGD [li2014efficient] is a derivative algorithm of SGD, which divides the entire data set into multiple subsets, and iteratively update the model parameters according to the first-order gradients at current mini-batch of data. The training process during a single iteration can be divided into the following steps. As shown in Fig. 4, a single iteration starts with the process of reading data from the computer’s disk to the CPU’s memory, and it ends with updates of parameters. The training process is to repeat the iteration until some terminating criteria. We generally use the average iteration time to measure the performance of training on particular software and hardware.

Mixed Precision Training.

The mixed precision [micikevicius2017mixed][jia2018highly]333The mixed precision mainly exploits FP16 as computation during the forward and backward passes, so its performance presents the performance of FP16 on accelerators. training technique is a very successful training algorithm that uses only 16-bit floating points to do the computation of forward and backward during training such that the hardware resource can be better utilized. Typically, in mixed precision, FP32 master weights and loss scaling are adopted to avoid the instability that FP16 precision might trigger. The training process is also shown in Fig. 4.

3 Methodology

In this section, we introduce the methodology of our evaluation for demonstrating comparison on performance, power and energy among multiple accelerators. We first present the selected hardware settings and DNNs from different AI areas. Then we illustrate our evaluation methods.

3.1 Hardware Setup

As we would like to evaluate the most commonly used accelerators for training DNNs, we select many-core processors from four vendors including Intel, Nvidia, AMD and Google. For each vendor, we select one to three processors for evaluation. The details of the selected accelerators are listed in Table 3, which presents the key parameters that are related to the performance of accelerators.

Vendor Accelerator Model Memory Theoretical FLOPS Memory Bdw Memory Type CPU
Intel Xeon Platinum 8163 48GB 1.92 T(FP32) 119.21 GB/s DDR4 -
Nvidia Titan X(Pascal) 12GB 11 T(FP32) 480.4 GB/s GDDR5X i7-7820X
Tesla P100 16GB 9.5 T(FP32) 732.2 GB/s HBM2 i7-6800K
Tesla V100 16GB 125 T(Tensor Core) 897.0 GB/s HBM2 i7-6800K
AMD Radeon VII 16GB 13.44 T(FP32) 1 TB/s HBM2 i7-4790
Google TPU v2-8 64GB 180 T (bfloat16) 600 GB/s HBM -
TPU v3-8 128GB 420 T (bfloat16) 900 GB/s HBM -
Table 3: Hardware setup

3.2 Evaluated Operations and DNNs


DNN models are mainly stacked by many layers which generally invoke two main resource-consuming operators (ops) including the matrix multiplication (Matmul) and convolution in 2d dimension (Conv2d). DNNs also contain some activation layers that are element-wise operators, but these operators are much faster than Matmul and Conv2d, so we mainly evaluate the performance of Matmul and Conv2d on the selected accelerators. To evaluate the performance of ops, the input data are synthetic tensors of different FLOPs444In this paper, FLOPS indicates FLOating Points per Second, while FLOPs indicates the number of FLOating Points. To ensure the utilization of accelerators, we select small, medium and large sizes of input tensors, which are listed in Table 4. For the Matmul operator, tensor dimensions range from 256256 to 81928192; for the Conv2d operator, inputs and filters are selected based on the real-world models under different batch sizes, from 32 to 256.

Matmul Shape N(2048, 2048) N(4096, 4096) N(8192, 8192)
FLOPs 1.72E+10 1.37E+11 1.10E+12
Con2d Shape F(256, 224, 224, 3) F(128, 112, 112, 64) F(256,56,56,128)
K(7, 7, 3, 64) K(3, 3, 128, 256) K(3, 3, 128, 256)
S(2,2) S(1,1) S(1,1)
FLOPs 5.72E+10 2.28E+11 4.40E+11
Table 4: Input tensor sizes for ops

To cover comprehensive AI applications, we choose DNN models from image classification with CNNs, language models with LSTM, speech recognition with Deep Speech 2 and the recent state-of-the-art language model Transformer. For CNNs, we choose ResNet-50 [he2015deep], Inception V3 [szegedy2016rethinking], and VGG16[simonyan2014very] on the ImageNet [deng2009imagenet] dataset; For LSTM, we selected the typical 2-Layer LSTM on the PTB [marcus1994penn] dataset; For the Deep Speech 2 architecture, we train the model on the AN4 dataset. For Transformer, we train the model on the WMT14 EN-DE dataset. The details of DNN configurations 555The FLOPs for Deep Speech2 is measured for a single time step. are shown in Table 5.

Networks # Params (million) MACs (GFLOPs) Datasets # Samples Input Size
VGG16 138 15.60 ImageNet 1.2M 2242243
Resnet50 26 4.14 ImageNet 1.2M 2242243
Inception V3 27 2.86 ImageNet 1.2M 2242243
2-Layer LSTM 66 0.102 PTB 42K seq length: 20
Deep Speech 2 27 0.122 AN4 948 /
Transformer 65 6.2 WMT14 EN-DE 36M seq length: 256
Table 5: Model and Dataset Setting

3.3 Evaluation Methods

Evaluation Metrics.

In order to present the readers a comprehensive scope of different tasks and AI accelerators, we use performance, power and energy as the evaluation metrics. For the performance measurements, we evaluate the average iteration time in 1000 iterations with an input mini-batch size, and then we calculate the accelerators’ performance in training a particular DNN as the throughput in terms of Samples/s. Note that the units of samples are images for CNNs, sentences/10 for LSTM, utterances for Deep Speech 2, and tokens/100 for Transformer, respectively. For the power measurement, we sample the system power in every 50ms during the training process using the built-in interfaces provided by Nvidia Management Library

[NVML] on Nvidia GPUs and ROCm System Management Library [rocm-power] on AMD GPU, and calculate the average watts as the metric. For energy measurement, it is directly derived using the evaluated performance and power.

The metric details are defined in Table 6.

Metric Definition Unit
Performance Throughput in processing samples during the training process Samples per second
Power The electrical energy cost at a certain mini batch per second Watt
Energy The electrical energy cost of the computing device to process a sample J per sample
Table 6: Definition of metrics
Measurement Software Tools.
Ops or DNNs Accelerator OS Software Libraries
Ops, Transformer Intel CPU Ubuntu TensorFlow 1.13 MKL-2019.4-243
Nvidia GPUs CUDA-10.0, cuDNN-v7.4
Google TPUs - -
CNNs, LSTM, Deep Speech 2 Intel CPU Ubuntu PyTorch 1.1 MKL-2019.4-243
Nvidia GPUs CUDA-10.0, cuDNN-v7.4
Google TPUs - TensorFlow 1.13 -
Table 7: Software setup

For the performance of ops, we use TensorFlow at the version 1.13 on all accelerators. We conduct the input sizes for two ops as shown in Table 4 and evaluate their average time on particular accelerators. For DNNs measurements, we evaluate CPU and GPUs (including Nvidia and AMD) with PyTorch at version 1.1, and the CPU and GPU related libraries are shown in Table 7. As TPU mainly supports TensorFlow, we measure the TPU training performance with TensorFlow.

4 Experimental Results

In this section, we present the results of experiments on AI accelerators. The evaluation results include the performance of low-level mathematical operators (Subsection 4.1), performance on end-to-end training (Subsection 4.2), and power and energy consumption on end-to-end training (Subsection 4.3).

4.1 Low-level Operators Performance

We make an evaluation on AI Accelerators on the operator level under different computational complexity. We test operators with small, medium, and large FLOPs, representing computation under different workloads for accelerators as shown in Table 4.

(a) Matmul.
(b) Conv2D.
Figure 5: The results of CPU’s Performance on two operators.

4.1.1 CPU Results

Multi-threading and AVX are two main techniques to exploit the many-core capability of CPUs. We evaluate the performance of operators with AVX enabled and disabled ranging the number of threads from 1 to 24. The results are shown in Fig. 5. In terms of multi-threading, with the number of threads increased, the computing performance is increased. In particular, with the number of threads doubled, the improvement is generally about 15%. However, the number of threads should not be larger than the number of physical cores of the CPU. Otherwise, it may sacrifice the performance. Regarding the AVX property, on the Malmul operator, the improvement of AVX is significant compared to the non-AVX counterpart. On the Conv2D operator, however, there is no obvious improvement with AVX enabled. The diverse results on Malmul and Conv2D indicate that the software optimization on Malmul is better than Conv2D and there exist further directions to improve the efficiency of Conv2D with AVX techniques.

We also plot the CPU utilization of the performance on these two operators as shown in Fig. 6. The maximum utilization is up to 73% and 62% on Matmul and Conv2D respectively.

(a) Matmul.
(b) Conv2D.
Figure 6: CPU’s utilization on two operators.

4.1.2 GPU and TPU Results

(a) Matmul.
(b) Conv2D.
Figure 7: The performance of Matmul and Conv2D on GPUs and TPUs
(a) Matmul.
(b) Conv2D.
Figure 8: GPU and TPU’s utilization on two operators.

On GPUs and TPUs, the performance of Malmul and Conv2D is shown in Fig. 7. For all accelerators, higher workloads can better utilize the computing resource to achieve higher throughput. Among GPUs, Nvidia Tesla V100 has the highest performance on both operators. In the evaluated Tesla V100 and AMD Radeon VII that both support FP16, 16-bit performance is better than the 32-bit counterpart. In particular, 16-bit performance in Tesla V100 is nearly 3 higher than its 32-bit version. The performance utilization of different accelerators is shown in Fig. 8. TPU V2 achieves nearly optimal throughput on both the operators. On the Malmul operator, Tesla V100 with FP16 also achieves nearly 97% perk performance, while it has only about 40% on the Conv2D operator. Among GPUs, Nvidia GPUs generally have higher utilization than AMD GPU on the Malmul operator, while they have similar poor utilization on Conv2D. All GPUs perform only about 40% peak FLOPS.

4.2 End-to-end Training Performance

4.2.1 CPU Results

The evaluated CPU (Intel Xeon Platinum 8163) contains 24 physical cores. Multi-threading is a key technique to utilize multiple cores to do calculations. We first evaluate how the number of threads affects training performance, which is shown in Fig. 9. It can be seen that on the 24-core CPU, 24 threads generally achieve the best performance on CNNs and LSTM. However, the best number of threads on Deep Speech 2 of Fig. 9 is 8, which indicates that it should not occupy all the computing resources for some operations. Note that in the training process, besides the forward and backward computations, data loading and pre-processing could be also computing-intensive. If all CPU cores are occupied by the forward and backward computations, the data pre-processing thread could lack CPU resource to do computations such that the overall performance would be even worse.

Figure 9: Performance of end-to-end training on CPU.

Note that with doubled number of threads (not larger than the number of physical cores), it generally achieves improvement around 10%-15%, which is much smaller than the expected doubled improvement. The main reason is that the parallelism of end-to-end training on the CPU is mainly on the operator level, which means the multi-threading is effected on the operators (e.g., Matmul and Conv2D). As we analyzed in Section 4.1, the performance improvement of the operator is also around 10%-15% with the doubled number of threads.

4.2.2 GPU Results

Performance vs Mini-batch Size.

As introduced in the GPU architecture, there exist thousands of cores in current GPUs. When training DNNs, the GPU workload is linearly increased with respect to the mini-batch size. Therefore, we first select a representative GPU (i.e., Tesla V100) to demonstrate the performance with respect to the mini-batch size, which is displayed in Fig. 10. From the results, we can conclude that small mini-batch sizes may not occupy all computing resource of accelerators so that the accelerators’ computational power is not fully utilized. One should set a proper mini-batch size for particular DNNs and accelerators to achieve maximum performance. For example, on one hand, a mini-batch size of 4096 can achieve about 30% higher throughput than that of 2048 with Transformer. On the other hand, a mini-batch size of 32 has very close performance with that of 16 with Deep Speech 2. In the latter discussion, we always use a proper mini-batch size that maximizes the performance on a particular accelerator.

Figure 10: Performance on V100 GPU with different mini-batch sizes.
Figure 11: Performance comparison on GPUs.

The performance of end-to-end training with different DNNs on different GPUs (including Nvidia and AMD) is shown in Fig. 11. We compare their performance in two applications (i.e., CNNs and NLP models).

Figure 12: Performance comparison between GPUs with CNNs.

The results of training CNNs on GPUs are shown in Fig. 12. It can be seen that Tesla V100 has the best performance with both FP32 and FP16 training among the tested GPUs. With the same numerical precision, Nvidia V100 generally achieves 1.5-2 higher performance than the AMD Radeon VII GPU. Among Nvidia GPUs, Tesla P100 and Titan X(Pascal) have very close performance, which is reasonable as these two GPUs have similar peak FLOPS as shown in Table 3. Comparing two desktop-level GPUs between Nvidia Titan X(Pascal) and AMD Radeon VII, we notice that Titan X(Pascal) achieves slightly higher performance and Radeon VII, while the peak FLOPS of Radeon VII is about 22% higher than that of Titan X(Pascal). The phenomenon indicates the importance of software optimizations for particular hardware, and highly optimized software could achieve nearly optimal FLOPS. Compared to highly optimized cuDNN and CUDA libraries on Nvidia GPUs, AMD software ecosystem recently develops ROCm.

NLP Models.

Note that Deep Speech 2 cannot be successfully run on the Radeon VII GPU as some operators are not supported, so we exclude Radeon VII from the comparison in Deep Speech 2. Similar to the performance of CNNs, Nvidia GPUs achieve higher performance than the AMD GPU. In particular, Titan X(Pascal) is nearly 1.8 faster than Radeon VII on 2-layer LSTM. Among Nvidia GPUs, Tesla V100 always has the best performance. However, different from the results on CNNs, Tesla V100 with mixed precision has only slight improvement compared to the FP32 counterpart on the NLP models. The margin improvement of Tesla V100 with mixed precision indicates that the software library should be further optimized for NLP models.

Figure 13: Performance comparison between GPUs with NLP models. The units of samples are sentences/10 for LSTM, utterances for Deep Speech 2, and tokens/100 for Transformer.

4.2.3 TPU Results.

Figure 14: Performance comparison between Tesla V100 GPU and TPUs

We first discuss the performance of two versions of TPUs. The performance is shown in Fig. 14. On three evaluated DNNs, TPU v3-8 is about 1.5-1.7 faster than TPU v2-8. However, the peak FLOPs of TPU v3-8 is around 2.3 than TPU v2-8 as shown in Table 3, which indicates that utilization on TPU v3-8 is much lower than TPU v2-8. The experimental results conclude that there still exist software optimization room for performance improvement of TPU v3-8.

4.2.4 Comparison Between GPU and TPU

As we have seen that among all evaluated GPUs, Nvidia Tesla V100 outperforms any other GPUs including the AMD GPU, while TPUs also achieve very high throughput in training DNNs in the previous subsection, we here would like to compare the performance between Nvidia Tesla V100 and TPUs in training different models. The performance comparison is shown in Fig. 14. It can be seen that TPUs outperform Tesla V100 GPU in the three evaluated models. On CNNs (ResNet-50 and Inception v3), TPU V2-8 and TPU V3-8 achieves more than 1.5 and 2 higher throughput than Tesla V100 with Tensor Cores respectively. However, on the Transformer architecture, TPU V2-8 is very close to Tesla V100 GPU, and TPU V3-8 achieves around 1.5 faster than Tesla V100.

4.3 End-to-end Training Power and Energy

Due to the limitation of power measurements on CPU and TPUs, we only discuss the power and energy consumption for training DNNs on GPUs (Nvidia GPUs and the AMD GPU). We will discuss the power and energy consumption separately in the following two subsections.

4.3.1 Power Consumption

Figure 15: The measured power on different GPUs

The measured powers on different GPUs are shown in Fig. 15. Among three Nvidia GPUs (i.e., V100, P100 and Titan X(Pascal)), P100 has the lowest power consumption on CNNs, while Titan X(Pascal) consumes the lowest power on NLP architectures. On the contrary with the performance, AMD GPU (Radeon VII) consumes the lowest power among all GPUs on NLP architectures, but it consumes much higher power on ResNet-50 and VGG16 than Nvidia GPUs.

4.3.2 Energy Consumption

Higher mini-batch size generally results in higher throughput of the accelerators, while it does not mean the lower energy consumption. We compare the energy consumption of GPUs under CNN and NLP models.

(a) CNNs.
(b) LSTM.
(c) Deep Speech 2.
(d) Transformer.
Figure 17: Comparison of energy consumption on GPUs. (The lower the better.)

The energy consumption on CNNs with different GPUs is shown in Fig. 17(a). Even though Tesla V100 has higher power consumption than Tesla P100 or AMD GPU, it has a much higher performance than other GPUs. Therefore, the energy consumption of Tesla V100 is the smallest among all evaluated GPUs. Among the Nvidia GPUs, Titan X(Pascal) consumes the highest energy to train the same models even though it is a desktop-level GPU.

2-Layer LSTM.

The energy comparison of LSTM is shown in Fig. 17(b). It is seen that AMD GPU has a much higher energy consumption than Nvidia GPUs. On the contrary, with increased mini-batch sizes, AMD GPU consumes less energy, while Nvidia GPUs keep nearly unchanged.

Deep Speech 2.

As the loss function of the Deep Speech model is not supported on the AMD GPU, we exclude the AMD result in Deep Speech 2. Among the three Nvidia GPUs, Tesla V100 also has the lowest energy consumption. However, with increased mini-batch sizes, Titan X(Pascal) achieves very close energy consumption with Tesla V100.


In the transformer architecture, AMD GPU consumes lower energy than Tesla P100 and Titan X(Pascal) GPUs, while Tesla V100 is also the best.

For a better reference of experimental results, all raw numbers of end-to-end training are shown in Table 8.

5 Related Work

Benchmarks are key methods to make the hardware and software move forward to better targets (e.g., performance and/or energy). In the era of deep learning, training tasks are very computationally intensive and resource-consuming, which makes the performance and energy consumption of accelerators very important for deployment. Started from 2016, deep learning frameworks are rapidly developed to fully utilize the performance of accelerators like CPUs and GPUs. The latest TPUs have a large forward on performance for machine learning applications.

Researchers [harmonia2015, shi2016benchmarking] started to evaluate the performance among different deep learning frameworks and different GPUs. However, these works mainly focus on software-level evaluation in terms of performance. Later, Stanford DAWN deep learning benchmark [coleman2017dawnbench] and MLPerf [mlperf2019] are open to researchers and practitioners for training and inference speed comparison under different software and hardware platforms. The two open benchmark platform has also achieved many submissions for comparison, while they are mainly task-specific focusing on the performance. Shams et al. [icdcs2017] evaluated deep learning software tools over several hardware platforms including the distributed environment. Recently Wang et al. [wei2019benchmarking] also proposed ParaDnn to measure the performance of various hardware including Intel CPU, Nvidia GPU and Google TPU. The energy consumption is of importance in the server-side during training, and many task scheduling algorithms are studied, but there is little study measuring the power and energy consumption of DNN training tasks. One related work is that Tang et al. [tang2019impact] studied the impact of GPU dynamic voltage and frequency scaling on training performance and energy.

Model DNN Resnet50 Inception V3 VGG16 2-Layer LSTM Deep Speech 2 Transformer
Batch Size 64 128 64 128 64 128 64 128 256 8 16 32 1024 2048 4096
Titan X 172.97 170.67 114.29 OOM 118.05 126.47 58.18 64.00 75.29 25.81 32.00 35.96 56.89 70.62 85.33
Radeon VII 206.45 166.23 133.33 110.34 125.49 126.73 14.55 26.12 41.97 - - - 55.21 76.58 87.22
P100 202.47 207.61 147.69 151.23 135.58 138.12 59.13 63.73 70.53 27.59 32.00 36.36 68.27 93.09 105.03
Perf. V100 311.25 334.58 230.53 243.59 213.32 216.25 82.63 99.01 103.04 38.10 43.24 44.44 105.57 158.41 193.21
(samples/s) V100(MP) 633.33 669.52 436.84 440.88 416.47 411.25 100.13 118.46 126.67 42.11 47.06 46.38 107.79 157.54 195.05
TPU V2 1066.67 1163.64 800.00 914.29 - - - - - - - - 160.00 190.33 213.33
TPU V3 1280.00 1882.35 1488.37 1523.81 - - - - - - - - 256.00 293.83 357.73
CPU 6.35 11.77 6.44 12.33 8.84 16.64 2.69 3.22 4.72 3.41 5.17 11.44 - - -
Titan X 233.08 212.97 207.35 OOM 202.29 215.12 189.26 195.24 201.66 105.74 109.58 111.55 150.64 171.35 189.23
Radeon VII 275.50 285.50 141.00 142.00 271.00 260.00 134.00 141.00 169.50 - - - 126.51 149.21 172.55
Power P100 172.64 176.51 162.38 178.03 166.76 163.45 195.32 197.35 201.47 109.65 119.52 127.87 188.05 205.21 210.33
(watt) V100 193.87 206.61 191.56 198.68 190.71 202.34 213.32 215.55 214.24 107.99 121.64 130.28 183.17 200.05 214.28
V100(MP) 188.22 213.25 193.21 199.25 206.29 222.54 209.24 216.11 225.58 108.54 125.05 131.77 180.5 201.44 214.58
Titan X 1.35 1.25 1.81 OOM 2.59 2.81 0.33 0.31 0.27 4.10 3.42 3.10 0.03 0.03 0.02
Radeon VII 1.33 1.73 1.07 1.29 2.17 2.38 0.92 0.54 0.41 - - - 0.023 0.020 0.019
Energy V100 0.62 0.62 0.83 0.82 0.89 0.94 0.26 0.22 0.21 2.82 2.81 2.96 0.017 0.013 0.011
(J/sample) P100 0.85 0.85 1.10 1.18 1.23 1.18 0.33 0.31 0.29 3.97 3.74 3.52 0.03 0.02 0.02
V100(MP) 0.30 0.32 0.44 0.45 0.50 0.54 0.21 0.18 0.18 2.58 2.66 2.84 0.017 0.013 0.011
Note: ’-’ means the item is currently unsupported. The units of Perf are images for CNNs, sentences/10 for LSTM, utterances for Deep Speech 2, and tokens/100 for Transformer.
Table 8: Experimental results

6 Conclusion

In this paper, we made detailed evaluation of training performance, power and energy consumption on various widely used accelerators including Intel CPU, AMD GPU, Nvidia GPUs, and Google TPUs covering several types of deep neural networks (convolutional neural networks, recurrent neural network, deep speech 2 and transformer). Our evaluation results provide several levels comparison (including hardware performance, software utilization, diversity on deep models, power consumption and energy consumption) for end-users and hardware/software designers. For the future work, we would like to benchmark the performance and power on deep learning inference tasks on both server devices and mobile devices.


The research was supported by Hong Kong RGC GRF grant HKBU 12200418.