I Introduction
GPUs have become popular for accelerating computing in industry in recent years. Traditionally, these computational devices were used primarily for 3D rendering. However, their uses in General Purpose Graphics Processing Unit (GPGPUs) has expanded rapidly since then. Heterogeneous computing, using both CPUs and GPUs, is becoming more and more powerful, and this trend is likely to continue. These systems are also becoming increasingly easier to program due to the development of languages such as CUDA. Because of the high demand for heterogeneous systems, these tools are seeing rise in popularity in the programming community.
Existing benchmark suites such as Rodinia and SHOC were designed to better understand the characteristics of heterogeneous systems. A set of applications were curated so that each benchmark exhibits unique behaviors that stress a characteristic or component of GPUs. Doing so allows programmers to select the most suitable hardware and software for specific tasks. However, these benchmark suites haven’t kept up with the evolution of programming frameworks: they don’t utilize newer features, like HyperQ, that were introduced in recent versions of CUDA. Further, while hardware has evolved to provide more raw compute, the task of scaling the input to benchmarks has either been left to the user, as is the case with Rodinia, or is entirely untenable, as in SHOC. This makes applications unlikely to stay relevant as problems sizes grow larger.
GPUs are the defacto platform for training deep neural networks. Frameworks like Tensorflow
[3]and Pytorch
[4] are widely adopted in the neural network community. However, they are not designed to study hardware behaviors and usually introduce excessive memory footprints, making them less desirable for hardware architects. SHOC does include a neuralnet benchmark, but it fails to capture the complexity of recent DNN models.This paper makes several key contributions:

Present the areas in which existing benchmark suites Rodinia and SHOC are lacking, specifically in workload diversity, problem sizes, and programming features.

Demonstrate the improvements made in Mirovia and how these improvements characterize the capabilities of modern heterogeneous more comprehensively. Mirovia adds new features supported up to CUDA 9.0.

Understand the characteristics of neural networks as a new application domain and compare it to conventional GPU applications by adding commonly used DNN kernels in Mirovia.
Ii Related Work
Iia Rodinia
Rodinia is a suite of applications designed for heterogeneous systems released in 2009. It consists of applications and kernels to represent different types of behavior of the Berkeley dwarfs[5]. The dwarfs represents 13 major categories of computation used to describe most types of problems. In addition, Rodinia covers a diverse set of applications covering communication, synchronization, and power consumption.
However, Rodinia doesn’t utilize new features like unified memory, HyperQ, Cooperative Groups, and Dynamic Parallelism. The benchmark suite is still using CUDA 4.0. Therefore, it will not utilize programming constructs or performance improvements introduced in newer CUDA versions, such as the CUDA event library. Moreover, Rodinia doesn’t support hardware capable of performing half precision operation and tensor core computation.
To analyze how Rodinia stresses characteristics of GPUs, performance was profiled for each benchmark using nvprof. These metrics quantify utilization rate on a scale of 0 to 10. The utilization rate for each component indicates how much time was spent on one component to the total execution time. Value 0 means idle, while 10 means full utilization. Because many applications run multiple kernels, the maximum utilization rate from any component is selected for that kernel in the benchmark.
shows the average utilization rate and their standard variance of the different functional units and the memory hierarchy for each application in Rodinia. Since Rodinia does not come with default run parameters but rather a sample run command, the parameters in the sample command were used for each application. These sample files can be found in the directory of each benchmark.
One observation is that the utilization of many components is relatively small even though many applications in Rodinia consist of multiple kernels. For example, average GPU component utilizations looks very similar for benchmarks such as gaussian, huffman, nw, and myocyte. Many hardware components are also not fully stressed to achieve maximum utilization. This is because Rodinia consists of higherlevel applications, instead of microbenchmarks targeting specific components such as SHOC. The majority of kernels in each benchmark are not designed to target a specific GPU component.
IiB Shoc
Developed in 2010, SHOC is a suite of benchmarks for heterogeneous computing platforms. Unlike Rodinia, SHOC is divided into two primary categories: stress tests and performance tests. The stress tests use computationally demanding kernels to identify devices with bad memory, insufficient cooling, and other device component related issues. Performances tests, on the other hand, concentrate on synthetic kernels and common parallel algorithms.
All SHOC applications runn within a unified framework which accepts userspecified testing parameters such as number of iterations to run. Detailed metrics, including floating point operations per second (FLOPS), can also be recorded. These features are very useful for evaluating performances.
However, many programs in SHOC are just basic parallel algorithms, which may only be a subset of routines used in more common and more complex applications. Even though SHOC covers a variety of dwarfs, it doesn’t represent the complexity present in realworld applications. Similar to Rodinia, SHOC was written at a time when newer software and hardware features weren’t available.
Figure 2 shows GPU resource utilization and standard variance for each application in SHOC. All metrics are collected using the largest preset data size available.
Unlike Rodinia, the utilization rate no longer exhibits a fixed pattern but varies over a diverse range. This is because SHOC consists of microbenchmarks that target specific hardware components. However, most components are not fully stressed to maximum capacity.
IiC Other Benchmarks
More recent benchmarks[6] [7] [8] have addressed irregular parallelism, rather than stressing overall heterogeneous performance. Further, not all new CUDA features have been taken into consideration. Sarita, Alsop, and Sinclair[9] focused on benchmarking the effects of different levels of synchronization (from coarse to fine). Their work was more focused on a benchmark suite that evaluates synchronization through data sharing (e.g. unified memory and coherent caches). The MAFIA framework[10] was designed to target multiapplication execution in GPUs.
Due to the rapid growth of popularity in machine learning, there has been significant focus on characterizing DNN behavior on GPUs
[11, 12, 13]. Popular frameworks such as Tensorflow include primitive tools for users to analyze the computational demands of their models. Tango [14] is a framework to study behaviors of specific neural network model. DNNmark [15] is a framework to study various kernels used in DNNs.Iii Motivation
The previous section introduces Rodinia and SHOC, analyzing the pros and cons of each. Here we motivate the creation of Mirovia and outline the key areas in which existing benchmark suites can be improved.
Iiia GPU Application Domain
GPUs have come to be used in many new domains in recent years that are not represented by workloads in Rodinia or SHOC, e.g., data analytics. Relational algebra and SQL statements are not represented in either Rodinia or SHOC. Similarly, while GPUs have become the standard for DNN model training, benchmark suites do not include neural network based kernels. Even though GPUs were designed to maximize throughput, recent introductions of specialized hardware like TPUs [16] suggests that there may be potential for improvement in GPUs. It is necessary to study the behavior of individual kernels to find potential performance improvements.
IiiB Better Dataset Sizes
One of the most obvious aspects of existing benchmark suites to be updated is dataset sizes. In SHOC, there are 4 preset data sizes. This lack of flexibility makes it hard for SHOC to stay relevant in the future, as advancing technology will eventually cause even the largest data size to be too small to stress GPU resources. Rodinia has the exact opposite problem, where benchmarks have no preset data size and the user must specify their own problem sizes. Users have to run data generation scripts even if they do not know what input size may be appropriate for the system they are benchmarking. Benchmarking with improperly sized input data throws the validity of the benchmarks into question, particularly when it is quite well established that memory coalescing can be used to great effect on GPUs[17].
IiiC Support for Recent CUDA Versions
In addition to general performance improvements, each new version of CUDA typically releases new programming constructs that can be used to write more efficient code. The official Rodinia benchmark suite relies only on features that were available in CUDA 4.0, and thus doesn’t take advantage of the newer CUDA constructs. It is essential to include these new features to understand their impact on performance.
Iv The Mirovia Benchmark Suite







0  BusSpeedDownload        
0  BusSpeedReadback        
0  DeviceMem        
0  MaxFlops      Half Precision  
1  GUPS        
1  BFS  Graph traversal    Unified Memory  
1  General Matrix Multiply  Dense linear algebra      
1  Pathfinder  Dynamic programming    HyperQ  
1  Sort  Sorting      
2  CFDSolver  Unstructured grid  Computational fluid dynamics    
2  DWT2D  Spectral method  Image processing    
2  Kmeans  Sense linear algebra  Data mining    
2  LavaMD  Nbody  Computational chemistry    
2  Mandelbrot    Numerical analysis  Dynamic Parallelism  
2  NeedlemanWunsch  Dynamic programming  Bioinformatics    
2  ParticleFilter  Structured grid  Medical imaging    
2  SRAD  Structured grid  Computer vision  Cooperative Group  
2  Where  Map Reduce  Data Analytics    
2  Activation  Unstructured Grid  Deep Learning    
2  Pooling  Dense linear algebra  Deep Learning    
2  Batchnorm  Unstructured Grid  Deep Learning    
2  Connected  Dense linear algebra  Deep Learning    
2  Convolution  Dense linear algebra  Deep Learning    
2  Dropout  Unstructured Grid  Deep Learning    
2  RNN  Dense linear algebra  Deep Learning    
2  Softmax  Unstructured Grid  Deep Learning    
2  LRN  Unstructured Grid  Deep Learning   
In Mirovia, like SHOC, benchmarks are divided into levels. Each level represents benchmarks characterizing low level characteristics such as memory bandwidth to performance on real world applications. While determining a set of benchmarks for Mirovia, consideration was given to both the Berkeley dwarfs and application domains. Table I shows the benchmarks included in Mirovia and their respective primitives and application domains. The Mirovia suite contains the following features:

A new set of benchmarks representing neural network layers commonly used in popular DNN models. This section consists of 15 types of layers and 1 realworld DNN models. They have been parallelized with CUDA APIs and powered by libraries including cuBLAS and cuDNN (NVIDIA CUDA Deep Neural Network library). We decide to include a neural network model because individual layers are not sufficient in terms of representing real world application workflows. We use Darknet[18], an open source Neural network framework, to construct neural networks. However, most of the kernels in Darknet don’t utilize the cuDNN library, which causes degradation in performance compared to industrial standards like Tensorflow. Thus, we reimplemented the most commonly used kernels with cuDNN library and removed extra memory operations to reduce memory footprints. We believe these neural network applications will enrich the benchmark diversity in Mirovia.

Mirovia aims to strike a balance between predetermined input sizes available in SHOC and customizable input sizes available in Rodinia. Benchmarks contains preset sizes optimized for systems with different compute capabilities, as well as a mechanism through which users can specify the size and other aspects of their input. This feature merges the favorable qualities from both Rodinia and SHOC.

Several benchmarks have been updated utilized the most recent release of CUDA. For each feature, one benchmark was chosen to test it. These features include

HyperQ: it allows for multiple independent CUDA kernels to execute in parallel on the same GPU if the resources are available. HyperQ uses 32 independent Work Distributor Queues to detect opportunities for parallelism, whereas old architectures uses a single Work Distributor Queue. This feature is implemented in Pathfinder.

Dynamic Parallelism: it enables currently executing CUDA kernels to call child CUDA kernels(nested parallelism). This feature is useful when running algorithms with hierarchical data structures and recursive algorithms with parallelism at each level. This feature is added to Mandelbrot.

Unified Memory: it is a programming construct that gives the programmer the illusion that the host and the device share an address space. It establishes a single memory address space visbile by all processors in the system. When applications access data currently absent on the running device, the hardware automatically pages in data needed by the processor. This function is implemented in BFS.

Cooperative Groups (Grid Sync): This feature provides another granularity of synchronization for kernel threads running on a GPU. GridSync allows users to sync all threads in the entire grid before beginning next section of computation. This features is useful for programs with disjoint phases of computation running right after one another. This takes the synchronization granuarity to a next level as previous CUDA versions only support __syncthreads() which synchronizes all threads in a single block. These feature is implemented in SRAD.

CUDA Event API: this features allows accurate timing of CUDA functions and kernel calls. This is an improvement from Rodinia since Rodinia which still uses system time.

Iva Workloads
IvA1 Level 0
Level 0 benchmarks are designed to measure low level characteristics of the hardware. These benchmarks do the simple task of measuring a single capability of the GPU and therefore don’t represent any dwarves or application domains.
BusSpeedDownload measures the speed of the PCI bus by repeatedly transferring data of various sizes from the host to the device. The data sizes are varied from 1kb to 500kb.
BusSpeedReadback measures the speed of the PCI bus, except in the opposite direction. Here, data is transferred from the device to the host.
DeviceMemory measures the bandwidth of different components of the memory hierarchy on the device. This includes global, constant, and shared memory.
MaxFlops (Half Precision)
measures the maximum achievable floating point operations per second on the device. In SHOC, this benchmark runs tests using single and double precision. The halfprecision test will only work on architectures that suppot halfprecision floating point arithmetic.
IvA2 Level 1
Level 1 benchmarks include basic parallel algorithms which are common tasks in parallel computing and often used in kernels of real applications. While these applications represent a subset of the Berkeley dwarfs, they are complex enough to represent real applications domains.
GUPS (Random Memory Access) stands for Gigaupdates per second. It measures how frequently a computer can issue updates to randomly generated RAM locations. This benchmarks stresses the latency and bandwidth of the device. This test is important because the random memory performance directly maps to the application performance.
Breadth First Search (Unified Memory) runs and measures the performance for breadthfirst search, a common graph traversal algorithm. This application was included because it is controlflow intensive. This benchmark is also chosen to test unified memory feature in CUDA.
General Matrix Multiply is an application that measures the performance for different types matrix multiplications. The types of matrix multiplications include single and double precision tests with and without transposing the input matrices.
Pathfinder (HyperQ) is an application that runs a shortestpath algorithm which serves as a test of irregular parallelism. While most conventional parallel algorithms have uniform behaviors across the different threads, irregular algorithms are characterized by different threads performing different executions. Depending on graph connectivity, different threads can experience unique behaviors. In addition to this, pathfinder will experience much higher control flow unit utilization compared to regular parallelism algorithms as each thread needs to decide how to execute independently. Therefore, we decided to include HyperQ in this test.
Sort is an application that runs a fast radix sort [19] on an array
of integers. It operates on keyvalue pairs of single
precision floating point data.
IvA3 Level 2
Level 2 benchmarks are real world application kernels. Benchmarks in this level are applications that can be found in industry, and therefore represent a variety of GPU application domains. These applications represent a diverse types of performance characteristics.
CFD Solver is a computational fluid dynamics benchmark. This application solves the threedimensional Euler equations for compressible flow. This workload optimizes effective GPU memory bandwidth by reducing total global memory accesses and overlapping computation.
GPUDWT is for discrete wavelet transform, an image and video compression algorithm that is also a popularly used digital signal processing technique. This benchmark implements both forward and reverse, as well as 9/7 and 5/3 transforms. The 9/7 transform uses floats while the 5/3 transform uses integers, so it’s important to measure the performance for both.
KMeans is a popular clustering algorithm used in data mining. This algorithm shows a high degree of data parallelism. At the beginning, K centers are chosen. In each iteration, each data point is assigned to a center, and at the end of each iteration, each center is recomputed as the mean of all the data points in its cluster until the two converge.
LavaMD calculates Nbody particle interaction. The code calculates particle potential and relocation due to mutual forces between particles within a large 3D space. This space is divided into cubes, or large boxes, that are allocated to individual cluster nodes. The large box at each node is further divided into cubes, called boxes. 26 neighbor boxes surround each box (the home box). Home boxes at the boundaries of the particle space have fewer neighbors. Particles only interact with those other particles that are within a cutoff radius since ones at larger distances exert negligible forces. Thus the box size is chosen so that the cutoff radius does not span beyond any neighbor box for any particle in a home box, thus limiting the reference space to a finite number of boxes.
Mandelbrot (Dynamic Parallelism) computes an image of a Mandelbrot fractal, a self repeating geometric pattern that loops back on itself at ever decreasing sizes. A commonly used algorithm is the Escape Time Algorithm, which calculates the value for different pixels on a per pixel basis. This benchmark was added specifically to test Dynamic Parallelism, a feature added to CUDA in version 5.0. With Dynamic Parallelism, the benchmark switches to using the MarianiSilver Algorithm. Unlike Escape Time, this procedure starts out coarse grained, and only iterates at a finer resolution if necessary for certain subsections.
NeedlemanWunsch is a nonlinear global optimization method for DNA sequence alignments. The potential pairs of sequences are organized in a 2D matrix. In the first step, the algorithm fills the matrix from top left to bottom right, stepbystep. The optimum alignment is the pathway through the array with maximum score, where the score is the value of the maximum weighted path ending at that cell. Thus, the value of each data element depends on the values of its northwest, north and westadjacent elements. In the second step, the maximum path is traced backward to deduce the optimal alignment.
ParticleFilter
is a statistical estimator of the location of a target object given noisy measurements of that target’s location and an idea of the object’s path in a Bayesian framework. The PF has a plethora of applications ranging from video surveillance in the form of tracking vehicles, cells and faces to video compression. This particular implementation is optimized for tracking cells, particularly leukocytes and myocardial cells.
SRAD (Cooperative Groups)
is a computer vision application used for reducing noise, or “speckles”, in images without destroying important image features. This is done using partial differential equations. Since each stage of this application operates on the entire image, SRAD requires synchronization after each stage. This makes SRAD the ideal benchmark to test the performance of using cooperative groups in CUDA.
Where is a new relational algebra benchmark.
GPUs are becoming increasingly popular for data analytics because relational algebra operations are easy to parallelize. This benchmark acts like a filter for a set of records, returning a subset of the
input records that meet a set of conditions.
It first maps each entry to a 1 or 0,
before running a prefix sum and using both of these auxiliary
data structures to reduce the input data to just the matching
entries.
IvA4 DNN Kernels
All benchmarks in this section represent artificial neural network layers commonly seen in popular DNN models. All layers in this section include both forward and backward passes.
Activation
layer is used to decide whether a neuron should be activated by calculating the weighted sum and adding bias with it. It introduces nonlinearity into the output of a neuron. Some of the most commonly used activation functions include ReLU, sigmoid, tanh, and LeakyReLU. Here we only present ReLU since it is the simplest one to understand. The following describes the ReLU activation function. Note that
represents the input to the neuron and is the output.(1) 
Pooling
is common used between successive convolution layers in a ConvNet architecture. Its main function is to reduce the spatial dimensions on a convolution neural network. For example, applying a maxpool kernel of size
on a matrix will yield the biggest number in the matrix, while an average pool kernel of the same size will produce the average value of the sum in the matrix. For simplicity, We include only average pool layer in the report.Batch normalization is a technique proposed to solve covariate shift [20]
in DNNs. When parameters in the preceding layer change, the input to the current layer will change accordingly, causing the current layer to adjust to the new distribution. The main goal of batch normalization is to limit the shifting to a certain range to speedup training process and produce reliable models.
Connected
layers are those whose neurons are connected to every neuron in the next layer. The connected layer can be seen as a feature vector that holds aggregated information from the previous layer. For example, a connected layer can be right after a convolution layer which provides a lowdimensional invariate feature space. The fully connected layer can then learn a function from that space to produce more useful or abstract knowledge.
Convolution layer is mostly used to extract important features from images by assigning learning weights to various objects in those images. For example, give an RGB image of size with 3 channels, we can train a convolution kernel of size with
channels and stride
to produce an output tensor of size with channel. The output tensor represents one feature in the image, such as the presence of curves in difference parts of the input image.Dropout is a regularization technique used to prevent neural networks from overfitting[21]. The key idea is to randomly drop units from the neural network during training. When training large neural networks on small data sets, overfitting can be a huge issue when the model is evaluated on test data set. Dropout solves this problem by stochastically introducing noise to prevent units from coadapting too much, thus making the model more robust.
RNN
stands for Recurrent Neural Network. It is widely adopted in learning tasks dealing with sequential data, such as speed recognition, text generation, and so on. RNNs have proven to be successful in capturing the dynamics of sequences by keeping internal states(memory) which tracks information from previous time stamps. Among the most commonly used RNNs are GRU and LSTM. In our benchmark, we only show results for LSTM for simplicity.
Softmax layer is typically seen as the final output layer in a neural network to perform multiclass classification. It takes an input, usually a score value(,
), and recomputes it as probabilities. Therefore, the outputs of the layer will represent a true probability distribution, where the sum of each individual output will equal to 1. Its calculation process is shown below:
(2) 
LRN (Local Response Normalization) is intended to simulate a form of lateral inhibition[22] inspired by the type found in real neurons. It allows diminishing response values uniformly large to neighborhoods and creates a high contrast in activation map. This feature is especially useful in unbound activation functions such as ReLU. The original formula is written as
(3) 
where is the regularized output for kernel at position , is the source output of kernel applied at position , is the number of kernels, is the size of the normalization neighbourhood, and are hyper parameters of LRN.
V Evaluation
In this section, we evaluate the applications in Mirovia in terms of runtime characteristics, diversity, and performance. Our tests were performed on a machine with the following specifications:
Benchmark  Forward Kernel  Backward Kernel 
Activation  op_generic_tensor_kernel  op_generic_tensor_kernel 
Pooling  pooing_fw_4d_kernel  pooing_bw_kernel_avg 
Batchnorm  bn_fw_tr_1C11_kernel_NCHW  bn_bw_1C11_kernel_new 
Connected  maxwell_sgemm_128x64_tn  sgemm_128x128x8_TN_vec 
Convolution  maxwell_scudnn_128x128_relu_small_nn  wgrad_alg0_engine 
Dropout  dropout_fp  dropout_bp 
RNN  maxwell_sgemm_128x64_tn  maxwell_sgemm_128x64_nn 
Softmax  softmax_fw_kernel_resident  softmax_bw_kernel 
LRN  lrnForward_evenC  lrnBackward_evenC 

Ubuntu 18.04.2 LTS

Linux 4.15.048generic

CPU: 2 Intel(R) Xeon(R) CPU E52650 v4

12 Core, 24 Thread

2.20Ghz Base Freq, 2.90Ghz Turbo Freq

256K L2 Cache

30M L3 Cache


126GB DDR4 Memory

GPU: Nvidia Tesla P100SXM2

Driver: 418.40.04

16GB HBM2 Memory

1328Mhz SM clock speed

715Mhz Memory clock

Va Benchmark Performance
We use the nvprof profiling tool to collect the metrics gathered from running individual kernels. Note a number of benchmarks involve multiple kernels and some are redundant. We select the maximum utilization of each kernel and calculate their mean and corresponding standard deviation. The memory and computational units utilization rate is provided in
Figure 3 and Figure 4.The utilization rate of different GPU components shows a diverse set of behaviors for both forward and backward passes in DNN section of the benchmark. We observe the most utilized components are dram and single precision floating point function unit, with backward average pool being an exception with high utilization of shared memory and load/store function unit.
Table II shows the most relevant kernel of each DNN benchmark. Each kernel’s collected metrics are presented in The single precision floating point function unit is closely related to the IPC for most kernels. For example, both forward and backward kernel passes for convolution results in high IPC. Low utilization of single precision floating point function unit results in low IPC for batch normalization kernel. The eligible number of warps per cycle also shows high number for convolution and low value for batch normalization. This can be explained by that convolution has relatively good data locality and spends less time waiting for data dependency to be met, whereas batch normalization requires more memory operations which reduces the number of warps eligible to issue the next instruction. This is a sign that convolution operation is compute bound and batch normalization is memory bound. The utilization for the rest of all benchmarks in Mirovia show a diverse range of values. Each GPU component utilization is increased compared to SHOC. This can be explained by the increase in input data size, which demonstrates the importance of having userdefined input problem size to stress hardware performance. These benchmarks also differ from DNN kernels. DNN kernel tend to stress dram and single precision function units heavily, while the conventional benchmarks exhibit a more diverse utilization of each component.
VB CUDA Feature Analysis
In this section, we analyzed benchmarks which implement new CUDA features to find out how each feature affects the performance of applications. To do this, we show the speedup of applications using the feature over various preset problems sizes available in Mirovia.
Unified Memory: For this feature, the kernel time plus the transfer time of BFS without unified memory was compared to the kernel time of BFS with unified memory, since there is no explicit transfer time when using unified memory. Three different versions of BFS using unified memory were tested and each was compared to a version of BFS that doesn’t utilize any new features. The first version uses unified memory without cudaMemAdvise() or cudaMemPrefetchAsync(). The second version uses unified memory with only cudaMemAdvise(), and the last version uses unified memory with both cudaMemAdvise() and cudaMemPrefetchAsync().
We found that BFS with unified memory was able to run faster than the baseline version only when prefetching was introduced. Additionally, the speedup was fairly inconsistent and did not scale with the input size. This is because the execution path is highly dependent on the generated graph. Since data is randomly generated, this introduces randomness to the speedup over various problem sizes. The result is reasonable because constant demand paging introduces execution overheads.
HyperQ: HyperQ was added to the level 1 Pathfinder benchmark. Since this CUDA feature increases the utilization when multiple independent kernel can execute concurrently, we just ran multiple instances of Pathfinder on different streams. The graph shows the the speedup as the number of concurrent pathfinder kernels increases. The transfer time is not included because it would stay the same regardless.
Our result shows that the speedup gained from HyperQ increases as the number of parallel kernels scales up. This speedup levels out around 32 parallel instances. This makes sense as at this point the benchmark is saturating all 32 independent work queues. In addition to this we see speedup starting at a little under 1x for a single instance, and up to 4x thereafter. This follows as increasing the number of instances here makes use of more of the work queues that are just sitting idle.
Cooperative Groups: Similar to HyperQ, the kernel time for SRAD using a cooperative kernel was compared to the kernel time for the original SRAD implementation. In this case, the transfer time was not included because it would have been the same for both.
The biggest drawback of using cooperative groups is the limit on the number of blocks able to launch. Because of this, SRAD using a cooperative kernel could not be run on image sizes greater than 256x256. Therefore, to get more data points, we varied the problem size by multiples of 16 instead of powers of 2.
Dynamic Parallelism: For this feature, the speedup was measured using the kernel times for Mandelbrot with and without Dynamic Parallelism. Like most of the other features, transfer time was not included in the speedup because it was the same for both versions of the benchmark. This benchmark shows one of the cleanest increases in speedup as problem sizes increase. This primarily comes down to the efficiency of the two algorithms used and what Dynamic Parallelism allows. While the traditional Escape Time algorithm is forced to calculate values for every pixel, MarianiSilver is allowed to subdivide and thus ignore ever increasing swaths of the image. This is shown by the increasing speedup as image size increases.
While this benchmark’s inclusion of dynamic parallelism is used for regular parallelism, this feature can also be used for more varied implementations. The Mandelbrot kernel explicitly calls itself over and over on smaller patches of the image. Other programs may choose to have a master kernel that calls multiple diverse subkernels. Mandelbrot is still enough to stress Dynamic Parallelism and is thus what we chose to include in Mirovia.
Vi Conclusion and Future Work
In creating Mirovia we aimed to modernize aspects of popular existing suites such as Rodinia and SHOC. To do this, we improved the spread of benchmarks included, bringing in new programs from different domains, while also adapting problem sizes to the abilities of modern hardware. Most importantly, we added support for measuring the performance of many new features that were introduced in recent years. Features like Cooperative Groups, HyperQ, and half precision arithmetic are all new enough that no current suite tests them at all. To capture the characteristics of DNN’s behaviors on GPUs, we included a set of popular neural network kernels. In this way we present Mirovia as a more complete benchmark suite for the modern era. Plans for future work include:

Add benchmark support for GPUDirect RDMA. This is a feature that allows for direct data exchange between devices on a PCI bus. Utilizing this greatly reduces the time required for data transfers by bypassing typical memory copies across various data planes. This is especially useful for pipelined data processing workloads where there are multiple disjoint transformations acting on the same data. GPUDirect RDMA allows for data to stay off host the entire time, moving directly from stage to stage of the workload.

Explore our benchmark diversity analysis by using the Principal Component Analysis (PCA) and Hierarchical Clustering Analysis described by this paper
[23]. 
Incorporate new CUDA features such as CUDA graphs into our benchmark to facilitate program speedup.

Add new benchmarks to test new hardware features such as tensor cores [24] in more recent architectures. Tensor Core is a specialized hardware units designed for performing mixed precision matrix computations commonly used in deep learning neural network training and inference applications.
References
 [1] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54, Oct 2009.
 [2] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, “The scalable heterogeneous computing (shoc) benchmark suite,” in Proceedings of the 3rd Workshop on GeneralPurpose Computation on Graphics Processing Units, GPGPU3, (New York, NY, USA), pp. 63–74, ACM, 2010.
 [3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Largescale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.
 [4] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NIPSW, 2017.
 [5] K. Asanović, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The landscape of parallel computing research: A view from berkeley,” Tech. Rep. UCB/EECS2006183, EECS Department, University of California, Berkeley, Dec 2006.
 [6] M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali, “Lonestar: A suite of parallel irregular programs,” in 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), (Los Alamitos, CA, USA), IEEE Computer Society, apr 2009.
 [7] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia: Understanding irregular gpgpu graph applications,” in 2013 IEEE International Symposium on Workload Characterization (IISWC), pp. 185–195, Sep. 2013.
 [8] M. A. O’Neil and M. Burtscher, “Microarchitectural performance characterization of irregular gpu kernels,” in 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 130–139, Oct 2014.
 [9] M. D. Sinclair, J. Alsop, and S. V. Adve, “Heterosync: A benchmark suite for finegrained synchronization on tightly coupled gpus,” in 2017 IEEE International Symposium on Workload Characterization (IISWC), pp. 239–249, Oct 2017.
 [10] A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das, “Anatomy of gpu memory system for multiapplication execution,” in Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS ’15, (New York, NY, USA), pp. 223–234, ACM, 2015.
 [11] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee, B. Schroeder, and G. Pekhimenko, “Benchmarking and analyzing deep neural network training,” in 2018 IEEE International Symposium on Workload Characterization (IISWC), pp. 88–100, Sep. 2018.
 [12] J. Lew, D. A. Shah, S. Pati, S. Cattell, M. Zhang, A. Sandhupatla, C. Ng, N. Goli, M. D. Sinclair, T. G. Rogers, and T. M. Aamodt, “Analyzing machine learning workloads using a detailed gpu simulator,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 151–152, March 2019.
 [13] S. A. Mojumder, M. S. Louis, Y. Sun, A. K. Ziabari, J. L. Abellàn, J. Kim, D. Kaeli, and A. Joshi, “Profiling dnn workloads on a voltabased dgx1 system,” in 2018 IEEE International Symposium on Workload Characterization (IISWC), pp. 122–133, Sep. 2018.
 [14] A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and H. Jeon, “Tango: A deep neural network benchmark suite for various accelerators,” jan 2019.
 [15] S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmark suite for gpus,” in Proceedings of the General Purpose GPUs, GPGPU10, (New York, NY, USA), pp. 63–72, ACM, 2017.
 [16] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “Indatacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12, June 2017.
 [17] B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “Colt: Coalesced largereach tlbs,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 258–269, Dec 2012.
 [18] J. Redmon, “Darknet: Open source neural networks in c.” http://pjreddie.com/darknet/, 2013–2016.
 [19] N. Satish, M. Harris, and M. Garland, “Designing efficient sorting algorithms for manycore gpus,” in 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–10, May 2009.
 [20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
 [21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.

[22]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012.  [23] N. Goswami, R. Shankar, M. Joshi, and T. Li, “Exploring gpgpu workloads: Characterization methodology, analysis and microarchitecture evaluation implications,” in IEEE International Symposium on Workload Characterization (IISWC’10), pp. 1–10, Dec 2010.
 [24] S. Markidis, S. Wei Der Chien, E. Laure, I. Peng, and J. S. Vetter, “Nvidia tensor core programmability, performance & precision,” 03 2018.
Comments
There are no comments yet.