A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels

01/20/2020
by   Lorenz Braun, et al.
University of Heidelberg
0

Characterizing compute kernel execution behavior on GPUs for efficient task scheduling is a non-trivial task. We address this with a simple model enabling portable and fast predictions among different GPUs using only hardware-independent features. This model is built based on random forests using 189 individual compute kernels from benchmarks such as Parboil, Rodinia, Polybench-GPU and SHOC. Evaluation of the model performance using cross-validation yields a median Mean Average Percentage Error (MAPE) of 8.86-52.00 different GPUs, while latency for a single prediction varies between 15 and 108 milliseconds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/28/2021

Power Consumption Analysis of Parallel Algorithms on GPUs

Due to their highly parallel multi-cores architecture, GPUs are being in...
01/19/2017

GPGPU Performance Estimation with Core and Memory Frequency Scaling

Graphics Processing Units (GPUs) support dynamic voltage and frequency s...
06/23/2014

Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels

Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kern...
01/31/2021

A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

Deep learning researchers and practitioners usually leverage GPUs to hel...
04/21/2019

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computation...
10/31/2018

OpenCL Performance Prediction using Architecture-Independent Features

OpenCL is an attractive model for heterogeneous high-performance computi...
07/14/2019

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

This paper proposes a versatile high-performance execution model, inspir...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

GPUs are massively parallel multi-processors, and offer a tremendous amount of performance in terms of operations per second, memory bandwidth and energy efficiency. As a result, they are being used pervasively in areas outside visual computing, including scientific and technical computing, machine learning and data analytics. Programs running on GPUs are expressed in compute kernels, which are code regions compiled separately for such co-processors, but called from the main host processor. As the GPU execution model demands for a high amount of structured parallelism, such kernels are typically well-structured and behave regularly by avoiding fine-grained control flow.

GPU computing is a prime example for heterogeneous computing, which ultimately requires tools reasoning about the most suitable processor for a given workload respectively kernel. Scheduling tasks can benefit tremendously from predictive modeling, as predictions of execution time allow to select the fastest processor for a given workload. Similar applies for other tasks including system provisioning and procurement, sub-task scheduling for overlapping communication and computation, as well as replacing execution time by power consumption as key metric.

If there exits such a predictive model (in the following: model), which is based soley on hardware-independent features, it would allow to reason about time and power for different GPU architectures and models, enabling to identify the most effective ones in terms of performance per unit cost. Such features can include instruction counts (floating point operations, integer operations, memory operations on different address spaces, etc.), or the thread hierarchy of the kernel in execution (kernel launch configuration), but not hardware-dependent features like cache hit rates.

Various performance and power models already exist, also for GPUs (Koike and Sadakane, 2014; Madougou et al., 2016; Wu et al., 2015; Carroll and Wong, 2017; Zhang and Owens, 2011; Hong and Kim, 2009, 2010; Huang et al., 2014; Baghsorkhi et al., 2010; Song et al., 2013; Chen et al., 2011; Lim et al., 2014; Nagasaka et al., 2010; Guerreiro et al., 2018; Majumdar et al., 2017; Wang et al., 2019; Johnston et al., 2018; Amarís et al., 2016; Lehnert et al., 2016; Salaria et al., 2019; Spafford and Vetter, 2012; Reisert et al., 2017). They are usually based on: (1) executing the program under observation, with additional costs depending on the required execution statistics; (2) collecting execution statistics using the processor’s performance counters; and (3) inferring executing time and power consumption based on these statistics. As a result, such models rely on a variety of input features, in particular also hardware-related features like cache hit rates. Notably, some models yield good prediction performance without the use of such hardware-related features (Hong and Kim, 2010; Baghsorkhi et al., 2010). Some previous work has used analytical models (e.g., (Nagasaka et al., 2010; Hong and Kim, 2009, 2010)

), but machine learning-based methods like for instance Artificial Neural Networks (ANNs) have demonstrated a highly improved accuracy (e.g.,

(Song et al., 2013; Wu et al., 2015)). Similar applies for vendor tools: RAPL by Intel and NVIDIA’s NVML are power measurement tools that are sometimes based on modeling techniques, and, in particular, require to execute a program in order to obtain knowledge about power consumption. While various solutions have been proposed, two particular downsides are apparent: first, it is not documented how well those models fit to other GPU architectures and models (lack of portability). Second, there is no publicly available performance and power model for GPUs (lack of availability).

Furthermore, according to our experience and review of publicly available GPU benchmark suites (Che et al., 2009; Stratton et al., ; Danalis et al., 2010; Grauer-Gray et al., 2012), GPU kernels are usually well-structured, sufficiently optimized for locality and latency-tolerant. Based on this, we form the following hypothesis: we hypothesize that if GPU kernels are well-structured, locality-optimized and latency-tolerant, GPU kernel behavior in terms of time and power consumption should be rather agnostic of hardware-related dynamic effects, like cache hit rates, coalescing or bank conflicts. Contrary, GPU kernel behavior should be mainly determined by its code, kernel launch configuration, and static hardware parameters like frequency, number of processing elements, and general architecture. Thus, given a model trained on a particular GPU architecture, it should be possible to predict kernel behavior accurately based solely on static code features, including instruction counts and kernel launch configuration.

We therefore propose a method and model for predicting kernel execution time and power consumption based on machine learning techniques, which is:

  • Simple: it is based on features that can be derived quickly and with minimal overhead in terms of additional execution time due to instrumentation.

  • Portable

    : it can be easily ported to other GPU architectures by simply retraining the model, based on the same feature selection and general methodology.

  • Fast: as it is based on a simple Random Forest model, no large amount of computation is required to infer a prediction.

As a result, (1) it requires only minimal overhead for profiling (model feature acquisition), (2) it is suitable for provisioning tasks as it can be easily ported to a variety of different GPU types, and (3) it is suitable for a use in schedulers, which usually require that the time for scheduling decisions is orders of magnitude shorter than the execution of the program.

The detailed contributions are as follows:

  • A portable profiling infrastructure for acquisition of input and output features, used for training and possible re-training for portability reasons.

  • A model suitable for a small input feature set, which is fast and sufficiently accurate for run-time decisions on scheduling (heterogeneity) and orchestration of kernels and data movements (prefetching respectively overlap).

  • An evaluation of method and model demonstrating prediction performance, prediction speed, and prediction portability: for a variety of GPU kernels from various benchmark suites, predictions for five different GPU types are evaluated (NVIDIA K20, GTX1650, Titan Xp, P100 and V100).

  • Method, model, measurement infrastructure and training tool are made publicly available 111Preprint, please contact authors for details..

2. Background

In the following, we will shortly review GPUs and CUDA, Random Forests as fundamental machine learning method, and related work in the context of predictive modeling.

2.1. GPU architecture and programming

The following introduction of GPUs is based on CUDA nomenclature, even though OpenCL is very similar except for different naming.

GPUs are massively parallel processors, executing multiple thousands of light-weight threads formed into a hierarchy: multiple threads are grouped into thread blocks, with the possibility of fast barrier synchronization and data exchange using shared memory structures equally fast as conventional caches. Multiple blocks form a thread grid, which is specified as part of the kernel launch configuration. Thus, a thread grid is a kernel in execution. Interactions among different thread blocks are not supported, as GPUs miss strong progress guarantees due to a lack of preemption. GPUs do not execute single threads individually, instead multiple threads (typically 32) form a thread warp which is the main unit for scheduling. As a result, all threads of a warp share a single instruction stream, and non-coherent control flow in a warp results in serialization.

The memory hierarchy of a GPU is flat and thus very different from general-purpose processors like CPUs. Threads can operate on register space as private memory, while thread blocks can make use of shared memory as cache-like memory resource. The main memory resource of a GPU is on-card GDDR-based high-throughput memory, called global memory or device memory. Unlike registers and shared memory, the lifetime of global memory exceeds the lifetime of a single kernel. Also, global memory is the main resource for interactions between host and GPU.

There also exist caches on a GPU, but as a GPU relies on latency tolerance and not latency minimization, caches can be small. In particular, unlike a CPU, a GPU does not make use of caches to reduce average (global) memory access latency, instead its main purpose is to reduce contention on lower levels of the memory hierarchy. For latency tolerance, GPUs are prime examples for the Bulk-Synchronous Parallel (BSP) execution model (Valiant, 1990), which requires a large amount of parallel slackness in the form of orders of magnitude more threads in execution than physical processing units present.

Still, GPUs consist of up to thousands of processing units, which are grouped into so-called Streaming Multi-Processors (SMs). A thread block can execute only on a single SM, and, as a result, there is no interaction among different SMs except for global memory. Hence, GPU architectures efficiently scale with the number of SMs, and kernels written once hopefully observe excellent performance portability on more recent GPUs.

With regard to the present work, notice in particular that common code optimization techniques for GPUs require that code is well-structured and behaving regularly with regard to coherent control flow and thread behavior. Otherwise, multiple performance penalties exist: unstructured access to shared memory might result in bank conflicts and access serialization. Similarly, unstructured access to global memory results in non-coalesced accesses to off-chip DRAM modules. Thread-individual control flow usually causes branch divergence penalties, as instructions are shared at warp level and non-coherent branching is handled by collectively execution all paths with appropriate masking of results.

2.2. Random Forests

Random forests are a machine learning method based on ensemble learning for either regression or classification tasks (Breiman, 2001)

. During training multiple decision trees are constructed. Usually the outputs of all trees are summarized into a mean prediction (regression) or a class (classification). Predictions are made by traversing the tree. On each node an input feature will compared to a threshold and the result determines the next node to be processed until a leaf with an output value is reached.

Construction of the tree is controlled by multiple parameters. In the case of the scikit-learn implementation (Pedregosa et al., 2011)

, the main parameters to adjust are the number of estimators (trees)

n_estimators, and max_features

as the number of features being used when splitting a node in a tree. More estimators typically lead to better result, but take more time to train and to predict. Low max_features parameters reduce variance but increase bias. Last, there are different variations of the split criterion, which measures the quality of a split.

Random forest algorithms allow to compute relative feature importances by analyzing the relative rank, which is the depth of a decision node in a tree respectively the feature used for that decision node. These importances can be used to check whether the trained model behaves as expected.

2.3. Related work

Source T P Model Accuracy Portability Input source Dataset
(Guerreiro et al., 2018) SR MAE: 7 (Pascal), 6 (Maxwel), 12 (Kepler) 3 NVIDIA GPUs (Pascal, Maxwell, Kepler) CUPTI, PTX, custom 83 apps
(Song et al., 2013) MLR, ANN RMSE: 6.7% (time), 2.1% (power) 2 NVIDIA GPUs (Fermi) CUPTI, custom 20 kernels
(Majumdar et al., 2017) RF MAPE: 25% (time), 12% (power) AMD CPU+APU AMD CodeXL 73 apps
(Wu et al., 2015) ANN average error: 15% (time), 10% (power) 6 AMD GPUs (GCN) AMD CodeXL 108 kernels
(Nagasaka et al., 2010) LR average square error: 54.9%, average error 4.7% 1 NVIDIA GPU (Fermi) CUDA Profiler 49 kernels
(Wang et al., 2019) HM MAPE: 17.04% 2 Kepler GPUs, 2 NVIDIA GPUs (Maxwell) LLVM, custom 20 kernels
(Hong and Kim, 2009) AM GMAE: 5.4-13.3% 4 NVIDIA GPUs (Fermi) PTX, custom 20 apps
(Hong and Kim, 2010) EM, AM GME: 2.7% (micro-benchs), 8.94% (merge) 1 NVIDIA GPU (Tesla) GPUOcelot 19 apps
(Johnston et al., 2018) RF MAE of R2: 1.2% 3 CPUs, 1 Xeon Phi, 5 NVIDIA GPUs (Kepler, Pascal), 6 AMD GPUs AIWC 37 kernels
(Amarís et al., 2016) AM, LR, SVM, RF MSE: 0.7-11.4% 9 NVIDIA GPUs (Kepler, Maxwell) nvprof, custom 9 kernels
(Lehnert et al., 2016)

LR, KNN

divergence: 10-80% 2 NVIDIA GPUs (Fermi, Kepler) custom 1 app
(Salaria et al., 2019) AM RMSE: 90.6% 7 NVIDIA GPUs (Kepler, Maxwell, Pascal, Volta) no information 30 apps
(Zhang and Owens, 2011) TM error: 5-15% 1 NVIDIA GPU (Fermi) Barra, cubin, nvcc 3 apps
(Spafford and Vetter, 2012) AM absolute relative error plots 1 NVIDIA GPU (Fermi) Aspen 4 kernels
(Reisert et al., 2017) EM SMAPE: 12.97% CPU-GPU (no info) Score-P 7 apps
(Barnes et al., 2008) RM median prediction error: 6.2-17.3% 1 CPU PAPI, custom 7 apps
(Kundu et al., 2010) RM, ANN median error: 1.16-6.65% 1 CPU Xen-specific 4 apps
(Carroll and Wong, 2017) AM

predicted/observed: 1.5% (vector add), 0.76% (matrix multiplication), 5.49% (reduction)

1 NVIDIA GPU (Kepler) custom 3 apps
(Huang et al., 2014) AM Average error: 13.2% (round robin), 14.0% (greety-then-oldest) Fermi-like architecture GPUOcelot 40 kernels
(Baghsorkhi et al., 2010) AM good agreement between predicted and observed 1 NVIDIA GPU (Tesla) PDG 4 apps
(Chen et al., 2011) RF, RT, LR APE:7.77%, 11.68%, 11.7% 1 NVIDIA GPU (Tesla) GPGPUSim 52 kernels
(Lim et al., 2014) EM GME: 7.7% (micro-bench), 12.8% (merge) 1 NVIDIA GPU (Fermi) MacSim, DRAMSIM 22 apps
Ours RF MAPE: [13.88% - 15.82%] (Time), [1.8%-2.9%] (Power) 5 NVIDIA GPUs (Kepler, Pascal, Volta, Turing) CUDA Flux 189 kernels (Time), 168 kernels (Power)
Table 1. An overview of related work, showing prediction target (time [T], power [P]), used model, accuracy, portability, input feature source, and dataset size.

In last years, performance and power modeling of GPUs are attracting considerable interest. Several approaches have been proposed, exploring different types of models. A summary including prediction (execution time or power consumption or both), model, accuracy, portability, input source, and dataset size is given in Table 1.

The most common approach is using machine learning methods such as random forest (RF) (Majumdar et al., 2017; Johnston et al., 2018; Amarís et al., 2016)

, support vector machines (SVM)

(Amarís et al., 2016), artificial neural networks (ANN) (Song et al., 2013; Wu et al., 2015; Kundu et al., 2010), k-nearest-neighbor (kNN) (Lehnert et al., 2016), and so forth. The other common approach is using regression-based models such as statistical regression (SR) (Guerreiro et al., 2018), regression model (RM) (Barnes et al., 2008), regression trees (RT) (Chen et al., 2011)

, linear regression (LR)

(Nagasaka et al., 2010), and multiple linear regression (MLR) (Song et al., 2013). Machine learning and regression methods provide an accurate prediction, albeit, tedious effort is required on feature engineering for building the model. However, this fundamental issue can be overcome by automatic methods evaluating the feature impact on the accuracy of the model. A typical drawback of this approach is that training these models typically requires large (labeled) datasets, otherwise the the trained model generalizes poorly and is sensitive to previously unseen samples.

The other major approach is to use analytical models (AM). One example is Aspen as a domain specific language used for analytical performance modeling (Spafford and Vetter, 2012), which basically requires to rewrite an application in this language. Another analytical model considers the number of running threads and memory bandwidth for predicting performance (Amarís et al., 2016). There is also analytical model using the novel collaborating filtering based modeling technique to predict the performance (Salaria et al., 2019). Furthermore, there are approaches combining the aforementioned methods, leading to a hybrid model (HM)(Wang et al., 2019). Models as throughput model (TM) (Zhang and Owens, 2011) and empirical model (EPM) (Reisert et al., 2017; Hong and Kim, 2010; Lim et al., 2014) exist as well.

Numerous metrics have been used for measuring the accuracy of models such as Mean Absolute Error (MAE) (Guerreiro et al., 2018; Johnston et al., 2018), Mean Prediction Accuracy (MPA) (Amarís et al., 2016)

, Geometric Mean of Absolute Error (GMAE)

(Hong and Kim, 2009), Geometric Mean of the Error (GME) (Hong and Kim, 2010; Lim et al., 2014), Root Mean Square Error (RMSE) (Song et al., 2013), Mean Squared Error (MSE) (Amarís et al., 2016), Mean Absolute Percentage Error (MAPE) (Majumdar et al., 2017; Wang et al., 2019), or Symmetric Absolute Percentage Error (SMAPE) (Reisert et al., 2017). Different performance metrics are used as they serve different purposes (Botchkarev, 2018).

Besides accuracy, another important characteristic of a model is to enable portability across different GPUs or other accelerators. Several studies (Guerreiro et al., 2018; Song et al., 2013; Majumdar et al., 2017; Wu et al., 2015; Hong and Kim, 2009; Johnston et al., 2018; Amarís et al., 2016; Lehnert et al., 2016; Salaria et al., 2019) have been conducted on this direction, while many other works focus on one single processor.

Any prediction relies on a set of input features, which describe the subject under prediction. Most often, performance counters are used as input features, with information acquired from tools including CUPTI, AMD CodeXL, nvprof, LLVM, Score-P, PAPI, nvcc, GPGPUSim, MacSim, DRAMSIM, GPUOcelot among others. However, most of the studies do not rely exclusively on those tools but also develop on their own using custom microbenchmarks, further code analysis, kernel compilation information, hardware specifications, analytical equations, program dependence graphs (DPG) and others.

As previously mentioned, a representative training dataset is important for the model’s generalization capability. The size of such a dataset varies highly among the studies, ranging from one single application to up to 108 different kernels. Note that it often remains unclear if an application consists of multiple kernels which are treated independently or not.

Our work distinguishes from the related works by using only hardware-independent input features for training a machine learning model. We collect instruction counts generated by CUDA Flux (Braun and Fröning, 2019) as input features, and use random forest as learning model. Our result demonstrates a good portability across a variety of workloads (189 and 168 different kernels for time and power, respectively) and GPUs, including Kepler-, Pascal-, Volta- and Turing-class NVIDIA GPUs. Only quite few related work is also based on static input features (Hong and Kim, 2010; Baghsorkhi et al., 2010), while the vast majority requires a comprehensive application analysis prior to prediction. Last, note that we are not aware of any other work being publicly available.

3. Methodology

As machine learning methods have been proved to be highly accurate for modeling and predicting performance of processors (Song et al., 2013; Wu et al., 2015), we are also relying on such techniques. Figure 1 provides a summary of the workflow. The left part mainly covers the training part, based on collecting metrics as input features, execution time and power consumption as ground truth. Thus, samples are formed of a space of input feature vectors , each with a label , all labels forming the output space . are the labels for training, respectively the target values for inference. Generally spoken, the goal is to find a model or function for which a scoring function yields a score that is to be maximized. Thereby, the training procedure seeks to maximize the score or fitness of a prediction, respectively most commonly minimize the error of the prediction.

Once a sufficiently accurate model is found, for any previously unseen input a prediction of, in this case, execution time and/or power consumption can be made. This inference is shown in the right part of Figure 1: for any given CUDA application, its metrics have to be determined in either way, which are then used to infer execution time or power consumption.

Figure 1. Workflow for execution time and power prediction using CUDA Flux. Rectangular nodes represent data and oval nodes processes.

In the following, we use a collection of 4 benchmark suites in order to provide a broad dataset for the model creation. Based on this workload set, we create two sets of data records, one for the input feature vectors , and one for the outputs , being either corresponding execution time or power consumption. In particular, we distinguish true values from predicted values . To keep the model simple and easy to train, the input features are enhanced in the preprocessing step (see section 3.2).

Commonly used metrics for the scoring function include Mean Absolute Error (MAE), Mean Squared Error (MSE) and R-squared Error (R2). Execution time measurements have shown that kernels last from a few microseconds to multiple minutes. It implies that, if the short kernels have too few contributions to the scoring function

, they are inevitable to be considered as noises. Absolute value based errors, e.g. MAE and MSE, are not a good fit for our dataset, because the errors in long running kernels are weighted more than short ones. Therefore the relative error measurement should be applied instead. Again, considering the large differences in the magnitude of our data, we are in favor of choosing a L1 loss function over a L2 loss function, as it is more robust regarding outliers. Hence, we decided to use the Mean Absolute Percentage Error (MAPE, cf. Equation 

1) as the scoring function.

(1)

3.1. Portable Code Features

Our approach for execution time and power prediction makes use of portable code features which are independent from GPU platforms. In other words, the code features are reused for other GPUs once they are recorded. Therefore, creating a new prediction model for another GPU only requires to record the target values, it makes our approach lightweight and portable. Disadvantages are missing information like cache hit rate or register spilling.

Thus, features must not depend on the GPU used, leaving a choice of possible features which are mainly covered by instruction counts and the kernel launch configuration (e.g., grid and block size of a kernel), since the kernel launch configuration has significant impact on kernel execution. In addition, the size of shared memory allocation as well as the thread hierarchy are also part of the feature set.

In summary, the set of input features is composed of kernel metrics (instruction counts) and the kernel launch configuration. As all these features are portable across different GPUs, there are the following advantages:

  • Features only depend on the kernel and its input.

  • Features can be reused for predicting time and power of different GPUs without the additional effort of re-recording features. Only the target values have to be measured again.

  • For kernels with predictable control flow the features can be computed ahead of kernel execution rather than being recorded by a profiler.

3.2. Feature Acquisition and Engineering

Instruction counts can be measured on different levels of abstraction: for NVIDIA GPUs, SASS (Stephenson et al., 2015) and PTX instruction sets are viable candidates. Since this approach aims for portable features which do not depend on the hardware, PTX seems to be the better fit as it is also portable across different GPU architectures. Usually, nvprof would be the natural choice to profile kernels, but as it profiles on SASS level it does not provide the required portability.

Instead, we use the CUDA Flux profiler to gather features at PTX level (Braun and Fröning, 2019). This profiler analyses the code for PTX instruction statistics on a basic block basis (Grune et al., 2012), and uses code instrumentation to keep track of how often threads execute a specific basic block. Notable, each thread of a given kernel launch is instrumented. Besides the instruction counts, the kernel launch configuration is recorded, including grid and block size of the kernel and shared memory usage. To keep the instrumentation lightweight, only the basic block execution frequencies and PTX instruction counts for each basic block are recorded when an instrumented application is executed. The computation of the final instruction counts is done afterwards.

The CUDA Flux profiler allows to gather instruction counts for each possible PTX instruction including specializations, thus possibly hundreds of features. These many fine-grained features may contain a lot of information on the application, but more features mean also that the average importance for each feature in the model can be quite low. A high-dimensional feature space requires a large amount of training data to obtain good results. For the sake of a simple and comprehensible model the different instruction types should be formed into more general groups of instructions. Our experiments showed that out of the many possibilities of grouping strategies, simple grouping by arithmetic, special, logic, and control flow already yields a reasonable performance. The groups were inspired by the classification of PTX instructions by Patterson et al. (Patterson and Hennessy, 2012). Additionally the bit width is ignored in order to reduce the number of features. On the contrary, memory instructions are grouped differently, because we think that in this case the width of a memory access makes a significant difference.

For memory instructions the most important metric is the data volume which is read or written, as well as the memory type being used. For this reason the memory instructions are used to compute data transfer volumes for memory types including global memory, shared memory and parameter memory. Where parameters are stored depends on the implementation, but usually it is either register space or global memory. Note that register spilling cannot be accounted for as this behavior is device dependent.

In addition to the count of each instruction group, the ratio of arithmetic instructions and data transfer volume of global and local memory is computed and used as input feature.

3.3. Model Construction and Training Procedure

The model is constructed using the Extremely Randomized Trees Regression method provided by the scikit-learn library (Pedregosa et al., 2011). Compared to the currently pervasive interest in neural networks, the random forest methods require a smaller amount of samples and need less training time. Extremely Randomized Trees are furthermore known to decrease variance and to avoid overfitting.

Training of the model includes search of optimal hyperparameters. Cross validation is commonly used to perform this task. Using simple cross-validation is in general more biased than more advanced methods as proposed by Cawley and Talbot

(Cawley and Talbot, 2010) and Tibshirani (Tibshirani and Tibshirani, 2009). For our problem, nested cross-validation as proposed by Cawley and Talbot offers better stability and was therefore chosen. Several iterations of nested cross validation ensure good generalization. In each iteration a different random initializer is used for the splits of test and training data. First the scores of each hyperparameter combination are computed on all splits. Then, the best parameter combination is used to compute scores on all splits again.

Since random forest learning algorithms only learn values in the range of the samples which the algorithm has seen, we employed our own custom split for time prediction, which always includes the five samples with the longest execution time in the training set in order to ensure sufficient coverage of the prediction interval. Furthermore, the custom split ensures that each split has about the same amount of samples of short (t<1.000us), medium (1.000<=t<100.000) and long running (t>100.000) kernels. Note that this methodology requires significantly more computational resources when using more samples for training. If training time is an issue, the Tibshirani method with only two cross-validations might be a suitable alternative.

The hyperparameter search for the Extremely Randomized Trees Regression needs to find parameters that avoid overfitting. Making the parameter space too large leads to very long execution time of the nested cross-validation. For this reason the parameter space was kept to a minimum:
Max features with either max, log2, sqrt, Split criterion: with either mse, mae, and N estimators: with either 128, 256, 512, 1024.

The main parameter for this algorithm is the number of estimators. Preliminary experiments showed that more than 1024 estimators are more likely to lead to overfitting. The maximum features method that shall be used to determine best split was also added to the hyperparameters. In addition the criterion for computing the quality of the split was used with mean squared error and mean absolute error.

4. Ground Truth

4.1. Benchmarks

To maximize the number of samples for training and evaluation of the model, we use as many workloads as possible. Furthermore, this gives us also a wide representation of kernels used in GPU computing. The used benchmark suites include: Rodinia 3.1 (Che et al., 2009), Parboil 2.5 (Stratton et al., ), SHOC (Danalis et al., 2010) and Polybench-gpu-1.0 (Grauer-Gray et al., 2012).

Adhinarayanan et al. (Adhinarayanan and Feng, 2016) characterized the benchmark suites SHOC, parboil and rodinia, and found that all benchmark suite have some unique benchmarks. Even though some benchmarks may be slightly over-represented, we decided to include all usable applications of the benchmark suites.

Due to limitations of the LLVM compiler framework the CUDA Flux profiler is built upon, benchmarks using texture memory cannot be considered. Table 2 lists all benchmarks which were not used in this analysis, as well as the reasons.

Suite Application Reason
Rodinia 3.1 hybridsort texture memory
mummergpu texture memory
leukocyte texture memory
kmeans texture memory
hotspot CUDA Flux compilation error
nn CUDA Flux compilation error
pathfinder CUDA Flux compilation error
gaussian kernel not loopable (in case of power pred.)
Parboil 2.5 bfs texture memory
sad texture memory
mri-gridding CUDA Flux execution time error
Polybench-GPU correlation kernel not loopable (in case of power pred.)
shoc deviceMemory texture memory
MD texture memory
spmv texture memory
GEMM No instrumentation possible, uses cuBlas
QTC texture memory
NeuralNet hardcoded datasets
Sort kernel not loopable (in case of power pred.)
FFT kernel not loopable (in case of power pred.)
MaxFlops kernel not loopable (in case of power pred.)
Table 2. List of excluded workloads

Moreover, as the Polybench-GPU benchmark suite has hard-coded problem sizes, we decided to modify the benchmarks to allow for larger problem sizes. A longer execution time of a kernel is especially helpful for more accurate power readings. As (Johnston et al., 2018) is using four problem sizes for generating the metrics, we followed this approach and also used four different problem sizes for our measurements. Further modifications were also implemented when kernel and kernel call are not in the same compilation module, as this is not supported by the CUDA Flux profiler.

Power measurements are in particular sensitive to short-running kernels, as the sampling frequency is limited. To obtain representative power values for short kernels, we therefore inserted for-loops. However, as kernels might have data dependencies, repeated executions potentially can change execution behavior. Thus, we exclude kernels showing different output results before and after inserting for-loops. Table 2 shows all kernels additionally excluded from power experiments.

4.2. Data Acquisition

Statistical data for execution time and power consumption for the GPU kernels of the four benchmark suites are gathered on five different NVIDIA GPUs (see Table 3). Note that we limit our study to CUDA-compatible GPUs, as CUDA Flux operates on PTX which is specific to NVIDIA GPUs.

GPU K20 GTX 1650 Titan Xp P100 V100
Class Kepler Turing Pascal Pascal Volta
Single precision performance [TFLOP/s] 3.5 3 12 9.3 14
Memory throughput [GB/s] 208 128 547.7 732 900
Peak power consumption [W] 225 75 250 300 300
Power sampling frequency [Hz] 73.6 10.9 60.2 61.1 61.2
Table 3. Overview of used GPUs and their relevant hardware specifications.

4.2.1. Execution Time

Time measurements were repeated ten times to decrease the probability of outliers. For each combination of benchmark and dataset all kernel executions are recorded. With the benchmark name, dataset and the launch sequence the time measurements can be joined with the features, which the CUDA Flux profiler provides. Note that some workloads execute kernels multiple times with the same parameters, thus only the median of these time measurements is used to create a sample. Grouping identical kernel executions reduces the number of samples from over 900.000 to about 21.000. The vast majority of samples have an execution time of less than a few tenths of a second (Figure 

2). Using GPUs with higher operating frequency or more processing units reduces the execution time even further. As one can see, the kernels running longer than a few seconds are under-represented. Because the range of kernel execution time is a very large interval, we decided to apply the log function before training the model. Thus, the data is more equally distributed in the mapped space, and prediction quality improves accordingly.

Figure 2. Histogram of the kernel execution time in logarithmic time. Note that long-running kernels are statistically under-represented.
Figure 3.

Visualization of the variance of execution time: standard deviation

plotted over the median of execution time (for identical kernel executions) shows that short-running kernels appear to have a larger variance compared to long-running kernels.

For very short-running kernels, for instance 1 ms and less, we expect the execution time to vary substantially. This has potentially also a negative impact on the prediction accuracy. Figure 3 shows the standard deviation over execution time, and demonstrates the argument above. Furthermore, one can see that for kernels running longer than 1 ms, the deviation is reasonably low. Still, since there is a number of measurements with a high standard deviation, more measurements can be beneficial for the statistical soundness of the data.

4.2.2. Power Consumption

Comprehensive power instrumentation and measurement are still a tedious task, mainly due to the lack of a complete monitoring environment for all possible power consumers within a given computing system. However, for certain components of such a system, some vendors, including NVIDIA and Intel, provide power measurement support. For instance, NVIDIA GPUs can be instrumented using nvidia-smi (29). Still, the details about its functionality are poorly documented, in particular, how current and voltage are measured. Other alternatives usually require hardware access to the system, and are based on interposers that possibly degrade physical properties of other connections, including high-speed serial transmission.

Thus, power measurement based on vendor tools is typically accepted by the community. For the on-board power sensor of K20 GPUs, a detailed analysis has been performed and they accept an error of 5% on the order of ten power samples, while the sampling frequency of the sensor is approximately 66.7Hz as reported in Burtscher et al. (Burtscher et al., 2014). In our experiments, we have been able to mainly reproduce these results, while we also observed that different GPU architectures and drivers result in different behavior regarding sampling frequency as shown in Table 3.

For power measurement, kernels are executed in a loop lasting at least one second, while a CPU thread records power consumption. The loop is necessary as most of the kernels have an execution time that is shorter than the measurement resolution (see also Figure 2 for execution times and Table 3 for power sampling frequency). Multiple measurements are afterwards averaged for each kernel. A similar methodology can be found in (Nagasaka et al., 2010, 2010; Guerreiro et al., 2018). Similarly as for time measurements, the common launch sequence was used for joining the power measurements with profiling results.

For the sake of measurement reliability, the power measurements were repeated ten times in order to obtain representative data. In Figure 4, the coefficient of variation versus mean value is reported. This shows that the coefficient of variation of power measurements is about less than 5%, similar to results reported in (Burtscher et al., 2014).

Figure 4. Validation of power measurements by comparing the coefficient of variation against mean power consumption. Different dots represent different kernel executions.

4.2.3. Reduction of Over-Represented Kernels

Some kernels were executed in loops with slightly changed launch configuration or parameters. This leads to an over-representation of some kernels which had thousands of samples. To address this, we decided to implement a threshold for the number of samples per combination of application, problem size and kernel during the random selection process. However, when the threshold is too large, the kernel over-representation cannot be solved, while if the threshold is too small, too few samples are constructed for the training data. In our study, we decided to use a threshold of 100 samples to be randomly selected for each combination, which is a good compromise for both arguments above.

5. K20 Case Study

This section will review the experimental results for execution time and power prediction for the K20 GPU. The experiments on other GPUs and the results regarding portability will be covered in the following section.

To ensure good predictions the scores of multiple nested cross-validation iterations are evaluated. Furthermore, we employed the leave-one-out (LOO) technique to gather comparable predictions for each sample. LOO is a special case of K-fold cross-validation where the number of folds is equal to the number of samples. This allows to spot outliers which are not covered well by the model.

5.1. Execution Time Prediction

Figure 5 shows the performance of the nested cross-validation for time prediction. The cross-validation was repeated over 30 iterations with different random splits for each fold. Consistent and low scores indicate that the prediction generalizes well. The mean error (MAPE according to Equation 1) of each iteration is between 11.6% and 21.8%, approximately. As different iterations show similar performance, we conclude that the prediction for the K20 can perform well with only a subset of the all the samples.

Figure 5. Nested cross-validation score for execution time (left) and power (right) prediction on the K20 GPU.

Random forests offer the possibility to give each training feature a weight based on its impact on the prediction. We expect the following features to be most important:

  • Total instructions: execution time should correlate with the number of total instructions when the size of the thread grid is fixed.

  • Number of CTAs: it indicates the amount of parallel work and increases execution time if not all CTAs can be scheduled simultaneously.

  • Threads per CTA: it also indicates the amount of parallel work and may change the number of CTAs which can be scheduled at once.

  • Arithmetic operations: indicating the amount of arithmetic work

  • Memory volume read or written from global memory: indicating the amount of data movements

Figure 6. Feature importance for time (left) and (power) prediction, each feature is assigned a percentage of the contribution to the prediction result.

Figure 6 shows that these features are indeed among the most important features. Surprisingly the parameter memory volume seems also to be very important. This could be due to some computational complex kernels using lots of kernel arguments, which increases the parameter memory volume.

Figure 7. Leave-One-Out results for time prediction on the K20 GPU. Left: scatter plot of true values versus predicted values (logarithmic scale). Right: distribution of prediction errors.

LOO is used to find and visualize samples which cannot be predicted well, because they are possibly outliers. The best parameters from nested cross-validation are used to compute predictions for each sample using the LOO method. This method allows to obtain predictions for each sample while excluding it from training.

Figure 7 shows that most of the LOO predictions are quite close to the true value. The samples on the high end of the prediction are usually underestimated. This is because random forest algorithms cannot predict values outside the range of the training samples, and there are only very few samples with a long execution time. About 82% of the samples are within +/- 10% of the true value and around 8% are between 10% and 25%. The next two groups are both about 4% of the total samples. Only about 2% of the samples have a deviation of more than 100%. This shows that the majority of samples can predicted very well, while there are still some outliers for which the predictions deviate by a large factor.

5.2. Power Prediction

In this section, we follow the same methodology as for time prediction, starting with nested cross-validation score as reported in Figure 5: an error (MAPE according to Equation 1) of in between 1.66% and 1.94% can be observed, effectively lower than in execution time prediction, which means that the prediction can generalize even better. This improvement is possibly due to the smaller range of power measurements, which have only two orders of magnitude, while for execution time measurements the range can cover up to eight orders of magnitude. Therefore, even few but high magnitude errors could degrade the overall performance of the nested cross-validation for time prediction, while being less likely for power prediction.

Similarly, we can analyze the importance of features, as shown in Figure 6. However, compared to execution time prediction, threads per CTA and number of CTAs are most important, possibly because they contribute most to GPU occupancy. GPU power consumption is strongly related to GPU occupancy, which is a good indicator for showing the parallelism of the kernel, the requested resources by the kernel as well as the GPU’s available resources.

Last, again the LOO method is used to find possible outliers. Using this method, the true versus the predicted values and the distribution of the prediction error are plotted in the left respectively right part of Figure 8. Most of the predictions are quite close to the true value, with 92% of the samples being within +/- 5% of the true value, around 4% between 5% and 10% and 3% between 10% and 25%, while less than 1% is between 25% and 50%.

Figure 8. Leave-One-Out results for power prediction on the K20 GPU. Left: scatter plot of true values versus predicted values (note the linear scale). Right: distribution of prediction errors.

6. Portability

This section discusses the portability of the concept, by evaluating prediction quality for all five GPUs. As stated in the methodology, we collect application statistics (input features) only once, while for each GPU a separate output is measured (ground truth).

6.1. Time Prediction

Figure 9.

Portability of time (left) and power (right) prediction across different GPUs: MAPE scores for all iterations of nested cross-validation with median, first and third quartile. Whiskers are limited to 1.5 times of the interquartile range (Q3-Q1). Outliers are not shown.

The results of nested cross-validation across all five GPUs are summarized in Figure 9. We decided to use a boxplot of the individual scores of the folds rather than the mean score of each iteration. This avoids smoothing scores of folds with poor performance by averaging the score with possibly much better performing folds. The median MAPE score is ranging from 13.88% to 15.82% for the K20, Titan Xp, P100 and V100, while for the GTX1650 it is about 45%. In this regard, we observe that server-class GPUs seem to have a better predictability compared to consumer-class GPUs.

Furthermore, we observe that the Titan Xp has a much higher variability compared to the other GPUs, including the consumer-class GTX1650: while the median MAPE score of 14.71% is quite low, the third quartile of 568.39% leads to a large interquartile range (IQR), indicating a high variability and therefore poor generalization of the model.

Figure 10. Histogram of the cross-validation error for each fold of the nested cross-validation.

As the results for some GPUs are very promising, but for other GPUs rather mediocre, we tried to understand the reasons in more detail. First, we analyze the distribution of errors for the different GPUs. Figure 10 shows a histogram of MAPE scores for all cross-validation iterations. K20 and V100 behave similarly good, but for GTX1650 the MAPE range is much larger. For the the TitanXp and P100 there are some folds where the error is unusually high. This suggests that the dataset may contain outliers or at least very unique samples which are hard to predict if there are no similar samples in the training set.

Figure 11. Scatter plot of true time values versus predicted values using the leave-one-out method for GTX1650, Titan Xp, P100 and V100 GPUs.

As the CUDA drivers can typically add about 1-50us of latency to a kernel execution time, depending on configuration and iterativeness, measurements of short kernels can become unreliable. As a result, it is harder to fit such measurements into a model. We use again the Leave-One-Out method to accurately assess the prediction performance for every single sample: Figure 11 reports the corresponding scatter plots, in which one can see that for the GTX1650 the amount of samples with short execution time is much higher in comparison to the other GPUs. We also see evidence for the under-representation of long-running kernels, as the error increases substantially for the samples with long execution time.

The scatter plots of the TitanXp and P100 GPU show that there are some outliers. The application srad-v2 in Rodinia-3.1 is especially problematic. As Table 4 shows, the kernel execution time is varying a lot even though there is no difference regarding the features, especially for the TitanXp and the P100. We assume that caching could be one reason for large differences in execution time. This extreme case of varying kernel execution time with an almost identical set of features cannot be handled well by the model.

GPU Execution 1 [us] Execution 2 [us] Reduction factor [%]
K20 1598.1 1369.7 85.7080
GTX1650 2520.6 1544.7 61.2830
TitanXp 6901.2 2.6 0.0377
P100 29064.0 2.6 0.0089
V100 32589.6 151.9 0.4660
Table 4. Table comparing the execution time the srad_cuda_1 kernel on different GPUs. The last column shows the reduction in percent from first to second execution.

Last, we report optimal hyperparameters as the result of cross-validation runs in Table 5. Also, note that the corresponding average prediction latency for these hyperparameter settings is constantly low, but still varies substantially with hyperparameter configuration. This suggests that the latency could be reduced by more sophisticated hyperparameter search.

GPU Best hyperparameters Prediction latency
K20 MAE, max features, 512 estimators 109.32 ms
GTX 1650 MAE, max features, 1024 estimators 210.35 ms
Titan Xp MAE, max features, 1024 estimators 210.9 ms
P100 MAE, max features, 128 estimators 109.98 ms
V100 MAE, max features, 1024 estimators 210.41 ms
Table 5. Hyperparameters for the best model for time prediction, together with the corresponding average prediction latency.

6.2. Power Prediction

The results of the nested cross-validation are summarized in Figure 9. The median of MAPE for the K20 GPU is below 2%, and for the other GPUs in a comparable range. This shows a constantly good prediction, even though the peak power consumption of the GPUs varies substantially (Table 3).

Kernel power consumption was measured in the order of milli-Watts. Performing leave-one-out, we are able to plot the model predictions and actual power values as shown in Figure 12. Thus, we could observe and identify the extreme outliers. It turns out that those few outliers are different kernel samples having identical kernel features, however, exhibiting different power consumption. Therefore, for those cases the model has limited knowledge to predict both kernels precisely. This can be attributed to limited features or potential statistical variance. The application srad-v2 from Rodinia-3.1 is also problematic for power predictions, representing the most extreme outliers.

Figure 12. Scatter plot of true power values versus predicted values using the leave-one-out method for GTX1650, Titan Xp, P100 and V100 GPUs.

Note that different GPUs use different hyperparameters for the best models of power and execution time prediction. Notably, both execution time and power prediction latency depend on the number of estimators (number of trees). Therefore, the more trees are involved in the model, the more computational time is needed for making a prediction although the prediction latency is still short in the order of hundred milliseconds (Table 6).

GPU Best hyperparameters Prediction latency
K20 MSE, max features, 1024 estimators 254.57 ms
GTX 1650 MAE, max features, 1024 estimators 252.54 ms
Titan Xp MAE, max features, 1024 estimators 259.47 ms
P100 MAE, max features, 1024 estimators 253.75 ms
V100 MSE, max features, 256 estimators 115.45 ms
Table 6. Hyperparameters for the best model for power prediction, together with the corresponding average prediction latency.

7. Discussion and Limitations

The cross-validation shows that our models generalize well for predicting time and power. The time prediction has median MAPE results ranging from 13.88% to 15.82% for professional GPUs. The time for consumer-class GPUs GTX 1650 and Titan Xp could not be predicted as well. Possible reasons are much higher processor and memory clocks, leading to even shorter kernel execution times, such that the measurement accuracy is poor. We also suspect that some effects like caching between kernel launches are not well covered by the current feature selection, while the impact of these effects is hardware-dependent. The cross-validation for power prediction yields a median MAPE varying from 1.8% to 2.9% for all used GPUs.

In spite of relying only on static input features, our results still show a good prediction accuracy for five tested GPUs, both regarding time and power predictions. Similar applies for portability, where experiments showed that a model trained specifically for a given GPU can accurately predict time and power for an application’s static set of input features. Furthermore, an extensive use of cross-validation shows that the models generalize well, in spite of a rather limited dataset size. Last, predictions can be made fast, as experiments show that the prediction latency is typically in the range of 0.1-0.2 seconds. Thus, in particular the use for scheduling decisions, even across a variety of hetereogeneous devices, seems feasible.

Still, we observed a couple of limitations which we will summarize in the following:

Training data: certainly a larger training data set would be helpful to improve the prediction accuracy. Furthermore, the database of samples used to build the model mainly includes short running kernels. As measurements for kernels with short execution time are less accurate, this also limits the accuracy for the predictions. With less data on long running kernel it is also harder to predict this class of samples. Also, 14.0% (GTX1650) to 55.9% (V100) of all kernel launch configurations do not utilize all available streaming multiprocessors (register usage ignored), indicating that more samples with high degree of parallel work would be helpful. Last, more research on the impact of the CUDA driver on execution time is necessary, in particular for short-running kernels. Regarding power prediction, short and data-dependent kernels show unexpected behavior when for-loops are inserted for obtaining adequate power measurements, leading us to exclude them from our analysis. A possible solution to the issues with the training data set may be synthetic workloads with configurable execution time and degree of parallelism, e.g. similar to the one used in (Choi et al., 2013).

Model features: to address the increasing interest in reduced-precision arithmetics, for instance 16bit floating point or 8bit integer, weighting the computational instructions by bit width is possible. In general, introducing features to reflect the degree of optimization would be helpful, respectively, indicting performance bugs like bank conflicts, branch divergence, or memory coalescing issues. As pointed out previously, some kernels show a strong variation in between consecutive kernel launches. More research is required to understand this situation and how this can be covered by features.

Model training: In general, a larger hyperparameter space as well as regularization of hyperparameters could futher improve prediction accuracy. Futhermore, trade-offs in between prediction latency and model complexity are possible.

8. Summary

We hypothesized that GPU kernels are usually well-structured, sufficiently optimized for locality and latency-tolerant, therefore a prediction of execution time and power consumption based solely on hardware-independent features, which describe code and kernel launch configuration, is feasible. We validate our hypothesis by training models for five GPUs, and evaluate their accuracy by comparing to monitored real execution of at least 184 unique kernels, using different problem sizes (thus, kernel launch configurations) when possible, as ground truth. The cross-validation shows that our models generalize well when predicting time and power. Median MAPE results for time prediction are 13.45%, 44.56%, 15.59%, 13.27%, 11.61%, while for power prediction 1.81%, 2.45%, 2.17%, 2.91%, 2.48%, for a K20, GTX1650, Titan Xp, P100 and V100 GPU, respectively.

We observed that the dataset, based on a representative set of benchmark suites, tends to rather short-running kernels, resulting in a rather poor representation of long running kernels. Results suggest that for GPUs with high processor frequency and memory clock, like the consumer-grade GTX1650, this lack of representation amplifies, which is reflected by an increase of median MAPE error. In contrary, the median MAPE error for power is similar for all the GPUs.

In summary, we conclude that our hypothesis is supported, as GPU kernel execution time and power consumption can be accurately predicted by using solely hardware-independent features. As a result, we are proposing a portable, fast, accurate model to predict time and power consumption, which is publicly available 222Preprint, please contact authors for details. and can be easily retrained for other GPU architectures. Note that portability is currently limited to CUDA, which, however, is a practical and not principal limitation.

Future work can include further feature engineering and more sophisticated features describing the degree of optimization in order to improve prediction accuracy. More effort on hyperparameter search and optimization could improve prediction latency and enhance the generalization of the models.

Acknowledgements.
This work is supported in part by the Federal Ministry of Education and Research of Germany in the framework of Mekong project (FKZ: 01IH16007). The authors would like to thank Ullrich Koethe at Heidelberg University and Kai Polsterer at Heidelberg Institute for Theoretical Studies for their help on machine learning methods and models.

References

  • V. Adhinarayanan and W. Feng (2016) An automated framework for characterizing and subsetting GPGPU workloads. pp. 307–317. External Links: ISBN 978-1-5090-1953-3, Document Cited by: §4.1.
  • M. Amarís, R. Y. de Camargo, M. Dyab, A. Goldman, and D. Trystram (2016) A comparison of gpu execution time prediction using machine learning and analytical modeling. In 2016 IEEE 15th International Symposium on Network Computing and Applications (NCA), Vol. , pp. 326–333. External Links: Document, ISSN Cited by: §1, §2.3, §2.3, §2.3, §2.3, Table 1.
  • S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu (2010) An adaptive performance modeling tool for GPU architectures. pp. 10 (en). Cited by: §1, §2.3, Table 1.
  • B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz (2008) A regression-based approach to scalability prediction. In Proceedings of the 22Nd Annual International Conference on Supercomputing, ICS ’08, New York, NY, USA, pp. 368–377. External Links: ISBN 978-1-60558-158-3, Link, Document Cited by: §2.3, Table 1.
  • A. Botchkarev (2018) Performance metrics (error measures) in machine learning regression, forecasting and prognostics: properties and typology. ArXiv abs/1809.03006. Cited by: §2.3.
  • L. Braun and H. Fröning (2019) CUDA flux: a lightweight instruction profiler for cuda applications. In Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Workshop, collocated with International Conference for High Performance Computing, Networking, Storage and Analysis (SC2019), Cited by: §2.3, §3.2.
  • L. Breiman (2001) Random forests. Machine Learning 45 (1), pp. 5–32. External Links: Document Cited by: §2.2.
  • M. Burtscher, I. Zecena, and Z. Zong (2014) Measuring gpu power with the k20 built-in sensor. In GPGPU@ASPLOS, Cited by: §4.2.2, §4.2.2.
  • T. C. Carroll and P. W.H. Wong (2017) An Improved Abstract GPU Model with Data Transfer. pp. 113–120. External Links: ISBN 978-1-5386-1044-2, Document Cited by: §1, Table 1.
  • G. C. Cawley and N. L. C. Talbot (2010) On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. pp. 29 (en). Cited by: §3.3.
  • S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron (2009) Rodinia: A benchmark suite for heterogeneous computing. pp. 44–54. External Links: Document, ISBN 978-1-4244-5156-2 Cited by: §1, §4.1.
  • J. Chen, B. Li, Y. Zhang, L. Peng, and J. Peir (2011) Statistical GPU power analysis using tree-based methods. In 2011 International Green Computing Conference and Workshops, pp. 1–6. External Links: Document Cited by: §1, §2.3, Table 1.
  • J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc (2013) A Roofline Model of Energy. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 661–672. External Links: Document Cited by: §7.
  • A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter (2010) The Scalable Heterogeneous Computing (SHOC) benchmark suite. pp. 63 (en). External Links: Document, ISBN 978-1-60558-935-0 Cited by: §1, §4.1.
  • S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos (2012) Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar), pp. 1–10. External Links: Document Cited by: §1, §4.1.
  • D. Grune, K. van Reeuwijk, H. E. Bal, C. J.H. Jacobs, and K. Langendoen (2012) Modern Compiler Design. Springer New York, New York, NY (en). External Links: ISBN 978-1-4614-4698-9 978-1-4614-4699-6, Document Cited by: §3.2.
  • J. Guerreiro, A. Ilic, N. Roma, and P. Tomas (2018) GPGPU power modeling for multi-domain voltage-frequency scaling. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 789–800. External Links: Document, ISSN 2378-203X Cited by: §1, §2.3, §2.3, §2.3, Table 1, §4.2.2.
  • S. Hong and H. Kim (2009) An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, New York, NY, USA, pp. 152–163. External Links: ISBN 978-1-60558-526-0, Link, Document Cited by: §1, §2.3, §2.3, Table 1.
  • S. Hong and H. Kim (2010) An integrated GPU power and performance model. pp. 10 (en). Cited by: §1, §2.3, §2.3, §2.3, Table 1.
  • J. Huang, J. H. Lee, H. Kim, and H. S. Lee (2014) GPUMech: GPU Performance Modeling Technique Based on Interval Analysis. pp. 268–279 (en). External Links: ISBN 978-1-4799-6998-2, Document Cited by: §1, Table 1.
  • B. Johnston, G. Falzon, and J. Milthorpe (2018) OpenCL performance prediction using architecture-independent features. 2018 International Conference on High Performance Computing & Simulation (HPCS). External Links: ISBN 9781538678794, Link, Document Cited by: §1, §2.3, §2.3, §2.3, Table 1, §4.1.
  • A. Koike and K. Sadakane (2014) A novel computational model for gpus with application to i/o optimal sorting algorithms. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops, Vol. , pp. 614–623. External Links: Document, ISSN Cited by: §1.
  • S. Kundu, R. Rangaswami, K. Dutta, and M. Zhao (2010) Application performance modeling in a virtualized environment. In HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, Vol. , pp. 1–10. External Links: Document, ISSN 2378-203X Cited by: §2.3, Table 1.
  • C. Lehnert, R. Berrendorf, J. P. Ecker, and F. Mannuss (2016) Performance prediction and ranking of spmv kernels on gpu architectures. In Proceedings of the 22Nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833, New York, NY, USA, pp. 90–102. External Links: ISBN 978-3-319-43658-6, Link, Document Cited by: §1, §2.3, §2.3, Table 1.
  • J. Lim, N. B. Lakshminarayana, H. Kim, W. Song, S. Yalamanchili, and W. Sung (2014) Power modeling for gpu architectures using mcpat. ACM Trans. Des. Autom. Electron. Syst. 19 (3), pp. 26:1–26:24. External Links: ISSN 1084-4309, Link, Document Cited by: §1, §2.3, §2.3, Table 1.
  • S. Madougou, A. Varbanescu, C. de Laat, and R. van Nieuwpoort (2016) The landscape of GPGPU performance modeling tools. Parallel Computing 56, pp. 18–33 (en). External Links: ISSN 01678191, Document Cited by: §1.
  • A. Majumdar, L. Piga, I. Paul, J. L. Greathouse, W. Huang, and D. H. Albonesi (2017) Dynamic gpgpu power management using adaptive model predictive control. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 613–624. External Links: Document, ISSN 2378-203X Cited by: §1, §2.3, §2.3, §2.3, Table 1.
  • H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Matsuoka (2010) Statistical power modeling of gpu kernels using performance counters. In International Conference on Green Computing, Vol. , pp. 115–122. External Links: Document, ISSN Cited by: §1, §2.3, Table 1, §4.2.2.
  • [29] (2012-06) NVIDIA System Management Interface. (en). External Links: Link Cited by: §4.2.2.
  • D. A. Patterson and J. L. Hennessy (2012) Computer Organization and Design: The Hardware/Software Interface.. Rev. 4. ed. edition, Elsevier Morgan Kaufmann, Amsterdam; Heidelberg. External Links: ISBN 978-0-12-374750-1 Cited by: §3.2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2.2, §3.3.
  • P. Reisert, A. Calotoiu, S. Shudler, and F. Wolf (2017) Following the blind seer – creating better performance models using less information. pp. 106–118. External Links: ISBN 978-3-319-64202-4, Document Cited by: §1, §2.3, §2.3, Table 1.
  • S. Salaria, A. Drozd, A. Podobas, and S. Matsuoka (2019) Learning neural representations for predicting GPU performance. In High Performance Computing - 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16-20, 2019, Proceedings, pp. 40–58. External Links: Document, Link Cited by: §1, §2.3, §2.3, Table 1.
  • S. Song, C. Su, B. Rountree, and K. W. Cameron (2013) A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures. pp. 673–686 (en). External Links: ISBN 978-1-4673-6066-1 978-0-7695-4971-2, Document Cited by: §1, §2.3, §2.3, §2.3, Table 1, §3.
  • K. L. Spafford and J. S. Vetter (2012) Aspen: a domain specific language for performance modeling. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Vol. , pp. 1–11. External Links: Document, ISSN 2167-4337 Cited by: §1, §2.3, Table 1.
  • M. Stephenson, S. K. Sastry Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O’Connor, and S. W. Keckler (2015) Flexible software profiling of GPU architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA ’15, Portland, Oregon, pp. 185–197 (en). External Links: Document, ISBN 978-1-4503-3402-0 Cited by: §3.2.
  • [37] J. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, and W. W. Hwu Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. pp. 12. Cited by: §1, §4.1.
  • R. J. Tibshirani and R. Tibshirani (2009) A Bias Correction for the Minimum Error Rate in Cross-Validation. The Annals of Applied Statistics 3 (2), pp. 822–829. External Links: ISSN 1932-6157 Cited by: §3.3.
  • L. G. Valiant (1990) A bridging model for parallel computation. Commun. ACM 33 (8). Cited by: §2.1.
  • X. Wang, K. Huang, A. Knoll, and X. Qian (2019) A hybrid framework for fast and accurate gpu performance estimation through source-level analysis and trace-based simulation. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 506–518. External Links: Document, ISSN 2378-203X Cited by: §1, §2.3, §2.3, Table 1.
  • G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou (2015) GPGPU performance and power estimation using machine learning. pp. 564–576. External Links: ISBN 978-1-4799-8930-0, Document Cited by: §1, §2.3, §2.3, Table 1, §3.
  • Y. Zhang and J. D. Owens (2011) A quantitative performance analysis model for GPU architectures. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 382–393. External Links: Document Cited by: §1, §2.3, Table 1.