Charactering and Detecting CUDA Program Bugs

05/06/2019 ∙ by Mingyuan Wu, et al. ∙ The University of Texas at Dallas 0

While CUDA has become a major parallel computing platform and programming model for general-purpose GPU computing, CUDA-induced bug patterns have not yet been well explored. In this paper, we conduct the first empirical study to reveal important categories of CUDA program bug patterns based on 319 bugs identified within 5 popular CUDA projects in GiyHub. Our findings demonstrate that CUDA-specific characteristics may cause program bugs such as synchronization bugs that are rather difficult to detect. To efficiently detect such synchronization bugs, we establish the first lightweight general CUDA bug detection framework, namely Simulee, to simulate CUDA program execution by interpreting the corresponding llvm bytecode and collecting the memory-access information to automatically detect CUDA synchronization bugs. To evaluate the effectiveness and efficiency of simulee, we conduct a set of experiments and the experimental results suggest that simulee can detect 20 out of the 27 studied synchronization bugs and successfully detects 26 previously unknown synchronization bugs, 10 of which have been confirmed by the developers.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

CUDA (cud, 2019) is a major parallel computing platform and programming model that allows software developers to leverage general-purpose GPU (GPGPU) computing (gpg, 2019)

. CUDA is advanced in simplifying I/O streams to memories and dividing computations into sub-computations since it parallelizes programs in terms of grids and blocks. In addition, CUDA enables more flexible cache management that speeds up the floating point computation of CPUs. CUDA is thus considered rather powerful for accelerating deep-neural-network-related applications where the relevant matrix computations can be efficiently loaded.

Due to the essential differences between GPUs and CPUs, traditional bug detection approaches for CPUs render inapplicable for GPUs. In particular, since GPU programs use barriers rather than locks for synchronization and enable simple happens-before relations, the traditional lockset-based  (Choi et al., 2002; Savage et al., 1997) and happens-before-based bug detection approaches (Dinning and Schonberg, 1990; Netzer and Miller, 1991) for CPUs become obsolete in detecting the parallel-computing-related bugs for GPUs. On the other hand, it is argued that the lack of GPU parallel programming experience and the unawareness of implicit assumptions from the third-party kernel functions of developers are major reasons to cause GPU parallel-computing-related bugs. For instance, a developer might launch kernel functions with 512 threads when she is not aware that the optimal maximum number of threads in one block is only 256 (Zheng et al., 2011). However, although CUDA programing has been dominating the popular deep-neural-networks-related applications, the studies on its parallel-computation-related bug patterns are rather limited (Yang et al., 2012). Therefore, a full scan of CUDA bug patterns could help developers understand the bug patterns to improve programming efficiency, and help researchers get enlightened for future research.

In this paper, we conduct a comprehensive empirical study on real-world CUDA bug patterns, based on 319 bugs collected from five popular CUDA projects with a total of 15314 commits and 1.1 million LOC (by Jan, 2019) in GitHub according to a set of policies that emphasize the importance and impact of these projects. Through the study, we build better understandings of the CUDA program bug patterns. In particular, we depict these collected CUDA program bugs by two dimensions: runtime stage and root cause. In our study, we identify three runtime stages and five root causes and obtain the following findings: (1) the majority of the kernel function bugs are not SIMD-specific only and can take place in other platforms, and thus can be detected by traditional CPU-based approaches; (2) the majority of the memory-related bugs can also be solved by traditional approaches; and (3) detecting synchronization bugs is important, challenging, and out of the scope of traditional approaches.

Inspired by these findings, we further develop a systematic lightweight bug detection framework, namely Simulee (sim, 2019), to detect the synchronization bugs for CUDA kernel functions. Though existing techniques, e.g., GKLEE (Li et al., 2012b), CURD (Peng et al., 2018a), have been developed to detect CUDA program bugs, they are mostly either based on expensive static/dynamic analysis that results in non-negligible overhead, or fail to generate effective test cases to detect different types of synchronization bugs. Simulee, on the other hand, generates a Memory Model that depicts the information regarding thread-wise memory access including thread id, visit order, and action. Accordingly, Simulee is launched by building a virtual machine that takes the llvm bytecode of CUDA kernel functions for initializing the running environmental setups including arguments, dimensions, and global memory if necessary based on the Memory Model. Next, Simulee applies Evolutionary Programming (Fogel, 1999) to approach error-inducing inputs for executing llvm bytecode of CUDA programs and collects the corresponding memory-access information. At last, by combining CUDA specifics, such collected information can be analyzed to find whether they lead to synchronization bugs.

Unlike other CUDA synchronization bug detection approaches that are mostly not fully automated (Peng et al., 2018a) (rac, 2019) (Eizenberg et al., 2017) or limited in detecting certain bug types, e.g., data race (Zheng et al., 2011) (Zheng et al., 2014), Simulee can detect multiple bug types including data race, redundant barrier function, and barrier divergence fully automatically. Moreover, Simulee benefits from only simulating runtime CUDA programs without incurring overhead for extra processing (e.g., searching), such that it is more efficient than the static/dynamic-analysis-based approaches (Li et al., 2012b; Pereira et al., 2016) that usually demand large search space.

To evaluate the effectiveness and efficiency of Simulee on detecting synchronization bugs, we conduct a set of experiments based on a total of 9 projects, including the 5 projects for the empirical study and additional 4 projects with in total 2113 commits and 122K LOC. The experimental results suggest that Simulee can successfully detect 20 of 27 synchronization bugs derived from the empirical study. It can further detect 26 previously unknown bugs of all the 9 projects, 10 of which have already been confirmed by the corresponding developers. Moreover, the experimental results also demonstrate that Simulee can be much more effective than state-of-the-art GKLEE in detecting synchronization bugs, e.g., none of the 10 confirmed bugs can be detected by GKLEE. In summary, our paper makes the following contributions:

  • To the best of our knowledge, we conduct the first extensive study on the overall CUDA program bugs. Our findings can help understand the characteristics about CUDA bugs and guide the future relevant research .

  • To the best of our knowledge, we develop the first lightweight, fully automated, and general-purpose detection framework for CUDA synchronization bugs, namely Simulee, that can automatically detect a wide range of synchronization bugs in CUDA programs which are hard to be captured manually.

  • We evaluate Simulee under multiple experimental setups. The results suggest that Simulee is able to detect most of the synchronization bugs in the studied projects. In addition, it detected 26 new bugs of all the subject projects and outperforms state of the art.

2. Background

In this section, we give an overview on CUDA, the CUDA parallel computing mechanism, and typical CUDA synchronization bugs.

CUDA Overview. CUDA is a parallel computing platform and programming model, which enables developers to use GPU hardware for general-purpose computing. CUDA is composed of a runtime library and an extended version of C/C++. In particular, CUDA programs are executed on GPU cores, namely “device”, while they need to be allocated with resources on CPUs, namely “host”, prior to execution. As a result, developers need to retrieve allocated resources such as global memory after CUDA program execution. To conclude, a complete CUDA program contains 3 runtime stages: host resource preparation, kernel function execution, and host resource retrieve.

CUDA Parallel Computing Mechanism. Kernel function refers to the part of CUDA programs that runs on the device side. Specifically, thread is the kernel function’s basic execution unit. In the physical level, 32 threads are bundled as a thread warp wherein all the threads execute the same statement at any time except undergoing a branch divergence, while in the logic level, one or more threads are contained in a block, and one or more blocks are contained in a grid.

Figure 1. CUDA Hierarchy

Developers set dimensions of grids and blocks as inputs for executing their kernel functions. In particular, they divide computation into sub-computations and dispatch each sub-computation to different threads according to the grid and block dimensions. Eventually, the results of sub-computations can be merged as the final result of the overall computation through applying algorithms such as reduction. The hierarchy of the parallel computing mechanism of CUDA kernel functions is presented in Figure 1.

To synchronize threads, CUDA applies barriers at which all the threads in one block must wait before any can proceed. In CUDA kernel functions, the barrier function is ‘‘__syncthreads()’’ which synchronizes threads from the same block. When a thread reaches a barrier, it is expected to proceed to next statement if and only if all threads from the same block have reached the same barrier. Otherwise, the program would be exposed to undefined behaviors.

CUDA Synchronization Bugs. There are three major synchronization bugs in CUDA kernel functions: data race, barrier divergence (Collingbourne et al., 2013), and redundant barrier function based on our findings from our study in later sections. Specifically, data race indicates that for accessing global or shared memory, CUDA cannot guarantee the visit order of “read&write” actions or “write&write” actions from two or more threads. For example, Figure 2 demonstrates the bug-fixing Revision no.febf515a82” in the file “” of the project “thundersvm” (Xtra-Computing, [n.d.]), one of the highly-rated Github projects. It can be observed from Figure 2 that the “if” statement writes to the memory of “f_val2reduce”, while inside the device, the function “get_block_min” writes to the same memory. This “write&write” bug is fixed by adding “__syncthreads” which synchronizes actions among threads.

Figure 2. An Example of Data Race
Figure 3. An Example of Redundant Barrier

A barrier function is considered redundant when there is no data race after deleting it from source code. A redundant barrier function compromises the program performance in terms of time and memory usage. For instance, Figure 3 demonstrates the bug-fixing Revision no.31761d27f01” in the file “kernel/homography.hpp” from the project “arrayfire” (arrayfire, 2019). It can be observed that the block dimension is 1 since from Line 1, the value of “tid” is assigned only by “threadIdx.x”. That indicates that the “tid”s are identical among different threads from the same block. As a result, “s_median[tid]” and “s_idx[tid]” can only be accessed by one thread, leading to a redundant barrier function in Line 4 because there is no race in “s_median” or “s_idx” after deleting it.

Figure 4. An Example of Barrier Divergence

A barrier divergence takes place when some threads in a block complete their tasks and leave the barrier while the others have not reached the barrier yet. Figure 4 demonstrates the bug-fixing Revision no.0ed6cccc5ff” in the file “nearest_neighbour.hpp” from the project “arrayfire” caused by barrier divergence. It can be indicated from Figure 4 that developers make sure all the threads in the same block reach the same barrier in every execution of the kernel function by moving the statement of “__syncthreads()” outside the given branch. Otherwise they will have to handle undefined behaviors.

3. Empirical Study

To investigate CUDA bug patterns, we first conduct a large-scale real-world bug dataset from open-source CUDA projects and then empirically analyzes the runtime stages and root causes of the collected CUDA program bugs.

3.1. Data collection and filtering

To collect sufficient CUDA bugs for study, we first define policies to select open-source CUDA projects. In this paper, we aim to select important and influential projects covering as many project types as possible. The collection is initialized by searching the keyword “CUDA” and results in more than 12,000 projects from GitHub. Next, we sort these projects in terms of the star number and commit number. In particular, the chosen projects fall into two groups: more active projects which are defined as the ones with over 1000 commits and less active projects which are defined as the ones with 500 to 1000 commits. In each group, we collect one library-type project and one application-type project with the most stars. However, none of these selected projects are marked with CUDA as their main language. Therefore, we collect one additional project with the main language marked as CUDA and most stars. As a result, we collect “kaldi” (kaldi asr, [n.d.]) (more active applicatoin), “arrayfire” (arrayfire, 2019) (more active library), “thundersvm” (Xtra-Computing, [n.d.]) (less active application), “mshadow” (dmlc, [n.d.]) (less active library), and “cuda-convnet2” (akrizhevsky, 2019) (main laguage CUDA) as listed as in Table 1.

After collecting the projects, the bugs are delivered based on the commit messages and “git diff” results. The specific operations are listed as follows. We first filter the commits and only keep the commits with the messages that contain at least one keyword in the set {“fix”, “error”, “mem”}, following prior study on other types of bugs (Zhang et al., 2018). In this way, the previous versions of the selected commits might have a higher chance to contain bugs.

However, the commit messages only with these keywords might not be relevant with CUDA bugs. Therefore, next, among the filtered commit messages, we further filter them according to whether they have at least a keyword in the set {“__global__”, “__device__”} or match at least one regular expression in the set {“cuda\w+\s*[(]”, “[^][^]”} with its parent node’s “git diff” results. To illustrate, “__global__” is the modifier of kernel functions and “__device__” is the modifier of the device functions that can be called by kernel functions. “cuda\w+\s*[(]” is designed in accordance with the information that the resource is prepared/released in host side before/after executing kernel functions. For instance, “cudaMalloc((void **) &host, sizeof(int) *100)” allocates a global 400-byte memory for kernel functions before execution; “cudaFree(&host)” releases the allocated memory for kernel functions after execution. “[^][^]” is designed in accordance with the scenario that sets up the environment for kernel functions, e.g., “function_namegrid_size, block_size(arguments)”. All these regular expressions together deliver the complete life cycle of executing kernel functions such that all the bugs of the whole life cycle can be covered.

width=0.8 Projects Star Number Commit Number LoC kaldi 5143 8419 364K arrayfire 2499 5171 381K thundersvm 818 790 343K mshadow 966 894 16K cuda-convnet2 620 40 27K

Table 1. Subject Statistics

We further manually review all the remaining commits after two rounds of filtering to remove any potential false positives. Due to the tedious and time-consuming manual inspection, all the selected CUDA projects are analyzed within the most recent 1000 commits or all of them if there are fewer than 1000 commits. As a result, we collected a total of 319 real-world CUDA bugs. Note that since CUDA programs are numeral-computation-oriented, they and their bug patterns appear to be converged as stated in the following sections. To the best of our knowledge, we conduct the most extensive study for CUDA program bugs to date.

3.2. Bug Taxonomy

To understand the features of CUDA program bugs, we investigate CUDA program bugs in the following dimensions:

3.2.1. Runtime stage

Runtime stages refer to the life-cycle stages of running CUDA programs, including host resource preparation, kernel function execution, and host resource retrieve, as mentioned in Section 2. In particular, “kernel function execution” tends to be more vulnerable to bugs than other runtime stages by involving 217 bugs out of 319 in total (217/319 = 68%), while “host resource retrieve” takes up 11% by involving 34 bugs and “host resource preparation” takes up 21% by involving 68 bugs. In this paper, we focus on studying the runtime stage “kernel function execution” because programs in this stage are GPU-specific.

3.2.2. Bug root cause

The root causes of CUDA program bugs can be grouped into five categories as follows. The detailed statistics for the root causes and their corresponding bug symptoms can be found in Table 2.

a. Improper resource management. This root cause refers to the bugs triggered when utilizing and managing memory and GPUs improperly. Such root cause widely spreads in the life cycle of CUDA programs and is associated with all the bug symptoms except flaky test. In particular, it includes the buggy scenarios such as incorrect device resource allocation, memory leak, early device call reset, and unauthorized memory access. From Table 2, we can observe that improper resource management takes up 14% (31/217) among all the root causes. We can also notice that improper resource management is the major root cause for crash (i.e., 73%) because the memory issues incurred by improper resource management can result in possible fatal errors to crash programs.

b. Non-optimal implementation. This root cause refers to the implementation which accomplishes the functional requirements with lossy performance. It is often associated with test failure and inferior performance for various reasons, e.g., improper data type, outdated library functions, branch divergence in kernel functions. Non-optimal implementation takes up 8% (17/217) among all the root causes.

c. Generic error. This root cause refers to the ones that occur in any platform or any programming language, such as range-checking errors, inappropriate exception handling, scope errors, and other implementations that cannot accomplish given functional requirements. This root cause also widely spreads across all the life cycle of kernel functions. It can be observed that generic error is the major root cause of both test failure (94%) and all the bugs (63%).

d. Improper synchronization. This root cause is based on three CUDA-specific synchronization issues: data race, barrier divergence, and redundant barrier functions. Data race refers to when multiple threads “read&write” or “write&write” to the same memory address at the same time, the kernel functions may return different results in multiple executions even under the identical environmental setups. Barrier divergence leads to undefined behaviors while threads in the same block cannot reach the same barrier function. Redundant barrier function refers to that no data race exists by removing barrier functions. Figures 2, 3, and 4 can be referred to for better illustration. Note that other issues such as improper-implementation-caused synchronization are also included here. Overall, Improper synchronization takes up 12% (27/217) among all the root causes. Due to the nature of such bugs, inferior performance and flaky test (Luo et al., 2014) (i.e., the tests with non-deterministic outcomes) are the main bug symptoms.

e. Poor portability. This root cause refers to the issues that relate to certain platform specifics, such as operating systems or hardware platforms. For instance, on vs2013, bulding “mshadow” (dmlc, [n.d.]) needs one additional step before calling “__half2float”; otherwise the building would fail (Revision no.51a8a7e3e5” of the project “mshadow”). Poor portability is the most rare root cause for all the CUDA bugs (3%).

width= Root Causes Crash Test failure Inferior performance Flaky test Sum Improper resource management 22 2 7 0 31 Non-optimal implementation 0 7 10 0 17 Generic error 3 132 1 0 136 Improper synchronization 0 0 10 17 27 Poor portability 5 0 0 1 6 Sum 30 141 28 18 217

Table 2. Root Causes and Bug Symptoms for CUDA Kernel

3.3. Bug Impact and Detection Effort

Our study derives the following findings on bug impact and detection effort.

3.3.1. The majority of the kernel function bugs are not CUDA-specific or single-instruction-multiple-data (SIMD) only and can take place in other platforms and be detected by traditional approaches

As mentioned before, generic error is the major bug type among all. In particular, for the most of these bugs caused by generic error that do not exclusively arise only in CUDA, it is routine to design test cases similar for generic software programs. To conclude, it can be implied that to detect these bugs, we can use traditional approaches as for other program types without designing new approaches (Jones and Harrold, 2005; Goues et al., 2012).

3.3.2. The majority of the memory-related bugs can be solved by traditional CPU-based approaches

In this paper, the memory-related bugs mainly refer to crash bugs and the others caused by improper resource management. Among the total 36 bugs regarding global and shared memory, memory leak (13) and unauthorized memory access (20) are the major ones.

Most of the CUDA programs rely on the injected grid or block dimensions to determine the thread-wise memory-access range. If there are computing errors or improper memory-access ranges, it tends to cross borders and leads to incorrect programming outputs or even program crashes. To detect such bugs, when kernel functions are launched, it is applicable to determine whether a statement has cross-border access according to whether the thread-wise memory-access range of that statement are intertwined or violates the preset legitimate global memory-access range, e.g., Revision no.97cca6c0ff6” from the project “kaldi” and Revision no.ba19743bb6” from the project “arrayfire”. Some CPU-based approaches, such as (Gao et al., 2015), can be studied for resolving such problems.

3.3.3. Detecting synchronization bugs is important and time consuming

We believe synchronization bugs have significant impacts on CUDA programs according to the following four reasons: (1) the occurrence of the synchronization bugs is non-negligible (27/217 = 12.4%); (2) such bugs are hard to be reproduced and could increase the difficulties of testing and debugging (shown in Table 2, such bugs usually incur performance issues and flaky tests, both of which are hard to diagnose); (3) the synchronization bugs are tightly connected with CUDA specifics that are hard to be detected by the existing CPU-based approaches, while there exist traditional approaches can be adopted for detecting the majority of the other bugs, e.g., generic error; (4) some synchronization bugs, e.g., data race, can easily taint the computation results or even shelter other bug types to render them more challenging to be captured.

Figure 5. Synchronization bug impacts

In this paper, the impacts of synchronization bugs are measured by time effort which is defined as the time window between the commit where the buggy code was firstly introduced and the commit where the bug was fixed. From Figure 5, it can be observed that a number of “flaky test” synchronization bugs have been existing in the program for a long time (i.e., median 144 days, averagely 180 days, 11 bugs out of 17 over 100 days). Therefore, it can be concluded that the synchronization bugs are hard to be detect and fix.

Among all the root causes regarding CUDA kernel functions, improper synchronization is challenging and time-consuming to be detected and requires new detection approaches.

3.4. Bug Detection Motivation and Possibility

It can be inferred from the previous study that synchronization bugs have a significant impact on executing CUDA kernel functions. Recently, some compiler-based approaches have been proposed to detect CUDA synchronization bugs, such as CURD (Peng et al., 2018b), BARRACUDA (Eizenberg et al., 2017). Typically, these approaches link the detectors to the applications in the compiling stage and check the bugs in runtime process. However, they are limited by not being “fully automatic”—users have to provide error-inducing inputs manually. On the other hand, given inferior inputs, the synchronization bugs might not be triggered and detected. Moreover, it can be expensive since such runtime detection demands compiling process and GPU computing environment. Other automatic synchronization bug detection approaches, e.g., GKLEE (Li et al., 2012b), apply static/dynamic analysis and could lead to poor runtime performance on real-world projects. In addition, most of them (Li et al., 2014a) (rac, 2019) are designed only for certain bug types, e.g., data race, while they are limited in detecting other bug types (e.g., barrier divergence and redundant barrier function). Therefore, it is essential to develop a new approach to efficiently detect a wide range of synchronization bugs.

It can be observed from our study that traditional test cases are able to capture various bug types except CUDA synchronization bugs. On the other hand, it can be inferred that with ideal test cases, synchronization bugs are expected to be captured easily. Such test cases should differ from traditional test cases which only deliver runtime program output. Instead, they should deliver the information that can help capture the synchronization occurrence, e.g., the deterministic thread-wise memory visit order. On this purpose, considering that it is hard to capture such information by directly running CUDA programs, a possible idea is to design a virtual machine that can offload runtime GPU programming to offline CPU programming for better observing runtime information in kernel functions and set a mechanism to collect and analyze them. Accordingly, error-inducing grids and blocks based on their contexts can be approached in a fully automated manner to detect synchronization bugs.

This idea is applicable due to the following reasons: (1) the sophisticated parallel computing model of CUDA allows collecting various runtime information without reducing significant runtime performance, and (2) the entire process is essentially simulating runtime CUDA program with collecting runtime information which is expected to be as closely efficient as simply running the original programs.

Figure 6. An Example to Illustrate the Possibility to Automatically Detect Bugs

Figure 6 is an example from the commit of latest version of the project “kaldi”. For this bug, normally the developers have little clue about whether this piece of code has synchronization bugs because regular test cases may pass during many runs. Our intuition in designing specific test cases for synchronization bugs might detect synchronization bugs in this case by setting up initial environment for “_copy_from_mat”. If they set the grid dimension to be 1, the block dimension to be (3, 2), “


” to be 1, “d_out.stride” to be 1, “d_out.rows” to be 5, and “d_out.cols” to be 5, then thread (0 1 0) and thread (1 0 0) will report “write & write” data race at Line 10, indicating that when “d_out.stride” is smaller than “d_out.cols”, kernel functions should instantly raise an exception that “stride” should be larger than “cols” instead of executing all the code for a long time. Eventually, the problem of automatically detecting such synchronization bugs can be transformed to the problem of automatically generating the error-inducing grid and block dimensions.

It is possible to design a general automated framework to detect CUDA synchronization bugs as long as tests (e.g., error-inducing grid and block dimensions) can be automatically approached to collect and analyze the corresponding memory-access information.

4. Framework of Simulee

In this section, we introduce Simulee, an automatic tool to detect real-world CUDA synchronization bugs. Typically, Simulee takes llvm bytecode translated from CUDA kernel function programs, automatically generates the associated error-inducing inputs, and yields Memory Model to detect synchronization bugs. Specifically, Simulee is composed of two parts—“Automatic Input Generation” and “Memory Model-based Bug Detection”. “Automatic Input Generation” is initialized by inputting the llvm bytecode of CUDA kernel function programs. Next, it slices the memory-access statements (e.g., read and write statements) and inputs them for Evolutionary Programming (Fogel, 1999). Subsequently, Evolutionary Programming helps generate error-inducing environmental setups by iteratively mutating and sorting dimensions and arguments and passes the acceptable ones to “Memory Model-based Bug Detection”. At last, “Memory Model-based Bug Detection” simulates runtime environment by constructing Memory Model and using it to detect whether there are synchronization bugs. The details can be referred to in Figure 7.

Figure 7. Framework of Simulee

4.1. Automatic Input Generation

Generating error-inducing inputs is essentially equivalent to generating the inputs that can lead to the memory-access conflicts among threads to improve the possibility of CUDA synchronization bug occurrences. However, how to automatically generate such error-inducing inputs remains challenging. Some intuitive solutions, such as random generation or coverage-oriented generation might be limited in effectiveness and efficiency, since they are not specially designed for triggering memory-access conflicts. In this section, we introduce how Simulee automatically generates error-inducing inputs for detecting CUDA synchronization bugs in an effective and efficient manner.

4.1.1. Intuition

An effective and efficient automatic approach to generate error-inducing inputs for triggering CUDA synchronization bugs indicates to generate as many memory-access conflicts as possible within a short time limit. Given the ith memory address and the kernel function inputs, i.e., grid and block dimensions and arguments, is defined as the number of threads that access the ith memory address while g(i) is a function that returns 1 when the ith memory address is accessed by any thread and returns 0 otherwise. [start, end] denotes the memory-access range. An intuitive target function can be presented in Equation 1 which denotes the ratio of the total number of the accessed memory addresses to the total number of the memory-access threads:


It can be derived that the max value of is 1 which denotes that there is no memory-access conflict between any thread pair. On the other hand, the smaller is, the higher chance the memory-access conflict takes place. Therefore, can be used for optimization to obtain error-inducing inputs that trigger CUDA synchronization bugs. Note that since is discrete, we choose Evolutionary Programming (Fogel, 1999) as our optimization approach.

4.1.2. Framework

The framework of “Automatic Input Generation” is presented in Algorithm 1. First, Simulee randomly initializes arguments and dimensions to create and sort individual solutions for evolving (Lines 3 to 7). In each generation, each solution is mutated to generate two children, which are added to the whole population set (Lines 8 to 14). Next, the population winners survive for the subsequent iterations (Lines 15 to 16). The iterations can be terminated once it finds an acceptable solution. Otherwise, after completing the iterations, it returns the optimal solution by then.

Initial Solutions. The initial dimensions and arguments are randomly generated and passed to fitness functions as initial solutions for future evolution. Note that the dimensions can be extracted from kernel functions. For instance, if a kernel function has “threadIdx.x” and “threadIdx.y”, it means the block is two-dimensional.

Fitness Function. Equation 1 is chosen as the primary fitness function for Evolutionary Programming. Specifically, the output of is the fitness score for a solution of dimensions and arguments in Evolutionary Programming. However, it is difficult to derive an optimal solution of dimensions and arguments by only optimizing . In particular, since is non-differentiable when the gradient does not exist, it is hard to find an optimal solution given the set of inferior solutions, e.g., all the solutions of are “1”s. To tackle such inferior solutions, we design a secondary fitness function such that they are sorted according to their possibility to be optimal: . In particular, it indicates that a smaller memory-access range leads to a higher possibility of memory-access conflict. As a result, we define fitness score of the primary fitness function as primary score, and the fitness score of the secondary fitness function as secondary score. During the population evaluation, the primary score is sorted first; if and only if the top-ranked primary score is 1, the secondary score is sorted to decide which solution is more likely to converge to the minimum of .

Input : population, generation

Output: acceptable arguments and dimensions

2:     population_lst list()
3:     for i in population do
4:          single_solution InitialSolution()
5:          single_score fitness(single_solution)
6:          population_lst.append([single_solution, single_score])      
7:     sort_by_score(population_lst)
8:     for i in generation do
9:          child_lst list()
10:          for solution in population_lst do
11:               children_solutions mutation(solution)
12:               new_scores fitness(children_solutions)
13:               child_lst.append([children_solutions, new_scores])           
14:          population_lst.merge(child_lst)
15:          sort_by_score(population_lst)
16:          population_lst population_lst[:population]
17:          if population_lst[0] acceptable then
18:               return population_lst                
19:     return population_lst
Algorithm 1 Framework for Automatic Input Generation

Mutation. In Simulee, solutions are generated by mutation, where each solution generates two children in one generation. Specifically, arguments and dimensions are independent from each other during mutation with respective mutation strategies. The mutate strategy for dimensions is trivial: first, Simulee

randomly generates an integer vector ranging from -1 to 1 according to the dimension size; next, the child’s dimension is mutated by summing the parent’s dimension and the generated integer vector.

The details of the mutation strategy for arguments is presented in Algorithm 2. Since the memory-access-relevant arguments are numbers, Simulee considers them as float numbers and converts them back to the actual types when executing

. Accordingly, each generation generates two children: one adds a random number generated by standard Normal Distribution 

(nor, 2019)

to the arguments inherited from the parent solution, and the other adds a random number generated by standard Cauchy Distribution 

(cau, 2019) to the arguments inherited from the parent solution. We define the search step length of the arguments as the absolute value of the number generated from the two aforementioned distributions, with expected values shown in Equations 2 and 3.


We next explain why we apply the above two distributions. It can be observed from Equations 2 and 3 that, the step length generated from standard normal distribution is expected to be small. That indicates that if there is an optimal solution nearby, the generated child is likely to approach it. On the contrary, the step length generated from standard cauchy distribution is expected to be large. That indicates that if there is an inferior solution nearby, the generated child is likely to escape from it.

Acceptable Function. The acceptable function is used to terminate the whole process given an acceptable solution. In our work, the acceptable solution is defined as that primary score is smaller than 0.3.

To conclude, by applying Evolutionary Programming, Simulee is expected to deliver error-inducing grid and block dimensions and arguments that lead to memory-access conflicts and trigger CUDA synchronization bugs.

Input : parent

Output: normal_solution, cauchy_solution

2:     normal_solution copy(parent)
3:     cauchy_solution copy(parent)
4:     for  argument in parent do
5:          normal_solution[argument] parent[argument] + normal()
6:          cauchy_solution[argument] parent[argument] + cauchy()      
7:     return normal_solution, cauchy_solution
Algorithm 2 Mutating Arguments

4.2. Memory-based Synchronization Bug Detection

With the auto-generated error-inducing inputs, the synchronization bug detection of Simulee is established on building a Memory Model that depicts thread-wise memory-access instances. Based on the Memory Model, Simulee develops a set of criteria to detect synchronization bugs including data race, redundant barrier functions, and barrier divergence.

4.2.1. Memory Model

The Memory Model accessed by the kernel functions is defined to be composed of a set of Memory Units where each Memory Unit corresponds to a memory address and is composed of a set of Unit Tuples. A Unit Tuple is defined as a three-dimensional vector space visit_order, thread_id, action, where visit_order represents the visit order to the associated memory address from different threads, thread_id represents the indices of such threads, and action refers to the read or write action from those threads.

An example of Memory Unit is demonstrated in Figure 8 with four Unit Tuples 0, (1 0 0), read, 0, (2 0 0), read, 0, (3 0 0), read, and 1, (3 0 0), read where threads (1,0,0), (2,0,0), and (3,0,0) read the same memory address in the same visit_order since none of them have reached any barrier function before they read. Assume all the threads reach a barrier function later and thread (3 0 0) reads, the visit_order is then incremented from 0 to 1 for thread (3 0 0) and the other threads afterwards.

4.2.2. Memory Model Construction

Since Memory Model is only associated with barrier functions and memory-access statements, it is applicable to detect synchronization bugs by obtaining such statements and then extracting/analyzing the memory-access information instead of executing the complete CUDA programs on GPUs, i.e., simulating the execution of CUDA kernel function programs. This simulation process is initiated by inputting the auto-generated block and grid dimensions and arguments passed to the kernel functions. Next, it constructs the Memory Unit for each memory address.

Figure 8. Memory Unit Example

The overall Memory Model construction is demonstrated in Algorithm 3. In particular, the algorithm is launched to initialize the block and grid dimensions as well as the global and shared memory for each thread (Lines 2 to 5). Next, for each block, the shared memory (Line 7) and the thread-wise visit_order for each global and shared memory address (Lines 8 to 9) are initialized. If there are still some unterminated threads, for all of them, their corresponding Memory Units are derived based on the collected parameters, e.g., global_mem and visit_order_global (Lines 10 to 14). The construction of the thread-wise Memory Units for shared memory and global memory are completed if there is no running thread left (Lines 15 to 16).

Algorithm 4 illustrates the details of Memory Model construction for a single thread. Specifically, given a running thread and the parameters passed by Algorithm 3 (Lines 2 to 4), Algorithm 4 is initialized by detecting whether the current statement is the end of file. If so, the thread would be terminated. If there is any thread halting afterwards, we can confirm there is a “barrier divergence” bug because that indicates at least a thread has not reached the barrier function where the other threads of the same block all have completed their tasks and left (Lines 5 to 9).

If the current statement calls barrier function and all the other threads have reached the same barrier function, the visit_order for both global and shared memory would be incremented if they have been visited before (Lines 10 to 13), since it indicates that all the threads in one block have visited the current memory address and the subsequent visits would demand a new barrier function. On the other hand, if the current statement does not call barrier function, the corresponding visit_order and the action of the associated thread is recorded to construct the Memory Model (Lines 14 to 21).

Input : grid_dim, block_dim, arguments

Output: Memory Model

2:     BLOCKS, generate_from_dimension(grid_dim)
3:     THREADS generate_from_dimension(block_dim)
4:     global_mem [MemoryUnit() for i in range(global_size)]
5:     shared_mem_lst list()
6:     for blk in BLOCKS do
7:          shared_mem [MemoryUnit() for i in range(shared_size)]
8:          visit_order_global [0 for i in range(global_size)]
9:          visit_order_shared [0 for i in range(shared_size)]
10:          while has_unterminated_thread() do
11:               for t in THREADS do
12:                    env Environment(arguments)
13:                    PROCESS_THREAD(t, global_mem, shared_mem,
14:                    visit_order_global, visit_order_shared, env)                          
15:          shared_mem_lst.append(shared_mem)      
16:     return global_mem, shared_mem_lst
Algorithm 3 Memory Model construction

4.2.3. Memory-Model-based Detection Mechanism

The design of Memory Model can be used in Simulee to detect CUDA synchronization bugs, i.e., data race, redundant barrier function, and barrier divergence.

Data Race. In general parallel computing programs, a possible data race takes place when multiple threads access the identical memory address in the same visit order and at least one of them writes. Specifically in CUDA kernel functions, besides the generic circumstances, a data race also takes place when (1) the threads are from different thread warps, or (2) the threads from the same thread warp underwent branch divergence, or (3) the threads from the same thread warp without undergoing branch divergence write to the same memory address by the same statement. By combining the data race detection criteria above and the design of Memory Model, Simulee can detect data race in CUDA kernel functions as described in Theorem 4.1.

Theorem 4.1 ().

Given two Unit Tuples and from the identical Memory Unit, a data race between them takes place if the conditions below are met:

  • [visit_order] = [visit_order]

  • [thread_id] != [thread_id]

  • [action] = ‘write’ or [action] = ‘write’

when the threads of and are (1) from different thread warps or (2) executing the “write” action on the same statements in the same thread warp or (3) underwent branch divergence before the current “write” action.

Redundant Barrier Function. A redundant barrier function indicates that no data race can be detected by removing that barrier function. In CUDA kernel functions, the visit_order is incremented for one Unit Tuple when at least one thread reaches a barrier function. In other words, two Unit Tuples with adjacent visit_order in one Memory Unit indicates the presence of a barrier function, shown in Figure 8. Therefore, to detect whether a barrier function is redundant or not, it is essential to collect all the associated Unit Tuples and analyze whether they together would lead to data race. The barrier function is defined to be redundant if no data race can be detected among such Unit Tuples.

The details of how to detect data race and redundant barrier function based on Memory Model are presented in Algorithm 5. For each Memory Unit, to detect data race, Simulee first groups the Unit Tuples with the same visit_order. For all the Unit Tuples in one group, Simulee checks whether any Unit Tuple has data race with others according to Theorem 4.1 (Lines 4 to 16). To detect redundant barrier function of one Memory Unit, Simulee extracts its visit_order and groups all the Unit Tuples with adjacent visit_order to find out whether any data race can take place (Lines 18 to 22). If there is no data race, Simulee identifies the associated barrier function and increments its recorder by 1 (Lines 23 to 24). At last, it checks whether the total recorder number matches the total number of the changing visit_order caused by that barrier function which can be obtained after constructing the Memory Model. This barrier function is redundant if the two numbers are equivalent (Lines 25 to 28).

Barrier Divergence. As mentioned in Section 4.2.2, barrier divergence can be detected during constructing Memory Model when there is any halting thread after the current execution is terminated, because it indicates that there is at least one thread which has not reached the barrier function while the others have already left.

To conclude, Simulee first applies Evolutionary Programming to generate error-inducing grid and block dimensions and arguments. Next, Simulee inputs such dimensions and arguments to construct Memory Model that delivers thread-wise memory-access information. Eventually, such information, along with the CUDA synchronization bug detection mechanism, are used to detect whether there exists any CUDA synchronization bug.

Input : thread, global_mem, shared_mem, visit_order_global,                  visit_order_shared, env


2:     if should_halt() or is_finished() then
3:          return      
4:     cur_stmt env.get_next_IP()
5:     if cur_stmt.is_EOF() then
6:          thread.finish()
7:          if has_halt_threads() then
8:               return BARRIER_DIVERGENCE           
9:          return      
10:     if cur_stmt.is_syncthreads() then
11:          if all threads reach same barrier then
12:               update_current_visit_order(visit_order_shared)
13:               update_current_visit_order(visit_order_global)           
14:     else
15:          is_global, mem_index simulate_execute(cur_stmt, env)
16:          if is_global then
17:               index visit_order_global[mem_index]
18:               update_memory_model(global_mem, mem_index, index)
19:          else
20:               index visit_order_shared[mem_index]
21:               update_memory_model(shared_mem, mem_index, index)                
22:     return
Algorithm 4 Thread Processor

Input : memory_model, changing_visit_order_number


2:     DATA_RACE False
3:     REDUNDANT_BARRIERS = dict()
4:     for memory_unit in memory_model do
5:          for visit_order in memory_unit do
6:               tuples get_tuples_by_order(visit_order)
7:               for thread performing write in tuples do
8:                    other_ts get_different_threads(thread, tuples)
9:                    for t in other_ts do
10:                         if in_same_warp(t, thread) then
11:                              if using_same_stmt(t, thread) then
12:                                   DATA_RACE True                               
13:                              if has_branch_divergence(t, thread) then
14:                                   DATA_RACE True                               
15:                         else
16:                              DATA_RACE True                                                                       
17:          barrier_dict = dict()
18:          for visit_order in memory_unit do
19:               next_order visit_order + 1
20:               current get_tuples_by_order(visit_order)
21:               target get_tuples_by_order(next_order)
22:               if can_merge_without_race(target, current) then
23:                    barrier get_split_barrier(next_order, memory_unit)
24:                    barrier_dict[barrier] ++                               
25:     for barrier in barrier_dict do
26:          REDUNDANT_BARRIERS[barrier]
27:          is_redundant(barrier_dict[barrier],
28:          changing_visit_order_number[barrier])      
Algorithm 5 Memory Model-based Detection

5. Experiment

We have extensively evaluated the effectiveness and efficiency of Simulee in terms of detecting synchronization bugs:

  • We choose all the synchronization bugs found in the 5 studied projects and apply Simulee to detect them.

  • We use Simulee to detect new synchronization bugs of the 5 studied projects and 4 additional projects according to their history (i.e., still actively maintained) and recent popularity (i.e., ¿ 100 stars), i.e., CudaSift (Celebrandil, [n.d.]) (241 stars, 103 commits, 2.4K LoC), CUDA-CNN (zhxfl, 2019) (111 stars , 247 commits, 12K LoC), cudpp (cudpp, [n.d.]) (202 stars, 302 commits, 58K LoC), gunrock (gunrock, [n.d.]) (483 stars, 1467 commits, 7.8K LoC).

  • We compare Simulee against the open-source automatic CUDA bug detection tool GKLEE in terms of their effectiveness and efficiency of detecting previously unknown bugs for all the projects.

5.1. Experimental Environment and Setup

We performed our evaluation on a desktop machine, with Intel(R) Xeon(R) CPU E5-4610 and 320 GB memory. The operating system is Ubuntu 16.04. For Evolutionary Programming of “Automatic Input Generation” in Simulee, the population is set to be 50, and the generation is set to be 3. Note that the Simulee webpage (sim, 2019) includes more experimental results under different settings for Evolutionary Programming.

5.2. Result Analysis

5.2.1. Effectiveness

First, we applied Simulee to detect the synchronization bugs of the 5 studied projects. The experimental results suggest that Simulee can successfully detect 20 out of 27 synchronization bugs, including 12 data race bugs, 6 barrier divergence bugs, and 2 redundant barrier functions. The bugs that Simulee fails to detect are caused by lossy implementation logic which can be fixed only by completely refactoring the overall code structure, including abandoning the synchronization mechanisms, e.g., Revision no.b3e927edcb” from the project “arrayfire”.

Next, we further applied Simulee to detect previously unknown synchronization bugs on the total 9 projects. In addition to the five studied projects, we adopted another 4 CUDA projects, i.e., CudaSift, CUDA-CNN, cudpp and gunrock. The experimental results are shown in Table 3, where it can be observed that we successfully detected 26 bugs in total (TT), including 15 data race bugs (DR), 1 barrier divergence bug (DB), and 10 redundant barrier function bugs (RB). To date, 8 redundant barrier function bugs, 1 data race bug and 1 barrier divergence bug have been confirmed by the corresponding developers. To be specific, the developers of CudaSift and cudpp responded as follows:

“Yes, there is a bit of cleaning up to do there. Sometimes when I detect oddities in the output, I add an unnecessary synchronization just in case. In fact those things should be all run on the same thread, since it cannot be parallelized anyway. Thank you for pointing it out.” — CudaSift
“I think you’re right… There are considerably faster ways to do matrix multiply calls…” — cudpp

Since barrier divergence is a undefined behavior, it may not hang on every situation. The developers of gunrock responded as follows:

“I do see what @Stefanlyy / @eagleShanf mean for the divergence issue, and surprise the code didn’t hang.”

width= Projects Detected Confirmed Under Discussion Nonresponse TT DR RB BD TT DR RB BD TT DR RB BD TT DR RB BD kaldi 11 10 1 0 1 0 1 0 6 6 0 0 4 4 0 0 thundersvm 4 4 0 0 0 0 0 0 4 4 0 0 0 0 0 0 CudaSift 4 0 4 0 4 0 4 0 0 0 0 0 0 0 0 0 CUDA-CNN 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 cudpp 3 0 3 0 3 0 3 0 0 0 0 0 0 0 0 0 gunrock 3 1 1 1 2 1 0 1 1 0 1 0 0 0 0 0 Total 26 15 10 1 10 1 8 1 11 10 1 0 5 4 1 0

Table 3. Bugs Detected

In addition, 11 bugs, including 10 data race bugs and 1 redundant barrier function bug, are still being actively discussed by developers since some developers assumed users should acquire the prior knowledge to set the dimensions and arguments correct. We label such bugs “under discussion” as stated in Table 3. For example, the developers of kaldi responded as follows:

“You could do that. Normally, though, CUDA does not expect reduction operations to be called with more than one block. So something like assert(gridDim.x == 1) would be better.”

In order to further evaluate the effectiveness of Simulee, we compare its capability in finding previously unknown bugs out of the 9 projects against GKLEE (Li et al., 2012b) which is concolic-execution-based CUDA bug detector. Since GKLEE is only designed to detect data race bugs, Table 4 shows the results for all the kernel functions from all the 9 projects that both Simulee and GKLEE are applicable to and the GKLEEs test-sample kernel function. In particular, the top one entry in Table 4 is GKLEE’s test-sample kernel function, the bottom two are the kernel functions without synchronization bugs, and the rest are detected with bugs by Simulee111Please refer to Simulee webpage (sim, 2019) for the detailed bugs and kernel functions..

From Table 4, it can be observed that Simulee can correctly detect all the bugs while GKLEE can only correctly detect one bug which is actually from its own test sample. The reasons why Simulee performs better than GKLEE can be inferred as follows. Simulee is built with a robust mechanism for bug detection, e.g., automatic setup for running environment, sufficient collection of runtime information, and simple yet complete bug detection mechanism. GKLEE, on the other hand, suffers from severe path explosion problem such that it has to adopt pruning techniques for better efficiency while risking wrongly pruning buggy branches under some circumstances, e.g., complicated programs with loops. Also, GKLEE uses the generation strategy based on traditional code coverage. In cases that code coverage cannot relate to memory access conflicts, the test cases generated by GKLEE may not converge to a setting that can trigger synchronization bugs.

5.2.2. Efficiency

To evaluate the efficiency of Simulee, we also compare it against GKLEE on the 5 projects in terms of the runtime latency in detecting these bugs with the timeout set to be 6 hours. From Table 4, it can be observed that Simulee can detect all the bugs while GKLEE can only detect one bug that is from its own testing samples and timed out on 4 bugs. In other words, GKLEE cannot detect any of the real-world bugs collected in our study. Specifically, for the four bugs that GKLEE does not time out for, Simulee is slightly slower than GKLEE due to Evolutionary Programming, but is still able to finish within seconds. On the other bugs, GKLEE suffers from severe path explosion problems and takes long time before enumerating all the possible executions especially when it comes to complex programs with loops, while Simulee is still able to finish analysis within seconds. From the experimental results, we can conclude that Simulee is a lightweight and scalable detection framework that can efficiently detect various synchronization bugs.

width=0.45 Kernel function GKLEE time Simulee time GKLEE report Simulee report 1 343ms 627ms r&w sync r&w sync 2 243ms 1895ms no sync w&w sync 3 486ms 1593ms no sync w&w sync 4 507ms 1938ms no sync w&w sync 5 655ms 1472ms no sync w&w sync 6 Timeout 6253ms N/A r&w sync 7 Timeout 7897ms N/A w&w sync 8 Timeout 2028ms N/A w&w sync 9 Timeout 1586ms N/A w&w sync 10 Timeout 3753ms N/A no sync 11 Timeout 4380ms N/A no sync

Table 4. GKLEE vs Simulee

6. Threats to Validity

The threats to external validity mainly lie in the subjects and faults. Though the studied projects may not represent the overall project distributions, they are selected such that the overall covered features of the CUDA projects can be maximized. On the other hand, the way that the studied bugs are derived by analyzing the commit messages may cause some false positives. In order to reduce such threat, we collect a large number of real bugs (319), which is much more than the closely related work, such as 175 in (Zhang et al., 2018) and 70 in (Liu et al., 2014). To our best knowledge, this is so far the largest study on CUDA bugs.

The threats to construct and internal validity may lie in the different understanding towards the definition on the bug symptoms. Some symptoms might not appear to be problematic to somebody due to one’s different standpoint. For instance, some developers might feel tolerant towards inferior performance issues and ignore the effectiveness of Simulee when it can detect them. To reduce such threat, we design two rounds of filtering to try our best to make sure that the derived bugs indeed reflect the errors that programs undergo. Next we manually check both the description and the source code to clearly understand the causes. In the end, we compare the derived symptoms against the corresponding commit messages to make sure that each bug falls into the correct category. Moreover, the feedback from the developers on our bug report submissions also appear to be supportive on our bug understandings.

7. Related Work

As our work investigates the automatic bug detection techniques for CUDA programs through empirical studies, we summarize the related work into two parts: empirical studies on CUDA programs and techniques of CUDA bug detection.

Empirical Studies There are several existing work that study bugs and other features on CUDA programs. For instance, Yang et al. (Yang et al., 2012) delivered the empirical study on the features of the performance bugs on CUDA programs, Burtscher et al. (Burtscher et al., 2012) studied the control-flow irregularity and memory-access irregularity and found that both irregularities are mutually dependent and exist in most of kernels. Che et al.(Che et al., 2008) examined the effectiveness of CUDA to express with different sets of performance characteristics. Some researchers are keen on the comparisons between CUDA and OpenCL. For instance, Demidov et al. (Demidov et al., 2013) compared some C++ programs running on top of CUDA and OpenCL and found that they work equally well for problems of large size. Du et al. (Du et al., 2012), on the other side, studied the discrepancies in the OpenCL and CUDA compilers’ optimization that affect the associated GPU computing performance.

CUDA bug detection Several approaches that detect CUDA bugs are static/dynamic-analysis-based (Li et al., 2012b)(Pereira et al., 2016)(Eizenberg et al., 2017)(Li et al., 2014b)(Li and Gopalakrishnan, 2010)(Betts et al., 2012)(Chong et al., 2013)(Bardsley et al., 2014). Though they can be effective, they are also argued to be time costly (Peng et al., 2018a). A lot of the research concentrate on detecting the specific data race bugs. In addition to many aforementioned work, LDetector (Li et al., 2017) instrumented compiler to detect races by using diffs between memory snapshots. Boyer et al. (Boyer et al., 2008) detected data race on GPU emulators instead on real GPU hardware. Many tools have been developed to inspect CUDA programs. For instance, GKLEE (Li et al., 2012b)

employed concolic execution-based verification and test-case reduction heuristics for CUDA program detections. It was scaled as using the technique of Parameterized Flows 

(Li et al., 2012a).

One closely-related work with Simulee is a test-amplification-based bug detection approach (Leung et al., 2012) that amplified the result of a single running test to combine it with static analysis such that the set of all inputs and interleavings could be verified. Though the idea of injecting testing philosophy into CUDA programs is similar with Simulee, Simulee advances in (1) it is a general-purpose and fully automated bug detection framework that can detect various synchronization types while (Leung et al., 2012) only handles data race and requires manual inputs; (2) Simulee only needs to run the code relevant to kernel function execution, while (Leung et al., 2012) needs to run the whole program life cycle which leads to much larger overhead. (3) (Leung et al., 2012) suffers from the incorrect input regarding synchronization and loss of effectiveness while Simulee does not have these limitations.

8. Conclusions

In this paper we conduct an extensive study on CUDA program bugs. It can be concluded from the study results that the bugs occur mostly in kernel functions, where the cross-platform generic errors are the major bug symptoms. It can also be observed that the synchronization bugs can be extremely challenging to handle. Therefore, we develop a fully automated approach, namely Simulee, that can successfully detect synchronization bugs efficiently based on the auto-generated running environment. Specifically, Simulee can detect most of the synchronization bugs out of the studied projects, and even successfully detected 26 previously unknown bugs which have never been reported/detected before. In addition, Simulee can achieve better effectiveness and efficiency than GKLEE.


  • (1)
  • cau (2019) 2019. Cauchy distribution.
  • cud (2019) 2019. CUDA program introduction.
  • gpg (2019) 2019. GPGPU introduction.
  • nor (2019) 2019. Normal distribution.
  • rac (2019) 2019. Racecheck Tool.
  • sim (2019) 2019. The Simulee project.
  • akrizhevsky (2019) akrizhevsky. 2019. cuda-convnet2.
  • arrayfire (2019) arrayfire. 2019. ArrayFire.
  • Bardsley et al. (2014) Ethel Bardsley, Adam Betts, Nathan Chong, Peter Collingbourne, Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer. 2014. Engineering a Static Verification Tool for GPU Kernels. In Proceedings of the 16th International Conference on Computer Aided Verification - Volume 8559. Springer-Verlag, Berlin, Heidelberg, 226–242.
  • Betts et al. (2012) Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: A Verifier for GPU Kernels. SIGPLAN Not. 47, 10 (Oct. 2012), 113–132.
  • Boyer et al. (2008) M. Boyer, K. Skadron, and W. Weimer. 2008. Automated Dynamic Analysis of CUDA Programs. In Third Workshop on Software Tools for MultiCore Systems.
  • Burtscher et al. (2012) M. Burtscher, R. Nasre, and K. Pingali. 2012. A quantitative study of irregular programs on GPUs. In 2012 IEEE International Symposium on Workload Characterization (IISWC). 141–151.
  • Celebrandil ([n.d.]) Celebrandil. [n.d.]. SIFT features with CUDA.
  • Che et al. (2008) Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel and Distrib. Comput. 68, 10 (2008), 1370 – 1380. General-Purpose Processing using Graphics Processing Units.
  • Choi et al. (2002) Jong-Deok Choi, Keunwoo Lee, Alexey Loginov, Robert O’Callahan, Vivek Sarkar, and Manu Sridharan. 2002. Efficient and Precise Datarace Detection for Multithreaded Object-oriented Programs. SIGPLAN Not. 37, 5 (May 2002), 258–269.
  • Chong et al. (2013) Nathan Chong, Alastair F. Donaldson, Paul H.J. Kelly, Jeroen Ketema, and Shaz Qadeer. 2013. Barrier Invariants: A Shared State Abstraction for the Analysis of Data-dependent GPU Kernels. SIGPLAN Not. 48, 10 (Oct. 2013), 605–622.
  • Collingbourne et al. (2013) Peter Collingbourne, Alastair F. Donaldson, Jeroen Ketema, and Shaz Qadeer. 2013. Interleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels. In Programming Languages and Systems, Matthias Felleisen and Philippa Gardner (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 270–289.
  • cudpp ([n.d.]) cudpp. [n.d.]. cudpp.
  • Demidov et al. (2013) D. Demidov, K. Ahnert, K. Rupp, and P. Gottschling. 2013. Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries. SIAM Journal on Scientific Computing 35, 5 (2013), C453–C472. arXiv:
  • Dinning and Schonberg (1990) A. Dinning and E. Schonberg. 1990.

    An Empirical Comparison of Monitoring Algorithms for Access Anomaly Detection. In

    Proceedings of the Second ACM SIGPLAN Symposium on Principles &Amp; Practice of Parallel Programming (PPOPP ’90). ACM, New York, NY, USA, 1–10.
  • dmlc ([n.d.]) dmlc. [n.d.]. Matrix Shadow.
  • Du et al. (2012) Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and Jack Dongarra. 2012. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38, 8 (2012), 391 – 407. APPLICATION ACCELERATORS IN HPC.
  • Eizenberg et al. (2017) Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky, and Joseph Devietti. 2017. BARRACUDA: Binary-level Analysis of Runtime RAces in CUDA Programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 126–140.
  • Fogel (1999) Lawrence J. Fogel. 1999. Intelligence Through Simulated Evolution: Forty Years of Evolutionary Programming. John Wiley & Sons, Inc., New York, NY, USA.
  • Gao et al. (2015) Q. Gao, Y. Xiong, Y. Mi, L. Zhang, W. Yang, Z. Zhou, B. Xie, and H. Mei. 2015. Safe Memory-Leak Fixing for C Programs. In ICSE, Vol. 1. 459–470.
  • Goues et al. (2012) C. Le Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In 2012 34th International Conference on Software Engineering (ICSE). 3–13.
  • gunrock ([n.d.]) gunrock. [n.d.]. Gunrock.
  • Jones and Harrold (2005) James A. Jones and Mary Jean Harrold. 2005. Empirical Evaluation of the Tarantula Automatic Fault-localization Technique. In Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering (ASE ’05). ACM, New York, NY, USA, 273–282.
  • kaldi asr ([n.d.]) kaldi asr. [n.d.]. Kaldi.
  • Leung et al. (2012) Alan Leung, Manish Gupta, Yuvraj Agarwal, Rajesh Gupta, Ranjit Jhala, and Sorin Lerner. 2012. Verifying GPU Kernels by Test Amplification. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 383–394.
  • Li and Gopalakrishnan (2010) Guodong Li and Ganesh Gopalakrishnan. 2010. Scalable SMT-based Verification of GPU Kernel Functions. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE ’10). ACM, New York, NY, USA, 187–196.
  • Li et al. (2012b) Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012b. GKLEE: Concolic Verification and Test Generation for GPUs. SIGPLAN Not. 47, 8 (Feb. 2012), 215–224.
  • Li et al. (2014a) Pengcheng Li, Chen Ding, and Tolga Soyata. 2014a. LDetector: A Low Overhead Race Detector For GPU Programs.
  • Li et al. (2017) Pengcheng Li, Xiaoyu Hu, Dong Chen, Jacob Brock, Hao Luo, Eddy Z. Zhang, and Chen Ding. 2017. LD: Low-Overhead GPU Race Detection Without Access Monitoring. ACM Trans. Archit. Code Optim. 14, 1, Article 9 (March 2017), 25 pages.
  • Li et al. (2012a) P. Li, G. Li, and G. Gopalakrishnan. 2012a. Parametric flows: Automated behavior equivalencing for symbolic analysis of races in CUDA programs. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–10.
  • Li et al. (2014b) Peng Li, Guodong Li, and Ganesh Gopalakrishnan. 2014b. Practical Symbolic Race Checking of GPU Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14). IEEE Press, Piscataway, NJ, USA, 179–190.
  • Liu et al. (2014) Yepang Liu, Chang Xu, and Shing-Chi Cheung. 2014. Characterizing and Detecting Performance Bugs for Smartphone Applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 1013–1024.
  • Luo et al. (2014) Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE. 643–653.
  • Netzer and Miller (1991) Robert H. B. Netzer and Barton P. Miller. 1991. Improving the Accuracy of Data Race Detection. SIGPLAN Not. 26, 7 (April 1991), 133–144.
  • Peng et al. (2018a) Yuanfeng Peng, Vinod Grover, and Joseph Devietti. 2018a. CURD: A Dynamic CUDA Race Detector. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 390–403.
  • Peng et al. (2018b) Yuanfeng Peng, Vinod Grover, and Joseph Devietti. 2018b. CURD: A Dynamic CUDA Race Detector. SIGPLAN Not. 53, 4 (June 2018), 390–403.
  • Pereira et al. (2016) Phillipe Pereira, Higo Albuquerque, Hendrio Marques, Isabela Silva, Celso Carvalho, Lucas Cordeiro, Vanessa Santos, and Ricardo Ferreira. 2016. Verifying CUDA Programs Using SMT-based Context-bounded Model Checking. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC ’16). ACM, New York, NY, USA, 1648–1653.
  • Savage et al. (1997) Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM Trans. Comput. Syst. 15, 4 (Nov. 1997), 391–411.
  • Xtra-Computing ([n.d.]) Xtra-Computing. [n.d.]. THUNDERSVM.
  • Yang et al. (2012) Y. Yang, P. Xiang, M. Mantor, and H. Zhou. 2012. Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs. In 2012 41st International Conference on Parallel Processing. 329–339.
  • Zhang et al. (2018) Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018.

    An Empirical Study on TensorFlow Program Bugs. In

    Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018). ACM, New York, NY, USA, 129–140.
  • Zheng et al. (2011) Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A low-overhead mechanism for detecting data races in GPU programs. In PPoPP.
  • Zheng et al. (2014) M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. 2014. GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme. IEEE Transactions on Parallel and Distributed Systems 25, 1 (Jan 2014), 104–115.
  • zhxfl (2019) zhxfl. 2019. CUDA-CNN.