DARM: Control-Flow Melding for SIMT Thread Divergence Reduction – Extended Version

07/12/2021
by   Charitha Saumya, et al.
Purdue University
0

GPGPUs use the Single-Instruction-Multiple-Thread (SIMT) execution model where a group of threads-wavefront or warp-execute instructions in lockstep. When threads in a group encounter a branching instruction, not all threads in the group take the same path, a phenomenon known as control-flow divergence. The control-flow divergence causes performance degradation because both paths of the branch must be executed one after the other. Prior research has primarily addressed this issue through architectural modifications. We observe that certain GPGPU kernels with control-flow divergence have similar control-flow structures with similar instructions on both sides of a branch. This structure can be exploited to reduce control-flow divergence by melding the two sides of the branch allowing threads to reconverge early, reducing divergence. In this work, we present DARM, a compiler analysis and transformation framework that can meld divergent control-flow structures with similar instruction sequences. We show that DARM can reduce the performance degradation from control-flow divergence.

READ FULL TEXT VIEW PDF

page 8

page 10

page 11

07/14/2017

Variable Instruction Fetch Rate to Reduce Control Dependent Penalties

In order to overcome the branch execution penalties of hard-to-predict i...
06/04/2019

SPECCFI: Mitigating Spectre Attacks using CFI Informed Speculation

Spectre attacks and their many subsequent variants are a new vulnerabili...
04/30/2018

Holistic Management of the GPGPU Memory Hierarchy to Manage Warp-level Latency Tolerance

In a modern GPU architecture, all threads within a warp execute the same...
08/07/2018

eQASM: An Executable Quantum Instruction Set Architecture

Bridging the gap between quantum software and hardware, recent research ...
09/03/2013

Understanding Evolutionary Potential in Virtual CPU Instruction Set Architectures

We investigate fundamental decisions in the design of instruction set ar...
02/19/2020

CopyCat: Controlled Instruction-Level Attacks on Enclaves for Maximal Key Extraction

The adversarial model presented by trusted execution environments (TEEs)...
04/01/2018

A Survey of Techniques for Dynamic Branch Prediction

Branch predictor (BP) is an essential component in modern processors sin...

I Introduction

General Purpose Graphics Processing Units (GPGPU) are capable of executing thousands of threads in parallel, efficiently. Advancements in the programming models and compilers for GPUs have made it much easier to write data-parallel applications. Unfortunately, exploiting data parallelism does not immediately translate to better performance. One key reason for the lack of performance portability is that GPGPUs are not capable of executing all the threads independently. Instead threads are grouped together into units called warps, and threads in a warp execute instructions in lockstep. This is commonly referred to as the Single Instruction Multiple Thread (SIMT) execution model.

The SIMT model suffers performance degradation when threads exhibit irregularity and can no longer execute in lockstep. Irregularity comes in two forms, irregularity in memory accesses patterns (i.e. memory divergence) and irregularity in the control-flow of the program (i.e. control-flow divergence). Memory divergence occurs when GPGPU threads needs to access memory at non-uniform locations, which results in un-coalesced memory accesses. Un-coalesced memory accesses are bad for GPU performance because memory bandwidth can not be fully utilized to do useful work.

Control-flow divergence occurs when threads in a warp diverge at branch instructions. At the diverging branch, lockstep execution can not be maintained because threads in a warp may want to execute different basic bocks (i.e. diverge). Instead, when executing instructions along a diverged path, GPGPUs mask out the threads that do not want to take that path. The threads reconverge at the Immediate Post-DOMinator (IPDOM) of a divergent branch—the instruction that all threads from both branches want to execute. This style of IPDOM-based reconvergence is implemented in hardware in most GPGPU architectures to maintain SIMT execution. Even though IPDOM-based reconvergence can handle arbitrary control-flow, it imposes a significant performance penalty if a program has a lot of divergent branches. In the IPDOM reconvergence model, instructions executed on divergent branches necessarily cannot utilize the full width of a SIMD unit. If the code has a lot of nested divergent branches or divergent branches inside loops, this style of execution causes significant under-utilization of SIMD resources.

For some GPGPU applications divergent branches are unavoidable, and there have been many techniques proposed to address this issue both in hardware and software. Proposals such as Dynamic warp formation [20], Thread block compaction [18] and Dual-path execution [37] focus on mitigating the problem at the hardware level by changing how threads are scheduled for execution and making sure that threads following the same path are grouped together. Unfortunately, such approaches are not useful on commodity GPGPUs.

Control-flow and instruction
Pattern
Technique
Tail
Merging
Branch
Fusion
darm
Diamond control-flow with
identical instruction sequences
Diamond control-flow with
distinct instruction sequences
Complex control-flow
TABLE I: Comparison of techniques for divergence reduction

There have also been efforts to reduce divergence through compiler approaches that leverage the observation that different control-flow paths often contain similar instruction (sub)sequences. Tail merging [8] identifies branches that have identical sequences of code and introduces early jumps to merged basic blocks, with the effect of reducing divergence. Branch fusion generalizes tail merging to work with instruction sequences that may not be identical [10]. However, branch fusion cannot analyze complex control-flow and hence it is restricted to simple if-then-else branches where each path has a single basic block (i.e. diamond-shaped control-flow).

This paper introduces a more general, software-only approach of exploiting similarity in divergent paths, called control-flow melding. Control-flow melding is a general control-flow transformation which can meld similar control-flow subgraphs inside a if-then-else region (not just individual basic blocks). By working hierarchically, recursively melding divergent control-flow at the level of subgraphs of the CFG, control-flow melding can handle substantially more general control structures than prior work. This paper describes darm, a realization of control-flow melding for general GPGPU programs. Table I compares the capabilities of darm with branch fusion and tail merging.

darm works in several steps. First, it detects divergent if-then-else regions and splits the divergent regions into Single Entry Single Exit (SESE) control-flow subgraphs. Next it uses a hierarchical sequence alignment technique to meld profitable control-flow subgraphs, repeatedly finding subgraphs whose control-flow structures and constituent instructions can be aligned. Once a fixpoint is reached, darm uses this hierarchical alignment to generate code for the region with reduced control-flow divergence.

The main contributions of the paper are,

  • [leftmargin=*]

  • Divergence-Aware-Region-Melder (darm), a realization of control-flow melding that identifies profitable melding opportunities in divergent if-then-else regions of the control-flow using a hierarchical sequence alignment approach and then melds these regions to reduce control-flow divergence.

  • An implementation of darm in LLVM [29] that can be applied to GPGPU programs written in HIP [23] or CUDA [11]. Our implementation of darm is publicly available as an archival repository111https://doi.org/10.5281/zenodo.5784768 and up-to-date version is available in GitHub222https://github.com/charitha22/cgo22ae-darm-code.

  • An evaluation of darm on a set of synthetic GPU programs and a set of real-world GPU applications showing its effectiveness

Ii Background

Ii-a GPGPU Architecture

Modern GPGPUs have multiple processing cores, each of which contains multiple parallel lanes (i.e.

 SIMD units), a vector register file and a chunk of shared memory. The unit of execution is called a warp (or wavefront). A warp is a collection of threads executed in lock-step on a SIMD unit. Shared memory is shared among the warps executing on a core. A branch unit takes care of control-flow divergence by maintaining a SIMT stack to enforce IPDOM based reconvergence, as discussed in Section 

I. GPGPU programming abstractions like CUDA [11] or HIP [23] gives the illusion of data parallelism with independent threads. However, during real execution, a group of program instances (i.e. threads) are mapped to a warp and executed in lock-step. Therefore control-flow divergence in SPMD programs is detrimental to the performance because of the SIMT execution limitations.

Ii-B LLVM SSA form and GPU Divergence Analysis

LLVM [29] is a general framework for building compilers, optimizations and code generators. Most of the widely adopted GPGPU compilers [35, 40] are built on top of the LLVM infrastructure. LLVM uses a target-independent intermediate representation, LLVM-IR, that enables implementing portable compiler optimizations. LLVM-IR uses static single assignment form [13] which requires that every program variable is assigned once and is defined before being used. SSA form uses nodes to resolve data-flow when branches are present, selecting which definition should be chosen at a confluence of different paths. In GPGPU compilers, a key step in identifying divergent control-flow regions is performing compiler analyses to identify divergent variables (or branches) [26, 10]. A branch is divergent if the branching condition evaluates to a non-uniform value for different threads in a warp. If the branching condition is divergent, threads in a warp will have to take different control-flow paths at this point. LLVM’s divergence analysis tags a branch as divergent, if the branching condition is either data-dependent or sync-dependent on a divergent variable (such as thread ID) [26], though more sophisticated divergence analyses have been proposed [42].

Iii Motivating Example

1  __global__ static void bitonicSort(int *values) {
2    // copy data from global memory to shared memory
3    __syncthreads();
4    for (unsigned int k = 2; k <= NUM; k *= 2) {
5      for (unsigned int j = k / 2; j > 0; j /= 2) {
6        unsigned int ixj = tid ^ j;
7        if (ixj > tid) {
8          if ((tid & k) == 0) {
9            if (shared[ixj] < shared[tid])
10              swap(shared[tid], shared[ixj]);
11          }
12          else {
13            if ( shared[ixj] > shared[tid])
14              swap(shared[tid], shared[ixj]);
15          }
16        }
17        __syncthreads();
18      }
19    } // write data back to global memory
20  }
Fig. 1: Bitonic sort kernel

Bitonic sort is a kernel used in many parallel sorting algorithms such as bitonic merge sort and Cederman’s quicksort [4, 6]. Figure 1 shows a CUDA implementation of bitonic sort. This kernel is our running example for describing darm’s control-flow melding algorithm.

In this kernel, the branch condition at line 8 depends on the thread ID. Therefore it is divergent. Since the divergent branch is located inside a loop, the execution of the two sides of the branch needs to be serialized many times, resulting in high control-flow divergence. However the code inside the if (line 9-10) and else (line 13-14) sections of the divergent branch are similar in two ways. First, both code sections have the same control-flow structure (i.e. if-then branch). Second, instructions along the two paths are also similar. Both conditions compare two elements in the shared array and perform a swap operation. Therefore the contents of the if and else sections can be melded to reduce control-flow divergence. Both code sections consists of shared memory loads and store operations. In the unmelded version of the code these shared memory operations will have to be serialized due to thread-divergence. However, if the two sections are melded threads can issue the memory instructions in the same cycle resulting in improved performance.

Existing compiler optimizations such as tail merging and branch fusion cannot be applied to this case. Tail merging is applicable only if two basic blocks have a common destination and have identical instruction sequences at their tails. However in bitonic sort, the if and then sections of the divergent branch have multiple basic blocks, and the compiler cannot apply tail merging. Similarly branch fusion requires diamond shaped control-flow and does not work if the if and else sections of the branch contain complex control-flow structures.

darm solves this problem in two phases. In the analysis phase (Section IV-C), darm analyzes the control-flow region dominated by a divergent branch to find isomorphic sub-regions that are in the true and false paths of the divergent branch. These isomorphic sub-region pairs are aligned based on their melding profitability using a sequence alignment strategy. Melding profitability is a compile-time approximation of the percentage of thread cycles that can be saved by melding two control-flow regions. Next, darm choses profitable sub-region pairs in the alignment (using a threshold) and computes an instruction alignment for corresponding basic blocks in the two regions. In the code generation phase (Section IV-D), darm uses this instruction alignment to meld corresponding basic blocks in the sub-region pair. This melding is applied iteratively until no further profitable melding can be performed. darm’s melding transformation is done in SSA form, therefore the resulting CFG can be optimized further using other compiler optimizations (Sections IV-E and IV-F).

Iv Detailed Design

In this section we describe the algorithm used by darm to meld similar control-flow subgraphs. First we define the following terms used in our algorithm description.

Iv-a Preliminaries and Definitions

Definition 1.

Simple Region : A simple region is a subgraph of a program’s CFG that is connected to the remaining CFG with only two edges, an entry edge and an exit edge.

Definition 2.

Region : A region of the CFG is characterized by two basic blocks, its entry and exit. All the basic blocks inside a region are dominated by its entry and post-dominated by its exit. Region with entry and exit is denoted by the tuple . LLVM regions are defined similarly [32, 24].

Definition 3.

Single Entry Single Exit Subgraph : Single entry single exit (SESE) subgraph is either a simple region or a single basic block with a single predecessor and a successor.

Note that a region with entry and exit can be transformed into a simple region by introducing a new entry and exit blocks , . All successors of are moved to and is made the single successor of . Similarly, all predecessors of are moved to and a single exit edge is added from to .

Definition 4.

Simplified Region : A region with all its subregions transformed into simple regions is called a simplified region.

We now turn to the steps the darm compiler pass takes to reduce control divergent code.

Iv-B Detecting Meldable Divergent Regions

First darm needs to detect divergent branches in the CFG. We use LLVM’s built-in divergence analysis to decide if a branch is divergent or not (Section II). The smallest CFG region enclosing a divergent branch is called the divergent region corresponding to this branch. Melding transformation is applied only to divergent regions of the CFG. The next step is to decide if a divergent region contains control-flow subgraphs (definition 3) that can be safely melded.

Definition 5.

Meldable Divergent Region: A simplified region with entry and exit is said to be meldable and divergent if the following conditions are met,

  1. [leftmargin=*]

  2. The entry block of has a divergent branch

  3. Let and be the successor blocks of . does not post-dominate and does not post-dominate

According to definition 5, a meldable divergent region has a divergent branch at its entry (condition 1). This makes sure that our melding transformation is only applied to divergent regions, and non-divergent parts of the control-flow are left untouched. Condition 2 ensures that paths (i.e. true path) and (i.e. false path) consists of at least one SESE subgraph and these subgraphs from the two paths can potentially be melded to reduce control-flow divergence. Consider our running example in Figure 1. When this kernel is compiled with ROCm HIPCC GPU compiler  [23] with -O3 optimization level into LLVM-IR, we get the CFG shown in Figure 3(a). Note that the compiler aggressively unrolls both the loops (lines 4 and 5) in the kernel, and the resulting CFG consists of multiple repeated segments of the inner loop’s body (lines 6-17). In Figure 3(a), only one unrolled instantiation of the loop body is shown. As explained in Section III, this kernel contains a divergent branch, which is at the end of basic block . Also ’s two successors and do not post-dominate each other. Therefore the region is a meldable divergent region.

Iv-C Computing Melding Profitability

Definition 5 only allows us to detect regions that may contain meldable control-flow subgraphs. It does not tell us whether it is legal to meld them or melding them will improve performance. First we need to define what conditions needs to be satisfied for two SESE subgraphs to be meldable.

Definition 6.

Meldable SESE Subgraphs: SESE subgraphs and where belongs to the true path and belongs to the false path are meldable if any one of the following conditions are satisfied,

  1. [leftmargin=*]

  2. Both and have more than one basic block and they are structurally similar i.e. isomorphic.

  3. is a simple region and consists of a single basic block or vice versa.

  4. Both and consists of single basic block.

Fig. 2: Examples showing the 3 cases considered by darm to detect meldable subgraphs

Definition 6 ensures that any two SESE subgraphs that meets any one of these conditions can be melded without introducing additional divergence to the control-flow. Note that we do not consider subgraphs that contain warp-level intrinsics[45] for melding because melding such subgraphs can cause deadlock. Figure 2 shows three examples where each of the above conditions are applicable. Assume in each example subgraphs and are in a divergent region and only one of the subgraphs are executed from any program path from to . (i.e. any thread in warp that executes must either go through or but not both).

Region-Region Melding : In case \⃝raisebox{-0.9pt}{1}, two SESE subgraphs and are isomorphic, therefore they can be melded to have the same control-flow structure (subgraph in Figure 2-\⃝raisebox{-0.9pt}{1}). In the melded subgraph , basic blocks and are guaranteed to post-dominate and threads can reconverge at these points resulting in reduction in control-flow divergence. Also the structural similarity in case \⃝raisebox{-0.9pt}{1} ensures that we do not introduce any additional branches into the melded subgraph.

Basic block-Region Melding : In case \⃝raisebox{-0.9pt}{2}, basic block (in subgraph ) can potentially be melded with any basic block in CFG . Assume that basic blocks and have the most melding profitability (melding profitability described later). First we replicate the control-flow structure of to create a new CFG . Then we place in such that and are in similar positions in the the two CFGs and . We also ensure the correctness of the program by concretizing the branch conditions in to always execute and create nodes at dominance frontiers of to make sure values defined inside are reached to their users [13]. In this example branch at end of basic block will always take the edge (bold arrow in subgraph ) and nodes will be added to . Now subgraphs and are isomorphic and therefore can be melded similar to case \⃝raisebox{-0.9pt}{1}. We refer to this process as Region Replication. Main benefit of region replication is that it allows us to meld with any profitable basic block in subgraph and resultant subgraph has less divergence because threads can reconverge at basic blocks and in melded subgraph .

Basic block-Basic block Melding : Case \⃝raisebox{-0.9pt}{3} is the simplest form where two SESE basic blocks are melded.

A meldable divergent region can potentially have multiple SESE subgraphs in its true and false paths. Therefore we need a strategy to figure out which subgraph pairs to meld. We formulate this as a sequence alignment problem as follows. First, we obtain a ordered sequence of subgraphs in true path and false of the divergent region. Subgraphs are ordered using the post-dominance relation of their entry and exit blocks. For example, if entry node of subgraph post-dominates exit node of subgraph , then comes after in the order and denoted as . A subgraph alignment is defined as follows,

Definition 7.

Subgraph Alignment: Assume a divergent region has ordered SESE subgraphs in its true path and ordered subgraphs in the false path. A subgraph alignment is an ordered sequence of tuples where,

  1. if then and are meldable subgraphs

  2. if then and

According to definition 7, only meldable subgraphs are allowed in a alignment tuple and if the aligned subgraphs are melded, the resultant control-flow graph does not break the original dominance and post-dominance relations of the subgraphs.

Given a suitable alignment scoring function and gap penalty function , we can find an optimal subgraph alignment using a sequence alignment method such as Smith-Waterman [44] algorithm. The scoring function measures the profitability of melding two meldable subgraphs and . Prior techniques have employed instruction frequency to approximate the profit of merging two functions[38, 39]. We use a similar method to define subgraph melding profitability. First we define the melding profitability of two basic blocks and as follows,

Here is set of all possible instruction types available in the instruction set (i.e. LLVM-IR opcodes). is the static latency of basic block which can be calculated by summing the latencies of all instructions in . is the latency of instruction type . The idea here is to approximate the percentage of instruction cycles that can be saved by melding the instructions in and assuming a best-case scenario (i.e. all common instructions in and are melded regardless of their order). For example, two basic blocks with identical opcode frequency profile will have a profitability value 0.5.

Because meldable subgraphs are isomorphic, there is a one-to-one mapping between basic blocks (i.e. corresponding basic blocks). For example, in Figure 2 case \⃝raisebox{-0.9pt}{1} the basic block mapping for CFGs and are . Assume the mapping of basic blocks in and is denoted by . Subgraph melding profitability of subgraphs and is defined in terms of melding profitabilities of their corresponding basic blocks.

Similar to , measures the percentage of instruction cycles saved by melding two SESE subgraphs. This metric is an over-approximation, however it provides a fast way of measure the melding profitability of two subgraphs that works well in practice. We use as the scoring function for subgraph alignment.

Instruction Alignment: Notice that our subgraph melding profitability metric (i.e. ) prioritizes subgraph pairs that have many similar instructions in their corresponding basic blocks. Therefore when melding two corresponding basic blocks we must ensure that maximum number of similar instructions are melded together. This requires computing an alignment of two instruction sequences such that if they are melded using this alignment, the number of instruction cycles saved will be maximal. We use the approach used in Branch Fusion [10] to compute an optimal alignment for two instructions sequences. In this approach compatible instructions are aligned together and instructions with higher latency are prioritized to be aligned over lower latency instructions. Compatibility of two instructions for melding depends on a number of conditions like having the same opcode and types of the operands being compatible. We used the criteria described by Rocha et al. [39] to determine this compatibility. This instruction alignment model uses a gap penalty for unaligned instructions because extra branches needs to be generated to conditionally execute these unaligned instructions. Our melding algorithm does not depend on the sequence alignment algorithm used for instruction alignment computation. We use Smith-Waterman algorithm [44] to compute the instruction alignment because prior work [10] has shown its effectiveness.

(a)
(b)
(c)
Fig. 3: (a) Instruction alignment result for two basic blocks and , (b) Code generated by darm for aligned instructions \⃝raisebox{-0.9pt}{a}, \⃝raisebox{-0.9pt}{b} and \⃝raisebox{-0.9pt}{c} in Figure 2(a), (c) Unpredication applied to the unaligned instructions of basic block in figure 2(a)

Figure 2(a) shows the instruction alignment computed for two basic blocks and . Aligned instructions are shown in green and instructions aligned with a gap are in red.

Iv-D darm Code Generation

Input: SPMD function
Output: Melded SPMD function
do
       changed false
       for BB in F do
             R, C GetRegionFor(BB)
             if IsMeldableDivergent(R) then
                   SimplifyRegion(R)
                   A ComputeSubgraphAlignment(R)
                   for (, profit) in A do
                         if profit threshold then
                               Meld(, C)
                               changed true
                         end if
                        
                   end for
                  
             end if
            if changed then
                   SimplifyFunction(F)
                   RecomputeControlFlowAnalyses(F)
                   break
             end if
            
       end for
      
while changed;
Algorithm 1 darm Algorithm
Input: SESE subgraphs ,, Condition C
Output: Melded SESE subgraph
List blockPairs Linearize()
List A empty
for () in blockPairs do
       List instrPairs ComputeInstrAlignment()
       A.append(instrPairs)
end for
PreProcess()
Map operandMap empty
for  in A do
       Clone()
       Update(operandMap, , )
      
end for
for  in A do
       SetOperands(, operandMap, C)
end for
RunUnpredication()
RunPostOptimizations()
Algorithm 2 SESE Subgraph melding Algorithm

darm’s control-flow melding procedure is shown in algorithm 1. This algorithm takes in a SPMD function and iterates over all basic blocks in to check if the basic block is an entry to a meldable divergent region () according to the conditions in Definition 5. We use to convert all subregions inside in to simple regions.

We compute the optimal subgraph alignment for the two sequences of subgraphs in the true and false paths of . We meld each subgraph pair in the alignment if the melding profitability is greater than some threshold. Subgraph melding changes the control-flow of . Therefore we first simplify the control-flow (using LLVM’s simplifycfg) and then recompute the control-flow analyses (e.g. dominator, post-dominator and region tree) required for the melding pass. We apply the melding procedure on again until no profitable melds can be performed.

Algorithm 2 shows the procedure for melding two subgraphs and . is the branching condition of the meldable divergent region containing and . First the two subgraphs are linearized in pre-order to form a list of corresponding basic block pairs. Processing the basic blocks in pre-order ensures that dominating definitions are melded before their uses. For each basic block pair in this list we compute an optimal alignment of instructions. Each pair in the alignment falls into two categories, I-I and I-G. I-I is a proper alignment with two instructions and I-G is an instruction aligned with a gap. Our alignment makes sure that in a match the two instructions are always meldable into one instruction (e.g. a load is not allowed to align with a store). First we traverse the alignment pair list and clone the aligned instructions. For I-I pairs, we clone a single instruction because they can be melded. During cloning, we also update the , which maintains a mapping between aligned and melded LLVM values. We perform a second pass over the instruction alignment to set the operands of cloned instructions (). Assume we are processing an I-I pair with instructions and cloned instruction is . For each operand of , the corresponding operands from and are looked up in because an operand might be an already melded instruction. If the resultant two operands from and are the same, we just use that value as the operand. If they are different, we generate a select instruction to pick the correct operand conditioned by . For an I-G pair, operands are first looked up in and the result is copied to . Consider the instruction alignment in figure 2(a). Figure 2(b) shows the generated code for aligned instruction pairs \⃝raisebox{-0.9pt}{a}, \⃝raisebox{-0.9pt}{b} and \⃝raisebox{-0.9pt}{c}. In case \⃝raisebox{-0.9pt}{a}, two select instructions are needed because both operands maps to different values (, and , ). In case \⃝raisebox{-0.9pt}{b}, the first operand is the same () for both instructions, therefore only one select is needed. In case \⃝raisebox{-0.9pt}{c}, both first and second operands are different for the two instructions. However the second operands map to same melded instruction , so only one select is needed. Note that is the branching condition for the divergent region, and we use that for selecting the operands.

Melding Branch Instructions of Exit Blocks: Setting operands for branch instructions in subgraph exit blocks is slightly different than that for other instructions. Let , be the exit blocks of and . Successors , can contain nodes. Therefore we need to ensure that successors of and can distinguish values produced in true path or false path. To solve this we move the branch conditions of and in to newly created blocks and . Now we can conditionally branch to and depending on . For example, in Figure 3(c) basic blocks and are created when when melding the exit branches of and in figure 3(b). Any node in (figure 3(c)) can distinguish the values produced in true or false path using and .

Melding Nodes : In LLVM SSA form nodes are always placed at the beginning of a basic block. Even if the instruction alignment result contains two aligned nodes we can not meld them into a single node because select instructions can not be inserted before them. Therefore we copy all nodes into the melded basic block and set the operands for them using the . This can introduce redundant nodes which we remove during post-processing.

Iv-E Unpredication

(a)
(b)
(c)
(d)
(e)
Fig. 4: darm melding algorithm applied to bitonic sort (Figure 1) (a) Original control-flow graph, (b) Region simplification, (c) darm subgraph melding, (d) Unpredication, (e) Final optimized control-flow graph

In our code generation process, unaligned instructions are inserted to the same melded basic block regardless of whether they are from true or false paths (i.e. fully predicated). This can introduce overhead due to several reasons. If the branching conditions is biased towards the true or false path, it can result in redundant instruction execution. Also full predication of unaligned store instructions require adding extra loads to makes sure correct value is written back to the memory. Unpredication splits the melded basic blocks at gap boundaries and moves the unaligned instructions into new blocks. Figure 2(c) shows unpredication applied to the unaligned instructions of basic block in Figure 2(a). The original basic block is split to two parts ( and ) and unaligned instructions ( and ) are moved to a new basic block, . nodes (( and )) are added to to ensure unaligned instructions dominate their uses. and are never executed in the true path, therefore nodes’ incoming values from block are undefined (LLVM undef). Note that in region replication (Section IV-C) we apply unpredication only to the melded basic blocks. Store instructions outside the melded blocks are fully predicated by inserting extra loads.

Iv-F Pre and Post Processing Steps

Fig. 5: darm pre-processing example

In SSA form, any definition must dominate all its users. However darm’s subgraph melding can break this property. Consider the two meldable subgraphs , in figure 5 \⃝raisebox{-0.9pt}{A}. Definition dominates its use before the melding. However if and are melded naively then will no longer dominate . To fix this we add a new basic block with a node . All uses of are replaced with (Figure 5 \⃝raisebox{-0.9pt}{B}). Notice that value is never meant to be used in the true path execution. Therefore it is undefined in true path (undef). We apply this preprocessing step before the melding ( in Algorithm 2).

Subgraph melding can introduce branches with identical successors, nodes with identical operands and redundant nodes. in Algorithm 2 removes these redundancies.

Iv-G Putting All Together

Figure 4 shows how each stage of the pipeline of subgraph-melding transforms the CFG of bitonicSort kernel. The original CFG is shown in Figure 3(a). Region (, ) is a meldable divergent region. Figure 3(b) shows the CFG after region simplification. Subgraphs and are profitable to meld according to our analysis. Figure 3(c) shows the CFG after subgraph-melding. The result after applying unpredication is shown in Figure 3(d). Notice that the unpredication splits the basic block (in Figure 3(c)) into 5 basic blocks (zoomed in blue-dashed blocks in Figure 3(d)). Basic blocks and are the unaligned groups of instructions and they are executed conditionally. Figure 3(e) shows the final optimized CFG after applying post optimizations. Note that ROCm HIPCC compiler applied if-conversion aggressively. Therefore the effect of unpredication step is nullified in this case.

Figure 4 only shows how darm transformation changes the CFG of our running example. It does not show the change of instructions inside these basic blocks. We use Figure 6 to explain the generation of melded instructions for the running example. Figure 5(a) shows the LLVM-IR of the meldable divergent region ( in Figure 3(b)) in our running example. During darm code generation, basic blocks in subgraphs and are linearized to compute the instruction alignment. Computed instruction alignment is shown in Figure 5(b). Notice that are the corresponding basic block pairs. In this example all instructions perfectly align with each other except for the compare instructions in basic blocks and (shown in red in Figure 5(b)). Figure 5(c) shows the LLVM-IR after applying subgraph melding and unpredication (similar to Figure 3(d)). Note that instructions and (compare instructions) are unaligned. Therefore unpredication step introduced basic blocks and to execute them conditionally based on the divergent condition . Extra instructions and are inserted to ensure def-use chains are not broken during the unpredication step. Out of the all aligned instructions only the branch instructions at the end of basic blocks and require select instructions during instruction-melding. For example the store instructions in basic blocks , use matching operands, therefore can be melded without adding selects. On the other hand, conditional branch instructions uses values and and select instruction is inserted (Figure 5(c)) to pick the branching condition conditionally. Note that the values and will flow to their users via the nodes and respectively. Therefore the select instruction (i.e. ) uses these nodes as its operands.

(a)
(b)
(c)
Fig. 6: LLVM-IR before and after applying darm transformation to our running example (a) meldable divergent region (b) instruction alignment (b) LLVM-IR generated after subgraph melding and unpredication

V Implementation

We implemented the darm algorithm described in Section IV as an LLVM-IR analysis and transformation pass on top of the ROCM HIPCC333LLVM version 12.0.0, ROCm version 4.2.0 GPU compiler [40]. Both the analysis and transformation are function passes that operate on GPGPU functions. The analysis pass first detects meldable divergent regions using LLVM’s divergence analysis. Then it finds all the profitable subgraph pairs that can be melded. We use a default melding profitability threshold of 0.2 (algorithm 1). We also provide a sensitivity analysis on this threshold in Section VI-E. We use modified version of LLVM cost model [9] to obtain instruction latencies for melding profitability and instruction alignment computations. The transformation uses the output of analysis to perform darm’s code generation procedure (Section IV-D). The transformation pass also performs the unpredication, pre- and post-processing steps described in Sections IV-E and IV-F. LLVM pass is implemented in lines of C++ code. In order to produce the program binary with our pass, we had to include our pass in the ROCM HIPCC compilation pipeline. Most GPGPU compilers (e.g. CUDA nvcc, ROCm HIPCC) use separate compilation for GPU device and CPU host codes. Final executable contains the device binary embedded in the host binary. In the modified workflow, we first compile the device code into LLVM-IR and run darm on top of that to produce a transformed IR module. Our pass runs only on device functions and avoids any modifications to host code. After that, we use the LLVM static compiler (llc[30] to generate an object file for the transformed device code. The rest of the compilation flow is as same as the one without any modification.

Vi Evaluation

Vi-a Evaluation Setup and Benchmarks

We evaluate the performance of darm on a machine with a AMD Radeon Pro Vega 20 GPU. This GPU has 16 GBs of global memory, 64 kB of shared memory (i.e. Local Data Share (LDS)) and 1700 MHz of max clock frequency. The machine consists of AMD Ryzen Threadripper 3990X 64-Core Processor with 2900 MHz max clock frequency.

We use two different sets of benchmarks. First, to assess the generality of darm, we create several synthetic programs that exhibit control divergence of varying complexity. While many real-world programs are hand-optimized to eliminate divergence, these synthetic programs both qualitatively demonstrate the generality of darm over prior automated divergence-control techniques, and show that darm can automate the control flow melding that would otherwise have to be done by hand.

Synthetic Benchmarks

Fig. 7: Control-flow patterns in synthetic benchmarks. Square: basic block and Circle: if-then region (shown on right)
Fig. 8: Micro Benchmark Performance. GM is geomean of darm’s speedup over baseline.
Fig. 9: Real-world Benchmark Performance. marks block size with best baseline runtime. GM is geo-mean of darm’s speedup on all benchmarks; GM-Best is darm’s speedup on configurations.

Each synthetic kernel consists of two nested loops. The inner loop contains a divergent region with different control-flow structures (SB1, SB2, SB3 and SB4 in Figure 7). Every divergent path computes on different pieces of data from shared memory. SB1 has simple diamond-shaped control-flow with basic blocks A2 and A3 performing identical computations. In SB2 and SB3; circled regions are if-then sections. Then blocks in region pairs B2-B3 (in SB2), C2-C3 and C6-C5 (in SB3) consist of identical computations. In three-way divergent kernel SB4, basic blocks D2, D4, and D5 are performing identical computations. Basic blocks/regions with identical computations have high melding profitability. Synthetic benchmarks SB1-R, SB2-R, SB3-R and SB4-R have same control-flow structure as SB1-SB4 but contain non-identical computations in the basic blocks.

Prior control-flow melding techniques (tail merging [8] and branch fusion [10]) cannot meld the full set of synthetic benchmarks. Tail merging can combine the divergent if-then-else blocks in SB1 and SB4 but cannot fully merge divergent regions. It cannot merge the -R variants due to the different instructions in the divergent paths. Branch fusion subsumes tail merging, and can fully merge if-then-else blocks in SB1, SB4 and their -R variants. However, it cannot be applied to the more complex control flow of SB2 and SB3, or their -R variants. In SB4, iterative application of branch fusion can meld blocks D4,D5 and D2. However its -R variant can not be fully melded by branch fusion due to non-identical computations being un-predicated (cf Section IV-E). In contrast, darm melds it by using region replication (cf Section IV-C).

Real-world Benchmarks Second, to show darm’s effectiveness on real-world programs, we consider 7 benchmarks written in HIP [23]. These benchmarks were taken from well-known highly hand-optimized GPU benchmark suites or optimized reference implementations of papers. We selected these benchmarks because they contain divergent if-then-else regions that present melding opportunities for DARM. We do not consider benchmarks that do not present any melding opportunities for darm because they are not modified by darm in any way.

Bitonic Sort (BIT) Our running example is bitonic sort [4]. In this kernel, each thread block takes in a bucket and performs parallel sort. We used an input of elements and varied the bucket (i.e. block) size.

Partition and Concurrent Merge (PCM)

PCM is a parallel sorting algorithm based on Batcher’s odd-even merge sort 

[22]. PCM performs odd-even merging of buckets of sorted elements at every position of the array leading to loops with nested data-dependent branches. We used an array of elements with different number of buckets.

Mergesort (MS) A parallel bottom-up merge sort implementation. The kernel has data-dependent control-flow divergence in the merging step. We used an input array with elements.

LU-Decomposition (LUD) LUD implementation from the Rodinia benchmark suite [7]. We focus our evaluation on the lud_perimeter kernel in this benchmark. lud_perimeter contains multiple divergent branches that depend on thread ID and block size. We use a randomly generated matrix of size as the input.

N-Queens (NQU) N-Queens solver uses backtracking to find all different ways of placing N queens on a NxN chessboard without attacking each other. We have used the kernel from the GPGPU-sim benchmark suite [3] with N is 15.

Speckle Reducing Anisotropic Diffusion (SRAD) SRAD is diffusion based noise removal method for imaging applications from Rodinia benchmark suite [7]. We have used an image of size as input.

DCT Quantization (DCT) An in-place quantization of a discrete cosine transformation (DCT) plane [12]. The quantization process is different for positive and negative values resulting in data-dependent divergence. We use a randomly generated DCT plane of size as input.

Baseline and Branch Fusion: Our baseline implementations of these kernels have been hand-optimized (except, obviously, for optimizations that manually remove control divergence by applying darm-like transformations). This optimization includes using shared memory when needed to improve performance. The baseline implementations were compiled with -O3. Branch fusion [10] was implemented in the Ocelot [27]open-source CUDA compiler that is no longer maintained and does not support AMD GPUs. We implemented branch fusion by modifying darm to apply melding for diamond-shaped control-flow (if-then-else). We use this for comparison against branch fusion. Branch fusion cannot fully handle the control-flow of BIT, PCM, and NQU. Loop unrolling enables successful branch fusion in LUD.

Block Size: Each of these kernels has a tunable block size—essentially, a tile size that controls the granularity of work in the inner loops. Because the correct block size can be dependent on many parameters (though for a given input and GPU configuration, one is likely the best), our evaluation treats block size as exogenous to the evaluation, and hence considers behavior at different block sizes for each kernel. In other words, our evaluation asks: if a programmer has a kernel with a given block size, what will happen if darm is applied?

Note that of these kernels, only LUD exhibit divergence that depends on block size. This means that all the other benchmarks will experience divergence regardless of block size. LUD’s divergence, on the other hand, is block size dependent. For some block sizes, the kernel will be divergent, while for others, it will be convergent.

Vi-B Performance

Figure 8 shows the speedups for the synthetic benchmarks with different block sizes. darm can successfully meld all 4 control-flow patterns we consider in the synthetic benchmarks and gives a superior performance than the baseline and branch fusion (geo-mean speedups of 1.36 for darm and 1.10 for branch fusion over the baseline). The performance for random (-R) variants are slightly lower for each of the patterns. This is because -R variants contain random instruction sequences and instructions do not align perfectly, causing darm to insert select instructions and branches to unpredicate unaligned instruction groups. Speedups observed for SB3 and SB3-R are better than SB1, SB2 and their -R variants because darm melds multiple subgraph pairs in the SB3 control-flow pattern (Figure 7) and control-flow divergence is reduced more in this case. We observe the highest performance improvement for SB4 and SB4-R because darm melds basic blocks D2, D4, and D5 (Figure 7) using region replication. SB4 and its -R variant have 3-way divergence because of the if-else-if-else branch. Applying region replication along with subsequent simplification passes greatly reduces this original three-way divergence.

Figure 9 shows the speedups for real benchmarks darm always improves the performance (1.15 geo-mean speedup over all benchmarks and 1.16 geo-mean speedup over the best baseline variants) except for SRAD (see below). The highest relative improvement in performance can be seen in BIT and PCM for all block sizes. This is because both these benchmarks are divergent regardless of the block size and they have complex control-flow regions with shared memory instructions. darm successfully melds these regions and reduces divergence significantly. Branch fusion improves performance in PCM by melding if-then-else blocks. In LUD, the divergence is block size dependent, and the kernel is divergent only at block sizes 16, 32 and 64, where we see a visible performance improvement introduced by darm. NQU contains a time-consuming loop with divergent if-then-elseif-then section. darm applies region replication to remove divergence, achieving superior performance. SRAD kernel has both block size-dependent and data-dependent divergent regions (say and respectively). Both and consists of if-then-else–if-then-else chains. contains no shared memory instructions and melding does not improve performance (for both darm and branch fusion). However contains a 3-way divergent branch with shared memory instructions and the divergence is biased i.e. execution only takes 2 of the 3 ways. In this case branch fusion has better performance at block size 16, because blocks that get melded happen to be on the divergent paths. However darm has more melding options than branch fusion, and it melds all 3 paths adding extra overhead. At block size 32, the extra overhead introduced by melding becomes significant and both darm and branch fusion exhibit a performance drop. Performance drop for darm can be avoided by prioritizing the melding order (i.e. apply melding to divergent regions with most profitable subgraphs first). However, prioritizing melding order is not considered in this paper.

In most cases (except SRAD), the block size for best performing baseline is also the one that gives the best absolute performance for darm. Interestingly, for benchmarks (BIT, PCM, MS, and DCT), not only does this best baseline block size produce the best absolute darm performance, it also produces the best speedup relative to the baseline: the block size that makes the baseline perform the best, actually exposes more optimization opportunities to darm.

We use rocprof [1] to collect ALU utilization and memory instruction counters to reason about performance. We focus on the block sizes for each benchmark where darm has highest improvement over the baseline.

Vi-C ALU Utilization

darm’s melding transformation enables the ALU instructions in divergent paths to be issued in the same cycle. This effectively improves the SIMD resource utilization. Figure 10 shows the ALU utilization (). As expected darm improves the ALU utilization significantly for most benchmarks. In BIT, divergent paths does not have common comparison operators ( and comparisons in lines 9 and 13 in Figure 1). Even though darm unpredicates these instructions, later optimization passes decide to fully-predicate them resulting in lower ALU utilization.

Fig. 10: ALU Utilization.
Fig. 11: Normalized Memory Instruction Counters.

Vi-D Melding of Memory Instructions

Figure 11 shows the normalized number of global and shared memory (i.e. local data share) instructions issued after applying darm. In LUD, there are many common shared memory instructions in divergent paths. However these instructions do not have different memory alignments, therefore cannot be melded into a single instruction. Unpredicated shared memory instructions are predicted by other optimization passes in LLVM resulting in higher instruction count. Melding reduces the global memory instruction count in LUD. DCT does not have any memory instructions in the divergent region and does not use shared memory. In BIT and PCM, the melded regions contain a lot of shared memory instructions. Therefore the reduction in shared memory instructions is significant and correlate with the performance gain. We find that melding shared memory instructions is more beneficial than melding ALU instructions because shared memory instructions have higher latency than most ALU instructions, though lower latency than global memory instructions. Therefore there is 2 improvement in cycles spent if two divergent shared memory instructions are issued in the same cycle. In contrast, melding global memory instructions does not always improve performance. This is because the data requested by divergent memory instructions might be on different cache lines and these requests are serialized by the memory controller even if they are issued in the same cycle.

Vi-E Melding Profitability Threshold

Fig. 12: Variation of melding profitability thresholds.

Figure 12 shows the performance of darm for different melding profitability thresholds on the real-world benchmarks considering darm’s best performing block sizes. For all benchmarks, we observe that darm’s speedup reduces as we increase the threshold due to lost opportunities.When we reduce the threshold, increment in the improvement of the performance of darm becomes insignificant (after 0.2). But we cannot reduce it to zero because every possible pair would be melded and the subsequent CFG simplification passes would unpredicate them. As a result, darm may become non-convergent.

Vi-F Compile Time

Benchmark O3 DARM Normalized
BIT 0.4804 0.5018 1.0444
PCM 0.5690 0.5942 1.0443
MS 0.8037 0.8064 1.0035
LUD 0.5993 0.6294 1.0502
NQU 0.4687 0.4738 1.0109
SRAD 0.4999 0.5121 1.0244
DCT 0.4398 0.4439 1.0093
TABLE II: Average Compile Time (s)

Table II shows the device code compilation times for the baseline and darm. We omit the time for compiling host code and linking because it is constant for both the baseline and darm. Since we perform the analysis and the instruction alignment – the most costly parts – at the basic block level rather than performing at a higher level (i.e. function or region level), we incur negligible compilation overhead. Compilation time overhead introduced by darm is a small fraction of total compilation time (including host code) for all cases.

darm’s compile time depends on the size of basic blocks that get melded and the structure of the program since it determines different types of melding opportunities. A slight overhead in compilation time of LUD is caused by sequence alignment overhead on large basic blocks (created by loop unrolling). PCM and BIT have divergent regions inside an unrolled loop, therefore darm’s meldable subgraph detection incurs overhead. Only BIT and PCM has opportunities for Region-Region melding, and only PCM, NQU, and SRAD have opportunities for Basic block-Region melding. Presence of Basic block-Region melding opportunity results in region replication.

Vii Related Work

Divergence Analysis

Impact of control-flow divergence has extensively studied in different contexts [43, 25, 34, 31]. Reducing control-flow divergence requires finding the source of divergence in a program. Coutinho et al. constructed a divergence analysis to statically identify variables with the same value for every SIMD unit and used this analysis to drive Branch Fusion [10]. A divergence analysis of similar fashion based on data and sync dependences has been integrated to the LLVM framework [26]. Recently, Rosemann et al. has presented a precise divergence analysis based on abstract interpretation for reducible CFGs [42]. Using a precise divergence analysis improves the opportunities of melding for darm.

Code Compaction

Tail Merging is a standard, but restrictive, compiler optimization used to reduce the code size by merging identical sequences of instructions. Chen et al. used generalized tail merging to compact matching Single-Entry-Multiple-Exit regions [8]. Recently, Rocha et al. has presented Function Merging, an advanced sequence-alignment based technique for code size reduction [39, 38]. Even though parts of darm has some similarities with function merging, it does not tackle divergence.

Compiler Techniques

In addition to branch fusion, Anantpur and Govindarajan proposed to structure the unstructured CFGs and then linearize it with predication [2]. More recently, Fukuhara and Takimoto proposed Speculative Sparse Code Motion to reduce divergence in GPU programs [17], which preserves the CFG and it is orthogonal to darm. Collaborative Context Collection copies registers of divergent warps to shared memory and restores them when those warps become non-divergent [28]. Iteration Delaying is a complementary compiler optimization to darm that delays divergent loop iterations [21] and can be applied following darm. Recently, Damani et al. has presented a speculative reconvergence technique for GPUs similar to iteration delaying [14]. Common Subexpression Convergence (CSC) [15] works similar to branch fusion but uses branch flattening (i.e. predication) to handle complex control-flow. In contrast, darm does not require predication to meld complex control-flow, thus more general than CSC.

Architectural Techniques

Thread Block Compaction [19] and Dynamic Warp Formation [20] involve repacking threads into non-divergent warps. Variable Warp Sizing [41] and Dynamic Warp Subdivision [33] depend on smaller warps to schedule divergent thread groups in parallel. Independent Thread Scheduling helps to hide the latency in divergent paths by allowing to switch between divergent threads inside a warp [37, 16].

Viii Discussion and Future Work

Most of the GPGPU benchmarks are heavily hand optimized by expert developers and this often include darm like transformations to remove control-flow divergence [10]. We evaluate darm on limited set of real-world benchmarks mainly because of this reason. However we also emphasize that doing darm-like transformations by hand is time-consuming and error-prone. For example, it took us several hours to manually apply control-flow melding to LUD kernel. Therefore, offloading this to the compiler can save a lot of developer effort.

The benefits of darm is not limited to reducing control-flow divergence in GPGPU programs. darm can be used to reduce control-flow divergence in any hardware backends and programming models that employ SIMT execution (e.g. intel/AMD processors with ISPC [36]). darm can be used to reduce branches in a program. This property can be exploited to accelerate software testing techniques such as symbolic execution [5]. darm factor out common code segments within if-the-else regions of a program. Therefore it can be used as an intra-function code size reduction optimization as well. Aforementioned applications of darm suggest that it is useful as a general compiler optimization technique. We plan to explore some of these applications in our future work.

In Section VI, we have shown that when shared memory is used to improve the baseline, it does not steal the opportunity from darm to meld, because melding shared memory instructions also results in better performance than the improved baseline. Exploiting this opportunity requires maximizing the alignment of shared memory instructions which can be achieved by using a refined instruction cost model.

Ix Conclusion

Divergent control-flow in GPGPU programs causes performance degradation due to serialization. We presented darm, a new compiler analysis and transformation framework for GPGPU programs implemented on LLVM, that can detect and meld similar control-flow regions in divergent paths to reduce divergence in control-flow. darm generalizes and subsumes prior efforts at reducing divergence such as tail merging and branch fusion. We showed that darm improves performance by improving ALU utilization and promoting coalesced shared memory accesses across several real-world benchmarks.

Acknowledgments

This work was supported in part by National Science Foundation awards CCF-1919197 and CCF-1908504. We would like to thank anonymous reviewers for their helpful comments and feedback. We would like to thank Tim Rogers for his feedback during discussions of this work and also providing us AMD GPUs for the experiments. Furthermore, we would like to thank Rodrigo Rocha for sharing the source code for Function Merging.

References

  • [1] ROCm-Developer-Tools / rocprofiler . Note: [Accessed 17-Dec-2021] External Links: Link Cited by: §VI-B.
  • [2] J. Anantpur and G. R. (2014) Taming control divergence in gpus through control flow linearization. Berlin, Heidelberg, pp. 133–153. External Links: ISBN 978-3-642-54807-9 Cited by: §VII.
  • [3] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt (2009) Analyzing cuda workloads using a detailed gpu simulator. pp. 163–174. External Links: Document Cited by: §VI-A.
  • [4] K. E. Batcher (1968) Sorting networks and their applications. In Proceedings of the April 30–May 2, 1968, spring joint computer conference (AFIPS ’68 (Spring)), Vol. , pp. 307–314. External Links: Document Cited by: §III, §VI-A.
  • [5] C. Cadar (2015) Targeted program transformations for symbolic execution. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, New York, NY, USA, pp. 906–909. External Links: ISBN 9781450336758, Link, Document Cited by: §VIII.
  • [6] D. Cederman and P. Tsigas (2010-01)

    GPU-quicksort: a practical quicksort algorithm for graphics processors

    .
    ACM J. Exp. Algorithmics 14. External Links: ISSN 1084-6654, Link, Document Cited by: §III.
  • [7] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron (2009) Rodinia: a benchmark suite for heterogeneous computing. pp. 44–54. External Links: Document Cited by: §VI-A, §VI-A.
  • [8] W. Chen, B. Li, and R. Gupta (2003) Code compaction of matching single-entry multiple-exit regions. Berlin, Heidelberg, pp. 401–417. External Links: ISBN 3540403256 Cited by: §I, §VI-A, §VII.
  • [9] CostModel.cpp File Reference. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: §V.
  • [10] B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira Jr. (2011) Divergence analysis and optimizations. In 2011 International Conference on Parallel Architectures and Compilation Techniques, Vol. , pp. 320–329. External Links: Document Cited by: §I, §II-B, §IV-C, §VI-A, §VI-A, §VII, §VIII.
  • [11] CUDA C++ Programming Guide. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: 2nd item, §II-A.
  • [12] CUDA Samples. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: §VI-A.
  • [13] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck (1991-10) Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13 (4), pp. 451–490. External Links: ISSN 0164-0925, Link, Document Cited by: §II-B, §IV-C.
  • [14] S. Damani, D. R. Johnson, M. Stephenson, S. W. Keckler, E. Yan, M. McKeown, and O. Giroux (2020) Speculative reconvergence for improved simt efficiency. New York, NY, USA, pp. 121–132. External Links: ISBN 9781450370479, Link, Document Cited by: §VII.
  • [15] S. Damani and V. Sarkar (2021) Common subexpression convergence: a new code optimization for simt processors. In Languages and Compilers for Parallel Computing, S. Pande and V. Sarkar (Eds.), Cham, pp. 64–73. External Links: ISBN 978-3-030-72789-5 Cited by: §VII.
  • [16] A. ElTantawy, J. W. Ma, M. O’Connor, and T. M. Aamodt (2014) A scalable multi-path microarchitecture for efficient gpu control flow. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 248–259. External Links: Document Cited by: §VII.
  • [17] J. Fukuhara and M. Takimoto (2020) Branch divergence reduction based on code motion. Journal of Information Processing 28 (), pp. 302–309. External Links: Document Cited by: §VII.
  • [18] W. W. L. Fung and T. M. Aamodt (2011) Thread block compaction for efficient simt control flow. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture, Vol. , pp. 25–36. External Links: Document Cited by: §I.
  • [19] W. W. L. Fung and T. M. Aamodt (2011) Thread block compaction for efficient simt control flow. pp. 25–36. External Links: Document Cited by: §VII.
  • [20] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt (2007) Dynamic warp formation and scheduling for efficient gpu control flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), Vol. , pp. 407–420. External Links: Document Cited by: §I, §VII.
  • [21] T. D. Han and T. S. Abdelrahman (2011) Reducing branch divergence in gpu programs. New York, NY, USA. External Links: ISBN 9781450305693, Link, Document Cited by: §VII.
  • [22] E. Herruzo, G. Ruiz, J. I. Benavides, and O. Plata (2007) A new parallel sorting algorithm based on odd-even mergesort. In 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP’07), Vol. , pp. 18–22. External Links: Document Cited by: §VI-A.
  • [23] HIP Programming Guide v4.1. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: 2nd item, §II-A, §IV-B, §VI-A.
  • [24] R. Johnson, D. Pearson, and K. Pingali (1994-06) The program structure tree: computing control regions in linear time. SIGPLAN Not. 29 (6), pp. 171–185. External Links: ISSN 0362-1340, Link, Document Cited by: Definition 2.
  • [25] R. Karrenberg and S. Hack (2011) Whole Function Vectorization. External Links: Document, Link Cited by: §VII.
  • [26] R. Karrenberg and S. Hack (2012) Improving performance of opencl on cpus. In Compiler Construction, M. O’Boyle (Ed.), Berlin, Heidelberg, pp. 1–20. External Links: ISBN 978-3-642-28652-0 Cited by: §II-B, §VII.
  • [27] A. Kerr, G. Diamos, and S. Yalamanchili (2009) A characterization and analysis of ptx kernels. In 2009 IEEE International Symposium on Workload Characterization (IISWC), Vol. , pp. 3–12. External Links: Document Cited by: §VI-A.
  • [28] F. Khorasani, R. Gupta, and L. N. Bhuyan (2015) Efficient warp execution in presence of divergence with collaborative context collection. New York, NY, USA. External Links: ISBN 9781450340342, Link, Document Cited by: §VII.
  • [29] C. Lattner and V. Adve (2004) LLVM: a compilation framework for lifelong program analysis transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., Vol. , pp. 75–86. External Links: Document Cited by: 2nd item, §II-B.
  • [30] llc - LLVM static compiler. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: §V.
  • [31] T. Lloyd, K. Ali, and J. N. Amaral (2019) GPUCheck: detecting cuda thread divergence with static analysis. Technical report Deparment of Computer Science, University of Alberta. External Links: Document, Link Cited by: §VII.
  • [32] llvm::RegionBase Class Template Reference. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: Definition 2.
  • [33] J. Meng, D. Tarjan, and K. Skadron (2010-06) Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38 (3), pp. 235–246. External Links: ISSN 0163-5964, Link, Document Cited by: §VII.
  • [34] S. Moll and S. Hack (2018) Partial Control-flow Linearization. New York, NY, USA, pp. 543–556. External Links: ISBN 978-1-4503-5698-5, Link, Document Cited by: §VII.
  • [35] NVCC :: CUDA Toolkit Documentation. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: §II-B.
  • [36] M. Pharr and W. R. Mark (2012) Ispc: a spmd compiler for high-performance cpu programming. In 2012 Innovative Parallel Computing (InPar), Vol. , pp. 1–13. External Links: Document Cited by: §VIII.
  • [37] M. Rhu and M. Erez (2013) The dual-path execution model for efficient gpu control flow. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 591–602. External Links: Document Cited by: §I, §VII.
  • [38] R. C. O. Rocha, P. Petoumenos, Z. Wang, M. Cole, and H. Leather (2019) Function merging by sequence alignment. pp. 149–163. External Links: Document Cited by: §IV-C, §VII.
  • [39] R. C. O. Rocha, P. Petoumenos, Z. Wang, M. Cole, and H. Leather (2020) Effective function merging in the ssa form. In

    Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)Proceedings of the April 30–May 2, 1968, Spring Joint Computer Conference2009 IEEE International Symposium on Workload Characterization (IISWC)Proceedings of the 10th International Conference on Static AnalysisProceedings of the 18th ACM/IEEE International Symposium on Code Generation and OptimizationProceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing UnitsCompiler ConstructionProceedings of the 48th International Symposium on MicroarchitectureLanguages and Compilers for Parallel Computing2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages2011 IEEE 17th International Symposium on High Performance Computer ArchitectureProceedings of the 42nd Annual International Symposium on Computer Architecture2009 IEEE International Symposium on Performance Analysis of Systems and SoftwareInternational Symposium on Code Generation and OptimizationProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

    , A. Cohen, C. Ding, J. Criswell, and P. Wu (Eds.),
    PLDI 2020AFIPS ’68 (Spring)SAS’03CGO 2020GPGPU-4MICRO-48MAPL 2018ISCA ’15CGOPLDI 2018, Vol. , New York, NY, USA. External Links: ISBN 9781450376136, Link, Document Cited by: §IV-C, §IV-C, §VII.
  • [40] ROCm Compiler SDK. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: §II-B, §V.
  • [41] T. G. Rogers, D. R. Johnson, M. O’Connor, and S. W. Keckler (2015) A variable warp size architecture. New York, NY, USA. External Links: ISBN 9781450334020, Link, Document Cited by: §VII.
  • [42] J. Rosemann, S. Moll, and S. Hack (2021-01) An abstract interpretation for spmd divergence on reducible control flow graphs. Proc. ACM Program. Lang. 5 (POPL). External Links: Link, Document Cited by: §II-B, §VII.
  • [43] T. Schaub, S. Moll, R. Karrenberg, and S. Hack (2015-01) The impact of the simd width on control-flow and memory divergence. ACM Trans. Archit. Code Optim. 11 (4). External Links: ISSN 1544-3566, Link, Document Cited by: §VII.
  • [44] T.F. Smith and M.S. Waterman (1981) Identification of common molecular subsequences. Journal of Molecular Biology 147 (1), pp. 195–197. External Links: ISSN 0022-2836, Document, Link Cited by: §IV-C, §IV-C.
  • [45] Using cuda warp-level primitives. Note: [Accessed 17-Dec-2021] External Links: Link Cited by: §IV-C.