I Introduction
Pairwise alignment is one of the most commonly used workhorses of sequence analysis. It is used to correct raw sequencer reads, assemble them into more complete genomes, search databases for similar sequences, and many other problems. The optimal solutions for this problem require quadratic time (i.e. they take time for aligning a sequence A of length and a sequence B of length ). Namely, Needleman–Wunsch (NW) [needleman1970general] is used to find the best global alignment by forcing the alignment to extend to the endpoints of both sequences. Alternatively, Smith–Waterman (SW)[smith1981identification] computes the best local alignment by finding the highest scoring alignment between continuous subsequences of the input sequences.
The popular drop [zhang2000greedy] algorithm avoids the full quadratic cost by searching only for highquality alignments, and can be viewed as an approach to accelerate both NW and SW. Most applications of alignment will throw out low quality alignments, which arise when the two strings are not similar. Instead of exploring the whole space, the drop algorithm searches only for alignments that results in a limited number of edits between the two sequences. drop keeps a running maximum score and does not explore cell neighborhoods whose score decreases by a userspecified parameter . It gets its performance benefits from searching a limited space of solutions and stopping early when a good alignment is not possible.
Zhang et al. [zhang2000greedy] proved that, for certain scoring matrices, the drop algorithm is guaranteed to find the optimal alignment between relatively similar sequences. In practice, the algorithm eliminates searches between sequences that are clearly diverging. This feature is especially effective for manytomany alignment problems when there is an attempt to align many sequences to many other possibly matching sequences, i.e., the cost is high as is the possibility that some pairs will not align. With drop, any spurious candidate pair is readily eliminated because the optimal score quickly drops. Consequently, drop and its variants are the algorithm of choice in some of the most popular sequence mapping software including BLAST [altschul1990basic], LAST [kielbasa2011adaptive], BLASTZ [schwartz2003human] with drop, and minimap2 [li2018minimap2] with drop.
Although
drop is a heuristic for cutting the cost of alignment, it also produces good quality results, which are sometimes better than a more complete search. Frith et al.
[frith2010parameters] show that a large does not necessarily produce better alignments. Without the drop feature, the alignment algorithm can incorrectly glue two independent local alignments into a large one. For example, consider two sequences, one of the form S = ABC and other of the form R = ADC. Since the regions A and C produce highscoring alignments, likely a high would incorrectly determine that provided that B and D regions are short enough.Although there are numerous GPU implementations of the full SW and NW algorithms that often achieve impressive computational rates (measured in CUPS or cell updates per second), they are rarely incorporated into highimpact genomics pipelines due to their quadratic complexity. By contrast, a GPU implementation of drop is notably missing from the literature despite its benefits and popularity. This is likely due to the increased complexity of implementing drop efficiently on a GPU, compared with NW and SW methods, because of the dynamic nature of the computation, its adaptive band, and the need to check for completion.
Our main contributions are:

We present the first highperformance, multiGPU implementation of the drop algorithm, named LOGAN, which achieves significant speedups over leading versions on stateoftheart processors.

We integrate LOGAN within BELLA, a longread manytomany overlapping and alignment software and demonstrate performance improvements up to .

We adapt the Roofline Model to LOGAN implementation and underlying hardware, and demonstrate that performance is near optimal on NVIDIA Tesla V100s.
A key aspect of our implementation is combining different levels of parallelism.
Specifically, we implement the intrasequence parallelism via dynamic thread scheduling and inwarp parallelism, while accomplishing intersequence parallelism by assigning each GPU block to a single alignment.
Finally, we carry out parallelism across multiple GPUs through a load balancer.
The remainder of the paper is organized as follows. Section II provides an overview of the related work, while Section III describes the original software algorithm we port on GPU. Section IV describes our implementation and optimizations. Section V presents LOGAN integration within BELLA [guidi2018bella], a longread overlapping and alignment software. Section VI illustrates our experimental results, while Section VII describes the Roolfline model used to analyze our implementation. Finally, Section VIII summarizes our contributions and outlines future work.
Ii Related Work
The majority of hardware acceleration efforts for pairwise alignment have focused on the Smith–Waterman (SW) and Needleman–Wunsch (NW) algorithms. These find exact alignments and have quadratic complexity in the lengths of the reads. Along with some of the most successful NW and SW acceleration efforts, we review the few efforts to accelerate heuristics more similar to our own. Though they are more generally applicable, exploiting GPU parallelism in these heuristics is more challenging due to their adaptive nature. As a common success metric, we report Giga Cell Updates Per Second (GCUPS), as reported by the original work, throughout this section. It is important to keep in mind that the GCUPS rates presented in this section were collected by the respective authors using different architectures than examined in our study. In Section VI, we collect comparative performance data with the ksw2 algorithm on equivalent platforms.
The implementation of Michael Farrar [farrar2006striped], which is adopted in Bowtie2 [langmead2012fast], stands out amongst software implementations of the SW algorithm. It leverages SIMD instructions and reaches performance of more than 20 GCUPS on an Intel Xeon Gold with 40 CPU threads. The same software implementation has been optimized for the PlayStation 3 processor and the IBM QS20 architecture [szalkowski2008swps3], achieving performance of and GCUPS, respectively.
CUDASW++3 [liu2013cudasw++] accelerates the SW algorithm combining SIMD instructions and GPU parallelism. The implementation achieves up to GCUPS when aligning reads with length less than characters. However, the performance drops significantly when the sequence length exceeds characters. Additionally, when running only using the GPU, their maximum attained performance is GCUPS (roughly of their peak performance).
Muhammadzadeh presented MRCUDASW++ [mrcudasw2014], which was inspired by CUDASW++3 but optimized for “medium length” reads. Muhammadzadeh compares MRCUDASW++ to other tools, with CUDASW++3 as its closest contender, across sequences lengths of K, K, and K. MRCUDASW++ achieved speedups of over CUDASW++3. The results were below 85 GCUPS using an NVIDIA Tesla V100, which is the same GPU used for our benchmarks. Li et al. [li2007160] accelerate the SW algorithm achieving a speedup of over compared to software implementation using an Altera Nios II Field Programmable Gate Array (FPGA). Nevertheless, the performance of their proposed solution is comparable to existing optimized software implementations. The SW implementation of Di Tucci et al. [di2017architectural], running on a Xilinx Virtex 7 and a Kintex Ultrascale platforms, achieves up to GCUPS and a speedup of over the stateoftheart FPGA implementation. However, this work is limited to aligning short sequences that have a number of characters not exceeding the number of processing elements in the architecture.
A recent work by Turakhia et al. [turakhia2018darwin], Darwin, exploits FPGAs to speed up the alignment process achieving up to GCUPS. Darwin uses a seedandextend heuristic (GATC) that performs the extension stage in the seedandextend paradigm in which Dynamic Programming (DP) is used around the seed hit to obtain local alignments similar to SW.
Feng et al. [FengIcpp2019] recently presented acceleratorbased optimizations for minimap2 [Minimap2]. Leveraging the GPU architecture, they accelerate minimap2’s seedchainextend pairwise alignment algorithm, which is quadratic in the length of the reads when computing traceback and linear otherwise. They reported performance of GCUPS (a speedup over the minimap2’s SIMD software implementation). Our drop alignment algorithm in this study computes a similar heuristic to minimap2’s pairwise alignment. In Section VI, we compare our work with ksw2 [suzuki2018introducing] (i.e., minimap2’s alignment kernel), showing that LOGAN achieves higher performance in terms of GCUPS than both ksw2 and the performance reported by Feng et al. for their GPUaccelerated implementation.
Despite a large number of sophisticated implementations, the overwhelming majority of the proposed hardware accelerations studies implement the exact SW or NW algorithms. We believe the although drop algorithm is the most practical choice for targeting largescale alignments, it requires more challenging parallelization than the original SW or NW methods. Though relatively unexplored due to this challenge, we demonstrate that GPU optimization results in significant acceleration.
Iii Background
This section provides an overview of the drop implementation proposed by Zhang et al. [zhang2000greedy] and implemented in the SeqAn library [doring2008seqan], a C++ library for sequence analysis. First, we review the formal definition of alignment. A pairwise alignment of sequences s and t over an alphabet is defined as the pair such that and the following properties hold:

=


Deleting all “” from yields , and deleting all “” from yields .
A scoring scheme is used to distinguish highquality alignments from the many (valid) alignments of a given pair of sequences. Scoring schemes generally reward matches and penalize mismatches, insertions, and deletions.
Iiia The drop Algorithm
Given two DNA sequences and of length and , the goal of the drop algorithm is to find the highestscoring semiglobal alignment between and of the forms and , for some and that are chosen to maximize the score.
For a given and , we define as the alignment matrix and as the alignment score between and . A positive match score is added to for each pair of identical nucleotides. If nucleotides do not match, the algorithm can either subtract a mismatch score to and move diagonally or subtract a gap score and move horizontally (gap into the vertical sequence) or vertically (gap into the horizontal sequence) in the dynamic programming grid. More formally, each cell of the alignment matrix is computed as follows:
Figure 1 shows the three dependencies of a cell during the computation: two dependencies on cells from the previous iteration and one dependency on a cell at two iterations prior. Note that SW, NW, as well as the majority of their heuristic implementations show these dependencies. SW and NW compute the entire matrix to find the optimal alignment. This quadratic algorithm is extremely inefficient in the case of either misalignment or when aligning almost identical sequences. A misalignment could happen, for example, when two sequences have a tiny region in common due to a genomic repetition. The SW algorithm would spend significant computational resources calculating the entire dynamic programming (DP) matrix and report a very poor alignment score between the two sequences. On the other hand, SW or NW on two almost identical sequences would compute the whole DP matrix with no additional benefit, since the optimal alignment score would remain close to the diagonal of the DP matrix.
The concept of the drop termination consists of halting the computation if the alignment score drops more than below the best alignment score seen so far for that pair of sequences. is potentially updated at each antidiagonal iteration. If , we set the cell equal to and no longer consider that cell for the subsequent iterations of . The cells set to are used to compute the lower and upper bound for the next antidiagonal iteration. This approach limits the antidiagonal width, reducing the search space of the algorithm, and automatically provides a termination condition. The drop algorithm is particularly efficient when two sequences do not align. A pseudocode of the drop algorithm is shown in Algorithm 1.
Note that drop should not be confused with the popular bandedSW method. This approach constrains the search space to a fixed band along the diagonal, regardless of the drop in the score. The areas of the dynamic programming grid explored by these algorithms are characteristically different from drop’s search space, which is reminiscent of a rugged band with changes in the length of each antidiagonal, as shown in Figure 2. To better understand the difference in practice, consider two sequences that have very high (over 50) differences in terms of substitutions but have no indel (insertion or deletion) differences. The optimal path would be along the diagonal because both a mismatch and match move the cursor in both sequences. drop will correctly terminate the search early due to a significant drop in the score whereas bandedSW would explore the entire band.
Iv Implementation
The key aspect of our implementation is exploiting as many levels of parallelism as possible on the Graphical Processing Unit (GPU). Intralevel parallelism is achieved by dynamically scheduling the threads based on the value of and through the use of inwarp parallelization to find the maximum of the antidiagonals. To achieve interlevel parallelism, we implement the parallel execution of multiple alignments by assigning each GPU block to a single alignment. MultiGPU parallelism is obtained by implementing a GPU load balancer that adapts the execution of LOGAN to leverage multiple GPUs.
In this section, we describe the design of our implementation. Section V then presents our optimized kernel integration within BELLA, a realworld application. Note that we refer to GPU threads and GPU blocks simply as threads and blocks.
Iva IntraSequence Parallelism
We first consider the intrasequences parallelism, which is the parallelization of a single pairwise alignment and its antidiagonals computation. Given that the drop algorithm we decided to port to GPU does not perform alignment traceback, we do not store the entire alignment matrix on the device for each alignment. Therefore, we can reduce the memory footprint of our kernel on the GPU by storing only three antidiagonals per alignment: current, previous, and two iterations prior, as highlighted in Figure 1.
Similarly to SW and NW algorithms, we compute the cell updates of each antidiagonal of the alignment matrix in parallel. Each antidiagonal cell has three dependencies on cells from antidiagonals at previous iterations. Nevertheless, we can leverage the independence between cells belonging to the same antidiagonal. To compute an antidiagonal update in parallel, we assign each cell to a GPU thread and compute them independently, where a GPU block can schedule up to 1024 threads. To overcome this limitation and ensure the computation of any antidiagonal length, we split each antidiagonal into segments, as shown in Figure 3. The antidiagonal is split into segments whose width is equal to the number of threads within a block. Once a segment of the antidiagonal is completed, the kernel initiates the computation of the subsequent segment. This process is repeated until the entire antidiagonal is computed.
For each cell, the corresponding thread sets the score to if the cell score drops below the global maximum of the alignment matrix, that is best in Algorithm 1. The overall maximum score is a shared variable within the considered block. The global maximum of the scoring matrix is updated after any antidiagonal has been completely calculated since it needs to consider the newly computed scores. Computing the global maximum naively would significantly slow down the execution since it would require serial comparison of each cell with the all others in a given antidiagonal. Thus we speed up this process by computing the maximum antidiagonal score via a parallel reduction.
Once each thread is assigned to a cell of the antidiagonal, our algorithm leverages inwarp thread communication to compare the values inside cells and perform the reduction, where a warp is a set of 32 threads executing the same code and sharing the same data. All threads within a warp communicate using registers, which maximizes communication speed. Algorithm 2 illustrates the parallel computation of the antidiagonal. Finally, the size of the next antidiagonal to be computed is updated by checking if there are cells marked with at the end or the start of the current antidiagonal. LOGAN continues computing the alignment matrix until either it reaches the end of the shortest read or the size of the current antidiagonal is set to zero, meaning that it has satisfied the condition.
IvB InterSequence Parallelism
Intrasequence parallelism optimizes the alignment computation for a single pair of sequences, however, it does not effectively leverage the large volume of GPU computational resources. We therefore exploit the GPU computational potential by designing LOGAN to align multiple pairs of sequences in parallel by assigning each alignment to a GPU block — thus taking advantage of intersequences parallelism. LOGAN schedules the number of GPU blocks based on the number of alignments needed to be performed (Figure 4). Each NVIDIA V100s GPU block can store up to 64KB in sharedmemory and performs an independent alignment. Since only three antidiagonals need to be stored, we could ostensibly store them into GPU sharedmemory (the fastest memory available after the registers). However, despite the potential of reserving 64KB of memory per block, this cannot be implemented in practice as the device has only 96KB of memory per streaming multiprocessor (SM). Given that each SM on the device can execute up to blocks in parallel, if a single block has reserved too much memory, a SM is forced to exchange data with the DRAM of the GPU every time it computes a block. Furthermore, given that only a single block could fit on a single SM, the execution would be limited to a single block per SM. Given our goal of achieving the best possible boardlevel utilization, we need to compute multiple blocks per SM in parallel. Consequently, to overcome these limitations and avoid shared memory contention, LOGAN stores the three antidiagonals of each alignment on the High Bandwidth Memory (HBM) of the GPU. Doing so, removes the limitation on the number of blocks per SM, and achieves significantly more effective parallelism.
To further improve our resource utilization, LOGAN schedules a number of threads per block lower than the maximum of . In fact, if the number of threads exceeds the antidiagonal length, many of the threads will stall, decreasing overall performance. Additionally, we need to store the intermediate results of the parallel reduction in shared memory to enable inwarp thread communication when computing the antidiagonal maximum score. Since the number of intermediate results is equal to the number of scheduled threads, reducing the number of scheduled threads per block also reduces the risk of shared memory contention. Given that the length of each antidiagonal is proportional to the value of , our implementation schedules a number of threads proportional to , significantly reducing the number of stalled threads. This scheduling increases our performance and improves our resource utilization. Table I shows the impact of the various degrees of parallelization implemented in the LOGAN kernel. The first two rows of the table show the impact of intralevel parallelism over a singlethread execution, while the second two show the impact over interlevel parallelism for K alignments of read pairs. Note that intralevel parallelism improves the performance by a factor of about , while the introduction of interlevel parallelism improves performance by an impressive factor of with respect to intraparallelism alone. The intrasequence parallelism has insufficient work to consume available GPU resources, hence its impact on performance is limited compared to intersequence parallelism. Notably, inter and intrasequence parallelisms are complementary to each other. LOGAN therefore implements both to better exploit the resources of the GPU and maximize our kernel performance.
Parallelism  Pairs  Threads  Blocks  Time  SpeedUp 
None  1  1  1  1.50s   
Intrasequence  1  128  1  0.16s  9.3 
Intrasequence  100K  128  1  45h   
Intra and intersequence  100K  128  100K  7.35s  22000.0 
To make the LOGAN processing more efficient on the GPU, we additionally introduce CPU host optimization. First, LOGAN loads the length of the sequences and the seed locations for each pair of sequences and stores them into two buffers. Each pair of sequences is split in two based on the seed’s location, resulting in a leftextension pair and a rightextension pair, as shown in Figure 5. Leftextension and rightextension pairs are stored into two different vectors, and these alignments are computed independently by scheduling two different streams on the GPU. Traditionally, when aligning two sequences, one of them is accessed backward, resulting in memory performance degradation since characters are read in the opposite direction of the memory. To ensure coalesced data access on the GPU and exploit memory burst, one of the two sequences (for each given pair) is reversed, as shown in Figure 6. This optimization linearizes the GPU memory data access, thus increasing performance while preserving the correct solution. Finally, note that kernel execution is scheduled asynchronously from the host, enabling the retrieval of alignment results as soon as they are available, instead of waiting for all the alignments to complete.
IvC Implementation with Multiple GPU Devices
To effectively exploit the available multiple GPU resources, LOGAN leverages a load balancer as shown in Figure 7. This optimization allows LOGAN to run on varying GPU configurations, since it can adapt the load to the specific number of GPUs present within a given system.
The host application balances the computation by scheduling the number of alignments for each GPU. The host switches context multiple times and then replicates the operations for each GPU to simplify its task. The preprocessing of the sequences occurs as in the single device implementation and the load balancer divides the sequences into different groups that are then assigned to the GPUs. The HBM memory of the GPU represents a limiting resource for LOGAN, since in the single GPU implementation it is fully utilized. To ensure balance, we schedule the number of alignments per GPU considering both the number of available GPUs and the length of the sequences. Once the division is completed, the host allocates the necessary memory on the different GPUs, enabling each GPU to execute its set of alignments independently. The host then schedules each GPU kernel to be executed in parallel and collects alignment results asynchronously. Once the GPU devices completed their execution, the load balancer collects and organizes the results.
V BELLA Integration
To demonstrate the impact of the work, we integrate LOGAN into a longread analysis software, called BELLA [guidi2018bella]. BELLA is a recently released, publiclyavailable software for longread manytomany overlap detection and alignment. Detecting overlaps is a crucial and computationally intense step in many longread applications, such as de novo genome assembly and error correction. BELLA uses a seedbased approach for overlap detection implemented as an efficient sparse matrixmatrix multiplication (SpGEMM) kernel. Before performing overlapping, BELLA provides a new algorithm for pruning the mers, substrings of fixed length used as seeds. The mers are pruned because unlikely to be useful in overlap detection and their retention would cause unnecessary computational overhead and potential errors. Once overlaps are identified, a seedendextend pairwise alignment step is performed to filter out spurious overlaps. BELLA chooses the optimal mer to begin alignment extension, as illustrated in Figure 5, through a binning mechanism, where
mer locations are used to estimate the overlap length and to “bin”
mers to form a consensus. BELLA additionally uses the novel approach of separating true alignments from false positives using an adaptive threshold based on a combination of alignment techniques and probabilistic modeling.BELLA relies on SeqAn’s drop implementation [doring2008seqan] for pairwise alignment, which constitutes about of the overall runtime when using real data sets. Once overlaps are computed via SpGEMM, BELLA performs the pairwise alignment and determines if the aligned pair should be kept. The current implementation is efficient for SeqAn as the processor computes independent pairwise alignments in parallel using OpenMP [dagum1998openmp]. However, this approach is inefficient for the GPU architecture, since it limits the amount of parallelism between alignments. To better exploit interalignment parallelism, we modify BELLA to batch the entire set of alignments together and send them to the GPU devices. The host CPU then retrieves and postprocesses the alignment results. Our optimized BELLA version with LOGAN integration produces equivalent results as the original version.
Vi Discussion
Drop  SeqAn  LOGAN  LOGAN 

168 CPU Threads  1 GPU  6 GPU  
10  5.1  2.2  1.9 
20  12.7  3.1  2.1 
50  29.6  5.0  2.2 
100  45.7  7.2  2.7 
500  102.6  14.9  4.0 
1000  133.3  20.2  4.9 
2500  168.0  25.3  5.6 
5000  176.6  26.7  5.8 
Drop  ksw2  LOGAN  LOGAN 
80 CPU Threads  1 GPU  8 GPU  
10  6.9  2.5  1.7 
20  7.0  3.8  1.8 
50  7.7  5.8  2.1 
100  10.4  7.3  2.4 
500  113.0  15.2  3.4 
1,000  209.5  20.4  4.3 
2,500  1235.8  25.9  5.2 
5,000  3213.1  27.2  5.2 
Drop  BELLA  LOGAN  LOGAN 

168 CPU Threads  1 GPU  6 GPU  
5  53.2  110.4  114.3 
10  108.6  146.4  115.3 
15  139.0  152.9  114.8 
20  226.7  162.7  118.4 
25  275.3  173.5  125.3 
30  558.0  185.3  130.6 
35  654.1  198.4  136.8 
40  750.1  212.7  138.4 
50  913.1  248.5  141.4 
80  1303.7  295.8  142.4 
100  1507.1  336.3  144.5 
In this section, we describe the experimental settings used to evaluate the LOGAN methodology and present our performance results.
Via Experimental Setting
We first compare LOGAN against the CPUbased drop algorithm as implemented in SeqAn [doring2008seqan]. Next, we evaluate LOGAN against two GPUbased algorithms: the current stateoftheart implementation of full SmithWaterman (SW), CUDASW3++ [liu2013cudasw++], and the closest heuristics to ours proposed by Feng et al., manymap [FengIcpp2019]. Finally, we integrate LOGAN into the BELLA longread application to demonstrate its benefit in a realworld computation.
To compare LOGAN against SeqAn’s, we generate a set of 100K read pairs with read length between 2,500 and 7,500 characters and an error rate of 15 between two reads of a given pair. The results were collected on a dualsocket server with two 22core IBM POWER9 processors and 6 NVIDIA Tesla V100s (16 GB HBM2) with 512 GB DDR4 of RAM. Each processor has 21 compute cores with 4 threads per core. Also, we compare LOGAN to minimap2’s [Minimap2] vectorized drop alignment algorithm, called ksw2 [suzuki2018introducing], using the same data set of 100K pairs described above. Note that we conducted these comparisons on a different hardware platform: a dualsocket computer with two 20core Intel Xeon Gold 6148 CPU processors, each running at 2.40 GHz with 384 GB DDR4 MHz memory and 8 NVIDIA Tesla V100s (16 GB HBM2) GPUs. A different platform was required since the POWER9 processors are not compatible with ksw2’s SSE2 SIMD instructions. On the Intel Xeon Gold platform with 8 NVIDIA Tesla V100s, we also perform the comparison between LOGAN, CUDASW++, and manymap using the same 100K pairs as above.
Additionally, we integrate LOGAN into the BELLA longread application [guidi2018bella], described in Section V, by evaluating the performance difference of replacing SeqAn with LOGAN. For this comparison, we used a real E. coli and a synthetic C. elegans data sets, requiring 1.8M and 235M alignments, respectively. We analyzed these experiments on the same hardware platform as SeqAn evaluation.
ViB Results
Drop  BELLA  LOGAN  LOGAN 

168 CPU Threads  1 GPU  6 GPU  
5  131.7  577.1  213.1 
10  723.3  750.2  579.7 
15  1467.7  865.6  749.8 
20  1954.8  908.9  777.0 
25  2518.8  1015.5  838.9 
30  3047.1  1125.0  888.0 
35  3492.5  1226.5  927.0 
40  3887.0  1329.0  955.9 
50  4607.7  1449.0  983.7 
80  6367.7  1593.9  1046.1 
100  7385.3  1753.3  1080.9 
Figure 8 shows LOGAN’s speedup using both one GPU and the entire set of six GPUs compared against SeqAn’s implementation using 168 threads on two POWER9 processors. Details of the execution time are shown in Table II. Note that LOGAN’s execution times remain roughly constant for large values of . In these scenarios, we can exploit the full parallelism of the GPU architecture, resulting in similar execution times. Observe that LOGAN attains speedups ranging from to for a single GPU and from to using all six GPUs. As expected, LOGAN achieves higher speedups as the value of increases, since the alignment runs for a longer duration. We also note that LOGAN multiple GPU implementation scales better for longer execution runs. This is due to amortizing the load balancing overhead when dividing the sequences into different groups.
Figure 9 presents LOGAN’s performance using both 1 GPU and the entire set of 8 GPUs when compared against ksw2’s CPU vectorized implementation on the Skylake processor. Both algorithms are benchmarked using the same set of 100K alignments used to compare LOGAN and SeqAn. Results show that LOGAN attains significant speedups ranging from to with a single GPU and from to using eight GPUs. Additionally, we can observe that ksw2 performs better when aligning the sequences using a small value of and its performance degrades drastically when increasing the drop value, as shown in Table III. Given LOGAN and ksw2 implement two slightly different heuristics, we also report a comparison based on the GCUPS metric. LOGAN achieves up to GCUPS with a single GPU for , while ksw2 best performance is only GCUPS for . Importantly, LOGAN always outperforms ksw2 both in terms of runtime and GCUPS, independently from their respective peak performance at different values of .
Figure 12 illustrates LOGAN’s performance compared to two GPUbased algorithms, CUDASW++ and manymap. Notably, each of these three implementations performs a different amount of work, therefore we report the performance in terms of GCUPS. Furthermore, CUDASW++ uses hybrid GPU/SIMD computation by default. We report its performance with both hybrid and GPUonly computation. LOGAN consistently outperforms both CUDASW++ and manymap with performance up to GCUPS on a single GPU, while CUDASW++ and manymap achieve at most and GCUPS, respectively. Running with eight GPUs, LOGAN computes more GCUPS than GPUonly CUDASW++.
Finally, Figures 10 and 11 present BELLA’s performance improvements when using LOGAN as pairwise alignment kernel. Our results show that BELLA attains significant speedups up to 7 and 10 on one GPU and six GPUs, respectively. Tables IV and III show the runtime of the original software in the column named “BELLA” and the runtime of BELLA using LOGAN as pairwise alignment kernel in the column “LOGAN”. For large values of , results show that LOGAN’s runtime does not drastically degrade with increasing . Notably, BELLA operates in a context where sequences have an error rate of about . In this scenario, small values of can potentially lead to early dropouts, even when sequences are supposed to align until the endpoints. Up to a certain point, increasing the value of increases the number of true alignments and makes it easier to differentiate true alignments from false positives. LOGAN’s integration would allow BELLA to use larger values, resulting in higher accuracy without a notable increase in runtime.
Computing time scales linearly, however, the communication with multiple GPUs introduces an overhead that increases with the number of GPUs.
Vii LOGAN Roofline Analysis
In this section, we provide a detailed analysis of the optimized LOGAN GPU performance by adapting the Roofline model [CACM09_Roofline, Ding2019instructionroofline] to fit our specific computational characteristics. The Roofline model is a visuallyintuitive method to understand the performance of a given kernel based on a bound and bottleneck analysis approach. The model outlines which factors affect the performance of computer systems, relating processor performance to offchip memory traffic. The Roofline model characterizes a kernel’s performance in billions of instructions (GIPS, yaxis) as a function of its operational intensity (OI, xaxis). We use Operational Intensity as the xaxis and, given that our kernel performs only integer operations, use billions of warp instructions per second (Warp GIPS) as the yaxis. Operational Intensity is defined as instructions per byte of DRAM traffic, which measures traffic between the caches and memory. Thus, our Roofline analysis combines integer performance, operational intensity, and memory performance into a 2D loglog scale graph, as shown in Figure 13.
On one NVIDIA Tesla V100 GPU, 80 Streaming Multiprocessor (SM)s are available, where each SM consists of four processing blocks, called warp schedulers. Each warp scheduler can dispatch only one instruction per cycle. As such, the theoretical maximum (warpbased) of instruction/s is GIPS. Besides, each processing block contains 16 FP32 cores, 8 FP64 cores, and 16 INT32 cores. The maximum attainable integer performance is integer warp GIPS since 16 INT32 cores can only support 16 threads out of 32 threads in one warp. Peak Performance is upper bounded by both the theoretical INT32 peak rate and the peak memory bandwidth, which define the green line in the plot. The actual Warp Giga Instructions Per Second (GIPS) depends on the operational intensity and the ceiling line determines the limit of the actual performance. A kernel is memorybound if the Warp GIPS are limited by the memory bandwidth (left of the red dotted line), and is computebound if limited by the hardware performance limit (right of the red dotted line). This maximum attainable performance represents a ceiling in the Roofline model plot for the considered GPU platform and is independent from the executed algorithm. To adapt this ceiling to the drop algorithm, we use the following formula:
(1) 
Equation 1 defines a new ceiling by averaging the number of cells that GPU can compute in parallel. indicates the total number of parallel iterations for a given algorithm, is the theoretical ceiling (220.8 warp GIPS), is the number of scheduled blocks, indicates the number of operations that need to be computed at each iteration, is the number of scheduled threads per block, and, finally, indicates the number of INT32 cores available. LOGAN’s overall performance behavior is shown in Figure 13. Result show the operational intensity of our kernel on the HBM memory of the GPU, indicating that we are not memory bound and that we are bound by the adapted theoretical ceiling. In other words, the operational intensity of our kernel is high enough to be in the computebound area of the Roofline, thus it is not limited by the HBM memory. Note that the optimized performance of our algorithm is very close to the adapted theoretical ceiling. Considering that the adapted ceiling does not take into account memory latency, the results of our implementation are extremely close to the maximum achievable performance. Therefore, LOGAN represents a nearoptimal implementation of the drop algorithm and it is only limited by the compute capability of the GPU.
Viii Conclusions
Our work presents LOGAN, the first highperformance multiGPU implementation of the drop alignment algorithm. drop is employed in several important genomics applications, however it is particularly challenging for GPU parallelization due to its adaptive banding and continual termination checks.
Detailed results and analyses show significant performance acceleration using our novel optimization approach. LOGAN demonstrated runtime improvements of up to using six GPUs, compared with the original CPU algorithm. Additionally, results show speedups up to using six GPUs compared with the SIMD vectorized ksw2 algorithm, which implements a similar heuristics. Finally, LOGAN integration resulted in performance improvement up to on BELLA, a realworld manytomany longread overlapper and aligner.
Finally, our work provided an adaptation of the Roofline model that captures the unique aspects of our computation in the context of the underlying GPU hardware configuration. Roofline analysis demonstrates that our drop design methodology results in nearoptimal performance. Our overall results show that our optimized kernel is flexible, efficient, and can be easily integrated into longread application performing pairwise alignment.
Future work will focus on reducing LOGAN’s load balancing overhead, to enable linear performance improvements with increasing GPU counts independent of the value of . Given our implementation can be easily adapted to solve other similar problems, we also plan to extend LOGAN to support protein alignment and expect the drop algorithm to be effective in protein homology searches.
Acknowledgments
We would like to thank Francesco Peverelli and Muaaz Awan for useful suggestions and valuable discussions.
This work is supported by the Advanced Scientific Computing Research (ASCR) program within the Office of Science of the DOE under contract number DEAC0205CH11231. This research was also supported by the Exascale Computing Project (17SC20SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.
We used resources of the NERSC supported by the Office of Science of the DOE under Contract No. DEAC0205CH11231. This research also used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0500OR22725.
Comments
There are no comments yet.