Pairwise alignment is one of the most commonly used workhorses of sequence analysis. It is used to correct raw sequencer reads, assemble them into more complete genomes, search databases for similar sequences, and many other problems. The optimal solutions for this problem require quadratic time (i.e. they take time for aligning a sequence A of length and a sequence B of length ). Namely, Needleman–Wunsch (NW) [needleman1970general] is used to find the best global alignment by forcing the alignment to extend to the endpoints of both sequences. Alternatively, Smith–Waterman (SW)[smith1981identification] computes the best local alignment by finding the highest scoring alignment between continuous subsequences of the input sequences.
The popular -drop [zhang2000greedy] algorithm avoids the full quadratic cost by searching only for high-quality alignments, and can be viewed as an approach to accelerate both NW and SW. Most applications of alignment will throw out low quality alignments, which arise when the two strings are not similar. Instead of exploring the whole space, the -drop algorithm searches only for alignments that results in a limited number of edits between the two sequences. -drop keeps a running maximum score and does not explore cell neighborhoods whose score decreases by a user-specified parameter . It gets its performance benefits from searching a limited space of solutions and stopping early when a good alignment is not possible.
Zhang et al. [zhang2000greedy] proved that, for certain scoring matrices, the -drop algorithm is guaranteed to find the optimal alignment between relatively similar sequences. In practice, the algorithm eliminates searches between sequences that are clearly diverging. This feature is especially effective for many-to-many alignment problems when there is an attempt to align many sequences to many other possibly matching sequences, i.e., the cost is high as is the possibility that some pairs will not align. With -drop, any spurious candidate pair is readily eliminated because the optimal score quickly drops. Consequently, -drop and its variants are the algorithm of choice in some of the most popular sequence mapping software including BLAST [altschul1990basic], LAST [kielbasa2011adaptive], BLASTZ [schwartz2003human] with -drop, and minimap2 [li2018minimap2] with -drop.
-drop is a heuristic for cutting the cost of alignment, it also produces good quality results, which are sometimes better than a more complete search. Frith et al.[frith2010parameters] show that a large does not necessarily produce better alignments. Without the -drop feature, the alignment algorithm can incorrectly glue two independent local alignments into a large one. For example, consider two sequences, one of the form S = A-B-C and other of the form R = A-D-C. Since the regions A and C produce high-scoring alignments, likely a high would incorrectly determine that provided that B and D regions are short enough.
Although there are numerous GPU implementations of the full SW and NW algorithms that often achieve impressive computational rates (measured in CUPS or cell updates per second), they are rarely incorporated into high-impact genomics pipelines due to their quadratic complexity. By contrast, a GPU implementation of -drop is notably missing from the literature despite its benefits and popularity. This is likely due to the increased complexity of implementing -drop efficiently on a GPU, compared with NW and SW methods, because of the dynamic nature of the computation, its adaptive band, and the need to check for completion.
Our main contributions are:
We present the first high-performance, multi-GPU implementation of the -drop algorithm, named LOGAN, which achieves significant speed-ups over leading versions on state-of-the-art processors.
We integrate LOGAN within BELLA, a long-read many-to-many overlapping and alignment software and demonstrate performance improvements up to .
We adapt the Roofline Model to LOGAN implementation and underlying hardware, and demonstrate that performance is near optimal on NVIDIA Tesla V100s.
A key aspect of our implementation is combining different levels of parallelism.
Specifically, we implement the intra-sequence parallelism via dynamic thread scheduling and in-warp parallelism, while accomplishing inter-sequence parallelism by assigning each GPU block to a single alignment.
Finally, we carry out parallelism across multiple GPUs through a load balancer.
The remainder of the paper is organized as follows. Section II provides an overview of the related work, while Section III describes the original software algorithm we port on GPU. Section IV describes our implementation and optimizations. Section V presents LOGAN integration within BELLA [guidi2018bella], a long-read overlapping and alignment software. Section VI illustrates our experimental results, while Section VII describes the Roolfline model used to analyze our implementation. Finally, Section VIII summarizes our contributions and outlines future work.
Ii Related Work
The majority of hardware acceleration efforts for pairwise alignment have focused on the Smith–Waterman (SW) and Needleman–Wunsch (NW) algorithms. These find exact alignments and have quadratic complexity in the lengths of the reads. Along with some of the most successful NW and SW acceleration efforts, we review the few efforts to accelerate heuristics more similar to our own. Though they are more generally applicable, exploiting GPU parallelism in these heuristics is more challenging due to their adaptive nature. As a common success metric, we report Giga Cell Updates Per Second (GCUPS), as reported by the original work, throughout this section. It is important to keep in mind that the GCUPS rates presented in this section were collected by the respective authors using different architectures than examined in our study. In Section VI, we collect comparative performance data with the ksw2 algorithm on equivalent platforms.
The implementation of Michael Farrar [farrar2006striped], which is adopted in Bowtie2 [langmead2012fast], stands out amongst software implementations of the SW algorithm. It leverages SIMD instructions and reaches performance of more than 20 GCUPS on an Intel Xeon Gold with 40 CPU threads. The same software implementation has been optimized for the PlayStation 3 processor and the IBM QS20 architecture [szalkowski2008swps3], achieving performance of and GCUPS, respectively.
CUDASW++3 [liu2013cudasw++] accelerates the SW algorithm combining SIMD instructions and GPU parallelism. The implementation achieves up to GCUPS when aligning reads with length less than characters. However, the performance drops significantly when the sequence length exceeds characters. Additionally, when running only using the GPU, their maximum attained performance is GCUPS (roughly of their peak performance).
Muhammadzadeh presented MR-CUDASW++ [mr-cudasw2014], which was inspired by CUDASW++3 but optimized for “medium length” reads. Muhammadzadeh compares MR-CUDASW++ to other tools, with CUDASW++3 as its closest contender, across sequences lengths of K, K, and K. MR-CUDASW++ achieved speedups of over CUDASW++3. The results were below 85 GCUPS using an NVIDIA Tesla V100, which is the same GPU used for our benchmarks. Li et al. [li2007160] accelerate the SW algorithm achieving a speed-up of over compared to software implementation using an Altera Nios II Field Programmable Gate Array (FPGA). Nevertheless, the performance of their proposed solution is comparable to existing optimized software implementations. The SW implementation of Di Tucci et al. [di2017architectural], running on a Xilinx Virtex 7 and a Kintex Ultrascale platforms, achieves up to GCUPS and a speed-up of over the state-of-the-art FPGA implementation. However, this work is limited to aligning short sequences that have a number of characters not exceeding the number of processing elements in the architecture.
A recent work by Turakhia et al. [turakhia2018darwin], Darwin, exploits FPGAs to speed up the alignment process achieving up to GCUPS. Darwin uses a seed-and-extend heuristic (GATC) that performs the extension stage in the seed-and-extend paradigm in which Dynamic Programming (DP) is used around the seed hit to obtain local alignments similar to SW.
Feng et al. [FengIcpp2019] recently presented accelerator-based optimizations for minimap2 [Minimap2]. Leveraging the GPU architecture, they accelerate minimap2’s seed-chain-extend pairwise alignment algorithm, which is quadratic in the length of the reads when computing traceback and linear otherwise. They reported performance of GCUPS (a speed-up over the minimap2’s SIMD software implementation). Our -drop alignment algorithm in this study computes a similar heuristic to minimap2’s pairwise alignment. In Section VI, we compare our work with ksw2 [suzuki2018introducing] (i.e., minimap2’s alignment kernel), showing that LOGAN achieves higher performance in terms of GCUPS than both ksw2 and the performance reported by Feng et al. for their GPU-accelerated implementation.
Despite a large number of sophisticated implementations, the overwhelming majority of the proposed hardware accelerations studies implement the exact SW or NW algorithms. We believe the although -drop algorithm is the most practical choice for targeting large-scale alignments, it requires more challenging parallelization than the original SW or NW methods. Though relatively unexplored due to this challenge, we demonstrate that GPU optimization results in significant acceleration.
This section provides an overview of the -drop implementation proposed by Zhang et al. [zhang2000greedy] and implemented in the SeqAn library [doring2008seqan], a C++ library for sequence analysis. First, we review the formal definition of alignment. A pairwise alignment of sequences s and t over an alphabet is defined as the pair such that and the following properties hold:
Deleting all “” from yields , and deleting all “” from yields .
A scoring scheme is used to distinguish high-quality alignments from the many (valid) alignments of a given pair of sequences. Scoring schemes generally reward matches and penalize mismatches, insertions, and deletions.
Iii-a The -drop Algorithm
Given two DNA sequences and of length and , the goal of the -drop algorithm is to find the highest-scoring semi-global alignment between and of the forms and , for some and that are chosen to maximize the score.
For a given and , we define as the alignment matrix and as the alignment score between and . A positive match score is added to for each pair of identical nucleotides. If nucleotides do not match, the algorithm can either subtract a mismatch score to and move diagonally or subtract a gap score and move horizontally (gap into the vertical sequence) or vertically (gap into the horizontal sequence) in the dynamic programming grid. More formally, each cell of the alignment matrix is computed as follows:
Figure 1 shows the three dependencies of a cell during the computation: two dependencies on cells from the previous iteration and one dependency on a cell at two iterations prior. Note that SW, NW, as well as the majority of their heuristic implementations show these dependencies. SW and NW compute the entire matrix to find the optimal alignment. This quadratic algorithm is extremely inefficient in the case of either misalignment or when aligning almost identical sequences. A misalignment could happen, for example, when two sequences have a tiny region in common due to a genomic repetition. The SW algorithm would spend significant computational resources calculating the entire dynamic programming (DP) matrix and report a very poor alignment score between the two sequences. On the other hand, SW or NW on two almost identical sequences would compute the whole DP matrix with no additional benefit, since the optimal alignment score would remain close to the diagonal of the DP matrix.
The concept of the -drop termination consists of halting the computation if the alignment score drops more than below the best alignment score seen so far for that pair of sequences. is potentially updated at each anti-diagonal iteration. If , we set the cell equal to and no longer consider that cell for the subsequent iterations of . The cells set to are used to compute the lower and upper bound for the next anti-diagonal iteration. This approach limits the anti-diagonal width, reducing the search space of the algorithm, and automatically provides a termination condition. The -drop algorithm is particularly efficient when two sequences do not align. A pseudo-code of the -drop algorithm is shown in Algorithm 1.
Note that -drop should not be confused with the popular banded-SW method. This approach constrains the search space to a fixed band along the diagonal, regardless of the drop in the score. The areas of the dynamic programming grid explored by these algorithms are characteristically different from -drop’s search space, which is reminiscent of a rugged band with changes in the length of each anti-diagonal, as shown in Figure 2. To better understand the difference in practice, consider two sequences that have very high (over 50) differences in terms of substitutions but have no indel (insertion or deletion) differences. The optimal path would be along the diagonal because both a mismatch and match move the cursor in both sequences. -drop will correctly terminate the search early due to a significant drop in the score whereas banded-SW would explore the entire band.
The key aspect of our implementation is exploiting as many levels of parallelism as possible on the Graphical Processing Unit (GPU). Intra-level parallelism is achieved by dynamically scheduling the threads based on the value of and through the use of in-warp parallelization to find the maximum of the anti-diagonals. To achieve inter-level parallelism, we implement the parallel execution of multiple alignments by assigning each GPU block to a single alignment. Multi-GPU parallelism is obtained by implementing a GPU load balancer that adapts the execution of LOGAN to leverage multiple GPUs.
In this section, we describe the design of our implementation. Section V then presents our optimized kernel integration within BELLA, a real-world application. Note that we refer to GPU threads and GPU blocks simply as threads and blocks.
Iv-a Intra-Sequence Parallelism
We first consider the intra-sequences parallelism, which is the parallelization of a single pairwise alignment and its anti-diagonals computation. Given that the -drop algorithm we decided to port to GPU does not perform alignment trace-back, we do not store the entire alignment matrix on the device for each alignment. Therefore, we can reduce the memory footprint of our kernel on the GPU by storing only three anti-diagonals per alignment: current, previous, and two iterations prior, as highlighted in Figure 1.
Similarly to SW and NW algorithms, we compute the cell updates of each anti-diagonal of the alignment matrix in parallel. Each anti-diagonal cell has three dependencies on cells from anti-diagonals at previous iterations. Nevertheless, we can leverage the independence between cells belonging to the same anti-diagonal. To compute an anti-diagonal update in parallel, we assign each cell to a GPU thread and compute them independently, where a GPU block can schedule up to 1024 threads. To overcome this limitation and ensure the computation of any anti-diagonal length, we split each anti-diagonal into segments, as shown in Figure 3. The anti-diagonal is split into segments whose width is equal to the number of threads within a block. Once a segment of the anti-diagonal is completed, the kernel initiates the computation of the subsequent segment. This process is repeated until the entire anti-diagonal is computed.
For each cell, the corresponding thread sets the score to if the cell score drops below the global maximum of the alignment matrix, that is best in Algorithm 1. The overall maximum score is a shared variable within the considered block. The global maximum of the scoring matrix is updated after any anti-diagonal has been completely calculated since it needs to consider the newly computed scores. Computing the global maximum naively would significantly slow down the execution since it would require serial comparison of each cell with the all others in a given anti-diagonal. Thus we speed up this process by computing the maximum anti-diagonal score via a parallel reduction.
Once each thread is assigned to a cell of the anti-diagonal, our algorithm leverages in-warp thread communication to compare the values inside cells and perform the reduction, where a warp is a set of 32 threads executing the same code and sharing the same data. All threads within a warp communicate using registers, which maximizes communication speed. Algorithm 2 illustrates the parallel computation of the anti-diagonal. Finally, the size of the next anti-diagonal to be computed is updated by checking if there are cells marked with at the end or the start of the current anti-diagonal. LOGAN continues computing the alignment matrix until either it reaches the end of the shortest read or the size of the current anti-diagonal is set to zero, meaning that it has satisfied the condition.
Iv-B Inter-Sequence Parallelism
Intra-sequence parallelism optimizes the alignment computation for a single pair of sequences, however, it does not effectively leverage the large volume of GPU computational resources. We therefore exploit the GPU computational potential by designing LOGAN to align multiple pairs of sequences in parallel by assigning each alignment to a GPU block — thus taking advantage of inter-sequences parallelism. LOGAN schedules the number of GPU blocks based on the number of alignments needed to be performed (Figure 4). Each NVIDIA V100s GPU block can store up to 64KB in shared-memory and performs an independent alignment. Since only three anti-diagonals need to be stored, we could ostensibly store them into GPU shared-memory (the fastest memory available after the registers). However, despite the potential of reserving 64KB of memory per block, this cannot be implemented in practice as the device has only 96KB of memory per streaming multiprocessor (SM). Given that each SM on the device can execute up to blocks in parallel, if a single block has reserved too much memory, a SM is forced to exchange data with the DRAM of the GPU every time it computes a block. Furthermore, given that only a single block could fit on a single SM, the execution would be limited to a single block per SM. Given our goal of achieving the best possible board-level utilization, we need to compute multiple blocks per SM in parallel. Consequently, to overcome these limitations and avoid shared memory contention, LOGAN stores the three anti-diagonals of each alignment on the High Bandwidth Memory (HBM) of the GPU. Doing so, removes the limitation on the number of blocks per SM, and achieves significantly more effective parallelism.
To further improve our resource utilization, LOGAN schedules a number of threads per block lower than the maximum of . In fact, if the number of threads exceeds the anti-diagonal length, many of the threads will stall, decreasing overall performance. Additionally, we need to store the intermediate results of the parallel reduction in shared memory to enable in-warp thread communication when computing the anti-diagonal maximum score. Since the number of intermediate results is equal to the number of scheduled threads, reducing the number of scheduled threads per block also reduces the risk of shared memory contention. Given that the length of each anti-diagonal is proportional to the value of , our implementation schedules a number of threads proportional to , significantly reducing the number of stalled threads. This scheduling increases our performance and improves our resource utilization. Table I shows the impact of the various degrees of parallelization implemented in the LOGAN kernel. The first two rows of the table show the impact of intra-level parallelism over a single-thread execution, while the second two show the impact over inter-level parallelism for K alignments of read pairs. Note that intra-level parallelism improves the performance by a factor of about , while the introduction of inter-level parallelism improves performance by an impressive factor of with respect to intra-parallelism alone. The intra-sequence parallelism has insufficient work to consume available GPU resources, hence its impact on performance is limited compared to inter-sequence parallelism. Notably, inter- and intra-sequence parallelisms are complementary to each other. LOGAN therefore implements both to better exploit the resources of the GPU and maximize our kernel performance.
|Intra- and inter-sequence||100K||128||100K||7.35s||22000.0|
To make the LOGAN processing more efficient on the GPU, we additionally introduce CPU host optimization. First, LOGAN loads the length of the sequences and the seed locations for each pair of sequences and stores them into two buffers. Each pair of sequences is split in two based on the seed’s location, resulting in a left-extension pair and a right-extension pair, as shown in Figure 5. Left-extension and right-extension pairs are stored into two different vectors, and these alignments are computed independently by scheduling two different streams on the GPU. Traditionally, when aligning two sequences, one of them is accessed backward, resulting in memory performance degradation since characters are read in the opposite direction of the memory. To ensure coalesced data access on the GPU and exploit memory burst, one of the two sequences (for each given pair) is reversed, as shown in Figure 6. This optimization linearizes the GPU memory data access, thus increasing performance while preserving the correct solution. Finally, note that kernel execution is scheduled asynchronously from the host, enabling the retrieval of alignment results as soon as they are available, instead of waiting for all the alignments to complete.
Iv-C Implementation with Multiple GPU Devices
To effectively exploit the available multiple GPU resources, LOGAN leverages a load balancer as shown in Figure 7. This optimization allows LOGAN to run on varying GPU configurations, since it can adapt the load to the specific number of GPUs present within a given system.
The host application balances the computation by scheduling the number of alignments for each GPU. The host switches context multiple times and then replicates the operations for each GPU to simplify its task. The pre-processing of the sequences occurs as in the single device implementation and the load balancer divides the sequences into different groups that are then assigned to the GPUs. The HBM memory of the GPU represents a limiting resource for LOGAN, since in the single GPU implementation it is fully utilized. To ensure balance, we schedule the number of alignments per GPU considering both the number of available GPUs and the length of the sequences. Once the division is completed, the host allocates the necessary memory on the different GPUs, enabling each GPU to execute its set of alignments independently. The host then schedules each GPU kernel to be executed in parallel and collects alignment results asynchronously. Once the GPU devices completed their execution, the load balancer collects and organizes the results.
V BELLA Integration
To demonstrate the impact of the work, we integrate LOGAN into a long-read analysis software, called BELLA [guidi2018bella]. BELLA is a recently released, publicly-available software for long-read many-to-many overlap detection and alignment. Detecting overlaps is a crucial and computationally intense step in many long-read applications, such as de novo genome assembly and error correction. BELLA uses a seed-based approach for overlap detection implemented as an efficient sparse matrix-matrix multiplication (SpGEMM) kernel. Before performing overlapping, BELLA provides a new algorithm for pruning the -mers, substrings of fixed length used as seeds. The -mers are pruned because unlikely to be useful in overlap detection and their retention would cause unnecessary computational overhead and potential errors. Once overlaps are identified, a seed-end-extend pairwise alignment step is performed to filter out spurious overlaps. BELLA chooses the optimal -mer to begin alignment extension, as illustrated in Figure 5, through a binning mechanism, where
-mer locations are used to estimate the overlap length and to “bin”-mers to form a consensus. BELLA additionally uses the novel approach of separating true alignments from false positives using an adaptive threshold based on a combination of alignment techniques and probabilistic modeling.
BELLA relies on SeqAn’s -drop implementation [doring2008seqan] for pairwise alignment, which constitutes about of the overall runtime when using real data sets. Once overlaps are computed via SpGEMM, BELLA performs the pairwise alignment and determines if the aligned pair should be kept. The current implementation is efficient for SeqAn as the processor computes independent pairwise alignments in parallel using OpenMP [dagum1998openmp]. However, this approach is inefficient for the GPU architecture, since it limits the amount of parallelism between alignments. To better exploit inter-alignment parallelism, we modify BELLA to batch the entire set of alignments together and send them to the GPU devices. The host CPU then retrieves and post-processes the alignment results. Our optimized BELLA version with LOGAN integration produces equivalent results as the original version.
|168 CPU Threads||1 GPU||6 GPU|
|80 CPU Threads||1 GPU||8 GPU|
|168 CPU Threads||1 GPU||6 GPU|
In this section, we describe the experimental settings used to evaluate the LOGAN methodology and present our performance results.
Vi-a Experimental Setting
We first compare LOGAN against the CPU-based -drop algorithm as implemented in SeqAn [doring2008seqan]. Next, we evaluate LOGAN against two GPU-based algorithms: the current state-of-the-art implementation of full Smith-Waterman (SW), CUDASW3++ [liu2013cudasw++], and the closest heuristics to ours proposed by Feng et al., manymap [FengIcpp2019]. Finally, we integrate LOGAN into the BELLA long-read application to demonstrate its benefit in a real-world computation.
To compare LOGAN against SeqAn’s, we generate a set of 100K read pairs with read length between 2,500 and 7,500 characters and an error rate of 15 between two reads of a given pair. The results were collected on a dual-socket server with two 22-core IBM POWER9 processors and 6 NVIDIA Tesla V100s (16 GB HBM2) with 512 GB DDR4 of RAM. Each processor has 21 compute cores with 4 threads per core. Also, we compare LOGAN to minimap2’s [Minimap2] vectorized -drop alignment algorithm, called ksw2 [suzuki2018introducing], using the same data set of 100K pairs described above. Note that we conducted these comparisons on a different hardware platform: a dual-socket computer with two 20-core Intel Xeon Gold 6148 CPU processors, each running at 2.40 GHz with 384 GB DDR4 MHz memory and 8 NVIDIA Tesla V100s (16 GB HBM2) GPUs. A different platform was required since the POWER9 processors are not compatible with ksw2’s SSE2 SIMD instructions. On the Intel Xeon Gold platform with 8 NVIDIA Tesla V100s, we also perform the comparison between LOGAN, CUDASW++, and manymap using the same 100K pairs as above.
Additionally, we integrate LOGAN into the BELLA long-read application [guidi2018bella], described in Section V, by evaluating the performance difference of replacing SeqAn with LOGAN. For this comparison, we used a real E. coli and a synthetic C. elegans data sets, requiring 1.8M and 235M alignments, respectively. We analyzed these experiments on the same hardware platform as SeqAn evaluation.
|168 CPU Threads||1 GPU||6 GPU|
Figure 8 shows LOGAN’s speed-up using both one GPU and the entire set of six GPUs compared against SeqAn’s implementation using 168 threads on two POWER9 processors. Details of the execution time are shown in Table II. Note that LOGAN’s execution times remain roughly constant for large values of . In these scenarios, we can exploit the full parallelism of the GPU architecture, resulting in similar execution times. Observe that LOGAN attains speed-ups ranging from to for a single GPU and from to using all six GPUs. As expected, LOGAN achieves higher speed-ups as the value of increases, since the alignment runs for a longer duration. We also note that LOGAN multiple GPU implementation scales better for longer execution runs. This is due to amortizing the load balancing overhead when dividing the sequences into different groups.
Figure 9 presents LOGAN’s performance using both 1 GPU and the entire set of 8 GPUs when compared against ksw2’s CPU vectorized implementation on the Skylake processor. Both algorithms are benchmarked using the same set of 100K alignments used to compare LOGAN and SeqAn. Results show that LOGAN attains significant speed-ups ranging from to with a single GPU and from to using eight GPUs. Additionally, we can observe that ksw2 performs better when aligning the sequences using a small value of and its performance degrades drastically when increasing the -drop value, as shown in Table III. Given LOGAN and ksw2 implement two slightly different heuristics, we also report a comparison based on the GCUPS metric. LOGAN achieves up to GCUPS with a single GPU for , while ksw2 best performance is only GCUPS for . Importantly, LOGAN always outperforms ksw2 both in terms of runtime and GCUPS, independently from their respective peak performance at different values of .
Figure 12 illustrates LOGAN’s performance compared to two GPU-based algorithms, CUDASW++ and manymap. Notably, each of these three implementations performs a different amount of work, therefore we report the performance in terms of GCUPS. Furthermore, CUDASW++ uses hybrid GPU/SIMD computation by default. We report its performance with both hybrid and GPU-only computation. LOGAN consistently outperforms both CUDASW++ and manymap with performance up to GCUPS on a single GPU, while CUDASW++ and manymap achieve at most and GCUPS, respectively. Running with eight GPUs, LOGAN computes more GCUPS than GPU-only CUDASW++.
Finally, Figures 10 and 11 present BELLA’s performance improvements when using LOGAN as pairwise alignment kernel. Our results show that BELLA attains significant speed-ups up to 7 and 10 on one GPU and six GPUs, respectively. Tables IV and III show the runtime of the original software in the column named “BELLA” and the runtime of BELLA using LOGAN as pairwise alignment kernel in the column “LOGAN”. For large values of , results show that LOGAN’s runtime does not drastically degrade with increasing . Notably, BELLA operates in a context where sequences have an error rate of about . In this scenario, small values of can potentially lead to early drop-outs, even when sequences are supposed to align until the endpoints. Up to a certain point, increasing the value of increases the number of true alignments and makes it easier to differentiate true alignments from false positives. LOGAN’s integration would allow BELLA to use larger values, resulting in higher accuracy without a notable increase in runtime.
Computing time scales linearly, however, the communication with multiple GPUs introduces an overhead that increases with the number of GPUs.
Vii LOGAN Roofline Analysis
In this section, we provide a detailed analysis of the optimized LOGAN GPU performance by adapting the Roofline model [CACM09_Roofline, Ding2019instructionroofline] to fit our specific computational characteristics. The Roofline model is a visually-intuitive method to understand the performance of a given kernel based on a bound and bottleneck analysis approach. The model outlines which factors affect the performance of computer systems, relating processor performance to off-chip memory traffic. The Roofline model characterizes a kernel’s performance in billions of instructions (GIPS, y-axis) as a function of its operational intensity (OI, x-axis). We use Operational Intensity as the x-axis and, given that our kernel performs only integer operations, use billions of warp instructions per second (Warp GIPS) as the y-axis. Operational Intensity is defined as instructions per byte of DRAM traffic, which measures traffic between the caches and memory. Thus, our Roofline analysis combines integer performance, operational intensity, and memory performance into a 2D log-log scale graph, as shown in Figure 13.
On one NVIDIA Tesla V100 GPU, 80 Streaming Multiprocessor (SM)s are available, where each SM consists of four processing blocks, called warp schedulers. Each warp scheduler can dispatch only one instruction per cycle. As such, the theoretical maximum (warp-based) of instruction/s is GIPS. Besides, each processing block contains 16 FP32 cores, 8 FP64 cores, and 16 INT32 cores. The maximum attainable integer performance is integer warp GIPS since 16 INT32 cores can only support 16 threads out of 32 threads in one warp. Peak Performance is upper bounded by both the theoretical INT32 peak rate and the peak memory bandwidth, which define the green line in the plot. The actual Warp Giga Instructions Per Second (GIPS) depends on the operational intensity and the ceiling line determines the limit of the actual performance. A kernel is memory-bound if the Warp GIPS are limited by the memory bandwidth (left of the red dotted line), and is compute-bound if limited by the hardware performance limit (right of the red dotted line). This maximum attainable performance represents a ceiling in the Roofline model plot for the considered GPU platform and is independent from the executed algorithm. To adapt this ceiling to the -drop algorithm, we use the following formula:
Equation 1 defines a new ceiling by averaging the number of cells that GPU can compute in parallel. indicates the total number of parallel iterations for a given algorithm, is the theoretical ceiling (220.8 warp GIPS), is the number of scheduled blocks, indicates the number of operations that need to be computed at each iteration, is the number of scheduled threads per block, and, finally, indicates the number of INT32 cores available. LOGAN’s overall performance behavior is shown in Figure 13. Result show the operational intensity of our kernel on the HBM memory of the GPU, indicating that we are not memory bound and that we are bound by the adapted theoretical ceiling. In other words, the operational intensity of our kernel is high enough to be in the compute-bound area of the Roofline, thus it is not limited by the HBM memory. Note that the optimized performance of our algorithm is very close to the adapted theoretical ceiling. Considering that the adapted ceiling does not take into account memory latency, the results of our implementation are extremely close to the maximum achievable performance. Therefore, LOGAN represents a near-optimal implementation of the -drop algorithm and it is only limited by the compute capability of the GPU.
Our work presents LOGAN, the first high-performance multi-GPU implementation of the -drop alignment algorithm. -drop is employed in several important genomics applications, however it is particularly challenging for GPU parallelization due to its adaptive banding and continual termination checks.
Detailed results and analyses show significant performance acceleration using our novel optimization approach. LOGAN demonstrated runtime improvements of up to using six GPUs, compared with the original CPU algorithm. Additionally, results show speed-ups up to using six GPUs compared with the SIMD vectorized ksw2 algorithm, which implements a similar heuristics. Finally, LOGAN integration resulted in performance improvement up to on BELLA, a real-world many-to-many long-read overlapper and aligner.
Finally, our work provided an adaptation of the Roofline model that captures the unique aspects of our computation in the context of the underlying GPU hardware configuration. Roofline analysis demonstrates that our -drop design methodology results in near-optimal performance. Our overall results show that our optimized kernel is flexible, efficient, and can be easily integrated into long-read application performing pairwise alignment.
Future work will focus on reducing LOGAN’s load balancing overhead, to enable linear performance improvements with increasing GPU counts independent of the value of . Given our implementation can be easily adapted to solve other similar problems, we also plan to extend LOGAN to support protein alignment and expect the -drop algorithm to be effective in protein homology searches.
We would like to thank Francesco Peverelli and Muaaz Awan for useful suggestions and valuable discussions.
This work is supported by the Advanced Scientific Computing Research (ASCR) program within the Office of Science of the DOE under contract number DE-AC02-05CH11231. This research was also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.
We used resources of the NERSC supported by the Office of Science of the DOE under Contract No. DEAC02-05CH11231. This research also used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.