Accelerating Genome Analysis: A Primer on an Ongoing Journey

Genome analysis fundamentally starts with a process known as read mapping, where sequenced fragments of an organism's genome are compared against a reference genome. Read mapping is currently a major bottleneck in the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are able to sequence a genome much faster than the computational techniques employed to analyze the genome. We describe the ongoing journey in significantly improving the performance of read mapping. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory). We conclude with the challenges of adopting these hardware-accelerated read mappers.

READ FULL TEXT VIEW PDF

page 1

page 3

page 5

02/21/2022

GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis

Read mapping is a fundamental, yet computationally-expensive step in man...
12/18/2019

AirLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes

As genome sequencing tools and techniques improve, researchers are able ...
12/24/2021

Application of Markov Structure of Genomes to Outlier Identification and Read Classification

In this paper we apply the structure of genomes as second-order Markov p...
06/07/2012

Visualization in Connectomics

Connectomics is a field of neuroscience that analyzes neuronal connectio...
01/27/2020

diBELLA: Distributed Long Read to Long Read Alignment

We present a parallel algorithm and scalable implementation for genome a...
09/20/2013

mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

Recent advances in high-throughput cDNA sequencing (RNA-Seq) technology ...

Read Mapping

Figure 1: (a) The three steps of read mapping in genome analysis: (1) indexing, (2) pre-alignment filtering, and (3) sequence alignment. (b) Overview of the existing approaches to accelerating each step of read mapping.

The main goal of read mapping is to locate possible subsequences of the reference genome sequence that are similar to the read sequence while allowing at most edits, where is the edit distance threshold. Commonly allowed edits include deletion, insertion, and substitution of characters in one or both sequences. Mapping billions of reads to the reference genome is computationally expensive [turakhia2018darwin, alser2019shouji, senolcali.micro2020]. Therefore, most read

mapping algorithms apply two key heuristic steps,

indexing and filtering, to reduce the number of reference genome segments that need to be compared with each read.

The three steps of read mapping are shown in Figure 1a. First, a read mapper indexes the reference genome by using substrings (called seeds) from each read to quickly identify all potential mapping locations of each read in the reference genome. Second, the mapper uses filtering heuristics to examine the similarity for every sequence pair (a read sequence and one potential matching segment in the reference genome identified during indexing). These filtering heuristics aim to eliminate most of the dissimilar sequence pairs. Third, the mapper performs sequence alignment (using ASM) to check whether or not the remaining sequence pairs that are identified by filtering to be similar are actually similar. The alignment step examines all possible prefixes of two sequences and tracks the prefixes that provide the highest possible alignment score (known as optimal alignment). The alignment score is a quantitative representation of the quality of an alignment for a given user-defined scoring function (computed based on the number of edits and/or matches).

Alignment algorithms typically use DP-based approaches to avoid re-examining the same prefixes many times. These DP-based algorithms provide the most accurate alignment results compared to other non-DP algorithms, but they have quadratic time and space complexity (i.e., for a sequence length of ). Sequence alignment calculates information about the alignment such as the alignment score, edit distance, and the type of each edit. Edit distance is defined as the minimum number of changes needed to convert a sequence into the other sequence. Such information is typically output by read mapping into a sequence alignment/map (SAM) file. Given the time spent on read mapping, all three steps have been targeted for acceleration. Figure 1b summarizes the different acceleration approaches, and we discuss a set of such works in the following sections.

Accelerating Indexing

The indexing operation generates a table that is indexed by the contents of a seed, and identifies all locations where the seed exists in the reference genome. Indexing needs to be done only once for a reference genome, and eliminates the need to perform ASM across the entire genome. During read mapping, a seed from a read is looked up in the table, and only the corresponding locations are used for ASM (as only they can match the entire read). The major challenge with indexing is choosing the appropriate length and number of to-be-indexed seeds, as they can significantly impact the memory footprint and overall performance of read mapping [li2018minimap2]. Querying short seeds potentially leads to a large number of mapping locations that need to be checked for a string match. The use of long reads requires extracting from each read a large number of seeds, as the sequencing error rate is much higher in long reads. This affects (1) the number of times we query the index structure and (2) the number of retrieved mapping locations. Thus, there are two key approaches used for accelerating the indexing step (Figure 1b).

Reducing the Number of Seeds

Read mapping algorithms (e.g., minimap2 [li2018minimap2]) typically reduce the number of seeds that are stored in the index structure by finding the minimum representative set of seeds (called minimizers) from a group of adjacent seeds within a genomic region. The representative set can be calculated by imposing an ordering (e.g. lexicographically or by hash value) on a group of adjacent seeds and storing only the seed with the smallest order. Read mappers also apply heuristics to avoid examining the mapping locations of a seed that occur more times than a user-defined threshold value [li2018minimap2]. Various data structures have been proposed and implemented to both reduce the storage cost of the indexing data structure and improve the algorithmic runtime of identifying the mapping locations within the indexing data structure. One example of such data structures is FM-index (implemented in [langarita2020compressed]), which provides a compressed representation of the full-text index, while allowing for querying the index without the need for decompression. This approach has two main advantages. 1) We can query seeds of arbitrary lengths, which helps to reduce the number of queried seeds. 2) It typically has less (by ) memory footprint compared to that of the indexing step of minimap2 [li2018minimap2]. However, one major bottleneck of FM-indexes is that locating the exact matches by querying the FM-index is significantly slower than that of classical indexes [vasimuddin2019efficient, langarita2020compressed]. BWA-MEM2 [vasimuddin2019efficient] proposes an uncompressed version of the FM-index that is at least larger than the compressed FM-index to speed up the querying step by .

Reducing Data Movement During Indexing

RADAR [huangfu2018radar] observes that the indexing step is memory intensive, because the large number of random memory accesses dominates computation. The authors propose a processing-in-memory (PIM) architecture that stores the entire index inside the memory and enables querying the same index concurrently using a large number of ASIC compute units. The amount of data movement is reduced from tens of gigabytes to a few bytes for a single query task, allowing RADAR to balance memory accesses with computation, and thus provide speedups and energy savings.

Accelerating Pre-Alignment Filtering

After finding one or more potential mapping locations of the read in the reference genome, the read mapper checks the similarity between each read and each segment extracted at these mapping locations in the reference genome. These segments can be similar or dissimilar to the read, though they share common seeds. To avoid examining dissimilar sequences using computationally-expensive sequence alignment algorithms, read mappers typically use filtering heuristics that are called pre-alignment filters.

The key idea of pre-alignment filtering is to quickly estimate the number of edits between two given sequences and use this estimation to decide whether or not the computationally-expensive DP-based alignment calculation is needed — if not, a significant amount of time is saved by avoiding DP-based alignment. If two genomic sequences differ by more than the edit distance threshold, then the two sequences are identified as dissimilar sequences and hence DP calculation is not needed

. In practice, only genomic sequence pairs with an edit distance less than or equal to a user-defined threshold (i.e., E) provide useful data for most genomic studies [kim2018grim, alser2019shouji, alser2020technology]. Pre-alignment filters use one of four major approaches to quickly filter out the dissimilar sequence pairs: (1) the pigeonhole principle, (2) base counting, (3) -gram filtering, or (4) sparse DP. Long read mappers typically use -gram filtering or sparse DP, as their performance scales linearly with read length and independently of the edit distance.

Pigeonhole Principle

The pigeonhole principle states that if items are put into +1 boxes, then one or more boxes would be empty. This principle can be applied to detect dissimilar sequences and discard them from the candidate sequence pairs used for ASM. If two sequences differ by edits, then they should share at least a single subsequence (free of edits) among +1 non-overlapping subsequences [alser2019shouji], where is the edit distance threshold. For a read of length , if there are no more than edits between the read and the reference segment, then the read and reference segment are considered similar if they share at most +1 non-overlapping subsequences, with a total length of at least . The problem of identifying these +1 non-overlapping subsequences is highly parallelizable, as these subsequences are independent of each other. Shouji [alser2019shouji] exploits the pigeonhole principle to reduce the search space and provide a scalable architecture that can be implemented for any values of and , by examining common subsequences independently and rapidly with high parallelism. Shouji accelerates sequence alignment by 4.2-18.8 without affecting the alignment accuracy. We refer the reader to the sidebar for a brief discussion of several other related works.

s O!htbp

Sidebar: Related Works on Pre-Alignment Filtering Using the Pigeonhole Principle

Pigeonhole-filtering-based pre-alignment filtering can accelerate read mappers even without specialized hardware. For example, the Adjacency Filter [1] accelerates sequence alignment by up to 19. The accuracy and speed of pre-alignment filtering with the pigeonhole principle have been rapidly improved over the last seven years. Shifted Hamming Distance (SHD) [2] uses SIMD-capable CPUs to provide high filtering speed, but supports a sequence length up to only 128 base pairs due to the SIMD register widths. GateKeeper [3] utilizes the large amounts of parallelism offered by FPGA architectures to accelerate SHD and overcome such sequence length limitations. MAGNET [4] provides a comprehensive analysis of all sources of filtering inaccuracy of GateKeeper and SHD. Shouji [5] leverages this analysis to improve the accuracy of pre-alignment filtering by up to two orders of magnitude compared to both GateKeeper and SHD, using a new algorithm and a new FPGA architecture. SneakySnake [6] achieves up to four orders of magnitude higher filtering accuracy compared to GateKeeper and SHD by mapping the pre-alignment filtering problem to the single net routing (SNR) problem in VLSI chip layout. SNR finds the shortest routing path that interconnects two terminals on the boundaries of a VLSI chip layout in the presence of obstacles. SneakySnake is the only pre-alignment filter that works on CPUs, GPUs, and FPGAs. GenCache [7] proposes to perform highly-parallel pre-alignment filtering inside the CPU cache to reduce data movement and improve energy efficiency, with about 20% cache area overhead. GenCache shows that using different existing pre-alignment filters together, each of which operates only for a given edit distance threshold (e.g., using SHD only when is between 1 and 5), provides a speedup over GenCache with a single pre-alignment filter.

REFERENCES

  1. [label=0.,leftmargin=0.5cm]

  2. Hongyi Xin et al. Accelerating Read Mapping with FastHASH. BMC Genomics, 2013.

  3. Hongyi Xin et al. Shifted Hamming Distance: A Fast and Accurate SIMD-Friendly Filter to Accelerate Alignment Verification in Read Mapping. Bioinformatics, 2015.

  4. Mohammed Alser et al. GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping. Bioinformatics, 2017.

  5. Mohammed Alser et al. MAGNET: Understanding and Improving the Accuracy of Genome Pre-Alignment Filtering. Transactions on Internet Research, 2017.

  6. Mohammed Alser et al. Shouji: A Fast and Efficient Pre-Alignment Filter for Sequence Alignment. Bioinformatics, 2019.

  7. Mohammed Alser et al. SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs. arXiv:1910.09020 [q-bio.GN], 2019.

  8. Anirban Nag et al. GenCache: Leveraging In-Cache Operators for Efficient Sequence Alignment. In MICRO, 2019.

Base Counting

The base counting filter compares the numbers of bases (A, C, G, T) in the read with the corresponding base counts in the reference segment. If one sequence has, for example, three more As than another sequence, then their alignment has at least three edits. If the difference in count is greater than , then the two sequences are dissimilar and the reference segment is discarded. Such a simple filtering approach rejects a significant fraction of dissimilar sequences (e.g., 49.8%–80.4% of sequences, as shown in GASSST [rizk2010gassst]) and thus avoids a large fraction of expensive verification computations required by sequence alignment algorithms.

q-gram Filtering Approach

The -gram filtering approach considers all of the sequence’s possible overlapping substrings of length (known as -grams). Given a sequence of length , there are overlapping -grams that are obtained by sliding a window of length over the sequence. A single difference in one of the sequences can affect at most overlapping -grams. Thus, differences can affect no more than -grams, where is the edit distance threshold. The minimum number of shared -grams between two similar sequences is therefore . This filtering approach requires very simple operations (e.g., sums and comparisons), which makes it attractive for hardware acceleration, such as in GRIM-Filter [kim2018grim]. GRIM-Filter exploits the high memory bandwidth and computation capability in the logic layer of 3D-stacked memory to accelerate -gram filtering in the DRAM chip itself, using a new representation of reference genome that is friendly to in-memory processing. -gram filtering is generally robust in handling only a small number of edits, as the presence of edits in any -gram is significantly underestimated (e.g., counted as a single edit).

Sparse Dynamic Programming

Sparse DP algorithms exploit the exact matches (seeds) shared between a read and a reference segment to reduce execution time. These algorithms exclude the corresponding locations of these seeds from estimating the number of edits between the two sequences, as they were already detected as exact matches during indexing. Sparse DP filtering techniques apply DP-based alignment algorithms only between every two non-overlapping seeds to quickly estimate the total number of edits. This approach is also known as chaining, and is used in minimap2 [li2018minimap2].

Accelerating Sequence Alignment

After filtering out most of the mapping locations that lead to dissimilar sequence pairs, read mapping calculates the sequence alignment information for every read and reference segment extracted at each mapping location. Sequence alignment calculation is typically accelerated using one of two approaches: (1) accelerating the DP-based algorithms using hardware accelerators without altering algorithmic behavior, and (2) developing heuristics that sacrifice the optimality of the alignment score solution in order to reduce alignment time. With the first approach, it is challenging to rapidly calculate sequence alignment of long reads with high parallelism. As long reads have high sequencing error rates (up to 20% of the read length), the edit distance threshold for long reads is typically higher than that for short reads, which results in calculating more entries in the DP matrix compared to that of short reads. The use of heuristics (i.e., the second approach) helps to reduce the number of calculated entries in the DP matrix and hence allows both the execution time and memory footprint to grow only linearly with read length (as opposed to quadratically with classical DP). Next, we describe the two approaches in detail.

Accurate Alignment Accelerators

From a hardware perspective, sequence alignment acceleration has five directions: (1) using SIMD-capable CPUs, (2) using multicore CPUs and GPUs, (3) using FPGAs, (4) using ASICs, and (5) using processing-in-memory architectures. Traditional

DP-based algorithms are typically accelerated by computing only the necessary regions (i.e., diagonal vectors) of the DP matrix rather than the entire matrix, as proposed in Ukkonen’s banded algorithm. This reduces the search space of the DP-based algorithm and

reduces computation time. The number of diagonal bands required for computing the DP matrix is 2+1, where is the edit distance threshold. For example, the number of entries in the banded DP matrix for a 2 Mb long read can be 1.2 trillion. Parasail [daily2016parasail] and KSW2 (used in minimap2 [li2018minimap2]) exploit both Ukkonen’s banded algorithm and SIMD-capable CPUs to compute banded alignment for a sequence pair with a configurable scoring function. SIMD instructions offer significant parallelism to the matrix computation by executing the same vector operation on multiple operands at once. KSW2 is nearly as fast as Parasail when KSW2 does not use heuristics (explained in the next subsection).

The multicore architecture of CPUs and GPUs provides the ability to compute alignments of many independent sequence pairs concurrently. GASAL2 [ahmed2019gasal2] exploits the multicore architecture of both CPUs and GPUs for highly-parallel computation of sequence alignment with a user-defined scoring function. Unlike other GPU-accelerated tools, GASAL2 transfers the bases to the GPU, without encoding them into binary format, and hides the data transfer time by overlapping GPU and CPU execution. GASAL2 is up to faster than Parasail (when executed with 56 CPU threads). BWA-MEM2 [vasimuddin2019efficient] accelerates the banded sequence alignment of its predecessor (BWA-MEM) by up to , by leveraging multicore and SIMD parallelism. However, to achieve such levels of acceleration, BWA-MEM2 builds an index structure that is 6 larger than that of minimap2.

Other designs, such as FPGASW [fei2018fpgasw], exploit the very large number of hardware execution units in FPGAs to form a linear systolic array. Each execution unit in the systolic array is responsible for computing the value of a single entry of the DP matrix. The systolic array computes a single vector of the matrix at a time. The data dependency between the entries restricts the systolic array to computing the vectors sequentially (e.g., top-to-bottom, left-to-right, or in an anti-diagonal manner). FPGASW has a similar execution time as its GPU implementation, but is more power efficient.

Specialized hardware accelerators (i.e., ASIC designs) provide application-specific, power- and area-efficient solutions to accelerate sequence alignment. For example, GenAx [fujiki2018genax] is composed of SillaX, a sequence alignment accelerator, and a second accelerator for finding seeds. SillaX supports both a configurable scoring function and traceback operations. SillaX is more efficient for short reads than for long reads, as it consists of an automata processor whose performance scales quadratically with the edit distance. GenAx is 31.7 faster than the predecessor of BWA-MEM2 (i.e., BWA-MEM) for short reads.

Recent processing-in-memory architectures such as RAPID [gupta2019rapid] exploit the ability to perform computation inside or near the memory chip to enable efficient sequence alignment. RAPID modifies the DP-based alignment algorithm to make it friendly to in-memory parallel computation by calculating two DP matrices: one for calculating substitutions and exact matches and another for calculating insertions and deletions. RAPID claims that this approach efficiently enables higher levels of parallelism compared to traditional DP algorithms. The main two benefits of RAPID and such PIM-based architectures are higher performance and higher energy efficiency [mutlu2019processing, ghose2019processing], as they alleviate the need to transfer data between the main memory and the CPU cores through slow and energy hungry buses, while providing high degree of parallelism with the help of PIM. RAPID is on average faster and more power efficient than 384-GPU cluster of GPU implementation of sequence alignment, known as CUDAlign [de2016cudalign].

Heuristic-Based Alignment Accelerators

The second direction is to limit the functionality of the alignment algorithm or sacrifice the optimality of the alignment solution in order to reduce execution time. The use of restrictive functionality and heuristics limits the possible applications of the algorithms that utilize this direction. Examples of limiting functionality include limiting the scoring function, or only taking into account accelerating the computation of the DP matrix without performing the backtracking step [chen2014accelerating]. There are several existing algorithms and corresponding hardware accelerators that limit scoring function flexibility. Levenshtein distance and Myers’s bit-vector algorithm are examples of algorithms whose scoring functions are fixed, such that they penalize all types of edits equally when calculating the total alignment score. Restrictive scoring functions reduce the total execution time of the alignment algorithm and reduce the bit-width requirement of the register that accommodates the value of each entry in the DP matrix. ASAP [banerjee2019asap] accelerates Levenshtein distance calculation by up to using FPGAs compared to its CPU implementation. The use of a fixed scoring function as in Edlib [vsovsic2017edlib], which is the state-of-the-art implementation of Myers’s bit-vector algorithm, helps to outperform Parasail (which uses a flexible scoring function) by 12–1000. One downside of fixed function scoring is that it may lead to the selection of a suboptimal sequence alignment.

There are other algorithms and hardware architectures that provide low alignment time by trading off accuracy. Darwin [turakhia2018darwin] builds a customized hardware architecture to speed up the alignment process, by dividing the DP matrix into overlapping submatrices and processing each submatrix independently using systolic arrays. Darwin provides three orders of magnitude speedup compared to Edlib [vsovsic2017edlib]. Dividing the DP matrix (known as the Four-Russians Method) enables significant parallelism during DP matrix computation, but it leads to suboptimal alignment calculation [rizk2010gassst]. Darwin claims that choosing a large submatrix size () and ensuring sufficient overlap (128 entries) between adjacent submatrices may provide optimal alignment calculation for some datasets.

There are other proposals that limit the number of calculated entries of the DP matrix based on one of two approaches: (1) using sparse DP or (2) using a greedy approach to maintain a high alignment score. Both approaches suffer from providing suboptimal alignment calculation [slater2005automated, zhang2000greedy]. The first approach uses the same sparse DP algorithm used for pre-alignment filtering but as an alignment step, as done in the exonerate tool [slater2005automated]. The second approach is employed in -drop [zhang2000greedy], which (1) avoids calculating entries (and their neighbors) whose alignment scores are more than below the highest score seen so far (where is a user-specified parameter), and (2) stops early when a high alignment score is not possible. The -drop algorithm is guaranteed to find the optimal alignment between relatively-similar sequences for only some scoring functions [zhang2000greedy]. A similar algorithm (known as -drop) makes KSW2 at least faster than Parasail.

Discussion and Future Opportunities

Despite more than two decades of attempts, bridging the performance gap between sequencing machines and read mapping is still challenging. We summarize four main challenges below.

First, we need to accelerate the entire read mapping process rather than its individual steps. Accelerating only a single step of read mapping limits the overall achieved speedup according to Amdahl’s Law. Illumina and NVIDIA have recently started following a more holistic approach, and they claim to accelerate genome analysis by more than , mainly by using specialization and hardware/software co-design. Illumina has built an FPGA-based platform, called DRAGEN (https://www.illumina.com/products/by-type/informatics-products/dragen-bio-it-platform.html), that accelerates all steps of genome analysis, including read mapping and variant calling. DRAGEN reduces the overall analysis time from 32 CPU hours to only 37 minutes [goyal2017ultra]. NVIDIA has built Parabricks, a software suite accelerated using the company’s latest GPUs. Parabricks (https://developer.nvidia.com/clara-parabricks) can analyze whole human genomes at 30 coverage in about 45 minutes.

Second, we need to reduce the high amount of data movement that takes place during genome analysis. Moving data (1) between compute units and main memory, (2) between multiple hardware accelerators, and (3) between the sequencing machine and the computer performing the analysis incurs high costs in terms of execution time and energy. These costs are a significant barrier to enabling efficient analysis that can keep up with sequencing technologies, and some recent works try to tackle this problem [kim2018grim, mutlu2019processing, ghose2019processing]. GenASM [senolcali.micro2020] is a framework that uses bitvector-based ASM to accelerate multiple steps of the genome analysis pipeline, and is designed to be implemented inside 3D-stacked memory. Through a combination of hardware–software co-design to unlock parallelism, and processing-in-memory to reduce data movement, GenASM can perform (1) pre-alignment filtering for short reads, (2) sequence alignment for both short and long reads, and (3) whole genome alignment, among other use cases. For short/long read alignment, GenASM achieves 111/116 speedup over state-of-the-art software read mappers while reducing power consumption by 33/37. DRAGEN reduces data movement between the sequencing machine and the computer performing analysis by adding specialized hardware support inside the sequencing machine for data compression. However, this still requires movement of compressed data. Performing read mapping inside the sequencing machine itself can significantly improve efficiency by eliminating sequencer-to-computer movement, and embedding a single specialized chip for read mapping within a portable sequencing device can potentially enable new applications of genome sequencing (e.g., rapid surveillance of new diseases such as COVID-19, near-patient testing, bringing precision medicine to remote locations). Unfortunately, efforts in this direction remain very limited.

Third, we need to develop flexible hardware architectures that do not conservatively limit the range of supported parameter values at design time. Commonly-used read mappers (e.g., minimap2) have different input parameters, each of which has a wide range of input values. For example, the edit distance threshold is typically user defined and can be very high (15-20% of the read length) for recent long reads. A configurable scoring function is another example, as it determines the number of bits needed to store each entry of the DP matrix (e.g., DRAGEN imposes a restriction on the maximum frequency of seed occurrence). Due to rapid changes in sequencing technologies (e.g., high sequencing error rate and longer read lengths), these design restrictions can quickly make specialized hardware obsolete. Thus, read mappers need to adapt their algorithms and their hardware architectures to be modular and scalable so that they can be implemented for any sequence length and edit distance threshold based on the sequencing technology.

Fourth, we need to adapt existing genomic data formats for hardware accelerators or develop more efficient file formats. Most sequencing data is stored in the FASTQ/FASTA format, where each base takes a single byte (8 bits) of memory. This encoding is inefficient, as only 2 bits (3 bits when the ambiguous base, N, is included) are needed to encode each DNA base. The sequencing machine converts sequenced bases into FASTQ/FASTA format, and hardware accelerators convert the file contents into unique (for each accelerator) compact binary representations for efficient processing. This process that requires multiple format conversions wastes time. For example, only 43% of the sequence alignment time in BWA-MEM2 [vasimuddin2019efficient] is spent on calculating the DP matrix, while 33% of the sequence alignment time is spent on pre-processing the input sequences for loading into SIMD registers, as provided in [vasimuddin2019efficient]. To address this inefficiency, we need to widely adopt efficient hardware-friendly formats, such as UCSC’s .2bit format (https://genome.ucsc.edu/goldenPath/help/twoBit), to maximize the benefits of hardware accelerators and reduce resource utilization. We are not aware of any recent read mapper that uses such formats.

The acceleration efforts we highlight in this work represent state-of-the-art efforts to reduce current bottlenecks in the genome analysis pipeline. We hope that these efforts and the challenges we discuss provide a foundation for future work in accelerating read mappers and developing other genome sequence analysis tools.

Acknowledgments

This work is supported by funding from Intel, the Semiconductor Research Corporation, and VMware to Onur Mutlu.

References