A Non-volatile Near-Memory Read Mapping Accelerator

09/07/2017 ∙ by S. Karen Khatamifard, et al. ∙ University of Southern California 0

DNA sequencing entails the process of determining the precise physical order of the four bases (Adenine, Guanine, Cytosine, Thymine) in a DNA strand. As semiconductor technology revolutionized computing, DNA sequencing technology, termed often as Next Generation Sequencing (NGS), revolutionized genomic research. Modern NGS platforms can sequence millions of short DNA fragments in parallel. The resulting short DNA sequences are termed (short) reads. Mapping each read to a reference genome of the same species (which itself represents a full-fledged assembly of already sequenced reads), sequence mapping, is an emerging application. Sequence mapping enables detailed study of genetic variations, and thereby catalyzes personalized health care solutions. Due to the large scale of the problem, well-studied pair-wise sequence similarity detection (or sequence alignment) algorithms fall short of efficiently mapping individual reads to the reference genome. Mapping represents a search-heavy data-intensive operation and barely features any complex floating point arithmetic. Therefore, sequence mapping can greatly benefit from in- or near-memory search and processing. Fast parallel associative search enabled by Ternary Content Addressable Memory (TCAM) can particularly help, however CMOS-based TCAM implementations cannot accommodate the large memory footprint in an area and energy efficient manner, where non-volatile TCAM comes to rescue. Still, brute-force TCAM search over as large of a search space as sequence mapping demands consumes unacceptably high energy. This paper provides an effective solution to the energy problem to tap the potential of non-volatile TCAM for high-throughput, energy-efficient sequence mapping: BioCAM. BioCAM can improve the throughput of sequence mapping by 7.5x; the energy consumption, by 109.0x when compared to a highly-optimized software implementation for modern GPUs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

DNA sequencing is the physical or biochemical process of extracting the order of the four bases (Adenine, Guanine, Cytosine, Thymine) in a DNA strand. As semiconductor technology revolutionized computing, DNA sequencing technology, termed High-throughput Sequencing or Next Generation Sequencing (NGS), revolutionized genomic research. Modern NGS platforms can sequence hundreds of millions of short DNA fragments in parallel. The resulting (short) DNA fragments, referred to as (short) reads, typically contain 100-200 bases [1]. A common critical first step spanning a rich and diverse set of emerging bioinformatics applications is read mapping: mapping each output read from a NGS machine to a reference genome of the same species (which itself represents a full-fledged assembly of already processed reads).

As a representative example, modern NGS machines from Illumina [1], a prominent NGS platform producer, can sequence more than 600 Giga-bases (which translates into hundreds of millions of output reads) per one run: 200 the length of a human genome of approximately 3 Giga-bases. Fig. 1 depicts the scaling trend for DNA sequencing in terms of the total number of human genomes sequenced. The values until 2015 reflect historical publication records, with milestones explicitly marked. The values beyond 2015 reflect three different projections: the first, following the historical growth until 2015; the second, a more conservative prediction from Illumina; the third, Moore’s Law. Historically, the total quantity of sequenced data has been doubling approximately every 7 months. Even under the more conservative projections from Fig. 1, the rapid growth of sequenced data challenges the throughput performance of read mapping.

Figure 1: Scaling trend for DNA sequencing [2].

The importance of read mapping motivated the development of several algorithms. The increasing scale of the problem per Fig. 1, however, renders well-studied pair-wise sequence similarity detection or alignment algorithms inefficient [3]. Worse, reads are subject to noise due to imperfections in NGS platforms and genome variations, which adds to the complexity of achieving high throughput. Both algorithmic solutions and hardware acceleration via GPUs [4] or FPGAs [5] therefore have to trade mapping accuracy for throughput performance at various degrees. In other words, read mapping by definition is after similarity matching. As optimizations are usually confined to compute-intensive stages of mapping, in light of the scaling projections from Fig. 1, most of these solutions are fundamentally limited by data transfer overheads. In this paper, we take a data-centric position to guide the design (and explore the design space) of scalable, energy-efficient high-throughput read mapping. Specifically, instead of optimizing an algorithm developed for general-purpose computers or GPUs, we rethink the algorithm from the ground up along with the accelerator design.

Read mapping represents a search-heavy memory-intensive operation and barely requires complex floating point arithmetic, therefore, can greatly benefit from in- or near-memory search and processing. Fast parallel associative search enabled by Ternary Content Addressable Memory (TCAM) can particularly help. Unfortunately, CMOS-based TCAM cannot accommodate the large memory footprint in an area and energy efficient manner, where non-volatile TCAM comes to rescue [6]. Still, brute-force TCAM search over as large of a search space as read mapping demands induces excessive energy consumption, rendering (non-volatile) TCAM-based acceleration infeasible. This paper provides an effective solution to tap the potential of non-volatile TCAM for scalable, energy-efficient high-throughput read mapping, BioMAP, which

  • introduces a novel similarity matching mechanism to trade mapping accuracy for throughput and energy efficiency in a much more scalable manner than existing solutions;

  • features a novel data representation to facilitate efficient similarity matching without compromising storage complexity;

  • tailors common search space pruning approaches to its novel similarity matching mechanism in order to identify and discard unnecessary memory accesses, and thereby, to prevent excessive energy consumption;

  • employs multi-phase hierarchical mapping to maximize mapping accuracy, where each phase acts as a filtering layer for the subsequent phase which in turn performs more sophisticated mapping (considering only the subset of reads the previous phase fails to map);

  • is expandable to similar search-intensive problems from other application domains beyond bioinformatics.

In the following, we provide a proof-of-concept BioMAP implementation and explore the design space: Section 2 discusses the basics of BioMAP; Section 3 covers practical considerations; Sections 4 and 5 provide the evaluation; Section 6 compares and contrasts BioMAP to related work; and Section 7 concludes the paper.

2 BioMAP: Macroscopic View

Short (i.e., 100-200 base long) reads from modern Illumina NGS platforms [1] constitute more than 90% of all reads in the world currently. Accordingly, BioMAP is designed for short read mapping.

Terminology: Without loss of generality, we will refer to each read simply as a query; and the reference genome, as the reference. Each query and the reference represent strings of characters from the alphabet {A, G, C, T} which stand for the bases {Adenine, Guanine, Cytosine, Thymine}. The inputs to BioMAP are a dataset of querys and the reference, where the reference is many orders of magnitude longer than each query. For example, if the reference is the human genome, reference length is approximately 3 bases. On the other hand, technological capabilities of modern NGS platforms limit the maximum query length.

2.1 Problem Definition: Read Mapping

Basics: Read mapping entails finding the most similar portions of a given reference to each query from a dataset corresponding to the same species, as output by a NGS machine. Fig. 2 demonstrates an example, with different portions from the same reference on top; two sample querys to be mapped, at the bottom. The first (second) query results in one (five) base-mismatch(es) when aligned to the () base of the reference. The query length is not representative, but simplifies demonstration.

Figure 2: Read mapping example.

Input querys are subject to noise due to imperfections in NGS platforms and potential genome variations. Therefore, read mapping by definition is after similarity rather than an exact match between the input querys and the reference. Accordingly, for each input query, BioMAP tries to locate the most similar sub-sequence of the reference to the query, and returns the range of its indices.

Mapping Reverse Complement of reads: Depending on the DNA strand sequenced, the sequencing platforms may output the reverse complement of a read. The reverse complement of sequence is formed by reversing the letters, interchanging A with T, and C with G. For example, the reverse complement of the sequence ACCGCCTA is given by TAGGCGGT. Since NGS platforms typically sequence almost half of DNA strands in reverse order, BioMAP is designed to handle reverse complement of reads (i.e., querys), as well.

Mismatches Induced by NGS Platform: NGS platform induced mismatches are called read errors and come in three types: An extra base may get inserted randomly into the query, leading to an insertion error. Or, a base may get randomly deleted from the query, leading to a deletion error. These two types of errors are often referred to as indels. Finally, random substitution of bases within the query leads to a substitution error. According to Illumina data-sheets, platform-induced indels in short reads are negligible [7]

. However, per base substitution error probability can be as high as 1.0%, depending on the sequencing machine. Therefore, BioMAP is designed to perform mapping in the presence of a notable number of platform-induced substitution errors.

Mismatches Induced by Genome Variations: In general, the sequenced genome is slightly different from the reference genome. For example, more than 99% of any two human genomes are similar, while they could differ in less than 1%. Such a difference could lead to the mismatch between the reads and the reference genome since the reads are coming from the sequenced genome, while they are mapped to the reference genome. This type of mismatch can take different forms, as well, such as substitution, insertion, deletion, duplication or inversion. The most common variations are single substitutions (i.e., single-nucleotide polymorphisms, SNPs) and short indels. Other types of variations are much more challenging to detect. In fact, to date there is no widely-accepted algorithm for identifying long indels, duplications, or inversions [8, 9].

While genome variations are rare, it is important to capture them, since this information is usually what we are after. Any short read mapping algorithm capable of handling platform-induced mismatches (i.e., substitution errors), can detect SNPs, too. However, the other common type of mutation-induced variations, short indels, should also be covered. Accordingly, BioMAP is designed to consider all.

Throughout the paper, we use the terminology mismatch to refer to both the mismatches induced by the NGS platform as well as the ones induced by the genome variations. These mismatches could be of substitution-type or indel-type. As we discussed, the substitution type is the dominant one.

To summarize, read mapping is a search-heavy similarity matching operation and can greatly benefit from parallel in- or near-memory associative search enabled by Ternary Content Addressable Memory (TCAM). We will next discuss the feasibility of TCAM-based acceleration for read mapping.

Figure 3: Structural organization of BioMAP.

2.2 Why Naive (Non-volatile) TCAM-based
Acceleration Does Not Work

Facilitating fast parallel in-memory pattern matching, TCAM-based acceleration is particularly suitable for

read mapping. TCAM is a special variant of associative memory (which permits data retrieval by indexing by content rather than by address) that can store and search the “don’t care” state X in addition to a logic 0 or 1. Moreover, for as large memory arrays as read mapping demands, non-volatile TCAM can overcome area and energy inefficiencies of CMOS based TCAM [6].

We will next look into the energy consumption of read mapping, comparing a non-volatile TCAM-based implementation with a highly optimized GPU-based solution deploying one of the fastest known algorithms to date [10]. The non-volatile TCAM mimics the least energy-hungry implementation from Guo et al. [6], corresponding to an array size of 1K1K bits = 1Mbits. For this design point, searching for a pattern of length 1Kbits (which represents the maximum-possible length, i.e., the row length) in the entire array takes approximately 2.5ns and consumes 245nJ.

If we simply encode each base from the alphabet {A, G, C, T} using 2 bits, and if a human genome of approximately 3 Giga-bases (= 6 Gbits) represents the reference, the reference can fit into 6K TCAM arrays (of size 1K1K bits = 1Mbits). For each query of a typical length of 100 bases [1], i.e., 200 bits, the following naive procedure can cover the entire search space: By construction, each 1K1Kbit TCAM array can search for at most one 1Kbit pattern at a time, which resides in a query register. We can align the most significant bit of the (200bit-long) query

 with the most significant bit position of TCAM’s 1Kbit query register, and pad the remaining (1K-200) bits by Xs, for the very first search in the array. We can then repeat the search by shifting the contents of TCAM’s query register (i.e., the padded

query) to the right by one bit at a time, leaving the unused more significant bit positions with Xs, until the least significant bit of the query reaches the least significant bit position in the query register. The total number of these bit-wise shifts (and hence, searches) would be in the order of the row length 1K. Putting it all together, mapping a given query to the reference in this case would take around 1K searches in each of the 6K arrays, with 245nJ consumed per search. The overall energy consumption therefore would become 6K 1K 245nJ 1500mJ.

The GPU solution from Luo et al. [10] on the other hand, can process 133.3K querys per second on an Nvidia K40. Hence, it takes 1/133.3K seconds to map a single query. Even under the unrealistic assumption (favoring TCAM) that the entire peak average power (TDP) goes to mapping a single query to the reference, the energy consumption would become at most 235W (1/133.3K)s 1.8mJ.

The GPU and TCAM designs feature similar technology nodes. Even under the assumptions favoring TCAM, the TCAM-based naive implementation consumes 3 orders of magnitude more energy than the GPU-based. The remarkable energy difference renders a direct adaption of non-volatile TCAM-based search infeasible. This difference stems from the gap in the size of the search spaces. While the TCAM-based design considers the entire search space to cover all possible alignments, the GPU-based design first prunes the search space to eliminate infeasible alignments, which in turn leads to orders of magnitude less number of (search) operations. BioMAP adopts a similar pruning strategy while deploying non-volatile TCAM arrays to enable even more energy-efficient search.

2.3 BioMAP Basics

For BioMAP we rethink the algorithm from the ground up along with the hardware design, instead of trying to map an algorithm developed for general-purpose computers or GPUs directly to hardware. The result is a near-memory read mapping accelerator deploying non-volatile TCAM arrays designed for scalable, energy-efficient high-throughput similarity search. The key features of BioMAP are

  • a novel, TCAM similarity matching mechanism to trade mapping accuracy for throughput and energy efficiency without compromising scalability;

  • a novel data representation to facilitate efficient similarity matching without compromising storage complexity, and

  • employing multi-phase hierarchical mapping to maximize mapping accuracy, where each phase acts as a filtering layer for the subsequent phase which in turn performs more sophisticated mapping (considering only the subset of reads the previous phase fails to map);

In order to identify and discard unnecessary memory accesses, and thereby, to prevent excessive energy consumption during TCAM search, BioMAP tailors well-known search space pruning (i.e., filtering) techniques to its novel similarity matching mechanism. While TCAMs capable of similarity matching have been explored before, Section 6 reveals why such schemes are not applicable considering the scale of read mapping.

BioMAP Organization: Fig. 3 provides the structural organization for BioMAP. BioMAP pipeline comprises two major units: Filter Unit (FilterU) and Match Unit (MatchU). Each query from the dataset to be mapped streams into the (first stage of the) BioMAP pipeline (i.e., FilterU) over the input queue. Once the mapping completes, the outcome streams out of the (last stage of the) BioMAP pipeline (i.e., MatchU) over the output queue. Non-volatile TCAM arrays (which feature BioMAP’s novel similarity matching mechanism) within MatchU keep the entire reference.

Input and output queues handle the communication to the outside world, by retrieving querys to be processed on the input end, and upon completion of the mapping, by providing the indices of the most similar sub-sequences of the reference to each query, on the output end.

Figure 4: Filter Unit (FilterU)

FilterU filters (indices of) sub-sequences of the reference which are more likely to result in a match to the incoming query, by examining sub-sequences of the incoming query itself. We call these indices potentially matching indices, PMI. FilterU feeds the MatchU with a stream of PMI, query tuples over the search queue. MatchU in turn conducts the search by only considering PMI of the reference. In this manner, BioMAP prunes the search space. MatchU’s non-volatile TCAM arrays incorporate BioMAP’s novel similarity matching mechanism, and are optimized for similarity search as opposed to exact search.

The input queue feeds the BioMAP pipeline with the querys to be mapped to the reference. The query dataset resides in memory. BioMAP initiates the streaming of the input querys into the memory-mapped input queue over a Direct Memory Access (DMA) request. The input queue in turn sends the querys to FilterU for search space pruning before the search takes place. Finally, for each input query, once the mapping completes, the output queue collects from MatchU the indices of the sub-sequence of the reference featuring the most similar match to the query. The output queue is memory-mapped, as well, and BioMAP writes back these indices to a dedicated memory location, over DMA.

In the following, we will detail the steps for query processing in each unit in case of a match. If no sub-sequence of the reference matches the input query (i.e., a mismatch is the case), no mapping takes place, and BioMAP triggers the update of a dedicated flag at the memory address to hold the result. BioMAP can detect a mismatch during processing at FilterU or MatchU.

Filter Unit (FilterU): Fig. 4 provides the structural organization of FilterU, which serves the compaction of the search space for each query to be mapped to the reference, as follows: We will refer to each sub-sequence of length seed as a prefix, where seed represents a design parameter and assumes a much lower value than the query length. As each prefix is a string of characters from the 4-character alphabet {A, G, C, T}, a prefix of length seed can take 4 different forms. FilterU relies on a pre-processing step which entails searching for each prefix of length seed in the reference, and in case of a match, recording the TCAM array, column and row number of the corresponding occurrence. Potential Match Index Table PMIT keeps this information. However, as the same prefix may occur multiple times along the reference string, PMIT may contain multiple entries for the very same prefix. Therefore, FilterU has a smaller table called PMIT Index Locator (PMITIL) for bookkeeping. PMITIL has 4 entries for each possible value of the prefix. Each PMITIL entry keeps the range (i.e., start and end) of addresses in PMIT where the matches for the corresponding prefix reside (BioMAP arranges PMIT to store multiple entries corresponding to the same prefix at consecutive addresses). PMIT and PMITIL generation constitutes a pre-processing step which BioMAP needs to perform only once for each reference, before read mapping starts.

Upon receipt of a new query from the head of the input queue, FilterU uses the first seed bases of the query as the prefix to consult PMITIL, and subsequently, PMIT. FilterU keeps the query being processed in the FilterU Query Register (FilterUQR) as filtering is in progress. If there is a match, FilterU first broadcasts the query being processed to all TCAM arrays. Then, it sends the corresponding TCAM array, column and row number (i.e., PMIs) to MatchU, over the search queue. We will refer to these TCAM coordinates as , , and , respectively.

Figure 5: Match Unit (MatchU)
(a) full match
(b) fragmented match
Figure 6: Full (a) and fragmented (b) match in the array.

Match Unit (MatchU): Fig. 5 provides the structural organization of MatchU, which orchestrates search. MatchU features a Dispatch Unit (DispatchU) and non-volatile TCAM arrays optimized for similarity search. DispatchU acts as a scheduler for TCAM search. For each input query to be mapped to the reference, DispatchU collects the TCAM array, column and row numbers (, and ), as extracted from the PMIT in FilterU, to initiate the targeted search. The input query stays in the Query Register (QR) of the TCAM array during TCAM access. Shift Logic (ShL) in TCAM array in turn first aligns the prefix of length seed of the query with the seed-long (matching) sub-sequence of the reference residing (in array ) in row , starting from column . To this end, ShL shifts query bits in QR and inserts Xs accordingly. Match Unit Controller (MatchCtrl) orchestrates this operation. Once alignment completes, MatchCtrl activates only the row for search, to improve energy efficiency. Once the search completes, MatchCtrl provides DispatchU with the indices of the reference which demarcate the most similar sub-sequence to the entire query. DispatchU then forwards these indices to the output queue.

Figure 7: Life-cycle of a query in BioMAP.

Fig. 6 depicts two different match scenarios: In Fig. (a)a, the query (shown in dark shade within QR, white space corresponding to Xs for padding) matches a sub-sequence of the reference which is entirely stored in a single row of the array. We call this scenario a full match. On the other hand, in Fig. (b)b, the query matches a sub-sequence of the reference which is stored in two consecutive rows of the array. We call this scenario a fragmented match. Fragmentation can happen at both ends of the query. For example, in Fig. (b)b, the first portion of the query (shown in darker shade) matches the end of row j, while the rest (shown in lighter shade) matches the beginning of the next row, row j+1. MatchCtrl needs to address such fragmentation as BioMAP lays out the character string representing the reference in two dimensions in each array consecutively.

Conventional TCAM can only detect full match. Handling fragmented match requires extra logic. By default, the TCAM array would select the longest sub-sequence l of the reference matching the input query if a full match is not the case, where l occupies an entire row. The darker-shade region in Fig. (b)b corresponds to such l. As l may be aligned to either the beginning (Fig. (b)b) or the end of the query, MatchCtrl has to additionally check the next or the previous row, respectively, for a match to the unmatched portion of the query. We call the first case a fragmented tail match; the second, a fragmented head match. In case of a fragmented match, search in the TCAM array takes two steps. As a fragmented match may also happen at TCAM array boundaries, each array’s last row keeps the contents of the first row of the next array in sequence.

2.4 Putting It All Together

Fig. 7 summarizes the 6 steps in mapping a query to the reference: First, FilterU retrieves a new query from the head of the input queue at step \⃝raisebox{-0.9pt}{1}. In this case seed=7 (bases) with the corresponding 7-base prefix of the query underlined. Then, FilterU locates the 7-base prefix of ACCCTGA in PMITIL, and extracts the corresponding PMIT address(es) at step \⃝raisebox{-0.9pt}{2}. Next, FilterU retrieves TCAM array, column, and row numbers (for targeted search in MatchU) for the sub-sequences of the reference which match the prefix ACCCTGA at step \⃝raisebox{-0.9pt}{3}, from the PMIT addresses collected at step \⃝raisebox{-0.9pt}{2}. We refer to these extracted coordinates for targeted search as , , and , respectively. Finally, FilterU sends the query along with , , and to MatchU over the search queue at step \⃝raisebox{-0.9pt}{4}. At step \⃝raisebox{-0.9pt}{5}, DispatchU initiates search in TCAM array , at and , and collects the match outcome. At step \⃝raisebox{-0.9pt}{6}, MatchU sends the match outcome to the output queue.

3 BioMAP: Microscopic View

We continue with a microscopic view of the proof-of-concept BioMAP design to cover implementation details, specifically, how BioMAP performs mapping in the presence of mismatches induced by NGS platform or genome variations covered in Section 2.1.

3.1 Search Space Pruning

In order to prune the search space, BioMAP first locates sub-sequences of the reference matching the seed-long prefix of the query in FilterU (Section 2.3). seed represents a key BioMAP design parameter which dictates not only the storage complexity, but also the degree of search space pruning, which in turn determines BioMAP’s throughput performance and energy efficiency.

PMITIL grows with 4, therefore, the larger the seed, the higher becomes the storage complexity. However, a larger seed is more likely to result in a lower number of prefix matches in the PMI tables, and hence, a lower number of targeted searches in the MatchU. While the value of the seed remains much less than the expected length of the query, it should be carefully set to best exploit the storage complexity vs. throughput (or energy efficiency) trade-off.

PMIT can have at most as many entries as the total number of seed long sub-sequences contained within the reference. This practically translates into the length of the reference, as a prefix can start from each base position of the reference onward. Recall that PMIT is organized to keep multiple matches consecutively. Therefore, it suffices to keep per PMITIL entry just the start address in the PMIT for the first match, along with the number of matches (as depicted in Fig. 7). PMIT, on the other hand, has to keep the TCAM array number, column number, row number tuple for each prefix match. If the reference is the human genome, PMIT would have approximately 3 Giga entries. As we will detail in Section 4.2, 32 bits suffice to store each TCAM array number, column number, row number tuple per PMIT entry; and 64 bits, each PMIT start address, number of matches (or PMIT start address, PMIT end address) pair per PMITIL entry.

To improve throughput performance, after generating PMIT, BioMAP shuffles the order of entries as follows:

  • PMIT keeps the entries corresponding to the very same prefix always at consecutive addresses, and re-orders such entries further to have all entries pointing to the same TCAM array reside at consecutive addresses. BioMAP processes multiple PMIT matches per prefix in this consecutive order. Under such re-ordering, sending a list of PMIs over the network and performing search in the array happen in a pipelined fashion. This masks network latency and consequently, improves performance significantly.

  • BioMAP re-orders the PMIT entries to have search requests to different TCAM chips interleaved. In other words, BioMAP tries to avoid sending multiple consecutive search requests to the very same TCAM chip to maximize (TCAM) chip-level parallelism.

3.2 Data Representation

Each input query and the reference itself represent character strings over the alphabet {A, G, C, T}. Conventional bioinformatics formats such as FASTA [11] encode each letter from such alphabets of bases by single-letter ASCII codes. However, TCAM arrays conduct the search at bit granularity. Therefore, BioMAP needs to translate base character mismatches to bit mismatches. To this end, BioMAP adopts an encoding which renders the very same number of mismatched bits for a mismatch between any two base characters. This would not be the case, if each base character in {A, G, C, T} is encoded by simply using 2 bits (i.e., some base character mismatches would result in 1-bit, others, in 2-bit differences). BioMAP’s encoding uses 3 bits per base character, where any two 3-bit code-words differ by exactly 2 bits, such as {111, 100, 010, 001}. Thereby BioMAP guarantees that exactly 2 bits would mismatch for any base character mismatch.

3.3 Similarity Search

Fig. 8 depicts a representative resistive TCAM cell. The two resistors attached to the access transistors and , respectively, carry the data bit value and its complement . The high (low) resistance value () encodes logic 1 (0). The third resistor attached to is hardwired to logic 1, hence its resistance remains constant at .

To search for logic 0, Search Line () is set to 0, and its complement to 1, such that turns on; off. Thereby only the resistor carrying , , gets connected to the Match Line (). If the cell content was 0, i.e., and , there would be a match, and applies. Otherwise, if the cell content was 1, i.e., and , there would be a mismatch, and applies. A symmetric discussion holds for searching for logic 1. On a per TCAM cell basis, connected to indicates a match, , a mismatch. To search for X, both and are set to 0, and Search X Line to 1, such that only the hard-wired attached to is connected to . This is how search for X always renders a match, independent of the value of .

Figure 8: Resistive TCAM cell [6].

Each cell within each row contributes to the effective resistance connected to , , by () on a match (mismatch). The Sense Amplifier SA (connected to the ) in each row signals a (mis)match for the entire row depending on the value of . SA would only signal a match, if all cells match, i.e., if each cell contributes to by . Let us call the in this case . SA would signal a mismatch if at least one cell mismatches, i.e., contributes to by . The value of in this case evolves with the number of mismatches, and assumes the closest value to under a single-cell (bit) mismatch.

In a TCAM array based on the cell from Fig. 8, unless all bits within a row match, SA would always signal a mismatch for the entire row. However, due to mismatches detailed in Section 2.1 a matching query may indeed have a few bases different from the corresponding sub-sequence of the reference. To resolve this discrepancy, BioMAP deploys tunable SAs which associate a wider range with a row-wide match. These SAs can be tuned to signal a row-wide match when less than a given number t of bits mismatch, which translates into less than t s connected to . We will refer to t as tolerance, which represents an adjustable design parameter. BioMAP handles substitution mismatches by tuning tolerance.

The gap between levels corresponding to different number of mismatching bits decreases as the number of mismatching bits grows, complicating SA design. At the same time, due to PVT (Process, Voltage, Temperature) variations, individual TCAM cell resistance levels may notably deviate from nominal or , leading to divergence of such levels from their expected values. In Section 4.4, we will detail how BioMAP tunes the SAs in a variation-aware fashion.

3.4 Hierarchical Multi-Phase Search

Our focus so far was on the basics of BioMAP’s mapping mechanism. We will next look into mapping accuracy, and specifically identify under what circumstances BioMAP would fail to map a given query to the respective reference, which in fact was similar enough. As explained in Section 2.1, the most likely reason is reverse complement of querys, since NGS platforms typically sequence almost half of the DNA strands in reverse order. The next likely reason is having substitution mismatches in the seed-long prefix used for search space pruning (Section 2.3). A corrupted prefix may lead to ill-addressed search requests, i.e., FilterU sending incorrect PMIs to MatchU. Finally, the least likely reason is a short indel in the query, since TCAM arrays can only handle substitution errors. Other reasons might be due to the existence of complex repetitive regions or complex genome variations which are very challenging to detect and is beyond the scope of this work.

BioMAP employs multi-phase hierarchical mapping, where each phase acts as a filtering layer for the next phase which in turn performs more sophisticated mapping, considering only the subset of querys the previous phase failed to map. In other words, a phase re-attempts mapping only for the subset of querys that the previous phase failed to map. If the mapping in a phase fails, MatchU raises the Failed-Map signal. BioMAP in turn feeds Failed-Map back to FilterU, to trigger more sophisticated mapping attempts in the subsequent phase(s), as need be.

To handle failed mappings due to reverse complement of querys, after getting Failed-Map from MatchU (at the end of the first phase), FilterU immediately sends PMIs corresponding to the reverse complement of the query to the MatchU. To accelerate this process, MatchU employs an extra register inside the Shift Logic, which keeps the reverse complement of the query in addition to the original. MatchU copies the reverse complement in this register at the time it gets the original query (during the first phase). Therefore, upon receipt of Failed-Map, FilterU does not need to broadcast the reverse complement of the query separately, but only the PMIs for the reverse complement (which FilterU simply extracts by consulting the PMI tables with the seed-long prefix of the reverse complement of the query). Attempting mapping for reverse complements hence forms BioMAP’s second phase for mapping.

If the second phase fails to map the query, as well, we are left with three possibilities: (i) there are mismatches in the prefix; (ii) there are indels anywhere in the query (since TCAMs cannot detect them); or (iii) the query simply is too unsimilar to the reference due to too many mismatches. Unfortunately case (iii), which is typically due to complex genome variations, is very challenging to fix. However, these variations are rare. To address the first two cases, BioMAP chunks the query into two halves, and re-attempts mapping each half separately in two subsequent phases. Such chunking is promising as mismatches to result in a failed mapping are less likely to occur in shorter querys. After these two phases, in case of failure in alignment, BioMAP attempts to align reverse complement of the two halves as well.

In Section 5.3, we will quantify how chunking can improve mapping accuracy significantly. However, chunking is helpless if BioMAP fails to align both halves of the query. BioMAP could fail to map a query under two scenarios: due to mismatches in the prefix or having indels anywhere. Let the probability of having mismatches in the prefix of a query be , and the probability of having indels in a query be .

Let be the per base probability of a mismatch, be it due to NGS platform or genome variations. Then,

applies. We can estimate

by adding NGS platform substitution error rate, 0.1% [7], and average genome variation rate, 0.1% [12, 13, 8]. Using this estimate, for a representative seed value of 15 (Section 4), barely reaches 3.0%.

Let be the probability of a short indel, and , the length of a query. Then,

applies. While there is no exact consensus on the rate of short indels, different studies estimate the expected probability of short indels based on the observed previous cases [12]. For example, the work [14] identifies short indels in the sequenced genome (which is of size billion). Let us, for the sake of illustration, use the conservative choice of for . Using this estimate, for a representative query length of  [1](), barely reaches .

We can estimate the probability of BioMAP failing to align a half, by adding and , which is around 4.0%. Then, probability of BioMAP failing to align both halves becomes .

Without loss of generality, the proof-of-concept BioMAP implementation adapts 2-way chunking, where multi-way (and not necessarily uniform) chunking can further help reduce the number of failed mappings. We leave the exploration of this rich design space to future work. Through chunking, the probability of missing a mapping can significantly reduce, but depending on the application, this may still not be good enough. To cover such cases, BioMAP and software implementations of sophisticated high-overhead algorithms can be paired as follows: If after multi-phase attempts BioMAP fails to map some querys, it stores them in a dedicated memory location, over DMA, for the software to read and process. Although negligible (due to the very low likelihood), we will quantify the overhead of this solution in Section 5. Highly optimized software packages (such as [10]) already commonly rely on similar multi-phase refinement.

Putting it all together, let us conclude this section with a thought experiment: Let us assume that there are two algorithms and for mapping reads to the reference genome. While can handle substitution errors only with time per read, can handle both substitution and indels in time . Also assume that . This is generally true since handling indels is much more difficult than substitution errors. Let us define a new meta-algorithm which uses algorithm for each read and only if the read cannot be mapped, passes the read to the next phase algorithm . Then we can give the run time of the algorithm approximately by

where is the total number of reads and is the probability of indels. Since and , this meta algorithm can lead to huge computational savings when compared to which requires . This simple thought experiment illustrates the time saving resulted from the hierarchical alignment structure.

From a statistical point of view, one might argue that using first and next might cause statistical loss when a read gets mapped with while it could have been mapped with with higher likelihood. We can avoid this loss by setting a threshold for the number of acceptable substitution errors in carefully. Let us use and to denote the indel and substitution probability, respectively. Then we just need to only accept mappings in with number of substitutions or less. This threshold guarantees that any sequence is mapped to the location (in the reference) with largest likelihood for similarity. In the worst case, we can always set the threshold for the number of acceptable errors in to one, which prevents statistical loss due .

3.5 System Integration

Without loss of generality, all components of the proof-of-concept BioMAP implementation reside in a DIMM attached to the main memory bus, similar to the resistive TCAM accelerator from Guo et al. [6] or to the associative compute engine ACC-DIMM [15]. While BioMAP features non-volatile TCAM data arrays as both of these designs do, BioMAP arrays do not include any of the priority index logic, population count logic, or the reduction network from Guo et al. [6] or programmable microcontrollers of ACC-DIMM [15]. Instead, BioMAP tailors the data and control paths to read mapping, which entails minimal logic for filtering, scheduling, and queuing search requests. At the same time, BioMAP’s TCAM arrays incorporate a novel similarity match mechanism cut for read mapping.

4 Evaluation Setup

We next provide the configuration, modeling, and simulation details for the evaluation of the proof-of-concept BioMAP implementation.

4.1 System-level Characterization

As explained in Section 3.5, the proof-of-concept design resides in a DIMM attached to the main memory bus. We evalute BioMAP using 3 different TCAM chips, all containing 1K1Kbit arrays. These chips have different capacities; 512Mbit, 1Gbit, and 2Gbit, respectively. We use a human genome as the reference (Section 4.6), which has approximately 3Giga-bases. As discussed in Section 3.2, BioMAP adopts a 3-bit representation for each base. Therefore, a total of 16, 8, and 4 TCAM chips find place on a single BioCAM DIMM, to store the 9Gbit reference, considering the 3 different chip configurations respectively. We implement PMI tables as DRAM modules, and keep them in the main memory in a separate DIMM such that the host can claim DRAM space back as part of the main memory, as need be. All BioCAM logic and controllers (from FilterU and MatchU) reside in the corresponding DIMM controllers. Another option (not considered in the evaluation) is packing PMI tables into the same DIMM as the TCAM chips, to have a self-contained BioCAM DIMM (of possibly higher energy efficiency and throughput by optimizing intra-DIMM communication). We do not practice this option to avoid favoring BioMAP over the baseline for comparison (Section 4.5). We use a modified version of DRAMSim2 [16] for simulation. While the default DRAMSim2 can model inter-DIMM communications, we had to implement intra-DIMM interactions during search, control logic and network related operations. We model a DDR4 DRAM to store PMIT and PMITIL tables.

4.2 PMI Table Generation

We evaluate BioMAP considering different seed values. As explained in Section 3.1, while PMIT keeps an entry for each possible seed-long prefix contained within the reference, PMITIL contains 4 entries. In other words, PMIT keeps an entry for each possible base position in the reference to demarcate the start of a seed-long prefix. Therefore, PMIT capacity becomes practically independent of the seed for feasible seed values. Allocating an entry for each possible base position in the reference, a tight-enough upper bound for PMIT capacity for the human genome used as the reference for evaluation (Section 4.6) is approximately 11.4GB, independent of the seed. PMITIL capacity, however, increases exponentially with seed, as captured by Table 1 for practical seed values ranging from 10 to 15. PMIT and PMITIL capacity together determine the DRAM space requirement of the proof-of-concept BioMAP implementation. We cap the maximum value of the seed at 15 to prevent excessive growth in PMITIL size.

Size (GB)
seed 10 11 12 13 14 15
PMITIL 0.008 0.033 0.134 0.536 2.147 8.590
PMIT + PMITIL 11.450 11.475 11.576 11.978 13.589 20.032
Table 1: PMI table capacity as a function of seed.

4.3 Circuit-level Characterization

The proof-of-concept BioMAP implementation uses Phase Change Memory (PCM) as the resistive memory technology for TCAM arrays, which features a relatively high to ratio: 11.5 on average [17]. A higher to ratio eases sensing (i.e., distinguishing between matches and mismatches as explained in Section 3.3), therefore, enables arrays with longer rows. BioMAP’s TCAM arrays are similar to the most energy-efficient design from Guo et al. [6], which corresponds to a 1K1Kbit configuration.

We synthesize logic circuits by Synopsys Design Compiler vH2013.12 using the FreePDK45 library [18]. Then, to match the technology of our baseline for comparison (Section 4.5), we scale the outcome from 45nm to 28nm using ITRS projections [19]. A single search operation takes 2ns to complete, while consuming 1nJ of energy. Area of each 2Gbit (1Gbit, 512Mbit) TCAM chip is nearly 27.8mm (13.9mm, 7.0mm). A single DIMM of BioMAP employing 2Gbit chips (1Gbit, 512Mbit) consumes 7.1W (10.7W, 15.2W) of power on average. We use ORION2.0 [20] to model the network. Intra-DIMM H-tree network operates at 750MHz, while each hop (1 router + link) consumes 0.045W.

Throughput Performance
(a) Throughput Performance
(b) Energy
Figure 9: Throughput performance and energy consumption.

4.4 Similarity Matching Specification

BioMAP adopts the Voltage Latch Sense Amplifier (VLSA) design from [21] to implement tunable sensing as explained in Section 3.3. We simulate VLSA in HSPICE v2015.06 using the FreePDK45 [18] library. VLSA’s threshold voltage sets the boundary between the ranges of the effective resistance, , values the SA perceives as a (row-wide) match or a mismatch. We configure VLSA’s threshold voltage to account for potential fluctuation in and values due to PVT variations.

4.4.1 Setting the sensing threshold

We conduct a Monte Carlo analysis using the (variation-afflicted) and distributions from IBM [17] extracted from measured data: , , , and . and

represent the mean and the standard deviation. Considering a row size of 1Kbits, we find

for 1M sample scenarios each corresponding to a different number of base mismatches. Using the resulting distribution, and capping the maximum number of base mismatches that are permitted to pass as a match (i.e., the tolerance as explained in Section 3.3), we set SA’s sensing threshold in a variation-aware manner.

4.4.2 Sensing Accuracy

As explained in Section 3.3, under PVT variations, sense amplifiers (SA) may trigger a (row-wide) match in case of an actual (row-wide) mismatch. For each such case, the number of base mismatches remains higher than the preset tolerance value. In the following, we will refer to this difference in the number of base mismatches with respect to the tolerance as overshoot. We should note that these cases are not errors, and rather translate into a query of lower similarity than expected being matched to a sub-sequence of the reference. Therefore, as long as the overshoot (in terms of base mismatches) with respect to the anticipated tolerance remains bounded, each such case can easily pass as a less similar match (which in fact can be an actual match where the input query was significantly corrupted by mismatches). Monte Carlo analysis from Section 4.4.1 shows that for different representative tolerance values used in Section 5, overshoot is usually less than 3, with probability of an overshoot of size 3 or larger barely reaching 0.05%. Maximum overshoot of 2 is acceptable for read mapping.

4.5 Baseline for Comparison

As a comparison baseline, we pick a highly optimized GPU implementation of the popular BWA algorithm, SOAP3-dp [10]. A pure software-based implementation of BioMAP is orders of magnitude slower than SOAP3-dp. We evaluate the throughput performance and power consumption of SOAP3-dp on an NVIDIA Tesla K40 GPU. We measure the power consumption of the GPU using NVIDIA-SMI (System Management Interface) command. We use the same reference and query dataset (Section 4.6) as BioMAP as inputs. We compare BioMAP against two different configurations of SOAP3-dp. The first one, SOAP, only capture mismatches of substitution-type, while the second one, SOAP, captures all types of mismatches.

4.6 Input Dataset

We use a real human genome, g1k_v37, from the 1000 genomes project [22] as the reference genome; and 20 million 150-base long real reads from NA12878 [23] as a query dataset. For mapping accuracy analysis, we further generate 20 million more querys using 150-base long randomly picked sub-sequences from this reference, which we corrupt according to the substitution error model of Illumina sequencing platforms with an error rate of 1.0%, SNP rate of 0.09%, and short indel rate of 0.009%, following Section 3.4. For a fair comparison (not to favor BioMAP) we choose the number of querys to have the reference + querys fit into the main memory of the GPU, such that the GPU does not suffer from extra energy-hungry data communication. We also limit the evaluation to a single BioMAP DIMM to keep the resource utilization comparable to the GPU baseline.

4.7 Design Space Exploration

We evaluate three BioMAP configurations. First one, BioMAP only has phases one and two, hence covers substitution errors and reverse complements. Second one, BioMAP, has all phases, hence can capture short indels, as well. Finally, BioMAP++ represents the scenario where unaligned reads from BioMAP are fed to SOAP3-dp periodically (after processing every 500K querys) to achieve the best possible accuracy.

5 Evaluation

In this section, we provide the evaluation of the proof-of-concept BioMAP implementation. We start with throughput performance and energy characterization in Section 5.1; quantify system bottlenecks in Section 5.2; discuss mapping accuracy in Section 5.3, and scalability in Section 5.4, respectively.

5.1 Throughput Performance and Energy

Larger seed values prune the search space more, resulting in a progressively lower number of search operations in processing each query. We observe that increasing the seed value monotonically improves the performance significantly. For instance, BioMAP with a seed value of 15 has 48.9% larger throughput when compared to BioMAP with a seed value of 14. However, as the size of PMITIL grows exponentially with larger seeds, we cap the maximum seed value by 15. In the following, we report all performance numbers considering BioMAP with a seed of 15.

Fig. (a)a depicts the throughput performance. Y-axis represents the number of querys mapped per second. X-axis shows the number of chips used for each experiment. The two red horizontal lines correspond to the throughput of the two baselines, SOAP and SOAP, respectively. With larger number of chips, BioMAP can perform more search operations in parallel. Consequently, the throughput performance increases for larger number of chips. For 16 chips, BioMAP services 449.7K querys per second, which is around 4.0 the peak throughput of SOAP. Besides, if we exclude indels, and perform the mapping considering substitutions only (i.e., BioMAP and SOAP), BioMAP with 16 chips can map 545.6K querys per second, and still remains 4.0 faster than SOAP. Alignment rate (i.e., number of querys successfully aligned or mapped) of BioMAP is around 95.9%, while SOAP can align 97.4% of the querys. Although this difference in negligible, one may want to improve the alignment rate of BioMAP further, by feeding the unaligned querys of BioMAP to SOAP (i.e., BioMAP++). Throughput under BioMAP++ is slightly less than BioMAP itself, around 440.9K querys per second for 16 chips. This shows that BioMAP++ can increase the alignment rate of BioMAP to reach the alignment rate of one of the best known software platforms, SOAP, by negligible overhead (as explained in Section 3.4).

Fig. (b)b demonstrates the energy consumption. As an energy efficiency metric, Y-axis captures number of querys that a platform can service per mili Joule. X-axis shows the number of chips BioMAP uses in each experiment. Two horizontal lines mark number of querys per mili Joule for the two baselines. Using more number of chips, BioMAP can perform more number of search operations in parallel, which in turn leads to higher array utilization rates. Consequently, BioMAP becomes more energy efficient. As Fig. (b)b depicts, BioMAP (BioMAP) with 16 chips can service around 29.6 (35.9) querys per mili Joule, which is 26.2 (28.5) more than SOAP (SOAP). Since execution of SOAP on the evaluated GPU consumes around 6.6 more power than BioMAP with 16 chips, we observe that using SOAP to align the 4.1% unaligned querys of BioMAP decreases energy-efficiency of BioMAP++ by 51.1% compared to the BioMAP alone. Still, one may want to trade energy efficiency for alignment rate, to be able to reach similar alignment rates to SOAP.

5.2 System Bottlenecks

We next identify system bottlenecks. According to simulation results, MatchU is the most time consuming stage of the BioMAP pipeline. Fig. (a)a depicts the share of time spent in each unit of the MatchU, for 16 chips, which represents the fastest design point. The throughput bottleneck is communication of querys and PMIs, taking 57.9% of the time. In Fig. (a)a, Logic refers to DispatchU, queues, and controllers in MatchU; Array, to the TCAM arrays. 24.8% of the time goes to actual search operations in the TCAM arrays. Besides, 17.3% of the time is spent in DispatchU (and the rest of the logic).

Time
(a) Time
(b) Energy
Figure 10: Time and energy break-down.

Fig. (b)b depicts the energy share of each BioMAP unit, for the most energy efficient platform, using 16 chips. Logic refers to the total energy consumption of all logic units incorporated in FilterU and MatchU; DRAM, to the PMI tables; Array, to the TCAM arrays. Since BioMAP keeps 16 chips utilized in parallel, most of the energy goes to communications, rendering a share of 46.2%. Performing actual search operations in TCAM arrays consumes 23.7% of the energy. 19.7% of the energy goes to search space pruning (i.e., accessing the PMI tables); and 10.4%, to logic operations.

5.3 Mapping Accuracy

BioMAP SOAP BioMAP SOAP
Misalignment Rate 2.74% 0.98% 2.99% 1.12%
Alignment Failure Rate 2.88% 1.38% 0.04% 0.01%
Total 5.62% 2.36% 3.03% 1.13%
Table 2: Accuracy of BioMAP w.r.t. SOAP

To compare the alignment or mapping accuracy of BioMAP to SOAP, we use a simulated input dataset with known expected matching indices. Mapping fails either when a map is aligned to a wrong part of the reference, or when it is not aligned to any part of the reference at all. Considering different configurations, Table 2 quantifies these mapping failure rates as Misalignment rate and Alignment Failure Rate respectively. We should note that misaligned querys are still mapped to a part of reference which is similar enough. Although not mapped to the expected part, they should not be regarded as errors.

BioMAP fails to map 2.88% of the querys, while maps 2.74% of them to wrong parts of the reference. BioMAP featuring hierarchical multi-phase search, on the other hand, aligns 98.6% of the querys that BioMAP fails to map, while misaligning only 8.8% of them. Compared to SOAP, BioMAP is only 3.3% less accurate overall. On the other hand, BioMAP in total fails to map only 1.9% more querys compared to SOAP.

This analysis depicts how effective multi-phase search is in improving the mapping accuracy when dealing with short indels. While the accuracy difference between BioMAP and SOAP is only around 1.9%, using BioMAP++ can always boost the accuracy of BioMAP to the level of SOAP, as necessary.

5.4 Scalability

A higher tolerance value can effectively decrease the rate of missed querys for mapping, which have a larger number of their bases corrupted than tolerance due to mismatches (per Section 2.1). Fig. (a)a depicts how the throughput of BioMAP (with 16 chips) and SOAP scale with different tolerance values (i.e., the number of base mismatches that are permitted to pass as matches during mapping), as captured by the X-axis. Left Y-axis denotes throughput in terms of total number of querys processed per second; right Y-axis, relative throughput improvement of BioMAP over SOAP. For BioMAP, changing tolerance is equivalent to changing SA’s threshold voltage, following the methodology from Section 4.4. To increase the tolerance, SA’s threshold voltage should decrease, which marginally slows the SA down. On the other hand, increasing tolerance improves alignment rate of BioMAP, which leads to having lower number of querys going through extra search phases. As shown in Fig. (a)a, increasing tolerance improves BioMAP’s throughput performance marginally. SOAP’s throughput, however, degrades more notably with increasing tolerance, since it cannot prune the search space as aggressively under a higher tolerance value. Consequently, BioMAP’s throughput improvement over SOAP increases for higher tolerance values as shown in Fig. (a)a. For a tolerance of 4 (largest allowed in SOAP), BioMAP’s throughput improvement over SOAP becomes . Fig. (b)b compares alignment rate (Y-axis) of BioMAP and SOAP under different tolerance values (X-axis). We observe that BioMAP’s alignment rate closely tracks SOAP’s. For a tolerance of 4, BioMAP fails to align only around 2.9% more querys, when compared to SOAP. To summarize, BioMAP can handle higher tolerance values compared to SOAP, with negligible alignment rate reduction.

(a) throughput
(b) alignment rate
Figure 11: Scalability w.r.t. SOAP.

5.5 Extension to Similar Application Domains

While tailored to read mapping, BioMAP is fundamentally applicable to similar search-intensive problems. For such extensions, we first need to determine the minimum number of bits needed to represent each character in the alphabet of the new problem, as covered in Section 3.2, and an acceptable tolerance level. As the number of bits representing each character grows, the number of bit-mismatches per character-mismatch grows, as well. Consequently, sensing becomes even easier to design. We next need to tune the threshold of the sense amplifiers (Section 4.4), and pick a seed to maximize the energy efficiency and throughput. We leave such exploration to future work.

6 Related Work

Read Mapping Alternatives: With no pre-processing of the reference, the computational complexity of read mapping scales at least linearly with the reference length. Therefore, advanced pre-processing is prevalent. Pre-processing in the form of hashing forms the basis of many popular software packages such as SOAP [24], Eland (part of the Illumina suite), and MAQ [25]. Another popular option of smaller memory footprint, including SOAP2 [26], Bowtie(2) [27, 28], BWA [29, 30], is based on the Burrows-Wheeler Transformation (BWT), an invertible transformation of the reference. Under BWT, the computational complexity of mapping becomes a function of the read length only. The baseline for comparison, SOAP3-dp [10] represents an open-source energy-efficient GPU implementation of BWA. SOAP3-dp can also handle noise in input reads, however, unlike BioMAP, the computational complexity depends on the value of the tolerance.

High-throughput FPGA implementations also exist, however, at the expense of significantly higher power consumption, hence lower energy efficiency. For example, Vanderbauwhede et al. [31] introduce a high-throughput implementation on RIVYERA clusters employing 128 Xilinx SPARTAN-6 LX150 FPGAs, which can process 45K (100-base-long) querys per second at the expense of orders of magnitude higher power consumption compared to BioMAP.

Race Logic: Madhavan et al. [32] propose hardware acceleration for dynamic programming using the exotic race logic. As a case study, they find the similarity between two strings corresponding to the read and a sub-sequence of the reference. The proposed accelerator requires around 120ns and 1nJ for each alignment, which is much slower than BioMAP. Besides, energy consumption grows by , where is the length of the read, which impairs scalability.

Resistive CAM Accelerators: Guo et al. in [6, 15] explore the potential of using TCAM arrays for accelerating data-intensive applications. Yavits et al. in [33] propose an associative processor, which employs resistive CAM based look-up tables to implement diverse functions. From a technology perspective, recent representative demonstrations of resistive TCAM arrays include Li et al. [34] and Yun et al. [35]: The IBM design [34] features a 1Gbit PCM-based CAM array using IBM 90nm technology. The measured search latency is 1.9ns. Yun et al. in [35], on the other hand, propose two novel MTJ-based cell designs in 45nm technology. They show that searching for a match takes around 0.6ns for a row size of 256. BioMAP can adapt any resistive CAM array, including these more recent proposals, to tap the potential of non-volatile TCAM for scalable, high-throughput, energy-efficient read mapping.

Imani et al. in [36] and [37] introduce approximate resistive CAM arrays to accelerate approximate computing workloads. The first one [36] relies on an exotic cell organization, which connects to ML on a match; , on a mismatch (the opposite convention to state-of-the-art per Fig. 8). This complicates match detection, thereby restricts the row length to at most 8 bits, and impairs the applicability to read mapping. The second one  [37] uses resistive CAM arrays to accelerate brain-inspired computing. This work does not rely on the previous exotic cell, however, to find the Hamming distance (their similarity metric) robustly, limits the row size to 4 bits only. This, as well, impairs the applicability to read mapping. While these proposals show great potential for many applications, expansion to approximate matches for read mapping is not straight-forward due to the restricted row sizes, which challenges the amortization of per-row SA overhead. At the same time, longer reads, hence querys, are emerging with improvements in NGS-platforms, where TCAM search happens at row-granularity. BioMAP adds support for approximate matches much less intrusively, by carefully tuning the SA reference voltage in a variation-aware manner, without restricting the row size or introducing additional SAs which can incur a notable area cost. At the same time, BioMAP features other key components (than the Match Unit with CAM arrays capable of detecting approximate matches in place), which constitute together the scalable, high-throughput energy-efficient read mapping accelerator. Neither the proposed similarity match detection mechanism in our paper, nor any CAM array capable of handling approximate matches would be sufficient to implement an efficient read mapping accelerator by itself, as demonstrated in Section 2.2.

Promising Alternative Computing Paradigms: The evaluated proof-of-concept implementation represents a feasible point in the rich design space of BioMAP. The system interface can take a different form, as well, rather than embedding the accelerator in a DIMM attached to the main memory bus. 3D stacking is an option, for example, to enable even more parallelism. That said, the evaluated design point features TCAM which suits very well to efficient similarity search as read mapping demands, and the evaluated interface leads to a relatively less intrusive design.

We could implement BioMAP using emerging in/near-memory logic, but the scale of the problem would demand careful optimization for data communication between the memory modules and logic embedded in/near memory. On the other hand, cache-based in-memory solutions such as Compute Caches [38] may not be suitable due to the relatively large memory footprint of reference genomes, unless a feasible form of data compaction or compression is the case.

Recently, Kim et al. have explored an efficient filtering step for hash-based read mapping using 3D-stacked memories [39]. This study covers filtering (i.e., search space pruning) only, which has a similar functionality to the Filter Unit of BioMAP. In line with our observations, the authors are after pruning the search space in order to maximize the throughput. They report up to 6.4 improvement. While the high cost of “quadratic-time dynamic programming” algorithms for string (i.e., query) matching motivated Kim et al. [39], BioMAP already relies on much faster and more energy-efficient string matching enabled by TCAM search. Hence, BioMAP employs a simpler, low-latency filtering mechanism (incorporated in the Filter Unit), which is tailored to this more efficient string matching mechanism (incorporated in the Match Unit). Although complex higher-overhead methods like  [39] can prune the search space further, embedding such into BioMAP’s Filter Unit would be overkill, rendering Filter Unit the system bottleneck and diminishing the benefits from Match Unit’s fast energy-efficient similarity matching. We believe that proposals like [39] are more suitable for long read mapping.

7 Discussion & Conclusion

As semiconductor technology revolutionized computing, high-throughput DNA sequencing technology termed Next Generation Sequencing (NGS), revolutionized genomic research. As a result, a progressively growing number of short DNA sequences, generated at a faster rate than Moore’s Law, needs to be mapped to reference genomes which themselves represent full-fledged assemblies of already sequenced DNA fragments. This search-heavy data-intensive mapping task does not need complex floating point arithmetic, and therefore, is particularly suitable to in- or near-memory processing, where non-volatile memory can accommodate the large memory footprint in an area and energy efficient manner. This paper details the design of such a near-(non-volatile)-memory sequence mapping accelerator, BioMAP, which results in 4.0 higher throughput while consuming 26.2 less energy when compared to a highly-optimized software implementation for modern GPUs.

Short (i.e., 100-200 base long) reads from modern Illumina NGS platforms [1] constitute more than 90% of all reads in the world currently. This dominance is unlikely to quickly change in the near future due to the progressively dropping sequencing cost of short read technologies, rendering them significantly more cost-efficient than the long read counterparts such as PacBio [40] or Oxford Nanopore [41] (where read lengths can exceed tens of thousands of bases). The key benefit of long read sequencing technologies comes from the capability of directly extracting long-range information, and not necessarily from higher accuracy. That said, many emerging recent technologies such as 10xGENOMICS [42] can obtain long-range information from short reads. Although it is very hard to predict the future exactly, considering practical facts such as market share and market caps on top, we believe that short read platforms will remain prevalent at least in the near future. Accordingly, BioMAP is designed for short read mapping.

References

  • [1] “Illumina sequencing by synthesis (SBS) technology: https://www.illumina.com/technology/next-generation-sequencing/sequencing-technology.html.”
  • [2] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer, M. C. Schatz, S. Sinha, and G. E. Robinson, “Big Data: Astronomical or Genomical?,” PLOS Biology, vol. 13, July 2015.
  • [3] S. Aluru and N. Jammula, “A review of hardware acceleration for computational genomics,” IEEE Design & Test, vol. 31, no. 1, 2014.
  • [4] P. Klus, S. Lam, D. Lyberg, M. S. Cheung, G. Pullan, I. McFarlane, G. S. Yeo, and B. Y. Lam, “BarraCUDA - a fast short read sequence aligner using graphics processing units,” BMC Research Notes, vol. 5, Jan. 2012.
  • [5] Y. Chen, B. Schmidt, and D. L. Maskell, “A hybrid short read mapping accelerator,” BMC Bioinformatics, vol. 14, no. 1, 2013.
  • [6] Q. Guo, X. Guo, Y. Bai, and E. Ipek, “A resistive TCAM accelerator for data-intensive computing.,” IEEE International Symposium on Microarchitecture (MICRO), 2011.
  • [7] M. Schirmer, R. D‘Amore, U. Z. Ijaz, N. Hall, and C. Quince, “Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data,” BMC Bioinformatics, vol. 17, no. 1, 2016.
  • [8] J. Wala et al., “Genome-wide detection of structural variants and indels by local assembly,” bioRxiv, 105080 2017.
  • [9] R. P. Abo, M. Ducar, E. P. Garcia, A. R. Thorner, V. Rojas-Rudilla, L. Lin, L. M. Sholl, W. C. Hahn, M. Meyerson, N. I. Lindeman, et al., “Breakmer: detection of structural variation in targeted massively parallel sequencing data using kmers,” Nucleic acids research, vol. 43, no. 3, pp. e19–e19, 2014.
  • [10] R. Luo, T. Wong, J. Zhu, C.-M. Liu, X. Zhu, E. Wu, L.-K. Lee, H. Lin, W. Zhu, D. W. Cheung, et al., “Soap3-dp: fast, accurate and sensitive gpu-based short read aligner,” PloS one, vol. 8, no. 5, 2013.
  • [11] D. J. Lipman and W. R. Pearson, “Rapid and sensitive protein similarity searches,” Science, vol. 227, no. 4693, 1985.
  • [12] J. M. Mullaney, R. E. Mills, W. S. Pittard, and S. E. Devine, “Small insertions and deletions (indels) in human genomes,” Human Molecular Genetics, vol. 19, no. R2, 2010.
  • [13] A. S.-M. et al., “The first korean genome sequence and analysis: full genome sequencing for a socio-ethnic group,” Genome Research, vol. 19, no. 9, 2009.
  • [14] J. Wang, W. Wang, R. Li, Y. Li, G. Tian, L. Goodman, W. Fan, J. Zhang, J. Li, J. Zhang, et al., “The diploid genome sequence of an asian individual,” Nature, vol. 456, no. 7218, pp. 60–65, 2008.
  • [15] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, “AC-DIMM: associative computing with STT-MRAM,” in ACM/IEEE International Symposium on Computer Architecture (ISCA), June 2013.
  • [16] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory system simulator,” IEEE Computer Architecture Letters, vol. 10, no. 1, 2011.
  • [17] H. Cheng, W. Chien, M. BrightSky, Y. Ho, Y. Zhu, A. Ray, R. Bruce, W. Kim, C. Yeh, H. Lung, et al., “Novel fast-switching and high-data retention phase-change memory based on new ga-sb-ge material,” in IEEE International Electron Devices Meeting (IEDM), 2015.
  • [18] NCSU-EDA, “FreePDK45: https://www.eda.ncsu.edu/wiki/FreePDK45:Contents.”
  • [19] L. Wilson, “International technology roadmap for semiconductors (itrs),” Semiconductor Industry Association, 2013.
  • [20] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi, “Orion 2.0: a fast and accurate noc power and area model for early-stage design space exploration,” in Proceedings of the Design, Automation & Test in Europe (DATE), 2009.
  • [21] M. H. Abu-Rahma, Y. Chen, W. Sy, W. L. Ong, L. Y. Ting, S. S. Yoon, M. Han, and E. Terzioglu, “Characterization of sram sense amplifier input offset for yield prediction in 28nm cmos,” in IEEE Custom Integrated Circuits Conference (CICC), Sept 2011.
  • [22] “1000 genomes project: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/.”
  • [23] “Na12878.” ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/.
  • [24] R. Li, Y. Li, K. Kristiansen, and J. Wang, “Soap: short oligonucleotide alignment program,” Bioinformatics, vol. 24, no. 5, 2008.
  • [25] H. Li, J. Ruan, and R. Durbin, “Mapping short dna sequencing reads and calling variants using mapping quality scores,” Genome Research, vol. 18, no. 11, 2008.
  • [26] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang, “Soap2: an improved ultrafast tool for short read alignment,” Bioinformatics, vol. 25, no. 15, 2009.
  • [27] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short dna sequences to the human genome,” Genome Biology, vol. 10, no. 3, 2009.
  • [28] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with bowtie 2,” Nature Methods, vol. 9, no. 4, 2012.
  • [29] H. Li and R. Durbin, “Fast and accurate short read alignment with burrows–wheeler transform,” Bioinformatics, vol. 25, no. 14, 2009.
  • [30] H. Li and R. Durbin, “Fast and accurate long-read alignment with burrows–wheeler transform,” Bioinformatics, vol. 26, no. 5, 2010.
  • [31] W. Vanderbauwhede and K. Benkrid, High-performance computing using FPGAs. Springer, 2013.
  • [32] A. Madhavan, T. Sherwood, and D. Strukov, “Race logic: A hardware acceleration for dynamic programming algorithms,” ACM/IEEE International Symposium on Computer Architecture (ISCA), 2014.
  • [33] L. Yavits, S. Kvatinsky, A. Morad, and R. Ginosar, “Resistive associative processor,” IEEE Computer Architecture Letters, vol. 14, no. 2, 2015.
  • [34] J. Li, R. Montoye, M. Ishii, K. Stawiasz, T. Nishida, K. Maloney, G. Ditlow, S. Lewis, T. Maffitt, R. Jordan, L. Chang, and P. Song, “1Mb 0.41 m2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing,” in Symposium on VLSI Circuits, June 2013.
  • [35] B. Yan, Z. Li, Y. Zhang, J. Yang, H. Li, W. Zhao, and P. C.-F. Chia, “A high-speed robust nvm-tcam design using body bias feedback,” in Great Lakes Symposium on VLSI (GVLSI), 2015.
  • [36] M. Imani, D. Peroni, A. Rahimi, and T. Rosing, “Resistive cam acceleration for tunable approximate computing,” IEEE Transactions on Emerging Topics in Computing, 2017.
  • [37] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. Rabaey, “Exploring hyperdimensional associative memory,” IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017.
  • [38] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017.
  • [39] J. Kim, D. Senol, H. Xin, D. Lee, M. Alser, H. Hassan, O. Ergin, C. Alkan, and O. Mutlu, “Genome read in-memory (grim) filter: Fast location filtering in dna read mapping using emerging memory technologies https://people.inf.ethz.ch/omutlu/pub/GRIM-genome-read-in-memoryfilter_psb17-poster.pdf,” 2017.
  • [40] “Pacific BioSciences: http://www.pacb.com/products-and-services/pacbio-systems/.”
  • [41] D. Branton, D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs, X. Huang, et al., “The potential and challenges of nanopore sequencing,” Nature Biotechnology, vol. 26, no. 10, 2008.
  • [42] “10xGENOMICS: https://www.10xgenomics.com/genome/.”