## I Introduction

Burrows-Wheeler transform (BWT) [1] is a string rearrangement algorithm first proposed for data compression. Making use of the properties between BWT and its original string, FM-index [4] is designed for efficient string searching. For a certain string, the data structure of its FM-index contains the BWT, the suffix array (SA) and two auxiliary tables of the target string. With its high efficiency in both time and memory, FM-index is widely adopted by many next-generation sequencing (NGS) applications [6, 7, 5, 10].

For FM-index-based sequence aligners such as BWA-backtrack [6] and Bowtie2 [5], the reference sequence is indexed for fast read-locating processes. For this kind of aligners, because only the reference sequence needs to be indexed, the hardware accelerators ([9, 11]) usually build the FM-index externally using CPUs. Since the length of a reference sequence can be as large as three billion base pairs (bps), the memory usage of naive index constructing method is unaffordable. Therefore, how to reduce memory usage in FM-indexing becomes an important issue [8, 3].

Recently, some de novo assemblers ([10], for example) apply FM-index for overlap finding of reads. Also, the reference and reads are both indexed in new sequence aligners such as BWA-SW [7]. For these applications, external construction of FM-index has much higher time overhead in comparison with traditional algorithms. Therefore, on-chip indexing becomes necessary and important. Our previous research [2] has demonstrated an FM-index constructor with a lightweight iterative algorithm [3] with an ASIC, but the cost of the indexer is still high when compared to the work proposed here.

In this work, we propose a novel memory efficient FM-index construction algorithm, Self-Aided Incremental Indexing (SAII), which is suitable for hardware realization. This algorithm builds the FM-index incrementally. In each iteration, it utilizes a meta-index to construct the complete index. Since SAII makes use of the FM-index itself for construction, it has no memory overhead for FM-index-based applications. Only few computational logic units are needed. In our hardware system, the processing speed is accelerated by a special prefetch mechanism and a parallel architecture. We choose Altera Stratix V FPGA as the evaluation platform. The SAII FM-index constructor is very compact in terms of logic usage, so it can be integrated with other functional blocks to form a complete hardware pipeline in emerging NGS applications.

## Ii Background

### Ii-a Burrows-Wheeler Transform

To construct the BWT of target sequence , a simple method is via the translation of suffix array () with Eq. 1. Also, BWT can be obtained by collecting all characters in the last column of sorted suffixes. Fig. 1 shows an example of the BWT of sequence =ACGATTG$, where character $ is the end-of-string character. The lexical order is $ACGT.

(1) |

### Ii-B FM-index and Backward Search Algorithm

FM-index is extended from BWT and suffix array, with two auxiliary tables— array and table. The definitions for array and table are shown in Eq. 2 and 3, where is the length of target sequence .

(2) |

(3) |

With array and table, the position of query can be efficiently located within a lower bound and upper bound as shown in Eq. 4 and 5.

(4) |

(5) |

In [4], Ferragina and Manzini have proved that if and only if is a substring of . This searching algorithm starts from the end of the query sequence and extends iteratively. Therefore it is also called backward search algorithm.

## Iii Self-Aided Incremental Indexing (SAII) Algorithm

An example of backward search algorithm is shown in Fig. 2. The initial values are set at . Here we discuss the mathematical insights of the lower bound in backward search algorithm. is the sum of , and 1. records the total number of characters lexically smaller than in target sequence . The term is the occurrence of in . With an additional offset, represents the suffix array index of lexically smallest sequence. Similar concept can be used to account for the upper bound. If only occurs once in , is equal to . It is also of interest that what would happen if is not a substring of . Since measures the occurrences of lexically smaller substrings in , the lower bound guarantees the following inequality:

(6) |

SAII utilizes Eq. 6 to build FM-index incrementally. For a target sequence , if the FM-index of its substring $ has been constructed, the suffix array index of the query string $ can be determined by calculating . Since could not be a substring of , the suffix array index obtained is unique and follows Eq. 6. Therefore, we can insert into the FM-index of , generating the new FM-index of without sorting the whole string all over again.

The algorithm of SAII is shown in Alg. 1, and an example of a target sequence ACGCT$ is provided in Fig. 3. In the first iteration, the initial character is and the BWT is also . The corresponding table and are calculated. In the second iteration, a new character T is added to the target sequence. Then we use and obtained in the previous iteration to calculate the new , and the updated is inserted to this position to form a new . With this updated , we recalculate and for this iteration. Then SAII is ready to move on to next iteration. After all the characters in the target sequence are read, the corresponding FM-index of the reference is correctly built.

Because the construction process is entirely based on FM-index itself, nearly no extra computational resources are needed and the memory overhead is zero for FM-index-based applications. The time complexity of SAII algorithm is , where is completeness of the table. The details of is given in Sec. IV-B.

## Iv Hardware Implementation and Discussion

### Iv-a Overall Architecture

The hardware system includes a finite-state machine controller and a combinational computing logic. The finite-state machine of the proposed hardware is shown in Fig. 4. There are four states in this system: Initial, Search, Update and Finish states.

Initial and Finish states control the initial and finishing conditions of the hardware. Search state computes the lower bound with Eq. 4. A two-stage parallel pop counter is used in Search state for fast computing. Update State updates the latest incoming symbol to the $ position in previous iteration. Also, it inserts the $ symbol to the updated index based on the position calculated in Search state. The whole FM-index data structure including , array and table all need to be updated in this state. The finite-state machine repeats between Search and Update states to construct the FM-index and moves to Finish state after the last character of the target sequence is processed.

### Iv-B Data Structure

For DNA aligners, the size of alphabets () is five (including $). Since symbol $ occurs only once, in our hardware system only A,C,G, and T are encoded. End-of-string character $ is encoded the same as the symbol, A, and an additional special pointer is designed to store the position of $. This design uses only two bits for each character, which is only 67% in comparison to that of naive 3-bit encoding.

The memory usage of the three main components of FM-index, array, table and , are , and , respectively. However, the length of genome sequence data is sometimes very large. The length of a human chromosome can exceed 200 Mbp and the whole human genome is more than 3 Gbp. With this scale of data, the memory usage of a complete table is very expensive in hardware systems. It should be noted that table is a hash table obtained from designed for fast computation of and . The correct bounds can still be calculated even without table at the cost of searching efficiency.

In our hardware system, incomplete table is used to achieve balance between memory usage and computing speed. An incomplete table stores the occurrence values at every characters. It is times smaller than a complete table. With an incomplete table, the calculation of is split into two parts. First, is stored in the incomplete table. Second, the occurrences of from to is calculated with a pop counter. In our hardware implementation, is set at 2,048. The pop counter is designed with a two-stage architecture, in which the first stage has 32 parallel adders and the second stage has 64 parallel adders. The incomplete table can be adjusted for different applications with simple modification of parameters.

### Iv-C Constructing BWT with Prefetch Mechanism

In our hardware system, Search and Update states are in charge of the construction of FM-index. To save computing resource, and table are both segmented and stored in the BRAM. Though this saves lots of area, it takes longer time to update the BRAM due to the fixed word length. Therefore, how to make use of the limited bandwidth is very important. To address this issue, we design a prefetch mechanism that saves 50% time of BRAM updating. The timing diagram without prefetching is shown in Fig. 6(a). In Search state, is calculated; in Update state, the incoming character is updated to the FM-index as shown in Line 6 in Alg. 1; in Insert state, the is inserted to the FM-index, as shown in Line 9 in Alg. 1. In both Update and Insert states, and table in the BRAM have to be refreshed, so the processing time is long.

Prefetch mechanism is designed to reduce the runtime to refresh BRAM. As shown in Fig. 3, SAII algorithm first replaces the in with the new character, calculates the new insertion position of and inserts new to the . With prefetch mechanism, the insertion of is combined with next Update state and executed after the hardware system sees the next character. This does not generate the completely correct FM-index yet because of its early update design. Therefore, the early updated position and character have to tracked with an additional monitor to make sure the calculation of is correct in the next iteration. In the last iteration, since there is no more character for prefetching, the final FM-index is correct. Fig. 6(b) shows the timing diagram of the hardware system with prefetching.

Assume that one update takes cycles and one backward search takes cycles (), prefetch mechanism reduces the computing time from to for each iteration. The computing time is nearly half of the original design. The expected runtime () of the SAII hardware system is shown in Eq. 7, where stands for search time and is set to 3 cycles in our implementation.

(7) |

### Iv-D Discussion

We implement our SAII FM-index constructor on an Altera Stratix V FPGA (5SGXEA7N2F45C2N) board. The hardware system is synthesized using Altera Quartus (v.15.0) tool. It only uses 21,944 ALMs (9%) and 266,496 BRAMs ( 1%) on this FPGA. Also, the serial-input design uses a very small proportion of the I/O bandwidth. These properties make the SAII algorithm easily integrated into existing sequencing pipelines at very low hardware cost.

The operation frequency can reach 120 MHz even with the worst case model (900 mV, 85 C). Sequences with different lengths, from 16,384-bp to 131,072-bp, are tested and the results are shown in Fig. 6

. It takes about 21 ms to finish the indexing of a 131,072-bp sequence. The runtime is very close to our theoretical estimation given in Eq.

7. Even for genomes with several million base pairs, it is estimated that our system can construct the FM-index in seconds.## V Conclusions

With many emerging applications based on FM-index, an efficient index construction algorithm is needed. Previous algorithms ([8, 3]) need additional working space to build the index, raising the costs of hardware systems. In this paper, we propose a novel hardware-compatible Self-Aided Incremental Indexing (SAII) algorithm to construct FM-index with no memory overhead. This algorithm is accelerated with a parallel pop counter and a special prefetch mechanism. Its realization on FPGA needs very few hardware resources and can be easily integrated in different FM-index-based applications.

## Acknowledgment

This work is supported by the Ministry of Science and Technology, Taiwan, under Grant numbers MOST 105-2221-E-002-090 and 106-2221-E-002-055. Nae-Chyun Chen would like to thank NOVATEK for providing fellowship.

## References

- [1] (1994) A block-sorting lossless data compression algorithm. Cited by: §I.
- [2] (2015) Power efficient special processor design for burrows-wheeler-transform-based short read sequence alignment. In Biomedical Circuits and Systems Conference (BioCAS), 2015 IEEE, pp. 1–4. Cited by: §I.
- [3] (2012) Lightweight data indexing and compression in external memory. Algorithmica 63 (3), pp. 707–730. Cited by: §I, §I, §V.
- [4] (2000) Opportunistic data structures with applications. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pp. 390–398. Cited by: §I, §II-B.
- [5] (2012) Fast gapped-read alignment with bowtie 2. Nature methods 9 (4), pp. 357–359. Cited by: §I, §I.
- [6] (2009) Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25 (14), pp. 1754–1760. Cited by: §I, §I.
- [7] (2010) Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics 26 (5), pp. 589–595. Cited by: §I, §I.
- [8] (2009) A linear-time burrows-wheeler transform using induced sorting.. In SPIRE, Vol. 5721, pp. 90–101. Cited by: §I, §V.
- [9] (2012) Hardware acceleration of short read mapping. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, pp. 161–168. Cited by: §I.
- [10] (2010) Efficient construction of an assembly string graph using the fm-index. Bioinformatics 26 (12), pp. i367–i373. Cited by: §I, §I.
- [11] (2016) Hardware-acceleration of short-read alignment based on the burrows-wheeler transform. IEEE Transactions on Parallel and Distributed Systems 27 (5), pp. 1358–1372. Cited by: §I.

Comments

There are no comments yet.