Vectorized Character Counting for Faster Pattern Matching

by   Roman Snytsar, et al.

Many modern sequence alignment tools implement fast string matching using the space efficient data structure called FM-index. The succinct nature of this data structure presents unique challenges for the algorithm designers. In this paper, we explore the opportunities for parallelization of the exact and inexact matches and present an efficient SIMD solution for the Occ portion of the algorithm. Our implementation computes all eight Occ values required for the inexact match algorithm step in a single pass. We showcase the algorithm performance in a multi-core genome aligner and discuss effects of the memory prefetch.



page 1

page 2

page 3

page 4


Cartesian Tree Matching and Indexing

We introduce a new metric of match, called Cartesian tree matching, whic...

Scout Algorithm For Fast Substring Matching

Exact substring matching is a common task in many software applications....

An Efficient Implementation of Manacher's Algorithm

Manacher's algorithm has been shown to be optimal to the longest palindr...

copMEM: Finding maximal exact matches via sampling both genomes

Genome-to-genome comparisons require designating anchor points, which ar...

The colored longest common prefix array computed via sequential scans

Due to the increased availability of large datasets of biological sequen...

SOPanG 2: online searching over a pan-genome without false positives

The pan-genome can be stored as elastic-degenerate (ED) string, a recent...

Periodicity in Data Streams with Wildcards

We investigate the problem of detecting periodic trends within a string ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

FM-index has been developed as a space efficient index for string matching. The backward search over the index finds exact matches of a pattern in time that is linear relative to the length of the pattern regardless the size of the reference. Even though the applications are numerous, for instance text compression [Navarro07], and indexing [Zhang13], FM-index has become especially popular with the developers of DNA sequence aligners like Bowtie [Langmead09], SOAPv2 [Soap09], and BWA [Li09]. Next, we introduce the fundamentals of FM-index construction and operation.

2 Background

Let be a string of length over some alphabet . A special character that is not part of the alphabet and is lexicographically smaller than any character in is appended to the end of the string. denotes a character in R at position , and is a substring of R ranging from to . Suffix array is then defined as an integer array containing starting positions of all suffixes of in a sorted order so that


Suffix array could be constructed by simply sorting all suffixes of a string. More sophisticated algorithms take into account the fact that all strings are related to each other and achieve much better asymptotic complexity and practical performance [Puglisi07].

4 $ $ACA G 0 0 1 0
0 ACAG$ ACAG $ 0 0 1 0
2 AG$ AG$A C 0 1 1 0
1 CAG$ CAG$ A 1 1 1 0
3 G$ G$AC A 2 1 1 0
C= 0 2 3 4 4
Figure 1: Data structures comprising FM-index are in boxes.

Burrows-Wheeler Matrix is obtained by writing out all rotations of string and sorting them lexicographically. The last column of the then forms a string known as the Burrows-Wheeler Transform . Sorting string rotations is closely related to sorting prefixes as shown in Figure 1, and can be easily obtained form :


Next, for each character in and for every we record the number of occurrences of b in the substring , and store it in the table . Additionally, we store the total of occurrences of all characters lexicographically preceding in into the table . It is easy to compute as an exclusive prefix sum of the last row of . For any single character at position in a pattern , the interval of rows in the BWM starting with this character is easily computed from :


From this initial interval, it is possible to extend the search backward from the starting position using the following recursive procedure:

ExactRecur(W, i, k, l)
 if  then
 return ExactRecur(W, i-1, k, l)
Listing 1: Backward Exact Match.

After the search is complete, the final BWT interval is mapped back to the locations in reference using the suffix array.

BWA [Li09] extends this algorithm to allow a predetermined number of mismatches :

InexRecur(W, i, z, k, l)
 if  then
 if  then
 for each  do
  if  then
  if  then
 return I
Listing 2: Inexact Match.

Note that every step of the inexact match algorithm consumes eight values compared to two values required by the exact match. In theory, all values could be precomputed, but holding the full array for the human genome reference would consume approximately 100GB of memory. To save memory space, the FM-index over the DNA alphabet is often stored using a cache-friendly approach introduced in [Gog14], that harkens back to the bucket layout from the original FM-index paper [Ferragina00]. Values of for every k that is a multiple of 128 are stored in memory followed by 128 characters of BWT in 2-bit encoding. Four Occ counters occupy 256 bits, as does the BWT string. Such data arrangement aligns well with AVX vector operations. For counters that are not at the factor of 128 positions, the values must be calculated on the fly. Furthermore, suffix array is compressed it a similar manner. Only values of where is a multiple of 32 are stored in memory, while all the values in between are recomputed using the Inverse Suffix Array relationships:


It means that the Equation 4 is applied over and over until for some j the result comes out to be a multiple of 32, and the SA value could be constructed according to Equation 5.

Even though the memory saving measures do not change the asymptotic complexity of the match algorithm, in reality they add hundreds of computations of to every search. Given that the search is performed multiple times for each and every read out of billions required for the alignment of a human genome, function performance becomes crucial.

3 Solution

Our approach to computing could be traced back to the algorithm by [Vigna08], that performs memory table lookups to count character occurrences in each byte of the BWT string. We replace the memory lookups with the half-byte register lookups, building on an idea first proposed by Mula for the bit population count [Mula17]. Note however that we do not attempt to reduce the character counting problem to bit counting, and apply the half-byte technique directly to the BWT string.

The input BWT string is masked with zeros beforehand for situations when the character at position is in the middle of the byte. The result of 256-bit occurrence count should be corrected for the extra A characters.














vpsrld, vpand









Figure 2: Example of counting letter ’g’ in a 32-character portion of BWT string.

3.1 Lookup

Every byte in the BWT string is split into its higher and lower half. Since each half byte value cannot be greater than 15, the lookup values now fit into a single vector register and could be retrieved via the VPSHUFB instruction. The lookup returns all four counters in a 2-bit format packed into a byte. Two bits are sufficient as a half byte contains just two characters. Additionally, the result is pre-converted into its bit complement to assist with subsequent extraction operation.

3.2 Extraction

After the lookup phase, all counters are tightly packed into two vector registers. Any addition operation could result in overflowing the 2-bit values. Before proceeding we have to extract counters for a given character X.

Both and are shifted to bring into the two lower bits of the byte. High bits of are then filled with ones and high bits of – with zeros. At this point contains unsigned byte values of and contains values of . The merged values are then fed into the VPSADBW instruction. It sums the absolute differences of eight consecutive bytes and stores the result as a 64-bit integer. At the end of this operation the result vector contains four partial sums in the form of . The final horizontal sum yields , and the final subtraction is combinable with the correction for the extra A count.

3.3 Aggregation

The extraction sequence runs four times to collect partial sums for characters ACGT, CATG, TGCA, and GTAC in four vector registers. Aggregating four final sums within a single register then takes just three additions and three shuffle operations, only one of which crosses the 128-bit lane boundary. The absence of data dependencies between extraction operations facilitates efficient use of the SIMD pipeline and keeps all arithmetic ports busy. Outputs for all eight counters required for one step of the inexact search fit into a single AVX512 register and are computed in one vector pass.





T, G, C, A




A, C, G, T

A, C, G, T



Figure 3: Data shuffling.

4 Implementation

We have implemented the half-byte algorithm using the AVX2 instruction set. The AVX512 version computing 8 values in parallel has also been integrated in the inexact search algorithm. The assembly code along with the Intel Architecture Code Analyzer throughput report is listed in the Appendix. The code is indeed well balanced across the ports but is expected to bottleneck on the backend meaning that the memory access pattern is crucial for the real word performance.

4.1 Experimental Setup

The computer platform is an Intel Xeon Platinum 8168 system with 16 cores running at 2.7 GHz and 32GB of RAM. To test the software performance we have run the BWA alignment tool with 16 threads (-t 16) on a 30X Human genome sample NA12878 from the 1000 Genomes database using hg38 as a reference. We have executed the BWA version 7.15 to establish the baseline, and then replaced the code with our AVX2 and AVX512 implementations. The total runtimes got collected from the BWA reports, and the percentage of time spent in the BWA and SA code measured via profiling.

4.2 Results