FM-index has been developed as a space efficient index for string matching. The backward search over the index finds exact matches of a pattern in time that is linear relative to the length of the pattern regardless the size of the reference. Even though the applications are numerous, for instance text compression [Navarro07], and indexing [Zhang13], FM-index has become especially popular with the developers of DNA sequence aligners like Bowtie [Langmead09], SOAPv2 [Soap09], and BWA [Li09]. Next, we introduce the fundamentals of FM-index construction and operation.
Let be a string of length over some alphabet . A special character that is not part of the alphabet and is lexicographically smaller than any character in is appended to the end of the string. denotes a character in R at position , and is a substring of R ranging from to . Suffix array is then defined as an integer array containing starting positions of all suffixes of in a sorted order so that
Suffix array could be constructed by simply sorting all suffixes of a string. More sophisticated algorithms take into account the fact that all strings are related to each other and achieve much better asymptotic complexity and practical performance [Puglisi07].
Burrows-Wheeler Matrix is obtained by writing out all rotations of string and sorting them lexicographically. The last column of the then forms a string known as the Burrows-Wheeler Transform . Sorting string rotations is closely related to sorting prefixes as shown in Figure 1, and can be easily obtained form :
Next, for each character in and for every we record the number of occurrences of b in the substring , and store it in the table . Additionally, we store the total of occurrences of all characters lexicographically preceding in into the table . It is easy to compute as an exclusive prefix sum of the last row of . For any single character at position in a pattern , the interval of rows in the BWM starting with this character is easily computed from :
From this initial interval, it is possible to extend the search backward from the starting position using the following recursive procedure:
After the search is complete, the final BWT interval is mapped back to the locations in reference using the suffix array.
BWA [Li09] extends this algorithm to allow a predetermined number of mismatches :
Note that every step of the inexact match algorithm consumes eight values compared to two values required by the exact match. In theory, all values could be precomputed, but holding the full array for the human genome reference would consume approximately 100GB of memory. To save memory space, the FM-index over the DNA alphabet is often stored using a cache-friendly approach introduced in [Gog14], that harkens back to the bucket layout from the original FM-index paper [Ferragina00]. Values of for every k that is a multiple of 128 are stored in memory followed by 128 characters of BWT in 2-bit encoding. Four Occ counters occupy 256 bits, as does the BWT string. Such data arrangement aligns well with AVX vector operations. For counters that are not at the factor of 128 positions, the values must be calculated on the fly. Furthermore, suffix array is compressed it a similar manner. Only values of where is a multiple of 32 are stored in memory, while all the values in between are recomputed using the Inverse Suffix Array relationships:
Even though the memory saving measures do not change the asymptotic complexity of the match algorithm, in reality they add hundreds of computations of to every search. Given that the search is performed multiple times for each and every read out of billions required for the alignment of a human genome, function performance becomes crucial.
Our approach to computing could be traced back to the algorithm by [Vigna08], that performs memory table lookups to count character occurrences in each byte of the BWT string. We replace the memory lookups with the half-byte register lookups, building on an idea first proposed by Mula for the bit population count [Mula17]. Note however that we do not attempt to reduce the character counting problem to bit counting, and apply the half-byte technique directly to the BWT string.
The input BWT string is masked with zeros beforehand for situations when the character at position is in the middle of the byte. The result of 256-bit occurrence count should be corrected for the extra A characters.
Every byte in the BWT string is split into its higher and lower half. Since each half byte value cannot be greater than 15, the lookup values now fit into a single vector register and could be retrieved via the VPSHUFB instruction. The lookup returns all four counters in a 2-bit format packed into a byte. Two bits are sufficient as a half byte contains just two characters. Additionally, the result is pre-converted into its bit complement to assist with subsequent extraction operation.
After the lookup phase, all counters are tightly packed into two vector registers. Any addition operation could result in overflowing the 2-bit values. Before proceeding we have to extract counters for a given character X.
Both and are shifted to bring into the two lower bits of the byte. High bits of are then filled with ones and high bits of – with zeros. At this point contains unsigned byte values of and contains values of . The merged values are then fed into the VPSADBW instruction. It sums the absolute differences of eight consecutive bytes and stores the result as a 64-bit integer. At the end of this operation the result vector contains four partial sums in the form of . The final horizontal sum yields , and the final subtraction is combinable with the correction for the extra A count.
The extraction sequence runs four times to collect partial sums for characters ACGT, CATG, TGCA, and GTAC in four vector registers. Aggregating four final sums within a single register then takes just three additions and three shuffle operations, only one of which crosses the 128-bit lane boundary. The absence of data dependencies between extraction operations facilitates efficient use of the SIMD pipeline and keeps all arithmetic ports busy. Outputs for all eight counters required for one step of the inexact search fit into a single AVX512 register and are computed in one vector pass.
We have implemented the half-byte algorithm using the AVX2 instruction set. The AVX512 version computing 8 values in parallel has also been integrated in the inexact search algorithm. The assembly code along with the Intel Architecture Code Analyzer throughput report is listed in the Appendix. The code is indeed well balanced across the ports but is expected to bottleneck on the backend meaning that the memory access pattern is crucial for the real word performance.
4.1 Experimental Setup
The computer platform is an Intel Xeon Platinum 8168 system with 16 cores running at 2.7 GHz and 32GB of RAM. To test the software performance we have run the BWA alignment tool with 16 threads (-t 16) on a 30X Human genome sample NA12878 from the 1000 Genomes database using hg38 as a reference. We have executed the BWA version 7.15 to establish the baseline, and then replaced the code with our AVX2 and AVX512 implementations. The total runtimes got collected from the BWA reports, and the percentage of time spent in the BWA and SA code measured via profiling.