Introduction
The applications of pattern matching are tremendous and in practice in minutely basis over the globe. Almost every aspect of our lives involves us to search for data in a reference that is short data (small document or small database) or big data (DNA data, internet webpages, banking data, etc). The input is a text () of length over an alphabet of size where the length of pattern () is and number of allowed errors (mismatch, insertion, or deletion) is . The output are the starting positions in of the sub-sequences that are at Hamming (or edit distance) with .
Exact matching, in which the value is zero, is the simple form of pattern matching problem. Many structures solved the exact matching problem in optimal time and space (linear) [1]. This includes suffix trees [2, 3, 4], suffix arrays [5], and FM-index [6]. While, approximate pattern matching, where the value of
is one or more, has not been solved in deterministic linear time and space. Due to this and in order to obtain as much as practical solution as possible, solutions for approximate matching problem is solved by using structures that solve the exact matching (the above structures as examples) followed by heuristic techniques. Tools that are solving reads-to-genome alignment problem, which is the main motivation of the work in this paper, are examples of this approach. More reviews and surveys for reads-to-genome alignment solutions and approximate pattern matching in general can be found at
[7, 8, 9, 10].Given the larger constant factor of suffix trees when compared to other linear structures such as suffix arrays or FM-index, the design of suffix tree structure is more flexible and dynamic to tackle string problems where approximate pattern matching is one of them.
Methods
The proposed structure in this paper involves the following four phases where further explanations, elaborations, and motivations for each phase are provided in the Supplementary Material. The first stage is to build suffix tree of the input data and preprocess the tree. Secondly, building a novel index that index all suffixes under all internal nodes in the suffix tree in linear time and with maintaining the inter-connectivity among the suffixes under different internal nodes. The third stage is to collect keys of nodes at depth in the path of each suffix in the input data and map these outcomes to the index in stage 2. Lastly, the algorithm based on error tree to compute the approximate pattern matching is described.
Stage 1: Building suffix tree
Build a suffix tree () for ensuring the following attributes are constructed with the : attribute at each node which is equal to the lengths of all edges from root to the node, attribute at each node which is equal to the index in of the starting character of string extracted from root to that node, attribute at each node that stores the memory location of parent node, and (which must be originally constructed in ) should be preserved. The cost of this stage is linear in space and time.
Stage 2: Building a novel index
In a general analysis of suffix trees, there are main challenges that are easily faced and needed to be carefully addressed in order to resolve the approximate matching problem more efficiently or optimally. Firstly, under different internal nodes there are similar suffixes; how can we track these same suffixes across different nodes so that if a process is performed on a single suffix then we can apply or just link the outcomes of this process to the same suffix in different node/s and save the costs of performing the same process again and again. Secondly, the structure under an internal node is symmetric partially or fully to the structure of other internal nodes (we mean by ”partially symmetric” a subtree under an internal node is symmetric with the subtree under another internal nodes); how can we trace and embed these interconnectivity across different internal nodes.
The possible answer to the first challenge is to build a global index for each suffixes, then record the index value of each suffix under each internal node. Away from the cost (where is the height of ) of this indexing schema, the index values that will be recorded in some internal node will be distributed randomly. This will lead to a costly computation when the index values of two different nodes need to be compared or intersected for some purposes. In addition, given that this indexing schema will be useful when two internal nodes are fully symmetric, but it will not be highly useful in case two internal nodes are partially symmetric or asymmetric (which are the common cases). So, the indexing schema can not be arbitrary and has to follow some structure and take into account (and take advantage) of the interconnectivity among different internal nodes, but how to find or create this structure and from which suffix/node should I start the indexing schema and in which order. The answers for these questions are in the following structure.
As we need to refer to this index and to distinguish it from suffix indexes in , let’s denote this index as OT (the reason for this naming is in the Acknowledgment section) and before showing how to build the index, the following observations must be introduced.
First observations, which is the key building block in construction the aforementioned index, is the following. If node has a suffix link to node , then all the set of suffixes under node , denoted as subset , must be a subset or equal to the set of the suffixes under node denoted as subset . This indicates that if we assign index values to the suffixes in subset , then these suffixes will be implicitly indexed in subset and we just need to assign new index values to the indexes . Note that this process will work recursively, in other words, if node has a to node , then we will just need to assign index values to the set where is the set of suffixes under node and there will be no computation or indexing process associated with the set of as they are already covered in index.
Secondly, the structure of includes the fact that an internal node, let’s say node , may have up to nodes with linking to it. Now, in order to build index correctly, we should start indexing all suffixes under each node with linking to node before indexing node . This indicates a postorder traversal process, hence, we must construct a tree structure in order to perform this postorder traversal.
Building tree
As we need to refer to this tree structure and to distinguish it from tree and other tree structures, the name of this tree is tree (the reason for this naming is in the Acknowledgment section). tree is constructed by reversing the suffix links in as the following:
-
The root of tree is the root of .
-
Internal nodes are all internal nodes in with at least one incoming .
-
Leaf nodes are all internal nodes in with only outgoing (no incoming ).
-
There is a directed edge from node to node if has to node .
.
Leaf nodes in are not included in tree as there is no outgoing from or incoming to them. Note that by the construction properties of and suffix links, tree will be a directed acyclic graph. Clearly, the space and time cost for building tree is linear (). The tree structure can be built implicitly (inside tree) or explicitly (outside ). For graphical representation of , this webtool https://hwv.dk/st/?$ can draw tree nicely, place the string between ? and $.
Building index
Building index requires both and trees. The following steps describe how to build the index:
-
Initialize a counter, , to zero and a list, , with size equal to the number of leaves in with each element in the list is initialized to -1 (for instance).
-
Traverse tree using postorder method. At each traversed node, let’s say node , scan each leaf node, let’s say leaf node which is within the construction of tree not tree, under and compute the length of the suffix from node to leaf . This length can be computed as: of + of node . Now, check if equals to -1. If yes, let equal to and then increment . If not, then do nothing (which means this suffix length is already covered in index).
-
Perform the second step until postorder traversal is completed.
-
For each node, let’s say node , in tree, set two attributes: and . Assign the value of index of the length of suffix from node to under node to the attribute; similarly for the attribute . This process can be performed within the second step.
Clearly, the space cost of is linear but the time cost will be as the leaf nodes under each internal node in should be processed. However, the following algorithm shows how to build index in linear time.
Linear time construction of index
The non-linear time cost is due to the trivial process of checking whether each suffix under each internal node is already indexed in index. This process is needed to find the set under node and index them in index (as we know already that the set was already indexed when we visited node and we know that set is a subset of set .
Let’s denote the set of suffixes that need to be indexed in index under node to be . Note that the at all internal nodes will be bounded to , then the algorithm that could find directly all of all internal nodes (without processing all suffixes under each internal node) will cost linear time. The following paragraphs describes the linear algorithm to achieve so.
After long and deep thoughts and analysis, the of node could be found using both structures of and trees and processing child (direct child) leaf nodes of node , of child (direct child) internal nodes of node , nodes that link (back and forth) to node and its child internal nodes. So, using both and trees, a postorder traversal of tree, and several rules/tricks (that can be derived from python code snippet in Listing 1, of all internal nodes can be computed in linear time and space. Few minor rules/tricks, that until now cost overhead on true negative cases more then saving costs on true positive cases, are currently under investigations and implementation for inclusion or exclusion.
Next, we traverse in postorder method and build index using the of all internal nodes only (again without processing all suffices under all internal nodes). Moreover, as a validation metric and as shown in the implementation, the above traversal does not involve checking whether any suffix in any list is already existed in the (it involves processing the lists directly without any checking which indicates their correctness),
As the eventual upper bound of index is linear (in which the and the already covered suffixes are indexed) and as the upper bound of all of all internal nodes is also linear (in fact they count almost half of the suffixes), the cost of the algorithm is linear (the linear bound was also verified by implementation).
Another expectedly linear algorithm in which suffixes indexes of each leaf node are processed in a bottom-up approach. Briefly, starting from each leaf node, we walk to parent node and check whether the previous suffix index () is existed (covered) under any node from the nodes that link to the parent node. If no, then append to the of parent node and walk up to the upper parent node. If yes, then stop the walk. The process proceeds until root node is reached if needed. This algorithm is under implementation and testing. The initial intuition of the cost of this algorithm is , but there are some rules/tricks that can be applied which may render the algorithm to be linear. With or without these rules/tricks, we will check, proof, and show its linearity (or non-linearity) and compare it with the trivial algorithm and the linear one.
Stage 3: Building Error Tree
Note that index can be useful for different string processing problems not only for approximate pattern matching, so its description and presentation is independent. Now, for approximate pattern matching problem, we are using some processes performed earlier in error tree structure with major modifications made due to the presence of index.
-
Set a key attribute for each leaf node where the key of leftmost leaf node must be 0, the next leaf node is 1, and proceed until the rightmost leaf node which must have a key equal to the number of leaf nodes minus 1. Likewise, set attribute named as to each internal node in . The assignment of unique keys for internal nodes should not intersects with leaf-nodes’ keys. The reason for separating the keys assignment between leaf nodes and internal nodes is to make the keys of leaf nodes serialized (from left to right) so they can be retrieved faster. The cost of this process is linear.
As the pattern length is given, one may trim the tree then records the indexes of suffixes of the trimmed nodes in the new leaf nodes of the trimmed tree. This reduces the space cost especially when the pattern length is short and in the case the final structure may be serialized and de-serialized for later usage. This trimming process is omitted in this paper.
-
Initialize a dictionary (or list), denoted as , where keys are integers and values are lists. Now, collect for each suffix the key of the first node with depth equal to or more than , let this key be , then append to the list of the value of where is the index of suffix in the input data (the reason for subtracting 1 will be shown in the next section). As we will need to store a single key for each suffixes, the cost of this process is linear in space and time.
-
Once traversal has finished, then for each key in initialize an empty list, . Now, for each suffix index, in find and append it to . Once all elements in appended to , sort and assign it back to . As finding index value will cost a constant time and the list of contains integer values (hence sorting will cost linear time) and as all contain in total suffixes, the cost of this process is linear.
Searching for approximate pattern matching
The following algorithm shows how to search for pattern of length with up to Hamming distance (edit distance will be described later).
Let’s assume , find the key of the node that reached by each suffix of the pattern. If a suffix ends in the middle of an edge, then return the key of the sink node of that edge. This process will cost linear time and space using the suffix links of . Note that the keys of some suffixes will be the same, so remove duplicated keys. Moreover, record for each suffixes the number of mismatches occurred on the edges.
Now, walk the pattern in . If the walk is on an edge and a mismatch occurred at position , then proceed the walking as exact matching until the end of pattern reached. If reached, report approximate matching at position which occurred at positions equal to the suffix indexes under the reached point. If not, report no approximate matching with value.
If the walk reach no mismatches on an edge and reach an internal node, let’s say node , then find the positions of and in the list of where is the key of pattern’s suffix . Now, map back the index values found in between and , if any, to their suffix indexes and report these indexes as the approximate matching of the pattern at position equal to the of . Continue the searching using the above manner until the end of pattern is reached.
For the case of , after inserting the and , get the first index value to the right of the and the first left index value to the left of the and proceed with these values in the next insertion for the upcoming . Proceed in similar fashion.
The time cost for search for a pattern of length will be in the worst case where tree is unbalanced and in the average case where is balanced. Note that the approximate matching outcomes of can be used for approximate matching process of . Lastly, given that upper bound of the number of patterns that are at Hamming distance (combinations) with a pattern of length is , however, merging and sorting the values of and of each internal node in with all index values in (which needs additional space) then apply few trivial procedures can contribute in linear searching time (this algorithm is under implementation).
Category | Organism | Accession | Size (byte) |
---|---|---|---|
Virus | Gordoniaphage GAL1[13] | GCF 001884535.1 | 50,654 |
Bacteria | WS1 bacterium JGI 0000059-K21[14] | GCA 000398605.1 | 521,951 |
Protist | Astrammina rara[14] | GCA 000211355.2 | 1,712,167 |
Fungus | Nosema ceranae[14] | GCA 000988165.1 | 5,809,207 |
Protist | Cryptosporidium parvumIowa II[14] | GCA 000165345.1 | 9,216,802 |
Protist | Spironucleus salmonicida[14] | GCA 000497125.1 | 13,142,503 |
Protist | Tieghemostelium lacteum[14] | GCA 001606155.1 | 23,672,980 |
Fungus | Fusarium graminearumPH-1 [13] | GCF 000240135.3 | 36,915,673 |
Protist | Salpingoeca rosetta[14] | GCA 000188695.1 | 56,150,373 |
Results
The structure proposed in this paper was implemented using python language (python3). First process in the implementation is building . For building , we used a python package (urlhttps://pypi.org/project/suffix-trees/) without any modification and ensuring that the attributes of are implemented at each node.
Next, was traversed using iterative postorder method where in this traversal the following preprocessing steps were computed: assigning serialized keys to leaf nodes from left to right, assigning keys to internal nodes (where these keys do not intersect with the keys of the leaf nodes), collect the keys of nodes at depth for each suffix in the input data ( dictionary), construct tree, set attributes of key_of_leftmost_leaf_node and key_of_rightmost_leaf_node to each internal node where the value of key_of_leftmost_leaf_node equals the key of leftmost leaf node under the internal node (key_of_rightmost_leaf_node likewise), and create two auxiliary lists. First list is to store the suffixes indexes of leaf nodes from left to right. The second list maps suffix index of leaf node to the key of that leaf node; as an example, if the suffix index stored at leaf node is , then store at position of the list the key of the leaf node. The size of each list is linear (equals to the number of leaf nodes).
For building index, an iterative postorder traversal of is computed to collect the (discussed in the section of Building index) at each internal node. Next, an iterative postorder traversal of tree is computed to build the index. Lastly, map/convert the suffixes in each key of dictionary to their index values then sort the resultant values.
For testing the implementation of , we used ten genomes, listed in Table 1, ranging in size from 50KB to 100MB. The time cost for each step involved in the linear building structure is listed in Table 1. For each genome, we built structure for value of 30. Table2 shows the time needed to build error tree for each genome. Note that building of index took time (and space) close to building . This applies also to the step of preprocessing tree.
Category | Size (byte) | Building (minute) | Preprocessing (minute) | Building index (minute) | Processing (minute) | Total time (minute) |
---|---|---|---|---|---|---|
Gordonia- | 50,654 | 0.01 | 0.01 | 0.01 | 0.00 | 0.02 |
WS1 bacterium | 521,951 | 0.10 | 0.10 | 0.11 | 0.01 | 0.32 |
Astrammina | 1,712,167 | 0.30 | 0.29 | 0.27 | 0.02 | 0.90 |
Nosema | 5,809,207 | 1.37 | 1.16 | 1.38 | 0.09 | 4.00 |
Cryptosporidium | 9,216,802 | 2.26 | 1.93 | 2.31 | 0.15 | 6.65 |
Spironucleus | 13,142,503 | 3.42 | 2.84 | 2.68 | 0.20 | 9.14 |
Tieghemostelium | 23,672,980 | 6.06 | 6.41 | 5.00 | 0.38 | 17.85 |
Fusarium | 36,915,673 | 9.77 | 8.42 | 10.07 | 0.63 | 28.90 |
Salpingoeca | 56,150,373 | 15.65 | 13.17 | 16.06 | 0.93 | 45.82 |
Chondrus | 106,387,446 | 29.02 | 27.28 | 34.04 | 1.61 | 91.96 |
The structure of allows to handle different values of without the need to reconstruct or re-process tree, index, or tree. All what is needed is to construct dictionary for each value then perform the last step in stage 3 for each dictionary. Lastly, insertions and deletions can be naively tackled by structure.
Lastly, the estimated constant factor for the linear algorithm is a bit large (10-16). To test this, we run the trivial algorithm, that needs
time for building index, on each genome in the dataset (data to be collected and shown). We could notice that the trivial is faster than the linear one on small genomes, as expected, but for large genomes the linear algorithm is much faster. This is due to the constant factor of linear algorithm being larger than factor and the fact that linear algorithm needs to perform two traversal (traverse and then tree) while the trivial algorithm perform only a single traversal ( tree).The run time for each genome is presented in Table 3
Data and materials availability: To be published.
Acknowledgment The name tree stands for Okaily-Sheehy-Huang-Rajasekaran which are the last name of the first author and the last names of his PhD committee in the University of Connecticut, Department of Computer Science, where the initial version of error tree structure was a chapter in the PhD dissertation. The committee members were: Chun-Hsi Huang (Major Advisor), Sanguthevar Rajasekaran, and Don Sheehy. This meant to tribute them and appreciate their kind, influential, and professional teaching and supervision. The name stands for the last names of the authors of this work.
References and Notes
- [1] S. I. Hakak, et al., IEEE access 7, 69614 (2019).
- [2] P. Weiner, 14th Annual Symposium on Switching and Automata Theory (swat 1973) (IEEE, 1973), pp. 1–11.
- [3] E. M. McCreight, Journal of the ACM (JACM) 23, 262 (1976).
- [4] E. Ukkonen, Algorithmica 14, 249 (1995).
- [5] M. I. Abouelhoda, S. Kurtz, E. Ohlebusch, Journal of discrete algorithms 2, 53 (2004).
- [6] P. Ferragina, G. Manzini, Proceedings 41st Annual Symposium on Foundations of Computer Science (IEEE, 2000), pp. 390–398.
- [7] M. Alser, et al., Genome biology 22, 1 (2021).
- [8] G. Kucherov, Bioinformatics 35, 3547 (2019).
- [9] S. Canzar, S. L. Salzberg, Proceedings of the IEEE 105, 436 (2015).
- [10] G. Kucherov, K. Salikhov, D. Tsur, Theoretical Computer Science 638, 145 (2016).
- [11] A. Al-Okaily (2016).
- [12] A. Al-Okaily, Journal of Computational Biology 22, 1118 (2015).
- [13] N. A. O’Leary, et al., Nucleic acids research 44, D733 (2016).
- [14] K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, E. W. Sayers, Nucleic acids research 44, D67 (2016).
Comments
There are no comments yet.