Suffix Trees, DAWGs and CDAWGs for Forward and Backward Tries
The suffix tree, DAWG, and CDAWG are fundamental indexing structures of a string, with a number of applications in bioinformatics, information retrieval, data mining, etc. An edge-labeled rooted tree (trie) is a natural generalization of a string. Breslauer [TCS 191(1-2): 131-144, 1998] proposed the suffix tree for a backward trie, where the strings in the trie are read in the leaf-to-root direction. In contrast to a backward trie, we call a usual trie as a forward trie. Despite a few follow-up works after Breslauer's paper, indexing forward/backward tries is not well understood yet. In this paper, we show a full perspective on the sizes of indexing structures such as suffix trees, DAWGs, and CDAWGs for forward and backward tries. In particular, we show that the size of the DAWG for a forward trie with n nodes is Ω(σ n), where σ is the number of distinct characters in the trie. This becomes Ω(n^2) for a large alphabet. Still we show that there is a compact O(n)-space representation of the DAWG for a forward trie over any alphabet, and present an O(n σ)-time O(n)-space algorithm to construct such a representation of the DAWG for a growing forward trie.
READ FULL TEXT