Block Palindromes: A New Generalization of Palindromes

06/01/2018
by   Keisuke Goto, et al.
FUJITSU
0

We propose a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of a maximal block palindrome, which leads to a compact representation of all block palindromes that occur in a string. We also propose an algorithm which enumerates all maximal block palindromes that appear in a given string T in O(|T| + MBP(T)) time, where MBP(T) is the output size, which is optimal unless all the maximal block palindromes can be represented in a more compact way.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/29/2021

Parameterized String Equations

We study systems of String Equations where block variables need to be as...
01/30/2019

Faster queries for longest substring palindrome after block edit

Palindromes are important objects in strings which have been extensively...
05/01/2021

Generative Art Using Neural Visual Grammars and Dual Encoders

Whilst there are perhaps only a few scientific methods, there seem to be...
10/07/2019

Diamond Subgraphs in the Reduction Graph of a One-Rule String Rewriting System

In this paper, we study a certain case of a subgraph isomorphism problem...
04/03/2020

Enumeration of LCP values, LCP intervals and Maximal repeats in BWT-runs Bounded Space

Lcp-values, lcp-intervals, and maximal repeats are powerful tools in var...
08/24/2019

When a Dollar Makes a BWT

The Burrows-Wheeler-Transform (BWT) is a reversible string transformatio...
12/14/2021

INRU: A Quasigroup Based Lightweight Block Cipher

In this paper, we propose a quasigroup based block cipher design. The ro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A palindrome is a string that is equal to its reverse, e.g., “AblewasIereIsawElba” (we treat upper and lower characters are the same for simple explanations). Palindromes have been studied in combinatorics on words and stringology.

Many research focused on finding palindromic structure of a string. Manacher [12] proposed a beautiful algorithm that enumerates all maximal palindromes of a string. Kosolobov et al. [11] proved that, a language can be recognizable in time, where is the language of all nonempty palindromes and is the length of an input string. Alatabbi et al. [2] considered maximal palindromic factorization in which all factors are maximal palindromes. They also consider a problem of computing the fewest palindromic factorization, and proposed off-line linear-time algorithms. Later, I et al. [9] and Fici et al. [4] independently proposed on-line -time algorithms, where is the length of an input string. Similar problems were also considered, such as, computing palindromic length [3], computing palindromic covers [9]

, computing palindromic pattern matching 

[8].

A gapped palindrome is a generalization of a palindrome that becomes a palindrome when a center substring is replaced by a character, where the center substring is a substring whose beginning and ending positions are equally far from the beginning and ending positions of the input string, respectively. For example, “Madam,heisAdam” is a gapped palindrome, and it becomes a palindrome if the center substring “m,heis” is replaced by a character. Gapped palindromes play an important role in molecular biology since they model a hairpin data structure of DNA and RNA sequences, see e.g. [14]. Several problems were considered such as, enumeration of exact gapped palindromes of a string [10] and also enumeration of approximate gapped palindromes [13, 7], finding maximal length of long armed or and constrained length gapped palindrome [5].

In this paper, we consider the notion of block palindromes [1], which is a new generalization of palindromes and also gapped palindromes 111Block palindromes were firstly introduced in a problem of 2015 British Informatics Olympiad [1], but we did not know the existence at the first version of this paper.. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. More precisely, a block palindrome is a “symmetric” factorization of a string with the center factor is a string (which may be empty) and each of other factor is a non-empty string for any . We also call a factor a block. For convenience, let when . For example, a factorization “To|kyo||and||Kyo|to” is a block palindrome, where “|” is a mark to distinguish adjacent blocks. Palindromes and gapped palindromes are special cases of block palindromes: For a palindrome, all blocks are characters, and for a gapped palindrome, the center block is a string and the other blocks are characters.

We investigate several properties of block palindromes. We introduce the notion of maximal block palindromes to concisely represent all block palindromes in a string, and propose an algorithm which enumerates all maximal block palindromes in a string in time, where is the output size. This is optimal unless all the maximal block palindromes can be represented in a more compact way.

2 Preliminaries

Let be an integer alphabet. An element of is called a string. The string of length 0 is called the empty string, and is denoted by . Although is not contained in , we sometimes call the empty character for convenience. For a string , , and are called a prefix, substring, and suffix of , respectively. In particular, a prefix (resp. suffix) of is called a proper prefix (resp. suffix) iff . A non-empty string that is a proper prefix and also a proper suffix of is called a border of . Hence, a string of length can have at most borders of length ranging from 1 to . A string which does not have any borders is called an unbordered string. For , a substring of which begins at position and ends at position is denoted by . For convenience, let if .

In this paper, we also consider half-positions for integers . For convenience, for a half-position and an integer such that , let . Note that for a half-position is the empty character. The position is called the center position of , is called the center character of , and for an integer is called a center substring of .

For a string and integers , a longest common extension (LCE) query asks the length of the longest common prefix of the two suffixes and of . When clear from the context, is abbreviated as . It is well known that if is drawn from an integer alphabet of size polynomially bounded in , then LCE queries for can be answered in constant time after an -time preprocessing, e.g., by constructing the suffix tree of and a data structure for lowest common ancestor queries on the tree [6].

For a block palindrome , the length of denoted by is the total length of blocks, and the size of denoted by is the number of non-empty blocks. A block palindrome is even if its size is even (that is, the center block is the empty string), and otherwise odd (that is, the center block is non-empty).

3 Properties of Block Palindromes

In this section, we investigate the properties of block palindromes. We assume that is an input string of length in the rest of the paper.

Since there are factorization of and block palindromes are symmetric, there are block palindromes of . Moreover, there is a tight example that consists of only the same characters.

Although there are a huge number of block palindromes of a string, they are very redundant. To look for more essential properties of block palindromes, we define the largest block palindrome which is a representative of other block palindromes. A block palindrome of that has the largest number of blocks among all block palindromes of is called the largest block palindrome. Note that each block for is an unbordered substring and for is the shortest border of , where if and otherwise. So, the largest block palindrome of is unique. The largest block palindrome is a representative of all block palindromes in the sense that all block palindromes can be represented as block palindromes of .

A natural and prompt question would be about how to efficiently compute the largest block palindrome of . The following theorem answers this question.

Theorem 1.

The largest block palindrome of can be computed in time.

  • We construct a data structure in time that can answer any LCE query in constant time.

    We greedily compute the blocks from outside to inner by LCE queries. We assume that we compute the shortest border of . For to , we check whether is the border of or not by checking whether or not. If does not have any border, we obtain . Otherwise, we obtain the shortest border of , and compute the more inner blocks for . Since the number of LCE queries is and each LCE query takes constant time, the largest block palindrome of can be computed in time. ∎

So far, we have considered only block palindromes that are equal to itself. Next, we consider block palindromes that appear as substrings in . We define a maximal block palindrome which is a representative of some block palindromes in , and study how many maximal block palindromes can appear in .

For a half-position and an integer , let be the set of largest block palindromes whose center positions are the same and whose center blocks appear at . When context is clear, we denote by . For a string , a largest block palindrome such that is the longest, namely the number of blocks are maximal among all largest block palindromes of , is called a maximal block palindrome.

We remark that the maximal block palindrome of is a representative of all the largest block palindromes of .

Remark 1.

For a half-position and an integer , any largest block palindrome is a sub-factorization of the maximal block palindrome , that is, and for .

  • We assume that the statement does not hold. Let be a block that , and or for . If , is a border of and it contradicts that is the largest block palindrome. We can say the same things for the case . Therefore, such and do not exist and this statement holds. ∎

We are interested in how many maximal block palindromes can appear in . It is trivially upper bounded by since there are substrings which can be center substrings. If there is no limitation on the size of maximal block palindromes, we can easily see that it is tight. For a string of length in which the characters are all distinct, any substring is unbordered, and there is at least one maximal block palindrome that contains as a center block. Thus, can contain maximal block palindromes. The following example says that the number of maximal block palindromes having three blocks has also the same tight upper bound.

Example 1.

The number of maximal block palindromes in that have at least three blocks is , where for a character denotes run of of length , and .

For convenience, we denote by , where , , , , , and are strings , , , , , and , respectively. There are maximal block palindromes of size three that, for , , = and they are unbordered, where the parentheses indicate blocks.

Remark that the upper bound is reduced to if we impose a limitation on the lengths of center blocks.

Remark 2.

For any constant , a string of length can contain maximal block palindromes whose center blocks are of length because there are possible center blocks. In particular, a string contains at most maximal block palindromes of even size (i.e., the center blocks must be empty) because the number of occurrences of center blocks are at most .

The following lemma shows an interesting property of maximal block palindromes, and this property can be used for the proof of Lemma 2.

Lemma 1.

For a half-position and two integers , two largest block palindromes and do not share the block boundaries, namely, the ending positions of blocks and such that and do not equal for any and .

  • Similar to Remark 1, if we assume that this lemma does not hold, a block of or must have a border and it contradicts that and are the largest block palindromes. ∎

Let denote the sum of the sizes of all maximal block palindromes in .

Lemma 2.

For any string of length , .

  • From Lemma 1, any two largest block palindromes, whose center positions are same but center blocks are different, do not share the block boundaries. This implies that, for a half-position , the number of blocks of maximal block palindromes whose center position is is up to . Since there are center positions, we have . ∎

4 Enumeration of Maximal Block Palindromes

In this section, we consider how to enumerate all the maximal block palindromes . A brute-force approach based on Theorem 1 would compute the largest block for every possible substring (while suppressing output of non-maximal ones), which takes time.

We propose an optimal solution running in time.

Theorem 2.

All maximal block palindromes that appear in can be enumerated in time, where is the output size.

We actually consider a variant of the problem: We propose an algorithm to enumerate all the maximal block palindromes of size , whose total output size is denoted by , in optimal time. That is to say, we can completely ignore maximal block palindromes of size , which might not be interesting if we focus on palindromic structures in . If we want to enumerate , we can do that by slightly modifying the algorithm.

Our algorithm proceeds in two steps: (i) enumerate all the pairing unbordered blocks for all center positions in a batch processing, and (ii) build maximal block palindromes from the enumerated blocks.

In Step (i), we firstly enumerate every pair of occurrences of an unbordered substring in . Note that the pair will be a component of a maximal block palindrome, and the total number of enumerated pairs is . We preprocess in time and space to support LCE queries in constant time. We also compute, for every character in , the list storing all the occurrences of the character in increasing order, all of which can be obtained by sorting the positions of with the key by radix sort in time and space.

Now we focus on an occurrence of , and identify every pair of occurrences of an unbordered substring such that the left one starts at . Let be the occurrences of in . We process in increasing order to identify common unbordered substrings starting at and using queries. At the first round for , we see that for any with , the common substring of length starting at and is unbordered, and thus, we report each pair of such unbordered substrings. While processing in increasing order, we maintain a set of positive integers (by a sorted list of intervals) such that has a border caused by the common substrings starting at and ’s processed so far. We use to efficiently skip ’s such that has a border in the later rounds. For example, in the first round, we add the interval to (which is initially empty) as, for any , has a border caused by the common substring starting at and . When processing for , we see that for any , the common substring of length starting at and is unbordered. Updating can be easily done in time by adding (merging if necessary) the interval to (observe that the new interval is always pushed back to or merged with the last interval of as we process in increasing order). Note that always contains , and we can incrementally enumerate its element in constant time per element because is maintained as a sorted list of intervals. Thus, the computation cost can be charged to the number of output, i.e., it runs in time in total.

When we find a pair of occurrences of an unbordered substring of length , we list it up as a triple , where is the center of the pairing blocks. After listing up all those triples, we sort them using the first and second elements as keys by radix sort, which can be done in time and space.

Now we are ready to proceed to Step (ii) in which we build the maximal block palindromes from the sorted list of triples computed in Step (i). For building the maximal block palindromes with center , we scan the sublist of triples having center and connect the pairing blocks whose beginning and ending positions are adjacent using the information of the second (the beginning position of the block) and third (the ending position of the block plus one) elements of the triples. We build all the -centered maximal block palindromes by extending their blocks outwards simultaneously with a -initialized array of length . When we look at a triple , we write to , and connect the block with the block ending at if such exists (which can be noticed by the information ). Since the block boundaries are not shared due to Lemma 1, the information written in can be propagated correctly to extend the blocks. It runs in time linear to the size of the sublist. We can also clear in the same time by scanning the sublist again while writing to the entries we touched.

Since the initialization cost of is payed once in the very beginning of Step (ii) and the other computation cost can be charged to the output size, the total time complexity is .

For enumerating , we modify Step (ii). While scanning the sublist for center , we can identify all the positions such that is not an ending position of some pairing block, for which the substring is unbordered. If the unbordered substring cannot be extended outwards by blocks (which can also be checked while scanning the sublist), it is the maximal block palindrome of size to output for . The algorithm runs in time in total as the additional cost can be charged to the output size.

5 Conclusions

In this paper, we investigated several properties of block palindromes which are the generalization of palindromes and gapped palindromes. We also proposed an optimal-algorithm to enumerate all maximal block palindromes appearing in a given string. As mentioned in Remark 2, if we impose a limitation on the lengths of center blocks, the upper bound of the number of maximal block palindromes is reduced to , where is the length of an input string. In particular, for maximal block palindromes of even size, the center blocks are super restricted to be empty. The situation is similar to considering ordinal palindromes (in which the center blocks are strict) versus maximal gapped palindromes (in which the restriction on the center blocks are relaxed). It would be interesting to investigate the properties of maximal block palindromes whose center blocks have restricted lengths and develop efficient algorithms to enumerate only such a subset of maximal block palindromes.

References

  • [1] The 2015 British Informatics Olympiad, 2015. Last accessed 6/13/2018. URL: http://olympiad.org.uk/2015/index.html.
  • [2] Ali Alatabbi, Costas S. Iliopoulos, and M. Sohel Rahman. Maximal palindromic factorization. In Jan Holub and Jan Žďárek, editors, Proceedings of the Prague Stringology Conference 2013, pages 70–77, Czech Technical University in Prague, Czech Republic, 2013.
  • [3] Kirill Borozdin, Dmitry Kosolobov, Mikhail Rubinchik, and Arseny M Shur. Palindromic Length in Linear Time. In CPM, volume 78 of LIPIcs, pages 23:1—-23:12. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.
  • [4] Gabriele Fici, Travis Gagie, Juha Kärkkäinen, and Dominik Kempa. A subquadratic algorithm for minimum palindromic factorization. J. Discrete Algorithms, 28:41–48, 2014.
  • [5] Shivika Gupta and Rajesh Prasad. Searching gapped palindromes in DNA sequences using Burrows Wheeler type transformation. Journal of Information and Optimization Sciences, 37(1):51–74, jan 2016. URL: http://www.tandfonline.com/doi/full/10.1080/02522667.2015.1103044, doi:10.1080/02522667.2015.1103044.
  • [6] Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
  • [7] Ping-Hui Hsu, Kuan-Yu Chen, and Kun-Mao Chao. Finding all approximate gapped palindromes. Int. J. Found. Comput. Sci., 21(6):925–939, 2010.
  • [8] Tomohiro I, Shunsuke Inenaga, and Masayuki Takeda. Palindrome pattern matching. Theoretical Computer Science, 483:162–170, apr 2013. URL: https://www.sciencedirect.com/science/article/pii/S0304397512001041, doi:10.1016/J.TCS.2012.01.047.
  • [9] Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Computing Palindromic Factorizations and Palindromic Covers On-line. In CPM, volume 8486 of Lecture Notes in Computer Science, pages 150–161. Springer, 2014.
  • [10] Roman Kolpakov and Gregory Kucherov. Searching for gapped palindromes. Theor. Comput. Sci., 410(51):5365–5373, 2009. URL: https://doi.org/10.1016/j.tcs.2009.09.013, doi:10.1016/j.tcs.2009.09.013.
  • [11] Dmitry Kosolobov, Mikhail Rubinchik, and Arseny M. Shur. Pal k is linear recognizable online. In SOFSEM, volume 8939 of Lecture Notes in Computer Science, pages 289–301. Springer, 2015.
  • [12] Glenn K Manacher. A New Linear-Time ”On-Line” Algorithm for Finding the Smallest Initial Palindrome of a String. J. {ACM}, 22(3):346–351, 1975.
  • [13] Shintaro Narisada, Diptarama, Kazuyuki Narisawa, Shunsuke Inenaga, and Ayumi Shinohara. Computing Longest Single-arm-gapped Palindromes in a String. In SOFSEM, volume 10139 of Lecture Notes in Computer Science, pages 375–386. Springer, 2017.
  • [14] Gerald R Smith. Meeting DNA palindromes head-to-head. Genes & development, 22:2612–2620, 2008.