Dynamic Palindrome Detection

06/24/2019
by   Amihood Amir, et al.
Bar-Ilan University
0

Lately, there is a growing interest in dynamic string matching problems. Specifically, the dynamic Longest Common Factor problem has been researched and some interesting results has been reached. In this paper we examine another classic string problem in a dynamic setting - finding the longest palindrome substring of a given string. We show that the longest palindrome can be maintained in poly-logarithmic time per symbol edit.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/23/2021

Dynamic Suffix Array with Sub-linear update time and Poly-logarithmic Lookup Time

The Suffix Array SA_S[1… n] of an n-length string S is a lexicographical...
01/07/2020

Complexity Issues of String to Graph Approximate Matching

The problem of matching a query string to a directed graph, whose vertic...
10/28/2018

Near-Linear Time Insertion-Deletion Codes and (1+ε)-Approximating Edit Distance via Indexing

We introduce fast-decodable indexing schemes for edit distance which can...
08/18/2018

The Capacity of Some Pólya String Models

We study random string-duplication systems, which we call Pólya string m...
07/19/2021

Sensitivity of string compressors and repetitiveness measures

The sensitivity of a string compression algorithm C asks how much the ou...
02/28/2021

On Problems Dual to Unification: The String-Rewriting Case

In this paper, we investigate problems which are dual to the unification...
05/01/2010

The Exact Closest String Problem as a Constraint Satisfaction Problem

We report (to our knowledge) the first evaluation of Constraint Satisfac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Palindrome recognition is one of the fundamental problems in computer science. It is among the first problems assigned in a programming course, and it reigns at tests and assignments for the automata and language courses, since it is a good example of a context-free language that is non-regular. It visits complexity courses as an example of a problem that is solved in linear time by a two-tape Turing machine 

[24] but requires quadratic time in a single-tape machine [20]. Seeking all palindromes in a string is also a good example for usages of subword trees. Apostolico, Breslauer and Galil [9] considered parallel algorithms for the problem. Manacher [21] and Galil [14] showed how to use DPDAs for recognizing palindrome prefixes of a string that is input online. Amir and Porat [8] showed how to recognize approximate palindrome prefixes of a string that is being input online.

In addition to its myriad theoretical virtues, the palindrome also plays an important role in nature. Because the DNA is double stranded, its base pair representation offers palindromes in hairpin structures, for example. Many restriction enzymes recognize and cut specific palindromic sequences. In addition palindromic sequences play roles in methyl group attachments and in T cell receptors. For some examples of the varied roles of palindromes in Biology see, e.g. [15, 13, 19, 25].

Due to the importance, both theoretical and practical, of palindromes, it is surprising that the problem of finding palindromes in a dynamic text has not been studied. Clearly, one can re-run a palindrome detection algorithm after every change in the text, but this is obviously a very inefficient way of handling the problem.

In the 1990’s the active field of dynamic graph algorithms was started, with the motive of answering questions on graphs that dynamically change over time. For an overview see [12]

. Recently, there has been a growing interest in dynamic pattern matching. This natural interest grew from the fact that the biggest digital library in the world - the web - is constantly changing, as well as from the fact that other big digital libraries - genomes and astrophysical data, are also subject to change through mutation and time, respectively.

Historically, some dynamic string matching algorithms had been developed. Amir and Farach [4] introduced dynamic dictionary matching, which was later improved by Amir et al. [5]. Idury and Scheffer [17] designed an automaton-based dynamic dictionary algorithm. Gu et al. [16] and Sahinalp and Vishkin [23] developed a dynamic indexing algorithm, where a dynamic text is indexed. Amir et al. [7] showed a pattern matching algorithm where the text is dynamic and the pattern is static.

The last few years saw a resurgence of interest in dynamic string matching. In 2017 a theory began to develop with its nascent set of tools. Bille et al. [10] investigated dynamic relative compression and dynamic partial sums. Amir et al. [2] considered the longest common factor (LCF) problem. They investigated the case after one error. Special cases of the dynamic LCF problem were discussed by Amir and Boneh [1]. The fully dynamic LCF problem was tackled by Amir et al. [3]. Amir and Kondratovsky [6] made a first step toward a fully dynamic string matching algorithm by considering a dynamic pattern and text that is changing in a limited fashion.

In this paper we consider the problem of finding the longest palindrome in a dynamic string. The changes to the string are character replacements.

The contributions of this paper are:

  1. We present a deterministic algorithm for computing the longest palindrome in a dynamic text in time per substitution.

  2. We reinforce the dynamic LCP as an important tool for dynamic string matching algorithms.

  3. We prove some novel combinatorial properties of palindromes and periodic palindromes. This deeper understanding of the nature of palindromes enables the efficient dynamic longest palindrome detection algorithm.

This paper is organized as follows. Section 2 gives the basic pattern matching definitions and tools and can be safely skipped by the practitioner. Section 3 summarizes the known tecniques for dynamic LCP. Section 4 gives the dynamic algorithm for finding the longest palindrome in a changing sequence. We conclude with some open problems and future directions.

2 Preliminaries

We begin with basic definitions and notation generally following [11].

Let be a string of length over a finite ordered alphabet of size . By we denote an exmpty string. For two positions and on , we denote by the factor (sometimes called substring) of that starts at position and ends at position (it equals if ). We recall that a prefix of is a factor that starts at position () and a suffix is a factor that ends at position (). We denote the reverse string of by , i.e. .

We say that string is a palindrome if . Let be a string, a factor of . We say that is a palindromic factor if is a palindrome. is a longest palindromic factor if there is no palindromic factor of where .

Given two strings and , the string that is a prefix of both is the longest common prefix (LCP) of and if there is no longer prefix of that is also a prefix of .

Let be a string of length with . We say that there exists an occurrence of in , or, more simply, that occurs in , when is a factor of . Every occurrence of can be characterised by a starting position in . Thus we say that occurs at the starting position in when .
We say that string of size has a period if for every such that , it’s satisfied that for some . The period of is the minimal for which that condition holds.
We say that a substring of ,denoted as is a run with period if it’s period is , but and . Meaning that every substring containing doesn’t have a period .

2.1 Suffix tree and suffix array.

The suffix tree of a non-empty string of length is a compact trie representing all suffixes of . The branching nodes of the trie as well as the terminal nodes, that correspond to suffixes of , become explicit nodes of the suffix tree, while the other nodes are implicit. Each edge of the suffix tree can be viewed as an upward maximal path of implicit nodes starting with an explicit node. Moreover, each node belongs to a unique path of that kind. Thus, each node of the trie can be represented in the suffix tree by the edge it belongs to and an index within the corresponding path. We let denote the path-label of a node , i.e., the concatenation of the edge labels along the path from the root to . We say that is path-labelled . Additionally, is used to denote the string-depth of node . Node is a terminal node if its path-label is a suffix of , that is, for some ; here is also labelled with index . It should be clear that each factor of is uniquely represented by either an explicit or an implicit node of , called its locus. In standard suffix tree implementations, we assume that each node of the suffix tree is able to access its parent. Once is constructed, it can be traversed in a depth-first manner to compute the string-depth for each node . It is known that the suffix tree of a string of length , over a fixed-sized ordered alphabet, can be computed in time and space  [11].

The suffix array of a string , denoted as , is an integer array of size storing the starting positions of all (lexicographically) sorted non-empty suffixes of , i.e. for all we have . Note that we explicitly add the empty suffix to the array. The suffix array of corresponds to a pre-order traversal of all the leaves of the suffix tree of . The inverse of the array is defined by , for all .

2.2 The Karp-Rabin Algorithm

Karp and Rabin developed a randomized linear time algorithm for finding all occurrences of a pattern in a text (pattern matching) [18]. The main idea of their algorithm is computing a numeric signature of the pattern, then sliding the pattern over the text and comparing the signature of the text substring that is tested against the pattern, to the pattern signature. Any signature that is updated in constant time per shift is a good candidate. Such a signature is also called a rolling hash function. For example, assume the alphabet is . The hash of a substring of length would be the representation of the substring as a number is base taken modulo , for some prime numbers and

. Clearly the computation of a hash in a single shift can be done in constant time, and a hash equality implies a substring equality with high probability.

3 Dynamic Longest Common Prefix queries

Definition and implementation qualities

Dynamic Longest Common Prefix queries are a fundamental and powerful tool for maintaining properties of a dynamic string.

Definition 1

[The Dynamic LCP problem] Let be a text string over alphabet ., A Dynamic Longest Common Prefix (LCP) algorithm supports two queries:

  1. - Return the longest common prefix of and .

  2. - Change the symbol in to be .

The quality of an implementation for D-LCP can be measured by various parameters: 1) update time, 2) time for LCP query on the current text, and 3) whether the algorithm is deterministic or randomized.

Note that since Static LCP can be done with linear time preprocessing and constant time query , any solution in which the update time is not sublinear will not be better than doing the static LCP prerocessing from scratch after every update.

3.1 The Deterministic Implementation

There are a number of algorithms that yield a polylogarithmic computation of an LCP query on a dynamic text, following a polylogarithmic processing per change. We mention Mehlhorn et al. [22]. Using their algorithm with appropriate deamortization, one can compute the LCP in time , and per text change.

3.2 Randomized Implementation

It is a folklore fact that dynamic LCP can be achieved via Rabin-Karp methods. In this case, the LCP of two indices can be computed in time with high probability.

4 Dynamic Longest Palindrome Substring

4.1 The Algorithm’s Idea

The goal is to maintain a data structure containing all the maximal palindromes. Maximal in this context means that the palindrome can not be expanded around its center. The longest palindrome substring is obviously a maximal palindrome, so as long as we keep track of the maximal palindromes in the text - we have the longest palindrome substring as well. Given an index in , We can find the maximal palindrome centered in this index using a single LCP query on .
After a text update - some maximal palindromes may be cut and some may be extended. We should query the relevant centers for the updated sizes of the affected maximal palindromes.
In the worst case - a single update can affect maximal palindromes. So checking every single affected palindrome will not result in sublinear time. We make several observations on maximal palindromes that allow us to reduce the amount of contested palindromes to .

4.2 Locally Maximal Palindromes

Let be a text. A Locally Maximal Palindrome of is defined to be a substring so that is a palindrome and . Meaning that the palindrome can not be extended to the sides from its center.
The first observation we make about locally maximal palindromes gives an upper bound on the amount of similar sized maximal palindromes within a given distance from each other’s starting points. For this purpose, we pick some constant . We denote . We partition the maximal palindromes of some text to classes. Class contains the palindromes whose size is such that .

Lemma 1

Let and be two maximal palindromes in class that also satisfy and neither of them contains the other. Assume w.l.o.g that . Then has a period.

Proof: Consider the overlap between and : . Since it is contained in , which is a palindrome, its reverse appears in the symmetric place in . So we have . Symmetrically, the reverse of the overlap should also appear in . So . We, therefore, have two instances of the same string, , starting in two different indices in the text. If the difference between the indices is smaller than half the size of the substring - then this substring has a period. The size of the overlap is since is the chunk of that is not participating in the overlap. is at least since belongs to class , and . So . The difference between the starting indices is . We already have a bound for . As for , note that . Therefore the difference between the ending indices equals the difference between the starting indices plus the difference between the palindromes’ sizes. The difference between the sizes is bounded by so overall we have for the difference between the ending indicies. It is also the case that for the difference between the two instances of . For a period, we need , which is satisfied for . We fix from now on.    

The main implication of the above theorem is the following:

Lemma 2

At most two locally maximal palindromes in class can start in an interval of size (unless one of them is contained within the other).

Proof: Consider 3 maximal palindromes , , and , ordered by the starting index, such that and none of them is contained within the other. According to the previous lemma, The entire interval containing and is a periodic string with period size . Note that according to our proof, for an appropriate choice of we get that . So is smaller than any palindrome in class , including which is fully contained in this periodic interval. From periodicity we have and . From the fact that is a palindrome we have ,as those are symmetric indicies in the palindrome (they must be included in since ). Transitivity now yields in contradiction to ’s maximality.

   

This is the key observation for lowering the amount of necessary LCP queries.

4.3 Maintaining all the LMPs

The data structure we use for maintaining all the LMPs consists of priority queues. contains the LMP with size such that , i.e. the ’th class palindromes. The values kept in are the start and end indicies of each palindrome, sorted by the value of the starting index. We also maintain extra data about the maximum size of a palindrome within . When an index is changed, we need to update every LMP that touches that index. We first observe a simplified case in which does not contain palindromes that are fully contained in another palindrome.

Given an update in index - we need to update all the palindromes that touched . Palindromes that fully extend are cut in index (and in the symmetrical index as well), as the equality was destroyed. Palindromes that end just before may have been extended. Their new value is checked using LCP queries.
We start by checking all the palindromes starting in for extension. Since we are assuming that does not contain palindromes that are fully contained in each other, we have at most one palindrome starting in in every . Now, we want to find all the other palindromes that are affected by the update. We do it by considering exponentially growing intervals of distances of the palindromes’ starting index from . At step , we look at all the palindromes that start within distance from such that . Note that the palindromes that start in the said distances from must be at least in class (otherwise - they will not reach all the way to from that distance). Also, the size of this distances interval is . Our lemma directly implies that in step , every size class with index larger than consists of no more than 2 palindromes in the contested interval. So, for every value of , we have to inspect each priority queue a constant number of times.

To conclude: There are priority queues and each of them is queried a constant number of times for every exponential interval. There are such intervals so the time complexity for this simplified case is , where is the LCP query time.

Sadly, our simplification is far for being true. Palindromes of the same size class can be included in each other in great quantities. For example, consider the text . The whole text is a palindrome. and every single index is the beginning of a LMP that is contained in the LMP starting in the previous index. We need to enhance both our data structure and understanding of locally maximal palindromes to deal with these cases.

4.4 Central Periodic Palindromes

The example we presented for many palindromes of similar sizes that are contained in each other actually demonstrates the structure of palindromes of that type. The following Theorem is the key to handling those palindromes:

Theorem 1

Let and be two LCPs in size class . If is contained in then has a period of size , where is the difference between the starting indices. Note that the period is at most .


Proof: Since is a substring of and is also a LMP, it can not share a center with . Therefore its reverse appears in the symmetrical indices in . But since is also a palindrome then it is equal to its reverse. Therefore, we have two instances of . Because of our choice of and because and are in the same size class, then is periodic and the size of its period is the difference between the starting indices of the instances of .    

Note that the formula only works if the initial is the left side instance of with respect to the center of . If we are given the right side instance - we can calculate the left side instance and proceed to apply the formula.

We call the periodic palindromes that is created as a result of a LMP that is contained within another LMP, , in the same size class the Central Periodic Palindrome of , or the CPP of . We call the period of a CPP the periodic seed of the CPP. We call a maximal run of the periodic seed a periodic palindromes cluster. We point out two important substrings of a periodic palindromes cluster:

  • The maximal palindrome prefix: The longest prefix of the cluster that is a palindrome.

  • The maximal palindrome suffix: The logest suffix of the cluster that is a palindrome.

It is possible for a CPP to be both a prefix CPP and a suffix CPP. We call a maximal run of the periodic seed a periodic palindromes cluster. Note that for the maximal palindrome prefix, all the prefixes of size are LMPs (with and being the period and the remainder of the largest palindrome prefix, respectively). The same applies to the maximal palindrome suffix with suffixes of the same sizes. The mentioned LMPs are represented by the cluster. Meaning that if we know the starting and ending position of , its periodic seed, its maximal palindrome prefix and its maximal palindrome suffix then the existence of all those LMPs is implied.
The periodic palindrome clusters and their components are our key ingredient for efficient palindrome detection. They have several properties that make them comfortable to work with. For example: all the LMPs that are contained in a cluster are either represented by the cluster or smaller than twice the size of the periodic seed. More formally:

Lemma 3

Let be a periodic palindromes cluster with period . Let be the maximal palindrome prefix of with a remainder , and Let be the maximal palindrome suffix of with a remainder . A substring of with is a locally maximal palindrome only if . We call the set of LMPs that are represented by .

Proof: First, we show that every element in is a palindrome. From periodicity, we have for . From ’s symmetry as a palindrome we have for . Transitivity now yieldws that , which makes this interval a palindrome. Symmetrical arguments can be made to show that is a palindrome too.
Now, Let be a LMP within with such that is not in . If and , then can be extended around its center due to similar arguments as in the proof of lemma 2. Otherwise, We can assume that (The proof for the case where is symmetrical). Let be the minimal value of such that and . Note that and . Since is a palindrome containing the palindrome , appears in the symmetrical place in . The difference between the starting indices of these two instances will be , which is smaller than . That yields a period smaller than for the prefix of size of , which indicates that has a period smaller than .    

The above lemma implies that if we have a cluster in class size in our data structure, and we find a LMP in the same class that is contained in - we do not need to explicitly store it. can not be smaller than twice the size of the period. So according to the lemma it is implicitly represented by .
The contained LCPs that are smaller than twice the size of the period can be handled within smaller exponential size classes.
Two other important properties of the periodic palindromes clusters are:

  • Two clusters within the same exponential size class can not be contained within each other.

  • Two clusters in size class can not have starting indices with distance smaller than from each other.

These properties can be proved by observing that if two clusters violate any of them - The run of one of the clusters can be extended.

4.5 Extension and cuts of CPPs

Our algorithm represents LMPs under substitutions using periodic palindromes clusters. We, therefore, need to understand how clusters act under substitution. We wish to maintain every cluster along with its periodic seed and its maximal palindrome prefix and suffix.

We start by examining the case in which a cluster is cut by a substitution in index . For a clearer exposition, we denote the cut cluster to be (rather than representing as some substring ). Let its period be and the remainders of the contained prefix CPP and the contained suffix CPP be and respectively. The substitution splits into two periodic palindromes clusters: and . We show how the implied LMPs that are centered in the left side of are affected and deduce the resulting maximal palindrome prefix and maximal palindrome suffix of the cluster . Symmetrical arguments can be made for the LMPs centered in the right side of and .
The LMPs consistent with the period that are centered in the left side of can be sorted into two groups:

  1. LMPs that are not touched by : The implied LMPs can be either prefixes of the maximal palindrome prefix or suffixes of the maximal palindrome suffix. Since we are considering LMPs with centers in the left side of that were not touched by , these can only be prefixes. Those are for every such that . We observe that the largest LMP in this set is with being the maximal that satisfies the previous constraint. We point out that contains all the other LMPs in that set. Assuming that , there is no palindrome prefix larger than according to Lemma 3. So is the maximal palindrome prefix of . The assumption that implies that it is still periodic. If not - we won’t keep as a cluster, But all the LMPs that are represented by it instead. Since in this case, the cluster is not periodic - the amount of represented LMPs is bounded by a constant factor.

  2. LMPs that are touched by : In this case we may consider both LMPs that are prefixes of and LMPs that are suffixes of .
    The relevant prefixes are for every such that and . The first constraint implies that the prefix is indeed touched by , and the second constraint implies the location of the center. The smallest that satisfies these two constraints will yield the represented prefix LMP that extends farthest to the left after the change. Denote this pivot value of as , and Denote . is the size of the suffix that was cut from the pivotal LMP. A prefix of the same size should be removed, so the new largest LMP that touches from the left will be (among the LMPs that are prefixes of the original maximal palindrome prefix).
    The relevant suffixes: Actually, the only possible candidate to extend the farthest to the left after a substitution in index is the maximal palindrome suffix denoted as , since its center is the farthest to the left from all its suffixes. If it is indeed cut by and its center is in the left side of , the resulting LMP after the substitution will be .
    Out of these two candidates, the one that extends farthest to the right will be the maximal palindrome suffix of .

All of the above can be calculated in constant time given and .

We now analyse periodic palindromes clusters that are touched by in their ends:
Again, for ease of exposition, we denote the cluster as . We assume that . The case where touches the left side of is treated symmetrically. Let the periodic seed of be . Denote the remainder of ’s prefix CPP as . Finally, denote . The LCP query in the last notation indicates the maximal extension of the run. Therefore the new updated interval for the cluster is . It can be proven by induction that given a prefix CPP with period , the string is a palindrome for every . If we take the maximal such that , we get the maximal palindrome prefix of . Denote this prefix as .
As for the maximal palindrome suffix - consider the suffix of size of . According to the previous claim this is a palindrome but it is not necessarily a LMP. Since it is within the periodic cluster, it may be extended around the center as long as the resulting extension is within the cluster. can be extended by to the right and to the left. Since , this extension will result in the LMP . is the maximal palindrome suffix cluster .

All of the above can be calculated in constant time given and . can be calculated using a dynamic LCP data structure.

4.6 Adding CPPs to the maximal palindrome algorithm

We enhance our algorithm to maintain collections i addition to the LMP collections. The maintained invariant in this setting is that every LMP in the text is either represented explicitly as an LMP or implicitly as a part of a cluster. In addition to the priority queues, we also define . contains periodic palindromes clusters in the ’th exponential size class. Every cluster is stored along with its corresponding period , prefix CPP and suffix CPP. As in , they are sorted by increasing value of the starting index. We maintain the invariant that each of those priority queues does not contain any element that is contained in another element in the queue. In this is naturally preserved as long as we maintain valid clusters due to properties of clusters. Preserving this condition in requires more sensitive care.
Given a substitution in index , needs to be tested against every and in every exponential distance level. First, we extract all the affected LMPs and clusters from the priority queues. We treat every LMP we extracted from some as in the simplified case - we cut it in a symmetrical manner if it was cut by and check for extension if touches one of its ends. We save the results of those extensions and cuts in some temporary list . As for the extracted s, we treat them as described in the previous section. Additionally, for every cluster created as a product of an extension or a cut in the process, we query the new center of the cluster for the LMP centered in this index. This is due to the fact that it is the only candidate among the LMPs represented by the cluster to be extendible beyond the cluster’s range. We add the results of these queries to as well. We also add the updated cluster to with respect to its size. If a cluster was cut to the point when it is no longer periodic then we do not add it to . Instead, we add all the LMPs that are implied by the cut CPP to . Since the cut CPP is no longer periodic then the amount of implied LMPs is bounded by a constant.

Note that the size of is at most , since we add an element to only once an element that is cutting is met in either or . This happens a constant number of times in every exponential distance level. So we have queues multiplied by exponential distance levels.

At this stage, every LMP in the new text is represented either in or in . The next natural step would be adding every LMP saved in to the appropriate . But this may violate our invariant that does not contain two elements such that one of them is containing the other. We handle it by deducing the existence of a cluster and adding the cluster to our data structure instead of the contained elements. This is done as follows:
For every , we query that matches ’s size for the predecessor of . We denote the returned LMP as .

Lemma 4

If there is an element in that contains , must contain . additionally, there are at most two elements in that contains

Proof: has the largest value of that is less than , so every successor of can not contain . Assume that does not contain . That necessarily means that . Assume that there is another element in that contains . Denoted it as . As seen, must be a predecessor of , so . also contains Therefore . Transitivity yields that and contains , In contradiction to not containing two elements that contain each other.
As for the existence of no more than two including elements, belongs to the ’th size class, meaning that its size is at least . If we have three LMPs in the th class that share an interval of size , the distance between their starting indices won’t be more than . Lemma 2 suggests that this is impossible.    

Given lemma 4 we can find and check if it contains . If it does then we calculate the derived from and and a period of this CPP using the formula provided in theorem 1, and check for the extension of the cluster in both directions using LCP queries. We proceed to ignore and add the new cluster to the appropriate .

Another violation that may result in adding to is that contains elements in . This case can be tested in a similar way. Query for the successor of and denote it as . As in lemma 4, it can be proven that if any element in is contained within then most be contained within . Also, there are at most two elements that are contained within in . We can add , remove (and its successor, if necessary), and proceed to calculate the cluster as in the previous case.

It may seem like we are done at this point, but there is another subtle detail that we need to take care of. Since the period of a periodic cluster is required in order to calculate the resulting cluster after an extension or a cut, The periods of the new clusters most be calculated. For the clusters created as a result of an extension or a cut this is done as described in the previous section. For clusters that were created to solve violations we need to work harder. Note that the formula in theorem 1 yields some period of the CPP (or cluster), which is valid for checking the extension, but is not necessarily the smallest period. We present a subroutine FindCppPeriod to solve that problem. FindCppPeriod is meant to be used when all the LMPs are either represented explicitly or implicitly. Therefore running it after we fully evaluated will be sufficient. FindCppPeriod also assumes that we know the starting and ending indices of the of . This assumption is valid becauseFindCppPeriod is called only after two LMPs within the same size class with one of them containing the other are met. In this settings, theorem 1 can be used to calculate the periodic interval that defines the .

Denote the containing LMP , and its size class as . The following lemma is the key to finding the period:

Lemma 5
  1. The CPP of is unique in the sense that every LMP in class size will yield the same CPP using the formula from theorem 1 on and .

  2. Let be the of . There is a LMP in the size class that is contained in such that if we apply the formula from theorem 1 on and then we will get the minimal period of the .

Proof: For the first part of the lemma, Assume that we have two different LMPs, and , and the CPPs derived from their existence are and respectively. We can assume, without loss of generality, that both and are the left side instances with respect to the center of of and , respectively. If then the resulting CPPs would be the same since the formula from theorem 1 depends only on the starting index of the inner LMP. Otherwise assume, w.l.o.g., that . According to the formula, that would mean that fully contains . Specifically it fully contains . So we have that is a LMP in size class that is contained in the CPP but is not consistent with the period of (since it is neither a prefix or a suffix of ). That is a contradiction to Lemma 3.

For the second part of the lemma, consider the unique CPP with a periodic seed . The existence of is a result of some LMP in size class that is contained within . is either a proper prefix or a proper suffix of . We will prove the lemma assuming that it is a proper prefix of . The proof for the case in which is a proper suffix of is symmetrical. Since is a proper suffix of , and an LMP, the period of can not be extended to the left. Otherwise, would have been extendable around its center. So is the beginning of a prefix with period and remainder , and every substring of the form is a LMP. Let be the maximal such that . Let . is a LMP contained in . Since is of the said form too, is at least as long as , implying that it is in class size . Since is the maximal value of that satisfies , we have . The size of is , the size of is and the distance between the starting locations is . Applying the formula from theorem 1, we get . If the formula yields a number smaller than , it will be a contradiction to being the minimal period. So it most yield .    

To conclude: in order to compute the periodic seed of the cluster, we need to find that in our collection and apply theorem 1. Thus the subroutine will work as follows:

FindCppPeriod
Let be the palindrome of interest in size class and let be the CPP of . First, query for the successor of , Denoted as . is a candidate for being If it is contained within , and is in the size class . If is a candidate for being , apply theorem 1 and get a candidate for the cluster’s period. Do the same for the second Successor of , provided that it is a candidate for being . Proceed to check all the LMPs in . Apply theorem 1 on every LMP in that is a candidate. As we previously claimed, there are no more than two LMPs IN that are contained in , so we didn’t missed any candidates in .
Assuming that is represented explicitly, we already have the right value of in hand. But what if it is represented as a part of a cluster? In this case we use the fact that is in the size class , meaning that its start index in in the interval and its ending index is in the interval . If is indeed represented as a part of a cluster, it is either a prefix of a cluster or a suffix of one. Since is in the size class , The cluster containing it must be in a size class such that . But as we previously showed, there are no more than three clusters in class size with starting indices in an interval smaller than . This is at least - the size of and . So every priority queue with will yield at most four candidates for the cluster that implies the existence of . These are at most two clusters that start within , located with two successor queries on , and at most two clusters that end within , located by two predecessor queries on . Given a cluster, its period, and the remainder of both the prefix CPP and the suffix CPP, we can find the largest implied prefix LMP contained in in constant time. We can also find the largest implied suffix LMP that is contained in . These will be the only possible candidates for from this cluster since, as implied from the proof of lemma 5, is the largest extension of a run that is contained in . To conclude this case - we iterate through every queue with and get the candidate clusters for containing . From every candidate cluster, we get at most two LMPs that are candidates for being . We apply theorem 1 on every that is encountered in this process.

After that process, it is guaranteed that one of the candidates that we tested was indeed . We take the minimal value of that was collected. This value must be the periodic seed.

Complexity of finding the period: In the worst case, we do a constant number of predecessor and successor queries on every one of the priority queues. This may take time. we also go though , and . Every one of these queries may yield a candidate for and we do constant work to produce a candidate for from each candidate. So the complexity is .

With this, our algorithm is finally complete. We try to add every LMP in to its appropriate . If the insertion results in two LMPs in containing each other, the contained element is removed, the periodic palindrome cluster is calculated, and FindCppPeriod is invoked to compute its periodic seed. The prefix CPP and the suffix CPP of the cluster can be deduced from and from , the CPP of the containing LMP.

Complexity: Finding, extending and cutting all the affected LMPs and CPPs takes a constant amount of priority queue queries and LCP queries per exponential distance level for each priority queue. Resulting in time, where is the time for LCP computation. In the worst case, we activate FindCppPeriod for every LMP in when adding it to , resulting in time. Overall, the complexity is . Since dynamic LCP queries can be computed in polylogarithmic time, this is .

5 Conclusion and Open Problems

We presented a dynamic algorithm for maintaining the longest palindromic subsequence in a changing text. This can be done in time per change.

We made heavy use of a polylogarithmic time dynamic LCP algorithm. It would be interesting to tighten up the dynamic LCP time as much as possible, and thus achieve logarithmically better time.

The field of dynamic string matching is re-emerging in recent years. It would be interesting to study various string problems in a dynamic setting, such as finding the longest periodic substring, and finding various motifs.

References

  • [1] A. Amir and I. Boneh. Locally maximal common factors as a tool for efficient dynamic string algorithms. In Proc. 29st Annual Symposium on Combinatorial Pattern Matching (CPM), LIPICS, pages 11:1–11:13, 2018.
  • [2] A. Amir, P. Charalampopoulos, C.S. Iliopoulos, S.P. Pissis, and J. Radoszewski. Longest common factor after one edit operation. In Proc. 24th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS, pages 14–26. Springer, 2017.
  • [3] A. Amir, P. Charalampopoulos, S. P. Pissis, and J. Radoszewski. Longest common factor made fully dynamic. Technical Report abs/1804.08731, CoRR, April 2018.
  • [4] A. Amir and M. Farach. Adaptive dictionary matching. Proc. 32nd IEEE FOCS, pages 760–766, 1991.
  • [5] A. Amir, M. Farach, R.M. Idury, J.A. La Poutré, and A.A Schäffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258–282, 1995.
  • [6] A. Amir and E. Kondratovsky. Searching for a modified pattern in a changing text. In Proc. 25th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS, pages 241–253. Springer, 2018.
  • [7] A. Amir, G.M. Landau, M. Lewenstein, and D. Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms, 3(2), 2007.
  • [8] A. Amir and B. Porat. Approximate on-line palindrome recognition, and applications. In Proc. 25th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 21–29, 2014.
  • [9] A. Apostolico, D. Breslauer, and Z. Galil. Optimal parallel algorithms for periods, palindromes and squares. In Proc. 19th International Colloquium on Automata, Languages, and Programming (ICALP), volume 623 of LNCS, pages 296–307. Springer, 1992.
  • [10] P. Bille, A. R. Christiansen, P. H. Cording, I. L. Gørtz, F. R. Skjoldjensen, H. W. Vildhøj, and S. Vind. Dynamic relative compression, dynamic partial sums, and substring concatenation. Algorithmica, 16(4):464–497, 2017.
  • [11] M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007.
  • [12] C. Demetrescu, D. Eppstein, Z. Galil, and G. Italiano.

    Algorithms and theory of computation handbook.

    chapter Dynamic Graph Algorithms, pages 9–9. Chapman & Hall/CRC, 2010.
  • [13] A. Fuglsang. Distribution of potential type ii restriction sites (palindromes) in prokaryotes. Biochemical and Biophysical Research Communications, 310(2):280–285, 2003.
  • [14] Z. Galil. On converting on-line algorithms into real-time and on real-time algorithms for string matching and palindrome recognition. SIGACT News, pages 26–30, Nov.-Dec. 1975.
  • [15] M.S. Gelfand and E.V. Koonin. Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acids Res, 25:2430–2439, 1997.
  • [16] M. Gu, M. Farach, and R. Beigel. An efficient algorithm for dynamic text indexing. Proc. 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 697–704, 1994.
  • [17] R.M. Idury and A.A Schäffer. Dynamic dictionary matching with failure functions. Proc. 3rd Annual Symposium on Combinatorial Pattern Matching, pages 273–284, 1992.
  • [18] R.M. Karp and M.O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Res. and Dev., pages 249–260, 1987.
  • [19] B. Lisnic, I.K. Svetec, H. Saric, I. Nikolic, and Z. Zgaga. Palindrome content of the yeast Saccharomyces cerevisiae genome. Curr Genetics, 47:289–297, 2005.
  • [20] W. Maass. Quadratic lower bounds for deterministic and nondeterministic one-tape turing machines (extended abstract). In Proc. 16th Annual ACM Symposium on the Theory of Computing (STOC), pages 401–408, 1984.
  • [21] G. Manacher. A new linear-time “on-line” algorithm for finding the smallest initial palindrome of a string. Journal of the ACM, 22(3):346–351, 1975.
  • [22] K. Mehlhorn, R. Sundar, and C. Uhrig. Maintaining dynamic sequences under equality-tests in polylogarithmic time. In Proc. 5th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 213–222, 1994.
  • [23] S. C. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm. Proc. 37th FOCS, pages 320–328, 1996.
  • [24] A.O. Slisenko. Recognition of palindromes by multihead turing machines. In Proc. of the Steklov Math. Inst., volume 129, pages 30–202. Acad. of Sciences of the USSR, 1973.
  • [25] S. K. Srivastava and H.S. Robins. Palindromic nucleotide analysis in human t cell receptor rearrangements. PLOS one, 7(12):e52250, 2012.