Algorithms for Anti-Powers in Strings

05/25/2018 ∙ by Golnaz Badkobeh, et al. ∙ Helsingin yliopisto UniPa 0

A string S[1,n] is a power (or tandem repeat) of order k and period n/k if it can decomposed into k consecutive equal-length blocks of letters. Powers and periods are fundamental to string processing, and algorithms for their efficient computation have wide application and are heavily studied. Recently, Fici et al. (Proc. ICALP 2016) defined an anti-power of order k to be a string composed of k pairwise-distinct blocks of the same length (n/k, called anti-period). Anti-powers are a natural converse to powers, and are objects of combinatorial interest in their own right. In this paper we initiate the algorithmic study of anti-powers. Given a string S, we describe an optimal algorithm for locating all substrings of S that are anti-powers of a specified order. The optimality of the algorithm follows form a combinatorial lemma that provides a lower bound on the number of distinct anti-powers of a given order: we prove that a string of length n can contain Θ(n^2/k) distinct anti-powers of order k.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A vast literature exists on algorithms for locating regularities in strings. One of the most natural notions of regularity is that of an exact repetition (also called power or tandem repeat), that is, a substring formed by two or more contiguous identical blocks — the number of these identical blocks is called the order of the repetition. Often, the efficiency of such algorithms derives from combinatorial results on the structure of the strings. The reader is pointed to Ba15 for a survey on combinatorial results about text redundancies and algorithms for locating them.

Recently, a new notion of regularity for strings based on diversity rather than on equality has been introduced: an anti-powerof order FRSZ16 (see Fi18 for the extended version) is a string that can be decomposed into pairwise-distinct strings of identical length. This new notion is at the basis of a new unavoidable property. Indeed, regardless of the alphabet size, every infinite string must contain powers of any order or anti-powers of any order FRSZ16 ; Fi18 . Defant Def17 (see also Narayanan N17 ) studied the sequence of lengths of the shortest prefixes of the Thue-Morse word that are -anti-powers, and proved that this sequence grows linearly in .

In this paper, we focus on the problem of finding efficient algorithms to locate anti-powers in a finite string. While there exist several algorithms for locating repetitions in strings (see for example Cr09 ), we present here the first algorithm that locates anti-power substrings in a given input string. Furthermore, we exhibit a lower bound on the number of distinct substrings that are anti-powers of a specified order, which allows us to prove that our algorithm time complexity is optimal.

2 Preliminaries

Let be a string of length over an alphabet of size . The empty string is the string of length . For , denotes the th symbol of , and the contiguous sequence of symbols (called factor or substring) . A substring is a suffix of if and it is a prefix of if . A power of order (or -power) is a string that is the concatenation of identical strings. An anti-power of order (or -anti-power) is a string that can be decomposed into pairwise-distinct strings of identical length FRSZ16 . The period of a -power (resp. the anti-period of a -anti-power) of length is the integer .

For example, is a -power (also called a square) of period , while is a -anti-power of anti-period (but also a -anti-power of anti-period ).

In this paper, we consider the following problem:

Problem 1.

Given a string and an integer , locate all the substrings of that are anti-powers of order .

We describe an optimal solution to this problem in Section 4. Before that, in Section 3, we prove a lower bound on the number of anti-powers of order that can be present in a string of length , which allows us to establish the optimality of our algorithm.

3 Lower Bound on the Number of Anti-Powers

Over an unbounded alphabet, it is easy to see that a string of length can contain anti-powers of order (think of a string consisting of all-distinct letters). However, somewhat more surprisingly, this bound also holds over a finite alphabet, as we now show.

For every positive integer , we let denote the string obtained by concatenating the binary expansions of integers from to followed by a symbol . So for example . We have that . Let us write .

Lemma 1.

Every string of length contains anti-powers of order .

Proof.

As mentioned before, we have . Let denote the number of anti-powers of order in with anti-period .

The number of anti-powers of order is at least the sum of the number of anti-powers of order with anti-period greater than . It is readily verified that if the anti-period is such that then at every position in there is a -anti-power of anti-period . This is because there are at least two ’s in every factor of of length , and every factor of containing at least two ’s has, by construction, only one occurrence in .

Hence we have:

Thus we have , as claimed. ∎

4 Computing Anti-Powers of Order

This section is devoted to establishing the following theorem and we assume is over an alphabet .

Theorem 2.

Given a string and an integer , the locations of all substrings of that are -anti-powers can be determined in ) time and space.

In light of the lower bound established in the previous section on the number of anti-powers of a given order that can occur in a string, this solution to Problem 1 is optimal.

4.1 Computing anti-powers having anti-period

We begin with a lemma that we will use in our algorithm.

Lemma 3.

Given a string , the longest substring of that consists of pairwise-distinct symbols can be computed in time and space.

Proof.

We scan left to right, and maintain two pointers into it. Through the scan, both and are monotonically nondecreasing. We maintain the invariant that the symbols in the substring delineated by and , i.e., , are all distinct. In order to maintain this invariant, we keep an array , initially all 0s, such that immediately before we increment , is the rightmost position of symbol in (or 0 if does not appear in ). Clearly, for the invariant to hold, we must have that , otherwise there are (at least) two occurrences of in . In other words, if contains distinct letters then so will , provided . Initially, and the invariant holds. We increment until , at which point we know that the symbols of were distinct. If is the length of the longest such substring we have seen so far, we record and . We then restore the invariant by setting , which has the effect of dropping the left occurrence of the repeated symbol , so that again contains distinct symbols. The runtime is clearly linear in . The only non-constant space usage is for . ∎

Obviously, the above algorithm can be used to efficiently compute -anti-powers having anti-period 1. We will use it as a building block for finding -anti-powers of all anti-periods.

4.2 Optimal algorithm for computing anti-powers

Let us now describe our algorithm. Firstly, observe that the maximum anti-period of a -anti-power within is . Our algorithm works in rounds, . In a generic round we will determine if contains (as a substring) a -anti-power of anti-period . Let be an integer name for substring amongst all substrings of length in — two substrings and have the same name if and only if the substrings are equal. Note that the number of names for any substring length is bounded above by , the length of the string. We can determine a suitable for all and in linear time from the names of substrings of length as follows. We create an array of pairs, , one for each position in the string. Initially, for all pairs. In round , we are computing the names of the substrings of length . We stably radix sort the pairs in time using as the sort key for pair . We then scan the sorted list of pairs, and for every run of adjacent pairs for which both and are equal, we assign them the same new name , overwriting their fields. After this scan, clearly only substrings and of length that are equal will have the same name because they had the same name and their last letters ( and ) are equal. We can now assign by scanning the list of pairs again and for each pair encountered setting .

To find a -anti-power of anti-period , we must find a set of distinct substrings of length , whose starting positions are spaced exactly positions apart and so are all equal modulo .

Let be the set of positions in that are equal to modulo , i.e., .

Let be the string of length formed by concatenating the values (in increasing order of ) for which . We can form in time by visiting each and computing in constant time. As observed above, any substring of length in that contains all-distinct letters corresponds to a -anti-power. In particular, if is made up of distinct letters, then is a -anti-power.

Thus, in round of our algorithm we compute for each . The total space and time required is . We then scan each of these strings in turn and detect substrings of length containing distinct letters, using the algorithm in the proof of Lemma 3. This process is denoted by function Distinct, in Line 4.2 of our Algorithm. Function Distinct outputs a set of starting and ending positions of -anti-powers whose anti-periods are and starting positions . The time required to scan each string is and so is in total for round . The extra space needed for each scan is for the array of previous positions.

Because each round takes time, and there are rounds, the total running time to output all anti-powers of order is . Since we can reuse space between rounds, the total space usage is .

AntiPowersS,k p1n/k i1p S’S APDistinctS’,k AP

1 2 2 3 3 3
1 1 2 1 2 3
aabababbbabb 133434 22242 1263 245 434
(1,9),(4,12) (2,10) (3,11)
Table 1: The step-by-step computations performed by Algorithm AntiPowers for input and .

5 Conclusions and Open Problems

The algorithm of the previous section is optimal in the sense that there are strings for which we must spend to simply list the antipowers of order because there are that many of them (as established in Section 3). One wonders though if an output senstive algorithm is possible, one that takes, say, time, where is the number of antipowers of order actually present in the input. Alternatively, do conditional lower bounds on antipower computation exist?

Many interesting algorithmic problems concerning anti-powers remain. For example, suppose we are to preprocess and build a data structure so that later, given queries of the form , we have to determine quickly whether the substring is an anti-power of order . Using suffix trees w1973 and weighted ancestor queries GawrychowskiLN14 it is fairly straightforward to achieve query time, in space. Alternatively, by storing metastrings for all possible anti-periods, it is not difficult to arrive at a data structure that requires space and answers queries in time. Is it possible to achieve a space-time tradeoff between the extremes defined by these two solutions, or even better, to simultaneously achieve the minima of the space and query bounds?

Acknowledgements

Our sincere thanks goes to the anonymous reviewers, whose comments materially improved our initial manuscript. Golnaz Badkobeh is partially supported by the Leverhulme Trust on the Leverhulme Early Career Scheme. Simon J. Puglisi is supported by the Academy of Finland via grant 294143.

References

  • [1] Golnaz Badkobeh, Maxime Crochemore, Costas S. Iliopoulos, and Marcin Kubica. Text redundancies. In Valerie Berthé and Michel Rigo, editors, Combinatorics, Words and Symbolic Dynamics, pages 151–174. Cambridge University Press, 2015.
  • [2] Maxime Crochemore, Lucian Ilie, and Wojciech Rytter. Repetitions in strings: Algorithms and combinatorics. Theoretical Computer Science, 410(50):5227 – 5235, 2009.
  • [3] Colin Defant. Anti-Power Prefixes of the Thue-Morse Word. Electronic Jouurnal of Combinatorics, 24(1):#P1.32, 2017.
  • [4] Gabriele Fici, Antonio Restivo, Manuel Silva, and Luca Q. Zamboni. Anti-powers in infinite words. In 43rd International Colloquium on Automata, Languages, and Programming, (ICALP), volume 55 of LIPIcs, pages 124:1–124:9. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016.
  • [5] Gabriele Fici, Antonio Restivo, Manuel Silva, and Luca Q. Zamboni. Anti-powers in infinite words. J. Comb. Theory, Ser. A, 157:109–119, 2018.
  • [6] Pawel Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Proc. 22nd Annual European Symposium on Algorithms (ESA), volume 8737 of Lecture Notes in Computer Science, pages 455–466. Springer, 2014.
  • [7] Shyam Narayanan. Functions on antipower prefix lengths of the Thue-Morse word. https://arxiv.org/abs/1705.06310.
  • [8] P. Weiner.

    Linear pattern matching.

    In IEEE 14th Annual Symposium on Switching and Automata Theory, pages 1–11. IEEE, 1973.