An Efficient Implementation of Manacher's Algorithm

03/17/2020
by   Shoupu Wan, et al.
0

Manacher's algorithm has been shown to be optimal to the longest palindromic substring problem. Many of the existing implementations of this algorithm, however, unanimously required in-memory construction of an augmented string that is twice as long as the original string. Although it has found widespread use, we found that this preprocessing is neither economic nor necessary. We present a more efficient implementation of Manacher's algorithm based on index mapping that makes the string augmentation process obsolete.

READ FULL TEXT VIEW PDF
03/17/2020

A New Implementation of Manacher's Algorithm

Manacher's algorithm is optimal for the longest palindromic substring pr...
06/24/2020

Small Longest Tandem Scattered Subsequences

We consider the problem of identifying tandem scattered subsequences wit...
01/08/2021

Cantor Mapping Technique

A new technique specific to String ordering utilizing a method called "C...
11/29/2021

Bounding the Last Mile: Efficient Learned String Indexing

We introduce the RadixStringSpline (RSS) learned index structure for eff...
04/06/2021

ASTANA: Practical String Deobfuscation for Android Applications Using Program Slicing

Software obfuscation is widely used by Android developers to protect the...
11/15/2018

Vectorized Character Counting for Faster Pattern Matching

Many modern sequence alignment tools implement fast string matching usin...
11/25/2008

String Art: Circle Drawing Using Straight Lines

An algorithm to generate the locus of a circle using the intersection po...

1 Introduction

Finding a longest palindrome substring (LPS) in a given string is a fundamentally important question as it has widespread applications in mathematics, physics, chemistry, genetics, music, etc. [5, 6, 4, 3, 2]. To one who is familiar with the song “Rhythm of the Rain”, the prelude music might be very impressive. That is an example of musical palindromes. In genetics, palindromic sequences has an important capability—forming hairpins [1]. It is amazing to learn that palindromes had played such an important role in life from the very beginning. But here I pick the LPS problem for two reasons. First, this problem is closely related to the study of symmetry. Often times, uncovering the underlying symmetry is the key for great solutions. This problem exemplifies how conscious application of mathematical analysis can help devise an algorithm. Secondly, this problem is a perfect case of study for demonstrating how to refactor messy and monolithic code with bloating duplications into a succinct and modular solution free from duplications step-by-step. Most of the techniques are discussed in depth in book [7].

Here is the structure of this article. In  section 2, we will set the problem statement. In  section 3, the reflection symmetry with necessary mathematical context will be explained with aim at the application toward the LPS problem. In  section 4, Manacher’s algorithm is presented together with some intuition. Then  section 5 we will discuss existing solutions with the string-augmentation preprocessing. In  section 6 and  section 7, we will present the new approach of index mapping to implement Manacher’s algorithm. Also in these sections, we will perform multi-stage refactor process that eventually leads to a modulalr solution to the LPS problem with high readability. Finally in  section 8, we will put all the implementations presented in this article to test. We will compare the performance test result for different approaches. All solutions provided in this article will be implemented in Java.

2 The problem Statement

The LPS problem takes various forms in the literature. For the sake of this article, we state the problem as

“Given an input string, find the longest palindromic substring in it (or one of them if there are more).”

According to Merriam-Webster dictionary, a palindrome is “a word, phrase, or sequence that reads the same backward as forward”. The length of a palindromic string can be either odd or even. Accordingly, we may classify palindromic strings as odd or even. For an odd palindrome, its center of symmetry,

e.g., palindromic center or simply center, falls on a character. For a nonempty even palindrome, its center falls between two characters, which in this book will be referred to as left center and right center, respectively. Obviously, an emtpy string is also palindromic—it is the trivial case. A palindromic substring (PSS) of a string is any substring that is a palindrome. For a string of length , there are palindromic centers, albeit some of them may be trivial. The sole PSS of an empty string is trivial. The first and last PSS’s of an nonempty string are trivial. Apparently for a specific non-trivial palindromic center, there may be a series of co-centered palindromic substrings. We call the longest among these co-centered palindromic substrings ‘prime palindromic substring’. Without loss of generality, we will limit our discussion on prime palindromic substrings only.

To some, solving the problem is not “hard” so to speak if optimality is not concerned. One possible solution, for example, may be that

Iterate through each possible center and for each center, calculate the length of PSS. To calculate the length of PSS at a specific center, one can dispatch two indexes off the center outwardly in opposite directions symmetrically. If, at any step, a mismatch is encountered, stop. The substring lies between the two indexes.

The runtime complexity of such a naive solution is . The difficulty about this problem is how to beat the quadratic runtime. In his paper of 1975, Glenn Manacher discovered an algorithm with linear runtime. It was later found that his method works not only for prefix PSS but for all PSS’s. This algorithm is now the so-called Manacher’s algorithm [5]. We will dedicate the next few sections to get a thorough understanding for this algorithm. First, we need a little bit of math about symmetry.

3 Reflection symmetry

Reflection symmetry, a.k.a., mirror-image symmetry 111https://en.wikipedia.org/wiki/Reflection_symmetry refers to spacial invariance under a reflection transformation. A reflection transformation is the operation that transforms coordinates to their mirror-image w.r.t. a fixed point, which we will refer to as the axis of symmetry or center of symmetry. In one-dimensional space (D), a coordinate, , and its transformed coordinate, w.r.t. axis are related by equation:

(1)
(a) An axis of reflection symmetry
(b) Another axis of symmetry
(c) Two concurrent axes
(d) Overlapping palindromes
Figure 1: Two nearby axes of reflection symmetry partition the entire space periodically and infinitely. Overlapping palindromes also form similar periodic patterns.

As shown in  (a), axis of symmetry partitions the entire D space into half-spaces about itself—one on the left and one on the right. There is a one-to-one mapping between the points in the half-spaces. Axis does similarly ((b)). It is truly magical when two axes of reflection symmetry are present near each other along the -axis. By repeatedly applying the reflection transformation, one may find infinitely many axes of symmetry alternately along the -axis, and collectively they partition the entire space into periodic regions with period , where is the distance between the two axes of symmetry (see (c)). This effect may not be unfamiliar to you if you have ever stepped in between two parallel mirrors—an array of clones of ‘you’ appear, alternately facing toward and away from you, aligned and coordinated. This symmetric configuration in a discretized D space resulting from multiple reflections is the key intuition to the LPS problem.

(d) shows an infinite palindromic string. In reality, however, the aforementioned symmetry does not exist, as there is no infinite space or string. Nevertheless, the argument still holds for the finite string in the overlapping regions. Of prime interest to us are a collection of ‘crowded’, overlapping PSS’s. Under the wings of some large PSS’s, some shorter palindromes may take shelter. On the other hand, the larger PSS that cloaks over the shorter ones will project the latter’s mirror image to the opposite wing, because of reflection symmetry (see  Figure 2). This reflective projection may be applied recursively as many times as there are enclosing palindromes. As a result, a substring may be projected to its mirror image , which, in turn, may be projected to its image and so on.

Figure 2: Some examples of palindromic substrings in the string “bananas”, labeled as ‘‘, ‘‘, and ‘‘.

4 Manacher’s Algorithm

Equipped with this understanding of the reflection symmetry, we are in a better position to crack the mystery of Manacher’s algorithm. Manacher’s algorithm leans heavily on cached PSS’s. The reflective projection relates the to-be-calculated palindromic center with its mirror image in the cache and this is the key step to avoid repeated character comparisons.

Taking string “bananas” for example as shown in  Figure 2, knowing that PSS- has length and that PSS- mirrors (part of) PSS- to (part of) PSS-, we can skip all but the outermost pairs, which are ‘n’ and ‘s’. Upon seeing that they do not match, the length of PSS- is finally pinned at . So with one additional character comparison, we obtained the length of PSS-. That is where savings come from.

To sum up, let us iterate through the string from left to right and cache the result in an array, e.g.,

index: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
length: 0, 1, 0, 1, 0, 3, 0, 5, 0, 3, 0, 1, 0, 1, 0

At each step, we also keep a reference PSS which reaches the farthest right. When examining a new center, we first check its mirror image with respect to the reference. Depending on the relationship between its mirror image PSS and the reference PSS, some or all character comparisons for the new center may be spared, just as in the case of “bananas”. With no more ado, lists Manacher’s algorithm.

  1. Initialize an array pss of size 2 * n + 1 and element at index i stores length of .

  2. Initialize refCenter = , which stores the palindrome center whose right wing reaches rightmost in each iteration of the main loop;

  3. For each character at j = 0, 1, ..., n in the augmented array,

    1. [label=:]

    2. If j lies outside of pss(refCenter) to the right, calculate from scratch; update refCenter accordingly; skip to the next iteration.

    3. Otherwise, find the mirror image, k, of j w.r.t. refCenter.

    4. If is completely contained in PSS(refCenter), then pss(k)=pss(j);

    5. Otherwise, we need to calculate the , but only the portion outside of PSS(refCenter), if any.

5 Augmented-String Implementations

I hope you have already grasped the gist of Manacher’s algorithm before we talk about its implementations. Traditionally the implementations of Manacher’s algorithm assumes augmenting the original string by inserting a dummy character between each adjacent pair of characters in the original string. For uniformity, we also add dummy characters at the ends (insert one dummy character in the case of an empty string). By doing so, we established a one-to-one mapping between the PSS’s in the original string and the odd PSS’s in the augmented string. So for “bananas”, the augmented string would be “ b a n a n a s ” if blank space is chosen as the augmenting character. It is important that the chosen dummy character be absent from the original string. Otherwise, spurious result may result.

With the literally augmented string, implementing Manacher’s algorithm becomes straightforward. One can refer to several published implementations in different programming languages. There is a Java version by the CS department of Princeton University222https://algs4.cs.princeton.edu/53substring/Manacher.java.html. There is a Python version by Fred Akalin333https://www.akalin.com/longest-palindrome-linear-time. There is even a Haskell implementation along with a discussion of the algorithm itself in Johan Jeuring’s blog444 http://finding-palindromes.blogspot.com/2012/05/finding-palindromes-efficiently.html and also book [4]. Lastly, an implementation of my own is also provided for the sake of reference ( 1).

1public String longestPalindrome(String input) {
2    final int n = input.length() * 2 + 1;
3    char[] aug = new char[n];
4    for (int i = 0; i < n; ++i) {
5        aug[i] = (i & 1) == 0 ? 0 : input.charAt(i / 2);
6    }
7    int maxStart = 0, maxEnd = 0;
8    int[] pss = new int[n];
9    pss[0] = 1;
10    for (int center = 0, j = 1; j < n; ++j) {
11        int wing = pss[center] / 2;
12        if (j < center + wing) {
13            int jmirror = 2 * center - j;
14            int jwing = pss[jmirror] / 2;
15            if (jmirror - jwing > center - wing) {
16                pss[j] = pss[jmirror];
17                continue;
18            }
19        }
20        //get the length of the LPS centered on j
21        for (int i = center + wing + 1; ; i++) {
22            if (2 * j - i < 0 || i == n || aug[2 * j - i] != aug[i]) {
23                pss[j] = 2 * (i - 1 - j) + 1;
24                break;
25            }
26        }
27        center = j;
28        if (maxEnd - maxStart < pss[center] / 2) {
29            maxStart = (center + 1 - pss[center] / 2) / 2;
30            maxEnd = (center + 1 + pss[center] / 2) / 2;
31        }
32    } //end for
33    return input.substring(maxStart, maxEnd);
34}
Listing 1: An implementation of Manacher’s algorithm based on literally augmented string.

6 Virtual Augmentation

Even though the string-augmentation approach has found widespread use for the implementation of Manacher’s algorithm, this is neither convenient nor necessary. For large strings such as DNA chains in genome sequencing, it is costly to have to construct the augmented string with doubled memory footprint [6]. Furthermore, it is onerous and sometimes quite annoying to have to identify a suitable dummy character for the augmentation process. In this section, we seek a more concise and economic way to implement Manacher’s algorithm—sparing the string augmentation process.

Arithmetic Semantic Helper Function
i / 2 The left center in the original string (only for even i)
(i - 1) / 2 The center in the original string (only for odd i)
2 * i - x Mirror image of x about the center in the augmented string toMirrorImage
i - pss[i] Left bounding index in the augmented string getLeftBound
i + pss[i] Right bounding index in the augmented string getRightBound
(i - pss[i]) / 2 Left bounding index in the original string555 This is because if i is even, length of the PSS centered on i can only be even. Conversely, if i is odd, length of the PSS centered on i can only be odd.
(i + pss[i]) / 2 Right bounding index in the original string 5
Table 1: Semantics of index arithmetic expressions given i being the palindromic center and x an arbitrary index, both in the augmented string; pss is the array storing lengths of palindromic substrings.

The key is to establish an index mapping between the original and the augmented string. Consider the string “bananas” ( Figure 2) and its augmented string “ b a n a n a s ”. Let us try to establish the correspondence rules between the original string and the augmented string. First, each character in the original string corresponds to a character in the augmented string with an odd index. So there is a one-to-one correspondence between the indexes of the original string and the odd indexes of the augmented string. The even indexes of the augmented string, however, corresponds to the inner boundaries between adjacent characters. Together, they establish a one-to-one mapping between PSS’s of the original string and odd PSS’s in the augmented string. It is easy to see that a PSS in the augmented string are completely determined by the corresponding PSS in the original string. The added augmentation characters do not interfere at all. Therefore, if we can formulate the PSS’s of the original string in terms of the indexes of the hypothetically augmented string, we would be freed from the need to construct the augmented string in memory. We name this method “index mapping”. Accordingly, the process of relying on mapped indexes for calculating palindromic substrings are named “virtualized augmentation”. Not only does virtualized augmentation makes the double-sized memory consumption obsolete, but it also frees us from the burden of choosing dummy characters. Some arithmetic expressions and their semantic meanings have been tabulated in  Table 1 for reference. Based on the idea of virtual augmentation, we come up with a new approach to implement Manacher’s algorithm ( 2). Note that the solution listed in  2 has runtime as well as memory complexity.

1public String longestPalindrome(String input) {
2    int[] pss = new int[input.length() * 2 + 1];
3    for (int i = 1, refCenter = 0; i < pss.length; ++i) {
4        // refCenter = center reaching to the right farthest
5        if (refCenter + pss[refCenter] <= i) {
6            pss[i] = palength(input, i, i);
7            refCenter = i;
8        } else {
9            int im = refCenter * 2 - i;
10            //assert im >= 0
11            if (im - pss[im] > refCenter - pss[refCenter]) {
12                pss[i] = pss[im];
13            } else {
14                pss[i] = palength(input, i, refCenter + pss[refCenter]);
15                refCenter = i;
16            }
17        }
18    }
19
20    for (int i = 0, refCenter = 0; ; ++i) {
21        if (i == pss.length) {
22            return substring(input, refCenter, pss[refCenter]);
23        }
24        if (pss[i] > pss[refCenter]) {
25            refCenter = i;
26        }
27    }
28}
29
30String substring(String str, int c, int len) {
31    //if c is even, len can only be even
32    //if c is odd, len can only be odd
33    int s = (c - len) / 2;
34    int e = (c + len) / 2;
35    return str.substring(s, e);
36}
37
38int palength(String s, int c, int i) {
39    final int n = 2 * s.length() + 1;
40    for (int mi = 2 * c - i; ; ++i, --mi) {
41        if (mi < 0 || i >= n || (i & 1) == 1 && s.charAt(i / 2) != s.charAt(mi / 2)) {
42            return i - c - 1;
43        }
44    }
45}
Listing 2: Implementation of Manacher’s aglorithm using index mapping.

7 Solutions for Readability

By virtual augmentation, we resolved the memory footprint issue and the shadowy dummy character. The mission is well accomplished. But by no means should we settle here. The code in  2 is like a bowl of spaghetti noodle, isn’t it? Even though the code is divided up into three functions, its cleanliness still suffers. It takes some courage for me to read it and try to figure out what each line does in just a couple days after I wrote it. I can not imagine it would be easier for one who has just come across the code. A principle that all developers should stick to is “If it is not readable, it is not acceptable”. Our next goal is to seek a more readable way to implement it.

At a glance, the code is packed with arithmetic expressions, some for symmetry transformations, some for key look ups, etc. These are all resulted from the virtual augmentation. But if you look carefully, you may spot bloating repetitions of some arithmetic operations. Some are hard “copy-n-paste” of others while more belong to the category of so-called “soft duplicate” [7]. Another problem with the implementation is that it is monolithic. The same function does too many things at once, violating the Single Responsibility Principle 666https://en.wikipedia.org/wiki/Single-responsibility_principle. A well organized solution in this context should consist of a group of meaningful, single-purposed, and reusable functions.

So we have two tasks that are somehow related—one for elimination of duplicate code and the other to make the solution more modular. Our starting point is  2. We may approach both tasks first with an understanding of the semantics of some operations, especially those that are repeated. If it helps, we can factor them out into helper functions.

Take as an example line  2 in  2:

1    if (im - pss[im] > refCenter - pss[refCenter])

It may be obscure to untrained eyes. But the pattern is “Given an index, get the result as the index minus the element at the index”. So we can factor that out into a function, e.g., getLeftBound (see Table 1). All occurrences of the same logic may be replaced by a call of function getLeftBound. This alone helps get rid of x duplications. In fact, similar refactor may be performed for other entries in  Table 1. By doing so, we first modularized the solution by creating succinct, easily understandable helper functions. The helper functions, in turn, may be reused to reduce code duplication. One stone for two birds. The technique is discussed in length in the book [7].

3public class LongestPalindromeSolver {
4
5    private String input;
6    // The i-th element in array ’lsp’ records the max length of palindrome centered
7    // on a char or between two adjacent chars of the input string depends on the parity of i:
8    // if i is even, it is centered between the (i/2)th char and the (i/2 + 1)th char;
9    // if i is odd, it is centered on the (i-1)/2th char in string input
10    // pss[0] = 0 by definition
11    private int[] pss;
12
13    public String longestPalindrome(String input) {
14        this.input = input;
15        pss = new int[input.length() * 2 + 1];
16        solve();
17        return substring(argmax());
18    }
19}
Listing 3: Modularized solution of the LPS problem.

Our refactored solution—the class LongestPalindromeSolver—is listed in  3 and  4. The class has two private fields, one for input and the other output. The only point of entry is the public method longestPalindrome which helps with bookkeeping and dispatching. All others are helper functions listed in  4 and explained below.

1/*
2 * Implement Manacher’s algorithm
3 */
4void solve() {
5    for (int i = 1, refCenter = 0; i < pss.length; ++i) {
6        // refCenter = center reaching to the right farthest
7        if (getRightBound(refCenter) <= i) {
8            pss[i] = palength(i, i);
9            refCenter = i; // i becomes the rightmost palindrome
10            continue;
11        }
12        int mi = toMirrorImage(refCenter, i);
13        //assert mi >= 0
14        if (getLeftBound(mi) > getLeftBound(refCenter)) {
15            // palindrome is wrapped
16            pss[i] = pss[mi];
17        } else {
18            //calculate the part outside
19            pss[i] = palength(i, getRightBound(refCenter));
20            refCenter = i;
21        }
22    }
23}
24
25int getLeftBound(int i) {
26    return i - pss[i];
27}
28
29int getRightBound(int i) {
30    return i + pss[i];
31}
32
33int toMirrorImage(int axis, int x) {
34    return 2 * axis - x;
35}
36
37/*
38 * Determine if there is a mismatch between virtual char at i and j
39 * A mismatch happens if both indexes are even and the charAt not equal
40 */
41boolean isMismatch(int i, int j) {
42    return (i & 1) == 1 && input.charAt(i / 2) != input.charAt(j / 2);
43}
44
45/*
46 * Find the palindrome in string input centered at center.
47 */
48int palength(int center, int index) {
49    for (int mi = toMirrorImage(center, index); ; ++index, --mi) {
50        // if one of the conditions: reaching the left end,
51        // reaching the right end, or encountered char mismatch
52        if (mi < 0 || index >= pss.length || isMismatch(index, mi)) {
53            return index - center - 1;
54        }
55    }
56}
57
58String substring(int center) {
59    int left = getLeftBound(center) / 2;
60    int right = getRightBound(center) / 2;
61    return input.substring(left, right);
62}
63
64/*
65 * Find the index of maximum palindromic substring
66 */
67int argmax() {
68    int maxi = 0;
69    for (int i = 0; i < pss.length; ++i) {
70        if (pss[i] > pss[maxi]) {
71            maxi = i;
72        }
73    }
74    return maxi;
75}
Listing 4: The refactored helper functions.

The solve function implements the large part of Manacher’s algorithm. Methods getLeftBound, getRightBound, palength, and toMirrorImage each corresponds with an index mapping expressions in  Table 1. Method isMismatch checks for pairwise character mismatch. Method substring helps construct the final result—the longest palindromic substring. Method argmax is, as suggested by its name, a quick-and-dirty implementation of the mathematical function.

8 Experiment

To test the performance of the implementations and catch possible regressions, we designed a simple experiment to compare the implementations listed in this article. We randomly generated strings of various lengths using alphabets as the testing benchmarks for and . To reduce error, each run is repeated three times and the average is taken. We found no strong correlation between runtime and or the length of longest palindromic substrings. The linearity of the runtime vs size of input string stands out quite obviously in  Table 2 which is expected. In summary, our new index-mapping based implementations perform similarly as the approach based on string-augmentation but is more efficient in terms of memory footprint. Where the implementation with augmented string fails, the virtualized augmentation approach still runs successfully. Besides the readability, there is also slight improvement in runtime in our modularized solution.

Length String Augmentation Index Mapping Modular Solution
OutOfMemory
Table 2: Runtime measurement for the three implementations “String Augmentation”  1, “Index Mapping”  2, and “Modularized Index Mapping” 3.

9 Conclusion

In conclusion, we went through the longest palindromic substring problem as a case of study. We discussed the reflection symmetry required to understand Manacher’s algorithm. We presented a novel implementation of Manacher’s algorithm that avoided the tedious and costly string augmentation with index mapping. We compared the performance of the new approach against that of string-augmentation in terms of memory and runtime complexities. Using the techniques presented in previous chapters of this book, we refactored the monolithic solution with bloating duplication into a more modular and readable one.

References

  • [1] L. S1, L. A, and R. E (2008) Chromosome evolution with naked eye: palindromic context of the life origin. Chaos 18 (2), pp. 013105. Cited by: §1.
  • [2] M. Crochemore and W. Rytter (1994) Text algorithms. Maxime Crochemore. Cited by: §1.
  • [3] Z. Galil (1981) String matching in real time. Journal of the ACM (JACM) 28 (1), pp. 134–149. Cited by: §1.
  • [4] J. T. Jeuring (1993) Theories for algorithm calculation. Utrecht University. Cited by: §1, §5.
  • [5] G. Manacher (1975-07) A new linear-time “on-line” algorithm for finding the smallest initial palindrome of a string. J. ACM 22 (3), pp. 346–351. External Links: ISSN 0004-5411, Link, Document Cited by: §1, §2.
  • [6] H. Shiu, K. Ng, J. Fang, R. C. Lee, and C. Huang (2010) Data hiding methods based upon dna sequences. Information Sciences 180 (11), pp. 2196–2208. Cited by: §1, §6.
  • [7] S. Wan (approx. 2020) Lean code. TBD. Cited by: §1, §7, §7.