The integral relationship between the suffix array  (SA) and Burrows–Wheeler transform  (BWT) is explored by Adjeroh et al. , who also illustrate the versatility of the BWT beyond its original motivation in lossless block compression 
. BWT applications include compressed index structures using backward search pattern matching, multimedia information retrieval, bioinformatics sequence processing, and it is at the heart of the bzip2 suite of text compressors. By its association with the BWT, this also indicates the importance of the SA data structure and hence our interest in exploring its combinatorial properties.
These combinatorial properties can be useful when checking the performance or integrity of string algorithms or data structures on string sequences in testbeds when properties of the employed string sequences are well understood. In particular, due to current trends involving massive data sets, indexing data structures need to work in external memory (e.g., ), or on distributed systems (e.g. ). For devising a solution adaptable to these scenarios, it is crucial to test whether the computed index (consisting of the suffix array or the BWT, for instance) is correctly stored on the hard disk or on the computing nodes, respectively. This test is more cumbersome than in the case of a single machine working only with its own RAM. One way to test is to compute the index for an instance, whose index shape can be easily verified. For example, one could check the validity of the computed BWT on a Fibonacci word since the shape of its BWT is known [39, 13, 45].
Other studies based on Fibonacci words are the suffix tree  or the Lempel-Ziv 77 (LZ77) factorization . In , the suffix array and its inverse of each even Fibonacci word is studied as an arithmetic progression. In this study, the authors, like many at that time, did not append the artificial $ delimiter (also known as a sentinel) to the input string, thus allowing suffixes to be prefixes of other suffixes. This small fact makes the definition of for a string with suffix array incompatible with the traditional BWT defined on the BWT matrix, namely the lexicographic sorting of all cyclic rotations of the string .
For instance, with , while and with . However, the traditional BWT constructed by reading the last characters of the lexicographically sorted cyclic rotations of bab yields bba, which is equal to after removing the $ character.
Note that not all strings are in the BWT image. An -time algorithm is given by Giuliani et al.  for identifying all the positions in a string where a $ can be inserted into so that becomes the BWT image of a string ending with $.
Despite this incompatibility in the suffix array based definition of the BWT, we can still observe a regularity for even Fibonacci words [34, Sect. 5]. Similarly, both methods for constructing the BWT are compatible when the string consists of a Lyndon word. The authors of [34, Remark 1] also observed similar characteristics for other, more peculiar string sequences. For the general case, the $ delimiter makes both methods equivalent, however the suffix array approach is typically preferred as it requires time  compared to with the BWT matrix method . By utilizing combinatorial properties of the BWT, an in-place algorithm is given by Crochemore et al. , which avoids the need for explicit storage for the suffix sort and output arrays, and runs in time using extra memory (apart from storing the input text). Köppl et al. [33, Sect. 5.1] adapted this algorithm to compute the traditional BWT within the same space and time bounds.
Up to now, it has remained unknown whether we can formulate a class of string sequences for which we can give the shape of the suffix array as an arithmetic progression (independent of the $ delimiter). With this article, we catch up on this question, and establish a correspondence between strings and suffix arrays generated by arithmetic progressions. Calling a permutation of integers arithmetically progressed if all pairs of successive entries of the permutation have the same difference modulo , we show that an arithmetically progressed permutation coincides with the suffix array of a unary, binary or ternary string. We analyze the conditions of a given arithmetically progressed permutation under which we can find a uniquely defined string over either a unary, a binary, or ternary alphabet having as its suffix array.
The simplest case is for unary alphabets: Given the unary alphabet and a string of length over , is an arithmetically progressed permutation with ratio .
For the case of a binary alphabet , several strings of length exist that solve the problem. Trivially, the solutions for the unary alphabet also solve the problem for the binary alphabet. However, studying those strings of length whose suffix array is , there are now multiple solutions: each with such that has this suffix array. Similarly, has the suffix array , which is an arithmetically progressed permutation with ratio 1.
A non-lexicographic ordering on strings, the -order, considered for a modified FM-index , provides a curious example for the case with ratio : if is a proper subsequence of a string of length , then precedes in -order, written . This implies that , so that for every , thus enabling trivial suffix sorting.
Clearly, any string ordering method that prioritizes minimal length within its definition will have a suffix array progression ratio of . Binary Gray codes111Originally known as Reflected Binary Code by the inventor Frank Gray. have this property and are ordered so that adjacent strings differ by exactly one bit . These codes, which exhibit non-lexicographic order, have numerous applications, notably in error detection schemes such as flagging unexpected changes in data for digital communication systems, and logic circuit minimization. While it seems that Gray code order has not been applied directly to order the rows in a BWT matrix, just four years after the BWT appeared in 1994 , Chapin and Tate applied the concept when they investigated the effect of both alphabet and string reordering on BWT-based compressibility . Their string sorting technique for the BWT matrix ordered the strings in a manner analogous to reflected Gray codes, but for more general alphabets. This modification inverted the sorting order for alternating character positions with which they demonstrated improved compression.
To date the only known BWT designed specifically for binary strings is the binary block order -BWT . This binary string sorting method prioritizes length, thus also exhibiting a suffix array with progression ratio . Ordering strings of the same length, as applicable to forming the BWT matrix, is by a play on decreasing and increasing lexicographic ordering of the run length exponents of blocks of bits. Experimentation showed roughly equal compressibility to the original BWT in binary lexicographic order but with instances where the -BWT gave better compression.
To see that the above two binary orders are distinct: 1101 comes before 1110 in Gray code order, whereas, 1110 comes before 1101 in binary block order.
In what follows, we present a comprehensive analysis of strings whose suffix arrays are arithmetically progressed permutations (under the standard lexicographic order). In practice, such knowledge can reduce the space for the suffix array to .
The structure of the paper is as follows.222Compared to the conference version, we added applications to Christoffel words and balanced words, generalized our results for larger alphabets, and gave examples for indeterminate strings. In Section 2 we give the basic definitions and background, and also deal with the elementary case of a unary alphabet. We present the main results in Section 3, where we cover ternary and binary alphabets, and consider inverse permutations. In Section 4, we illustrate the binary case for Christoffel words and in particular establish that every (lower) Christoffel word has an arithmetically progressed suffix array. We go on to link the binary characterization to balanced words and Fibonacci words. A characterization of strings with larger alphabets follows in Section 5. We overview the theme concepts for meta strings in Section 6. We conclude in Section 7 and propose a list of open problems and research directions, showing there is plenty of scope for further investigation. But for all that, we proceed to the foundational concepts presented in the following section, starting with the case of a unary alphabet.
Let be an alphabet with size . An element of is called a character333Also known as letter or symbol in the literature.. Let denote the set of all nonempty finite strings over . The empty string of length zero is denoted by ; we write . Given an integer , a string444Also known as word in the literature. of length over takes the form with each . We write with . The length of a string is denoted by . If for some strings , then is a prefix, is a substring, and is a suffix of ; we say (resp. and ) is proper if (resp. and ). We say that a string of length has period if for every (note that we allow periods larger than ). If , then is said to be a cyclic rotation of . A string is said to be a repetition if and only if it has a factorization for some integer ; otherwise, is said to be primitive. A string that is both a proper prefix and a proper suffix of a string is called a border of ; a string is border-free if the only border it has is the empty string .
If is a totally ordered alphabet with order , then this order induces the lexicographic ordering on such that for two strings if and only if either is a proper prefix of , or , for two characters such that and for some strings . In the following, we select a totally ordered alphabet having three characters with .
A string is a Lyndon word if it is strictly least in the lexicographic order among all its cyclic rotations . For instance, abcac and aaacbaabaaacc are Lyndon words, while the string aaacbaabaaac with border aaac is not.
A reciprocal relationship exists between the suffix array of a text and its Lyndon factorization, that is, the unique factorization of the text by greedily choosing the maximal length Lyndon prefix while processing the text from start to end: the Lyndon factorization of a text can be obtained from its suffix array ; conversely, the suffix array of a text can be constructed iteratively from its Lyndon factorization .
Lyndon words have numerous applications in combinatorics and algebra, and prove challenging entities due to their non-commutativity. Additionally, Lyndon words can arise naturally in Big Data – an instance in a biological sequence over the DNA alphabet, with , is the following substring occurring in a SARS-CoV-2 genome555https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta:
For the rest of the article, we take a string of length . The suffix array of is a permutation of the integers such that is the -th lexicographically smallest suffix of . We denote with ISA its inverse, i.e., . By definition, ISA is also a permutation. The string BWT with is the (SA-based) BWT of .
The focus of this paper is on arithmetic progressions. An arithmetic progression is a sequence of numbers such that the differences between all two consecutive terms are of the same value: Given an arithmetic progression , there is an integer such that for all . We call the ratio of this arithmetic progression. Similarly to sequences, we can define permutations that are based on arithmetic progressions: An arithmetically progressed permutation with ratio is an array with for all , where we stipulate that .666We can also support negative values of : Given a negative , we exchange it with and use instead of . Here if , for an integer , and for . In particular, . In what follows, we want to study (a) strings whose suffix arrays are arithmetically progressed permutations, and (b) the shape of these suffix arrays. For a warm-up, we start with the unary alphabet:
Given the unary alphabet , the suffix array of a string of length over is uniquely defined by the arithmetically progressed permutation with ratio .
Conversely, given the arithmetically progressed permutation , we want to know the number of strings from a general totally ordered alphabet with the natural order , having as their suffix array. For that, we fix a string of length with . Let be the number of occurrences of the character appearing in . Then . By construction, each character has to appear after all characters with . Therefore, such that the position of the characters are uniquely determined.777Note that we slightly misused notation as the exponentiations of the characters being integers have to be understood as writing a character as many times as the exponent. So does not give but a string of length having only ’s as characters. In other words, we can reduce this problem to the classic stars and bars problem [23, Chp. II, Sect. 5] with stars and bars, yielding possible strings. Hence we obtain:
There are strings of length over an alphabet with size having the suffix array .
As described above, strings of Theorem 2.2 have the form . The BWT based on the suffix array is . For , it does not coincide with the BWT based on the rotations since the lexicographically smallest rotation is , and hence the first entry of this BWT is . For , the last character ‘1’ acts as the dollar sign being unique and least among all characters, making both BWT definitions equivalent.
For the rest of the analysis, we omit the arithmetically progressed permutation of ratio as this case is complete. All other permutations (including those of ratio ) are covered in our following theorems whose results we summarized in Fig. 1.
|Min.||Properties of Strings||Reference|
|2||unique, Lyndon word||Theorem 3.9|
|2||unique, period||Theorem 3.9|
|2||unique, period||Theorem 3.9|
|1||trivially periodic||Theorem 2.2|
3 Arithmetically Progressed Suffix Arrays
We start with the claim that each arithmetically progressed permutation coincides with the suffix array of a string on a ternary alphabet. Subsequently, given an arithmetically progressed permutation , we show that either there is precisely one string with whose characters are drawn from a ternary alphabet, or, if there are multiple candidate strings, then there is precisely one whose characters are drawn from a binary alphabet. For this aim, we start with the restriction on and to be coprime:
Two integers are coprime888Also known as relatively prime in the literature. if their greatest common divisor (gcd) is one. An ideal is a subgroup of . It generates if , i.e., . Fixing one element of an ideal generating induces an arithmetically progressed permutation with ratio by setting for every . On the contrary, each arithmetically progressed permutation with ratio induces an ideal (the induced ideals are the same for two arithmetically progressed permutations that are shifted). Consequently, there is no arithmetically progressed permutation with ratio if and are not coprime since in this case , from which we obtain:
The numbers and must be coprime if there exists an arithmetically progressed permutation of length with ratio .
3.2 Ternary Alphabet
Given an arithmetically progressed permutation with ratio , we define the ternary string by splitting right after the values and into the three subarrays , , and (one of which is possibly empty) such that . Subsequently, we set
Figure 2 gives an example of induced ternary/binary strings.
Given an arithmetically progressed permutation with ratio , for defined in Eq. 1.
Suppose we have constructed . Since , according to the above assignment of , the suffixes starting with a lexicographically precede the suffixes starting with b, which lexicographically precede the suffixes starting with c. Hence, and store the same text positions as , , and , respectively. Consequently, it remains to show that the entries of each subarray (, or ) are also sorted appropriately. Let and be two neighboring entries within the same subarray. Thus, holds, and the lexicographic order of their corresponding suffixes and is determined by comparing the subsequent positions, starting with and . Since we have , we can recursively show that these entries remain in the same order next to each other in the suffix array until either reaching the last array entry or a subarray split, that is, (1) or (2) .
When becomes (the last entry in SA), is in the subsequent subarray of the subarray of (remember that or ends directly after , cf. Fig. 3). Hence , and .
The split at the value ensures that when reaching and , we can stop the comparison here as there is no character following . The split here ensures that we can compare the suffixes and by the characters . If we did not split here, , and the suffix would be a prefix of , resulting in (which yields a contradiction unless ).
To sum up, the text positions stored in each of , and are in the same order as in since the subsequent text positions of each consecutive pair of entries and are consecutive in for the smallest integer such that .
Knowing the suffix array of the ternary string of Eq. 1, we can give a characterization of its BWT. We start with the observation that both BWT definitions (rotation based and suffix array based) coincide for the strings of Eq. 1 (but do not in general as highlighted in the introduction, cf. Fig. 4), and then continue with insights in how the BWT looks like.
Given an arithmetically progressed permutation with ratio and the string of Eq. 1, the BWT of defined on the BWT matrix coincides with the BWT of defined on the suffix array.
According to Theorem 3.2, , and therefore the BWT of defined on the suffix array is given by . The BWT matrix is constituted of the lexicographically ordered cyclic rotations of . The BWT based on the BWT matrix is obtained by reading the last column of the BWT matrix from top downwards (see Fig. 4). Formally, let be the starting position of the lexicographically -th smallest rotation . Then is if , or if . We prove the equality by showing that, for all , the rotation starting at is lexicographically smaller than the rotation starting at . We do that by comparing both rotations and characterwise:
Let be the first position where and differ, i.e., and for every .
First we show that by a contradiction: Assuming that , we conclude that by the definition of . Since is the ratio of , we have
and . By Eq. 1, and belong to different subarrays of , therefore and , contradicting the choice of as the first position where and differ.
This concludes that (hence, ) and . Hence, and are characters given by two consecutive entries in , i.e., and for a . Thus , and by definition of we have , leading finally to . Hence, . ∎
Since is an arithmetically progressed permutation with ratio then so is the sequence with . In particular, is a cyclic shift of with because . However, is a split position of one of the subarrays , , or , meaning that starts with one of these subarrays and ends with another of them (cf. Fig. 5). Consequently, there is a such that , and we have the property that with is the -th rotation of . ∎
We will determine the parameter after Eq. 3 in Section 3.4, where is defined such that . With Lemma 3.4, we obtain the following corollary which shows that the number of runs in for defined in Eq. 1 are minimal:
For an arithmetically progressed permutation and the string defined by Eq. 1, consists of exactly 2 runs if is binary, while it consists of exactly 3 runs if is ternary.
Given an arithmetically progressed permutation with ratio such that , the string given in Eq. 1 is unique.
The only possible way to define another string would be to change the borders of the subarrays , , and . Since , and , as well as and , are stored as a consecutive pair of text positions in .
If is not split between its consecutive text positions and , then . Consequently, we have the contradiction .
If is not split between its consecutive text positions and , then . Since , and , this leads to the contradiction , cf. Fig. 3.
Following this analysis of the ternary case we proceed to consider binary strings. A preliminary observation is given in Fig. 2, which shows, for the cases is 1 and in Theorem 3.6, namely Rotations (5) and (8), that a rotation of in the permutation gives a rotation of one in the corresponding binary strings. We formalize this observation in the following lemma, drawing a connection between binary strings whose suffix arrays are arithmetically progressed and start with or .
Let be an arithmetically progressed permutation with ratio and for a binary string over with . Suppose that the number of ’s in is and that is the first rotation of . Then is the -th rotation of with . Furthermore, .
Since , and . In the following, we show that for with (since ). For that, we show that each pair of suffixes in is kept in the same relative order in (excluding ):
Consider two text positions with .
If for the least , then .
Otherwise, is a proper prefix of .
If , then .
Otherwise (), is a proper prefix of , and similarly .
Hence the relative order of these suffixes given by and is the same. In total, we have for , hence is an arithmetically progressed permutation with ratio . Given the first entries in represent all suffixes of starting with a, is the -th rotation of since is the -th entry of , i.e., the smallest suffix starting with b in . Finally, since the strings and are rotations of each other, their BWTs are the same. ∎
3.3 Binary Alphabet
We start with the construction of a binary string from an arithmetically progressed permutation:
Given an arithmetically progressed permutation with ratio such that , we can modify of Eq. 1 to be a string over the binary alphabet with .
If , then is split after the occurrences of the values and , which gives only two non-empty subarrays. If , is split after the occurrence of , which implies that is empty since . Hence, can be constructed with a binary alphabet in those cases, cf. Fig. 2.
For the case , is split after the occurrences of the values and , so contains only the text position . By construction, the requirement is that the suffix is smaller than all other suffixes starting with c. So instead of assigning the unique character like in Theorem 3.2, we can assign , which still makes the smallest suffix starting with c. We conclude this case by converting the binary alphabet to . Cf. Fig. 2, where in Rotation (6) has become bbabbabb with period . ∎
The main result of this section is the following theorem. There, we characterize all binary strings whose suffix arrays are arithmetically progressed permutations. More precisely, we identify which of them are unique999The exact number of these binary strings is not covered by Theorem 3.6., periodic, or a Lyndon word.
|Case 1||Case 2||Case 3|
Let and be two coprime integers. If , there are exactly three binary strings of length whose suffix arrays are arithmetically progressed permutations with ratio . Each such solution is characterized by
for all text positions and an index , called the split index.
The individual solutions are obtained by fixing the values for and , the position of the lexicographically largest suffix starting with , of :
and , and
The string has period in Cases 1 and 2, while of Case 3 is a Lyndon word, which is not periodic by definition.
For , Cases 2 and 3 each yields exactly one binary string, but Case 1 yields binary strings according to Theorem 2.2.
Let be a binary string of length , and suppose that is an arithmetically progressed permutation with ratio . Further let be the position of the largest suffix of starting with a. Then and thus . We have for all since
is the starting position of the largest suffix ().
and are the lexicographically largest suffix starting with a and the lexicographically smallest suffix starting with b, respectively, such that .
To sum up, since by construction, holds for whenever . This, together with the coprimality of and , determines uniquely in the three cases (cf. Fig. 6 for the case that and , and Fig. 7 for sketches of the proof):
We first observe that the case gives us , and this case was already treated with Theorem 2.1. In the following, we assume , and under this assumption we have , and (otherwise would be the second smallest suffix, i.e., and hence ). Consequently, is the smallest suffix starting with b, namely ba, and therefore .
If , then (otherwise ). Therefore, is the smallest suffix starting with b, and consequently .
For the periodicity, with for we need to check two conditions:
If , then breaks the periodicity.
If , then breaks the periodicity.
For Case 1, and (hence and ), thus Case 1 is periodic.
Case 2 is analogous to Case 1.
For Case 3, does not have period as , and hence