Periodicity in Data Streams with Wildcards

We investigate the problem of detecting periodic trends within a string S of length n, arriving in the streaming model, containing at most k wildcard characters, where k=o(n). A wildcard character is a special character that can be assigned any other character. We say S has wildcard-period p if there exists an assignment to each of the wildcard characters so that in the resulting stream the length n-p prefix equals the length n-p suffix. We present a two-pass streaming algorithm that computes wildcard-periods of S using O(k^3 polylog n) bits of space, while we also show that this problem cannot be solved in sublinear space in one pass. We then give a one-pass randomized streaming algorithm that computes all wildcard-periods p of S with p<n/2 and no wildcard characters appearing in the last p symbols of S, using O(k^3polylog n) space.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/12/2018

Sliding window order statistics in sublinear space

We extend the multi-pass streaming model to sliding window problems, and...
09/25/2019

Streaming PTAS for Binary ℓ_0-Low Rank Approximation

We give a 3-pass, polylog-space streaming PTAS for the constrained binar...
10/15/2018

Small Space Stream Summary for Matroid Center

In the matroid center problem, which generalizes the k-center problem, w...
11/12/2017

Longest Alignment with Edits in Data Streams

Analyzing patterns in data streams generated by network traffic, sensor ...
04/27/2020

The Streaming k-Mismatch Problem: Tradeoffs between Space and Total Time

We revisit the k-mismatch problem in the streaming model on a pattern of...
07/23/2020

Lower Bounds and Hardness Magnification for Sublinear-Time Shrinking Cellular Automata

The minimum circuit size problem (MCSP) is a string compression problem ...

1 Introduction

We study the problem of detecting repetitive structure in a data stream containing a small number of wildcard characters. Given an alphabet and a special wildcard character ‘111Although wildcard characters are usually denoted with ‘’, we use to differentiate from compilation errors - the LaTeX equivalent of wildcard characters, let be a stream that contains at most wildcards. We can assign a value from to each wildcard character in resulting in many possible values of . Then we informally say has wildcard-period if there exists an assignment to each of the wildcard characters in so that the resulting string consists of the repetition of a block of characters.

Example 1

The string has wildcard-period , since assigning ‘c’ to the first wildcard character, ‘b’ to the second wildcard character, and ‘a’ to the third results in the string ‘abcabcabcabc’, which consists of repetitions of the substring ‘abc’ of length .

The identification of repetitive structure in data has applications to bioinformatics, natural language processing, and time series data mining. Specifically, finding the smallest period of a string is necessary preprocessing for many algorithms, such as the classic Knuth-Morriss-Pratt

[KMP77]

algorithm in pattern matching, or the basic local alignment search tool (BLAST)

[AGM90] in computational biology.

We consider our problem in the streaming model, where we process the input in sequential order and sublinear space. However in practice, some of the data may be erased or corrupted beyond repair, resulting in symbols that we cannot read, ‘’. As a consequence, we attempt to perform pattern matching with optimistic assignments to these values. This motivation has resulted in a number of literature on string algorithms with wildcard characters [MR95, Ind98, CH02, Kal02, CC07, HR14, LNV14, GKP16].

One possible approach to our problem is to generalize the exact periodicity problem, for which [EJS10] give a two-pass streaming algorithm for finding the smallest exact period of a string of length that uses -space and time per arriving symbol. Their results can be easily generalized to an algorithm for finding the wildcard-period of strings using -space, but at a cost of post-processing time, which is often undesirable. More recently, [EGSZ17] study the problem of -periodicity, where a string is permitted to have up to permanent changes. The authors give a two-pass streaming algorithm that uses bits of space and runs in amortized time per arriving symbol. This algorithm can be modified to recover the wildcard-period. We show how to do this more efficiently in Theorem 3.2.

1.1 Our Contributions

The challenge of determining periodicity in the presence of wildcard characters can first be approached by working toward an understanding of specific structural properties of strings with wildcard characters. We show in Lemma 2 that the number of possible assignments to the wildcard characters over all periods is “small”. This allows us to compress our data into sublinear space. In this paper, given a string with at most wildcard characters, we show:

  1. a two-pass randomized streaming algorithm that computes all wildcard-periods of using space, regardless of period length, running in amortized time per arriving symbol,

  2. a one-pass randomized streaming algorithm that computes all wildcard-periods of with and no wildcard characters appearing in the last symbols of , using space, running in amortized time per arriving symbol (see Appendix 0.A),

  3. a lower bound that any one-pass streaming algorithm that computes all wildcard-periods of requires space even when randomization is allowed,

  4. a lower bound that, for with , any one-pass randomized streaming algorithm that computes all wildcard-periods of

    with probability at least

    requires space, even under the promise that the wildcard-periods are at most .

We remark that our algorithm can be easily modified to return the smallest, largest, or any desired wildcard-period of . Finally, we note in Appendix 0.B several results in the related problem of determining distance to -periodicity. We give an overview of our techniques in Section 2.

1.2 Related Work

The study of periodicity in data streams was initiated in [EJS10], in which the authors give an algorithm that detlects the period of a string, using bits of space. Independently, [BG11] gives a similar result with improved running time. Also, [EAE06] studies mining periodic patterns in streams, and [CM11] studies periodicity via linear sketches, [IKM00] studies periodicity in time-series databases and online data. [EMS10] and [LN11] study the problem of distinguishing periodic strings from aperiodic ones in the property testing model of sublinear-time computation. Furthermore, [AEL10] studies approximate periodicity in the RAM model under the Hamming and swap distance metrics.

The pattern matching literature is a vast area (see [AG97] for a survey) with many variants. In the data stream model, [PP09] and [CFP16] study exact and approximate variants in offline and online settings. We use the sketches from [CFP16] though there are some other works [AGMP13, CEPR09, RS17, PL07] with different sketches for strings. [CJPS13] also show several lower bounds for online pattern matching problem.

Strings with wildcard characters have been extensively studied in the offline model, usually called “partial words”. Blanchet-Sadri [Bla08] presents a number of combinatorial properties on partial words, including a large section devoted to periodicity. Notably, [BMRW12] gives algorithms for determining the periodicity for partial words. Manea et al.[MMT14] improves these results, presenting efficient time offline algorithms for determining periodicity on partial words, minimizing either total time or update time per symbol.

Golan et al.[GKP16] study the pattern matching problem with a small number of wildcards in the streaming model. Prior to this work, several works had studied other aspects of pattern matching under wildcards (See [CH02],[CC07],[HR14],and [LNV14]).

Many ideas used in these sublinear algorithms stem from related work in the classical offline model. The well-known KMP algorithm [KMP77] initially used periodic structures to search for patterns within a text. Galil et al.[GS83] later improved the space performance of this pattern matching algorithm. Recently, [Gaw13] also used the properties of periodic strings for pattern matching when the strings are compressed. These interesting properties have allowed several algorithms to satisfy some non-trivial requirements of respective models (see [GKP16], [CFP15] for example).

1.3 Preliminaries

Given an input stream of length over some alphabet , we denote the character of by , and the substring between locations and (inclusive) . We say that two strings have a mismatch at index if . Then the Hamming distance is the number of such mismatches, denoted . We denote the concatenation of and by . We denote the greatest common divisor of two integers and by .

Multiple standard and equivalent definitions of periodicity are often used interchangeably. We say has period if where is a block of length that appears times in a row, and is a prefix of . For instance, has period 4 where , and . Equivalently, for all . Similarly, the following definition is also used for periodicity.

Definition 1

We say string has period if the length prefix of is identical to its length suffix, .

More generally, we say has -period (i.e., has period with mismatches) if for all but at most (valid) indices . Equivalently, the following definition is also used for -periodicity.

Definition 2

We say string has -period if .

The definition of -periodicity lends itself to the following observation.

Observation 1.1

If is a -period of , then at most substrings in the sequence of substrings can differ from the preceding substring in the sequence.

Finally, we use the following definition of wildcard-periodicity:

Definition 3

We say that a string has wildcard-period if there exists an assignment to the wildcard characters, so that (i.e., the resulting string has period . See Example 1).

Note that the determinism of the assignments of the characters is very important, as evidenced by Example 2.

Example 2

Consider the string . To check whether has wildcard-period , we must compare and . At first glance, one might think assigning the character ‘’ to the wildcard in the prefix and an ‘’ in the suffix will make the prefix and the suffix identical. However, this is not a legal move; there is not a single character that the wildcard can be replaced with that makes the above prefix and the suffix the same. Thus, does not have a wildcard-period of 1.

The following example emphasizes the difference between -periodicity and wildcard-periodicity:

Example 3

For , the string has -period . However, to obtain wildcard-period , at least five characters in must be changed to wildcards (for example, all of the characters ‘’ or ‘’).

Therefore, -periodicity is a good notion for capturing periodicity with respect to long-term, persistent changes, while wildcard-periodicity is a good notion for capturing periodicity against a number of symbols that are errors or erasures.

We shall require data structures and subroutines that allow comparing of strings with mismatches. The below useful fingerprinting algorithm utilizes Karp-Rabin fingerprints [KR87] to obtain general and important properties:

Theorem 1.2

[KR87] Given two strings and of length , there exists a polynomial encoding that uses bits of space, and outputs whether or . Moreover, this encoding supports concatenation of strings and can be done in the streaming setting.

From here, we use the term fingerprint to refer to this data structure. We will also need use an algorithm for pattern matching with mismatches, which we call the -mismatch algorithm.

Theorem 1.3

[CFP16] Given a string and an index , there exists an algorithm which, with probability , outputs all indices where using bits of space. Moreover, the algorithm runs in amortized time per arriving symbol.

Concurrent with our work, Clifford et al.[CKP17] provide a nearly-optimal solution to the -mismatch algorithm, which can potentially be used in the framework of [EGSZ17] to immediately improve over the existing -periodicity algorithms.

2 Our Approach

To find all the wildcard-periods of , during our first pass we determine a set of candidate wildcard-periods, similar to the approach in [EGSZ17], that includes all the true wildcard-periods. We also determine a set of positions of the wildcard characters. By a structural result (Lemma 2), we can then use the second pass to verify the candidates and identify the true wildcard-periods.

Pattern matching and periodicity seem to have a symbiotic relationship (for example, exact pattern matching and exact periodicity use each other as subroutines [KMP77, EJS10], as do -mismatch pattern matching [CFP16] and -periodicity [EGSZ17]). It feels tempting and natural to try to apply the algorithm from [GKP16] for pattern matching with wildcards. Unfortunately, there does not seem to be an immediate way of doing this: the [GKP16] algorithm searches for a wildcard-free pattern in text containing up to wildcards, while we would like to allow wildcards in the pattern and the text. We instead choose to use the -mismatch algorithm from [CFP16] in the first pass and obtain new structural results about possible assignments to the wildcard characters in the second pass.

In the first pass, we treat wildcards simply as an additional character. We let be the set of indices (candidate periods) that satisfy

for some appropriate value of that we specify later. Note that each wildcard character can cause up to two mismatches; thus, all true wildcard-periods must satisfy the above inequality. We show that can be easily compressed, even though it may contain a linear number of candidates. Specifically, we can succinctly represent by adding a few additional “false candidates” into .

If the correct assignments of the wildcards were known a priori, then the problem would reduce to determining exact periodicity. Unfortunately, we do not know the correct assignments to the wildcard characters prior to the data stream, so most of the difficulty lies in the guessing of assignments, bounding the total number of assignments, and storing these assignments. Thus, the main difference between wildcard-periodicity and both exact periodicity and -periodicity is the process of verifying candidates. Whereas exact and -periodicity can be verified by comparing the number of mismatches between the prefix and suffix of length , wildcard-periodicity is sensitive to the correct assignments of the wildcards. We address this challenge by noting , the positions of the wildcard characters in the first pass. Since we also have the list of candidate wildcard-periods following the first pass, we can guess the assignments of the wildcard characters in the second pass by looking at the characters in a few select locations, as in Example 4.

Example 4

The string has wildcard-period . The assignment of the wildcard at position must be the characters at positions . Note that and .

From Example 4, we observe the following:

Observation 2.1

If has wildcard-period and a wildcard character is known to be at position , then the assignment of the wildcard must be the character , for some integer , that is not a wildcard.

We show how to use Observation 2.1 and the compressed version of in the second pass to verify the candidates and output the true wildcard-periods of .

We note that recent algorithmic improvements to the -mismatch problem [CKP17] use space. Using this algorithm in place of Theorem 1.3 as a subroutine in our algorithms improves the space usage to bits in the two-pass algorithm.

3 Two-Pass Algorithm to Compute Wildcard-Periods

In this section, we provide a two-pass, -space algorithm to output all wildcard-periods of some string containing at most wildcard characters. At a high level, we first identify a list of candidates of the periods of , detected via the -mismatch algorithm of [CFP16] as a black box. Although the number of candidates could be linear, it turns out the string has enough structure that the list of candidates can be succinctly expressed as the union of arithmetic progressions.

However, this list of candidates is insufficient in identifying the possible assignments to the wildcard characters. To address this issue, we explore the structure of periods with wildcards in order to limit the possible assignments for each wildcard character. Thus, the first pass also records , the positions of all wildcard characters so that during the second pass, we go over as well as the compressed data to verify the candidate periods.

We present two algorithms in parallel to find the periods, based on their lengths. The first algorithm identifies all periods with , while the second algorithm identifies all periods with .

3.1 Computing Small Wildcard-Periods

In this section, we describe a two-pass algorithm for finding wildcard-periods of length at most . The first pass of the algorithm identifies a set of candidate wildcard-periods in terms of indices of , and maintains its succinct representation , which includes a number of additional indices. It also records , the positions of all wildcard characters. The second pass of the algorithm recovers each index of from and verifies whether or not the index is a wildcard-period. We can find the assignments of the wildcard characters in the second pass, by looking at the characters in a few locations that we determine via . We emphasize the following properties of and :

  1. All wildcard-periods (possibly as well as additional candidate wildcard-periods that are false positives) are in .

  2. can be stored in sublinear space and can be fully recovered from .

  3. In the second pass, we can verify and eliminate in sublinear space candidates that are not true periods.

In the first pass, we treat the wildcard characters as a regular, additional alphabet symbol. We observe that if string with such wildcards has wildcard-period , there are at most indices such that , caused by the wildcard characters (the converse is not necessarily true). It follows that any wildcard-period must satisfy

for all , and specifically for . Thus, we set and refer to any index that satisfies as a candidate wildcard-period. The set of all candidate wildcard-periods forms the set . Because is a necessary but not sufficient condition for a wildcard-period , Property 1 follows.

We give the first pass of the algorithm in full in Algorithm 1.

Input: A stream of symbols with at most wildcard characters .
Output: A succinct representation of all candidate wildcard periods and the positions of the wildcard characters.

1:initialize for each .
2:initialize .
3:for each index (found using the -mismatch algorithm) such that
 do
4:     consider for which is in the interval
5:     if there exists no candidate in the interval  then
6:         add to .
7:     else
8:         let be the smallest candidate in and either or .
9:         if  then
10:              set .
11:         else
12:              set .               
13:record the positions of all wildcard characters.
Algorithm 1 (To determine any wildcard-period with ) First pass

Here, we show why the remaining properties for and are satisfied. Our algorithm divides the candidates into ranges and stores the candidates in each range in compressed form as an arithmetic series.

Since we use the -mismatch algorithm in the first pass, we describe a structural property of the resulting list of candidates:

Theorem 3.1

[EGSZ17] Let be a candidate -period for a string , with all contained within . Given the fingerprints of and , we can determine whether or not has -period for any by storing at most additional fingerprints. These fingerprints represent substrings of the form , where is an integer and .

The structural property can be visualized in Figure 1.

Fig. 1: The dots represent candidate wildcard-periods. For any interval that has more than two dots, it follows that all dots are equally spaced after the first. The black dots represent while white dots are artificially inserted to form , dots that follow an arithmetic sequence.

Even though the list of candidates could be linear in size, Theorem 3.1 enforces a structure upon the list of candidates, so that an arithmetic sequence with first term and common difference includes all of . Thus, we can succinctly represent a superset that contains and Property 2 follows.

We now show that any wildcard period is included among the list of candidates stored by Algorithm 1 during the first pass, and can be recovered from the list.

Lemma 1

If is a period and , then can be recovered from and .

Proof

Suppose is a wildcard period. Then there exists an assignment to the wildcard characters such that . It follows that for ,

so the index will be reported by the -mismatch algorithm in the first pass.

If at that time during Pass 1 there is no other index in , then will be inserted into , so can clearly be recovered from . If there is another index in , then will be updated to be a divisor of . Hence, is a multiple of . Furthermore, any future update to will result in a value that divides the current value of , due to a greatest common divisor operation. Thus, will remain a multiple of the final value of , and so the set at the end of the first pass will contain .

It remains to show that the list of candidate wildcard-periods can be verified in sublinear space in the second pass (Property 3). To do this, we need a combinatorial property for periodicity on strings with wildcard characters.

3.2 Verifying Candidates

Recall that after the first pass, the algorithm maintains succinctly represented arithmetic progressions , corresponding to the candidate wildcard periods. The algorithm also maintains , the list of positions of wildcard characters in . In the second pass, the algorithm must check, for each , , whether for an appropriate setting of the wildcard characters. The challenge is computing the fingerprints of both and in sublinear space, especially if the number of candidates is linear.

We first set a specific and note that for the smallest candidate , there are at most unique substrings , , . Since any other candidate satisfies for some integer , then is the concatenation

Thus, by storing fingerprints and positions, we can recover the fingerprint of the substring for each .

The second obstacle is handling wildcard characters in the computation of the fingerprints of and . To address this challenge, our algorithm delays the calculation of the contribution of wildcard characters to the fingerprints until we know the assignment of the wildcard character with respect to a candidate period. We show that for a specific , then there are at most possible assignments for the wildcard character with respect to all candidates , across all , where is the positions of all wildcard characters recorded by Algorithm 1. Therefore, we can compute the assignment for each wildcard character with respect to a candidate period in the second pass, and then compute the fingerprint of and .

Lemma 2

For a given , and , let denote the assignment of . Then .

Proof

Let be the smallest candidate in and be the largest candidate in so that for some integer . We partition into , the set of indices greater than , and , the set of indices no more than . We consider the wildcard characters , and note that the proof for is symmetric. Consider the sequences

Each term in a sequence that differs from the previous term corresponds to a mismatch between , , . For each , there are at most unique chains of substrings with length beginning at index . Hence, across all sequences , , , there are at most unique characters. Since the assignment of with respect to any candidate is for some integer , then it follows that there are at most assignments of across all . As the symmetric proof holds for , then there are at most assignments of across all .

Thus, deciding the assignment of with respect to a candidate is simple: For each such that :

  1. Let be the smallest candidate in and be the largest candidate in so that for some .

  2. For each :

    1. If , succinctly record the values of , , , .

    2. If , succinctly record the values of , , , .

    Let so that for some .

  3. The assignment of with respect to is any that is not a wildcard character (where is an integer).

We describe the second pass in Algorithm 2, recalling that at the end of the first pass, the algorithm records arithmetic progressions, succinctly represented, as well as the positions of all wildcard characters.

Input: A stream of symbols with at most wildcard characters, a succinct representation of all candidate wildcard periods and the position of the wildcard characters.
Output: All wildcard-periods .

1:for each such that  do
2:     for each such that , implicitly determine the value of with respect to .
3:     let be the integer for which is in the interval
4:     if  then has multiple values in
5:         record up to unique fingerprints of length , starting from .
6:     else has one value in
7:         record up to unique fingerprints of length , starting from .      
8:     check if and return if this is true.
9:for each which is in interval for some integer  do
10:     if there exists an index in whose distance from is a multiple of  then
11:         check if and return if this is true.      
Algorithm 2 (To determine any wildcard-period with ) Second pass

For each arithmetic progression, there are total possibilities for all of the wildcard characters. Thus, the algorithm maintains the characters corresponding to the value of all wildcard characters across all candidate positions.

We now show the ability to construct the fingerprints of for any candidate period .

Lemma 3

Let be a candidate -period for a string , with all contained within . Given the fingerprints of and , we can determine whether or not has wildcard-period for any by storing at most additional fingerprints.

Proof

Consider a decomposition of into substrings of length , so that . Even though the algorithm does not record a fingerprint for each , each index for which corresponds to at least one mismatch. Since the first pass searched for positions that contained at most mismatches, then it follows from Observation 1.1 that there are indices for which . Thus, recording the fingerprints and locations of these indices suffices to build fingerprints for , ignoring the wildcard characters. Then we can verify whether or not is a wildcard-period of if the assignment of the wildcard characters with respect to is also known.

By Theorem 3.1, the greatest common divisor of the difference between each in is a -period. That is, can be decomposed so that has length , and each subsequent substring has length . Then there exist at most indices for which , by Observation 1.1. Ignoring wildcard characters, storing the fingerprints and positions of these indices allows the recovery of the fingerprint of from the fingerprint of , since is a multiple of . By Lemma 2, we know the values of the wildcard characters with respect to . Therefore, we can confirm whether or not is a wildcard-period.

We now show correctness of the algorithm.

Lemma 4

For any period , the algorithm outputs .

Proof

Since the intervals cover , then for some . It follows from Lemma 1 that after the first pass, can be recovered from and . Thus, the second pass tests whether or not is a wildcard-period. By Lemma 3, the algorithm outputs , as desired.

3.3 Computing Large Wildcard-Periods

As in Algorithm 1, we would like to identify candidate periods during the first pass of the algorithm, while treating the wildcard characters as an additional symbol in the alphabet. Unfortunately, if a wildcard-period is greater than , then it no longer satisfies

since , and is undefined. However, by treating the wildcard characters as an additional symbol, recall that for all . Then we would like to use as large an as possible while still satisfying when choosing candidate wildcard periods . To this effect, the observation in [EJS10] states that we can try exponentially decreasing values of . Specifically, we run instances of the algorithm in succession, with . Note that one of these values of is the largest value as possible while still satisfying . As a result, the corresponding algorithm instance outputs , while the other instances do not output anything. We detail the first pass in full in Algorithm 3 in Appendix 0.C.

This partition of into the disjoint intervals , , guarantees that any -period is contained in one of these intervals. Moreover, the intervals partition

and so can be recovered from and . We present the second pass in Algorithm 4 in Appendix 0.C.

Since correctness follows from the same arguments as the case where , it remains to analyze the space complexity of our algorithm.

Theorem 3.2

There exists a two-pass randomized algorithm using bits of space that finds the wildcard-period and runs in amortized time per arriving symbol.

Proof

In the first pass, for each , we maintain a -mismatch algorithm which requires bits of space, as in Theorem 1.3. Since , we use bits of space in total in the first pass.

In the second pass, we maintain fingerprints for any set of indices in , and there are indices in for each , for a total of bits of space. In addition, we store the assignments for all the wildcard positions in each interval , where and . Thus, bits of space suffice for both passes.

The running time of the algorithm is dominated by the time spent for parallel copies of -mismatch algorithm in the first pass, i.e., Algorithm 3. From Theorem 1.3, the -mismatch algorithm runs in amortized time per arriving symbol. The rest of the algorithm consists of simple tasks like computing gcd and can be performed very quickly. In the second pass, in total at most assignments are determined and stored. Thus, the second pass runs in amortized time per arriving symbol.

4 Lower Bounds

We first note that [EJS10] shows computing the period of a string in one-pass requires space. Since the problem of periodicity for strings containing wildcards is a generalization of exact periodicity, the same lower bound applies.

Theorem 4.1 (Implied from Theorem 3 from [Ejs10] and Theorem 16 from [Egsz17])

Given a string with at most wildcard characters, any one-pass streaming algorithm that computes the smallest wildcard-period requires space.

To show a lower bound that randomized streaming algorithm that computes all wildcard-periods of with probability at least , even under the promise that the wildcard-periods are at most , consider the following construction. Define an infinite string , as in [GMSU16], and let be the prefix of length . Define to be the set of binary strings of length with Hamming distance from . For , let be the set of binary strings of length with either or . Pick uniformly at random from . Then Theorem 17 in [EGSZ17] shows a lower bound on the size of the sketches necessary to determine whether or .

Theorem 4.2

[EGSZ17] Any sketching function that determines whether or from and , with probability at least for , uses space.

Suppose Alice has , along with the locations of the first positions in which . Alice replaces these locations with wildcard characters , runs the wildcard-period algorithm, and forwards the state of the algorithm to Bob, who has . Bob then continues running the algorithm on to determine the wildcard-period of the string . Observe that:

Lemma 5

If , then the string has period . On the other hand, if , then has period greater than .

Combining Theorem 4.2 and Lemma 5:

Theorem 4.3

For with , any one-pass randomized streaming algorithm that computes all wildcard-periods of an input string with probability at least requires space, even under the promise that the wildcard-periods are at most .

Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments. The work was supported by the National Science Foundation under NSF Awards #1649515 and #1619081.

References

  • [AEL10] Amihood Amir, Estrella Eisenberg, and Avivit Levy. Approximate periodicity. Algorithms and Computation, pages 25–36, 2010.
  • [AG97] Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms. Oxford University Press, Oxford, UK, 1997.
  • [AGM90] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
  • [AGMP13] Alexandr Andoni, Assaf Goldberger, Andrew McGregor, and Ely Porat. Homomorphic fingerprints under misalignments: sketching edit and shift distances. In

    Proceedings of the forty-fifth annual ACM symposium on Theory of computing

    , pages 931–940, 2013.
  • [BG11] Dany Breslauer and Zvi Galil. Real-time streaming string-matching. In Combinatorial Pattern Matching, pages 162–172. Springer, 2011.
  • [Bla08] Francine Blanchet-Sadri. Algorithmic Combinatorics on Partial Words. Discrete mathematics and its applications. CRC Press, 2008.
  • [BMRW12] Francine Blanchet-Sadri, Robert Mercas, Abraham Rashin, and Elara Willett. Periodicity algorithms and a conjecture on overlaps in partial words. Theor. Comput. Sci., 443:35–45, 2012.
  • [CC07] Peter Clifford and Raphaël Clifford. Simple deterministic wildcard matching. Inf. Process. Lett., 101(2):53–54, 2007.
  • [CEPR09] Raphaël Clifford, Klim Efremenko, Ely Porat, and Amir Rothschild. From coding theory to efficient pattern matching. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 778–784, 2009.
  • [CFP15] Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana A. Starikovskaya. Dictionary matching in a stream. In Algorithms - ESA 23rd Annual European Symposium, Proceedings, pages 361–372, 2015.
  • [CFP16] Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana A. Starikovskaya. The k-mismatch problem revisited. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 2039–2052, 2016.
  • [CH02] Richard Cole and Ramesh Hariharan. Verifying candidate matches in sparse and wildcard matching. In Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC), pages 592–601, 2002.
  • [CJPS13] Raphaël Clifford, Markus Jalsenius, Ely Porat, and Benjamin Sach. Space lower bounds for online pattern matching. Theoretical Computer Science, 483:68–74, 2013.
  • [CKP17] Raphaël Clifford, Tomasz Kociumaka, and Ely Porat. The streaming k-mismatch problem. CoRR, abs/1708.05223, 2017.
  • [CM11] Michael S. Crouch and Andrew McGregor. Periodicity and cyclic shifts via linear sketches. In

    Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques - 14th International Workshop, APPROX, and 15th International Workshop, RANDOM. Proceedings

    , pages 158–170, 2011.
  • [EAE06] Mohamed G. Elfeky, Walid G. Aref, and Ahmed K. Elmagarmid. STAGGER: periodicity mining of data streams using expanding sliding windows. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), pages 188–199, 2006.
  • [EGSZ17] Funda Ergün, Elena Grigorescu, Erfan Sadeqi Azer, and Samson Zhou. Streaming periodicity with mismatches. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pages 42:1–42:21, 2017.
  • [EJS10] Funda Ergün, Hossein Jowhari, and Mert Saglam. Periodicity in streams. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 13th International Workshop, APPROX 2010, and 14th International Workshop, RANDOM 2010. Proceedings, pages 545–559, 2010.
  • [EMS10] Funda Ergün, S. Muthukrishnan, and Süleyman Cenk Sahinalp. Periodicity testing with sublinear samples and space. ACM Trans. Algorithms, 6(2):43:1–43:14, 2010.
  • [Gaw13] Pawel Gawrychowski. Optimal pattern matching in lzw compressed strings. ACM Transactions on Algorithms (TALG), 9(3):25, 2013.
  • [GKP16] Shay Golan, Tsvi Kopelowitz, and Ely Porat. Streaming pattern matching with d wildcards. In 24th Annual European Symposium on Algorithms, pages 44:1–44:16, 2016.
  • [GMSU16] Pawel Gawrychowski, Oleg Merkurev, Arseny M. Shur, and Przemyslaw Uznanski. Tight tradeoffs for real-time approximation of longest palindromes in streams. In 27th Annual Symposium on Combinatorial Pattern Matching, CPM, pages 18:1–18:13, 2016.
  • [GS83] Zvi Galil and Joel Seiferas. Time-space-optimal string matching. Journal of Computer and System Sciences, 26(3):280–294, 1983.
  • [HR14] Danny Hermelin and Liat Rozenberg. Parameterized complexity analysis for the closest string with wildcards problem. In Combinatorial Pattern Matching - 25th Annual Symposium, CPM Proceedings, pages 140–149, 2014.
  • [IKM00] Piotr Indyk, Nick Koudas, and S. Muthukrishnan. Identifying representative trends in massive time series data sets using sketches. In VLDB, Proceedings of 26th International Conference on Very Large Data Bases, pages 363–372, 2000.
  • [Ind98] Piotr Indyk. Faster algorithms for string matching problems: Matching the convolution bound. In 39th Annual Symposium on Foundations of Computer Science, FOCS, pages 166–173, 1998.
  • [Kal02] Adam Kalai. Efficient pattern-matching with don’t cares. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 655–656, 2002.
  • [KMP77] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977.
  • [KNW10] Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 41–52, 2010.
  • [KR87] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987.
  • [LN11] Oded Lachish and Ilan Newman. Testing periodicity. Algorithmica, 60(2):401–420, 2011.
  • [LNV14] Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In 31st International Symposium on Theoretical Aspects of Computer Science (STACS), pages 506–517, 2014.
  • [MG82] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143–152, 1982.
  • [MMT14] Florin Manea, Robert Mercas, and Catalin Tiseanu. An algorithmic toolbox for periodic partial words. Discrete Applied Mathematics, 179:174–192, 2014.
  • [MR95] S. Muthukrishnan and H. Ramesh. String matching under a general matching relation. Inf. Comput., 122(1):140–148, 1995.
  • [PL07] Ely Porat and Ohad Lipsky. Improved sketching of hamming distance with error correcting. In Annual Symposium on Combinatorial Pattern Matching, pages 173–182, 2007.
  • [PP09] Benny Porat and Ely Porat. Exact and approximate pattern matching in the streaming model. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS, pages 315–323, 2009.
  • [RS17] Jakub Radoszewski and Tatiana A. Starikovskaya. Streaming k-mismatch with error correcting and applications. In 2017 Data Compression Conference, DCC, pages 290–299, 2017.

Appendix 0.A One-Pass Algorithm to Compute Small Wildcard-Periods

In this section, we address the problem of computing any wildcard-period that satisfies , under the condition that no wildcard character appears in the last symbols of the string. As in Section 3, we run two algorithms in parallel. The first algorithm will return any wildcard-period that satisfies and the second algorithm will return any wildcard-period that satisfies . In the first process, we identify all indices such that . We simultaneously track the positions of the wildcard characters and the symbol that is positions away from each wildcard character, so that we know the assignment of each wildcard character with respect to each candidate period. Unfortunately, the second process cannot use the same paradigm, since the -Mismatch algorithm reports candidate periods too late for fingerprints to be built. As a result, we must pre-emptively guess the candidate periods.

0.a.1 Computing Small Wildcard-Periods

In this section, we describe the algorithm that finds any wildcard-period with . We first designate wildcard characters as unique characters and run the -mismatch algorithm to find

When the -mismatch algorithm finds indices , we use the fingerprints for and to simultaneously build the fingerprint for and continue building the fingerprint for respectively. Concurrently, we also track the positions of each wildcard character. For some position of a wildcard character, we identify any arbitrary non-wildcard character that is at a position . By Lemma 2, we can do this in space, and thus replace the wildcard characters in the fingerprints of and .

The -mismatch algorithm outputs upon reading character . Thus for , it follows that so we can identify in time to build . From Theorem 3.1, we can build each of these fingerprints from a sequence of compressed fingerprints.

0.a.2 Computing Large Wildcard-Periods

We now describe an algorithm for identifying all wildcard-periods such that . Let be the interval of length for and again define a set of candidate periods:

Let be a wildcard-period of . We first consider the case where and then the case where .

Observation 0.A.1

[CFP16] If is a -period for , then each such that

must be at least symbols apart.

By Observation 0.A.1, if , then . Moreover, we can detect whether by index . On the other hand, , and so we can properly build the fingerprint of .

Now, consider the case where . [EGSZ17] show that we can compute the fingerprint of by storing the fingerprints and positions of substrings.

Thus, we can build the fingerprint of regardless of whether or . In both cases, we again simultaneously track the positions of each wildcard character. For some position of a wildcard character, we identify any arbitrary non-wildcard character that is at a position .

By a similar reasoning to Lemma 2, we can do this in space, and thus replace the wildcard characters in the fingerprints of and .

Theorem 0.A.2

There exists a one-pass algorithm that outputs all the wildcard-periods of a given string with , and uses bits of space.

Proof

The -mismatch subroutine that identifies candidate wildcard-periods uses bits of space. We also maintain fingerprints for any set of indices in , and there are indices in for each , for a total of fingerprints. In addition, we store the assignments for all the wildcard positions in each interval , where and . Thus, bits of space suffice.

Appendix 0.B Distance to -Periodicity

In this section, we address the problem of finding distance to -periodicity in a string of length containing wildcard characters. That is, we find the minimum number of character changes in to obtain a string that has wildcard-period .

Suppose without loss of generality that divides , so that for some integer . Then can be visualized as a matrix so that . Intuitively, is the smallest number of changes to entries in matrix so that all the characters in each row are the same. Let

be the frequency vector of the entries in

, the row of , excluding both the most frequent character of and any wildcard characters that appear in . Then it follows that

It remains to estimate

using one of several well-known techniques. Indeed, [EJS10] uses several references to obtain results that directly translate to strings containing wildcard characters. For example, [EJS10] use a heavy-hitter algorithm from [MG82] to approximate . We can slightly modify the technique by ignoring wildcard characters to obtain the following result:

Theorem 0.B.1

There exists a deterministic one-pass streaming algorithm that provides a -approximation of using bits of space.

Similarly, [EJS10] use a distinct-elements algorithm from [KNW10] to approximate . Again, the technique can be modified by ignoring wildcard characters to obtain the following result:

Theorem 0.B.2

There exists a one-pass streaming algorithm that provides a -approximation of with probability at least , using bits of space.

Appendix 0.C Full Algorithms

In this section, we provide the full algorithms for finding wildcard-periods . We detail the first pass in full in Algorithm 3.

Input: A stream of symbols with at most wildcard characters.
Output: A succinct representation of all candidate wildcard periods and the position of the wildcard characters.

1:initialize for each and .
2:initialize .
3:for each index , let be the largest such that do
4:     using the -mismatch algorithm, check whether
5:     if so, let then
6:         let be the integer for which is in the interval
7:         if there exists no candidate in the interval  then
8:              add to .
9:         else
10:              let be the smallest candidate in