    Hidden Words Statistics for Large Patterns

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern w of length m as a subsequence in a random text of length n. The quantity of interest is the number of occurrences of w as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern w is of variable length. To the best of our knowledge this problem was only tackled for a fixed length m=O(1) [Flajolet, Szpankowski and Vallée, 2006]. In our main result we prove that for m=o(n^1/3) the number of subsequence occurrences is normally distributed. In addition, we show that under some constraints on the structure of w the asymptotic normality can be extended to m=o(√(n)). For a special pattern w consisting of the same symbol, we indicate that for m=o(n) the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding's projection method for U-statistics to prove our findings.

Authors

10/13/2020

Pattern statistics in faro words and permutations

We study the distribution and the popularity of some patterns in words o...
03/18/2021

The equidistribution of some Mahonian statistics over permutations avoiding a pattern of length three

We prove the equidistribution of several multistatistics over some class...
02/14/2019

On long words avoiding Zimin patterns

A pattern is encountered in a word if some infix of the word is the imag...
05/22/2019

Cartesian Tree Matching and Indexing

We introduce a new metric of match, called Cartesian tree matching, whic...
07/30/2018

A Proof of Entropy Minimization for Outputs in Deletion Channels via Hidden Word Statistics

From the output produced by a memoryless deletion channel from a uniform...
11/24/2018

OCLEP+: One-class Anomaly and Intrusion Detection Using Minimal Length of Emerging Patterns

This paper presents a method called One-class Classification using Lengt...
10/06/2021

Coded Shotgun Sequencing

Most DNA sequencing technologies are based on the shotgun paradigm: many...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and Motivation

One of the most interesting and least studied problem in pattern matching is known as the subsequence string matching or the hidden pattern matching . In this case, we search for a pattern of length in the text of length as subsequence, that is, we are looking for indices such that . We say that is hidden in the text . We do not put any constraints on the gaps , so in language of  this is known as the unconstrained hidden pattern matching. The most interesting quantity of such a problem is the number of subsequence occurrences in the text generated by a random source. In this paper, we study the limiting distribution of this quantity when , the length of the pattern, grows with .

Hereafter, we assume that a memoryless source generates the text , that is, all symbols are generated independently with probability for symbol , where the alphabet is assumed to be finite. We denote by the probability of the pattern . Our goal is to understand the probabilistic behavior, in particular, the limiting distribution of the number of subsequence occurrences that we denote by . It is known that the behavior of depends on the order of magnitude of the pattern length . For example, for the exact pattern matching (i.e., the pattern must occur as a string in consecutive positions of the text), the limiting distribution is normal for (more precisely, when , hence up to ), but it becomes a Pólya–Aeppli distribution when for some constant

, and finally (conditioned on being non-zero) it turns into a geometric distribution when

 (see also ). We might expect a similar behaviour for the subsequence pattern matching. In  it was proved by analytic combinatoric methods that the number of subsequence occurrences, , is asymptotically normal when , and not much is known beyond this regime. (See also . Asymptotic normality for fixed follows also by general results for -statistics .) However, in many applications – as discussed below – we need to consider patterns whose lengths grow with . In this paper, we prove two main results. In Theorem 2.6 we establish that for the number of subsequence occurrences is normally distributed. Furthermore, in Theorem 2.7 we show that under some constrains on the structure of , the asymptotic normality can be extended to . Moreover, for the special pattern consisting of the same symbol repeated, we show in Theorem 2.4 that for , the distribution of number of occurrences is asymptotically normal, while for larger (up to for some ) it is asymptotically log-normal. We conjecture that this dichotomy is true for a large class of patterns. Finally, for random typical we establish in Corollary 4.4 that is asymptotically normal for .

Regarding methodology, unlike  we use here probabilistic tools. We first observe that can be represented as a -statistic (see (2.3) and Section 3.2). This suggests to apply the Hoeffding  projection method to prove asymptotic normality of for some large patterns. Indeed, we first decompose

into a sum of orthogonal random variables with variances of decreasing order in

(for not too large), and show that the variable of the largest variance converges to a normal distribution, proving our main results Theorems 2.6 and 2.7.

The hidden pattern matching problem, especially for large patterns, finds many applications from intrusion detection, to trace reconstruction, to deletion channel, to DNA-based storage systems [1; 4; 5; 6; 12; 17]. Here we discuss below in some detail two of them, namely the deletion channel and the trace reconstruction problem.

A deletion channel [5; 6; 7; 14; 17; 20] with parameter takes a binary sequence where as input and deletes each symbol in the sequence independently with probability . The output of such a channel is then a subsequence of , where

follows the binomial distribution

, and the indices correspond to the bits that are not deleted. Despite significant effort [6; 14; 15; 17; 20] the mutual information between the input and output of the deletion channel and its capacity are still unknown. However, it turns out that the mutual information can be exactly formulated as the problem of the subsequence pattern matching. In  it was proved that

 I(Ξn;ζ(Ξn))=∑wdn−|w|(1−d)|w|( E[ZΞn(w)logZΞn(w)] −E[ZΞn(w)]logE[ZΞn(w)]), (1.1)

where the sum is over all binary sequences of length smaller than and is the number of subsequence occurrences of in the text . As one can see, to find precise asymptotics of the mutual information we need to understand the probabilistic behavior of for and typical . The trace reconstruction problem [4; 11; 16; 18] is related to the deletion channel problem since we are asking how many copies of the output deletion channel we need to see until we can reconstruct the input sequence with high probability.

2. Main Results

In this section we formulate precisely our problem and present our main results. Proofs are delayed till the next section.

2.1. Problem formulation and notation

We consider a random string of length . We assume that are i.i.d. random letters from a finite alphabet ; each letter has the distribution

 P(ξi=a)=pa,a∈A, (2.1)

for some given vector

; we assume for each . We may also use for a random letter with this distribution.

Let be a fixed string of length over the same alphabet . We assume . Let

 pw:=m∏j=1pwj, (2.2)

which is the probability that equals .

Let be the number of occurrences of as a subsequence of .

For a set (in our case or ) and , let be the collection of sets with . Thus, . For , contains just the empty set . For , we identify and in the obvious way. We write as , where we assume that . Then

 Z=∑α∈([n]m)Iα, (2.3)

where

 Iα=m∏j=11{ξαj=wj}. (2.4)
Remark 2.1.

In the limit theorems, we are studying the asymptotic distribution of . We then assume that and (usually) ; we thus implicitly consider a sequence of words of lengths . But for simplicity we do not show this in the notation. ∎

We have for every . Hence,

 EZ=∑α∈([n]m)EIα=(nm)pw. (2.5)

Further, let

 Yα:=p−1wIα, (2.6)

so , and

 Z∗:=p−1wZ=∑α∈([n]m)Yα, (2.7)

so and

 Z∗−EZ∗=p−1wZ−(nm)=∑α∈([n]m)(Yα−1). (2.8)

We also write for the norm of a random variable , while is the usual Euclidean norm of a vector in some .

denotes constants that may be different at different occurrences; they may depend on the alphabet and , but not on , or .

Finally, and mean convergence in distribution and probability, respectively.

We are now ready to present our main results regarding the limiting distribution of , the number of subsequence occurrences when . We start with a simple example, namely, for some , and show that depending on whether

or not the number of subsequences will follow asymptotically either the normal distribution or the log-normal distribution.

Before we present our results we consider asymptotically normal and log-normal distributions in general, and discuss their relation.

2.2. Asymptotic normality and log-normality

If is a sequence of random variables and and are sequences of real numbers, with , then

 Xn∼AsN(an,bn) (2.9)

means that

 Xn−an√bnd⟶N(0,1). (2.10)

We say that is asymptotically normal if for some and , and asymptotically log-normal if for some and (this assumes ). Note that these notions are equivalent when the asymptotic variance is small, as made precise by the following lemma.

Lemma 2.2.

If , and are arbitrary, then

 lnXn∼AsN(an,bn)⟺Xn∼AsN(ean,bne2an). (2.11)
Proof.

By replacing by , we may assume that . If with , then , and thus . It follows that (with ), and thus

 Xn−1b1/2n=Xn−1lnXnlnXnb1/2nd⟶N(0,1), (2.12)

and thus .

The converse is proved by the same argument. ∎

Remark 2.3.

Lemma 2.2 is best possible. Suppose that . If , then , and thus

 Xn/eand⟶eζb,ζb∼N(0,b). (2.13)

In this case (and only in this case), thus converges in distribution, after scaling, to a log-normal distribution. If , then no linear scaling of can converge in distribution to a non-degenerate limit, as is easily seen. ∎

2.3. A simple example

We consider first a simple example where the asymptotic distribution can be found easily by explicit calculations. Fix and let , a string with identical letters. Then, if is the number of occurrences of in , then

 Z=(Nam). (2.14)

We will show that is asymptotically normal if is small, and log-normal for larger .

Theorem 2.4.

Let . Suppose that , with .

1. Then

 (2.15)
2. In particular, if , then

 lnZ∼AsN(ln(npam),(p−1a−1)m2n). (2.16)
3. If , then this implies

 Z/EZ∼AsN(1,(p−1a−1)m2n), (2.17)

and thus

 Z∼AsN(EZ,(p−1a−1)m2n(EZ)2). (2.18)
Proof.

(i): We have . Define

. Then, by the Central Limit Theorem,

 Y∼AsN(0,npa(1−pa)). (2.19)

By (2.14), we have

 lnZ−ln(npam) =ln(npa+Ym)−ln(npam) =lnΓ(npa+Y+1)−lnΓ(npa+Y−m+1)−lnm! −(lnΓ(npa+1)−lnΓ(npa−m+1)−lnm!) =∫Yy=0∫0x=−m(lnΓ)′′(npa+x+y+1)dxdy (2.20)

where is the Euler gamma function. We fix a sequence such that ; this is possible by the assumption. Note that (2.19) implies that , and thus . We may thus in the sequel assume . We assume also that is so large that .

Stirling’s formula implies, by taking the logarithm and differentiating twice (in the complex half-plane , say)

 (lnΓ)′′(x)=1x+O(1x2)=1x(1+O(1x)),x⩾1. (2.21)

Consequently, (2.3) yields, noting the assumptions just made imply ,

 lnZ−ln(npam) =∫Yy=0∫0x=−m1npa+x+y+1(1+O(1npa−m))dxdy =(1+O(ωnnpa−m))Y∫0x=−m1npa+xdx =(1+o(1))Ylnnpanpa−m. (2.22)

Consequently, using also (2.19), we obtain

 lnZ−ln(npam)n1/2∣∣ln(1−mnpa)∣∣=(1+op(1))Yn1/2d⟶N(0,pa(1−pa)), (2.23)

which is equivalent to (2.15).

(ii): If , then , and (2.16) follows.

(iii): If , then (ii) applies, so (2.16) holds; hence Lemma 2.2 implies

 (2.24)

Furthermore,

 EZ=(nm)pma=nmeO(m2/n)m!pma∼nmm!pma (2.25)

and, similarly, . Hence, and (2.17) follows from (2.24); (2.18) is an immediate consequence. ∎

Example 2.5.

Let as in Theorem 2.4, and let for some . Then, as , by Theorem 2.4(ii), with , and ,

 lnZn∼AsN(lnzn,σ2) (2.26)

and thus

 (2.27)

Hence, converges in distribution to a log-normal distribution, so is asymptotically log-normal but not asymptotically normal. See also Remark 2.3. ∎

2.4. General results

We now present our main results. However, first we discuss the road map of our approach. First, we observe that the representation (2.3) shows that can be viewed as a -statistic. For convenience, we consider in (2.7), which differs from by a constant factor only, and show in (3.18) that can be decomposed into a sum of orthogonal random variables such that, when is not too large, . Next, in Lemma 3.7 we prove that appropriately normalized converges to the standard normal distribution. This will allow us to conclude the asymptotic normality of .

In this paper, we only consider the region . First, for we claim that the number of subsequence occurrences always is asymptotically normal.

Theorem 2.6.

If , then

 Z∼AsN((nm)pw,σ21p2w), (2.28)

where

 σ21 =n∑i=1∑a∈Ap−1a⎛⎜⎝∑j: wj=a(i−1j−1)(n−im−j)⎞⎟⎠2−n(n−1m−1)2. (2.29)

Furthermore, and .

In the second main result, we restrict the patterns to such that are not typical for the random text; however, we will allow .

Theorem 2.7.

Let be the proportions of the letters in , i.e., . Suppose that . If further , then we have the asymptotic normality

 Z∼AsN((nm)pw,σ21p2w), (2.30)

where is given by (2.29). Furthermore, and .

3. Analysis and Proofs

In this section we will prove our main results. We start with some preliminaries.

3.1. Preliminaries and more notation

Let, for ,

 φa(x):=p−1a1{x=a}−1. (3.1)

Thus, letting be any random variable with the distribution of ,

 (3.2)

Let and

 B:=p−1∗−1. (3.3)
Lemma 3.1.

Let and be as above.

1. For every ,

 E[φa(ξ)2]=p−1a−1⩽B. (3.4)
2. For some and every ,

 ∥φa(ξ)∥2=(p−1a−1)1/2⩾c1. (3.5)
3. For any vector with ,

 ∥∥∑a∈Araφa(ξ)∥∥2⩾∥r−p∥:=(∑a∈A|ra−pa|2)1/2. (3.6)
Proof.

The definition (3.1) yields

 E[φa(ξ)2]=p−2aVar[1{ξ=a}]=p−2apa(1−pa)=p−1a−1. (3.7)

Hence, (3.4) and (3.5) follow, with given by (3.3).

Finally, for every , by (3.1) again,

 ∑a∈Araφa(x)=rxp−1x−∑a∈Ara=rx/px−1 (3.8)

and thus

 E(∑a∈Araφa(ξ))2=∑a∈Apa(ra/pa−1)2=∑a∈Ap−1a(ra−pa)2 (3.9)

and (3.6) follows. ∎

3.2. A decomposition

The representation (2.3) shows that is a special case of a -statistic. (Recall that, in general, a -statistic is a sum over subsets as in (2.3) of for some function .) For fixed , the general theory of Hoeffding  applies and yields asymptotic normality. (Cf. [13, Section 4] for a related problem.) For (our main interest), we can still use the orthogonal decomposition of , which in our case takes the following form.

By the definitions in Section 2.1 and (3.1),

 Yα=m∏j=1(p−1wj1{ξαj=wj})=m∏j=1(φwj(ξαj)+1). (3.10)

By multiplying out this product, we obtain

 Yα=∑γ⊆[m]∏j∈γφwj(ξαj). (3.11)

Hence,

 Z∗=∑α∈([n]m)Yα=∑α∈([n]m)∑γ⊆[m]∏j∈γφwj(ξαj)=∑α∈([n]m)∑γ⊆[m]|γ|∏k=1φwγk(ξαγk). (3.12)

We rearrange this sum. First, let , and consider all terms with a given . For each and , with , let

 αγ:={αγ1,…,αγℓ}∈([n])ℓ. (3.13)

For given and , the number of such that equals the number of ways to choose, for each , elements of in a gap of length , where we define and , ; this number is

 c(β,γ):=ℓ+1∏k=1(βk−βk−1−1γk−γk−1−1). (3.14)

Consequently, combining the terms in (3.12) with the same ,

 Z∗=m∑ℓ=0∑γ∈([m]ℓ)∑β∈([n]ℓ)c(β,γ)ℓ∏k=1φwγk(ξβk). (3.15)

We define, for and ,

 Vℓ,β:=∑γ∈([m]ℓ)c(β,γ)ℓ∏k=1φwγk(ξβk) (3.16)

and

 Vℓ:=∑β∈([n]ℓ)Vℓ,β. (3.17)

Thus (3.15) yields the decomposition

 Z∗=m∑ℓ=0Vℓ. (3.18)

For , contains only the empty set , and

 V0=V0,∅=(nm)=EZ∗. (3.19)

Furthermore, note that two summands in (3.15) with different are orthogonal, as a consequence of (3.2) and independence of different . Consequently, the variables (, ) are orthogonal, and hence the variables () are orthogonal.

Let

 σ2ℓ:=Var(Vℓ)=EV2ℓ=∑β∈([n]ℓ)EV2ℓ,β,1⩽ℓ⩽m. (3.20)

Note also that by the combinatorial definition of given before (3.14), we see that

 ∑β∈([n]ℓ)c(β,γ)=(nm), (3.21)

since this is just the number of , and

 ∑γ∈([m]ℓ)c(β,γ)=(n−ℓm−ℓ), (3.22)

since this sum is the total number of ways to choose elements of the elements of in the gaps.

3.3. The projection method

We use the projection method used by Hoeffding  to prove asymptotic normality for -statistics. Translated to the present setting, the idea of the projection method is to approximate by , thus ignoring all terms with in the sum in (3.18

). In order to do this, we estimate variances.

First, by (3.4) and the independence of the ,

 ∥∥ℓ∏k=1φwγk(ξβk)∥∥2=(ℓ∏k=1E∣∣φwγk(ξβk)∣∣2)1/2⩽Bℓ/2. (3.23)

By Minkowski’s inequality, (3.16), (3.23) and (3.22),

 ∥∥Vℓ,β∥∥2 ⩽∑γ∈([m]ℓ)c(β,γ)Bℓ/2=Bℓ/2(n−ℓm−ℓ) (3.24)

or, equivalently,

 EV2ℓ,β⩽Bℓ(n−ℓm−ℓ)2. (3.25)

This leads to the following estimates.

Lemma 3.2.

For ,

 σ2ℓ:=EV2ℓ ⩽ˆσ2ℓ:=Bℓ(nℓ)(n−ℓm−ℓ)2. (3.26)
Proof.

The definition of in (3.17) and (3.25) yield, since the summands are orthogonal,

 σ2ℓ:=EV2ℓ =∑β∈([n]ℓ)EV2ℓ,β⩽(nℓ)Bℓ(n−ℓm−ℓ)2, (3.27)

as needed. ∎

Note that, for ,

 (3.28)
Lemma 3.3.

If , then

 Var(Z∗−V1)⩽B2m2(n−1m−1)2. (3.29)
Proof.

By (3.28) and the assumption, for ,

 ˆσ2ℓ+1ˆσ2ℓ⩽1ℓ+1⩽12, (3.30)

and thus, summing a geometric series,

 Var(Z∗−V1) =m∑ℓ=2Var(Vℓ)⩽m∑ℓ=2ˆσ2ℓ⩽m∑ℓ=222−ℓˆσ22⩽2ˆσ22 =B2n(n−1)(n−2m−2)2⩽B2m2(n−1m−1)2. (3.31)

3.4. The first term V1

For , we identify and , and we write . Note that, by (3.14),

 c(i,j):=c({i},{j})=(i−1j−1)(n−i)m−j. (3.32)
Remark 3.4.

For later use, we define also

 π(i,j):=c(i,j)c(1,1)=c(i,j)(n−1m−1). (3.33)

Then, for fixed ,

is a (shifted) hypergeometric distribution:

 π(i,j)=P(X=j−1)=(i−1j−1)(n−im−j)(n−1m−1) (3.34)

which we write as

 X∼HGe(n−1,m−1,i−1). (3.35)

For , (3.17) and (3.16) become

 V1=n∑i=1V1,i (3.36)

with, using (3.32),

 V1,i=m∑j=1c(i,j)φwj(ξi)=m∑j=1(i−1j−1)(n−im−j)φwj(ξi). (3.37)

Note that is a function of , and thus the random variables are independent. Furthermore, (3.2) implies . Let

 τ2i:=VarV1,i=EV21,i. (3.38)

Then, see (3.20),

 σ21=VarV1=n∑i=1VarV1,i=n∑i=1τ2i. (3.39)

Observe that it follows from (3.37) and (3.1) that

 τ2i=∑a∈Ap−1a⎛⎜⎝∑j: wj=a(i−1j−1)(n−im−j)⎞⎟⎠