1. Introduction and Motivation
One of the most interesting and least studied problem in pattern matching is known as the subsequence string matching or the hidden pattern matching . In this case, we search for a pattern of length in the text of length as subsequence, that is, we are looking for indices such that . We say that is hidden in the text . We do not put any constraints on the gaps , so in language of  this is known as the unconstrained hidden pattern matching. The most interesting quantity of such a problem is the number of subsequence occurrences in the text generated by a random source. In this paper, we study the limiting distribution of this quantity when , the length of the pattern, grows with .
Hereafter, we assume that a memoryless source generates the text , that is, all symbols are generated independently with probability for symbol , where the alphabet is assumed to be finite. We denote by the probability of the pattern . Our goal is to understand the probabilistic behavior, in particular, the limiting distribution of the number of subsequence occurrences that we denote by . It is known that the behavior of depends on the order of magnitude of the pattern length . For example, for the exact pattern matching (i.e., the pattern must occur as a string in consecutive positions of the text), the limiting distribution is normal for (more precisely, when , hence up to ), but it becomes a Pólya–Aeppli distribution when for some constant
, and finally (conditioned on being non-zero) it turns into a geometric distribution when (see also ). We might expect a similar behaviour for the subsequence pattern matching. In  it was proved by analytic combinatoric methods that the number of subsequence occurrences, , is asymptotically normal when , and not much is known beyond this regime. (See also . Asymptotic normality for fixed follows also by general results for -statistics .) However, in many applications – as discussed below – we need to consider patterns whose lengths grow with . In this paper, we prove two main results. In Theorem 2.6 we establish that for the number of subsequence occurrences is normally distributed. Furthermore, in Theorem 2.7 we show that under some constrains on the structure of , the asymptotic normality can be extended to . Moreover, for the special pattern consisting of the same symbol repeated, we show in Theorem 2.4 that for , the distribution of number of occurrences is asymptotically normal, while for larger (up to for some ) it is asymptotically log-normal. We conjecture that this dichotomy is true for a large class of patterns. Finally, for random typical we establish in Corollary 4.4 that is asymptotically normal for .
Regarding methodology, unlike  we use here probabilistic tools. We first observe that can be represented as a -statistic (see (2.3) and Section 3.2). This suggests to apply the Hoeffding  projection method to prove asymptotic normality of for some large patterns. Indeed, we first decompose(for not too large), and show that the variable of the largest variance converges to a normal distribution, proving our main results Theorems 2.6 and 2.7.
The hidden pattern matching problem, especially for large patterns, finds many applications from intrusion detection, to trace reconstruction, to deletion channel, to DNA-based storage systems [1; 4; 5; 6; 12; 17]. Here we discuss below in some detail two of them, namely the deletion channel and the trace reconstruction problem.
A deletion channel [5; 6; 7; 14; 17; 20] with parameter takes a binary sequence where as input and deletes each symbol in the sequence independently with probability . The output of such a channel is then a subsequence of , where
follows the binomial distribution, and the indices correspond to the bits that are not deleted. Despite significant effort [6; 14; 15; 17; 20] the mutual information between the input and output of the deletion channel and its capacity are still unknown. However, it turns out that the mutual information can be exactly formulated as the problem of the subsequence pattern matching. In  it was proved that
where the sum is over all binary sequences of length smaller than and is the number of subsequence occurrences of in the text . As one can see, to find precise asymptotics of the mutual information we need to understand the probabilistic behavior of for and typical . The trace reconstruction problem [4; 11; 16; 18] is related to the deletion channel problem since we are asking how many copies of the output deletion channel we need to see until we can reconstruct the input sequence with high probability.
2. Main Results
In this section we formulate precisely our problem and present our main results. Proofs are delayed till the next section.
2.1. Problem formulation and notation
We consider a random string of length . We assume that are i.i.d. random letters from a finite alphabet ; each letter has the distribution
for some given vector; we assume for each . We may also use for a random letter with this distribution.
Let be a fixed string of length over the same alphabet . We assume . Let
which is the probability that equals .
Let be the number of occurrences of as a subsequence of .
For a set (in our case or ) and , let be the collection of sets with . Thus, . For , contains just the empty set . For , we identify and in the obvious way. We write as , where we assume that . Then
In the limit theorems, we are studying the asymptotic distribution of . We then assume that and (usually) ; we thus implicitly consider a sequence of words of lengths . But for simplicity we do not show this in the notation. ∎
We have for every . Hence,
so , and
We also write for the norm of a random variable , while is the usual Euclidean norm of a vector in some .
denotes constants that may be different at different occurrences; they may depend on the alphabet and , but not on , or .
Finally, and mean convergence in distribution and probability, respectively.
We are now ready to present our main results regarding the limiting distribution of , the number of subsequence occurrences when . We start with a simple example, namely, for some , and show that depending on whether
or not the number of subsequences will follow asymptotically either the normal distribution or the log-normal distribution.
Before we present our results we consider asymptotically normal and log-normal distributions in general, and discuss their relation.
2.2. Asymptotic normality and log-normality
If is a sequence of random variables and and are sequences of real numbers, with , then
We say that is asymptotically normal if for some and , and asymptotically log-normal if for some and (this assumes ). Note that these notions are equivalent when the asymptotic variance is small, as made precise by the following lemma.
If , and are arbitrary, then
By replacing by , we may assume that . If with , then , and thus . It follows that (with ), and thus
and thus .
The converse is proved by the same argument. ∎
Lemma 2.2 is best possible. Suppose that . If , then , and thus
In this case (and only in this case), thus converges in distribution, after scaling, to a log-normal distribution. If , then no linear scaling of can converge in distribution to a non-degenerate limit, as is easily seen. ∎
2.3. A simple example
We consider first a simple example where the asymptotic distribution can be found easily by explicit calculations. Fix and let , a string with identical letters. Then, if is the number of occurrences of in , then
We will show that is asymptotically normal if is small, and log-normal for larger .
Let . Suppose that , with .
In particular, if , then
If , then this implies
(i): We have . Define
. Then, by the Central Limit Theorem,
By (2.14), we have
where is the Euler gamma function. We fix a sequence such that ; this is possible by the assumption. Note that (2.19) implies that , and thus . We may thus in the sequel assume . We assume also that is so large that .
Stirling’s formula implies, by taking the logarithm and differentiating twice (in the complex half-plane , say)
Consequently, (2.3) yields, noting the assumptions just made imply ,
Consequently, using also (2.19), we obtain
which is equivalent to (2.15).
2.4. General results
We now present our main results. However, first we discuss the road map of our approach. First, we observe that the representation (2.3) shows that can be viewed as a -statistic. For convenience, we consider in (2.7), which differs from by a constant factor only, and show in (3.18) that can be decomposed into a sum of orthogonal random variables such that, when is not too large, . Next, in Lemma 3.7 we prove that appropriately normalized converges to the standard normal distribution. This will allow us to conclude the asymptotic normality of .
In this paper, we only consider the region . First, for we claim that the number of subsequence occurrences always is asymptotically normal.
If , then
Furthermore, and .
In the second main result, we restrict the patterns to such that are not typical for the random text; however, we will allow .
Let be the proportions of the letters in , i.e., . Suppose that . If further , then we have the asymptotic normality
where is given by (2.29). Furthermore, and .
3. Analysis and Proofs
In this section we will prove our main results. We start with some preliminaries.
3.1. Preliminaries and more notation
Let, for ,
Thus, letting be any random variable with the distribution of ,
Let and be as above.
For every ,
For some and every ,
For any vector with ,
3.2. A decomposition
The representation (2.3) shows that is a special case of a -statistic. (Recall that, in general, a -statistic is a sum over subsets as in (2.3) of for some function .) For fixed , the general theory of Hoeffding  applies and yields asymptotic normality. (Cf. [13, Section 4] for a related problem.) For (our main interest), we can still use the orthogonal decomposition of , which in our case takes the following form.
By multiplying out this product, we obtain
We rearrange this sum. First, let , and consider all terms with a given . For each and , with , let
For given and , the number of such that equals the number of ways to choose, for each , elements of in a gap of length , where we define and , ; this number is
Consequently, combining the terms in (3.12) with the same ,
We define, for and ,
Thus (3.15) yields the decomposition
For , contains only the empty set , and
Furthermore, note that two summands in (3.15) with different are orthogonal, as a consequence of (3.2) and independence of different . Consequently, the variables (, ) are orthogonal, and hence the variables () are orthogonal.
Note also that by the combinatorial definition of given before (3.14), we see that
since this is just the number of , and
since this sum is the total number of ways to choose elements of the elements of in the gaps.
3.3. The projection method
We use the projection method used by Hoeffding  to prove asymptotic normality for -statistics. Translated to the present setting, the idea of the projection method is to approximate by , thus ignoring all terms with in the sum in (3.18
). In order to do this, we estimate variances.
First, by (3.4) and the independence of the ,
This leads to the following estimates.
Note that, for ,
If , then
By (3.28) and the assumption, for ,
and thus, summing a geometric series,
3.4. The first term
For , we identify and , and we write . Note that, by (3.14),
For later use, we define also
Then, for fixed ,
is a (shifted) hypergeometric distribution:
which we write as