## I Introduction

Due to advances in DNA sequencing, vast amounts of biological sequence data are available nowadays. Developing efficient methods for the analysis and storage of this type of data will benefit from gaining a better mathematical understanding of the structure of these sequences. Biological sequences are formed by genomic mutations, which alter the sequence in each generation to create a new sequence in the next generation. These processes can be viewed as stochastic string editing operations that shape the statistical properties of sequence data.

In this paper, our goal is to gain a better understanding of the evolution of sequences under random mutations. We represent the evolutionary process as a stochastic system in which an arbitrary initial string evolves through random mutation events. In such systems, we will study the evolution of the frequencies of words of length , i.e., -mers, as the sequence evolves. The analysis of -mers has various applications, including identifying functions and evolutionary features [sievers2017]. Alignment-free sequence comparison also relies on -mer frequencies [zielezinski2017]. Their analysis is also of interest because other statistical properties can be computed from -mer frequencies.

From an information-theoretic point of view, stochastic sequence generation process through mutation can be viewed as a *source* of information. We study the entropy of such sources, which can be viewed as representing the complexity of sequences generated by the source. Sequence complexity measures, including entropy, have been used to determine the origin and/or the role of DNA sequences [orlov2004, farach1995, wan2003]

, for example to classify protein-coding and non-coding regions of a genome. The entropy of a source also determines how well the sequences it produces can be compressed, an increasingly important problem given the growth of biological data.

Several types of mutations exist, including substitution, duplication, insertion, and deletion. Substitution refers to changing a symbol in the sequence, e.g., . Duplication mutations, where a segment of DNA (called the template) is
copied and inserted elsewhere in the genome, may be of the
*tandem* or *interspersed* type. In tandem duplication, the copy is inserted immediately after the template. For example, from , we may obtain , where the template is overlined and the copy is underlined. For interspersed duplication, there is generally no relationship between where the template is located and where the copy is inserted. As an example, two possibilities for after a single interspersed duplication are and . Our focus will be on duplication mutations, which are thought to play a critical role in generating new genetic material [ohno1970evolution].

Tandem duplication is generally thought to be caused by slipped-strand mispairings [mundy2004], where during DNA synthesis, one strand in a DNA duplex becomes misaligned with the other. Tandem duplications and substitutions, along with other mutations, lead to tandem repeats, i.e., stretches of DNA in which the same pattern is repeated many times. Tandem repeats are known to cause important phenomena such as chromosome fragility [usdin2008a]. Interspersed duplications are caused by transposons, or “jumping genes”, which are elements in the genome that can “copy/paste” themselves into different locations. Interspersed duplication is of interest as it leads to interspersed repeats, which form 45% of the human genome [lander2001initial].

We will analyze two systems involving the types of duplications discussed. The first system models a sequence evolving through tandem duplications and substitutions (TDS) and the second system represents interspersed duplications (ID). Along with duplications, other types of mutations occur. But for simplicity, our attention is limited to the aforementioned systems, and we leave more comprehensive analysis to future work. Furthermore, the significantly more complex effect of natural selection is not considered.

In TDS systems, in each step, i) a randomly chosen substring of the sequence is duplicated and inserted in tandem, or ii) a position is chosen at random and the symbol in that position is changed to one of the other symbols. In ID systems, a string evolves through random interspersed-duplication events, i.e., in each step, a random segment of the string is duplicated and then inserted in a random position in the string, independent of the position of the original segment.

Our analysis starts by considering how

-mer frequencies evolve as mutations occur. To analyze their evolution, we use the stochastic-approximation method, which enables modeling of a discrete dynamic system by a corresponding continuous system described by an ordinary differential equation (ODE). For the TDS model, our approach allows us to compute the limit for the frequency of any

-mer as a function of model parameters. We will then use these results to provide bounds on the entropy of sequences generated by tandem duplications and substitutions. For the ID model, we show that the frequencies of strings of length larger than one are, in the limit, consistent with those of iid sequences; implying that in a certain sense, a sequence evolving through interspersed duplication is unrecognizable from an iid sequence. Note that an iid sequence has the maximum entropy among sequences with a given symbol distribution. The structure of the limit set for -mer frequencies in ID systems, however, leads to trivial upper bounds on the entropy. However, in certain cases these bounds are satisfied with equality.In previous work, the related problem of finding the combinatorial capacity of duplication systems has been studied. The combinatorial capacity is related to entropy but is defined based on the size of the set of sequences that can be generated by the system, without considering their probabilities. The combinatorial capacity is studied by

[farnoud2016, jain2017c], for duplication systems (without allowing other types of mutations) and by [jain2017b] for systems with both tandem duplication and substitution. Compared to combinatorial capacity, entropy, which is studied in this paper, provides a more accurate measure of the complexity and compressibility of sequences generated by the system. For duplication systems and duplication/substitution systems, entropy has been studied by [elishco2016b, elishco2018]. While this work considers a wider range of systems, it only allows duplications involving single symbols. Furthermore, it does not study -mer frequencies. The stochastic-approximation framework has been used for estimation of model parameters in tandem duplication systems [farnouda]. Estimating the entropy of DNA sequences has been studied in [schmitt1997, DNAEntropyDatacompression, farach1995]. However these works focus on estimating the entropy from a given sequence, rather than computing the entropy of a stochastic sequence generation system that models evolution. Duplication systems have also been studied in the context of designing error-correcting codes [jain2017duplication, dolecek2010repetition, chee2017deciding, lenz2017bounds].The rest of the paper is organized as follows. Notation and preliminaries are given in the next section. In Section III, we present the framework for the application of stochastic approximation to our string-duplication systems. Section IV contains the analysis of the evolution of -mer frequencies in tandem duplication systems and the proof of entropy bounds. Section V is devoted to the analysis of -mer frequencies in strings undergoing random interspersed duplications. We close the paper with concluding remarks in Section VI.

## Ii Notation and Preliminaries

For a positive integer , let . For a finite alphabet , the set of all finite strings over is denoted , and the set of all finite non-empty strings is denoted . Also, let denote the set of -mers, i.e., length- strings, over . The elements in strings are indexed starting from 1, e.g., , where is the length of . For a string , denotes the length- substring of starting at . Furthermore, the concatenation of two strings and is denoted by . For a non-negative integer , and , is a concatenation of copies of

. Vectors and strings are denoted by boldface letters such as

, while scalars and symbols by normal letters such as .Consider an initial string and a process where in each step a random transform, or “mutation”, is applied to , resulting in . To avoid the complications arising from boundaries, we assume the strings are circular, with a given origin and direction. Let the length of be denoted by . To a duplication of length , which may be tandem or interspersed depending on the model under study, we assign probability . For TDS systems, in which substitutions are present, we denote the probability of substitution with . For ID systems, we let . The position of the template in duplication mutations is chosen at random among the possible options. For interspersed duplication, the position at which the copy is inserted is also chosen randomly. Furthermore, for substitution mutations, the position of the symbol that is substituted is chosen randomly. We assume there exists such that for all . Hence, we have .

For a string , denote the number of appearances of in as , and its frequency as , where . For example, if , then . Furthermore, for any set , we define , and . Thus is a vector representing the number of appearances of in the string at time and is the normalized version of . We let

be the filtration generated by the random variables

.Before proceeding to the analysis of -mer frequencies, we present two results for the evolution of symbol frequencies (1-mers). These results can be viewed as extensions of results for Pólya urn models. In such models, a random ball is chosen from an urn containing balls of different colors. The chosen ball is returned to the urn, along with a predetermined number of balls of the same color. It is known that the ratio of the balls of each color (equivalent to symbol frequencies) is a martingale and converges almost surely. While strings are more complex objects than urns, we describe similar results that are valid for any duplication process in which for each , all -substring of have the same chance of being duplicated. In particular, these results hold both for TDS systems with and for ID systems.

###### Theorem 1.

In a duplication system with , the random variables , , are martingales and converge almost surely.

###### Proof:

Suppose . We have

We thus have and so is a martingale. Since it is nonnegative, by the martingale convergence theorem, it converges almost surely. ∎

The above theorem does not in fact require the distribution to be constant and bounded. Under our assumption that is so, we can in addition obtain the following result on the probability of deviating from its starting value.

###### Theorem 2.

For all and we have

###### Proof:

Since for and , . Thus

implying that

Let so that and note that

By the Hoeffding-Azuma inequality, we have . ∎

The preceding theorem implies that it is unlikely for the composition of a long DNA sequence to change dramatically through random duplication events of bounded length. Such changes, if observed, are likely the result of context-dependent duplications or other biased mutations.

Unfortunately, this simple martingale argument does not extend to when . Therefore, for analyzing such cases, we use the more flexible technique of stochastic approximation as described in the sequel.

##
Iii Stochastic Approximation

for Duplication Systems

In this section, we present an overview of the application of stochastic approximation to the analysis of duplication systems. By using stochastic approximation, our goal is to find out how changes with by finding a differential equation whose solution approximates .

We state a set of conditions that must be satisfied for our analysis. Let denote the expected value conditioned on the fact that the length of the duplicated substring is and let . In the case of substitution, we let . We consider the following conditions.

###### A 1

There exists such that for .

###### A 2

, and thus , are bounded.

###### A 3

is bounded.

###### A 4

For each , is a function of only, so we can write .

###### A 5

The function is Lipschitz.

We assume (A1) holds. From this follows (A.2) since for each -mer, a mutation can create or eliminate a bounded number of occurrences. Additionally, (A.3) holds because each element of is between 0 and 1. The correctness of (A.4) and (A.5) will be shown for each system.

To understand how varies, our starting point is its difference sequence . We note that

(1) |

For the first term of the right side of (1), we have

(2) |

where , and where we have used , which follows from the boundedness of (see (A1)).

Furthermore, for the second term of the right side of (1), we have

(3) |

where . Note that is a bounded martingale difference sequence.

From (1), (2), and (3), we find

where we have used the fact that . This follows from the boundedness of , which in turn follows from the boundedness of for all . We note that determines the overall expected behavior of the system.

In the rest of the paper, the element of that corresponds to is denoted by . More precisely, . This notation also extends to .

An additional condition requires and , which can be proven using the Borel-Cantelli lemma if . Given these and our discussion above, the following theorem, which relates the discrete system describing to a continuous system, follows from the stochastic-approximation theorem [borkar2008stochastic, Theorem 2].

###### Theorem 3.

The vector of -mer frequencies converges almost surely to a compact connected internally chain transitive invariant set of the ODE

Note the dual use of the symbol in the theorem; the meaning
is however clear from the subscript. We recall that [borkar2008stochastic, Theorem 2] a set is an
*invariant* set of an ODE
if it is closed and for some
implies that for all . The invariant
set is *internally chain transitive *with respect to the
ODE , provided that
for every and positive reals and ,
there exist and a sequence with
, , and such
that for , if , then for some ,
is in the -neighborhood of .

## Iv Tandem Duplication with Substitution

In this section, we study the behavior of a system that allows tandem duplication and substitution mutations. First, we will determine the limits of the frequencies of -mers. Then, after presenting a theorem relating the limits to entropy, we find bounds on the entropy of these systems.

Let , so is the vector of all -mer occurrences, and is the vector of all -mer frequencies. From Section III we know that we can use the differential equation to determine the limit of -mer frequencies. To find the differential equation, in Theorem 4, we determine for with and , where it can be observed that (A.4) and (A.5) hold in our model

In the next subsection, we will give some necessary definitions. We will then prove that is a linear function of , which leads to a linear first-order differential equation. This linear form facilitates determining the asymptotic behavior of the -mer frequencies. We will then show that the entropy of stochastic string systems can be related to the capacity of semiconstrained systems defined by the limit set of the -mer frequencies. Leveraging the simple form of the limits for systems with tandem duplications and substitutions, we will provide bounds on the entropy of these systems.

### Iv-a Definitions

The following definitions will be useful for finding . For and , define to be a sequence of length whose th element is determined by whether the symbol in position of equals the symbol in position . More specifically, the th element of is

where

is a dummy variable. Only the positions of ‘0’s in

are of importance to us. Let the lengths of the maximal runs of ‘0’s immediately after the initial and at the end of be denoted by and , respectively. Note that either of or may be equal to 0. If , then . Otherwise, we have , for some that starts and ends with . For example, for , we may haveand and . Note that a duplication of length is equivalent to inserting 0s into . For example, may come from after a length 3 tandem duplication with the overlined substring as a template and can be viewed as the result of inserting three 0s into between the overlined two symbols.

To enable us to succinctly represent the results, we define several functions. These functions relate to the frequencies of other substrings that can generate via appropriate duplication events.

Consider the sequence , for which . This string can be created through duplications of length 2 from (in two ways) and from . These correspond to runs of of length in . For a string and positive integers and , let be the sequence obtained from by removing symbols in positions . For example, . Define

where the sum is over all that are the indices of the start of (not necessarily maximal) runs of s in , i.e., . For , . There is a slight abuse of notation in the definition of above (as well as the definitions of and below). While the argument of is , on the right side, for sequences with may appear. We note however that can be obtained from by summing over the elements of corresponding to strings that include as a prefix.

New occurrences of can also be generated from strings that are not of the form . For example, consider the sequence , for which . This sequence can be created through a length-3 tandem duplication from and , where the part that is to be duplicated is overlined. The following definitions will be of use in the analysis of this type of duplication.

In the special case where and , we will benefit from defining

We let if .

Lastly, we use to denote set of strings at Hamming distance 1 from . For example, for , . Also for , the indicator function equals 1 if and equals 0 otherwise.

### Iv-B Evolution of -mer Frequencies

We first find for (duplication) and then for (substitution).

When analyzing , we only consider substrings of length , which simplifies the derivation. The frequencies of shorter substrings can be found by summing over the frequencies of longer substrings.

###### Theorem 4.

For an integer and a string , if , then

and if ,

Before proving the theorem, we present two examples for and :

###### Proof:

Suppose a duplication of length occurs in , resulting in . The number of occurrences of may change due to the duplication event. To study this change, we consider the -substrings of that are eliminated (do not exist in ) and the -substrings of that are new (do not exist in ). Any new -substring must intersect with both the template and the copy in . Likewise, an eliminated -substring must include symbols on both sides of the template in , i.e., the template must be a strict substring of the -substring that includes neither its leftmost symbol nor its rightmost symbol.

As an example, suppose

(4) |

where , the (new) copy is underlined and the template is overlined, and . Let , the new -substrings are , , , and the eliminated substring is . Note that here the two substrings are counted as different. Formally, let

where the substring is duplicated. The new -substrings created in are

for . Note that we have defined such that the first element of the copy, , is at position in . The -substrings eliminated from are , for .

For a given , let

denote the indicator random variable for the event that

, that is, the duplication creates a new occurrence of in in which the first symbol of the copy is in position . In example denoted by (4), if , then and thus .Furthermore, let denote the number of occurrences of that are eliminated. We have

where the second equality follows from the fact that each of the eliminated -substrings are equal to with probability .

To find , it suffices to find , or equivalently, . We consider different cases based on the value of , which determines how overlaps with the template and the copy. These cases are illustrated in Figure 1 and are considered in Lemmas 5–7.

Similarly, using Lemma 7, it can be shown that

(6) |

To complete the proof, we need to show that summed over the range reduces to or as appropriate.

From Lemma 6, if , then

(7) |

and if , also

(8) |

Finally, if , from the same lemma, we find

(9) |

where the last step follows from the definition of .

###### Lemma 5 (Case 1).

For ,

###### Proof:

For (regardless of whether or ), the new occurrences of always contain some (but not all) of the template and all of the new copy. This scenario is labeled as Case 1 in Figure 1.

Suppose . Since the copy and the template are identical, elements of that coincide with the same positions in these two substrings must also be identical. So a necessary condition for is

Assume this condition is satisfied. Then if and only if the sequence starting at the beginning of the template in is equal to , which has probability . As an example for , consider

where , is overlined, and the copy is underlined. Note that contains . For , consider

∎

###### Lemma 6 (Case 2).

Suppose . If , then

and if , then