Tandem duplication is a biological process that creates consecutive copies of a segment of a genome during DNA replication. Representing genomes as strings, this event transforms a string into another string . This process is known to occur either at small scale at the nucleotide level, or at large scale at the genome level [4, 5, 6, 17, 24]. For instance, it is known that the Huntington disease is associated with the duplication of 3 nucleotides CAG , whereas at genome level, tandem duplications are known to involve multiple genes during cancer progression . Furthermore, gene duplication is believed to be the main driving force behind evolution, and the majority of duplications affecting organisms are believed to be of the tandem type (see e.g. ).
For these reasons, tandem duplications have received significant attention in the last decades, both in practice and theory. The combinatorial aspects of tandem duplications have been studied extensively by computational biologists [16, 9, 11, 26] and, in parallel, by various formal language communities [7, 27, 18]. From the latter perspective, a natural question arises: given a string , what is the language that can be obtained starting from and applying (any number of) tandem duplications, i.e. rules of the form ? This question was first asked in 1984 in the context of so-called copying systems . Combined with results from , it was shown that this language is regular if is on a binary alphabet, but not regular for larger alphabets. These results were rediscovered 15 years later in [7, 27]. In , it was shown that given an unbounded duplication language (‘unbounded’ means that the size of the duplicated string is not necessarily bounded by any constant), the membership, inclusion and regularity testing problems can all be decided in linear time; same with the equivalence testing between two such languages. In [18, 19, 14], similar problems are also considered when the duplication size is bounded. More recently in [13, 15], the authors study the expressive power of tandem duplications, a notion based on the subsequences that can be obtained from a copy operation.
In this work, we are interested in a question posed in  (pp. 306, Open Problem 3) by Leupold et al., who raised the problem of computing the minimum number of tandem duplications to transform a string to another string . We call this the Tandem Duplication (TD) distance problem. The TD distance is one of the many ways of comparing two genomes represented as strings in computational biology — other notable examples include breakpoint  and transpositions distances, the latter having recently been shown NP-hard in a celebrated paper of Bulteau et al. . The TD distance has itself received special attention recently, owing to its role in cancer evolution .
Our results. In this paper, we solve the problem posed by Leupold et al. in 2004 and show that computing the TD distance from a string to a string is NP-hard. We show that this result holds even if is exemplar, i.e. if each character of is distinct. Exemplar strings are commonly studied in computational biology , since they represent genomes that existed prior to duplication events. We note that simply deciding if can be transformed into by a sequence of TDs still has unknown complexity. In our case, we show that the hardness of minimizing TDs holds on instances in which such a sequence is guaranteed to exist.
As demonstrated by the transpositions distance in , obtaining NP-hardness results for string distances can sometimes be an involving task. Our hardness reduction is also quite technical, and one of the tools we develop for it is a new problem we call the Cost-Effective Subgraph. In this problem, we are given a graph with a cost , and we must choose a subset of . Each edge with both endpoints in has a cost of , every other edge costs , and the goal is to find a subset of minimum cost. We show that this problem is W-hard for parameter , where is the cost that we can save below the upper bound 111In other words, if we were to state the maximization version of the Cost-Effective Subgraph problem, would be the value to maximize. The minimization version, however, is more convenient to use for our needs.. The problem enforces optimizing the tradeoff between covering many edges versus having a large subset of high cost, which might be applicable to other problems. In our case it captures the main difficulty in computing TD distances. We then obtain some positive results by showing that if is exemplar, then one can decide if can be transformed into using at most duplications in time . The result is obtained through an exponential size kernel. Finally, we conclude with several open problems that might be of interest to the theoretical computer science community.
This paper is organized as follows. In Section 2, we give basic definitions. In Section 3, we show that computing the TD distance is NP-hard through the Cost-Effective Subgraph problem. In Section 4, we show that computing the exemplar TD distance is FPT. In Section 5, we conclude the paper with several open problems.
2 Preliminary notions
We borrow the string terminology and notation from . Unless stated otherwise, all the strings in the paper are on an alphabet denoted . For a string , we write for the subset of characters of that have at least one occurrence in . A string is called exemplar if , i.e. each character present in occurs only once. A substring of is a contiguous sequence of characters within . A prefix (resp. suffix) is a substring that occurs at the beginning (resp. end) of . A subsequence of is a string that can be obtained by successively deleting characters from .
A tandem duplication (TD) is an operation on a string that copies a substring of and inserts the copy after the occurrence of in . In other words, a TD transforms into . Given another string , we write if there exist strings such that and . More generally, we write if there exist such that . We also write if there exists some such that .
The TD distance between two strings and is the minimum value of satisfying . If does not hold, then .
A square string is a string of the form , i.e. a concatenation of two identical substrings. Given a string , a contraction is the reverse of a tandem duplication. That is, it takes a square string contained in and deletes one of the two copies of . We write if there exist strings such that and . We also define and for contractions analogously as for TDs (note that if and only if and if and only if ). When there is no possible confusion, we will sometimes write instead .
We have the following problem.
The -Tandem Duplication (-TD) problem:
Input: two strings and over the same alphabet and an integer .
Question: is ?
In the Exemplar--TD variant of this problem, is required to be exemplar. In either variant, we may call the source string and the target string. We will often use the fact that and form a YES instance if and only if can be transformed into by a sequence of at most contractions. See Fig.1 for a simple example.
We recall that although we study the minimization problem here, it is unknown whether the question can be decided in polynomial time. Nonetheless, our NP-hardness reduction applies to ‘promise’ instances in which always holds.
3 NP-hardness of Exemplar--Td
To facilitate the presentation of our hardness proof, we first make an intermediate reduction using the Cost-Effective Subgraph problem, which we will then reduce to the promise version of the Exemplar--TD problem.
The Cost-Effective Subgraph problem
Suppose we are given a graph and an integer cost . For a subset , let denote the edges inside of . The cost of is defined as
The Cost-Effective Subgraph problem asks for a subset of minimum cost. In the decision version of the problem, we are given an integer and we want to know if there is a subset whose cost is at most . Observe that or are possible solutions.
The idea is that each edge “outside” of costs and each edge “inside” costs . Therefore, we pay for each edge not included in , but if gets too large, we pay more for edges in . We must therefore find a balance between the size of and its number of edges. The connection with -TD can be roughly described as follows: in our reduction, we will have many substrings which need to be deleted through contractions. We will have to choose an initial set of contractions and then, each substring will have two ways to be contracted: one of cost , and the other of cost .
An obvious solution for a Cost-Effective Subgraph is to take , which is of cost . Another formulation of the problem could be whether there is a subset of cost at most , where can be seen as a “profit” to maximize. Treating as a parameter, we show the NP-hardness and W-hardness in parameters of the Cost-Effective Subgraph problem (we do not study the parameter ). Our reduction to -TD does not preserve W-hardness and we only use the NP-hardness in this paper, but the W-hardness might be of independent interest.
Before proceeding, we briefly argue the relevance of parameter in the W-hardness. If is a fixed constant, then we may assume that any solution satisfies . This is because if , every edge included in will cost more than and putting yields a lower cost. Thus for fixed , it suffices to brute-force every subset of size at most and we get a time algorithm. Our W-hardness shows that it is difficult to remove this exponential dependence between and .
The Cost-Effective Subgraph problem is NP-hard and W-hard for parameter .
We reduce from CLIQUE, a classic NP-hard problem where we are given a graph and an integer and must decide whether contains a clique of size at least . The problem is also W-hard in parameter . We will assume that is even (which does not alter either hardness results).
Let be a CLIQUE instance, letting and . The graph in our Cost-Effective Subgraph instance is also . We set the cost , which is an integer since is even, and put
We ask whether admits a subgraph satisfying . We show that is a YES instance to CLIQUE if and only if contains a set of cost at most . This will prove both NP-hardness and W-hardness in (noting that here ).
The forward direction is easy to see. If is a YES instance, it has a clique of size (exactly) . Since , the cost of is precisely .
Let us consider the converse direction. Assume that is a NO instance of CLIQUE. Let be any subset of vertices. We will show that . There are cases to consider depending on .
Case 1: . Since is a NO instance, is not a clique and thus , where . We have that . Since and , the cost of is strictly greater than .
Case 2: for some . Denote , where (actually, but we do not bother). The cost of is
Consider the difference
If , then the difference is clearly above regardless of , and then as desired. Thus we may assume that . In this case, we may assume that , as this minimizes . But in this case, .
Case 3: , with . If , then and . So we assume . Put , where . We have
The difference with this cost and is
the latter since . Again, it follows that .
Reduction to Exemplar--Td
Since the reduction is somewhat technical, we provide an overview of the techniques that we will use. Let be a Cost-Effective Subgraph instance where is the cost and the optimization value, and with vertices . We will construct strings and and argue on the number of contractions to go from to . We would like our source string to be , where each is a distinct character that corresponds to vertex . Let be obtained by doubling every , i.e. . Our goal is to put , where each is a substring gadget corresponding to edge that we must remove to go from to . In a contraction sequence from to , we make it so that we first want to contract some, but not necessarily all, of the doubled ’s of , resulting in another string . Let be the number of ’s contracted from to . For instance, we could have , where only and were contracted, and thus . The idea is that these contracted ’s correspond to the vertices of a cost-effective subgraph. After is transformed to , we then force each to use to contract it. For , a contraction sequence that we would like to enforce would take the form
where we underline the substring affected by contractions at each step. We make it so that when contracting into , we have two options. Suppose that are the endpoints of edge . If, in , we had chosen to contract and , we can contract using a sequence of moves. Otherwise, we must contract using another more costly sequence of moves. The total cost to eliminate the gadgets will be , where is the number of edges that can be contracted using the first choice, i.e. for which both endpoints were chosen in .
Unfortunately, constructing and the ’s to implement the above idea is not straightforward. The main difficulty lies in forcing an optimal solution to behave as we describe – i.e. enforcing going from to first, enforcing the ’s to use , and enforcing the two options to contract with the desired costs. In particular, we must replace the ’s by carefully constructed substrings . We must also repeat the sequence of ’s a certain number times. We now proceed with the technical details.
The Exemplar--TD problem is NP-complete.
To see that the problem is in NP, note that since each contraction from to removes a character. Thus a sequence of contractions can serve as a certificate, has polynomial size and is easy to verify.
For hardness, we reduce from the Cost-Effective Subgraph problem. Let be an instance of Cost-Effective Subgraph, letting and . Here is the “outsider edge” cost and we ask whether there is a subset such that . We denote and . The ordering of vertices and edges is arbitrary but remains fixed for the remainder of the proof. For convenience, we allow the edge indices to loop through to , and so we put for any integer . Thus we may sometimes refer to an edge with an index , meaning that is actually the edge .
The construction. Let us first make an observation. If we take an exemplar string (i.e. a string in which no character occurs twice), we can double its characters and obtain a string . The length of is only twice that of and , i.e. going from to requires contractions. We will sometimes describe pairs of strings and at distance without explicitly describing and , but the reader can assume that starts as an exemplar string and we obtain by doubling it.
Now we show how to construct and . First let be large (but polynomial) integers. We choose to be a multiple of . For concreteness, we put and , but it is enough to think of these values as simply “large enough”. Instead of doubling ’s as in the intuition paragraph above, we will duplicate some characters times. Moreover, we can’t create a string that behaves exactly as described above, but we will show that we can append copies of carefully crafted substring to obtain the desired result. We need and to be high enough so that “enough” copies behave as we desire.
For each , define an exemplar string of length . Moreover, create enough characters so that no two string contain a character in common. Let be a string satisfying .
Then for each , define an exemplar string . Ensure that no contains a character from an string, and no two ’s contain a common character. The strings can consist of a single character, with the exception of and which are special. We assume that for and , we have strings and such that
The ’s are the building blocks of larger strings. For each , define
These strings are used as “blockers” and prevent certain contractions from happening. Also define the strings
and for edge with whose endpoints are and , define
Thus in , all substrings are turned into , except and .
Finally, define a new additional character , which will be used to separate some of the components of our string. We can now define and . We have
It follows from the definitions of and that is exemplar. Now for , define
which we will call the edge gadget. Define as
(we add brackets for clarity — they are not actual characters of ). The idea is that starts with , a modified in which becomes and the substrings are turned into . This substring serves as a choice of vertices in our cost-effective subgraph. Each edge has a “gadget substring” . Since is a multiple of , the sequence of edge gadgets is repeated times. Our goal to go from to is to get rid of all these edge gadgets by contractions. Note that because a gadget starts with and the gadget starts with , the substring has a character that the substring does not have.
The hardness proof. We now show that has a subgraph of cost at most if and only if can be contracted to using at most moves. We include the forward direction, which is the most instructive, in the main text. The other direction can be found in the Appendix. Although we shall not dig into details here, it can be deduced from the ) direction that holds.
() Suppose that has a subgraph of cost at most . Thus . To go from to , first consider an edge that does not have both endpoints in . We show how to get rid of the gadget substring for using contractions. Note that contains the substring , where brackets surround the occurrence that we want to remove. We can first contract to using contractions, then contract to using contractions. The result is the substring, which becomes using two contractions (see below). This sums to moves. More visually, the sequence of contractions works as follows (as usual brackets indicate the substring and what remains of it)
This sequence of contractions effectively removes the substring gadget. Observe that after applying this sequence, it is still true that every remaining gadget substring is preceded by . We may therefore repeatedly apply this contraction sequence to every not contained in (including those gadgets for which ). This procedure is thus applied to gadgets. We assume that we have done so, and that every for which the gadget substring remains is in . Call the resulting string .
Now, let be the substring obtained from by contracting, for each , the string to . We assume that we have contracted the substring of to , which uses contractions (note that there is only one occurrence of in , namely right before the first ). Call the resulting string. At this point, for every substring gadget that remains, where corresponds to edge , contains the substrings and (instead of and ).
Let be the smallest integer for which the substring gadget is still in . This is the leftmost edge gadget still in , meaning that has the prefix
where brackets indicate the substring. To remove , first contract to , and contract to (this is possible since ). The result is . One more contraction gets rid of the second half. This requires contractions. This procedure is applied to gadgets. To recap, the contraction sequence for does as follows:
After we repeat this for every , all that remains is the string . We contract to using contractions (in total, going from to required moves). Then contract and to using contractions. One more contraction of the second half of the string yields . The summary of the number of contractions made is
(): this direction of the proof is somewhat involved and we redirect the interested reader to the Appendix. The idea is to show that a minimum contraction sequence must have the form similar to that in the () direction. The challenging part is to show that each substring must get removed separately in this sequence, and that “most” of them incur a cost of either or for some (this “most” is the reason that we need a large ).
4 An FPT algorithm for the exemplar problem
In this section, we will show that Exemplar--TD can be solved in time by obtaining a kernel of size (here is the length of ).
We first note that there is a very simple, brute-force algorithm to solve -TD (including Exemplar--TD as a particular case). This only establishes membership in the class, but it will be useful to evaluate the complexity of our kernelization later on.
The -TD problem can be solved in time , where is the size of the target string.
Let be a given instance of -TD. Consider the branching algorithm that, starting from , tries to contract every substring of the form in and recurses on each resulting substring, decrementing by each time (the branching stops when is obtained or when reaches without attaining ). We obtain a search tree of depth at most and degree at most , and thus it has nodes. Visiting the internal nodes of this search tree only requires enumerating substrings, which form the set of children of the node. Hence, there is no added computation cost to consider when visiting a node.
From now on, we assume that we have an Exemplar--TD instance , and so that is exemplar.
Let and be two consecutive characters in (i.e. is a subtring of ). We say that is -stable if in , every occurrence of in is followed by and every occurrence of is preceded by . An -stable substring , where , is a substring of such that is -stable for every . We also define a string with a single character to be a -stable substring (provided appears in and ). If any substring of that strictly contains is not an -stable substring, then is called a maximal -stable substring. Note that these definitions are independent of and , and so the same definitions apply for -stability, for any strings and .
We will show that every maximal -stable substring can be replaced by a single character, and that if can be obtained from using at most tandem duplications, then this leaves strings of bounded size.
We first show that, roughly speaking, stability is maintained by all tandem duplications when going from to .
Suppose that and let be an -stable substring. Let be any minimum sequence of strings transforming to by tandem duplications. Then is -stable for every .
Assume the lemma is false, and let be the first of that does not verify the statement. Then there are two characters belonging to such that is -stable, but is not -stable.
We claim that, under our assumption, is not -stable for any . As this includes , this will contradict that is -stable. We do this by induction — as a base case, is not -stable so this is true for . Assume that is not -stable, where . Let be the duplication transforming to (here contains the start and end positions of the substring of to duplicate).
Suppose first that is not -stable because has an occurrence of that is not followed by . Thus has an occurrence of , say at position , followed by . If we assume that is -stable, then a character must have appeared after this from to . Changing the character next to this is only possible if the last character duplicated by is the at position and the first character of is a . In other words, denoting for appropriate substrings, the duplication must do the following
But then, there is still an occurrence of followed by , and it follows that cannot be -stable.
So suppose instead that is not -stable because has an occurrence of preceded by . The character preceding this has changed in . But one can verify that this is impossible. For completeness, we present each possible case: either includes both and , includes one of them or none. These cases are represented below, and each one of them leads to an occurrence of still preceded by (the left-hand side represents and the right-hand side represents ):