Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms

02/12/2020 ∙ by Manuel Lafond, et al. ∙ Montana State University Université de Sherbrooke 0

Recently, due to the genomic sequence analysis in several types of cancer, the genomic data based on copy number profiles ( CNP for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific gene or segment of interest. In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. We also prove that the parameterized version is W[1]-hard, answering another open question by Qingge et al. The other result is positive and is based on a new (and more general) problem regarding CNP's. The Copy Number Profile Conforming (CNPC) problem is formally defined as follows: given two CNP's C_1 and C_2, compute two strings S_1 and S_2 with cnp(S_1)=C_1 and cnp(S_2)=C_2 such that the distance between S_1 and S_2, d(S_1,S_2), is minimized. Here, d(S_1,S_2) is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if d(S_1,S_2) is measured by the breakpoint distance then the problem is polynomially solvable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In cancer genomics research, intra-tumor genetic heterogeneity is one of the central problems [9, 10, 14]. Understanding the origins of cancer cell diversity could help cancer prognostics [3, 8] and also help explain drug resistance [2, 4]. It is known for some types of cancers, such as high-grade serous ovarian cancer (HGSOC), that heterogeneity is mainly acquired through endoreduplications and genome rearrangements. These result in aberrant copy number profiles (CNPs) — nonnegative integer vectors representing the numbers of genes occurring in a genome [11]. To understand how the cancer progresses, an evolutionary tree is certainly desirable, and producing a valid evolutionary tree based on these genomic data becomes a new problem. In [13], Schwarz et al. proposed a way to construct a phylogenetic tree directly from integer copy number profiles, the underlying problem being to convert CNPs into one another using the minimum number of duplications/deletions [15].

In [12], another fundamental problem was proposed. The motivation is that in the early stages of cancer, when large numbers of endoreduplications are still rare, genome sequencing is still possible. However, in the later stage we might only be able to obtain genomic data in the form of CNPs. This leads to the problem of comparing a sequenced genome with a genome with only copy-number information.

Given a genome represented as a string and a copy number profile , the Minimum Copy Number Generation (MCNG) problem asks for the minimum number of deletions and duplications needed to transform into any genome in which each character occurs as many times as specified by . Qingge et al. proved that the problem is NP-hard when the duplications are restricted to be tandem and posed several open questions: (1) Is the problem NP-hard when the duplications are arbitrary? (2) Does the problem admit a decent approximation? (3) Is the problem fixed-parameter tractable (FPT)? In this paper, we answer all these three open questions. We show that MCNG is NP-hard to approximate within any constant factor, and that it is W[1]-hard when parameterized by the solution size. The inapproximability follows from a new general-purpose lemma on set-cover reductions that require an exact cover in one direction, but not the other. The W[1]-hardness uses a new set-cover variant in which every optimal solution is an exact cover. These set-cover extensions can make reductions to other problems easier, and may be of independent interest.

We also consider a new fundamental problem called Copy Number Profile Conforming (CNPC), which is defined as follows. Given two CNP’s and , compute two strings/genomes and with and such that the distance between and , , is minimized. The distance could be general, which means it could be any genome rearrangement distance (such as reversal, transposition, and tandem duplication, etc). We make the first step by showing that if is measured by the breakpoint distance then the problem is polynomially solvable.

2 Preliminaries

A genome is a string, i.e. a sequence of characters, all of which belong to some alphabet (the characters of can be interpreted as genes or segments). We use genome and string interchangeably in this paper, when the context is clear. A substring of is a sequence of contiguous characters that occur in , and a subsequence is a string that can be obtained from by deleting some characters. We write to denote the character at position of (the first position being ), and we write for the substring of from positions to , inclusively. For , we write to denote the subsequence of obtained by removing all occurrences of .

We represent an alphabet as an ordered list of distinct characters. Slightly abusing notation, we may write if is a member of this list. We write to denote the number of occurrences of in a genome . A Copy-Number Profile (or CNP) on is a vector that associates each character of the alphabet with a non-negative integer . We may write to denote the number associated with in . We write to denote the CNP obtained from by setting .

The Copy Number Profile (CNP) of genome , denoted , is the vector of occurrences of all characters of . Formally111Note that in the theory of formal languages, the CNP of a string is called the Parikh vector.,

For example, if and , then and .

Deletions and duplications on strings

We now describe the two string events of deletion and duplication. Both are illustrated in Figure 1.

Figure 1: Three strings (or toy genomes), and . From to , a deletion is applied to . From to , a duplication is applied to , with the copy inserted after position 6.

Given a genome , a deletion on takes a substring of and removes it. Deletions are denoted by a pair of the positions of the substring to remove. Applying deletion to transforms into .

A duplication on takes a substring of , copies it and inserts the copy anywhere in , except inside the copied substring. A duplication is defined by a triple where is the string to duplicate and is the position after which we insert (inserting after prepends the copied substring to ). Applying duplication to transforms into .

An event is either a deletion or a duplication. If is a genome and is an event , we write to denote the genome obtained by applying on . Given a sequence of events, we define as the genome obtained by successively applying the events of to . We may also write instead of .

The most natural application of the above events is to compare genomes.

Let and be two strings over alphabet . The Genome-to-Genome distance between and , denoted , is the size of the smallest sequence of events satisfying .

We also define a distance between a genome and a CNP , which is the minimum number of events to apply to to obtain a genome with CNP .

Let be a genome and be a CNP, both over alphabet . The Genome-to-CNP distance between and , denoted , is the size of the smallest sequence of events satisfying .

The above definition leads to the following problem, which was first studied in [12].

The Minimum Copy Number Generation (MCNG) problem:
Instance: a genome and a CNP over alphabet .
Task: compute .

Qingge et al. proved that the MCNG problem is NP-hard when all the duplications are restricted to be tandem [12]. In the next section, we prove that this problem is not only NP-hard, but also NP-hard to approximate within any constant factor.

3 Hardness of Approximation for MCNG

In this section, we show that the  distance is hard to approximate within any constant factor. This result actually holds if only deletions on are allowed. This restriction makes the proof significantly simpler, so we first analyze the deletions-only case. We then extend this result to deletions and duplications.

Both proofs rely on a reduction from SET-COVER. Recall that in SET-COVER, we are given a collection of sets over universe , and we are asked to find a set cover of having minimum cardinality (a set cover of is a subset such that ). If is a set cover in which no two sets intersect, then is called an exact cover.

There is one interesting feature (or constraint) of our reduction , which transforms a SET-COVER instance into a MCNG instance . A set cover only works on if is actually an exact cover, and a solution for can be turned into a set cover for that is not necessarily exact. Thus we are unable to reduce directly from either SET-COVER nor its exact version. We provide a general-purpose lemma for such situations, and our reductions serve as an example of its usefulness.

The proof relies on a result on -SET-COVER,the special case of SET-COVER in which every given set contains at most elements. It is known that for any constant , the -SET-COVER problem is hard to approximate within a factor for some constant not depending on [16].

Let be a minimization problem, and let be a function that transforms any SET-COVER instance into an instance of in polynomial time. Assume that both the following statements hold:

  • any exact cover of of cardinality at most can be transformed in polynomial time into a solution of value at most for ;

  • any solution of value at most for can be transformed in polynomial time into a set cover of of cardinality at most .

Then unless P = NP, there is no constant factor approximation algorithm for .

Proof.

Suppose for contradiction that admits a factor approximation for some constant . Choose any constant such that -SET-COVER is hard to approximate within factor , and such that . Note that might be exponentially larger than , but is still a constant.

Now, let be an instance of -SET-COVER over the universe . Consider the intermediate reduction that transforms into another -SET-COVER instance . Since is a constant, has sets and this can be carried out in polynomial time.

Now define and consider the instance . By the assumptions of the lemma, a solution for of value yields a set cover for . Clearly, can be transformed into a set cover for instance : for each , there exists such that , so we get a set cover for by adding this corresponding superset for each . Thus yields a set cover of with at most sets.

In the other direction, consider a set cover of with sets. This easily translates into an exact cover of with sets by taking the collection

By the assumptions of the lemma, this exact cover can then be transformed into a solution of value at most for instance .

Therefore, has a set cover of cardinality at most if and only if has a solution of value at most . Since there is a correspondence between the solution values of the two problems, a factor approximation for would provide a factor approximation for -SET-COVER. ∎

3.1 Constructing genomes and CNPs from SET-COVER instances

All of our hardness results rely on Lemma 3. We need to provide a reduction from SET-COVER to MCNG and prove that both assumptions of the lemma are satisfied.

Figure 2: An example of our construction, with and .

This reduction is the same for deletions-only and deletions-and-duplications. Given and , we construct a genome and a CNP as follows (an example is illustrated in Figure 2). The alphabet is , where and . Thus, there is one character for each set of and each element of . Here, each is a character that will serve as a separator between characters to delete. For a set , define the string as any string that contains each character of exactly once. We put

i.e. is the concatenation of the strings . As for the CNP , put

  • for each ;

  • for each , where is the number of sets from that contain .

Notice that in , each already has the correct copy-number, whereas each needs exactly one less copy. Our goal is thus to reduce the number of each by . This concludes the construction of MCNG instances from SET-COVER instances. We know focus on the hardness of the deletions-only case.

3.2 Warmup: the deletions-only case

Suppose that we are given a set cover instance and , and let and be the genome and CNP, respectively, as constructed above.

Given an exact cover for of cardinality , one can obtain a sequence of deletions transforming into a genome with CNP .

Proof.

Denote . Consider the sequence of deletions that deletes the substrings (i.e. the sequence first deletes the substring , then deletes , and so on until is deleted). Since is an exact cover, this sequence removes exactly one copy of each and does not affect the characters. Thus the deletions transform into a genome with the desired CNP . ∎

Given a sequence of deletions transforming into a genome with CNP , one can obtain a set cover for of cardinality at most .

Proof.

Suppose that the deletion events transform into a genome with CNP . Note that no deletion is allowed to delete a set-character , as there is only one occurrence of in and . Thus all deletions remove only characters. In other words, each in either deletes a substring of between some and with , or deletes a substring after . Moreover, exactly one of each occurrences gets deleted from .

Call affected if there is some event of that deletes at least one character between and with , and call affected if some event of deletes characters after . Let  is affected. Then , since each deletion affects at most one and there are deletion events. Moreover, must be a set cover, because each has at least one occurrence that gets deleted and thus at least one set containing that is included in . This concludes the proof. ∎

We have shown that all the assumptions required by Lemma 3 are satisfied. The inapproximability follows.

Assuming , there is no polynomial-time constant factor approximation algorithm for MCNG when only deletions are allowed.

We mention without proof that the reduction should be adaptable to the duplication-only case, by putting for each .

The real deal: deletions and duplications

We now consider both deletions and duplications. The reduction uses the same construction as in Section 3.1. Thus we assume that we have a SET-COVER instance over , and a corresponding instance of MCNG with genome and CNP .

In that case, we observe the following: Lemma 3.2 still holds whether we allow deletion only, or both deletions and duplications. Thus we only need to show that the second assumption of Lemma 3 holds.

Unfortunately, this is not as simple as in the deletions-only case. The problem is that some duplications may copy some and occurrences, and we lose control over what gets deleted, and over what each corresponds to (in particular, some might now get deleted, which did not occur in the deletions-only case).

Nevertheless, the analogous result can be shown to hold.

Given a sequence of events (deletions and duplications) transforming into a genome with CNP , one can obtain a set cover for of cardinality at most .

Due to space constraints, we redirect the reader to the Appendix for the proof. In a nutshell, given a sequence of events from to a genome with CNP , the idea is to find, for each , one occurrence of in that we have control over. More precisely, even though that occurrence of might spawn duplicates, all its copies (and copies of copies, and so on) will eventually get deleted. The character preceding this character indicates that can be added to a set cover. The crux of the proof is to show that this character exists for each , and that their corresponding form a set cover of size at most .

We arrive to our main inapproximability result, which again follows from Lemma 3.

Assuming P NP, there is no polynomial-time constant factor approximation algorithm for MCNG.

In the next section, we prove that the MCNG problem, parameterized by the solution size, is W[1]-hard. This answers another open question in [12]. We refer readers for more details on FPT and W[1]-hardness to the book by Downey and Fellows [5].

4 W[1]-hardness for MCNG

Since SET-COVER is W[2]-hard, naturally we would like to use the ideas from the above reduction to prove the W[2]-hardness of MCNG. However, the fact that we use -SET-COVER with constant in the proof of Lemma 3 is crucial, and -SET-COVER is in FPT. On the other hand, the property that is really needed in the instance of this proof, and in out MCNG  reduction, is that we can transform any set cover instance into an exact cover. We capture this intuition in the following, and show that SET-COVER instances that have this property are W[1]-hard to solve.

An instance of SET-COVER-with-EXACT-COVER, or SET-COVER-EC for short, is a pair where is an integer and is a collection of sets forming a universe . In this problem, we require that satisfies the property that any set cover for of size at most is also an exact cover. We are asked whether there exists a set cover for of size at most (in which case this set cover is also an exact cover). Therefore, SET-COVER-EC is a promise problem.

The SET-COVER-EC problem is W[1]-hard for parameter .

Proof.

We show W[1]-hardness using the techniques introduced by Fellows et al. which is coined as MULTICOLORED-CLIQUE [6]. In the MULTICOLORED-CLIQUE problem, we are given a graph , an integer and a coloring such that no two vertices of the same color share an edge. We are asked whether contains a clique of vertices (noting that such a clique must have a vertex of each color). This problem is W[1]-hard w.r.t. .

Given an instance of MULTICOLORED-CLIQUE, we construct an instance of SET-COVER-EC. We put . For , let and for each pair , let . The universe of the SET-COVER-EC instance has one element for each color , one element for each pair of distinct colors, and two elements for each edge, one for each direction of the edge. That is,

Thus . For two colors , we will denote , i.e. we include in both elements corresponding to each . Now, for each color class and each vertex , add to the set

where is the set of neighbors of in . Then for each , and for each edge , add to the set

Figure 3: A graphical example of the constructed sets for the elements of a graph (not shown) with , where the ’s are in and the ’s in (sets have a gray background, edges represent containment, the lines are dotted only for better visualization).

The idea is that can cover every element of

, except those ordered pairs whose first element is

or . Then if we do decide to include in a set cover, it turns out that we will need to include and to cover these missing ordered pairs. See Figure 3 for an example. For instance if we include in a cover, the uncovered and can be covered with and . We show that has a multicolored clique of size if and only if admits a set cover of size . Note that we have not shown yet that is an instance of SET-COVER-EC, i.e. that any set cover of size at most is also an exact cover. This will be a later part of the proof.

First suppose that has a multi-colored clique , where for each . Consider the collection

the cardinality of is . Each element is covered since we include a set for each color. Each element is covered since we include a set for each color pair with . Consider an element , where and . Note that either or is possible, and that . If , then covers . If , then covers and if , then covers . Thus is a set cover, and is of size at most .

For the converse direction, suppose that is a set cover for of size at most . Note that to cover the elements of , must have at least one set such that for each color class . Moreover, to cover the elements of , must have at least one set such that for each pair. We deduce that has exactly sets. Hence for color , there is exactly one set in for which , and for each pair, there is exactly one set in for which .

We claim that is a multi-colored clique. We already know that contains one vertex of each color. Now, suppose that some do not share an edge, where and . Let be the set of that covers , with . Since is not an edge but is, we know that or (or both). Moreover, does not cover the and elements of , and we know that at least one of these is not covered by nor (if , then none covers , if , then none covers ). But , and and are the only sets of that have elements of , contradicting that is a set cover. This shows that is a multi-colored clique.

It remains to show that is an exact cover. Observe that no two distinct and sets in can intersect because and must be of a different color, and no two distinct and sets in can intersect because and must be from two different color pairs. Suppose that do intersect, and say that and . Then all elements in are of the form for some . Choose any such . If is of color , then since otherwise by construction could not contain . But when , no set of covers the element (it is not nor , the only two possibilities). If is of color , then since again could not contain . In this case, no set of covers . We reach a contradiction and deduce that is an exact cover. ∎

It is now almost immediate that MCNG is W[1]-hard with respect to the natural parameter, namely the number of events to transform a genome into a genome with a given profile (proof in Appendix).

The MCNG problem is W[1]-hard.

Now that we have finished presenting the negative results on MCNG. An immediate question is whether we could obtain some positive result on a related problem. In the next section, we present some positive result for an interesting variation of MCNG.

5 The Copy Number Profile Conforming Problem

We define the more general Copy Number Profile Conforming (CNPC) problem as follows: Given two CNP’s and , the CNPC problem asks to compute two strings and with and such that the distance between and , , is minimized.

Let , we assume that and are bounded by a polynomial of . (This assumption is needed as the solution of our algorithm could be of size .) We simply say are polynomially bounded. Note that is a very general distance measure, i.e., it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc, or their combinations, e.g. tandem duplication + deletion). In this paper, we use the breakpoint distance (and the adjacency number), which is defined as follows. (These definitions are adapted from Angibaud et al. [1] and Jiang et al. [7], which generalize the corresponding concepts on permutations [17].)

Given two sequences = and =, if = we say that and are matched to each other. In a maximum matching of 2-substrings in and , a matched pair is called an adjacency, and an unmatched pair is called a breakpoint in and respectively. Then, the number of breakpoints in (resp. ) is denoted as (resp. ), and the number of (common) adjacencies between and is denoted as . For example, if , then and there are 2 and 4 breakpoints in and respectively.

Coming back to our problem, we define . From the definitions, we have

or,

Hence, the problem is really to maximize

Given -dimensional vectors and , with , and , we say is a sub-vector of if for , also denote this relation as . Henceforth, we simply call integer vectors (with the understanding that no item in a vector is negative). Given two -dimensional integer vectors and , with , and , we say is a common sub-vector of and if is a sub-vector of and is also a sub-vector of (i.e., and ). Finally, is the maximum common sub-vector of and if there is no common sub-vector of and which satisfies or .

An example is illustrated as follows. We have , , and . Both and are common sub-vectors for and , is not the maximum common sub-vector of and (since ) while is.

Given a CNP and alphabet , for , we use to denote the multiset of letters (genes) corresponding to ; more precisely, denotes the number of ’s in . Similarly, given a multiset of letters , we use to denote a string where all the letters in appear exactly once (counting multiplicity; i.e, ). is similarly defined when is a CNP. We present Algorithm 1 as follows.

  1. Compute the maximum common sub-vector of and .

  2. Given the gene alphabet , compute , and . Let and .

  3. If , then return two arbitrary strings and as and , exit; otherwise, continue.

  4. Find , and , such that and , and exactly one of is in (say ), and the other is in (say ). If such an cannot be found then return two strings and by concatenating letters in and arbitrarily at the ends of respectively, exit; otherwise, continue.

  5. Compute an arbitrary sequence with the constraint that the first letter is and the last letter is . Then obtain and ( is the concatenation operator).

  6. Finally, insert all the elements in arbitrarily at the two ends of to obtain , and insert all the elements in arbitrarily at the two ends of to obtain .

  7. Return and .

Let . Also let and . We walk through the algorithm using this input as follows.

  1. The maximum common sub-vector of and is .

  2. Compute , and
    . Compute and .

  3. Identify and such that and , and while .

  4. Compute , and .

  5. Insert elements in arbitrarily at the right end of to obtain , and insert all the elements in at the right end of to obtain .

  6. Return and .

Let be polynomially bounded. The number of common adjacencies generated by Algorithm 1 is optimal with a value either or , where with the maximum common sub-vector of and being .

Proof.

First, note that if is a 0-vector (or ) then there will not be any adjacency in and . Henceforth we discuss .

Notice that a common adjacency between and must come from two letters which are both in . That naturally gives us adjacencies, where , which can be done by using the letters in to form two arbitrary strings and (for which is a common substring). If can be found such that and , and one of them is in (say ), and the other is in (say ), then, obviously we could obtain and which are substrings of and respectively. Clearly, there are adjacencies between and (and also and ).

To see that this is optimal, first suppose that no pair as above can be found. This can only occur when there are no two components in ,…, , ,…,