A QPTAS for Gapless MEC

04/29/2018
by   Shilpa Garg, et al.
Max Planck Society
0

We consider the problem Minimum Error Correction (MEC). A MEC instance is an n x m matrix M with entries from 0,1,-. Feasible solutions are composed of two binary m-bit strings, together with an assignment of each row of M to one of the two strings. The objective is to minimize the number of mismatches (errors) where the row has a value that differs from the assigned solution string. The symbol "-" is a wildcard that matches both 0 and 1. A MEC instance is gapless, if in each row of M all binary entries are consecutive. Gapless-MEC is a relevant problem in computational biology, and it is closely related to segmentation problems that were introduced by [Kleinberg-Papadimitriou-Raghavan STOC'98] in the context of data mining. Without restrictions, it is known to be UG-hard to compute an O(1)-approximate solution to MEC. For both MEC and Gapless-MEC, the best polynomial time approximation algorithm has a logarithmic performance guarantee. We partially settle the approximation status of Gapless-MEC by providing a quasi-polynomial time approximation scheme (QPTAS). Additionally, for the relevant case where the binary part of a row is not contained in the binary part of another row, we provide a polynomial time approximation scheme (PTAS).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/12/2020

Complexity of Combinatorial Matrix Completion With Diameter Constraints

We thoroughly study a novel and still basic combinatorial matrix complet...
07/08/2019

Inapproximability Results for Scheduling with Interval and Resource Restrictions

In the restricted assignment problem, the input consists of a set of mac...
08/12/2016

Chi-squared Amplification: Identifying Hidden Hubs

We consider the following general hidden hubs model: an n × n random mat...
01/14/2020

Mass Error-Correction Codes for Polymer-Based Data Storage

We consider the problem of correcting mass readout errors in information...
06/29/2020

Pattern Masking for Dictionary Matching

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are gi...
02/12/2020

Public Bayesian Persuasion: Being Almost Optimal and Almost Persuasive

Persuasion studies how an informed principal may influence the behavior ...
08/10/2019

Approximation of the Lagrange and Markov spectra

The (classical) Lagrange spectrum is a closed subset of the positive rea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction.

The minimum error correction problem (MEC) is a segmentation problem where we have to partition a set of length strings into two classes. A MEC instance is given by a set of strings over of length , where the symbol “” is a wildcard symbol. The strings are represented by an matrix , where the th string determines the th row of . The distance of two symbols from is

For two strings from where denotes the -th symbol of the respective string, . A feasible solution to MEC is a pair of two strings from . The optimization goal is to find a feasible solution that minimizes

If is clear from the context, we sometimes skip the index.

A MEC instance is called gapless if in each of the rows of , all entries from are consecutive. (As regular expression, a valid row is a word of length from the language ). The MEC problem restricted to gapless instances is Gapless-MEC.

Our motivation to study Gapless-MEC stems from its applications in computational biology. Humans are diploid, and hence there exist two versions of each chromosome. Determining the DNA sequences of these two chromosomal copies – called haplotypes – is important for many applications ranging from population history to clinical questions [21, 22]. Many important biological phenomena such as compound heterozygosity, allele-specific events like DNA methylation or gene expression can only be studied when haplotype-resolved genomes are available [14].

Existing sequencing technologies cannot read a chromosome from start to end, but instead deliver small pieces of the sequences (called reads). Like in a jigsaw puzzle, the underlying genome sequences are reconstructed from the reads by finding the overlaps between them.

The upcoming next-generation sequencing technologies (e.g., Pacific Biosciences) have made the production of relatively long contiguous sequences with sequencing errors feasible, where the sequences come from both copies of chromosome. These sequences are aligned to a reference genome or to a structure called contig. We can formulate the result of this process as a Gapless-MEC instance: the sequences are the contiguous strings and the contig determines the columns of the strings.

Gapless-MEC is a generalization of a problem called Binary-MEC, the version of MEC with only instances where all entries of are in . Finding an optimal solution to Binary-MEC is equivalent to solving the hypercube 2-segmentation problem (H2S) which was introduced by Kleinberg, Papadimitriou, and Raghavan [11, 12] and which is known to be -hard [5, 12]. The optimization version of Binary-MEC differs from H2S in that we minimize the number of mismatches instead of maximizing the number of matches. Binary-MEC allows for good approximations. Ostravsky and Rabiny [17] obtained a PTAS for Binary-MEC based on random embeddings. Building on the work of Li et al. [15], Jiao et al. [10] presented a deterministic PTAS for Binary-MEC.

Gapless-MEC was shown to be -hard by Cilibrasi et al. [4].222Their result predates the hardness result of Feige [5] for H2S. The proof of the claimed -hardness of H2S by Kleinberg, Papadimitriou, and Raghavan [11] was never published. Additionally, they showed that allowing a single gap in each string renders the problem -hard. More recently, Bonizzoni et al. [3] showed that it is unique games hard to approximate MEC with constant performance guarantee, whereas it is approximable within a logarithmic factor in the size of the input. To our knowledge, previous to our result their logarithmic factor approximation was also the best known approximation algorithm for Gapless-MEC.

1.1 Our results.

Our main result is the following theorem.

Theorem 1.

There is a quasi-polynomial time approximation scheme (QPTAS) for Gapless-MEC.

Thus we partially settle the approximability for this problem: Gapless-MEC is not -hard unless (cf. [20]). Thus our result reveals a separation of the hardness of the gapless case and the case where we allow a single gap. Furthermore, already Binary-MEC is strongly -hard since the input does not contain numerical values. Therefore we can exclude the existence of an FPTAS for both Binary-MEC and Gapless-MEC unless .

Additionally, we address the class of subinterval-free Gapless-MEC instances where no string is contained in another string. More precisely, for each pair of rows from we exclude that the set of columns with binary entries from one row is a strict subset of the set of columns with binary entries from the other row.

Theorem 2.

There is a polynomial time approximation scheme (PTAS) for Gapless-MEC restricted to instances such that no string is the substring of another string.

1.2 Overview of our approach.

Our algorithm is a dynamic program (DP) that is composed of several levels. Given a general Gapless-MEC instance, we decompose the rows of the instance into length classes according to the length of the contiguous binary parts of the rows. For each length class we consider a well-selected set of columns such that each row crosses at least one column and at most two. (Row crosses a column , if , i.e., the binary part of the row contains that column.)

We further decompose each length class into two sub-classes, one that crosses exactly one column and one that crosses exactly two columns. For the second class, it is sufficient to consider every other column, which leaves us with many rooted instances. Thus for each sub-instance there is a single column (the root) which is crossed by all rows of the instance.

We further decompose rooted sub-instances into the left hand side and the right hand side of the root. Since the two sides are symmetric, we can arrange the rows and columns of these sub-instances in such a way that all rows cross the first column. We call this type of sub-instance SWC-instance (for “simple wildcards”). We order the rows from top to bottom by increasing length in order to be able to further decompose the instance.

The first level of our DP solves these highly structured SWC-instances. The basic idea that we would like to apply is that we select a constant number of rows from the instance that represents the solution. Without further precautions, however, this strategy fails because of differing densities within the instance: the selected rows have to represent both the entries of columns crossed by many short rows and entries of arbitrarily small numbers of rows crossing many columns. To resolve this issue, we observe that computing the solution strings and is equivalent to finding a partition of into two row sets, one assigned to and the other assigned to . If we assume to have the guarantee that for both solution strings and an fraction of rows of the matrix forms a Binary-MEC sub-instance, we show that the basic idea works.

This insight motivates to separate SWC-instances from left to right into sub-instances with the required property and to assemble them from left to right using a DP. There are, however, several complications. In order to choose the right sub-instances, we have to take into account that the choice depends on which rows are assigned to and which are assigned to . Therefore the DP has to take special care when identifying the sub-instances.

Furthermore, in order to stitch sub-instances together to form a common solution, the solution computed in the left sub-instance has to compute a set of candidate solutions oblivious of the choices of the right sub-instance. This means that we have to compute a solution to the left sub-instance without looking at a fraction of rows. We present an algorithm for these sub-instances in Section 2.

In order to combine the sub-instances, we face further technical complications due to having distinct sub-instances for those rows assigned to and those rows assigned to . In Section 2.1, we introduce a DP whose DP cells are pairs of simpler DP cells, one for and one for .

Before we consider general instances, we first develop our techniques by considering subinterval-free instances which are easier to handle (Section 3). Observe that the instances considered until now are special rooted sub-interval-free instances. We show how to solve arbitrary rooted sub-interval-free instances by combining the DP with additional information about the sub-problems that contain the root. We then introduce the notion of domination in order to combine rooted sub-interval-free instances with a DP proceeding from left to right. The main idea is that a dominant sub-problem dictates the solution. At the interface of two sub-instances, there can be a (contiguous) region where none of the two sub-problems is dominant. We show that these regions can be solved directly by considering a constant number of rows (using the results from Section 2).

Until this point, all parts of our algorithm run in polynomial time. We lose this property when considering length classes, in Section 4.1. The length classes allow us to separate an instance into rooted sub-instances. The difficulty is that the left hand side of a separating column may have a completely different structure than the right hand side of that column. We do not know how to combining the two sides by considering only a polynomial number of possibilities. If we allow, however, quasipolynomial running time, we can solve the problem. We use that each of the two sub-instances (the one on the left and the one on the right) is composed of at most logarithmically many parts. Considering all parts simultaneously allows us to take care of dependencies between the left hand side and the right hand side and still solve them as if they were separate instances.

Combining such rooted instances from left to right then can be done in the same spirit as combining rooted sub-interval-free instances. To solve the entire length-class, we combine both solutions by running a new DP that considers quadruples of DP cells.

Finally, in Section 4.2, we are able to handle all length classes simultaneously. We solve general instances in the same spirit as the combined sub-instances of a single length class. Instead of considering quadruples of cells, however, we form collections of quadruples that are – figuratively speaking – stacked on top of each other. The key insight is that there are only different length classes and each collection has at most one quadruple of each length class. Considering all possible collections adds another power of to the running time, which is still quasi-polynomial.

1.3 Further related work.

Binary-MEC is a variant of the Hamming -Median Clustering Problem when and there are PTASs known [10, 17]. Li, Ma, and Wang [15] provided a PTAS for the general consensus pattern problem which is closely related to MEC. Additionally, they provided a PTAS for a restricted version of the star alignment problem aligning with at most a constant number of gaps in each sequence.

Alon and Sudakov [1] provided a PTAS for H2S, the maximization version of Binary-MEC and Wulff, Urner and Ben-David [23] showed that there is also a PTAS for the maximization version of MEC. For Gapless-MEC, He et al. [8] studied the fixed-parameter tractability in the parameter of fragment length with some restrictions. These restrictions allow their dynamic programming algorithm to focus on the reconstruction of a single haplotype and, hence, to limit the possible combinations for each column. There is an FPT algorithm parameterized by the coverage [18, 7] (and some additional parameters for pedigrees). Bonizzoni et al. [3] provided FPT algorithms parameterized by the fragment length and the total number of corrections for Gapless-MEC. There are some tools which can be used in practice to solve Gapless-MEC instances [19, 18].

Most research in haplotype phasing deals with exact and heuristic approaches to solve

Binary-MEC

. Exact approaches, which solve the problem optimally, include integer linear programming 

[6] and fixed-parameter tractable algorithms [8, 19]. There is a greedy heuristic approach proposed to solve Binary-MEC [2].

Lancia et al. [13] obtained a network-flow based polynomial time algorithm for Minimum Fragment Removal (MFR) for gapless fragments. Additionally, they found the relation of Minimum SNPs Removal (MSR) to finding the largest independent set in a weakly triangulated graph.

1.4 Preliminaries and notation.

We consider a Gapless-MEC instance, which is a matrix . The th row of

is the vector

and the th column is the vector . The length of the binary part in is . We say that the th row of crosses the th column if .

For each feasible solution for , we specify an assignment of rows to solution strings. The default assignment is specified as follows. For a row , we assign to if . Otherwise we assign to . For the rows of assigned to we write and for the rows assigned to we write . For a given instance, denotes an optimal solution. Observe that knowing allows us to obtain an optimal assignments and by assigning each row to the solution string with fewest errors and knowing and allows us to obtain an optimal solution by selecting the column-wise majority values.

2 Simple instances with wildcards.

In this section, we consider instances of Gapless-MEC where all entries of column one in are zero or one, i.e., for each index . Observe that the wildcards now have a simple structure which we refer to as SWC-structure. An instance with SWC-structure is an SWC-instance.

Definition 1 (Standard ordering of SWC-instances).

We define the standard ordering of rows in such that for each , i.e., we order them from top to bottom in increasing length of the binary part.

Definition 2 (Good SWC-instances).

We call an SWC-instance good, if it is in standard ordering and there are at least rows of and at least rows of that have only entries from .

To solve good SWC-instances, we generalize the PTAS for Binary-MEC by Jiao et al. [10]. Our algorithm requires partitions of the set of rows. In the following two definitions, the required number of rows may be a fractional number. To solve the problem, we allow the assignment of fractional rows, i.e., for a row , we can choose an and assign an fraction of to one set and a fraction to the other set.

The following two definitions allow us introduce a structured view on optimal solutions.

Definition 3 (Trisection).

An -trisection of an instance for is a partition of the rows into three consecutive ranges that have the following properties.

  1. The first range contains row and rows of .

  2. The second range is consecutive to first row set containing rows of .

  3. The third range contains the remaining rows in .

To avoid ambiguity, we choose and such that the first row is in .

We define an -trisection , , and for analogously, replacing by .

Definition 4 (Subdivision of trisections).

We consider the rows sets from Definition 3 and additionally, we divide each of these sets into disjoint subsets denoted as . For each , contains rows from and contains rows from . Analogously, each contains rows from and contains rows from . To avoid ambiguity, each set and starts with a (fractional) row of and each set and starts with a (fractional) row of .

We introduce a new algorithm for our setting. For an instance , we consider the rows sets from the -trisections of and their subsets according to Definition 4. Additionally, we select a multi-set of rows from and . We then compute the majority weighting according to Definition 5 for each column using multisets based on the minimum number of errors. The main idea to find two small row sets that represent the whole instance . The intuitive meaning is that we select rows from the upper part with a much lower density then the rows of the lower part. We therefore introduce a bias such that all rows are equally important.

Definition 5 (Weighted majority).

Let be an integer and let and be two matrices with at least columns. In and , we replace all zeros by and then all wildcard symbols by zero. We then compute the number . Then if and if .

With this preparation, we are now ready to present the algorithm. The input has a long list of parameters that will allow our dynamic programs later on to control the execution. The reason is that we do not know and . Therefore the algorithm takes guesses of row sets as input. The values and are guesses of and .

Input : Row sets , , and of a good SWC-instance , numbers .
Optional: selection of rows , see below.
Output : A pair of solution strings .
Run the algorithm for each possible selection of the following type and keep the best outcome (minimum number of errors);
  // If provided as input, skip selection.
For each , select (with repetition) a multi-set of rows from and from ;
For each , select (with repetition) a multi-set of rows from and from such that ;
// . The values , , and are defined analogously.
For each column , set and ;
For each row of , determine the value ;
Assign the rows with minimal values to and the remaining rows to .
Algorithm 1

Observe that for small (i.e., constant) values of or , the algorithm can be replaced by an exact algorithm since we know if and only if we know , and we are able to guess constantly many rows.

Lemma 1.

Let be a good SWC-instance. For sufficiently large and , let be a subdivision (Definition 4) of an -trisection of . Then is a -approximation algorithm for .

The proof is based on a randomized argument using Chernoff bounds. (See Appendix A).

Lemma 1 shows that the set of solutions considered by contains at least one solution that is good enough even though we do not look at . It does not say that we finally compute that solution, since other solutions may have fewer errors in or . For our dynamic programs, we need a stronger statement. We would like to be able to compute a solution for an instance and afterwards change a fraction of assignments without losing the approximation guarantee. The next lemma is a key ingredient of our result.

Lemma 2.

Let be a good SWC-instance and sufficiently small. Let be an -trisection for and an -trisection for , with subdivisions according to Definition 4. Let be the solution computed by with , . Then re-assigning the rows to and to gives a -approximation for the instance .

Proof.

For ease of presentation, we assume that all appearing numbers are integers. It is easy to adapt the proof by rounding fractional numbers appropriately.

We first analyze the computed solution string . Let be the total number of errors of within and let be the total number of errors of within . Due to Lemma 1, we have .

We may assume since otherwise we can simply rename the two strings , . Additionally, by renaming of and , we may assume that . Therefore and . (Recall that the matrix has rows and columns. The value is a save bound on .)

Claim 2.1.

There is a set of indices such that for all .

Proof of Claim.

We concentrate on the columns of where both strings and have at most errors within . By counting the errors, there are at most columns where has at least errors. Similarly, there are at most many columns where has at least errors. Therefore there is a set of at least columns where simultaneously both and have less than errors each.

Now suppose that the claim was not true and there was an index with . Then, since , either or is erroneous in at least rows of , a contradiction. ∎

Next we analyze for the columns . Let be a column (i.e., an index) from . By symmetry, we may assume . We aim to show that an optimal solution has always sufficiently many errors to pay for wrong entries of .

Let be the number of errors of in column of and let be the number of errors of in column of . Let .

Claim 2.2.

For each column of , either or .

Proof of Claim.

We distinguish two cases. We first assume . If also , we are done. We therefore assume . If there are more than ones in column of , has more than errors in column and thus . Otherwise has at least zeros in column and therefore . We obtain as claimed.

In the second case, and we assume that . If there are more than ones in column of , has more than errors in column and thus . Otherwise has at least zeros in column and therefore . Again, we obtain as claimed. ∎

Since by our assumption , Claim 2.2 implies that within , after reassigning the rows we still have a -approximation.

To finish the proof, we argue that is large enough to pay for all errors in and outside of . Let be the number of errors due to assigning to and to within the interval .

Then, using the size of stated in Claim 2.1, the total number of errors of in is at most , i.e., the errors of within , the errors within and in the columns of , and all other entries of . The obtained approximation ratio is

.

The first inequality uses that for some constant , . ∎

2.1 A DP for SWC-instances.

Let be an SWC-instance with rows . We define to be the start and the end of string number of , i.e., the column number of the matrix where the binary part starts and ends. For a sub-matrix of , determines the index of the first column of and the index of the last column of .

We next specify the parts of which the DP cells are composed. We divide the input instance into blocks defined as follows.

Definition 6 (Block).

Given a good SWC-instance , a block is a sub-instance determined by three numbers as follows. The first column of is column of . The last column of is . The first row of is and the last row is . We write for the rows from to , for the rows from to , and for the rows from to .

The idea is that a block determines a trisection. We subdivide each block into chunks and select rows from these chunks. Chunks are closely related to subdivisions of trisections, but we do not assume the knowledge of .

Definition 7 (Chunk).

Let be a block determined by the numbers . We partition into many chunks (ranges or rows). These chunks are determined by numbers

The th chunk of is the submatrix composed of the rows to and the th chunk of is the submatrix composed of the rows to .

Definition 8 (Selection).

For each block with a set of chunks , we consider multiset of rows of size . We require that contains rows from each chunk in .

The selection will take the role of and in . With these preparations we can define a DP cell.

Definition 9 (DP cell).

For each block , each set of chunks of and each selection of rows from , there is a DP cell represented by . A DP cell is a predecessor of if the following conditions hold.

  • and , where are the numbers from Definition 6.

  • The chunks from between and are exactly the chunks from between to .

  • For each pair of chunks from with the same range of rows, the selection matches the selection .

The value of will be an approximation of the minimum number of errors that we can have in until the last column of .

We now describe the dynamic program for a pair of solution strings by using joint DP cells (see also Fig. 1).

Figure 1: Example for a pair of strings with . The blue lines and dashed blue ones represent sets and , and .

For , we use the same notation as in Definitions 6, 7 and 8, but we use the symbol prime (  ) for all occurring variables.

Definition 10 (DP cell for a pair).

We define joint DP cell with the two single cells defined as in Definition 9. We require that

  • the rows of and where chunks start are pairwise distinct, and

  • .

Definition 11 (Predecessor of a joint DP cell).

A DP cell is a predecessor of if (i) and is a predecessor of ; or (ii) is a predecessor of and .

Algorithm ().

The general idea of the algorithm is to guess trisections. Suppose we initially chose blocks that are the left-most trisections for and . Then we obtain an approximation of the prefix of restricted to (whichever ends first) by sampling rows of ,,, and . The sampled rows for and provide the interface to the next step. Suppose starts at an earlier row than . Then we guess the trisection of for restricted to the rows of and . Let be that block of our algorithm. Then and we sample rows of in order to approximate a new infix of . For a simplified version of the DP without the complications due to having two solution strings, we refer to Appendix B.

We globally guess two numbers and that represent and . We split the processing into an initialization phase and an update phase. In the initialization phase, we assign values to each DP cell based on with the following parameters. We obtain from the chunks and from the chunks . In the execution of , we use the selections instead of trying all possible selections, i.e., and determine all , , , and in the algorithm. Let be the matrix with rows from to the and columns one to . The solution of the computation is a pair of strings , the prefixes of the two computed strings until . The value of is .

In the update phase, we compute the value and the pair of strings of the DP cell as follows. We inductively assume that all DP cells for predecessors of have been updated already. We try all predecessor pairs of DP cells and keep the one that gives the best result (see also Appendix B). Let be a predecessor of . By symmetry, we assume without loss of generality that . There are two cases how the two pairs interact. The first case is . We run on the columns to with the parameters from (see initialization). To obtain the full solution, we append the computed string for to the string (which is one of the solution strings of the predecessor pair). Let be the matrix with rows from to the and columns one to . The solution of the computation is a pair of strings , the prefixes of the two computed strings from column one to . The potential new value of is . We replace the stored solution with the potential new solution if the cost has decreased.

The second case is . This case is the crux of the joint DP, since we have a “switch” of the role of and .

We run on the columns to with the parameters from (see initialization). To obtain the full solution, we then append the computed string for to the string (which is one of the solution strings of the predecessor pair). Let be the matrix with rows from to the and columns one to . The solution of the computation is a pair of strings , the prefixes of the two computed strings until . The potential new value of is . We replace the stored solution with the potential new solution if the cost has decreased.

For the last strings, we additionally consider special cells that are defined as before, but with or . Intuitively, we use these cells when only at most rows of or are left. For pairs of cells containing such or , our computation considers the optimal solution within the computation instead of .

Theorem 3.

The algorithm is a PTAS for -instances.

Proof.

To see that the DP works in polynomial time, we observe that instead of simple DP cells in Lemma 7 here we consider pairs of DP cells. Therefore the number of cells is squared and thus stays polynomial. During the recursive construction of the solution, we compare each cell to be computed with one compatible cell at a time. Therefore the construction of the solution also takes only polynomial time. As in Lemma 7, the computed solution is vacuously feasible.

We continue with analyzing the quality of the computed solution. Let be an optimal solution. We set and . By renaming the two strings we may assume that the last row of the first rows of is below the first row of the last rows of .

We consider DP cells similar to the proof of Lemma 7. Starting from the top-most row of , for each , the th range contains the next rows of . We assign the rows not in such that the first row of each is contained in . Then we choose such that all rows of until are contained in .

We consider the DP cells for each with the parameters , , and . The block contains the rows of and , and the columns one to the end of the first row of . For each , block contains the rows of and , and the columns after those of to the end of the first row of .

If only a constant number of rows of are left, we can compute the partial solutions optimally and there are DP cells for exactly this purpose: there is a DP cell such that the last rows of are located between and and contains exactly these rows. As before, to keep a clean notation, in the following we implicitly assume that cells with constantly many rows of are handled separately.

The chunks of are the ranges that equally distribute . The selection is the best possible selection as specified in . Analogously we define , , and for .

We construct a solution SOL and inductively show that the value of each considered cell and is at most a factor larger than the number of errors of an optimal solution restricted to the considered prefix and the considered rows. Afterwards we show that our algorithm computes a solution at least as good as SOL.

We first consider the DP cell . Recall that we assumed w.l.o.g. that . We apply Lemma 2 with the parameters of the pair of cells to obtain the prefixes and . The total number of errors within the columns of at the prefixes is therefore at most a factor larger than in . There are two possibilities for the subsequent steps with .

We first assume that and consider the cell . Then, similar to the proof of Lemma 7, we apply to obtain the suffix of after . By Lemma 2, considering the suffix alone we have at most a factor more errors within these columns than .

Since is a predecessor of , all newly assigned rows were not considered in . Note that did not change. Even though we looked at the same chunks, we used the same selections and therefore did not change .

The second possibility is that and we consider the cell . The instance is shown in Fig. 2. We then apply to obtain the suffix of after . We obtain a -approximation analogous to the case . ∎

Figure 2: Blocks of an instance in the DP for a pair of solution strings. The blue and gray lines represent and respectively from first two iterations of DP. The sketch shows the switch example in the second iteration because .

3 Subinterval-free instances.

We show how to generalize the results of the previous section in order to handle instances where no interval of a string is a proper subinterval of a string and thus show Theorem 2. To this end, we first show how to handle the rooted version of sub-interval free instances, where there is one column such that each string of the instance crosses .

We order the rows of a subinterval-free instance from top to bottom such that for each pair of rows with the binary part of starting on the left of the binary part of , is above . In other words, the binary strings are ordered from top to bottom with increasing starting position (i.e., column). Observe that the sub-string freeness property ensures that the last binary entry of is not on the left of the last binary entry of .

Lemma 3.

Let be a Gapless-MEC instance such that no string is the substring of another string. Furthermore we assume that there is a column of such that each string of the instance crosses . Then there is a PTAS for .

Proof.

Let and be the first and the last row of . The column determines a block of that spans all rows and the columns from the first binary entry of , , to the last binary entry of , . In particular, has only binary entries.

The right hand side of (the submatrix of composed of all columns with index at least ) forms a Gapless-MEC instance as required in Theorem 3. The submatrix of that contains all rows of and columns to forms a Gapless-MEC instance as required in Theorem 3 if we invert both the order of the rows and the columns. Instead of changing the ordering of the matrix, we can run the algorithm from right to left and from bottom to top.

We would like to apply Theorem 3 independently to the two specified sub-problems. To this end we define a special set of DP cells with cells . The content of these cells is similar to the regular cells, but it contains the information for both sides simultaneously. More precisely, a cell has the following entrees (see also Figures 3 and 4).

(a) Three consecutive ranges of rows determined by numbers . These numbers determine an upper range from row to row and the following further ranges. A left lower range from row to row , as well as a right lower range from row to row . (b) A separation into chunks . There are chunks in : for , for , and for . (c) A selection of rows (with repetition): for each chunk.

We analogously obtain with the same variables but marked with the symbol prime. The rows selected in are required to be disjoint from those in , i.e., . Also the boundaries of chunks in and have to be disjoint.

Definition 12 (Center cells).

The cells are called center cells.

The reason is that they take a special role as common “centers” of two separate runs of the DP: one run to the left and one run to the right. Observe that for each feasible entry of , we can apply Theorem 3 independently to the left and to the right, since the DP cells takes the role of the left-most cell in Theorem 3. The strings only overlap between the columns where we obtain an instance of Binary-MEC, which in particular is a good SWC-instance. Note that for each column on the right hand side of , all rows of located above with binary entry at column have also a binary entry at all rows between and , due to the subinterval-freeness. The properties of on the left hand side of are analogous. We will choose and in such a way that by Lemma 2, it is therefore sufficient to consider the rows between and in order to handle all rows crossing .

None of the remaining steps from Section 2.1 interfere with each other. We therefore run the following . We first compute all center cells . For each cell, we store an infix of and an infix of . The infix of starts at and ends at . The entries of the two strings are those that we obtain from with the parameters of . Each cell forms a starting point for Algorithm , applied independently towards the left hand side and the right hand side.

To see that the DP yields a good enough approximation, again we compare against an optimal solution . Clearly we get a -approximation for the infix between column and if for a DP cell , by Lemma 2. Note that the computed solution does not consider the rows above or below . Since the further processing respects our choice between and , the claim follows from Theorem 3. ∎

General sub-interval-free instances

We use Lemma 3 to handle general sub-interval free instances. Instead of a single column crossed by all strings, we determine a sequence of columns with the property that each string crosses exactly one of them. Let be the first string in