Biological evidence indicate that the genomes of different species may present essentially the same set of genes in their DNA strands, although not in the same order NadeauTaylor1984; PalmerHerbon1988, suggesting the occurrence of mutational events that affect large portions of DNA. Research indicates that these are rare events and, therefore, may provide important clues for the reconstruction of the evolutionary history among species koonin2005orthologs. One such event is the transposition, which swaps the position of two adjacent blocks of genes in one chromosome. The Transposition Distance Problem (TDP) aims to find the minimum number of transpositions (distance) required to transform one chromosome into another, represented by permutations. The TDP equivalent to the problem of Sorting by Transpositions (SBT), as this last asks to find the Transposition Distance of a given permutation from the identity.
The first approximation algorithm to solve the SBT was devised in 1998 by Bafna and Pevzer BafnaPevzner1998, with a approximation ratio, based on properties of a structure called cycle graph. In 2006, Elias and Hartman EliasHartman2006 presented a -approximation algorithm with time complexity , the best known approximation ratio for SBT so far, also based on the cycle graph. Their algorithm relies on the simplification, a technique which presumably facilitates handling with permutations whose cycle graphs contain long cycles. This technique consists of inserting new symbols into the original permutation , so that is a new permutation which the corresponding cycle graph contains only short cycles, at the same time that the transposition distance lower bounds of both and are equal. The transpositions sorting can be mimicked to sort .
In a later study, the time complexity of their algorithm was improved to by Cunha and colleagues Cunha2015
. Other improvements, including heuristics, were proposed by Dias and DiasDias2010Dias2010; DiasDias2013. Solutions for TDP using different approaches to the cycle graph were proposed by Hausen and colleagues hausen2008toric; Hausen2010, Lopes and colleagues lopes2011analysis and Galvão and Dias galvao2012approximation. In addition to these, recently, several studies involving variations of the transposition event and also rearrangement events that combine transposition with other events, e.g. reversals, have been proposed dias2015sorting; lintzmayer2014sorting; lintzmayer2017sorting; rusu2017log. In 2012, Bulteau, Fertin and Rusu bulteau2012sorting demonstrated that TDP is -hard. Meidanis and Dias MeidanisDias2000 and Mira and Meidanis mira2005algebraic were the first authors to propose the use of an algebraic formalism to solve TDP, as an alternative to the classical formalism based on the cycle graph, and to the other ad hoc methods. The goal was to provide a formal approach for solving rearrangement problems using known results of the theory of permutation groups.
The first result we present are examples of permutations which, depending on how they are simplified, can make the algorithm of Elias and Hartman EliasHartman2006 produce one extra transposition above the approximation ratio. We then, using an algebraic formalization, propose a algorithm to solve the TDP that does not use simplification to ensure the -approximation for all permutations in the Symmetric Group . To avoid the insertion of extra symbols into the original permutation, we “desimplified” the catalog of permutations generated by Elias and Hartman EliasHartman2006 to prove their result on the diameter of -permutations. These are the permutations whose cycle graphs contain only cycles with pairs of edges. We also propose a new upper bound for the transposition distance, not only for the simple permutations subset, but for all . Finally, we present audit results on short permutations of maximum length , performed on the GRAAu platform, for implementations of our algorithm and that of Elias and Hartman EliasHartman2006. These results showed that our algorithm performs better than the one of Elias and Hartman EliasHartman2006 both in relation to the maximum approximation ratio and the percentage of the correct answers, i.e., the percentage of computed distances that coincides with the exact ones.
2 Background on permutations and transpositions
Let be a permutation on the set . A transposition , with , and , “cuts” the symbols from the interval and then “pastes” them between and . Thus,
Given two permutations and , the Transposition Distance Problem (TDP) corresponds to finding the minimum such that . We call , denoted , the transposition distance between and . Consider is the identity permutation. We can see that the problem of Sorting by Transpositions (SBT) can be reduced to TDP.
2.1 Cycle graph
The cycle graph of , denoted by , is a directed edge-colored graph consisting of a set of vertices and a set of colored (black or gray) edges. For all , the black edges connect to . One extra black edges is inserted, connecting to . For , the gray edges connect vertex to vertex and one extra gray edges connects to . Intuitively, the black edges indicate the current state of the genes, related to their arrangement in the first chromosome modeled by , while the gray edges indicate the desired order of the genes in the second permutation, modeled by .
Figure 1 shows of with black edges, , , , , , and gray edges, , , , , , .
Both in-degree and out-degree of each vertex in is , corresponding to one black edge entering a vertex and another gray edge leaving . This induces in a unique decomposition into cycles. A -cycle is a cycle in with black edges. In addition, is said to be a long cycle, if , otherwise, is a short cycle. If
is odd (even), then we also say thatis an odd (even) cycle.
Figure 1 has one odd long -cycle. We can walk the cycle starting at the black edge (any edge could be used to start walking), then taking the gray edge , until reach the gray edge .
The maximum number of cycles in is obtained if and only if is the identity permutation . In this case, each cycle is composed of exactly one black edge and one gray edge. Let us denote the number of odd cycles in , and the variation on the number of odd cycles in , after having applied a transposition . Bafna and Pevzner BafnaPevzner1998 demonstrated the following result.
Lemma 3 (Bafna and Pevzner BafnaPevzner1998).
From this result, they derived a lower bound for SBT.
Theorem 4 (Bafna and Pevzner BafnaPevzner1998).
2.2 Simple permutations
Simplification is a technique introduced to facilitate handling long cycles of . It consists of inserting new symbols (usually fractional numbers) into to obtain a new permutation , so that contains only short cycles. The transformation of into is said to be safe if, with each new inserted symbol, the lower bound of Theorem 4 is maintained, i.e. , where and denote the number of black edges in and , respectively. If is a permutation obtained from through safe transformations, then we say and are equivalent. Lin and Xue lin2001signed have shown that every permutation can be transformed into a simple one through safe transformations. It is important to mention that a permutation can be simplified in many different ways. For a complete description of simplification and related results, refer to hartman2006simpler; lin2001signed.
2.3 Permutation groups
All results presented in this subsection are classical in the literature of permutation groups. The proofs were omitted since they can be easily found in basic abstract algebra textbooks dummit2004abstract; Gallian2009.
The Symmetric Group on a finite set of symbols is the group formed by all permutations on distinct elements of , defined as bijections from to itself, under the operation of composition. The product of two permutations is defined as their composition as functions. Thus, if and are permutations in , then , or simply , is the function that maps any element of to .
An element is called a fixed element of , if . If there exists a subset of distinct elements of , such that
and fixes all other elements, then we call a cycle. In cycle notation, this cycle is written as , but any of , …, denotes the same cycle . The number k is the length of , also denoted as . In this case, is also called a -cycle. Later we will show a relation of these cycles with cycles of cycle graph.
The support of , denoted , is the subset of moved (not fixed) elements of . Two permutations and are said disjoint, if , i.e, if every symbol moved by one is fixed in the other. In addition, if and are disjoint, then they commute as elements of .
Every permutation in can be written as a product of disjoint cycles. This representation, called disjoint cycle decomposition, is unique, regardless of the order of the cycles in the product.
The identity permutation is the permutation that fixes all elements of . Fixed elements are usually not written in the cycle notation. However, if it is necessary to represent them, we use -cycles.
A -cycle is commonly called as transposition in the algebra literature. In order to avoid misunderstanding with the terminology, in this text, “transposition” always refers to swapping two adjacent blocks of symbols in a permutation (biological transposition).
Every permutation in can be written as a (not unique) product of -cycles.
A permutation is called even(odd) if it can be written as a product of an even (odd) number of -cycles. Next, we present some important results related to the parity of permutations.
If a permutation is written as a product of an even(odd) number of -cycles, then every product of -cycles that equals to must have an even(odd) number of -cycles.
If , are permutations with the same parity, then the product is even.
We note that, in algebra literature, an odd(even)-length cycle is even(odd). It is important to not confuse with the definition of cycle parity in the cycle graph, where an even(odd)-length cycle is even(odd).
Two permutations , are conjugate if there is a permutation such that . In this case, we may also say that is the conjugate of by . Conjugation is an equivalence relation that partitions into classes.
The permutations of an equivalence class induced by the conjugacy relation have all the same cycle type, i.e., the same number of cycles with the same length.
2.4 Algebraic formalization for TDP
The next definitions are based on Meidanis and Dias MeidanisDias2000 and Mira and Meidanis mira2005algebraic.
We can associate each permutation of with a -cycle of . Thus, the permutation can be represented as the -cycle .
We say that a -cycle is applicable to if the symbols , and appear in in the same cyclic order they are in . Therefore, there should be integers and such that and , with , where means applied times over or, simply, is the element in positions forward of in cycle notation of . The product is a -cycle such that the symbols between and (i.e., the symbols between and , including , but not ) in is cut and then pasted between and , thus simulating a biological transposition on .
Let . The -cycle is applicable to and thus simulates a transposition. The application results in .
Given the -cycles and , the Transposition Distance Problem consists of finding the minimum number , denoted , of transpositions modeled as applicable -cycles needed to transform into , i.e.,
From the equality above, we have that the product is equal to , since
Note that if , then .
The -norm of an even permutation , denoted by , corresponds to the minimal number of factors in a product of -cycles equals to . Denote by , the number of cycles and the number of odd-length cycles (thus, even cycles), including -cycles, in the disjoint cycle decomposition of , respectively. Mira and Meidanis mira2005algebraic demonstrated the following result.
Lemma 11 (Mira and Meidanis mira2005algebraic).
Lemma 12 (Mira and Meidanis mira2005algebraic).
Given and -cycles, then
3 Properties of the permutation
It is interesting to note that, taking , the cycle graph of and the permutation are equivalent structures where . To a reader used to the graphical notation of Bafna and Pevzner BafnaPevzner1998, it is enough to follow the edges of the cycles, taking note of the labels of the vertices where the gray edges enter, changing the label to . This will produce exactly the same cycles of .
If we follow the edges of the cycle graph of Figure 1, applying the previously explained process, we get .
In the circular representation of Elias and Hartman EliasHartman2006, it is enough to take note of the labels, since it does not use the symbol. Due to this equivalence and to facilitate reading, we have kept, as close as possible, in our formalism the same definitions used in the cycle graph.
3.1 Cycles of and the effect of -cycle application
Let be a cycle in the disjoint cycles representation of . If and , i.e., if the symbols , and appear in in a cyclic order that is distinct from the one in , then we say is an oriented triple and is an oriented cycle. Otherwise, if there is no oriented triples in , then is an unoriented cycle. A cycle is a segment of if . Analogously, we define a segment of a cycle of as oriented or unoriented.
Note that, from Equation 4, i.e. the effect in of the application of a -cycle , is equivalent to multiplying by . Let us denote by the difference .
Depending on the symbols of , its application can affect the cycles of in four distinct ways, described as follows:
, and are symbols belonging to the support of only one cycle of . We have two subcases:
If , and appear in the same cyclic order in and , then .
Otherwise, . Then, .
, and belong to the support of two different cycles and of . W.l.o.g., suppose and . Then we have that .
, and belong to the support of three different cycles , and of . W.l.o.g., suppose , and . Then .
From this observation, we have the following result.
If is an applicable -cycle, .
We denote by -move an applicable -cycle if . We also denote by -sequence, a sequence of applicable -cycles , , such that there are -cycles, which are -moves, while the other -cycles are -moves. If then we call this -sequence an -sequence.
As Theorem 12 defines a lower bound for TDP based on the number of odd-length cycles of , we are only interested in finding sequences of applicable -cycles whose application increases , thus causing to decrease.
3.2 as a product of two -cycles
Suppose and are two cycles of . If , i.e., if the symbols of the pairs and occur in alternate order in , we say that and intersect, and that and are intersecting cycles. Similarly, if and are such that , i.e., if the symbols of the triplets and occur in alternate order in , then we call and interleaving cycles. Analogously, we define two segments of two cycles as intersecting or interleaving.
Let and . The cycles and are examples of intersecting cycles whereas and are interleaving cycles.
Let be a cycle in a product of segments of different cycles of , in which each cycle of has at most one segment in . We call the pair an open gate in , if there is no cycle in such that and intersect; and there is no such that is an oriented triplet. Lemma 16 demonstrates that there is no open gate in , since it is a product of two cycles of the same length.
Let be two -cycles, . If is a cycle in the disjoint cycle decomposition of , then:
there is a symbol , such that and , or;
there is a cycle in , such that .
Assume , where is a product of disjoint cycles, disjoint of .
First, we show that there are symbols between and in , i.e., . By way of contradiction, assume . Then, the conjugate . Therefore , which is impossible, since, by Theorem 9, the conjugate is a -cycle.
Now, assuming , and , we show that for some , . That is, not all the symbols between and in are fixed in . For this, by way of contradiction, suppose , for every . Also, suppose . Then,
In this case, , implying that the conjugate is not a -cycle. Thus, if , for some , then the lemma holds. Otherwise, there must be a cycle in , such that and . Suppose, by contradiction that and . Then, by the equality above, . However, in this case, is not, again, a -cycle, given that and are disjoint. ∎
4 Elias and Hartman algorithm may require one extra transposition above the approximation of
An important step in the algorithm of Elias and Hartman EliasHartman2006 is the simplification of the input permutation, by the insertion of new symbols in . The positions of the new symbols are supposed to be irrelevant, on condition that they occur through safe transformations. However, there are simplifications that, although producing equivalent permutations through safe transformations, causes the algorithm of Elias and Hartman EliasHartman2006 to require one extra transposition above the approximation of . Two examples are explored next.
Consider the permutation shown in Figure 1. The lower bound given by Theorem 4 is , also its exact distance, corresponding to the application of four -moves, shown in Figure 3. One simplification of generates and its corresponding cycle graph is shown in Figure 2. Note that the lower bound of is as well. However, there is no -sequence to be applied to . In fact, to optimally sort , two -sequences are required. Therefore the algorithm of Elias and Hartman EliasHartman2006 using as input, even applying an optimal sorting on , results in transpositions. However, the algorithm should require at most transpositions to not exceed the -approximation ratio.
The following example shows that, even whether there are -sequences of transpositions to apply in , the algorithm of Elias and Hartman EliasHartman2006 may require one transposition above the approximation ratio of . Take the permutation (Figure 6), with both the lower bound and distance equal to . A simplified version of is (Figure 5). The algorithm of Elias and Hartman EliasHartman2006 sorts optimally by applying a -sequence, followed by a -sequence, in a total of transpositions. However, the algorithm should not require more than transpositions to not exceed the -approximation ratio.
In both examples, an initial -sequence is “missed” during the simplification process to transform into . This sequence is essential to guarantee the -approximation ratio when, after the application of a number of -sequences, bad small components remain in (Theorem 22, EliasHartman2006). These are cycles graphs that do not allow application of -sequences.
5 A 1.375-approximate algorithm for all the permutations of
Similar to the work of Elias and Hartman EliasHartman2006, some results presented in this section are based on the analysis of a huge number of cases. In this regard, several computer programs, for enumerating the cases and search for solutions, were implemented in order to assist their demonstration. The source code is available on the GitHub platform111https://github.com/luizaugustogarcia/tdp1375/tree/master/src/main/java/br/unb/cic/tdp/proof.
When considering only the simple permutations subset of , the algorithm of Elias and Hartman guarantees the approximation ratio of . In this section, we present an algorithm that guarantees this ratio for all the permutations in . First, we present results that will be used to show the proposed algorithm is correct.
If there is an odd (even-length) cycle in , then a -move exists.
Since is an even permutation (Theorem 8), then there is an even number of odd cycles in . Let and be two odd cycles of . We have two cases:
and intersect. In this case, . Then is a -move.
and do not intersect. W.l.o.g, suppose . In this case, is a -move. Remark that is not a distinct case, being just a cyclic rotation of with the variables and switched.
If , then there is a -sequence.
If there is an odd (even-length) cycle in , then by Proposition 17 a -move (i.e. a -sequence) exists. Thus, consider with only even (odd-length) cycles.
There is an oriented -cycle in . If , then there is -move and the lemma holds. Now, suppose . W.l.o.g, let and be an oriented triple. In this case, is the only permutation, relatively to the positions of the symbols, not allowing a -move. Then, there is a -sequence with , and .
All the cycles of are unoriented. Let be a segment of a cycle of . We have two cases:
interleaves with another segment . In this case, . Then, , and is a -sequence.
intersects with two segments and . For each of the distinct forms of (enumerated on A), relatively to the possible positions of the symbols of , and , there is a -sequence.
If there is an even (odd-length) -cycle in such that and is an oriented triple, then there is a -sequence.
If is a -move, then the lemma holds, since a -move is a -sequence, which in turn is -sequence. Thus, w.l.o.g, suppose
We used vertical bars to indicate the points in that would be affected by the application of , and subscripts to indicate the length parity of the resulting cycles. Note that the cycle can be rewritten as the product
For each of the distinct possible forms of (B), related to the symbols of , not allowing a -move, there is a -sequence of transpositions. ∎
A configuration is a product of even (odd-length) segments of cycles with at most two open gates, so that each cycle of has at most one segment in . An unoriented configuration is a configuration consisting only of unoriented segments. If then it is referred to as a small configuration.
Let and . The products and are both small configurations of .
From a configuration , we can obtain a larger configuration such that , extending by three different ways, as follows:
Suppose has one or two open gates. We add a -cycle segment of a cycle to in order to close an open gate.
Suppose has no open gates. We add a -cycle segment of a cycle to , so that this new segment intersects with another one in .
Let be a configuration of two intersecting segments. If , then we call it an interleaving pair. On the other hand, if , we call it an intersecting pair.
Let be a -cycle segment of . If it is possible to extend eight times until eventually reaching a configuration such that , then we call a sufficient configuration.
Notice that our definitions of configurations are similar to those devised by Elias and Hartman EliasHartman2006. However, although they have proposed the concept of extension of configuration, there is no definition in their method analogous to our extension 3, since they have only worked with simple permutations.
Our goal is to show that, any configuration of such that can be rewritten as a -sequence. To this end, we employ the results of Elias and Hartman EliasHartman2006, available to the public222http://www.sbc.su.se/~isaac/SBT1375_proof/index.html as a catalog of configurations and their respective -sequences of transpositions.
5.2 Desimplification: undoing simplification
In this section, we propose the concept of desimplification – the opposite of simplification. In algebraic terms, given two cycles and from , such that , a desimplification step consists of the following operations:
Replacing and by the new cycle .
Removing from and the symbol .
Replacing in and the symbol by .
We call a join pair, and denote by the cycle resulting from the desimplification step of with the join pair .
Let and . Let and . Note that