Order distances and split systems

10/22/2019 ∙ by Vincent Moulton, et al. ∙ 0

Given a distance D on a finite set X with n elements, it is interesting to understand how the ranking R_x = z_1,z_2,...,z_n obtained by ordering the elements in X according to increasing distance D(x,z_i) from x, varies with different choices of x ∈ X. The order distance O_p,q(D) is a distance on X associated to D which quantifies these variations, where q ≥p/2 > 0 are parameters that control how ties in the rankings are handled. The order distance O_p,q(D) of a distance D has been intensively studied in case D is a treelike distance (that is, D arises as the shortest path distances in an edge-weighted tree with leaves labeled by X), but relatively little is known about properties of O_p,q(D) for general D. In this paper we study the order distance for various types of distances that naturally generalize treelike distances in that they can be generated by split systems, i.e. they are examples of so-called l_1-distances. In particular we show how and to what extent properties of the split systems associated to the distances D that we study can be used to infer properties of O_p,q(D).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A distance on a finite, non-empty set  is a symmetric map with and for all . Following [5] we associate a new distance on to the distance for with as follows. For each with we define the sets

and

Note that the sets , and are pairwise disjoint and that their union is . Now, for any bipartition or split of , let be the distance on given by taking if and otherwise for all . The order distance associated to is then defined as

(1)

where denotes the set of all 2-element subsets of .

The order distance can be regarded as the amount by which the two rankings and of the elements in generated by ordering these elements according to increasing distances from and from , respectively, differ [5]. Note that, as pointed out in [5], to ensure that satisfies the triangle inequality we must require , and that, for any with , .

Figure 1: A tree with non-negative edge weights whose leaves are labeled by the elements in the set and the distance of shortest path distances between the leaves of . The associated order distance can also be represented by the tree by adjusting the weights of its edges. Note that each edge of corresponds to a split of .

Most previous work concerning order distances has focused on their properties for treelike distances, that is, distances that arise by taking lengths of shortest paths between pairs of leaves of a tree (see e.g. [5, 14, 15, 16]). In Figure 1 we present an example of the order distance  with and associated to a treelike distance . Note that in this example the order distance is also treelike since, as we can see in the figure, it can be represented by adjusting the weights of the edges in . In fact, this is no coincidence: The main result of Bonnot et al. in [5] establishes that if is a treelike distance which can be represented by giving non-negative weights to the edges in some tree, then for , after possibly adjusting the edge weights, the order distance can also be represented by the same tree.

In this paper, we aim to better understand to what extent this result can be extended to more general distances. In particular we shall focus on order distances for so-called -distances (cf. [11, Ch. 4]), that is, distances on for which there exists a set of splits of or split system, together with a non-negative weighting such that

also referred to as an -decomposition of . It is natural to consider -distances in the context of order distances since it follows directly from Equation (1) that the order distance for any distance is an -distance.

Interestingly, Bonnot et al.’s result can be re-expressed in -terminology as follows. To any tree with leaves labeled by we can associate the split system consisting of the splits of that correspond to the edges of the tree (e.g. the edge in in Figure 1 with weight 6 corresponds to the split ). Bonnot et al.’s result then states that in case , for every non-negative weighting of the splits in , there exists a non-negative weighting of the splits in such that

(2)

Since the split system arises from a tree, it has a special combinatorial property known as compatibility (see Section 2). In this paper, we will explore under what conditions Equation (2) might hold for other split systems that are not necessarily compatible.

The rest of the paper is structured as follows. In Section 2 we present some preliminaries concerning the relationship between treelike distances and compatible split systems. Then, in Section 3, we prove a variant of Bonnot et al.’s result for arbitrary in the special case where the split system underlying a tree is maximal (Theorem 1). In Section 4 we focus on the split system associated to a distance on that forms the index set for the first sum in Equation (1). In particular, we give a tight upper bound on its size (Theorem 2), and also a characterization for when it is compatible (Theorem 3).

In Section 5 we introduce the concept of an orderly split system. These are essentially split systems for which Equation (2) holds in case . Compatible split systems are special examples of orderly split systems, and we show that the more general circular split systems [3] also enjoy this property in case they have maximum size (Theorem 5). In Sections 6 and 7 we then explore to what extent this latter result can be extended to the even more general class of so-called flat split systems [6, 26]. In particular, we show that within the class of maximum sized flat split systems, the orderly split systems are precisely those that are circular (Theorem 12). In Section 8 we briefly look into consequences of our results on efficiently computing order distances. We conclude in Section 9 with some possible directions for future work.

2 Preliminaries

For the rest of this paper will denote a finite non-empty set with , and the set consisting of all possible splits of . We also use to denote a split of into two non-empty subsets and . We call any non-empty subset a split system on . A pair consisting of a split system on and a weighting is called a weighted split system on  and we denote by the distance generated by the weighted split system . We emphasize that throughout this paper the weights of the splits in a weighted split systems will always be non-negative.

A split system on is compatible if, for any two splits and in , at least one of the intersections is empty. The splits in a compatible split system on are in one-to-one correspondence with the edges of a (necessarily) unique -tree , that is, a graph theoretic tree with vertex set and edge set together with a map such that the full image contains all vertices of degree at most two. We denote the compatible split system represented by the edges of an -tree by . The edges of the -tree in Figure 1, for example, yield the following collection of splits of : , , , , , and (which can be visualized by removal of each edge from ). Assigning to each of these splits as its weight the weight of the corresponding edge in yields a weighted compatible split system that generates .

Note that a compatible split system on is maximal, that is, adding any further split to yields a split system that is no longer compatible, precisely in case it contains splits, in which case it corresponds to a binary -tree where the elements of are in one-to-one correspondence with the leaves of the tree. Hence, maximal compatible split systems are precisely the maximum-sized or maximum, for short, compatible split systems on .

In proofs we will make use of these facts concerning compatible split systems and will sometimes switch between a weighted compatible split system and its equivalent unique representation as an edge-weighted -tree. Full details concerning this correspondence can be found in [23].

3 Treelike distances revisited

In this section, we consider properties of the order distance of a treelike distance for arbitrary values . Note that most previous results for treelike distances, such as those in mentioned in the introduction, focus mainly on the case .

We first consider the case where is generated by a maximum compatible split system.

Theorem 1

For any maximum compatible split system on with strictly positive weighting , the order distance associated to can be expressed as for some non-negative weighting .

Proof: Let be a maximum compatible split system on with strictly positive weighting . First note that in [19] it is shown that if is compatible then the splits and are contained in for all with . Moreover, in view of the fact that maximum compatible split systems with strictly positive weighting are precisely those that can be represented by binary -trees with strictly positive edge weights whose leaves are in one-to-one correspondence with the elements in , we must have for all with and, for any with , the splits , and correspond to three edges that share a single vertex in the binary -tree that represents . In particular, the split must be contained in . Hence, from Equation (1) and in view of the assumption we have

for some suitable non-negative weighting of the splits in , as required.

Note that the assumption in Theorem 1 that the compatible split system is maximum is necessary. Indeed, in [5, p. 258], an example is presented which provides a weighted, non-maximum compatible split system on , , such that for any , the order distance associated to cannot be expressed as for any non-negative weighting of the splits in . There exists, however, a compatible superset of splits such that for some non-negative weighting in this example.

In general, even expressing the order distance by a suitable superset of splits that belong to the same class of split systems (here compatible) as the split system that generates requires, in general, that we restrict to . To illustrate this in the following example, we make use of the fact (see e.g. [23]) that treelike distances on a set are characterized by the following 4-point condition: For all

(3)

must hold. Now, consider the distance generated by the weighted non-maximum compatible split system with

on the 5-element set . We obtain the order distance with

In view of Equation (3), a weighted compatible split system with and can exist only if

holds. But this implies and, thus .

Note that the distance in the previous example is not only treelike but even an ultrametric, that is, holds for all . In [14] it is shown that for any ultrametric generated by a weighted compatible split system with strictly positive weighting , the associated order distance can be expressed as for some strictly positive weighting of the splits in , that is, all splits in are used to generate . As pointed out in [16], however, this property does not characterize ultrametrics: there are examples of distances that have this property but are not ultrametrics. Moreover, it is shown in [14] that the order distance of an ultrametric is, in general, not an ultrametric.

4 The midpath split system of a distance

Given a distance on we define the midpath split system associated to to be the set of splits of of the form for with and . Note that the splits in are precisely those appearing in the index set of the first sum in Equation (1). We chose the name for since it is closely related to the midpath phylogeny introduced in [20]. In this section, we consider general properties of the split system , including a characterization in terms of for when this split system is compatible.

First note that, as a direct consequence of the definition of , it follows that contains at most splits. As , we immediately see that for small this upper bound is not tight for . Nevertheless we have the following result:

Theorem 2

Let be a distance on a set with elements. Then we have and, for all sufficiently large , this bound is tight.

Proof: It remains to show that the upper bound is tight for all sufficiently large . To this end, consider a distance on such that, for all , the value is selected randomly from the set

, with both values having the same probability of being selected. We now argue that, for sufficiently large

, the probability that is strictly greater than 0.

Note that satisfies the triangle inequality and that for all , implying that the splits and exist. Moreover, in order to have , for any two distinct , , the splits , , and must be pairwise distinct. Now, it follows immediately from the definition of that the probability of is . More generally, the probability that at least two of the splits , , and coincide is bounded by for some constants and . This implies that the probability of is at most

which is strictly less than 1 for sufficiently large , as required.

The remainder of this section is devoted to giving a characterization of those distances for which is compatible. In [19], a 6-point condition is given that characterizes for a distance when

  • the split system is compatible and

  • there are no with , and .

Note that for generic distances , that is, and holds for all with , the aforementioned 6-point condition characterizes when is compatible, because in this case condition (ii) cannot be violated in view of the fact that for a generic distance holds for all .

Figure 2: A distance on such that is compatible but has no midpath phylogeny. The edges of the -tree represent the splits in . For the distance we obtain but and do not yield the same ranking of .

So, we start by providing in Figure 2 an example of a non-generic distance for which is compatible but condition (ii) is violated. More precisely we have

This example also illustrates the meaning of condition (ii) in terms of the -tree representing the splits in : If the edges representing the splits and do not coincide they are required to share a vertex. The edges representing the splits and in our example, however, do not share a vertex. This implies that the 6-point condition given in [19] does not provide the characterization for arbitrary distances that we are looking for. Another aspect illustrated in Figure 2 is that distances and on the same set with do not necessarily yield the same ranking of , not even if and yield the same set for all .

Our characterization for when is compatible will also be a 6-point condition. Actually, it is more convenient to state and prove a characterization for when is not compatible. The structure of the proof of Theorem 3 is similar to the proof of the 6-point condition in [19].

Theorem 3

Let be a distance on a set . The split system is not compatible if and only if there exist with , , such that one of the following holds:

  • and and either , , and or , , and

  • and and either , , and or , , and

Proof: First assume that satisfies (1) or (2). Then the split as well as one of the splits or are contained in . Now, if satisfies (1) and , then and are not compatible in view of , , , and, if satisfies (1) and , then and are not compatible in view of , , , .

Similarly, if satisfies (2) and , then and are not compatible in view of , , , and, if satisfies (2) and , then and are not compatible in view of , , , . Hence, is not compatible, as required.

Now assume that is not compatible. Let

be the set of those pairs of , , with . Order the pairs in arbitrarily and let denote the resulting sequence. Put and for all . Note that the split system is compatible.

Let be the smallest index such that is not compatible. Such an index must exist in view of and our assumption that is not compatible. The split system , however, is compatible and there exists an -tree such that . Let . The fact that is not compatible implies that there must exist a vertex on the path in from the vertex labeled by to the vertex labeled by such that the split , corresponding to some edge in which has one endpoint at , is not compatible with . Let be such that or . The two possible configurations, depending on whether or not edge lies on the path from to in , are depicted in Figure 3, where and . It can be checked that the configuration depicted in Figure 3(a) implies that (1) holds while the configuration depicted in Figure 3(b) implies that (2) holds, as required.

Figure 3: The two possible configurations in the -tree referred to in the proof of Theorem 3. The gray triangles indicate the two subtrees of connected to the endpoints of the edge , one of them being the vertex  that lies on the path from to . The configuration in (a) yields condition (1) and the configuration in (b) yields condition (2) stated in Theorem 3.

To illustrate that a characterization as in Theorem 3 with a -point condition for is not possible, one can employ the following distance that was presented in [19]:

0 6 5 4 13 14
6 0 2 3 12 11
5 2 0 1 8 9
4 3 1 0 10 7
13 12 8 10 0 15
14 11 9 7 15 0

It is shown in [19] that the restriction of to any 5-element subset of yields a distance such that is compatible. The split system , however, contains the splits and which are not compatible.

Before continuing we briefly mention some previous work that is related to the midpath phylogeny mentioned above and that is concerned with situations where the distance on is not known and only the rankings of the elements in generated by are available. The aim then is to find a weighted compatible split system that represents these rankings. Methods that follow this approach and which heavily rely on the midpath phylogeny are presented in [15, 18, 20, 21]. Moreover, in [19] it is shown (see also [21]) that any compatible split system that represents the rankings of the elements in generated by must contain the splits of the midpath phylogeny. However, as shown in [24], in general, if the rankings can be represented at all, further splits must be added. The decision problem of whether the rankings can be represented or not can be solved in polynomial time if restricted to representations by compatible split systems in which every split is assigned the same positive weight [18, 21]. However, if the splits in the compatible split system can have arbitrary positive weights, then the problem is NP-hard [25].

5 Orderly split systems

Motivated by the main result of Bonnot et al. in [5], we call a split system orderly if for all and all non-negative weightings of the splits in there exists a non-negative weighting of the splits in such that . The result of Bonnot et al. can then be restated as follows: Every compatible split system is orderly. In this section we show that another important class of split systems also enjoys this property.

We begin by recalling that a split system on is circular if there exists an ordering of the elements in such that, for any split , there exist with

in which case we will say then that fits on . Circular split systems naturally appear in the context of the so-called split decomposition of a distance (see [3, Sec. 3], where they are introduced), and have applications in phylogenetics (see e.g. [7]). Note that every compatible split system is circular but not vice versa [3].

From the definition it follows immediately that a maximal circular split system on a set with elements contains precisely splits. Hence, for fixed , just like for compatible split systems, the maximal circular split systems are precisely the maximum circular split systems. Moreover, while in general the ordering of the elements in onto which a circular split system fits might not be unique, it follows again immediately from the definition that any maximum circular split system uniquely determines this ordering up to reversing and shifting it.

Now, a distance on is circular if there exists a circular split system with a non-negative weighting such that . It is shown in [9] (see also [10]) that a distance on is circular if and only if there exists an ordering of the elements in such that

(4)

holds for all so that, in particular, circular distances are equivalent to so-called Kalmanson distances [17]. Note that an ordering of the elements in satisfies condition (4) for a circular distance if and only if the necessarily unique circular split system with strictly positive weighting that generates  fits on . We now make a useful observation concerning circular split systems.

Lemma 4

Let be a circular split system on with non-negative weighting  such that fits on the ordering of the elements in . Then, for , the split system is circular and fits on .

Proof: If then is circular and it fits on . So assume that and consider an arbitrary split . By the definition of there exist with and . Note that we must have and . Assume for a contradiction that, even after possibly shifting , the elements in do not form an interval of consecutive elements in . This implies that there exist with and with such that, after possibly shifting and/or reversing , the restriction of to is . Then, in view of the definition of , we must have

But this implies

contradicting condition (4).

As a consequence of Lemma 4 we immediately obtain the main result of this section:

Theorem 5

Every maximum circular split system is orderly.

Proof: Let be a maximum circular split system together with a non-negative weighting and put . In view of , Equation (1) can be written as

where, for each , the weight is a certain non-negative integer multiple of . By Lemma 4, fits onto the unique ordering of the elements in onto which the maximum circular split system fits. Hence, we have , as required.

Corollary 6

For , the order distance of a circular distance is always circular.

Note that in Theorem 5 we assume that the circular split system is maximum. To illustrate that, in contrast to compatible split systems, we cannot remove this assumption, consider, for example, the non-maximum circular split system

on and the weighting that assigns weight 1 to every split in . This yields the order distance associated to with

which is generated as by the weighted circular split system with

So, in general, if is generated by a non-maximum circular split system on the order distance associated to may be generated only by a proper superset of .

6 Linearly independent split systems

Circular split systems on are examples of linearly independent split systems as introduced in [6], that is, the set of split distances arising from

is linearly independent when viewed as elements of the vector space of all symmetric bivariate maps

with for all . Note that the dimension of this vector space clearly is and, in view of the fact that there exist linearly independent split systems of size (e.g. maximum circular split systems), this implies that a linearly independent split system on is maximal with respect to set inclusion if and only if has maximum size . Linearly independent split systems therefore provide a natural generalization of circular split systems, and in this section we explore to what extent Theorem 5 can be generalized to maximum linearly independent split systems that are not circular.

In the following we always assume  and, to avoid fractions in computations, put . The technical lemma we state next provides a useful link between the combinatorial structure of a linearly independent split system and the order distances that it generates. To describe this link, we call a split system on a set with elements closed if for any two incompatible splits and in at least one of the following holds:

  • also contains the splits , , and .

  • and also contains the split .

  • and also contains the splits , and .

  • and also contains the splits , and .

Lemma 7

Let be a set with elements and a linearly independent split system on . If is orderly then is closed.

Proof: If is compatible then it is trivially closed. So assume that is orderly and contains two incompatible splits and . Let  be the weighting of  with and for all other . We consider the order distance with . Then, putting