Cross-modality (visual-auditory) Metric Learning Project
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.READ FULL TEXT VIEW PDF
We present a new method for clustering based on compression. The method
We survey the emerging area of compression-based, parameter-free, simila...
Normalized web distance (NWD) is a similarity or normalized semantic dis...
We show that the Fréchet distance of two-dimensional parametrised surfac...
Ancestral mixture model, proposed by Chen and Lindsay (2006), is an impo...
The distance between a pair of spike trains, quantifying the differences...
Phylogenetic tree reconstruction is traditionally based on multiple sequ...
Cross-modality (visual-auditory) Metric Learning Project
Face verification with mxnet on android
WordNet and Distributional Similarity Measures
How do we measure similarity—for example to determine an evolutionary distance—between two sequences, such as internet documents, different language text corpora in the same language, among different languages based on example text corpora, computer programs, or chain letters? How do we detect plagiarism of student source code in assignments? Finally, the fast advance of worldwide genome sequencing projects has raised the following fundamental question to prominence in contemporary biological science: how do we compare two genomes [30, 51]?
Our aim here is not to define a similarity measure for a certain application field based on background knowledge and feature parameters specific to that field; instead we develop a general mathematical theory of similarity that uses no background knowledge or features specific to an application area. Hence it is, without changes, applicable to different areas and even to collections of objects taken from different areas. The method automatically zooms in on the dominant similarity aspect between every two objects. To realize this goal, we first define a wide class of similarity distances. Then, we show that this class contains a particular distance that is universal in the following sense: for every pair of objects the particular distance is less than any “effective” distance in the class between those two objects. This universal distance is called the “normalized information distance” (NID), it is shown to be a metric, and, intuitively, it uncovers all similarities simultaneously that effective distances in the class uncover a single similarity apiece. (Here, “effective” is used as shorthand for a certain notion of “computability” that will acquire its precise meaning below.) We develop a practical analogue of the NID based on real-world compressors, called the “normalized compression distance” (NCD), and test it on real-world applications in a wide range of fields: we present the first completely automatic construction of the phylogeny tree based on whole mitochondrial genomes, and a completely automatic construction of a language tree for over 50 Euro-Asian languages.
Previous Work: Preliminary applications of the current approach were tentatively reported to the biological community and elsewhere [11, 31, 34]. That work, and the present paper, is based on information distance [33, 4], a universal metric that minorizes in an appropriate sense every effective metric: effective versions of Hamming distance, Euclidean distance, edit distances, Lempel-Ziv distance, and the sophisticated distances introduced in [16, 38]. Subsequent work in the linguistics setting, [2, 3], used related ad hoc compression-based methods, Appendix A. The information distance studied in [32, 33, 4, 31], and subsequently investigated in [25, 39, 43, 49], is defined as the length of the shortest binary program that is needed to transform the two objects into each other. This distance can be interpreted also as being proportional to the minimal amount of energy required to do the transformation: A species may lose genes (by deletion) or gain genes (by duplication or insertion from external sources), relatively easily. Deletion and insertion cost energy (proportional to the Kolmogorov complexity of deleting or inserting sequences in the information distance), and aspect that was stressed in . But this distance is not proper to measure evolutionary sequence distance. For example, H. influenza and E. coli are two closely related sister species. The former has about 1,856,000 base pairs and the latter has about 4,772,000 base pairs. However, using the information distance of 
, one would easily classifyH. influenza with a short (of comparable length) but irrelevant species simply because of length, instead of with E. coli. The problem is that the information distance of  deals with absolute distance rather than with relative distance. The paper  defined a transformation distance between two species, and  defined a compression distance. Both of these measures are essentially related to . Other than being asymmetric, they also suffer from being absolute rather than relative. As far as the authors know, the idea of relative or normalized distance is, surprisingly, not well studied. An exception is , which investigates normalized Euclidean metric and normalized symmetric-set-difference metric to account for relative distances rather than absolute ones, and it does so for much the same reasons as does the present work. In  the equivalent functional of (V.1) in information theory, expressed in terms of the corresponding probabilistic notions, is shown to be a metric. (Our Lemma V.4 implies this result, but obviously not the other way around.)
This Work: We develop a general mathematical theory of similarity based on a notion of normalized distances. Suppose we define a new distance by setting the value between every pair of objects to the minimal upper semi-computable (Definition II.3 below) normalized distance (possibly a different distance for every pair). This new distance is a non-uniform lower bound on the upper semi-computable normalized distances. The central notion of this work is the “normalized information distance,” given by a simple formula, that is a metric, belongs to the class of normalized distances, and minorizes the non-uniform lower bound above. It is (possibly) not upper semi-computable, but it is the first universal similarity measure, and is an objective recursively invariant notion by the Church-Turing thesis . We cannot compute the normalized information distance, which is expressed in terms of the noncomputable Kolmogorov complexities of the objects concerned. Instead, we look at wether a real-world imperfect analogue works experimentally, by replacing the Kolmogorov complexities by the length of the compressed objects using real-world compressors like gzip or GenCompress. Here we show the results of experiments in the diverse areas of (i) bio-molecular evolution studies, and (ii) natural language evolution. In area (i): In recent years, as the complete genomes of various species become available, it has become possible to do whole genome phylogeny (this overcomes the problem that different genes may give different trees [9, 47]). However, traditional phylogenetic methods on individual genes depended on multiple alignment of the related proteins and on the model of evolution of individual amino acids. Neither of these is practically applicable to the genome level. In this situation, a method that can compute shared information between two individual sequences is useful because biological sequences encode information, and the occurrence of evolutionary events (such as insertions, deletions, point mutations, rearrangements, and inversions) separating two sequences sharing a common ancestor will result in partial loss of their shared information. Our theoretical approach is used experimentally to create a fully automated and reasonably accurate software tool based on such a distance to compare two genomes. We demonstrate that a whole mitochondrial genome phylogeny of the Eutherians can be reconstructed automatically from unaligned complete mitochondrial genomes by use of our software implementing (an approximation of) our theory, confirming one of the hypotheses in . These experimental confirmations of the effacity of our comprehensive approach contrasts with recent more specialized approaches such as  that have (and perhaps can) only be tested on small numbers of genes. They have not been experimentally tried on whole mitochondrial genomes that are, apparently, already numerically out of computational range. In area (ii) we fully automatically construct the language tree of 52 primarily Indo-European languages from translations of the “Universal Declaration of Human Rights”—leading to a grouping of language families largely consistent with current linguistic viewpoints. Other experiments and applications performed earlier, not reported here are: detecting plagiarism in student programming assignments , phylogeny of chain letters in .
Subsequent Work: The current paper can be viewed as the theoretical basis out of a trilogy of papers: In 
we address the gap between the rigorously proven optimality of the normalized information distance based on the noncomputable notion of Kolmogorov complexity, and the experimental successes of the “normalized compression distance” or “NCD” which is the same formula with the Kolmogorov complexity replaced by the lengths in bits of the compressed files using a standard compressor. We provide an axiomatization of a notion of “normal compressor,” and argue that all standard compressors, be it of the Lempel-Ziv type (gzip), block sorting type (bzip2), or statistical type (PPMZ), are normal. It is shown that the NCD based on a normal compressor is a similarity distance, satisfies the metric properties, and it approximates universality. To extract a hierarchy of clusters from the distance matrix, we designed a new quartet method and a fast heuristic to implement it. The method is implemented and available on the web as a free open-source software tool: the CompLearn Toolkit. To substantiate claims of universality and robustness,  reports successful applications in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. We tested the method both on natural data sets from a single domain and combinations of different domains (music, genomes, texts, executables, Java programs), and on artificial ones where we know the right answer. In  we applied the method in detail to to music clustering, (independently  applied the method of  in this area). The method has been reported abundantly and extensively in the popular science press, for example [37, 41, 5, 17], and has created considerable attention, and follow-up applications by researchers in specialized areas. One example of this is in parameter-free data mining and time series analysis 
. In that paper the effacity of the compression method is evidenced by a host of experiments. It is also shown that the compression based method, is superior to any other method for comparision of heterogeneous files (for example time series), and anomaly detection, see AppendixB,
Distance and Metric: Without loss of generality, a distance only needs to operate on finite sequences of 0’s and 1’s since every finite sequence over a finite alphabet can be represented by a finite binary sequence. Formally, a distance is a function with nonnegative real values, defined on the Cartesian product of a set . It is called a metric on if for every :
iff (the identity axiom);
(the triangle inequality);
(the symmetry axiom).
A set provided with a metric is called a metric space. For example, every set has the trivial discrete metric if and otherwise.
Kolmogorov Complexity: A treatment of the theory of Kolmogorov complexity can be found in the text . Here we recall some basic notation and facts. We write string to mean a finite binary string. Other finite objects can be encoded into strings in natural ways. The set of strings is denoted by . The Kolmogorov complexity of a file is essentially the length of the ultimate compressed version of the file. Formally, the Kolmogorov complexity, or algorithmic entropy, of a string is the length of a shortest binary program to compute
on an appropriate universal computer—such as a universal Turing machine. Thus,, the length of , denotes the number of bits of information from which can be computationally retrieved. If there are more than one shortest programs, then is the first one in standard enumeration.
We require that there can be decompressed from its compressed version by a general decompressor program, but we do not require that can be compressed to by a general compressor program. In fact, it is easy to prove that there does not exist such a compressor program, since is a noncomputable function. Thus, serves as the ultimate, lower bound of what a real-world compressor can possibly achieve.
To be precise, without going in details, the Kolmogorov complexity we use is the “prefix” version, where the programs of the universal computer are prefix-free (no program is a proper prefix of another program). It is equivalent to consider the length of the shortest binary program to compute in a universal programming language such as LISP or Java. Note that these programs are always prefix-free, since there is an end-of-program marker.
The conditional Kolmogorov complexity of relative to is defined similarly as the length of a shortest program to compute if is furnished as an auxiliary input to the computation. We use the notation for the length of a shortest binary program that prints out and and a description how to tell them apart. The functions and , though defined in terms of a particular machine model, are machine-independent up to an additive constant and acquire an asymptotically universal and absolute character through Church’s thesis, from the ability of universal machines to simulate one another and execute any effective process.
A real-valued function is upper semi-computable if there exists a rational-valued recursive function such that (i) , and (ii) . It is lower semi-computable if is upper semi-computable, and it is computable if it is both upper- and lower semi-computable.
It is easy to see that the functions and (and under the appropriate interpretation also , given ) are upper semi-computable, and it is easy to prove that they are not computable. The conditional information contained in is equivalent to that in : there are fixed recursive functions such that for every we have and . The information about contained in is defined as . A deep, and very useful, result  shows that there is a constant , independent of , such that
with the equalities holding up to additive precision. Hence, up to an additive constant term .
Precision: It is customary in this area to use “additive constant ” or equivalently “additive term” to mean a constant, accounting for the length of a fixed binary program, independent from every variable or parameter in the expression in which it occurs.
In our search for the proper definition of the distance between two, not necessarily equal length, binary strings, a natural choice is the length of the shortest program that can transform either string into the other one—both ways, . This is one of the main concepts in this work. Formally, the information distance is the length of a shortest binary program that computes from as well as computing from . Being shortest, such a program should take advantage of any redundancy between the information required to go from to and the information required to go from to . The program functions in a catalytic capacity in the sense that it is required to transform the input into the output, but itself remains present and unchanged throughout the computation. A principal result of  shows that the information distance equals
up to an additive term. The information distance is upper semi-computable: By dovetailing the running of all programs we can find shorter and shorter candidate prefix-free programs with and , and in the limit obtain such a with . (It is very important here that the time of computation is completely ignored: this is why this result does not contradict the existence of one-way functions.) It was shown in , Theorem 4.2, that the information distance is a metric. More precisely, it satisfies the metric properties up to an additive fixed finite constant. A property of that is central for our purposes here is that it minorizes every “admissible distance” (below) up to an additive constant. In defining the class of admissible distances we want to exclude unrealistic distances like for every pair , by restricting the number of objects within a given distance of an object. Moreover, we want distances to be computable in some manner.
Let . A function (where denotes the positive real numbers) is an admissible distance if it is upper semi-computable, symmetric, and for every pair of objects the distance is the length of a binary prefix code-word that is a program that computes from , and vice versa, in the reference programming language.
In  we considered “admissible metric”, but the triangle inequality metric restriction is not necesary for our purposes here.
If is an admissible distance, then for every the set is the length set of a prefix code. Hence it satisfies the Kraft inequality ,
which gives us the desired density condition.
In representing the Hamming distance between and strings of equal length differing in positions , we can use a simple prefix-free encoding of in bits. We encode and prefix-free in bits each, see e.g. , and then the literal indexes of the actual flipped-bit positions. Hence, is the length of a prefix code word (prefix program) to compute from and vice versa. Then, by the Kraft inequality,
It is easy to verify that is a metric in the sense that it satisfies the metric (in)equalities up to additive precision.
The information distance is an admissible distance that satisfies the metric inequalities up to an additive constant, and it is minimal in the sense that for every admissible distance we have
This is the same statement as Theorem 4.2 in , except that there the ’s were also required to be metrics. But the proof given doesn’t use that restriction and therefore suffices for the slightly more general theorem as stated here.
Suppose we want to quantify how much objects differ in terms of a given feature, for example the length in bits of files, the number of beats per second in music pieces, the number of occurrences of a given base in the genomes. Every specific feature induces a distance, and every specific distance measure can be viewed as a quantification of an associated feature difference. The above theorem states that among all features that correspond to upper semi-computable distances, that satisfy the density condition (III.2), the information distance is universal in that among all such distances it is always smallest up to constant precision. That is, it accounts for the dominant feature in which two objects are alike.
Many distances are absolute, but if we want to express similarity, then we are more interested in relative ones. For example, if two strings of length differ by 1000 bits, then we are inclined to think that those strings are relatively more similar than two strings of 1000 bits that have that distance and
A normalized distance or similarity distance, is a function that is symmetric , and for every and every constant
The density requirement (IV.1) is implied by a “normalized” version of the Kraft inequality:
Then, satisfies (IV.1).
We call a normalized distance a “similarity” distance, because it gives a relative similarity (with distance 0 when objects are maximally similar and distance 1 when the are maximally dissimilar) and, conversely, for a well-defined notion of absolute distance (based on some feature) we can express similarity according to that feature as a similarity distance being a normalized version of the original absolute distance. In the literature a distance that expresses lack of similarity (like ours) is often called a “dissimilarity” distance or a “disparity” distance.
The prefix-code for the Hamming distance between in Example III.3 is a program to compute from to and vice versa. To turn it into a similarity distance define with satisfying the inequality for every and for every , where this time
denotes the entropy with two possibilities with probabilitiesand , respectively. For example, for with and is within bit flips of , we can set , yielding with the number of bit flips to obtain from . For every , the number of in the Hamming ball is upper bounded by . By the constraint on , the function satisfies the density condition (IV.1).
Clearly, unnormalized information distance (III.1) is not a proper evolutionary distance measure. Consider three species: E. coli, H. influenza, and some arbitrary bacteria of similar length as H. influenza, but not related. Information distance would have , simply because of the length factor. It would put two long and complex sequences that differ only by a tiny fraction of the total information as dissimilar as two short sequences that differ by the same absolute amount and are completely random with respect to one another. In  we considered as first attempt at a normalized information distance:
Given two sequences and , define the function by
Writing it differently, using (II.1),
where is known as the mutual algorithmic information. It is “mutual” since we saw from (II.1) that it is symmetric: up to a fixed additive constant. This distance satisfies the triangle inequality, up to a small error term, and universality (below), but only within a factor 2. Mathematically more precise and satisfying is the distance:
Given two sequences and , define the function by
Several natural alternatives for the denominator turn out to be wrong:
(a) Divide by the length. Then, firstly we do not know which of the two length involved to divide by, possibly the sum or maximum, but furthermore the triangle inequality and the universality (domination) properties are not satisfied.
(b) In the definition divide by . Then one has whenever and are random (have maximal Kolmogorov complexity) relative to one another. This is improper.
(c) In the definition dividing by length does not satisfy the triangle inequality.
There is a natural interpretation to : If then we can rewrite
That is, between and is the number of bits of information that is shared between the two strings per bit of information of the string with most information.
satisfies the metric (in)equalities up to additive precision , where is the maximum of the Kolmogorov complexities of the objects involved in the (in)equality.
Clearly, is precisely symmetrical. It also satisfies the identity axiom up to the required precision:
To show that it is a metric up to the required precision, it remains to prove the triangle inequality.
satisfies the triangle inequality up to an additive error term of .
Case 1: Suppose . In , the following “directed triangle inequality” was proved: For all , up to an additive constant term,
Dividing both sides by , majorizing and rearranging,
up to an additive term . Replacing by in the denominator of the first term in the right-hand side, and by in the denominator of second term of the right-hand side, respectively, can only increase the right-hand side (again, because of the assumption).
Case 2: Suppose . Further assume that (the remaining case is symmetrical). Then, using the symmetry of information to determine the maxima, we also find and . Then the maxima in the terms of the equation are determined, and our proof obligation reduces to:
up to an additive term . To prove (V.5) we proceed as follows:
Applying the triangle inequality (V.4) and dividing both sides by , we have
where the left-hand side is .
Case 2.1: Assume that the right-hand side is . Setting , and observe by (II.1). Add to both the numerator and the denominator in the right-hand side of (V.6), which increases the right-hand side because it is a ratio , and rewrite:
which was what we had to prove.
Case 2.2: The right-hand side is . We proceed like in Case 2.1, and add to both numerator and denominator. Although now the right-hand side decreases, it must still be . This proves Case 2.2.
Clearly, takes values in the range . To show that it is a normalized distance, it is left to prove the density condition of Definition IV.1:
The function satisfies the density condition (IV.1).
Case 1: Assume . Then, . If , then Adding to both sides, rewriting according to (II.1), and subtracting from both sides, we obtain
There are at most binary programs of length . Therefore, for fixed there are objects satisfying (V.7).
Case 2: Assume . Then, . If , then (V.7) holds again. Together, Cases 1 and 2 prove the lemma.
Since we have shown that takes values in , it satisfies the metric requirements up to the given additive precision, and it satisfies the density requirement in Definition IV.1, it follows:
The function is a normalized distance that satisfies the metric (in)equalities up to precision, where is the maximum of the Kolmogorov complexities involved in the (in)equality concerned.
As far as the authors know, the idea of normalized metric is not well-studied. An exception is , which investigates normalized metrics to account for relative distances rather than absolute ones, and it does so for much the same reasons as in the present work. An example there is the normalized Euclidean metric , where ( denotes the real numbers) and is the Euclidean metric—the norm. Another example is a normalized symmetric-set-difference metric. But these normalized metrics are not necessarily effective in that the distance between two objects gives the length of an effective description to go from either object to the other one.
We now show that is universal then it incorporates every upper semi-computable (Definition II.3) similarity in that if objects are similar according to a particular feature of the above type, then they are at least that similar in the sense. We prove this by demonstrating that is at least as small as any normalized distance between in the wide class of upper semi-computable normalized distances. This class is so wide that it will capture everything that can be remotely of interest.
The function itself, being a ratio between two maxima of pairs of upper semi-computable functions, may not itself be semi-computable. (It is easy to see that this is likely, but a formal proof is difficult.) In fact, has ostensibly only a weaker computability property: Call a function computable in the limit if there exists a rational-valued recursive function such that . Then is in this class. It can be shown  that this is precisely the class of functions that are Turing-reducible to the halting set. While is possibly not upper semi-computable, it captures all similarities represented by the upper semi-computable normalized distances in the class concerned, which should suffice as a theoretical basis for all practical purposes.
The normalized information distance minorizes every upper semi-computable normalized distance by where .
Let be a pair of objects and let be a normalized distance that is upper semi-computable. Let .
Case 1: Assume that . Then, given we can recursively enumerate the pairs such that . Note that the enumeration contains . By the normalization condition (IV.1), the number of pairs enumerated is less than . Every such pair, in particular , can be described by its index of length in this enumeration. Since the Kolmogorov complexity is the length of the shortest effective description, given , the binary length of the index plus an bit program to perform the recovery of , must at least be as large as the Kolmogorov complexity, which yields . Since , by (II.1), , and hence . Note that , because supplies the information which includes the information . Substitution gives:
Case 2: Assume that . Then, given we can recursively enumerate the pairs such that . Note that the enumeration contains . By the normalization condition (IV.1), the number of pairs enumerated is less than . Every such pair, in particular , can be described by its index of length in this enumeration. Similarly to Case 1, this yields . Also, by (II.1), , and hence . Substitution gives:
It is difficult to find a more appropriate type of object than DNA sequences to test our theory: such sequences are finite strings over a 4-letter alphabet that are naturally recoded as binary strings with 2 bits per letter. We will use whole mitochondrial DNA genomes of 20 mammals and the problem of Eutherian orders to experiment. The problem we consider is this: It has been debated in biology which two of the three main groups of placental mammals, Primates, Ferungulates, and Rodents, are more closely related. One cause of debate is that the maximum likelihood method of phylogeny reconstruction gives (Ferungulates, (Primates, Rodents)) grouping for half of the proteins in mitochondial genome, and (Rodents, (Ferungulates, Primates)) for the other half . The authors aligned 12 concatenated mitochondrial proteins taken from the following species: rat (Rattus norvegicus), house mouse (Mus musculus), grey seal (Halichoerus grypus), harbor seal (Phoca vitulina), cat (Felis catus), white rhino (Ceratotherium simum), horse (Equus caballus), finback whale (Balaenoptera physalus), blue whale (Balaenoptera musculus), cow (Bos taurus), gibbon (Hylobates lar), gorilla (Gorilla gorilla), human (Homo sapiens), chimpanzee (Pan troglodytes), pygmy chimpanzee (Pan paniscus), orangutan (Pongo pygmaeus), Sumatran orangutan (Pongo pygmaeus abelii), using opossum (Didelphis virginiana), wallaroo (Macropus robustus) and platypus (Ornithorhynchus anatinus) as the outgroup, and built the maximum likelihood tree. The currently accepted grouping is (Rodents, (Primates, Ferungulates)).
Before applying our theory, we first examine the alternative approaches, in addition to that of . The mitochondrial genomes of the above 20 species were obtained from GenBank. Each is about 18k bases, and each base is one out of four types: adenine (A), which pairs with thymine (T), and cytosine (C), which pairs with guanine (G).
k-mer Statistic: In the early years, researchers experimented using G+C contents, or slightly more general -mers (or Shannon block entropy) to classify DNA sequences. This approach uses the frequency statistics of length substrings in a genome and the phylogeny is constructed accordingly. To re-examine this approach, we performed simple experiments: Consider all length blocks in each mtDNA, for . There are
different such blocks (some may not occur). We computed the frequency of (overlapping) occurrences of each block in each mtDNA. This way we obtained a vector of lengthfor each mtDNA, where the th entry is the frequency with which the th block occurs overlapping in the mtDNA concerned (). For two such vectors (representing two mtDNAs) , their distance is computed as . Using neighbor joining , the phylogeny tree that resulted is given in Figure 1. Using the hypercleaning method , we obtain equally absurd results. Similar experiments were repeated for size blocks alone (for ), without much improvement.
Gene Order: In  the authors propose to use the order of genes to infer the evolutionary history. This approach does not work for closely related species such as our example where all genes are in the same order in the mitochondrial genomes in all 20 species.
Gene Content: The gene content method, proposed in [19, 46], uses as distance the ratio between the number of genes two species share and the total number of genes. While this approach does not work here due to the fact that all 20 mammalian mitochondrial genomes share exactly the same genes, notice the similarity of the gene content formula and our general formula.
Rearrangement Distance: Reversal and rearrangement distances in [28, 26, 40] compare genomes using other partial genomic information such as the number of reversals or translocations. These operations also do not appear in our mammalian mitochondrial genomes, hence the method again is not proper for our application.
Transformation Distance or Compression Distance: The transformation distance proposed in  and compression distance proposed in  essentially correspond to which is asymmetric, and so they are not admissible distances. Using in the GenCompress approximation version produces a wrong tree with one of the marsupials mixed up with ferungulates (the tree is not shown here).
We have shown that the normalized information distance (and up to a factor 2 this holds also for ) is universal among the wide class normalized distances, including all computable ones. These universal distances (actually, metrics) between and are expressed in terms of , and . The generality of the normalized information distance comes at the price of noncomputability: Kolmogorov complexity is not computable but just upper semi-computable, Section II, and itself is (likely to be) not even that. Nonetheless, using standard compressors, we can compute an approximation of .
To prevent confusion, we stress that, in principle, we cannot determine how far a computable approximation of exceeds its true value. What we can say is that if we flip a sequence of bits with a fair coin, then with overwhelming probability we will have is about and a real compressor will also compress to a string of about length (that is, it will not compress at all and the compressed file length is about the Kolmogorov complexity and truely approximates it). However, these strings essentially consist of random noise and have no meaning. But if we take a meaningful string, for example the first bits of the binary representation of , then the Kolmogorov complexity is very short (because a program of, say, 10,000 bits can compute the string), but no standard compressor will be able to compress the string significantly below its length of (it will not be able to figure out the inherent regularity). And it is precisely the rare meaningful strings, rare in comparison to the overwhelming majority of strings that consist of random noise, that we can be possibly interested in, and for which the Kolmogorov complexity depends on computable regularities. Certain of those regularities may be easy to determine, even by a simple compressor, but some regularities may take an infeasible amount of time to discover.
It is clear how to compute the real-world compressor version of the unconditional complexities involved. With respect to the conditional complexities, by we have (up to an additive constant), and it is easy to see that up to additive logarithmic precision. (Here is the length of the shortest program to compute the concatenation of and without telling which is which. To retrieve we need to encode the separator between the binary programs for and .) So is roughly equal to .
In applying the approach in practice, we have to make do with an approximation based on a real-world reference compressor . The resulting applied approximation of the “normalized information distance” is called the normalized compression distance (NCD)
Here, denotes the compressed size of the concatenation of and , denotes the compressed size of , and denotes the compressed size of . The NCD is a non-negative number representing how different the two files are. Smaller numbers represent more similar files. The in the upper bound is due to imperfections in our compression techniques, but for most standard compression algorithms one is unlikely to see an above 0.1 (in our experiments gzip and bzip2 achieved NCD’s above 1, but PPMZ always had NCD at most 1).
The theory as developed for the Kolmogorov-complexity based “normalized information distance” in this paper does not hold directly for the (possibly poorly) approximating NCD. In , we developed the theory of NCD based on the notion of a “normal compressor,” and show that the NCD is a (quasi-) universal similarity metric relative to a normal reference compressor . The NCD violates metricity only in sofar as it deviates from “normality,” and it violates universality only insofar as stays above . The theory developed in the present paper is the boundary case , where the “partially violated universality” has become full “universality”. The conditional has been replaced by , which can be interpreted in stream-based compressors as the compression length of based on using the “dictionary” extracted from . Similar statments hold for block sorting compressors like bzip2, and designer compressors like GenCompress. Since the writing of this paper the method has been released in the public domain as open-source software at http://complearn.sourceforge.net/: The CompLearn Toolkit is a suite of simple utilities that one can use to apply compression techniques to the process of discovering and learning patterns. The compression-based approach used is powerful because it can mine patterns in in completely different domains. In fact, this method is so general that it requires no background knowledge about any particular subject area. There are no domain-specific parameters to set, and only a handful of general settings.
Number of Different k-mers: We have shown that using -mer frequency statistics alone does not work well. However, let us now combine the -mer approach with the incompressibility approach. Let the number of distinct, possibly overlapping, -length words in a sequence be . With large enough, at least , where is the cardinality of the alphabet and the length of , we use as a rough approximation to . For example, for a sequence with the repetition of only one letter, this will be 1. The length is chosen such that: (i) If the two genomes concerned would have been generated randomly, then it is unlikely that they have a -length word in common; and (ii) It is usual that two homologous sequences share the same -length words. A good choice is , where is the length of the genomes and 4 is because we have 4 bases. There are subwords because the alphabet has size 4 for DNA. To describe a particular choice of subwords of length in a string of length we need approximately bits. For a family of mitochondrial DNA, we typically have . In this range, can be approximated by for some constant . So, overall the number of different subwords of length is proportional to for this choice of parameters.
According to our experiment, should be slightly larger than . For example, a mitochondrial DNA is about 17K bases long. , while the we use below is in range of , , or , according to different formula and whether spaced seeds (see below) are used.
We justify the complexity approximation using the number of different -mers by the pragmatic observation that, because the genomes evolve by duplications, rearrangements and mutations, , and assuming that duplicated subwords are to be regarded as duplicated information that can be “compressed out,” while distinct subwords are not “compressed out,” it can be informally and intuitively argued that a description of the set of different subwords describes . With our choice of parameters it therefore is appropriate to use
as a plausible proportional estimate forin case is a genome. So the size of the set is used to replace the of genome . is replaced by the size of the union of the two subword sets. Define as . Given two sequences and , following the definition of , (V.3), the distance between and can be defined as
Similarly, following , (V.1) we can also define another distance using ,
Using and , we computed the distance matrixes for the 20 mammal mitochondrial DNAs. Then we used hyperCleaning  to construct the phylogenies for the 20 mammals. Using either of and , we were able to construct the tree correctly when , as in Figure 3. A tree constructed with for is given in Figure 2. We note that the opossum and a few other species are misplaced. The tree constructed with for is very similar, but it correctly positioned the opossum.
Number of Spaced -mers In methods for doing DNA homology search, a pair of identical words, each from a DNA sequence, is called a “hit”. Hits have been used as “seeds” to generate a longer match between the two sequences. If we define as the number of distinct words that are in and not in , then the more hits the two sequences have, the smaller the and are. Therefore, the (VII.2), (VII.3) distances can also be interpreted as a function of the number of hits, each of which indicates some mutual information of the two sequences.
As noticed by the authors of , though it is difficult to get the first hit (of consecutive letters) in a region, it only requires one more base match to get a second hit overlapping the existing one. This makes it inaccurate to attribute the same amount of information to each of the hits. For this reason, we also tried to use the “spaced model” introduced in  to compute our distances. A length-, weight- spaced template is a 0-1 string of length having entries . We shift the template over the DNA sequence, one position each step, starting with the first positions aligned and finishing with the last positions aligned. At each step extract the ordered sequence of the bases in the DNA sequence covered by the -positions of the template to form a length- word. The number of different such words is then used to define the distances and in Formula (V.1) and (VII.3).
We applied the new defined distances to the 20 mammal data. The performance is slightly bettern than the performance of the distances defined in (V.1) and (VII.3). The modified and can correctly construct the mammal tree when and , respectively.
Compression: To achieve the best approximation of Kolmogorov complexity, and hence most confidence in the approximation of and , we used a new version of the GenCompress program, , which achieved the best compression ratios for benchmark DNA sequences at the time of writing. GenCompress finds approximate matches (hence edit distance becomes a special case), approximate reverse complements, among other things, with arithmetic encoding when necessary. Online service of GenCompress can be found on the web. We computed between each pair of mtDNA and , using GenCompress to heuristically approximate , , and , and constructed a tree (Figure 3) using the neighbor joining  program in the MOLPHY package . The tree is identical to the maximum likelihood tree of Cao, et al. . For comparison, we used the hypercleaning program  and obtained the same result. The phylogeny in Figure 3 re-confirms the hypothesis of (Rodents, (Primates, Ferungulates)). Using the measure gives the same result.
To further assure our results, we have extracted only the coding regions from the mtDNAs of the above species, and performed the same computation. This resulted in the same tree.
In  we have repeated these phylogeny experiments using bzip2 and PPMZ compressors, and a new quartet method to reconstruct the phylogeny tree. In all cases we obtained the correct tree. This is evidence that the compression NCD method is robust under change of compressors, as long as the window size of the used compressor is sufficient for the files concerned, that is, GenCompress can be replaced by other more general-purpose compressors. Simply use .
Evaluation: This new method for whole genome comparison and phylogeny does not require gene identification nor any human intervention, in fact, it is totally automatic. It is mathematically well-founded being based on general information theoretic concepts. It works when there are no agreed upon evolutionary models, as further demonstrated by the successful construction of a chain letter phylogeny  and when individual gene trees do not agree (Cao et al., ) as is the case for genomes. As a next step, using the approach in , we have applied this method to much larger nuclear genomes of fungi and yeasts. This work is not reported yet.
Normalized information distance is a totally general universal tool, not restricted to a particular application area. We show that it can also be used to successfully classify natural languages. We downloaded the text corpora of “The Universal Declaration of Human Rights” in 52 Euro-Asian languages from the United Nations website . All of them are in UNICODE. We first transform each UNICODE character in the language text into an ASCII character by removing its vowel flag if necessary. Secondly, as compressor to compute the NCD we used a Lempel-Ziv compressor gzip. This seems appropriate to compress these text corpora of sizes (2 kilobytes) not exceeding the length of sliding window gzip uses (32 kilobytes). In the last step, we applied the -metric (V.1) with the neighbor-joining package to obtain Figure 4. Even better worked applying the -metric (V.3) with the Fitch-Margoliash method  in the package PHYLIP ); the resulting language classification tree is given in Figure 5. We note that all the main linguistic groups can be successfully recognized, which includes Romance, Celtic, Germanic, Ugro-Finnic, Slavic, Baltic, Altaic as labeled in the figure. In both cases, it is a rooted tree using Basque [Spain] as outgroup. The branch lengths are not proportional to the actual distances in the distance matrix.
Any language tree built by only analyzing contemporary natural text corpora is partially corrupted by historical inter-language contaminations. In fact, this is also the case with genomic evolution: According to current insights phylogenetic trees are not only based on inheritance, but also the environment is at work through selection, and this even introduces an indirect interation between species, called reticulation111Joining of separate lineages on a phylogenetic tree, generally through hybridization or through lateral gene transfer. Fairly common in certain land plant clades; reticulation is thought to be rare among metazoans. (arguably less direct than de borrowings between languages). Thus, while English is ostensibly a Germanic Anglo-Saxon language, it has absorbed a great deal of French-Latin components. Similarly, Hungarian, often considered a Finn-Ugric language, which consensus currently happens to be open to debate in the linguistic community, is known to have absorbed many Turkish and Slavic components. Thus, an automatic construction of a language tree based on contemporary text corpora, exhibits current linguistic relations which do not necessarily coincide completely with the historic language family tree. The misclassification of English as Romance language is reenforced by the fact that the English vocabulary in the Universal Declaration of Human Rights, being nonbasic in large part, is Latinate in large part. This presumably also accounts for the misclassification of Maltese, an Arabic dialect with lots of Italian loan words, as Romance. Having voiced these caveats, the result of our automatic experiment in language tree reconstruction is accurate.
Our method improves the results of , using the same linguistic corpus, using an asymmetric measure based on the approach sketched in the section “Related Work.” In the resulting language tree, English is isolated between Romance and Celtic languages, Romani-balkan and Albanian are isolated, and Hungarian is grouped with Turkish and Uzbek. The (rooted) trees resulting from our experiments (using Basque as out-group) seem more correct. We use Basque as outgroup since linguists regard it as a language unconnected to other languages.
We developed a mathematical theory of compression-based similarity distances and shown that there is a universal similarity metric: the normalized information distance. This distance uncovers all upper semi-computable similarities, and therefore estimates an evolutionary or relation-wise distance on strings. A practical version was exhibited based on standard compressors. Here it has been shown to be applicable to whole genomes, and to built a large language family tree from text corpora. References to applications in a plethora of other fields can be found in the Introduction. It is perhaps useful to point out that the results reported in the figures were obtained at the very first runs and have not been selected by appropriateness from several trials. From the theory point-of-view we have obtained a general mathematical theory forming a solid framework spawning practical tools applicable in many fields. Based on the noncomputable notion of Kolmogorov complexity, the normalized information distance can only be approximated without convergence guarantees. Even so, the fundamental rightness of the approach is evidenced by the remarkable success (agreement with known phylogeny in biology) of the evolutionary trees obtained and the building of language trees. From the applied side of genomics our work gives the first fully automatic generation of whole genome mitochondrial phylogeny; in computational linguistics it presents a fully automatic way to build language trees and determine language families.
In  the purpose is to infer a language tree from different-language text corpora, as well as do authorship attribution on basis of text corpora. The distances determined between objects are justified by ad-hoc plausibility arguments (although the information distance of [33, 4] is also mentioned). The paper 
is predated by our universal similarity metric work and phylogeny tree (hierarchical clustering) experiments[11, 12, 34], but it is the language tree experiment we repeated in the present paper using our own technique with somewhat better results. For comparison of the methods we give some brief details. Assume a fixed compressor ([2, 3] use the Lempel-Ziv type). Let denote the length of of the compressed version of a file , and let be a short file from the same source as . For example if is a long text in a language, then is a short text in the same language. (The authors refer to sequences generated by the same ergodic source.) Then two distances are considered between files : (i) the asymmetric distance , the numerator quantifying the difference in compressing using a data base sequence generated by a different source versus one generated by the same source that generated ; and a symmetric distance (ii) . The distances are not metric (neither satisfies the triangular inequality) and the authors propose to “triangularize” in practice by a Procrustes method: setting in case the left-hand side exceeds the right-hand side. We remark that in that case the left-hand side becomes smaller and may in its turn cause a violation of another triangular inequality as a member of the right-hand side, and so on. On the upside, despite the lack of supporting theory, the authors report successful experiments.
Note that this measure always ranges between (for ) and (for and satisfy , that is, compressing doesn’t help in compressing ). The authors don’t give a theoretical analysis, but intuitively this formula measures similarity of and by comparing the lengths of the compressed files in combination and seperately.
John Tromp carefully read and commented on an early draft, and Teemu Roos supplied reference .
27th ACM Symp. Theory of Computing, 1995, 178-189.
C. Rajski, A metric space of discrete probability distributions,Inform. Contr., 4(1961), 371–377.
Ming Li is a CRC Chair Professor in Bioinformatics, of Computer Science
at the University of Waterloo. He is a recipient of
Canada’s E.W.R. Steacie Followship Award in 1996, and the 2001 Killam
Fellowship. Together with Paul Vitanyi they pioneered applications of
Kolmogorov complexity and co-authored the book ”An Introduction to
Complexity and Its Applications” (Springer-Verlag, 1993, 2nd Edition, 1997).
He is a co-managing editor of Journal of Bioinformatics
and Computatational Biology. He currently also serves on the editorial
boards of Journal of Computer and System Sciences, Information
and Computation, SIAM Journal on Computing, Journal of Combinatorial
Optimization, Journal of Software, and Journal
of Computer Science and Technology.
is a CRC Chair Professor in Bioinformatics, of Computer Science at the University of Waterloo. He is a recipient of Canada’s E.W.R. Steacie Followship Award in 1996, and the 2001 Killam Fellowship. Together with Paul Vitanyi they pioneered applications of Kolmogorov complexity and co-authored the book ”An Introduction to Kolmogorov Complexity and Its Applications” (Springer-Verlag, 1993, 2nd Edition, 1997). He is a co-managing editor of Journal of Bioinformatics and Computatational Biology. He currently also serves on the editorial boards of Journal of Computer and System Sciences, Information and Computation, SIAM Journal on Computing, Journal of Combinatorial Optimization, Journal of Software, and Journal of Computer Science and Technology.
Xin Chen received his Ph.d. from Peking University, Beijing, China, in
2001. He is now a Post-doc at University of California, Riverside. His
research interests include data compression, pattern recognition and
received his Ph.d. from Peking University, Beijing, China, in 2001. He is now a Post-doc at University of California, Riverside. His research interests include data compression, pattern recognition and bioinformatics.
Xin Li obtained his B.Sc. degree in Computer Science from McMaster University (Canada) and his M.Sc. degree in Computer Science from the University of Western Ontario (Canada).
Bin Ma received his Ph.D. degree from Peking University in 1999, and has been an assistant professor in the Department of Computer Science at the University of Western Ontario since 2000. He is a recipient of Ontario Premier’s Research Excellence award in 2003 for his research in bioinformatics. He is a coauthor of two well-known bioinformatics software programs, PatternHunter and PEAKS.
Paul M.B. Vitányi is a Fellow
of the Center for Mathematics and Computer Science (CWI)
in Amsterdam and is Professor of Computer Science
at the University of Amsterdam. He serves on the editorial boards
of Distributed Computing (until 2003), Information Processing Letters,
Theory of Computing Systems, Parallel Processing Letters,
International journal of Foundations of Computer Science,
Journal of Computer and Systems Sciences (guest editor),
and elsewhere. He has worked on cellular automata,
computational complexity, distributed and parallel computing,
machine learning and prediction, physics of computation,
Kolmogorov complexity, quantum computing. Together with Ming Li
they pioneered applications of Kolmogorov complexity
and co-authored “An Introduction to Kolmogorov Complexity
and its Applications,” Springer-Verlag, New York, 1993 (2nd Edition 1997),
parts of which have been translated into Chinese, Russian and Japanese.
is a Fellow of the Center for Mathematics and Computer Science (CWI) in Amsterdam and is Professor of Computer Science at the University of Amsterdam. He serves on the editorial boards of Distributed Computing (until 2003), Information Processing Letters, Theory of Computing Systems, Parallel Processing Letters, International journal of Foundations of Computer Science, Journal of Computer and Systems Sciences (guest editor), and elsewhere. He has worked on cellular automata, computational complexity, distributed and parallel computing, machine learning and prediction, physics of computation, Kolmogorov complexity, quantum computing. Together with Ming Li they pioneered applications of Kolmogorov complexity and co-authored “An Introduction to Kolmogorov Complexity and its Applications,” Springer-Verlag, New York, 1993 (2nd Edition 1997), parts of which have been translated into Chinese, Russian and Japanese.