# Information Distance Revisited

We consider the notion of information distance between two objects x and y introduced by Bennett, Gács, Li, Vitanyi, and Zurek [1] as the minimal length of a program that computes x from y as well as computing y from x, and study different versions of this notion. It was claimed by Mahmud [11] that the prefix version of information distance equals max(K(x|y), K(y|) + O(1) (this equality with logarithmic precision was one of the main results of the paper by Bennett, Gács, Li, Vitanyi, and Zurek). We show that this claim is false. [More will be added.]

## Authors

• 6 publications
• 8 publications
• ### Precise Expression for the Algorithmic Information Distance

We consider the notion of information distance between two objects x and...
08/30/2020 ∙ by Bruno Bauwens, et al. ∙ 0

• ### Information Distance in Multiples

Information distance is a parameter-free similarity measure based on com...
05/20/2009 ∙ by Paul M. B. Vitanyi, et al. ∙ 0

• ### Clustering with Respect to the Information Distance

We discuss the notion of a dense cluster with respect to the information...
10/04/2021 ∙ by Andrei Romashchenko, et al. ∙ 0

• ### Information Distance: New Developments

In pattern recognition, learning, and data mining one obtains informatio...
01/05/2012 ∙ by P. M. B. Vitanyi, et al. ∙ 0

• ### Semantic Clone Detection via Probabilistic Software Modeling

Semantic clone detection is the process of finding program elements with...
08/11/2020 ∙ by Hannes Thaller, et al. ∙ 0

• ### On the Algorithmic Probability of Sets

The combined universal probability m(D) of strings x in sets D is close ...
07/10/2019 ∙ by Samuel Epstein, et al. ∙ 0

• ### A Behavioural Foundation for Natural Computing and a Programmability Test

What does it mean to claim that a physical or natural system computes? O...
03/23/2013 ∙ by Hector Zenil, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Informally speaking, Kolmogorov complexity measures the amount of information in an object (say, a bit string) in bits. The complexity of is defined as the minimal bit length of a program that generates . This definition depends on the programming language used, but one can fix an optimal language that makes the complexity function minimal up to an additive term. In a similar way one can define the conditional Kolmorogov complexity of a string given some other string as a condition. Namely, we consider the minimal length of a program that transforms to . Informally speaking, is the amount of information in that is missing in , the number of bits that we should give in addition to  if we want to specify .

The notion of information distance was introduced in [1] as “the length of a shortest binary program that computes from as well as computing from .” It is clear that such a program cannot be shorter than or since it performs both tasks; on the other hand, it cannot be much longer than the sum of these two quantities (we can combine the programs that map to and vice versa with a small overhead needed to separate the two parts and to distinguish from ). As the authors of [1] note, “being shortest, such a program should take advantage of any redundancy between the information required to go from to and the information required to go from to ”, and the natural question arises: to what extent is this possible? The main result of [1] gives the strongest upper bound possible and says that the information distance equals

 max(C(x|y),C(y|x))

with logarithmic precision.

In fact, in [1] the prefix version of complexity denoted by and the corresponding definition of information distance were used; see. e.g., [14] for the detailed explanation of different complexity definitions. The difference between prefix and plain versions is logarithmic, so it does not matter whether we use plain or prefix versions if we are interested in results with logarithmic precision. However, several inequalities that are true with logarithmic precision for plain complexity become true with -precision if prefix complexity is used. So one could hope that a stronger result with -precision holds for prefix complexity. Such a claim was indeed made in [11]; in [10] a similar claim is made with reference to [1].333The authors of [10] define (section 2.2) the function as the prefix-free non-bipartite version of the information distance (see the discussion below in section 4.1) and then write: “the following theorem proved in [4] was a surprise: Theorem 1. ”. They do not mention that in the paper they cited as [4] (it is [1]

in our list) there is a logarithmic error term; in fact, they do not mention any error terms (though in other statements the constant term is written explicitly). Probably this is a typo, since more general Theorem 2 in

[10] does contain a logarithmic error term. Unfortunately, the proof in [11] contains an error and (as we will show) the result is not valid for prefix complexity with -precision. On the other hand, it is easy to see that the original argument from [1] can be adapted for plain complexity to obtain the result with -precision, as noted in [15].

In this paper we try to clarify the situation and discuss the possible definitions of information distance in plain and prefix versions, and their subtle points (one of these subtle points was the source of an error in [11]). We also discuss some related notions. In Section 2 we consider the easier case of plain complexity; then in Section 3 we discuss the different definitions of prefix complexity (with prefix-free and prefix-stable machines, as well as definitions using the a priori probability) and in Section 4 we discuss their counterparts for the information distance. In Section 5 we use the game approach to show that indeed the relation between information distance (in the prefix version) and conditional prefix complexity is not valid with -precision, contrary to what is said in [11].

## 2 Plain complexity and information distance

Let us recall the definition of plain conditional Kolmogorov complexity. Let be a computable partial function of two string arguments; its values are also binary strings. We may think of as an interpreter of some programming language. The first argument is considered as a program and the second argument is an input for this program. Then we define the complexity function

 CU(y|x)=min{|p|:U(p,x)=y};

here stands for the length of a binary string , so the right hand side is the minimal length of a program that produces output given input . The classical Solomonoff–Kolmogorov theorem says that there exists an optimal that makes minimal up to an -additive term. We fix some optimal and then denote by just . (See, e.g., [9, 14] for the details.)

Now we want to define the information distance between and . One can try the following approach: take some optimal from the definition of conditional complexity and then define

 EU(x,y)=min{|p|:U(p,x)=y and U(p,y)=x},

i.e., consider the minimal length of a program that both maps to and to . However, there is a caveat, as the following simple observation shows.

###### Proposition 1.

There exists some computable partial function that makes minimal up to an additive term, and still is infinite for some strings and and therefore not minimal.

###### Proof.

Consider an optimal function and then define such that where is the empty string, and . In other terms, copies the first bit of the program to the output and then applies to the rest of the program and the input. It is easy to see that is minimal up to an additive term, but has the same first bit as , so if and have different first bits, there is no such that and at the same time. ∎

On the other hand, the following proposition is true (and can be proven in the same way as the existence of the optimal for conditional complexity):

###### Proposition 2.

There exists a computable partial function that makes minimal up to additive term.

Now we may define information distance for plain complexity as the minimal function . It turns out that the original argument from [1] can be easily adapted to show the following result (that is a special case of a more general result about several strings proven in [15]):

###### Theorem 1.

The minimal function equals .

###### Proof.

We provide the adapted proof here for the reader’s convenience. In one direction we have to prove that , and the same for . This is obvious since the definition of contains more requirements for (it should map both to and to , while in it is enough to map to ).

To prove the reverse inequality, consider for each the binary relation on strings (of all lengths) defined as

By definition, this relation is symmetric. It is easy to see that is (computably) enumerable uniformly in , since we may compute better and better upper bounds for reaching ultimately its true value. We think of as the set of edges of an undirected graph whose vertices are binary strings. Note that each vertex of this graph has degree less than since there are less than programs of length less than that map to its neighbors.

For each , we enumerate edges of this graph (i.e., pairs in ). We want to assign colors to the edges of in such a way that edges that have a common endpoint have different colors. In other terms, we require that for every vertex all edges of adjacent to have different colors. For that, colors are enough. Indeed, each new edge needs a color that differentiates it from less than existing edges adjacent to one its endpoint and less than edges adjacent to other endpoint.

Let us agree to use -bit strings as colors for edges in , and perform this coloring in parallel for all . Now we define for a -bit string and arbitrary string as the string such that the edge has color in the coloring of edges from . Note that can be reconstructed as . The uniqueness property for colors guarantees that there is at most one such that has color , so is well defined. It is easy to see now that if and , and is the color of the edge , then and at the same time. This implies the reverse inequality (the terms appears when we compare our with the optimal one). ∎

###### Remark 1.

In the definition of information distance given above we look for a program that transforms to and also transforms to . Note that we do not tell the program which of the two transformations is requested. A weaker definition would provide also this information to . This modification can be done in several ways. For example, we may require in the definition of that and , using the first input bit as the direction flag. An equivalent approach is to use two computable functions and in the definition and require that and . This corresponds to using different interpreters for both directions.

It is easy to show that the optimal functions and exist for this (two-interpreter) version of the definition. A priori we may get a smaller value of information distance in this way (the program’s task is easier when the direction is known, informally speaking). But it is not the case for the following simple reason. Obviously, this new quantity is still an upper bound for both conditional complexities and with precision. Therefore Theorem 1 guarantees that this new definition of information distance coincides with the old one up to additive terms. (For the prefix versions of information distance such a simple argument does not work anymore, see below.)

We have seen that different approaches lead to the same (up to additive term) notion of plain information distance. There is also a simple and natural quantitative characterization of this notion provided by the following theorem.

###### Theorem 2.

The function for optimal is the minimal up to additive terms upper semicomputable non-negative symmetric function with two string arguments and natural values such that

 #{y:E(x,y)

for some and for all integers and strings .

Recall that upper semicomputability of means that one can compute a sequence of total upper bounds for that converges to . The equivalent requirement: the set of triples where are strings and are natural numbers, such that , is (computably) enumerable.

###### Proof.

The function is upper semicomputable and symmetric. The inequality is true for it since it is true for the smaller function (for ; indeed, the number of programs of length less than is at most ).

On the other hand, if is some symmetric upper semicomputable function that satisfies , then one can for any given and enumerate all such that . There are less than strings with this property, so each can be described (given ) by a string of

bits, its ordinal number in the enumeration. Note that the value of

can be reconstructed from this string (by decreasing its length by ), so if . It remains to apply the symmetry of and Theorem 1. ∎

###### Remark 2.

The name “information distance” motivates the following question: does the plain information distance satisfy the triangle inequality? For the logarithmic precision the answer is positive, because

 C(x|z)⩽C(x|y)+C(y|z)+O(log(C(x|y)+C(y|z))).

However, if we replace the last term by an -term, then the triangle inequality is no more true. Indeed, for every strings and the distance between an empty string and is , and the distance between and some encoding of a pair is at most , and the triangle inequality for distances with -precision would imply , and this is not true, see, e.g., [14, section 2.1].

One may as whether a weaker statement saying that there is a maximal (up to an additive term) function in the class of all symmetric non-negative functions that satisfy both the condition and the triangle inequality, is true. The answer is negative, as the following proposition shows.

###### Proposition 3.

There are two upper semicomputable symmetric functions , that both satisfy the condition and the triangle inequality, such that no function that is bounded both by and can satisfy and the triangle inequality at the same time.

###### Proof.

Let us agree that and are infinite when and have different lengths. If and are -bit strings, then means that all the bits in and outside the first positions are the same, and is defined in a symmetric way (or the last positions). Both and satisfy the triangle inequality (and even the ultrametric inequality) and also satisfy the condition , since the ball of radius consist of strings that coincide except for the first/last bits. If is bounded both by and and satisfies the triangle inequality, then by changing the first and the last positions in a string we get a string such that , and it is easy to see that the number of strings that can be obtained in this way is not , but . ∎

## 3 Prefix complexity: different definitions

The notion of prefix complexity was introduced independently by Levin [5, 7, 3] and later by Chaitin [2]. There are several versions of this definition, and they all turn out to be equivalent, so people usually do not care much about technical details that are different. However, if we want to consider the counterparts of these definitions for information distance, the difference becomes important if we are interested in -precision.

Essentially there are four different definitions of prefix complexity that appear in the literature.

### 3.1 Prefix-free definition

A computable partial function with two string arguments and string values is called prefix-free (with respect to the first argument) if and cannot be defined simultaneously for a string and its prefix and for the same second argument . In other words, for every string the set of strings such that is defined is prefix-free, i.e., does not contain a string and its prefix at the same time.

For a prefix-free function we may consider the complexity function . In this way we get a smaller class of complexity functions (compared with the definition of plain complexity discussed above), and the Solomonoff–Kolmogorov theorem can be easily modified to show that there exists a minimal complexity function in this smaller class (up to additive term, as usual). This function is called prefix conditional complexity and usually is denoted by . It is greater than since the class of available functions is more restricted; the relation between and is well studied (see, e.g., [14, chapter 4] and references within).

The unconditional prefix complexity is defined in the same way, with that does not have a second argument. We can also define as for some fixed string . This string may be chosen arbitrarily; for each choice we have but the constant in the bound depends on the choice of .

### 3.2 Prefix-stable definition

The prefix-stable version of the definition considers another restriction on the function . Namely, in this version the function should be prefix-stable with respect to the first argument. This means that if is defined, then is defined and equal to for all that are extensions of (i.e., when is a prefix of ). We consider the class of all computable partial prefix-stable functions and corresponding functions , and observe that there exists an optimal prefix-stable function that makes minimal in this class (for prefix-stable functions).

It is rather easy to see that the prefix-stable definition leads to a version of complexity that does not exceed the prefix-free one (each prefix-free computable function can be easily extended to a prefix-stable one). The reverse inequality is not so obvious and there is no known direct proof; the standard argument compares both versions with the third one (the logarithm of a maximal semimeasure, see Section 3.4 below for this definition).

Prefix-free and prefix-stable definitions correspond to the same intuitive idea: the program should be “self-delimiting”. This means that the machine gets access to an infinite sequence of bits that starts with the program and has no marker indicating the end of a program. The prefix-free and prefix-stable definitions correspond to two possible ways of accessing this sequence. The prefix-free definition corresponds to a blocking read primitive (if the machine needs one more input bit, the computation waits until this bit is provided). The prefix-stable definition corresponds to a non-blocking read primitive (the machine has access to the input bits queue and may continue computations if the queue is currently empty). We do not go into details here; the interested reader could find this discussion in [14, section 4.4].

### 3.3 A priori probability definition

In this approach we consider the a priori probability of given , the probability of the event “random program maps to ”. More precisely, consider a prefix-stable function and an infinite sequence

of independent uniformly distributed random bits (a random variable). We say that

if for some that is a prefix of . Since is prefix-stable, the value is well defined. For given and , we denote by the probability of this event (the measure of the set of such that ). For each prefix-stable we get some function . It is easy to see that there exists an optimal that makes maximal (up to an -factor). Then we define prefix complexity as for this optimal .

It is also easy to see that prefix-free functions (used instead of prefix-stable ones) lead to the same definition of prefix complexity. Informally speaking, if we have an infinite sequence of random bits as the first argument, we do not care whether we have blocking or non-blocking read access, the bits are always there. The non-trivial and most fundamental result about prefix complexity is that this definition (as logarithm of the probability) is equivalent to the two previous ones. As a byproduct of this result we see that the prefix-free and prefix-stable definitions are equivalent. This proof and the detailed discussion of the difference between the definitions can be found, e.g., in [14, chapter 4].

### 3.4 Semimeasure definition

The semimeasure approach defines a priori probability in a different way, as a convergent series that converges as slow as possible. More precisely, a lower semicomputable semimeasure is a non-negative real-valued function on binary strings such that is a limit of a computable (uniformly in ) increasing sequence of rational numbers and . There exists a maximal (up to -factor) lower semicomputable semimeasure , and its negative logarithm coincides with (unconditional) prefix complexity up to an additive term.

We can define conditional prefix complexity in the same way, considering semimeasures with parameter . Namely, we consider lower semicomputable non-negative real-valued functions such that for every . Again there exists a maximal function among them, denoted by , and its negative logarithm equals up to an additive term.

To prove this equality, we note first that the a priori conditional probability is a lower semicomputable conditional semimeasure. The lower semicomputability is easy to see: we can simulate the machine and discover more and more programs that map to . The inequality also has a simple probabilistic meaning: the events “ maps to ” for a given and different are disjoint, so the sum of their probabilities does not exceed . The other direction (starting from a semimeasure, construct a machine) is a bit more difficult, but in fact it is possible (even exactly, without additional -factors). See [14, chapter 4] for details.

The semimeasure definition can be reformulated in terms of complexities (by taking exponents): is a minimal (up to additive term) upper semicomputable non-negative integer function such that

 ∑x2−k(x,y)⩽1

for all . A similar characterization of plain complexity would use a weaker requirement

 #{x:k(x,y)

for some and all . (We discussed a similar result for information distance where the additional symmetry requirement was used, but the proof is the same.)

### 3.5 Warning

There exists a definition of plain conditional complexity that does not have a prefix-version counterpart. Namely, the plain conditional complexity can be equivalently defined as the minimal unconditional plain complexity of a program that maps to . Шn this way we do not need the programming language used to map to to be optimal; it is enough to assume that we can computably translate programs in other languages into our language; this property, sometimes called ---theorem or Gödel property of a computable numbering, is true for almost all reasonable programming languages. Of course, we still assume that the language used in the definition of unconditional Kolmogorov complexity is optimal.

One may hope that can be similarly defined as the minimal (unconditional) prefix complexity of a program that maps to . The following proposition shows that it is not the case.

###### Proposition 4.

The prefix complexity does not exceed the minimal prefix complexity of a program that maps to ; however, the difference between these two quantities is not bounded.

###### Proof.

To prove the first part, assume that is a prefix-stable function of one argument that makes the complexity function

 CU1(q)=min{|p|:U(p)=q}

minimal. Then . (We still need an term since the choice of an optimal prefix-stable function is arbitrary). Then consider the function

 U2(p,x)=[U1(p)](x)

where denotes the output of a program on input . Then is a prefix-stable function from the definition of conditional prefix complexity, and

 CU2(y|x)⩽CU1(q)

for any program that maps to (i.e., ). This gives the inequality mentioned in the proposition. Now we have to show that this inequality is not an equality with -precision.

Note that for every binary string of length . Indeed, a prefix-stable (or prefix-free) machine that gets as input can copy first bits of its program to the output. (The prefix-free machine should check that there are exactly input bits.) In this way we get -bit programs for all strings of length .

Now assume that the two quantities coincide up to an ) additive term. Then for every string there exists a program that maps to and for all and some . Note that may be equal to for , but this may happen only if and have different lengths. Consider now the set of all for all strings , and the series

 ∑q∈Q2−K(q).

This sum does not exceed (it is a part of a similar sum for all that is at most , see above). On the other hand, we have at least different programs for all -bit strings , and they correspond to different terms in ; each of these terms is at least . We get a converging series that contains, for every , at least terms of size at least . It is easy to see that such a series does not exist. Indeed, each tail of this series should be at least (consider these terms for large when at least half of these terms are in the tail), and this is incompatible with convergence. ∎

Why do we get a bigger quantity when considering the prefix complexity of a program that maps to ? The reason is that the prefix-freeness (or prefix-stability) requirement for the function is formulated separately for each : the decision where to stop reading the program may depend on its input . This is not possible for a prefix-free description of a program that maps to . It is easy to overlook this problem when we informally describe prefix complexity as “the minimal length of a program, written in a self-delimiting language, that maps to ”, because the words “self-delimiting language” implicitly assume that we can determine where the program ends while reading the program text (and before we know its input), and this is a wrong assumption.

### 3.6 Historical digression

Let us comment a bit on the history of prefix complexity. It appeared first in 1971 in Levin’s PhD thesis [5]; Kolmogorov was his thesis advisor. Levin used essentially the semimeasure definition (formulated a bit differently). This thesis remained unpublished for a very long time (and it was in Russian). In 1974 Gács’ paper [3] appeared where the formula for the prefix complexity of a pair was proven. This paper mentioned prefix complexity as “introduced by Levin in [4], [5]” ([6] and [7] in our numbering). The first of these two papers does not say anything about prefix complexity explicitly, but defines the monotone complexity of sequences of natural numbers, and prefix complexity can be considered as a special case when the sequence has length (this is equivalent to the prefix-stable definition of prefix complexity). The second paper (we discuss it later in this section) has a comment “(to appear)” in Gács’ paper.

Gács does not reproduce the definition of prefix complexity saying only that it is “defined as the complexity of specifying on a machine on which it is impossible to indicate the endpoint [the English translation says “halting” instead of “endpoint” but this is an obvious translation error] of a master program: an infinite sequence of binary symbols enters the machine and the machine must itself decide how many binary symbols are required for its computation”. This description is not completely clear, but it looks more like a prefix-free definition (if we understand it in such a way that the program is written on a one-directional tape and the machine decides where to stop reading). Gács also notes that prefix complexity (he denotes it by ) “is equal to the [negative] base two logarithm of a universal semicomputable probability measure that can be defined on the countable set of all words”.

Levin’s 1974 paper [7] says that “the quantity has been investigated in details in [6,7]”. Here [7] in Levin’s numbering is Gács paper cited above ([3] is our numbering) and has the comment “in press”, and [6] in Levin’s numbering is cited as “Левин Л.А., О различных видах алгоритмической сложности конечных объектов (в печати)” [Levin L.A., On different version of algorithmic complexity of finite objects, to appear]. Levin does not have a paper with exactly this title, but the closest approximation is his 1976 paper [8], where prefix complexity is defined as the logarithm of a maximal semimeasure. Except for these references, [7] describes the prefix complexity in terms of prefix-stable functions: “It differs from the Kolmogorov complexity measure in that the decoding algorithm has the following “prefix” attribute: if and are defined and distinct, then cannot be a beginning fragment of ”.

The prefix-free and a priori probability definitions were given independently by Chaitin in [2] (in different notation) together with the proof of their equivalence, so [2] was the first publication containing this (important) proof.

Now it seems that the most popular definition of prefix complexity is the prefix-free one (it is given as the main definition in [9], for example).

## 4 Prefix complexity and information distance

### 4.1 Four versions of prefix information distance

Both the prefix-free and prefix-stable versions of prefix complexity have their counterparts for the information distance.

Let be a partial computable prefix-free [prefix-stable] function of two string arguments having string values. Consider the function

 EU(x,y)=min{|p|:U(p,x)=y and U(p,y)=x}

As before, one can easily prove that there exists a minimal (up to ) function among all functions of the class considered. It will be called prefix-free [resp. prefix-stable] information distance function.

Note that only the cases when and also matter for . So we may assume without loss of generality that waiting until both equalities are true before finalizing the values of . Then for every we have some matching on the set of all strings: an edge is in if and . This is indeed a matching: for every only may be connected with .

The set is enumerable uniformly in . In the prefix-free version the matchings and are disjoint (have no common vertices) for two compatible strings and (one is an extension of the other). For the prefix-stable version increases when increases (and remains a matching). It is easy to see that a family that has these properties always corresponds to some function (here we have two statements: for prefix-free and prefix-stable version).

There is another way in which this definition could be modified. As we have discussed for the plain complexity, we may consider two different functions and and consider the distance function

 EU,U′(x,y)=min{|p|:U(p,x)=y and U′(p,y)=x}.

Intuitively this means that we know the transformation direction in addition to the input string. This corresponds to matchings in a bipartite graph where both parts consist of all binary strings; the edge is in the matching if and . Again instead of the pair we may consider the family of matchings that are disjoint (for compatible , in the prefix-free version) or monotone (for the prefix-stable version). In this way we get two other versions of information distance that could be called bipartite prefix-free and bipartite prefix-stable information distances.

In [1] the information distance is defined as the prefix-free information distance (with the same function for both directions, not two different ones). The definition (section III) considers the minimal function among all . This minimal function is denoted by (while is denoted by , see section I of the same paper). The inequality is obvious, and the reverse inequality (with logarithmic precision) is proven in [1] as Theorem 3.3.

Which of the four versions of prefix information distance is the most natural? Are they really different? It is easy to see that the prefix-stable version (bipartite or not) does not exceed the corresponding prefix-free version, since every prefix-free function has a prefix-stable extension. Also each bipartite version (prefix-free or prefix-stable) does not exceed the corresponding non-bipartite version for obvious reasons (one may take ). It is hard to say which version is most natural, and the question whether some of them coincide or all four are different, remains open. But as we will see (Theorem 4), the smallest of all four, the prefix-stable bipartite version, is still bigger than (the maximum of conditional complexities), and the difference is unbounded, so for all four versions (including the prefix-free non-bipartite version used both in [1, 10, 11]) the equality with -precision is not true, contrary to what is said in [11].

However, before going to this negative result, we prove some positive results about the definition of information distance that is a counterpart of the a priori probability definition of prefix complexity.

### 4.2 A priori probability of going back and forth

Fix some prefix-free function . The conditional a priori probability is defined as

 Prπ[U(π,x)=y]

where means that for some that is a prefix of . As we discussed, there exists a maximal function among all , and its negative logarithm equals the conditional prefix complexity .

Now let us consider the counterpart of this construction for the information distance. The natural way to do this is to consider the function

 eU(x,y)=Prπ[U(π,x)=y and U(π,y)=x].

Note that in this definition the prefixes of used for both computations are not necessarily the same. It is easy to show, as usual, that there exists an optimal machine that makes maximal. Fixing some optimal , we get some function (different optimal lead to functions that differ only by -factor). The negative logarithm of this function coincides with (from [1]) with -precision, as the following result says.

###### Proof.

Rewriting the right-hand side in the exponential scale, we need to prove that

up to -factors. One direction is obvious: is smaller than since the set of in the definition of is a subset of the corresponding set for , if we use the probabilistic definition of . The same is true for .

The non-trivial part of the statement is the reverse inequality. Here we need to construct a machine such that

up to -factors.

Let us denote the right-hand side by . The function is symmetric, lower semicomputable and for all (due to the symmetry, we do not need the other inequality where is fixed). This is all we need to construct with the desired properties; in fact will be at least (and the factor is important for the proof).

Every machine has a “dual” representation: for every pair one may consider the subset of the Cantor space that consists of all such that and . These sets are effectively open (i.e., are computably enumerable unions of intervals in the Cantor space) uniformly in , are symmetric () and have the following property: for a fixed , all sets for all (including ) are disjoint.

What is important to us is that this correspondence works in both directions. If we have some family of uniformly effectively open sets that is symmetric and has the disjointness property mentioned above, there exists a prefix-free machine that generates these sets as described above. This machine works as follows: given some , it enumerates the intervals that form for all (it is possible since the sets are effectively open uniformly in ). One may assume without loss of generality that all the intervals in the enumeration are disjoint. Indeed, every effectively open set can be represented as a union of a computable sequence of disjoint intervals (to make intervals disjoint, we represent the set difference between the last interval and previously generated intervals as a a finite union of intervals). Note also that for different values of  the sets are disjoint by the assumption. If the enumeration for contains the interval (the set of all extensions of some bit strings ), then we let and (we assume that the same enumeration is used for and ). Since all intervals are disjoint, the function is prefix-free.

Now it remains (and this is the main part of the proof) to construct the family with required properties in such a way that the measure of is at least . In our construction it will be exactly . For that we use the same idea as in [1] but in the continuous setting. Since is lower semicomputable, we may consider the increasing sequence of approximations from below (that increase with time, though we do not explicitly mention time in the notation) that converge to . We assume that at each step one of the values increases by a dyadic rational number . In response to that increase, we add to one or several intervals that have total measure and do not intersect and for any . For that we consider the unions of all already chosen parts of and of all chosen parts of . The measure of the first union is bounded by and the measure of the second union is bounded by where is the lower bound for before the -increase. Since the sums remain bounded by after the -increase, we may select a subset of measure outside both unions. (We may even select a subset of measure , but this will destroy the construction at the following steps, so we add only to .) ∎

###### Remark 3.

As for the other settings, we may consider two functions and and the probability of the event

 eU,U′(x,y)=Prπ[U(π,x)=y and U′(π,y)=x]

for those that make this probability maximal. The equality of Theorem 3 remains valid for this version. Indeed, the easy part can be proven in the same way, and for the difficult direction we have proven a stronger statement with additional requirement .

One can also describe the function as a maximal function in some class, therefore getting a quantitative definition of . This is essentially the statement of theorem 4.2 in [1]. In terms of semimeasures it can be reformulated as follows.

###### Proposition 5.

Consider the class of symmetric lower semicomputable functions with string arguments and non-negative real values such that for all . This class has a maximal function that coincides with up to an factor.

(Indeed, we have already seen that this minimum has the required properties; if some other function in this class is given, we compare it with conditional semimeasures and and conclude that does not exceed both of them.)

In logarithmic scale this statement can be reformulated as follows: the class of upper semicomputable symmetric functions with string arguments and real values such that for each , has a minimal element that coincides with up to an additive term. Theorem 4.2 in [1] says the same with the additional condition for : it should satisfy the triangle inequality. This restriction makes the class smaller and could increase the minimal element in the class, but this does not happen since the function

 max(K(x|y),K(y|x))+c

satisfies the triangle inequality for large enough . This follows from the inequality since the left hand size increases by and the right hand size increases by when is increased by .

###### Remark 4.

To be pedantic, we have to note that in [1] an additional condition is required for the functions in the class; to make this possible, one has to exclude the term in the sum (now this term equals ) and require that (p. 1414, the last inequality). Note that the triangle inequality remains valid if we change and let for all .

## 5 A counterexample

In this section we prove the main negative (and most technically difficult) result of this paper that shows that none of the four prefix distances coincides with

###### Theorem 4.

The bipartite prefix-stable information distance exceeds more than by a constant: the difference is unbounded.

As we have mentioned, the other three versions of the information distance are even bigger, so the same result is true for all of them. We will explain the proof for the non-bipartite prefix-stable version (it is a bit easier and less notation is needed) and then explain the changes needed for the bipartite prefix-stable version.

The proof uses the game approach (see [13, 12] for the general context, but the proof is self-contained). In the next section (5.1) we explain the game rules and prove that a computable winning strategy in the game implies that the difference is unbounded, and then (in Section 5.2) we explain the strategy. Finally (in Section 5.3) we discuss the modifications needed for the bipartite case.

### 5.1 It is enough to win a game

Consider the following two-player full information game. Fix some parameter , a positive rational number. The game field is the complete graph on a countable set (no loops); we use binary strings as graph vertices. Alice and Bob take turns.

Alice increases weights of the graph edges. We denote the weight of the edge connecting vertices and by (here ). Initially all are zeros. At each move Alice may increase weights of finitely many edges using rational numbers as new weights. The weights should satisfy the inequality for every (the total weight of the edges adjacent to some vertex should not exceed ).

Bob assigns some subsets of the Cantor space to edges. For each (where ) the set assigned to the edge is a clopen subset of the Cantor space (clopen subsets are subsets that are close and open at the same time, i.e., finite unions of intervals in the Cantor space). Initially all are empty. At each move Bob may increase sets assigned to finitely many edges (using arbitrary clopen sets that contain the previous ones). For every , the sets (for all ) should be disjoint.

The game is infinite, and the winner is determined in the limit (assuming that both Alice and Bob follow the rules). Namely, Bob wins if for every and (where ) the limit value (the union of the increasing sequence of Bob’s labels for edge ) contains an interval in the Cantor space whose size is at least (the limit value of Alice’s labels for , multiplied by ). Recall that the interval in the Cantor space is the set of all extensions of some string , and its size is . In the sequel the size of the maximal interval contained in is denoted by .

We claim that the existence of a computable (uniformly in ) winning strategy for Alice in this game is enough to prove Theorem 4. But first let us make some remarks on the game rules.

###### Remark 5.

Increasing the constant , we make Bob’s task more difficult, and Alice’s task easier. So our claim says that Alice can win the game even for arbitrarily small (positive) values of .

###### Remark 6.

In our definition the result of the game is determined by the limit values of and , so both players may postpone their moves. Two consequences of this observation will be used. First, we may assume that Bob always has empty when (he may postpone his move). Second, we may assume that Bob has to satisfy the requirement after each of his moves. Indeed, Alice may wait until this requirement is satisfied by Bob: if this never happens, Alice wins the game in the limit (due to compactness: if an infinite family of intervals covers some large interval in the Cantor space, a finite subfamily exists that covers it, too).

Now let us assume that Alice has a (uniformly) computable strategy for winning the game for every . Since the factor is arbitrary, we may strengthen the requirement for Alice and require for some . This corresponds to the factor in the original game. Given some integer , consider Alice’s winning strategy for and . We play all these strategies simultaneously against a “blind” strategy for Bob that ignores Alice’s moves and just follows the optimal machine used in the definition of information distance. Here are the details.

Consider the function that makes the function

 EU(u,v)=min{|p|:U(p,u)=v and U(p,v)=u}

minimal. For each edge consider the union of the sets for all such that and at the same time. This union is an effectively open set, and Bob enumerates the corresponding intervals and adds them to the label for the edge when they appear in the enumeration. (Note that this set is the same for and by definition.) For the limit set we then have by construction (consider the interval that corresponds to the shortest in the definition of ).

Let Alice use her winning strategy (for and ) against Bob. Since Bob’s actions and Alice’s strategy are computable, the limit values of Alice’s weights are lower semicomputable uniformly in . Let us denote these limit values by (for the th game). We know that for every and the sum does not exceed . Therefore the sum

 mu,v=∑kmku,v

satisfies the requirement

 ∑v≠umu,v⩽1

and we can apply Proposition 5, where we let . This proposition guarantees that

If, contrary to the statement of Theorem 4, the value of prefix-stable (non-bipartite) information distance between and is bounded by , then in the right hand side of the last inequality can be replaced by . But this means, by our construction, that Bob wins the th game for large enough , since the maximal intervals in are large enough to match (and therefore ) for large enough , according to this inequality. We get a contradiction that finishes the proof of Theorem 4 for the non-bipartite case, assuming the existence of a uniformly computable winning strategy for Alice.

### 5.2 How to win the game

Now we present a winning strategy for Alice. It is more convenient to consider an equivalent version of the game where Alice should satisfy the requirement (for some constant ; we assume without loss of generality that is a negative power of ) and Bob should match Alice’s weights without any factor, i.e., satisfy the requirement . We need to show that even for small values of Alice has a winning strategy.

The idea of the strategy is that Alice maintains a finite set of “currently active” vertices, initially very large and then decreasing. The game is split into stages where (as we will see, this is enough). After each stage the set of active vertices and the edge labels satisfy the following conditions:

• Alice has zero weights on edges that connect active vertices (as we have said, we may assume without loss of generality that Bob has empty labels on these edges, too);

• for each active vertex, only a small weight is used by Alice on edges that connect it to other vertices (inactive ones; edges to active ones are covered by the previous condition and do not carry any weight); this weight will never exceed ;

• more and more space is unavailable to Bob for use on edges between active vertices, since it is already used on edges connecting active and inactive vertices.

The amount of unavailable space (for Bob) grows from stage to stage until no more space is available and Alice wins. In fact, at each stage the amount of unavailable space grows by , so Alice needs stages to make all the space unavailable for Bob; then she makes one more request and wins since Bob has no available space to fulfull this request.

In the previous paragraph we used the words “unavailable space” informally. What do we mean by unavailable space? Consider some active vertex and edges that connect it to inactive vertices. These edges have some Bob’s labels (subsets of the Cantor space). The part of the Cantor space occupied by these labels is not available to Bob for edges between and other active vertices. Moreover, if Alice requests an interval of size , and some part (even a small one) of an interval of this size is occupied, then this interval cannot be used by Bob (is unavailable). In this way the unavailable space can be much bigger than the occupied space, and this difference is the main tool in our argument.444This type of accounting goes back to Gács’ paper [4] where he proved that monotone complexity and continuous a priori complexity differ more than by a constant, see also [14] for the detailed exposition of his argument.

Let us explain this technique. First, let us agree that Alice increases only zero weights, and the new non-zero weight she uses depends on the stage only. At the first stage she uses some very small , at the second stage she uses some bigger , etc. (so at the th stage weights are used). We will use values of that are powers of  (since interval sizes in the Cantor space are powers of  anyway), and assume that . More precisely, we let and assume that .

This commitment about the weights implies that, starting from the th stage, only the -neighborhood of the space used by Bob matters. Here by -neighborhood (where is a negative power of ) of a subset of the Cantor space we mean the union of all intervals of size that have nonempty intersection with ; note that the -neighborhood of increases when increases (or increases).

More precisely, let us call an interval dirty for active vertex

(at some moment) if some part of this interval already appears in Bob’s labels for edges that connect

to inactive vertices. This interval cannot be used later by Alice. After stage , we consider all the intervals of size that are “everywhere dirty”, i.e., dirty for all vertices (those that are dirty for some active vertices but not for the others, do not count). The everywhere dirty intervals form the unavailable space after stage , and the total measure of this space increases at least by at each stage. In other terms, after stage we consider for every active vertex the space allocated by Bob to all edges connecting with (currently) inactive vertices, and the -neighborhood of this space. The intersection of these neighborhoods for all active vertices is the unavailable space (after stage ).

After stage the total size of unavailable space will be at least (recall that ). At the end (after the th stage) we have , so the total size of everywhere dirty intervals of size is , while the total weight used by Alice at any vertex is . Then Alice makes one more request with weight and wins. Of course, we need that at least two vertices remain active after stage , and this will be guaranteed if the initial number of active vertices is large enough.

The picture above places between stages since is used for accounting after stage and before stage .

It remains to explain how Alice plays at stage using requests of size and creating (new) everywhere dirty intervals of size with total size (=the size of their union) at least . This happens in several substages; each substage decreases the set of active vertices and increases the set of everywhere dirty intervals of size (for the remaining active vertices).

Before starting each substage, we look at two subsets of the Cantor space:

• the set of intervals of size that were everywhere dirty after the previous stage;

• the set of intervals of size that are everywhere dirty now (after the substages that are already performed).

The second set is bigger for two reasons. First, we changed the granularity (recall the -neighborhood of some set can be bigger than -neighborhood). Second, the previous substages create new everywhere dirty intervals of size . Our goal is to make the second set larger than the first one; the required difference in size is . If this goal is already achieved, we finish the stage (no more substage are necessary). If not, we initiate a new substage that creates a new everywhere dirty -interval.

[1ex] Alice’s strategy for a substage

The key idea is that if Alice makes requests for all edges of a large star, she may use a lot of weight for the central vertex (the sum of the weights could be up to , since in our process the total weight on edges that connect some active vertex to inactive ones, never exceeds , and the maximal total weight is ). Still for all other vertices of the star only one new edge of non-zero weight is added, and the central vertex will be made inactive after the substage. Bob has to allocate some intervals of size at least for every edge in the star, and these intervals should be disjoint (due to the restrictions for the center of the star). The total measure of these intervals is at least , and all of them are outside the zone (a). Therefore, since the goal is not yet achieved, one of these new intervals used by Bob is also outside the zone (b).

Alice does the same for many stars (assuming that there are enough active vertices) and gets many new -intervals outside the (b)-zone (one per star). Some of them have to coincide: if we started with many stars, we may select many new active vertices that have the same new -dirty interval. Making all other vertices inactive, we get a smaller (but still large if we started with a large set of active vertices) set of active vertices and a new everywhere dirty -interval. The goal of a substage is achieved, and we may look again at the set of everywhere dirty -intervals (with new interval added) to decide whether the difference between (b) and (a) is now enough () or a new substage is needed. The maximal number of substages needed to finish the stage is , since each substage creates a new -interval.

The same procedure is repeated for all stages. We need to check that Alice does not violates her obligation on the sum of weights connecting some active vertex to all inactive vertices. For that, we look at the “amplification factor”: in the construction Alice uses a new weight (for every new active vertex) to get a dirty interval of size , therefore the amplification factor is . Since the total size of dirty intervals is at most , the total weight used by Alice (for each active vertex) never exceeds , as required.

It remains to explain why Alice can choose enough active vertices in the beginning, so she will never run out of them in the construction and at least two active vertices exist at the end (so the last request for the edge connecting them wins the game). Indeed, the backwards induction shows that for each substage of each stage there is some finite number of active vertices that is sufficient for Alice to follow her plan till the end. If we want to upperbound the length on the strings where a given difference between two quantites in the statement of Theorem 4 is achieved, we need to compute this number explicitly. But the qualitative statement (the unbounded difference) is already proven for the prefix-stable non-bipartite case. The prefix-free case is a corollary (the distance becomes bigger), but for the bipartite case we need to adapt the argument, and this is done in the next section.

### 5.3 Modifications for the bipartite case

In the bipartite case the game should be changed. Namely, we have a complete bipartite graph where left and right parts contain all strings. Alice increases weights on edges; for each vertex (left or right) the sum of the weights for all adjacent edges should not exceed some (the parameter of the game). In other terms, at each step Alice’s weights form a two-dimensional table indexed by pairs of strings and , all entries are zeros except for finitely many positive rational numbers, and

 ∀x(∑ymx,y⩽1),∀y(∑xmx,y⩽1)

(now we have two requirements since the table is not symmetric anymore; note that the diagonal entries do not have special status).

Bob replies by assigning increasing sets to edges such that . For each the sets (with different ) should be disjoint; the same should be true for sets for fixed and different .

Again, to prove that the bipartite prefix-free information distance exceeds by a constant, we show that for every Alice has a computable (uniformly in ) winning strategy in this game. Then we consider games with total weight and factor condition and let Alice play her winning strategy against the “blind” strategy for Bob that (for the edge ) enumerates all intervals such that and at the same time.

The winning strategy for Alice works in stages as before, and the request size grows with the stage in the same way. Alice keeps the list of active vertices (both in the left and the right part), and after each stage (and substage) all weights on edges between (left and right) active vertices are zeros, and sum of Alice’s weights on edges between each active vertex and all inactive vertices is small. After the th stage we consider intervals of size . When defining dirty intervals, we look only at one part (say, the right one). An interval of size is considered as dirty for a right vertex if some part of this interval is allocated to some edge connecting to some vertex from the left part. We are interested in intervals that are dirty everywhere (i.e., for every right vertex ). At each substage (of the th stage), to create a new everywhere dirty interval, we use stars as shown.

In each star the sum of Alice’s weights is ; we choose an edge for which Bob’s label is not in the everywhere dirty intervals found during previous substage, and look at the interval where it goes. Since there are many stars, some dirty interval occurs many times; Alices selects such an interval and uses the right vertices of corresponding edges as new active right vertices. On the left side, Alice uses vertices that do not appear in the stars, as new active left vertices. (In this way we have much more active vertices on the left; if for some reason we want to keep the same number of left and right vertices, Alice may delete part of the remaining vertices..)