1 Bayes’ Law
Bayes’ law, also known as Bayes’ rule or Bayes’ theorem, proposes a way to make use of available data that may affect the likelihood of an event to better assess the probability of occurence of that event. This law was proposed by the Reverand Thomas Bayes, born in the early 1700s, but this work of Bayes wasn’t made public during his lifetime, and it wasn’t until after his death in 1761 that his student Richard Price communicated Bayes’ work to John Canton in 1763 suggesting that “a communication of it to the Royal Society cannot be improper”.
In the Bayesian interpretation, where a probabilty is a measure of our degree of belief in something which is different from the frequentist interpretation of probability, given a phenomenon and a proposed theory for , Bayes’ law provides a tool that quantifies the validity of as supported by our initial belief (a subjective measure) in and the observation of some evidence about . As an equation, Bayes’ law is stated as follows.
We note that the above mathematical form of Bayes’ law is due to the French mathematician Laplace who came to a similar conclusion in 1774. Following Laplace’s logic, equation 1 argues that the probability of a theory given an evidence is proportional to the probability of that evidence given that theory. In equation 1, is the initial probability of , which is our initial belief in prior to observing any evidences (aka a priori probability or the prior), is the probability of given , which is the probability of after accounting for the evidence (aka a posteriori probability or the posterior), and is the probability of given all possible theories for . Thus, Bayes’ law allows us to update our initial belief in in a way that accounts for the evidence .
Clearly, Bayes’ law constitutes a learning algorithm and is probably one of the earliest demonstration of a data-driven approach to learning. We view the difference between the a priori probability and the posteriori probability in Bayes’ law as some manifestation of the amount of information in about . Informally put, we have
, Price writes about the problem solved by Bayes’ law that “Every judicious person will be sensible that the problem now mentioned is necessary to be sovled in order to a sure foundation for all our reasonings concerning past facts, and that is likely to be hereafter”. In fact, Bayes’ law is considered one of the most fundamental applications of probability theory and has been compared to the Pythagorean theorem in geometry.
2 Kolmogorov’s Algorithmic Information
in 1964, 1965, and 1969, respectively. At the core of this theory is the notion of a Universal Turing Machine of Alain Turing, which follows from the fact that a Turing machine is capable of simulating any other Turing machine.
2.1 Why Universality Matters?
Picture yourself walking into a bookstore. There, you would find people of all walks of life browsing books either for fun or for learning enough about the information content of a book so that they can decide whether the content is worth the price. In this context, the quality of a book is not an intrinsic property of the book itself, but rather depends on the background and taste of the reader, which naturally would differ from one reader to another. Now, obtaining an objective measure of the quality of any book requires that we have a universal reader who is a highest authority in all subjects. A book review provided by a universal reader could then be described as an intrinsic property of the book itself.
All three inventors (Kolmogorov, Solomonoff, and Chaitin) used the concept of a Universal Turing Machine (UTM) to propose new ideas with built-in universal qualities. In particular, Kolmogorov’s notion of algorithmic information relies on the existence of a universal decompressor (universal Turing machine) to propose a new definition of information that deviates from the notion of information being tied to a random variable (as discussed in Shannon’s information theory) and makes information tied to an individual string, free of any probability distributions. Similarly, Solomonoff’s notion of algorithmic probability is the halting probability of a Universal Probabilistic Turing Machine (UPTM) that takes no inputs, or equivalently, the halting probability of a UTM that takes an infinite random string as an input.
2.2 Kolmogorov Complexity
Kolmogorov used the theory of algorithms of Turing to redefine randomness as incompressibility (or equivalently, lack of regularities) and to propose that the random or incompressible content of a finite string represents the amount of uncertainty or information in it. Given a finite string , Kolmogorov complexity is defined as our ability to capture the regular part of so that when given the random part of , we would be able to reconstruct from (decompresses to regenerate ). We note that we are not interested in how got compressed down to , we simply want to have an effective way to uncover from 222Kolmogorov complexity is about decompression, and not compression..
Our discussion this far suggests that the Kolmogorov complexity of a finite binary string is defined relative to a particular decompression algorithm (a Turing machine) as the length of a smallest input that causes , when it reads , to generate and then halts. In a mathematical form, we have
As such, Kolmogorov complexity is made relative to a particular algorithm or Turing machine, which can hardly mean anything (this is similar to the point we made in the previous section when we argued that the quality of a book cannot be determined by the taste and background of a particular reader). Here, the notion of a Universal Turing Machine (UTM) comes to the rescue. Knowing that a UTM can simulate any other Turing machine , rewriting equation 3 relative to gives a universal meaning to . This is true because we can easily show that for all other Turing machines , we have
where is a constant that depends on , but not on . In particular, is about the length of the binary encoding of . The input of would consist of the pair so that would know how to simulate on the input ; that is, . If we next let for some string ; that is, is a shortest input for to generate , then , where is the length of a self-delimiting binary encoding of the length of , which uses to separate bin from . This result is known as the Invariance Theorem and was discovered independently by Solomonoff and Kolmogorov. Moreover, given any other UTM , since a UTM is just another Turing machine, we have
where is a constant that depends on and , but not on . Thus, it doesn’t really matter which UTM we choose in our definition of Kolmogorov complexity as long as we accept to tolerate an additive constant error in the result, which can be large! A better statement is that it actually does matter which UTM we use in the definition of Kolmogorov complexity, but once we fix a reference UTM , we will have a universal definition in the sense that the value of may exceed the true value of the amount of information in by a constant term, but it is never less than it.
2.3 Kolmogorov Mutual Information
First, we review the argument of Kolmogorov for calling the amount of information in about itself. Kolmogorov introduced the notion of the conditional complexity of a string in presence of another string that is made available to the UTM for free. In particular, we have
Here, we follow Kolmogorov’s notation and place the auxilliary information before the input . Next, Kolmogorov argued  that since , it is fair to call the differene the amount of information in about , to be denoted by
We note that this argument is similar to the argument we hinted to in Section 1 on Bayes’ law, when we proposed as a manifestation of the amount of information in about .
We next ask what is ? What is ?
Clearly, the constant length of the copy program that copies its input to its output. Thus, up to an additive constant, and . For this reason, Kolmogorov suggested to call the amount of information in about itself333We could call it self-information, similar to the notion of self-entropy..
We mention that prior to this work of Kolmogorov, the notion of the information content of a string wasn’t there (almost). For example, Shannon’s work was about the minimum number of bits needed on the average to transmit a value taken by a random variable, as a syntactic unit independent of any semantic444In his 1948 paper, Shannon wrote “ … semantic aspects of communication are irrelevant to the engineering problem ..”. . Similarly, Chaitin was interested in studying the size of a shortest program capable of generating a given sequence of bits on a universal Turing machine, and Solomonoff was interesed in predicting the next value taken by a random variable following an unknown probability distribution. While this is true in general, the following definition of a possible information measure was first suggested by Wiener in 1948: “The amount of information provided by a single message , ”, which is related to the number of bits needed to identify any of the messages which happen to occur with probability . The comforting thing is that both Shannon and Kolmogorov notions of information agree that information is about removal of uncertainty. This agrees with the point of view suggested by Kolmogorov, though seen from an opposite end, that information is about the ability to uncover regularity. That is, the more regular a string is, the less information it has555Alternatively, the less uncertainty it contains., and vice versa.
3 Solomonoff’s Algorithmic Probability and Its Relationship to Kolmogorov Complexity
To understand Solomonoff’s algorithmic probability, we first need to recall the notion of a probabilistic Turing Machine (PTM). A PTM is similar to a non-deterministic Turing machine with an added read-only tape, called the random tape, that is full of random bits. The machine can have two possible next moves in any configuration, and the choice is made based on the next bit read off the random tape (the assumption is that the two possible next moves in any configuration are equally likely). Let be the random sequence of bits read off the random tape of a PTM when it runs on input . The halting probability of is the product of the probabilities of the choices taken at each step of the computation, which is . The output of is a string , if accepts . We note that may accept in one execution and reject it in another. We next consider the case when runs on the empty tape. The halting probability of is the sum of halting probabilities of for all random strings which cause to output and halt. Solomonoiff called the algorithmic probabilty of relative to . We have
Given that we require to halt immediately after it outputs , then all random which appear in the equation of must be prefix free. By Kraft’s inequality, we have .
Solomonoff’s work on algorithmic probability assumes a deterministic Turing machine (not a probabilistic one) whose input consists of an infinite random binary string with equal probabilities for zero and one. Thus, we can write
Solomonoff next used the notion of a Universal Turing Machine (UTM) to give his algorithmic probability a universal sense and called his algorithmic probability relative to a UTM a universal a priori . Levin  showed that is a universal lower semicomputable semimeasure in the sense that for any other probabilistic Turing machine , , for all , where is a constant that depends on , but not on . This result shows that dominates (is superior to) any other lower semicomputable semimeasure (= the halting probability distribution generated by a probabilistic Turing machine).
We recall that Solomonoff’s overall objective was to be able to predict the next sequence of bits in a string that is generated by a random source for which we know nothing about its governing probability disribution. His method uses Bayes’ law where the unknown a priori probability gets replaced by his a priori probability .
We conclude this section by emphasizing the following observations:
Setting the halting probability of to makes the argument that the less (more) randomness the string contains, the higher (lower) its algorithmic probability is.
The algorithmic probability accounts for all possible different random contents that allows the machine to recover . Solomonoff writes in : “The assignment of high a priori probabilities to sequences with many descriptions corresponds to a feeling that if an occurrence has many possible causes, then it is more likely.”
Solomonoff’s logic in his algortihmic probability agrees with the logic of Epicurus, which states that “If more than one theory is consistent with the data, keep them all.” which intrestingly expresses the opposite sentiment to Occam’s razor adopted by Kolmogorov in his definition of , which considers only a shortest random content of that allows to recover .
Strings of high (low) algorithmic probability correspond to strings of low (high) Kolmogorov complexity.
4 Algorithmic Mutual Information Is Equivalent to Bayes’ Law
Levin showed that Solomonoff’s algorithmic probability is related to a special type of Kolmogorov complexity, named prefix Kolmogorov complexity (discovered independently by Levin and Chaitin), which requires programs and inputs to be prefix-free [1, 7]. It is known that for every Turing machine , one can construct an equivalent prefix-free Turing machine such that for all inputs , . The prefix Kolmogorov complexity of , denoted by , is the length of a shortest input that causes a fixed reference prefix-free UTM to print and then halts. In the rest of this section, we use the notation for dropping the subscript .
An important result of Levin  shows that up to an additive constant, for all finite strings ,
Combining this result with two simpler results (each is expressed up to an additive constant), namely, and , we conclude that up to an additive constant,
Using prefix Kolmogorov complexity to express the amount of information in a string about another, we conclude that up to an additive constant,
This is true because up to an additive constant, we have
which implies that
We view this argument that up to an additive constant the amount of information in about is equal to the amount of information in about as the algorithmic information version of Bayes’ law. In fact, one can easily uses this argument to derive Bayes’ law for Solomonoff’s a priori probability. We have
Replacing by , we have, up to an additive constant
Applying basic rules for logarithms gives
-  Bienvenu L., Shafer G., Shen A., “On the history of martingales in the study of randomness”, Electronic Journal for History of Probability and Statistics, no. 1, Vol. 5, pp. 1-40, 2009.
-  Chaitin G.J., “On the length of programs for computing binary sequences: statistical considerations”, Journal of the ACM, no. 1, v. 16, pp. 145-159, 1969.
-  Kolmogorov A.N., “Three approaches to the quantitative definition of information”, Problems Inform. Transmission, no. 1, v. 1, pp. 1-7, 1965.
-  Zvonkin, A.K. and Levin, L.A., “The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms”, Russ. Math. SWVS., no. 6, vol. 25, pp. 83-124, 1970.
-  Price, R., “An Essay towards solving a Problem in the Doctrine of Chances”, By the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M.A. and F.R.S., Dec. 23, 1763.
-  Jeffreys, H., Scientific Inference, 3rd Edition, Cambridge University Press, 1973.
-  Shen, A., Uspensky, V.A., and Vereshchagin, N., Kolmogorov Complexity and Algorithmic Randomness, Mathematical Surveys and Monographs Vol. 220, American Mathematical Society, 2017.
-  Solomonoff R.J., “A formal theory of inductive inference”, part 1, part 2, Information and Control, v. 7, pp. 1–22, pp. 224–254, 1964.
-  Turing, A.M., “On Computable Numbers, with an Application to the Entscheidungsproblem”, Proceedings of the London Mathematical Society. 2 (published 1937). 42: 230–265, 1936.