1 Introduction
Given a labelled dataset, a loss function is a mathematical construct that assigns a numerical value to the discrepancy between a predicted modelbased outcome and its real outcome. A cost function aggregates all losses incurred into a single numerical value that, in simple terms, evaluates how close the model is to the real data. The goal of minimizing an appropriately formulated
cost function is ubiquitous and central to any machine learning algorithm. The main heuristic behind most training algorithms is that fitting a sufficiently representative training set will result in a model that will capture the
structure behind the elements of the target set, where a model is fitted to a set when the absolute minimum of the cost function is reached.The algorithmic loss function that we introduce is designed to quantify the discrepancy between an inferred program (effectively a computable model of the data) and the data.
Algorithmic complexity [36, 24, 7], along with its associated complexity function , is the accepted mathematical definition of randomness. Here, we adopt algorithmic randomness—with its connection to algorithmic probability—to formulate a universal search method [29, 39]
for exploring nonentropybased loss/cost functions in application to AI, and to supervised learning in particular. We exploit novel numerical approximation methods based on algorithmic randomness to navigate undifferentiable problem representations capable of implementing and comparing local estimations of algorithmic complexity, as a generalization of particular entropybased cases, such as those rooted in cross entropy or KL divergence, among others.
In [45, 13] and [34, 48], a family of numerical methods was introduced for computing lower bounds of algorithmic complexity using algorithmic probability. The algorithmic complexity of an object is the length of the shortest binary computer program , that running on a Turingcomplete language , can reproduce and halt. That is, The Invariance theorem [36, 24, 7] guarantees that the choice of computer language has only an impact bounded by a constant that can be thought of as the length of the compiler program needed to translate a computer program from one computer language into another. The algorithmic probability [36, 28] of an object is the probability of a binary computer program producing by chance (i.e. considering that keystrokes are binary instructions) running on a Turingcomplete computer language and halting. That is,
Solomonoff and Levin show that is an optimal computable inference method [37] and that any other inference method is either a special case less powerful than , or indeed is itself [28]. Algorithmic probability is related to algorithmic complexity by the socalled Coding theorem: .
The Coding theorem [13, 34] and Block Decomposition methods [48]
provide a procedure to navigate the space of computable models matching a piece of data, allowing the identification of sets of programs sufficient to reproduce the data regardless of its length, and thus relaxing the minimal length requirement. In conjunction with classical information theory, these techniques constitute a hybrid, divideandconquer approach to universal pattern matching, combining the best of both worlds in a hybrid measure (BDM). The divideandconquer approach entails the use of an unbiased library of computable models, each capturing small segments of the data. These explore and build an unbiased library of computable models that can explain small segments of a larger piece of data, the conjoined sequence of which can reproduce the whole and constitute a computable model–as a generalization of statistical approaches typically used in current approaches to machine learning.
Interestingly, the use of algorithmic probability and information theory to define AI algorithms has theoretically been proposed before [39, 38]. Yet, it has received limited attention in practice compared to other less powerful but more accessible techniques, due to the theoretical barrier that has prevented researchers from exploring the subject further. However, in one of his latest interviews, if not the last one, Marvin Minsky suggested that the most important direction for AI was actually the study and introduction of Algorithmic Probability [31].
2 An Algorithmic Probability Loss Function
The main task of a loss function is to measure the discrepancy between a value predicted by the model and the actual value as specified by the training data set. In most currently used machine learning paradigms this discrepancy is measured in terms of the differences between numerical values, and in case of crossentropy loss, between predicted probabilities. Algorithmic information theory offers us another option for measuring this discrepancy–in terms of the algorithmic distance or information deficit between the predicted output of the model and the real value, which can be expressed by the following definition:
Definition 1.
Let be the real value and the predicted value. The algorithmic loss function is defined as It can be interpreted as the loss incurred by the model at data sample , and is defined as the information deficit between the real value with respect to the predicted value.
There is a strong theoretical argument to justify the Def. 1. Let’s recall that given a training set , an AI model aims to capture, as well as possible, the underlying rules or mechanics that associate each input with its output . Let’s denote by the perfect or real model, such that . It follows that an ideal optimization metric would measure how far our model is from , which in algorithmic terms is denoted by . However, we do not have access to much information regarding itself. What we do have is a set of pairs of the form . Thus the problem translates into minimizing the distance between a program that outputs and a program that outputs given . Now, given that is constant for and , the objective of an optimization strategy can be interpreted as minimizing for all in our data sets. Therefore, with the proposed algorithmic loss function, we are not only measuring how far our predictions are from the real values, but also how far our model is from the real explanation behind the data in a fundamental algorithmic sense.
An algorithmic cost function
must be defined as a function that aggregates the algorithmic loss incurred over a supervised data sample. At this moment, we do not have any reason, theoretical or otherwise, to propose any particular loss aggregation strategy. As we will show in subsequent sections, considerations such as continuity, smoothness and differentiability of the cost function are not applicable to the algorithmic cost function. We conjecture that any aggregation technique that
correctly and uniformly weights the loss incurred through all the samples will be equivalent, the only relevant considerations being training efficiency and the statistical properties of the data. However, in order to remain congruent with the most widely used cost functions, we will, for the purpose of illustration, use the sum of the squared algorithmic differences3 Categorical Algorithmic Probability Classification
One of the main fields of application for automated learning is the categorical classification of objects. These classification tasks are often divided into supervised and unsupervised problems. In its most basic form, a supervised categorical classification task can be defined, given a set of objects and a set of finite categories , as that of finding a computable function or model such that if and only if belongs to according to previously agreed criteria. In this section we apply our hybrid machine learning approach to supervised classification tasks.
Now, it is important to note that it is not constructive to apply the algorithmic loss function (Def. 1
) to the abstract representations of classes that are commonly used in machine learning classifiers. For instance, the output of a softmax function is a vector of probabilities that represent how likely it is for an object to belong to each of the classes (
[5]), which is then assigned to a onehot vector that represents the class itself. However, the integer that such a onehot vector signifies is an abstract representation of the class that was arbitrarily assigned, and therefore has little to no algorithmic information pertaining to the class itself. Accordingly, in order to apply the algorithmic loss function to classification tasks, we need to seek a model that outputs an information rich object that can be used to identify the class regardless of the context within which the problem was defined. In other words, the model must output the class itself or a similar enough object that identifies it.We can find the needed output in the definition of the classification problem: a class is defined as all the that are associated with it, and finding any underlying regularities to define a class beyond that is the task of a machine learning model. It follows that an algorithmic information model must output an object that minimizes the algorithmic distance to all the members of the class, so that we can classify them. Therefore, a correct interpretation of the general definition of the algorithmic loss function for a model is the following:
(1) 
where is the class to which belongs, while is the output of the model. What the equation 1 is saying is that the model must produce as an output an object that is algorithmically close to all the elements of the class. In unsupervised classification tasks this object is known as a centroid of a cluster [30]. This means that the algorithmic loss function is inducing us to universally define algorithmic probability classification as a clustering by an algorithmic distance model.^{1}^{1}1An unlabelled classificatory algorithmic information schema has been proposed by [11]. A general schema in our algorithmic classification proposal is based on the concept of a nearest centroid classifier.
Definition 2.
Given a training set with different classes, an algorithmic classification model consists of a set of centroids such that each minimizes the equation
, where is the object that the model assigns to the class , and the class prediction for a new object is defined as:
In short, we assign to each object the closest class according to its algorithmic distance to one of the centroids in the set of objects .
Now, in a strong algorithmic sense, we can say that a classifier is optimal if the class assigned to each object fully describes this object minus incidental or incompressible information. In other words, if a classifier is optimal and we know the class of an object, then we know all its characteristics, except those that are unique to the object and not shared by other objects within the class. Formally, a classifier is optimal with a degree of sophistication (in the sense of Koppel, [25, 26, 2]) of if and only if, for every , for any program and object such that and , then . The next theorem shows that minimizing the stated cost function guarantees that the classifier is optimal in a strong algorithmic sense:
Theorem 3.
If a classifier minimizes the cost function , then it is an optimal classifier,
Proof.
Assume that is not optimal. Then there exist such that, for any class , there exists a program and string such that , and . Now, consider the classifier :
It follows that
∎
4 Approximating the Algorithmic Similarity Function
While theoretically sound, the proposed algorithmic loss (Def. 1) and classification (Def. 2) cost functions rely on the uncomputable mathematical object ([24, 8]). However, recent research and the existence of increasing computing power ([13, 48]) have made available a number of techniques for the computable approximation of the nonconditional version of . In this Section we present three methods for approximating the conditional algorithmic information function .
4.1 Conditional CTM and Domain Specific CTM
The Coding Theorem Method ([13]) is a numerical approximation to the algorithmic complexity of single objects. A generalization of CTM for approximating the conditional algorithmic complexity is the following:
Definition 4.
Let be a computable relation and a finite set of pairs of the form corresponding to . We define the conditional CTM (with respect to and ) as:
, where is the cardinality of . When is not a Turing complete space, or does not contain an exhaustive computation of all possible pairs for the space, we say that we have a domain specific CTM function.
The previous Def. is based on the Coding theorem ([29]), which establishes a relationship between the information complexity of an object and its algorithmic probability. If the relation
approaches the space of all Turing machines, and
is the set of all possible inputs and outputs for these machines, then at the limit, we have it that . In the computable cases where we take a reduced (finite) set of Turing machines, we then have a (lower bounded) approximation to .In the case where is the empty string and is the relation induced by the space of small Turing machines with 2 symbols and 5 states, with computed exhaustively, approximates , and has been used to compute an approximation to the algorithmic complexity of small binary strings of up to size 12 ([34, 13]). Similarly, the space of small bidimensional Turing machines has been used to approximate the algorithmic complexity of square matrices of size up to ([49]).
When refers to a non(necessarily) Turing complete space, or a computable input/output object different from ordered Turing machines, or if has not been computed exhaustively over this space, then we have a domain specific version of CTM, and its application depends on this space. This version of CTM is also an approximation to , given that, if is a computable relation, then we can define a Turing machine with input that outputs . However, we cannot guarantee that this approximation is consistent or (relatively) unbiased. Therefore we cannot say that it is domain independent.
At the time of writing this article, a database for conditional CTM over small Turing machines had yet to be computed.
4.2 Coarse Conditional BDM
The Block Decomposition Method (BDM, [48]) decomposes an object () into smaller parts for which there exist, thanks to CTM ([48]), good approximations to their algorithmic complexity, and we then aggregate these quantities by following the rules of algorithmic information theory. We can apply the same concept to computing a coarse approximation to the conditional algorithmic information complexity between two objects. Formally:
Definition 5.
We define the coarse conditional BDM of
with respect to the tensor
with respect to aswhere is a partition strategy of the objects into smaller objects for which CTM values are known, is the result of this partition for and respectively, and are the multiplicity of the objects within and respectively, and is the function defined as
The subobjects are called base objects or base tensors (when the object is a tensor) and are objects for which the algorithmic information complexity can be satisfactorily approximated by means of CTM.
The motivation behind this definition is to enable us to consider partitions for the tensors and into sets of subtensors and , and then approximate the algorithmic information within the tensor that is not explained by by considering the subtensors which are not present in the partition . In other words, if we assume knowledge of and its corresponding partition, then in order to describe the elements of the decomposition of using the partition strategy , we only need descriptions of the subtensors that are not . In the case of common subtensors, if the multiplicity is the same then we can assume that does not contain additional information, but that it does if the multiplicity differs.
The term quantifies the additional information contained within when the multiplicity of the subtensors differs between and . This term is important in cases where such multiplicity dominates the complexity of the objects, cases that can present themselves when the objects resulting from partition are considerably smaller than the main tensors.
4.3 Strong Conditional BDM
The previous definition featured the adjective coarse because we can define a stronger version of conditional BDM approximating with greater accuracy that uses conditional CTM. As explained in Section 9.2, one of the main weaknesses of coarse conditional BDM was the inability to detect the algorithmic relationship between base blocks. This is in contrast with conditional CTM.
Definition 6.
The strong conditional BDM of with respect to corresponding to the partition strategy is
where is a pairing of the base elements in the decomposition of and where the elements of can appear only once in a pair but without restrictions as to the elements of . This is a functional relation , and is the same function as specified in Def. 5. If conditional CTM is used, then we say that we have strong conditional CTM.
While we assert that the pairing strategy minimizing the given sum will yield the best approximation to in all cases, prior knowledge of the algorithmic structure of the objects can be used to facilitate the computation by reducing the number of possible pairings to be explored, especially when using the domain specific version of conditional BDM. For instance, if two objects are known to be produced from local dynamics, then restricting the algorithmic comparisons by considering pairs based on their respective position on the tensors will, with high probability, yield the best approximation to their algorithmic distance.
4.3.1 The Relationship Between Coarse and Strong BDM
It is easy to see that under the same partition strategy, strong conditional BDM will always present a better approximation to than its coarse counterpart. If the partition of two tensors and does not share an algorithmic connection other than subtensor equality, i.e. there exists , then this is the best case for coarse BDM, and applying both functions will yield the same approximation to . However, if there exist two base blocks where then where is the diminishing error incurred in proposition 1 in [48]. Moreover, unlike coarse conditional BDM, the accuracy of the approximation offered with the strong variant will improve in proportion to the size of the base objects, ultimately converging towards CTM and then itself.
The properties of strong and coarse conditional BDM and their relation with entropy are shown in the appendix (Section 9). In particular, we show that conditional BDM is well behaved by defining joint and mutual BDM (Section 9.1), and we show that its behavior is analogous to the corresponding Shannon’s entropy functions. We also discuss the relation that both measures have with entropy (Section 9.2), showing that, in the worst case, we converge towards conditional entropy.
5 Algorithmic Optimization Methodology
In the previous sections we proposed algorithmic loss and cost functions (Def. 1) for supervised learning tasks, along with means to compute approximations to these theoretical mathematical objects. Here we ask how to perform model parameter optimization based on such measures. Many of the most widely used optimization techniques rely on the cost function being (sufficiently) differentiable, smooth and convex ([3]), for instance gradient descent and associated methods ([6, 23]). In the next section we will show that such methods are not adequate for the algorithmic cost function.
5.1 The Nonsmooth Nature of the Algorithmic Space
Let us start with a simple bilinear regression problem. Let
(2) 
be a linear function used to produce a set of 20 random data points of the form , and be a proposed model whose parameters and must be optimized in order to, hopefully, fit the given data.
According to the Def. 1, the loss function associated with this optimization problem is A visualization of the surface resulting from this function, where was approximated by coarse conditional BDM (Def. 5) with a partition of size 3 can be seen on the left of Figure 1. From the plot we observe that the resulting curve is not smooth and that gradient based approaches would fail to converge toward a nonlocal minimum. This observation was evaluated by applying several optimization techniques: gradient descent (constrained to a square of radius 0.25 around the solution), random search, and a purely random search. The purely random algorithm simply pooled 5000 random points and chose the point where the cost function evaluated was the lowest. At the right of the Fig. 1 we can see that this random pooling of points yielded the optimization technique. It is well understood that a random pooling optimization method like the one we performed is not scalable to larger, more complex problems. However, the function has an algorithmic property that will allow us to construct a more efficient optimization method, that we will call algorithmic parameter optimization.
5.2 Algorithmic Parameter Optimization
The established link between algorithmic information and algorithmic probability theory (
[29]) provides a path for defining optimal (under the only assumption of computable algorithms) optimization methods. The central question in the area of parameter optimization is the following: Given a data set , what is the best model that satisfies , and hopefully, will extend to pairs of the same phenomena that are not present in ?Algorithmic probability theory establishes and quantifies the fact that the most probable computable program is also the least complex one [36, 28], thereby formalizing a principal of parsimony such as Ockham’s razor. Formally, we define the algorithmic parameter optimization problem as the problem of finding a model such that (a) minimizes and (b) minimizes the cost function
By searching for the solution using the algorithmic order we can meet both requirements in an efficient amount of time. We start with the least complex solution, therefore the most probable one, and then we move towards the most complex candidates, stopping once we find a good enough value for or after a determined number of steps, in the manner of other optimizations methods.
Definition 7.
Let be a model, the algorithmic cost function, the training set and finally let be the parameter space which is ordered according to their algorithmic order (from least to most algorithmically complex). Then the simple algorithmic parameter optimization (or algorithmic search) is performed by
where the halting condition is defined in terms of the number of iterations or a specific value for .
The algorithmic cost function is not expected to reach zero. In a perfect fit scenario, the loss of a sample is the relative algorithmic complexity of with respect to the model itself, which can be unbounded. Depending on the learning task, we can search for heuristics to define an approximation to the optimum value for , and thus end the search when a close enough optimization has been reached, resembling the way in which the number of clusters is naturally estimated with algorithmic information itself [44], stopping the process when the model’s complexity starts to increase rather than decrease when given the data as conditional variable. Given the semiuncomputable nature of , there is no general method to find such conditions, but they can be approximated. Another way to define a stopping condition is by combining other cost functions, such as the MSE or accuracy over a validation set in the case of classification. What justifies Def. 7 is the importance of the ordering of the parameter space and the expected execution time of the program provided.
By Def. 7, it follows that the parameter space is countable and computable. This is justified, given that any program is bound by the same requirements. For instance, in order to fit the output of the function (Eq. 2) by means of the model , we must optimize over two continuous parameters and . Therefore the space of parameters is composed of the pairs of real numbers . However, a computer cannot fully represent a real number, using instead an approximation by means of a fixed number of bits. Since this second space is finite, so is the parameter space and the search space which is composed of pairs of binary strings of finite size, the algorithmic information complexity value of which can be approximated by BDM or CTM. Furthermore, as the next two examples will show, for algorithmic optimization a search space based on binary strings can be considered an asset that can be exploited to speed up the algorithm and improve performance, rather than a hindrance because of its lower accuracy in representing continuous spaces. This is because algorithmic search is specifically designed to work within a computable space.
Now, consider a fixed model structure . Given that the algorithmic parameter optimization always finds the lowest algorithmically complex parameters that fit the data within the halting condition, the resulting model is the most algorithmically plausible model that meets the restrictions imposed by the Def. of . This property results in a natural tendency to avoid overfitting. Furthermore, algorithmic optimization will always converge significantly more slowly to overly complex models that will tend to overfit the data even if they offer a better explanation of a reduced data set . Conversely, algorithmic parameter optimization will naturally be a poor performer when inferring models of high algorithmic complexity. Finally, note that the method can be applied to any cost function, preserving the above properties. Interestingly, this can potentially be used as a method of regularization in itself.
5.2.1 On the Expected Optimization Time
Given the way that algorithmic parameter optimization works, the optimization time, as measured by the number of iterations, will converge faster if the optimal parameters have low algorithmic complexity. Therefore they are more plausible in the algorithmic sense. In other words, if we assume that, for the model we are defining, the parameters have an underlying algorithmic cause, then they will be found faster by algorithmic search, sometimes much faster. How much faster depends on the problem and its algorithmic complexity. In the context of artificial evolution and genetic algorithms, it has been previously shown that, by using an algorithmic probability distribution, the exponential random search can be sped up to quadratic (
[9, 10, 20]).Following the example of inferring the function in section 5.1, the mean and median BDM value for the parameter space of pairs of 8bit binary strings are 47.6737 and 47.7527, respectively; while the optimum parameters have a BDM of 44.2564. This lower BDM value confirms the intuition that binary representations of both parameters have an algorithmic source (repeating 10 or 01). The difference in value might seem small on a continuum, but in algorithmic terms it translates into an exponential absolute distance between candidate strings: the optimum parameters are expected to be found at least times faster by algorithmic optimization (compared to a search within the space). The optimum solution occupies position 1026 out of 65,281 pairs of strings. Therefore the optimum for this optimization problem can be found within 1026 iterations, or nearly 65 times faster.
The assumption that the optimum parameters have an underlying simplicity bias is strong, but has been investigated [15, 46] and is compatible with principles of parsimony. This bias favours objects of interest that are of low algorithmic complexity, though they may appear random, For example, the decimal expansions of the constant or to an accuracy of 32 bits have a BDM value of 666.155 and 674.258, respectively, while the expected BDM for a random binary string of the same size is significantly larger: . This means that we can expect them to be found significantly faster, according to algorithmic probability— about and time steps faster, respectively, compared to a random string— by using algorithmic optimization methods.
At the same time, we are aware that, for the example given, mathematical analysisbased optimization techniques have a perfect and efficient solution in terms of the gradient of the MSE cost function. While algorithmic search is faster than random search for a certain class of problems, it may be slower for another large class of problems. However, algorithmic parameter optimization (Def. 7) is a domain and problemindependent general method. While this new field of algorithmic machine learning that we are introducing is at an early stage of development. in the next sections we set forth some further developments that may help boost the performance of our algorithmic search for specific cases, such as greedy search over the subtensors, and there is no reason to believe that more boosting techniques will not be developed and introduced in the future.
6 Methods
6.1 Traversing Nonsmooth Algorithmic Surfaces for Solving Ordinary Differential Equations
Thus far we have provided the mathematical foundation for machine learning based on the power of algorithmic probability at the price of operating on a nonsmooth loss surface in the space of algorithmic complexity. While the directed search technique we have formulated succeeds with discrete problems, here we ask whether our tools generalize to problems in a continuous domain. To gain insight, we evaluate whether we can estimate parameters for ordinary differential equations. Parameter identification is wellknown to be a challenging problem in general, and in particular for continuous models when the dataset is small and in the presence of noise. Following
[16], as a sample system we have and (Eq. 2) with hidden parameters and . Let be a function that correctly approximates the numerical value corresponding to the given parameters and for the ODE system. Let us consider a model composed of a binary representation of the pair by a 16 bit string where the first 8 bits represent , the last 8 bits for a parameter search space of size , and where within these 8 bits the first 4 represent the integer part and the last 4 the fractional part. Thus the hidden solution is represented by the binary string ‘0101000000010000’.6.2 Finding Computable Generative Mechanisms
An elementary cellular automaton (ECA) is a discrete and linear binary dynamical system where the state of a node is defined by the states of the node itself and its two adjacent neighbours ([42]). Despite their simplicity, the dynamics of these systems can achieve Turing completeness. The task was to classify a set of black and white images representing the evolution of one of eleven elementary cellular automata according to a random 32bit binary initialization string. The automata were
Aside from the Turingcomplete rule with number 110, the others were randomly selected among all 256 possible ECA. The training set was composed of 275 black and white images, 25 for each automaton or ‘class’. An independent validation set of the same size was also generated, along with a testset with 1 375 evenly distributed samples. An example of the data in these data sets is shown in figure 2.
First we will illustrate the difficulty of the problem by training neural networks with simple topologies over the data. In total we trained three naive^{2}^{2}2We say that a NN topology is naive when its design does not use specific knowledge of the target data.
neural networks that consisted of a flattened layer, followed by either 1, 2, 3 or 4 fully connected linear layers, ending with a softmax layer for classification. The networks were trained using ADAM optimization for 500 rounds. Of these 4 models, the network with 3 linear layers performed best, with an accuracy of 40.3%.
However, as shown in [18], it is possible to design a deep network that achieves a higher accuracy for this particular classification task. This topology consists of 16 convolutional layers with a kernel of , which was specifically chosen to fit the rules of ECA, a pooling layer that aggregates the data of all the convolutions into a vector of one dimension of length 16, and 11 linear layers (256 in the original version) connected to a final softmax unit. This network achieves an accuracy of 98.8% on the test set and 99.63% on the training set
after 700 rounds. This specialized topology is an example of how, by using prior knowledge of the algorithmic organization of the data, it is possible to guide the variance of a neural network towards the algorithmic structure of the data and avoid overfitting. However, as we will show over the following experiments, this is a specialized adhoc solution that does not generalize to other tasks.
6.2.1 Algorithmicprobability Classifier Based on Coarse conditional BDM
The algorithmic probability model chosen consists of eleven binary matrices, each corresponding to a sample class, denoted by , encompassing members . Training this model, using Def. 2, the loss function
is minimized, where is the object that the model assigns to the class . Here we approximate the conditional algorithmic complexity function with the coarse conditional BDM function, then proceed with algorithmic information optimization over the space of the possible binary matrices in order to minimize the computable cost function However, an elementary cellular automaton can potentially use all the 32bits of information contained in a binary initialization string, and the algorithmic difference between each cellular automaton is bounded and relatively small (within 8bits). Furthermore, each automaton and initialization string was randomly chosen without regard to an algorithmic cause. Therefore we cannot expect a significant speedup by using an algorithmic search, and it would be nearly equivalent to randomly searching through the space of matrices, which will take a length of time of the order of . Nevertheless, we perform a greedy block optimization version of algorithmic information optimization:

First, we start with the eleven matrix of 0s.

Then, we perform algorithmic optimization, but only changing the bits contained in the upper left submatrix. This step is equivalent to changing all the bits in the quadrant, searching for the matrix that minimizes the cost function.

After minimizing with respect to only the upper left quadrant, we minimize over the upper right quadrant.

We repeat the procedure for the lower left and lower right quadrants.
These four steps are illustrated in Fig. 3for the class corresponding to the automaton .
6.3 Finding the Initial Conditions for Cellular Automata
The next problem was to classify black and white images representing the evolution of elementary cellular automata. In this case, we are classifying according to the initialization string that produced the corresponding evolution for a randomly chosen automaton. The classes for the experiment consisted of 10 randomly chosen binary strings, each 12 bits in length. These strings correspond to the binary representation of the following integers:
The training, validation and test sets each consisted of two hundred binary images. These images represent the evolution to 4 steps of one of the 10 strings within the first 128 cellular automata rules (to avoid some trivially symmetric cases) by means of a randomly chosen cellular automaton. It is important to note that the first row of the evolution (the initialization) was removed. Otherwise this classification task would be trivial.
We trained and tested a group of neural network topologies on the data in order to establish the difficulty of the classification task. These networks were an (adapted version of) Fernandes’ topology and 4 naive
neural networks that consisted of a flattened (fullyconnected) layer, followed by 1, 2, and 5 groups of layers, each consisting of a fully connected linear layer with rectilinear activation (ReLU) function followed by a dropout layer, ending with a linear layer and a softmax unit for classification. The adaptation of the Fernandes topology was only for the purpose of changing the kernel of the pooling layer to
to take into account the nonsquare shape of the data. All networks were trained using the ADAM optimizer.The best performing network was the shallower one, which consists of a flattened layer, followed by a fully connected ReLU, a dropout layer, a linear layer with 10 inputs and a sotfmax unit. This neural network achieved an accuracy of 60.1%. At 18.5%, the performance of Fernandes’ topology was very low, being barely above random choice. This last result is to be expected, given that the topology is domain specific, and should not be expected to extend well to different problems, even though at first glance the problem may seem to be related.
6.3.1 Algorithmicprobability Classifier Based on Strong conditional BDM
The algorithmic probability model chosen for these tasks consisted of eleven 12bit binary vectors. The model was trained using algorithmic information greedy block optimization by first optimizing the loss function over the 6 leftmost bits and then over the remaining six.
However, for this particular problem, the coarse version of conditional BDM proved inadequate for approximating the universal algorithmic distance , for which reason we opted to use the stronger version. For the stronger version of conditional BDM we approximated the local algorithmic distance , where is a binary matrix of size and is a binary vector of length 6, in the following way.

First, we computed all the outputs of all possible 12bit binary strings for each of the first 128 ECA for a total of 528,384 pairs of 12 bit binary vectors and binary matrices, forming the set of pairs .

Then, by considering only the inner 6 bits of the vectors (discarding the 3 bits on the left and the 3 bits on the right) and, similarly, the inner submatrix, we defined
where is the cardinality of . This cropping was done to solve the frontier issue of finite space ECA.

If a particular pair was not present in the database, then considering that , we defined . This means that the algorithmic complexity of given is at least 20 bits.

In the end we obtained a database of the algorithmic distance between all 6 bit vectors and their respective possible outputs.
The previous procedure might at first seem to be too computationally costly. However, just as with Turing Machine based CTM ([35, 48]), this computation only needs to be done once, with the data obtained being reusable in various applications.
The trained model consisted of 10 binary vectors that, as expected, corresponded to the binary expansion of each of the classes. The accuracy of the classifier was 95.5% on the test set.
6.4 Classifying NK Networks
An NK network is a dynamical system that consists of a binary Boolean network where the parameter specifies the number of nodes or vertices and defines the number of incoming connections that each vertex has ([17, 22, 1]). Each node has an associated ary Boolean function which uses the states of the nodes corresponding to the incoming connections to determine the new state of the nodes over a discrete time scale. The number of incoming connections defines the stability (or lack thereof) of the network.
Given the extensive computational resources it would require to compute a CTM database, such as the one used in section 6.4, for Boolean networks of 24 nodes we opted to do a classification based only on the algorithmic complexity of the samples as approximated by BDM. This approach is justified, considering that according to the definition 2, an algorithmic information model for the classifier can consist of three sets. Each of these sets is composed of all possible algorithmic relations, including the adjacency matrix and related binary operations, corresponding to the number of incoming connections per node (the parameter ). Therefore, given the combinatorial order of growth of these sets, we can expect the quantity of information required to specify the members of each class to increase as a function of .
Specifically, the number of possible Boolean operations of degree is and the number of possible adjacency matrices is . It follows that the total number of possible network topologies is , and the expected number of bits required to specify a member of this set is . Therefore, the expected algorithmic complexity of the members of each class increases with and . With fixed at 24 we can do a coarse algorithmic classification simply according to the algorithmic complexity of the samples, as approximated by BDM.
Following this idea we defined a classifier where the model consisted of the mean BDM value for each of the classes in the training set The prediction function measures the BDM of the sample and assigns it to the class centre that is the closest. This classifier achieved an accuracy of 71%. Alternatively, we employed a nearest neighbour classifier using the BDM values of the training set, which yielded virtually identical results. For completeness sake, we recreated the last classifier using entropy to approximate the algorithmic information theory . The accuracy of this classifier was 37.66%.
For classifying according to the Boolean rules assigned to each node, we used 10 randomly generated (ordered) lists of 4 binary Boolean rules. These rules were randomly chosen (with repetitions) from , , and , with the only limitation being that had to be among the rules. Since the initial state for the network was the vector , at least one XOr was needed in order to generate an evolution other than forty 0s. Then, to generate the samples, each list of binary rules was associated with a number of random topologies (with =2). The training and validation sets were composed of 20 samples for each class (200 samples in total) while the test set contained 2000 samples.
To classify according to network topology we used 10 randomly generated topologies consisting of 10 binary matrices of size , which represented the adjacency matrices of the chosen topologies. The random matrices had the limitation that each column had to contain two and only two 1s, so the number of incoming connections corresponds to . Then, to generate a sample we associated one of the chosen topologies with a random list of rules. This list of rules was, again, randomly chosen from the same 4 Boolean functions and with the limitation that had to be a member. The training and validation sets were composed of 20 samples for each class (200 samples in total) while the test set contained 2000 samples.
6.5 Classifying Kauffman networks
Kauffman networks are a special case of Boolean NK networks where the number of incoming connections for each node is two, that is, ([4]). This distinction is important because when = 2 we have a critical point that “exhibits selforganized critical behaviour”; below that () we have too much regularity, a (frozen state) and beyond it () we have chaos.
6.5.1 Algorithmicprobability Classifier based on conditional CTM
For this problem we used a different type of algorithmic cluster centre. For the Boolean rules classifier, the model consisted of ten lists of Boolean operators. More precisely, the model consisted of binary strings that codified a list composed of each of the four Boolean functions used (, , and ) as encoded by the Wolfram Language. For the topology classifier, the model consisted of 10 binary matrices representing the possible network topologies.
The use of different kinds of models for this task showcases another layer of abstraction that can be used within the wider framework of algorithmic probability classification: context. Rather than using binary tensors, we can use a structured object that has meaning for the underlying problem. Yet, as we will show next, the underlying mechanics will remain virtually unchanged.
Let’s go back to the definition 2, which states that to train both models we have to minimize the cost function
So far we have approximated by means of conditional BDM. However, given that at the time of writing this article a sufficiently complete conditional CTM database has yet to be computed, we have estimated the CTM function by using instances of the computable objects, as previously shown in section 6.3.1. Moreover, owing to the nonlocal nature of NK networks and the abstraction layer that the models themselves are working at, rather than using BDM, we have opted to use a context dependent version of CTM directly. In the last task we will show that BDM can be used to classify a similar, yet more general problem.
Following similar steps to the ones used in section 6.3.1, by computing all the 331,776 possible NK networks with and , we compiled two lists of pairs and that contained, respectively, the pairs and , where is the topology and is the list of rules that generated the 40bit vector , which represents the evolution to ten steps of the respective networks. Next, we defined the as:
or as 19 if the pair is not present on either of the lists. Then we approximated by using the defined function directly.
6.6 Hybrid Machine Learning
So far we have presented supervised learning techniques that, in a way, diverge. In this section we will introduce one of the ways in which the two paradigms can coexist and complement each other, combining statistical machine learning with an algorithmicprobability approach.
6.6.1 Algorithmic Information Regularization
The choice of an appropriate level of model complexity that avoids both under and overfitting is a key hyperparameter in machine learning. Indeed, on the one hand, if the model is too complex, it will fit the data used to construct the model very well but generalize poorly to unseen data. On the other hand, if the complexity is too low, the model will not capture all the information in the data. This is often referred to as the biasvariance tradeoff, because a complex model will exhibit large variance, while an overly simple one will be strongly biased. Most traditional methods feature this choice in the form of a free hyperparameter via, e.g., what is known as regularization.
A family of mathematical techniques or processes that has been developed to control overfitting of a model goes under the rubric ’regularization’, which can be summarized as the introduction of information from the model to the training process in order to prevent overfitting of the data. A widely used method is the Tikhonov regularization ([41, 32]
), also known as ridge regression or weight decay, which consists in adding a penalty term to the cost function of a model, which increases in direct proportion to the norms of the variables of the model. This method of regularization can be formalized in the following way: Let
be the cost function associated with the model trained over the data set , a model weighting function of the form , where , and a positive real number. The (hyper)parameter is called a regularization parameter; the product is known as the regularization term and the regulated cost function is defined as(3) 
The core premise of the previous function is that we are disincentivizing fitting towards certain parameters of the model by assigning them a higher cost in proportion to , which is a hyperparameter that is learned empirically from the data. In current machine learning processes, the most commonly used weighting functions are the sum of the norms of the linear coefficients of the model, such as in ridge regressions ([21]).
We can employ the basic form of equation 3 and define a regularization term based on the algorithmic complexity of the model and, in that way, disincentivize training towards algorithmically complex models, thus increasing their algorithmic plausibility. Formally:
Definition 8.
Let be the cost function associated with the model trained over the data set , the universal algorithmic complexity function, and a positive real number. We define the algorithmic regularization as the function
The justification of the previous definition follows from algorithmic probability and the coding theorem: Assuming an underlying computable structure, the most probable model that fits the data is the least complex one. Given the universality of algorithmic probability, we argue that the stated definition is general enough to improve the plausibility of the model of any machine learning algorithm with an associated cost function. Furthermore, the stated definition is compatible with other regularization schemes.
Just as with the algorithmic loss function (Def. 2), the resulting function is not smooth, and therefore cannot be optimized by means of gradientbased methods. One option for minimizing this class of functions is by means of algorithmic parameter optimization (Def 7). It is important to recall that computing approximations to the algorithmic probability and complexity of objects is a recent development, and we hope to promote the development of more powerful techniques.
6.7 Algorithmicprobability Weighting
Another, perhaps more direct way to introduce algorithmic probability into the current field of machine learning, is the following. Given that in the field of machine learning all model inference methods must be computable, the following inequality holds for any fixed training methodology:
(4) 
where is the fitted model, is the training data, is the model with the parameters during its initialization and corresponds to the length of the program implementing the training procedure. Now, using common initialization conventions, either has very low algorithmic complexity or very high (it’s random), in order to not induce a bias in the model. Thus the only parameter on the right side of the inequality that can be optimized is . It follows that increasing the algorithmic plausibility of a model can be achieved by reducing the algorithmic complexity of training set , which can be achieved by preprocessing the data and weighting each sample using its algorithmic information content, thus optimizing in the direction of samples with lower algorithmic complexity.
Accordingly, the heuristic for our definition of algorithmic probability weighting is that, to each training sample, we assign an importance factor (weight) according to its algorithmic complexity value, in order to increase or diminish the loss incurred by the sample. Formally:
Definition 9.
Let be a cost function of the form
We define the weighted approximation to the algorithmic complexity regularization of or algorithmic probability weighting as
where is a function that weights the algorithmic complexity of each sample of the training data set in a way that is constructive with respect to the goals of the model.
We have opted for flexibility regarding the specification of the function . However taking into account the noncontinuous nature of , we have recommended a discrete definition for . The following characterization has worked well with our trials and we hold that it is general enough to be used in a wide number of application domains:
where and are hyperparameters and denotes that belongs to the
ith quantile of the distribution of algorithmic complexities of all the samples belonging to the same class as
.As its names implies, the previous Def. 9 can be considered analogous to sample weighting, which is normally used as a means to confer predominance or diminished importance on certain samples in the data set according to specific statistical criteria, such as survey weights and inverse variance weight [19]. However, a key difference of our definition is that traditional weighting strategies rely on statistical methods to infer values from the population, while with algorithmic probability weighting we use the universal distribution for this purpose. This makes algorithmic probability weighting a natural extension or universal generalization of the concept of sample weighting, and given its universality, it is domain independent.
Now, given that the output of and its parameters are constant from the point of view of the parameters of the model , it is easy to see that if the original cost function is continuous, differentiable, convex and smooth, so is the weighted version . Furthermore, the stated definition is compatible with other regularization techniques, including other weighting techniques, while the algorithmic complexity of the samples can be computed by numerical approximation methods such as the Block Decomposition Method.
7 Results
7.1 Estimating ODE parameters
A key step to enabling progress between a fundamentally discrete theory such as computability and algorithmic probability, and a fundamentally continuous theory such as that of differential equations and dynamical systems, is to find ways to combine both worlds. As shown in Section 5.1, optimizing the parameters with respect to the algorithmic cost function is a challenge (Fig. 4). Following algorithmic optimization, we note that parameters (5 and 1) have low algorithmic complexity due to their functional relation. This is confirmed by BDM, which assigns the unknown solution to the ODE a value of 153.719 when the expected complexity is approximately 162.658, which means that the number of more complex parameter candidates than must be on the order of . Within the parameter space, the solution is at the position 5 093 out of 65 536. Therefore the exact ODE solution can be found within less than 6 thousand iterations following the simple algorithmic parameter optimization (Def. 7) by consulting the function . Furthermore, for the training set of size 10 composed of the pairs corresponding to the list
, we need only 2 samples to identify the solution, supporting the expectation that algorithmic parameter optimization ensures a solution with high probability, despite a low number of samples as long as the solution has low complexity in a relatively low number of iterations. This is proofofprinciple that our search technique can not only be used to identify parameters for an ODE problem, but also affords the advantage of faster convergence (fewer iterations), requiring less data to solve the parameter identification problem. In Fig. 5, equivalent to the pixel attacks for discrete objects, we show that the parameter identification is robust to even more than 25% of additive noise. Operating in a low complexity regime— as above—is compatible with a principal of parsimony such as Ockham’s razor, which is empirically found to be able to explain data simplicity bias [43, 15, 47], suggesting that the best explanation is the simplest, but also that what is modelled is not algorithmically random [46].
7.2 Finding Generative Rules of ECA
Following optimization, a classification function was defined to assign a new object to the class corresponding to the centre to which it is the closest according to the algorithmic distance . The classifier obtained reaches an accuracy of 98.1% on the test set and of 99.27% on the training set (table 1).
Classifier  Test Set  Training Set 

Naive Networks  
1  38.88%  95.63% 
2  39.70%  95.63 
3  40.36%  100% 
4  39.05%  100% 
Fernandes’  98.8%  99.63% 
Algorithmic Class.  98.1%  99.27% 
From the data we can see that the algorithmic classifier outperformed the four naive (or simple) neural networks, but it was outperformed slightly by the Fernandes classifier, built expressly for the purpose. But as we will show over the following sections, this last classifier is less robust and is domain specific.
Last but not least, we have tested the robustness of the classifiers by measuring how good they are at resisting onepixel attacks ([40]). A onepixel attack occurs when a classifier can be fooled into misclassifying an object by changing just a small portion of its information (one pixel). Intuitively, such small changes should not affect the classification of the object in most cases, yet it has recently been shown that deep neural network classifiers present just such vulnerabilities.
Algorithmic information theory tells us that algorithmic probability classifier models should have a relatively high degree of resilience in the face of such attacks: if an object belongs to a class according to a classifier it means that it is algorithmically close to a centre defining that class. A onepixel attack constitutes a relatively small information change in an object. Therefore there is a relatively high probability that a onepixel attack would not alter the information content of an image enough to increase the distance to the centre in a significant way. In order to test this hypothesis, we systematically and exhaustively searched for vulnerabilities in the following way: (a)One by one, we flipped (from 0 to 1 or vice versa) each of the pixels of the samples contained in the test data. (b) If a flip was enough to change the assigned classification for the sample, then it was counted as a vulnerability. (c)Finally, we divided the total number of vulnerabilities found by the total number of samples in order to obtain an expected number of vulnerabilities per sample. The results obtained are shown in Table ref.
Classifier  Total Vulnerabilities  Per Sample  Percentage of Pixels 

Fernandes’ (DNN)  190,850  138.88  13.56% 
Algorithmic Classifier  15,125  11  1% 
From the results we can see that for the DNN, 13.56% of the pixels are vulnerable to onepixel attacks, and that only 1% of the pixels manifest that vulnerability for the algorithmic classifier. These results confirm our hypothesis that the algorithmic classifier is significantly more robust in the face of small perturbations compared to the deep network classifier designed without a specific purpose in mind. It is important to clarify that we are not stating that it is not possible to increase the robustness of a neural network, but rather pointing out that algorithmic classification has a high degree of robustness naturally.
7.3 Finding Initial Conditions
The accuracy obtained using the different classifiers is represented in Table 3. Based on these results we can see that the algorithmic classifier performed significantly better than the neural networks tested. Furthermore, the first two naive topologies have enough variance present to have a good fit visavis the training set, in an obvious case of overfitting. The domain specific Fernandes topology maintained a considerably high error rate—exceeding 80%—over 3,135 ADAM training rounds.
It is important to note that in this case collisions, that is, two samples that belong to two different classes, can exist. Therefore it is
impossible to obtain 100% perfect accuracy. An exhaustive search classifier that searches through the space for the corresponding initialization string reached an accuracy of 97.75% over the test set.
In order to test the generalization of the CTM database computed for this experiment, we tested our algorithmic classifying scheme on a different instance of the same basic premise: binary images of size that correspond to the output of twenty randomly selected binary strings of 24 bits each for a randomly chosen ECA. The number of samples per class remains at 20 for the training, validation and test sets. The results are shown in the following table. For this case the algorithmic classifier increased its accuracy to 96.56%. Thanks to the additional data, the neural networks also increased their accuracy to 64.11% and 61.74% for the first and second topology, respectively.
7.3.1 Network Topology Algorithmicinformation Classifier
The results are summarized in Table 5. Here we can see that only the coarse BDM algorithmic information classifier—with 70% accuracy—managed to reach an accuracy that is significantly better than random choice, validating our method.
Furthermore, by analyzing the confusion matrix plot (Figure
6) we can see that the algorithmic classifier performs (relatively) well at classifying the frozen and chaotic networks, while the deep learning classifier seems to be random in its predictions. The fact that the critical stage was harder to classify is evidence of its rich dynamics, accounting for more varied algorithmic behaviours.A second task was to classify a set of binary vectors of size 40 that represent the evolution of an NK network of four nodes () and two incoming connections (). Given that an NK network is defined by two causal features, the topology of the network and the Boolean function of each node, we divided the second task in two: classifying according to its topology and according to the underlying Boolean rules.
7.4 Classifying Kauffman Networks
The task was to determine whenever a random Boolean network belonged to the frozen, critical or chaotic phase by determining when , 2 or 3. Furthermore, we used the full range of possible unary, binary and tertiary Boolean operations corresponding to each of the functions associated with a node. The objects to classify were binary vectors of size 240 bits that represented the evolution of networks of 24 nodes to ten steps with incoming connections of degree 1, 2 or 3. The training, validation and test sets were all of size 300, with 100 corresponding to each class. For this more general, therefore harder classification task, we used larger objects and data sets. The objects to classify were binary vectors of size 240 bits that represented the evolution of networks of 24 nodes to ten steps with incoming connections of degree 1, 2 or 3. The training, validation and test sets were all of size 300, with 100 corresponding to each class.
For the task at hand we trained the following classifiers: a neural network, gradient boosted trees and a convolutional neural network. The first neural network had a
naive classifier that consisted of a ReLU layer, followed by a Dropout layer, a linear layer and a final softmax unit for classification. For the convolutional model we used prior knowledge of the problem and used a specialized topology that consisted of 10 convolutional layers with a kernel of size 24, each kernel representing a stage of the evolution, with a ReLU, a pooling layer of kernel size 24, a flattened layer, a fully connected linear layer and a final softmax layer. The treebased classifier manages an accuracy of 35% on the test set, while the naive and convolutional neural networks managed an accuracy of 43% and 31% percent respectively. Two of the three classifiers are nearly indistinguishable from random classification, while the naive neural network is barely above it.For comparison purposes, we trained a neural network and a logistic regression classifier on the data. The neural network consisted of a
naive topology consisting of a ReLU layer, followed by a dropout layer, a linear layer and a softmax unit. The results are shown in Table 4.From the results obtained we can see that the neural network, with 92.50% accuracy, performed slightly better than the algorithmic classifier (91.35%) on the test set. The logistic regression accuracy is a bit further behind, at 82.35%.
However, the difference in the performance of the topology test set is much greater, with both the logistic regression and the neural network reaching very high error rates. In contrast, our algorithmic classifier reaches an accuracy of 72.4%.
7.5 A First Experiment and Proof of Concept of Algorithmicprobability Weighting
As a first experiment in algorithmic weighing, we designed an experiment using the MNIST dataset of hand written digits [14]. This dataset, which consists of a training set of 60,000 labelled images representing hand written digits from 0 to 9 and a test set with 10,000 examples, was chosen given its historical importance for the field and also because it offered a direct way to deploy the existing tools for measuring algorithmic complexity via binarization without compromising the integrity of the data.
The binarization was performed by using a simple mask: if the value of a (gray scale) pixel was above 0.5 then the value of the pixel was set to one, using zero in the other case. This transformation did not affect the performance of any of the deep learning models tested, including the LeNet5 topology ([27]), in a significant way.
Next we salted or corrupted 40% of the training samples by randomly shuffling 30% of their pixels. An example of these corrupted samples can be seen in Figure 7. With this second transformation we are reducing the useful information within a random selected subset of samples by a random amount, thus simulating a case where the expected amount of incidental information is high, as in the case of data loss or corruption.
Finally, we trained 10 neural networks with increasing depth, setting aside 20% of the training data as a verification set, thereby obtaining neural networks of increasing depth and, more importantly, variance. The topology of these networks consisted of a flattened layer, followed by an increasing number of fully connected linear layers with rectified linear (ReLU) activation functions, and a final softmax layer for classification. In order to highlight the effect of our regularization proposal, we abstained from using other regularization techniques and large batch sizes. For instance, optimizing using variable learning rates such as RMSProp along with small stochastic batches is an alternative way of steering the samples away from the salted samples.
For purposes of comparison, the neural networks were trained with and without weighting, using the option sample_weight for the train_on_batch on Keras.The training parameters for the networks, which were trained using Keras on Python 3, were the following:

Stochastic gradient descent with batch size of 5 000 samples.

40 epochs, (therefore 80
training stages), with the exception of the last model with 10 ReLU layers, which was trained for 150 training stages. 
Categorical crossentropy as loss function.

ADAM optimizer.
The hyperparameters for the algorithmic weighting function used were:
(5) 
which means that if the BDM value for the th sample was in the 75th quantile of the algorithmic complexity within its class, then it was assigned a weight of 0.01; the assigned weight was 0.5 if it was in the 50th quantile, and 2 if it was among the lower half in terms of algorithmic complexity within its class. The value for these hyperparameters was found by random search. That is, we tried various candidates for the function on the validation set and we are reporting the one that worked best. Although not resorted to for this particular experiment, more efficient hyperparameter optimization methods such as grid search can be used.
Following the theoretical properties of algorithmic regularization, by introducing algorithmic probability weighting we expected to steer the fitting of the target parameters away from random noise and towards the regularities found in the training set. Furthermore, the convergence toward the minimum of the loss function is expected to be significantly faster, in another instance of algorithmic probability speedup ([20]). We expected the positive effects of the algorithmic probability weighting to increase with the variance of the model to which it was applied. This expectation confirms the hypothesis of the next numerical experiment.
The differences in the accuracy of the models observed through the experiments as a function of variance (number of ReLU layers) are summarized in Figure 8. The upper plots show the difference between the mean accuracy and the maximum accuracy obtained through the optimization of the network parameters for networks of varying depth. A positive value indicates that the networks trained with the algorithmic weights showed a higher accuracy than the unweighted ones. The difference in the steepness of the loss function between the models is shown in the left plot of Figure 10, which is also presented as a function of the number of ReLU layers. A positive value indicates that a linear approximation to the loss function had a steeper gradient for the weighted models when compared to the unweighted ones. In Figure 9, we can see the evolution of this difference with respect to the percentages of corrupted samples and the corrupted pixels within these samples.
As the data show (Figure 8), the networks trained with the algorithmic weights are more accurate at classifying all three sets: the salted training set, the (unsalted) test set and the (salted) validation set. This is shown when the difference of the mean accuracy (over all the training epochs) and the maximum accuracy attained by each of the networks is positive. Also, as predicted, this difference increases with the variance of the networks: at higher variance, the difference between the accuracy of the data sets increases. Moreover, as shown in Figure 10, the weighted models reach the minimum of the loss function in a lower number of iterations, exemplified when the linear approximation to the evolution of the cost is steeper for the weighted models. This difference also increases the variance of the model.
On the left are shown the differences between the slopes of the linear approximation to the evolution of the loss function for the first six weighted and unweighted models. The linear approximation was computed using linear regression over the first 20 training rounds. On the right we have the loss function of the models with 10 ReLU units. From both plots we can see that training toward the minimum of the loss function is consistently faster on the models with the algorithmic complexity sample weights, and that this difference increases with the variance of the model.
8 Conclusions
Here we have presented a mathematical foundation within which to solve supervised machine learning tasks using algorithmic information and algorithmic probability theories. We think this is the first time that a symbolic inference engine is integrated to more traditional machine learning approaches constituting not only a path towards putting both symbolic computation and statistical machine learning together but allowing a statetostate and causeandeffect correspondence between model and data and therefore a powerful interpretable whitebox approach to machine learning. This framework is applicable to any supervised learning task, does not require differentiability, and is naturally biased against complex models, hence inherently designed against overfitting, robust in the face of adversarial attacks and more tolerant to noise in continuous identification problems.
We have shown specific examples of its application to different problems. These problems included the estimation of the parameters of an ODE system, the classification of the evolution of elementary cellular automata according to their underlying generative rules; the classification of binary matrices with respect to 10 initial conditions that evolved according to a random elementary cellular automaton; and the classification of the evolution of a Boolean NK network with respect to 10 associated binary rules or ten different network topologies, and the classification of the evolution of a randomly chosen network according to its connectivity (the parameter ). These tasks were chosen to highlight different approaches that can be taken to applying our model. We also assert that for these tasks it is generally hard for nonspecialized classifiers to get accurate results with the amount of data given.
While simple, the ODE parameter estimation example illustrates the range of applications even in the context of a simple set of equations where the unknown parameters are those explored above in the context of a neural network [16], . These parameters correspond to a low algorithmic complexity model. Given the way that algorithmic parameter optimization works, the optimization time, as measured by the number of iterations, will converge faster if the optimal parameters have low algorithmic complexity, and therefore are more plausible in the algorithmic sense. These low complexity assumptions are compatible with a principle of parsimony such as Ockham’s razor, empirically found to be able to explain data simplicity bias [43, 15, 47], and suggesting that the best explanation is also the simplest, but also that what is modelled is not algorithmically random [46]. The advantage of our approach is that it offers a means to reveal a set of candidate generative models.
From the results obtained from the first classification task (6.2), we can conclude that our vanilla algorithmic classification scheme performed significantly better than the nonspecialized vanilla neural network tested. For the second task (Section 6.3), our algorithmic classifier achieved an accuracy of 95.5%, which was considerably higher than the 60.11% achieved by the best performing neural network tested.
For finding the underlying topology and the Boolean functions associated with each node, the naive neural network achieved a performance of 92.50%, compared to 91.35% for our algorithmic classifier. However, when classifying with respect to the topology, our algorithmic classifier showed a significant difference in performance, with over 39.75% greater accuracy. There was also a significant difference in performance on the fourth task, with the algorithmic classifier reaching an accuracy of 70%, compared to the 43% of the best neural network tested.
We also discussed some of the limitations and challenges of our approach, but also how to combine and complement other more traditional statistical approaches in machine learning. Chief among them is the current lack of a comprehensive Turing machine based conditional CTM database required for the strong version of conditional BDM. We expect to address this limitation in the future.
It is important to emphasize that we are not stating that there is no neural network that is able to obtain similar, or even better, results than our algorithms. Neither do we affirm that algorithmic probability classification in its current form is better on any metric than the existing extensive methods developed for deep learning classification. However, we have introduced a completely different view, with a new set of strengths and weaknesses, that with further development could represent a better grounded alternative suited to a subset of tasks beyond statistical classification, where finding generative mechanisms or first principles are the main goals, with all its attendant difficulties and challenges.
References
 [1] (2003) Boolean dynamics with random couplings perspectives and problems in nonlinear science: a celebratory volume in honor of lawrence sirovich. Berlin: Springer. Cited by: §6.4.
 [2] (2009) Sophistication revisited. Theory of Computing Systems 45 (1), pp. 150–161. Cited by: §3.
 [3] (1966) Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of mathematics 16 (1), pp. 1–3. Cited by: §5.
 [4] (1981) Random boolean networks. Cybernetics and System 12 (12), pp. 103–121. Cited by: §6.5.

[5]
(1995)
Neural networks for pattern recognition
. Oxford university press. Cited by: §3.  [6] (2010) Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §5.
 [7] (1969) On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM 16 (1), pp. 145–159. Cited by: §1, §1.
 [8] (1982) Algorithmic information theory. In Encyclopaedia of Statistical Sciences, Vol. 1, pp. 38–41. Cited by: §4.
 [9] (2009) Evolution of mutating software. Bulletin of the EATCS 97, pp. 157–164. Cited by: §5.2.1.
 [10] (2013) Proving darwin: making biology mathematical. Vintage. Cited by: §5.2.1.
 [11] (2005) Clustering by compression. IEEE Transactions on Information theory 51 (4), pp. 1523–1545. Cited by: footnote 1.
 [12] (2012) Elements of information theory. John Wiley & Sons. Cited by: §9.1.
 [13] (2012) Numerical evaluation of algorithmic complexity for short strings: a glance into the innermost structure of randomness. Applied Mathematics and Computation 219 (1), pp. 63–77. Cited by: §1, §1, §4.1, §4.1, §4.

[14]
(2012)
The mnist database of handwritten digit images for machine learning research [best of the web]
. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: §7.5.  [15] (2018) Inputoutput maps are strongly biased towards simple outputs. Nature Communications 9(761). Cited by: §5.2.1, §7.1, §8.
 [16] (2011) A simultaneous approach for parameter estimation of a system of ordinary differential equations, using artificial neural network approximation. Industrial & Engineering Chemistry Research 51 (4), pp. 1809–1814. Cited by: §6.1, §8.
 [17] (2005) Kauffman networks: analysis and applications. In Proceedings of the 2005 IEEE/ACM International conference on Computeraided design, pp. 479–484. Cited by: §6.4.
 [18] (2018)(Website) External Links: Link Cited by: §6.2.
 [19] (2008) Statistical metaanalysis with applications. Book, John Wiley & Sons.. Cited by: §6.7.
 [20] (2018) Algorithmically probable mutations reproduce aspects of evolution, such as convergence rate, genetic memory and modularity. Royal Society open science 5 (8), pp. 180399. Cited by: §5.2.1, §7.5.
 [21] (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §6.6.1.
 [22] (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of theoretical biology 22 (3), pp. 437–467. Cited by: §6.4.
 [23] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
 [24] (1965) Three approaches to the quantitative definition of information. Problems of Information and Transmission 1 (1), pp. 1–7. Cited by: §1, §1, §4.
 [25] (1991) An almost machineindependent theory of programlength complexity, sophistication, and induction. Information Sciences 56 (13), pp. 23–33. Cited by: §3.
 [26] (1991) Learning to predict nondeterministically generated strings. Machine Learning 7 (1), pp. 85–99. Cited by: §3.
 [27] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §7.5.
 [28] (1974) Laws of information conservation (nongrowth) and aspects of the foundation of probability theory. Problems in Form. Transmission 10, pp. 206–210. Cited by: §1, §5.2.
 [29] (1974) Laws of information conservation (nongrowth) and aspects of the foundation of probability theory. Problemy Peredachi Informatsii 10 (3), pp. 30–35. Cited by: §1, §4.1, §5.2.
 [30] (1982) Least squares quantization in pcm. IEEE Transactions on Information Theory 28(2), pp. 129–137. Cited by: §3.
 [31] (2014) Panel discussion on the limits of understanding. World Science Festival, NYC, Dec 14, 2014. Cited by: §1.
 [32] (2007) Linear regularization methods. In Numerical Recipes: The Art of Scientific Computing, pp. 1006–1016. External Links: ISBN 9780521880688 Cited by: §6.6.1.
 [33] (1948) A mathematical theory of communication. Bell system technical journal 27 (3), pp. 379–423. Cited by: §9.1.
 [34] (2014) Calculating kolmogorov complexity from the output frequency distributions of small turing machines. PLoS ONE 9 (5), pp. 1–18. Cited by: §1, §1, §4.1.
 [35] (2014) Calculating kolmogorov complexity from the output frequency distributions of small turing machines. PloS one 9 (5), pp. e96223. Cited by: §6.3.1.
 [36] (1964) A formal theory of inductive inference: parts 1 and 2. Information and Control 7 (122), pp. 224–254. Cited by: §1, §1, §5.2.
 [37] (1960) A preliminary report on a general theory of inductive inference. Technical report Zator Co., Cambridge. Cited by: §1.
 [38] (2003) The kolmogorov lecture the universal distribution and machine learning. The Computer Journal 46 (6), pp. 598–601. Cited by: §1, §9.2.

[39]
(1986)
The application of algorithmic probability to problems in artificial intelligence
. In Machine Intelligence and Pattern Recognition, Vol. 4, pp. 473–491. Cited by: §1, §1. 
[40]
(2019)
One pixel attack for fooling deep neural networks.
IEEE Transactions on Evolutionary Computation
. Cited by: §7.2.  [41] (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4, pp. 1035–1038. Cited by: §6.6.1.
 [42] (2002) A new kind of science. Wolfram Media. Cited by: §6.2.
 [43] (2010) On the algorithmic nature of the world. in g. dodigcrnkovic and m. burgin (eds). In Information and Computation, Cited by: §7.1, §8.
 [44] (2019) Causal deconvolution by algorithmic generative models. Nature Machine Intelligence 1, pp. 58–66. Cited by: §5.2.
 [45] (201106) Une approche expérimentale à la théorie algorithmique de la complexité, dissertation in fulfilment of the degree of doctor in computer science (committee: j.p. Ph.D. Thesis, Université de Lille 1. Note: Highest honours Cited by: §1.
 [46] (2020) Compression is comprehension, and the unreasonable effectiveness of digital computation in the natural world. In Taming Complexity: Life and work of Gregory Chaitin. F.A. Doria, S. Wuppuluri (eds.), Vol. forthcoming. Cited by: §5.2.1, §7.1, §8.
 [47] (2018) Codingtheorem like behaviour and emergence of the universal distribution from resourcebounded algorithmic probability. International Journal of Parallel Emergent and Distributed Systems. Cited by: §7.1, §8.
 [48] (2018) A decomposition method for global evaluation of shannon entropy and local estimations of algorithmic complexity. Entropy 20 (8), pp. 605. Cited by: §1, §1, §4.2, §4.3.1, §4, §6.3.1, §9.2.1, §9.3.
 [49] (2015) Twodimensional kolmogorov complexity and an empirical validation of the coding theorem method by compressibility. PeerJ Computer Science 1, pp. e23. Cited by: §4.1.
9 Appendix
9.1 Joint and Mutual BDM
In classical information theory ([33, 12]) we can think of mutual entropy as the information contained over two or more events occurring concurrently, and of joint entropy over the two communication channels or events as the average uncertainty contained over all possible combinations of events. For algorithmic information theory, the first concept can be understood as the “amount of information within an object that is explained by another” and the second concept can be interpreted as the “amount of information contained within two or more objects”.
In contrast to classical information theory, we started by defining conditional BDM. Therefore we think that the best way to define joint BDM is from the chain rule.
Definition 10.
The joint BDM of and with respect to is defined as
Following the same path, we could define mutual BDM thus:
Definition 11.
The mutual BDM of and with respect to is defined as
9.1.1 The Relationship Between Conditional, Joint and Mutual Information
The results shown in this section are evidence that our Def. for conditional BDM is well behaved, as it is analogous to important properties for conditional, joint and mutual entropy.
It is important to note that does not imply that . However, it does imply that . This is a consequence of the fact that BDM does not measure the information encoded in the position of the subtensors.
Proposition 14.
If and are independent with respect to the partition , this is equivalent to , then .
Proof.
It is a direct consequence of the Def. 5, given that we have it that ∎
Proposition 15.
.
Proof.
First, consider the equation
While on the other hand we have it that
Notice that in both equations we have the sum over all the pairs that are in both sets, and , with the difference being in the terms corresponding to the multiplicity. Now we have to consider two cases. If we have the equality. Otherwise, in the first equation we have terms of the form , which, by Def. of , is 0; analogously for the second equation. Therefore, we have the equality. ∎
Proposition 16.
.
Proof.
∎
9.2 Coarseness and Relationship With Entropy
As mentioned in the previous section, the goal behind the Def. of coarse conditional BDM, , is to measure the amount of information contained in not present in . Ideally, this is measured by the conditional algorithmic information . The Def. 5 includes the adjective coarse given that, as we will show in this section, its behaviour is closer to Shannon’s entropy than the algorithmic information measure , relying heavily on the entropylike behaviour of BDM.
The conditional algorithmic information content function is an incomputable function. Therefore it represents a theoretical ideal that cannot be reached in practice. By construction, coarse conditional BDM is an approximation to this measure. However it differs in not taking into account two information sources: the information content shared between base blocks and the position of each block.
As an example of the first limitation, consider the string and its negation . Intuitively, we know that both strings are algorithmically close, but for a partition strategy that divides the string into substrings of size 2 with no overlapping, the sets and are disjoint. Therefore conditional BDM assigns the maximum BDM value to the shared information content. Within this limitation, we argue that conditional BDM represents a better approximation to in comparison to entropy, mainly because BDM uses the CTM approximation value for each block, rather than just its distribution, and the information content of its multiplicity, thus representing a more accurate approximation to the overall algorithmic information content of the nonshared base blocks.
The second limitation can become a significant factor when the size of the base blocks is small when compared to that of the objects analysed, given that the positional information can become the dominant factor of the information content within an object. This is an issue shared with entropy that conditional BDM inherits from the numerical challenges of CTM in BDM. However, conditional BDM has the added benefit that it is defined for finite tensors generated from different distributions by assuming the socalled universal distribution ([38]) (known to dominate any other approach) as the underlying distribution between the two ‘events’.
9.2.1 Empirical Comparison with Entropy
Owing to the origins of the BDM function, the asymptotic relationship between coarse conditional BDM and conditional entropy follows from the relationship between BDM and entropy ([48]). In this section we will focus on empirical evidence for this relationship, along with exploring the impact of the partition strategy for unidimensional objects. Further theoretical properties that establish the wellbehavedness of conditional BDM are set forth in the Appendix in Section 9.1.
For this numerical experiment we generated a sample of 19,000 random binary strings of length 20 that are pairwise related, coming from one of 19 biased
distributions where the expected number of 1s varies from 1 to 19. For each pair we computed the conditional BDM with partitions of size 1 and divided it by the conditional BDM of the first string with respect to a random string coming from an uniform distribution. To both, the divisor and the dividend, we added 1 to avoid divisions by zero. We repeated the experiment for conditional entropy. Both results where normalized by dividing the quotients obtained by the maximum value obtained for each distribution. In the plot
11 we show the average obtained for each biased distribution.From the plot 11 we can see that as the underlying distribution associated with the strings is increasingly biased, the expected shared information content of two related strings is higher (conditional BDM is lower) when compared to the conditional BDM of two unrelated strings. This behaviour is congruent with what we expect and observe for conditional entropy. That the area under the normalized cube is smaller is expected, given that BDM is a finergraded information content measure than entropy and is not perfectly symmetric, as BDM and CTM are computational approximations to an uncomputable function and are also inherently more sensitive to the fundamental limits of computable random number generators.
9.3 The Impact of the Partition Strategy
As shown in previous results ([48]), BDM better approximates the universal measure as the number of elements resulting from applying the partition strategy to . However, this is not the case for conditional BDM. Instead is a good approximation to when the and share a high number of base tensors, and the probability of this occurring is lower in inverse proportion to the number of elements of the partition. For this reason we must point out that conditional BDM is dependent on the chosen partition strategy .
As a simple example, consider the binary string and its inverse . Since we have the CTM approximation for strings of size 8, the best BDM value for each string is found when and . However, given that the elements of the partitions are different, we have it that , even when intuitively we know that, algorithmic informationwise, they should be very close. However, conditional BDM is able to capture this with partitions of size 1 to 4 with no overlapping, assigning a value of 0 to .
We conjecture that there is no general strategy for finding a best partition strategy. This is an issue shared with conditional block entropy, and just like the original BDM definition. At its worst, conditional BDM will behave like conditional entropy when comparable, while maintaining best cases close to the ideal of conditional algorithmic complexity. Thus the partition strategy can be considered an hyperparameter that can be empirically optimized from the available data.
We performed a numerical experiment to observe this behaviour by generating 2 400 000 random binary strings of size 20 with groups of 600,000 strings belonging to one of four different distributions: uniform (ten 1s expected), biased 3/20 (three 1s expected), biased 1/4 (five 1s expected) and biased 7/20 (seven 1s expected). Then, we formed pairs of strings belonging to the same distribution and computed the conditional BDM using different partition sizes from 1 to 20, for a total of 30,000 pairs per data point, normalizing the result by dividing it by the partition size to avoid this factor being the dominant one. In the plot 12 we show the average obtained for each data point.
In figure 12 we can observe two main behaviours. The first is that as the partition size increases so does the conditional BDM value. This is because bigger partitions take into account more information from the position of each bit, and we do not expect randomly generated strings to share positional information. The drop observed after partitions of size 12 is the result of CTM values being available up to strings of size 12, the point where the program starts to rely on BDM for the computation. Additionally the partition strategy ignores smaller partitions than the ones stated, thereby reducing the overall amount of information taken into account. The second is that not only is conditional BDM able to capture the discrepancies expected from the different distributions for partition sizes where there is no loss of statistical information (this being from size 1 to 10), but seems to improve on its ability to do so with larger partition sizes up to 10, therefore improving upon the results presented in Section 9.2.1.
It is important to note that an important reduction in accuracy for partitions of sizes larger than 10 was expected, given that the partition strategy used discarded substrings of smaller sizes than the ones stated. For instance, the partition of size 3 of the string 10111 is just , thus losing information. Furthermore, for big partition sizes with respect to the string length, the statistical similarity vanishes, given that now each substring is considered a different symbol of an alphabet. Therefore, the abrupt change of behaviour observed beyond partitions of size 15 is expected and is the product of causation.