I Introduction
The ability to predict, analyze and parsimoniously model observations is not only central to the scientific endeavor, but also a goal of unsupervised machine learning, which is a key frontier in artificial intelligence (AI) research
LeCun et al. (2015). Despite impressive recent progress with artificial neural nets, they still get frequently outmatched by human researchers at such modeling, suffering from two drawbacks:
Different parts of the data are often generated by different mechanisms in different contexts. A big model that tries to fit all the data in one environment may therefore underperform in a new environment where some mechanisms are replaced by new ones, being inflexible and inefficient at combinatorial generalization Battaglia et al. (2018).

Big models are generally hard to interpret, and may not reveal succinct and universal knowledge such as Newton’s law of gravitation that explains only some aspects of the data. The pursuit of “intelligible intelligence” in place of inscrutable blackbox neural nets is important and timely, given the growing interest in AI interpretability from AI users and policymakers, especially for AI components involved in decisions and infrastructure where trust is important Russell et al. (2015); Amodei et al. (2016); Boden et al. (2017); Krakovna and DoshiVelez (2016).
Strategy  Definition 

Divideand  Learn multiple theories each of which 
conquer  specializes to fit part of the data very well 
Occam’s  Avoid overfitting by minimizing description 
Razor  length, which can include replacing fitted constants by simple integers or fractions. 
Unification  Try unifying learned theories by introducing parameters 
Lifelong  Remember learned solutions and try them 
Learning  on future problems 
To address these challenges, we will borrow from physics the core idea of a theory, which parsimoniously predicts both aspects of the future (from past observations) and also the domain in which these predictions are accurate. This suggests an alternative to the standard machinelearning paradigm of fitting a single big model to all the data: instead, learning small theories one by one, and gradually accumulating and organizing them. This paradigm suggests the four specific approaches summarized in Table 1, which we combine into a simple “AI Physicist” learning agent: To find individual theories from complex observations, we use the divideandconquer strategy with multiple theories and a novel generalizedmean loss that encourages each theory to specialize in its own domain by giving larger gradients for betterperforming theories. To find simple theories that avoid overfitting and generalize well, we use the strategy known as Occam’s razor, favoring simple theories that explain a lot, using a computationally efficient approximation of the minimumdescriptionlength (MDL) formalism. To unify similar theories found in different environments, we use the description length for clustering and then learn a “master theory” for each class of theories. To accelerate future learning, we use a lifelong learning strategy where learned theories are stored in a theory hub for future use.
Our approach complements other work on automatic program learning, such as neural program synthesis/induction Graves et al. (2014); Sukhbaatar et al. (2015); Reed and De Freitas (2015); Parisotto et al. (2016); Devlin et al. (2017); Bramley et al. (2018) and symbolic program induction Muggleton (1991); Lavrac and Dzeroski (1994); Liang et al. (2010); Ellis et al. (2015); Dechter et al. (2013) and builds on prior machinelearning work on divideandconquer Cormen et al. (2009); Fürnkranz (1999); Ghosh et al. (2017), network simplification Rissanen (1978); Hassibi and Stork (1993); Suzuki et al. (2001); Grünwald et al. (2005); Han et al. (2015a, b) and continuous learning Kirkpatrick et al. (2017); Li and Hoiem (2017); LopezPaz et al. (2017); Nguyen et al. (2017). It is often said that babies are born scientists, and there is arguably evidence for use of all of these four strategies during childhood development as well Bramley et al. (2018).
The rest of this paper is organized as follows. In Section II, we introduce the architecture of our “AI Physicist” learning agent, and the algorithms implementing the four strategies. We present the results of our numerical experiments using a suite of physics environment benchmarks in Section III, and discuss our conclusions in Section IV, delegating supplementary technical details to a series of appendices.
Ii Methods
Unsupervised learning of regularities in time series can be viewed as a supervised learning problem of predicting the future from the past. This paper focuses on the task of predicting the next state vector
in a sequence from the concatenation of the last vectors. However, our AI Physicist formalism applies more generally to learning any function from examples. In the following we first define theory, then introduce a unified AI Physicist architecture implementing the four aforementioned strategies.ii.1 Definition of Theory
A theory is a 2tuple , where f is a prediction function that predicts when is within the theory’s domain, and
is a domain subclassifier which takes
as input and outputs a logit of whether
is inside this domain. When multiple theories are present, the subclassifier’s outputs are concatenated and fed into a softmax function, producing probabilities for which theory is applicable. Both
f and can be implemented by a neural net or symbolic formula, and can be set to learnable during training and fixed during prediction/validation.This definition draws inspirations from physics theories (conditional statements), such as “a ball not touching anything (condition) with vertical velocity and height will a time later have (prediction function)”. For our AI Physicist, theories constitute its “atoms” of learning, as well as the building blocks for higherlevel manipulations.
ii.2 AI Physicist Architecture Overview
Figure 1 illustrates the architecture of the AI Physicist learning agent. At the center is a theory hub which stores the learned and organized theories. When encountering a new environment, the agent first inspects the hub and proposes old theories that help account for parts of the data as well as randomly initialized new theories for the rest of the data. All these theories are trained via our divideandconquer strategy, first jointly with our generalizedmean loss then separately to finetune each theory in its domain (section II.3). Successful theories along with the corresponding data are added to the theory hub.
The theory hub has two organizing strategies: (1) Applying Occam’s razor, it snaps the learned theories, in the form of neural nets, into simpler symbolic formulas (section II.4). (2) Applying unification, it clusters and unifies the symbolic theories into master theories (section II.5). The symbolic and master theories can be added back into the theory hub, improving theory proposals for new environments. The detailed AI Physicist algorithm is presented in a series of appendices.
ii.3 DivideandConquer
Conventionally, a function f mapping is learned by parameterizing f by some parameter vector that is adjusted to minimize a loss (empirical risk)
(1) 
where is some nonnegative distance function quantifying how far each prediction is from the target, typically satisfying . In contrast, a physicist observing an unfamiliar environment does typically not try to predict everything with one model, instead starting with an easier question: is there any part or aspect of the world that can be described? For example, when Galileo famously tried to model the motion of swinging lamps in the Pisa cathedral, he completely ignored everything else, and made no attempts to simultaneously predict the behavior of sound waves, light rays, weather, or subatomic particles. In this spirit, we allow multiple competing theories , , to specialize in different domains, with a novel generalizedmean loss
(2) 
When , the loss will be dominated by whichever prediction function fits each data point best. This dominance is controlled by , with in the limit where . This means that the best way to minimize is for each to specialize by further improving its accuracy for the data points where it already outperforms the other theories. The following Theorem 1
formalizes the above intuition, stating that under mild conditions for the loss function
, the generalizedmean loss gives larger gradient w.r.t. the error for theories that perform better, so that a gradientdescent loss minimization encourages specialization.Theorem 1
Let denote the prediction of the target by the function , .
Suppose that and for a monotonically increasing function
that vanishes on for some , with differentiable and strictly convex for .
Then if , we have
(3) 
where .
Appendix F gives the proof, and also shows that this theorem applies to meansquarederror (MSE) loss , meanabsoluteerror loss , Huber loss and our descriptionlength loss from the next section.
We find empirically that the simple choice works quite well, striking a good balance between encouraging specialization for the best theory and also giving some gradient for theories that currently perform slightly worse. We term this choice
the “harmonic loss”, because it corresponds to the harmonic mean of the losses for the different theories. Based on the harmonic loss, we propose an unsupervised differentiable divideandconquer (DDAC) algorithm (Alg.
2 in Appendix B) that simultaneously learns prediction functions and corresponding domain classifiers from observations.ii.4 Occam’s Razor
The principle of Occam’s razor, that simpler explanations are better, is quite popular among physicists. This preference for parsimony helped dispense with phlogiston, aether and other superfluous concepts.
Our method therefore incorporates the minimumdescriptionlength (MDL) formalism Rissanen (1978); Grünwald et al. (2005), which provides an elegant mathematical implementation of Occam’s razor. It is rooted in Solomonoff’s theory of inference Solomonoff (1964) and is linked to Hutter’s AIXI approach to artificial general intelligence Hutter (2000). The description length (DL) of a dataset is defined as the number of bits required to describe it. For example, if regularities are discovered that enable data compression, then the corresponding description length is defined as the number of bits of the program that produces as its output (including both the code bits and the compressed data bits). In our context of predicting a time series, this means that the description length is the number of bits required to describe the theories used plus the number of bits required to store all prediction errors. Finding the optimal data compression and hence computing the MDL is a famous hard problem that involves searching an exponentially large space, but any discovery reducing the description length is a step in the right direction, and provably avoids the overfitting problem that plagues many alternative machinelearning strategies Rissanen (1978); Grünwald et al. (2005).
The endgoal of the AI Physicist is to discover theories minimizing the total description length, given by
(4) 
where is the prediction error at time step . By discovering simple theories that can each account for parts of the data very well, the AI Physicist strives to make both and small.
Physics has enjoyed great success in its pursuit of simpler theories using rather vague definitions of simplicity. In the this spirit, we choose to compute the description length DL not exactly, but using an approximate heuristic that is numerically efficient, and significantly simpler than more precise versions such as
Rissanen (1983), paying special attention to rational numbers since they are appear in many physics theories. We compute the DL of both theories and prediction errors as the sum of the DL of all numbers that specify them, using the following conventions for the DL of integers, rational numbers and real numbers. Our MDL implementation differs from popular machinelearning approaches whose goal is efficiency and generalizability Hinton and van Camp (1993); Han et al. (2015a); Blier and Ollivier (2018) rather than intelligibility.The number of binary digits required to specify a natural number is approximately , so we define for natural numbers. For an integer , we define
(5) 
For a rational number , the description length is the sum of that for its integer numerator and (natural number) denominator, as illustrated in Figure 2:
(6) 
For a real number and a numerical precision floor , we define
(7) 
where the function
(8) 
is plotted in Figure 2. Since for , is approximately the description length of the integer closest to . Since for , simplifies to a quadratic (meansquarederror) loss function below the numerical precision, which will prove useful below.^{1}^{1}1Natural alternative definitions of include , , and . Unless otherwise specified, we choose in our experiments.
Note that as long as all prediction absolute errors for some dataset, minimizing the total description length instead of the MSE
corresponds to minimizing the geometric mean instead of the arithmetic mean of the squared errors, which encourages focusing more on improving already wellfit points.
drops by 1 bit whenever one prediction error is halved, which is can typically be achieved by finetuning the fit for many valid data points that are already well predicted while increasing DL for bad or extraneous points at most marginally.For numerical efficiency, our AI Physicist minimizes the description length of equation (4) in two steps: 1) All model parameters are set to trainable real numbers, and the DDAC algorithm is applied to minimize the harmonic loss with using equation (7) and the annealing procedure for the precision floor described in Appendix B. 2) Some model parameters are replaced by rational numbers as described below, followed by reoptimization of the other parameters. The idea behind the second step is that if a physics experiment or neural net training produces a parameter , it would be natural to interpret this as a hint, and to check if gives an equally acceptable fit to the data, reducing total DL. We implement step 2 using continued fraction expansion as described in Appendix C and illustrated in Figure 3.
ii.5 Unification
Physicists aspire not only to find simple theories that explain aspects of the world accurately, but also to discover underlying similarities between theories and unify them. For example, when James Clerk Maxwell corrected and unified four key formulas describing electricity and magnetism into his eponymous equations (, in differential form notation), he revealed the nature of light and enabled the era of wireless communication.
Here we make a humble attempt to automate part of this process. The goal of the unification is to output a master theory , such that varying the parameter vector can generate a continuum of theories including previously discovered ones. For example, Newton’s law of gravitation can be viewed as a master theory unifying the gravitational force formulas around different planets by introducing a parameter corresponding to planet mass. Einstein’s special relativity can be viewed as a master theory unifying the approximate formulas for and motion.
We perform unification by first computing the description length of the prediction function (in symbolic form) for each theory and performing clustering on . Unification is then achieved by discovering similarities and variations between the symbolic formulas in each cluster, retaining the similar patterns, and introducing parameters in place of the parameters that vary as detailed in Appendix D.
ii.6 Lifelong Learning
Isaac Newton once said “If I have seen further it is by standing on the shoulders of giants”, emphasizing the utility of building on past discoveries. At a more basic level, our past experiences enable us humans to model new environments much faster than if we had to reacquire all our knowledge from scratch. We therefore embed a lifelonglearning strategy into the architecture of the AI Physicist. As shown in Fig. 1 and Alg. 1, the theory hub stores successfully learned theories, organizes them with our Occam’s razor and unification algorithms (reminiscent of what humans do while dreaming and reflecting), and when encountering new environments, uses its accumulated knowledge to propose new theories that can explain parts of the data. This both ensures that past experiences are not forgotten and enables faster learning in novel environments. The detailed algorithms for proposing and adding theories are in Appendix E.
Iii Results of Numerical Experiments
iii.1 Physics Environments
We test our algorithms on two suites of benchmarks, each with increasing complexity. In all cases, the goal is to predict the twodimensional motion as accurately as possible. One suite involves chaotic and highly nonlinear motion of a charged double pendulum in two adjacent electric fields. The other suite involves balls affected by gravity, electromagnetic fields, springs and bounceboundaries, as exemplified in Figure 4. Within each spatial region, the force corresponds to a potential energy function for some constants , , , where (no force), (uniform electric or gravitational field), (spring obeying Hooke’s law) or (ideal elastic bounce), and optionally involves also a uniform magnetic field. The environments are summarized in Table 4.
iii.2 Numerical Results
In the mystery world example of Figure 4, after the DDAC algorithm 2 taking the sequence of coordinates as the only input, we see that the AI Physicist has learned to simultaneously predict the future position of the ball from the previous two, and classify without external supervision the observed inputs into four big physics domains. The predictions are seen to be more accurate deep inside the domains (tiny dots) than near boundaries (larger dots) where transitions and bounces create small domains with laws of motion that are harder to infer because of complexity and limited data. Because these small domains can be automatically inferred and eliminated once the large ones are known as described in Appendix G, all accuracy benchmarks quoted below refer to points in the large domains only.
After DDAC, the AI Physicist performs Occam’srazorwithMDL (Alg. 3) on the learned theories. As an example, it discovers that the motion deep inside the lowerleft quadrant obeys the difference equation parameterized by a learned 3layer neural net, which after the first collapseLayer transformation simplifies to
(9)  
with and . The snapping stage thereafter simplifies to
(10) 
which has lower description length in both model bits () and data bits () and gets transformed to the symbolic expressions
(11) 
where we have writen the 2D position vector for brevity. During unification (Alg. D), the AI Physicist discovers multiple clusters of theories based on the DL of each theory, where one cluster has DL ranging between 48.86 and 55.63, which it unifies into a master theory with
(12) 
effectively discovering a “gravity” master theory out of the different types of environments it encounters. If so desired, the difference equations (III.2) can be automatically generalized to the more familiarlooking differential equations
where , and both the Harmonic Oscillator Equation and Lorentz Force Law of electromagnetism can be analogously autoinferred from other master theories learned.
Many mystery domains in our test suite involve laws of motion whose parameters include both rational and irrational numbers. To count a domain as “solved” below, we use the very stringent requirement that any rational numbers (including integers) must be discovered exactly, while irrational numbers must be recovered with accuracy .
We apply our AI Physicist to 40 mystery worlds in sequence (Appendix H). After this training, we apply it to a suite of 40 additional worlds to test how it learns different numbers of examples. The results are shown tables 3 and 4, and Table 2 summarizes these results using the median over worlds. For comparison, we also show results for two simpler agents with similar parameter count: a “baseline” agent consisting of a threelayer feedforward MSEminimizing leakyReLU network and a “newborn” AI Physicist that has not seen any past examples and therefore cannot benefit from the lifelong learning strategy.
We see that the newborn agent outperforms baseline on all the tabulated measures, and that the AI Physicist does still better. Using all data, the Newborn agent and AI Physicist are able to predict with meansquared prediction error below , more than nine orders of magnitude below baseline. Moreover, the Newborn and AI Physicist agents are able to simultaneously learn the domain classifiers with essentially perfect accuracy, without external supervision. Both agents are able to solve above 90% of all the 40 mystery worlds according to our stringent criteria.
The main advantage of the AI Physicist over the Newborn agent is seen to be its learning speed, attaining given accuracy levels faster, especially during the early stage of learning. Remarkably, for the subsequent 40 worlds, the AI Physicist reaches 0.01 MSE within 35 epochs using as little as 1% of the data, performing almost as well as with 50% of the data much better than the Newborn agent. This illustrates that the lifelong learning strategy enables the AI Physicist to learn much faster in novel environments with less data. This is much like an experienced scientist can solve new problems way faster than a beginner by building on prior knowledge about similar problems.
Benchmark  Baseline  Newborn  AI Physicist 
meansquared error  3.89  13.95  13.88 
Classification accuracy  67.56%  100.00%  100.00% 
Fraction of worlds solved  0.00%  90.00%  92.50% 
Description length for f  11,338.7  198.9  198.9 
Epochs until MSE  95  83  15 
Epochs until MSE  6925  330  45 
Epochs until MSE  5403  3895  
Epochs until MSE  6590  5100  
MSE  
using 100% of data  3.78  13.89  13.89 
using 50% of data  3.84  13.76  13.81 
using 10% of data  3.16  7.38  10.54 
using 5% of data  3.06  6.06  6.20 
using 1% of data  2.46  3.69  3.95 
Epochs until MSE  
using 100% of data  95  80  15 
using 50% of data  190  152.5  30 
using 10% of data  195  162.5  30 
using 5% of data  205  165  30 
using 1% of data  397.5  235  35 
Our doublependulum mysteries (Appendix H.2) are more challenging for all the agents, because the motion is more nonlinear and indeed chaotic. Although none of our doublependulum mysteries get exactly solved according to our very stringent abovementioned criterion, Figure 7 illustrates that the Newborn agent does a good job: it discovers the two domains and classifies points into them with an accuracy of 96.5%. Overall, the Newborn have a median best accuracy of 91.0% compared with the baseline of 76.9%. The MSE prediction error is comparable to the baseline performance ( in the median, since both architectures have similar large capacity. We analyze this challenge and opportunities for improvement below.
Iv Conclusions
We have presented a simple “AI Physicist” unsupervised learning agent centered around the learning and manipulation of theories, which parsimoniously predict both aspects of the future (from past observations) and the domain in which these predictions are accurate. Testing it on a suite of mystery worlds involving random combinations of gravity, electromagnetism, harmonic motion and elastic bounces, we found that its divideandconquer and Occam’s razor strategies effectively identified domains with different laws of motion and reduced the meansquared prediction error billionfold, typically recovering integer and rational theory parameters exactly. These two strategies both encouraged prediction functions to specialize: the former on the domains they handled best, and the latter on the data points within their domain that they handled best. Adding the lifelonglearning strategy greatly accelerated learning in novel environments.
Our work suggests many opportunities for improvement of the implementation. The more modest success in the doublependulum experiments illustrated the value of learned theories being simple: if they are highly complex, they are less likely to unify or generalize to future environments, and the correspondingly complex baseline model with have enough expressive power to approximate the motion in all domains at once. It will be valuable to improve techniques for simplifying complex learned neural nets (reducing their total description length from equation (4
)), for example by using MonteCarloMarkovChainbased and genetic techniques
Real et al. (2017)Zoph and Le (2016); Baker et al. (2016) and analytic regression Schmidt and Lipson (2009)literature to simplify and shrink the model architecture. It will also be interesting to generalize our implementation to simplify not only the prediction functions, but also the classifiers, for example to find sharp domain boundaries composed of hyperplanes or other simple surfaces. These and other improvements to the algorithms that implement our AI Physicist paradigm could enable future unsupervised learning agents to learn simpler and more accurate models faster from fewer examples.
Acknowledgements: This work was supported by the The Casey and Family Foundation, the Ethics and Governance of AI Fund, the Foundational Questions Institute and the Rothberg Family Fund for Cognitive Science. We thank Isaac Chuang, John Peurifoy and Marin Soljačić for helpful discussions and suggestions, and the Center for Brains, Minds, and Machines (CBMM) for hospitality.
Appendix A AI Physicist Algorithm
The detailed AI Physicist algorithm is presented in Algorithm 1, with links to each of the individual subalgorithms.
Appendix B The Differentiable DivideandConquer (DDAC) Algorithm
Here we elaborate on our differentiable divideandconquer (DDAC) algorithm with generalizedmean loss (Eq. (2)). This loss with works with a broad range of distance functions satisfying Theorem 1. Since the goal of our AI Physicist is to minimize the overall description length (DL) from equation (4), we choose to be the DL loss function of equation (7) together with (harmonic loss), which works quite well in practice.
Algorithm 2 describes our differentiable divideandconquer implementation, which consists of two stages. In the first stage (steps 26), it applies the subroutine with harmonic loss to train the theories a few times with the precision floor gradually lowered according to the following annealing schedule. We set the initial precision floor to be quite large so that initially approximates an MSE loss function. After each successive iteration, we reset to the median prediction error.
The DL loss function from equation (7) is theoretically desirable but tricky to train, both because it is nonconvex and because it is quite flat and uninformative far from its minimum. Our annealing schedule helps overcome both problems: initially when is large, it approximates MSEloss which is convex and guides the training to a good approximate minimum, which further training accurately pinpoints as is reduced.
The subroutine IterativeTrain forms the core of the algorithm. In the first stage (steps 26), it uses the harmonic mean of the DLloss of multiple prediction functions (i.e., equation (2) with and DL) to simultaneously train these functions, encouraging them to each specialize in the domains where they predict best (as proven by Theorem 1), and simultaneously trains the domain classifier using each example’s bestperforming prediction function as target, with categorical crossentropy loss. After several rounds of IterativeTrain with successively lower precision floors, each prediction function typically becomes good at predicting part of the dataset, and the domain classifier becomes good at predicting for each example which prediction function will predict best.
AddTheories() inspects each theory describing at least a large fraction (we use ) of the examples to see if a nonnegligible proportion of examples (we use a threshold of ) of the examples inside its domain have MSE larger than a certain limit (we use ). If so, it uses those examples to initialize a new theory , and performs tentative training together with other theories using IterativeTrain without steps s8 and s9 (it is also possible to allow steps s8 and s9 in this recursive calling of IterativeTrain, which will enable a recursive adding of theories for notwellexplained data, and may enable a more powerful DDAC algorithm). If the resulting loss is smaller than before adding the new theory, is accepted and retained, otherwise it is rejected and training reverts to the checkpoint before adding the theory. DeleteTheories() deletes theories whose domain or bestpredicted examples cover a negligible fraction of the examples (we use a delete threshold of ) .
In the second stage (steps 710), the IterativeTrain is applied again, but the loss for each example is using only the theory that the domain classifier predicts (having the largest logit). In this way, we iteratively finetune the prediction functions w.r.t. each of its domain, and finetune the domain to the best performing theory at each point. The reason that we assign examples to domains using our domain classifier rather than prediction accuracy is that the trained domains are likely to be simpler and more contiguous, thus generalizing better to unseen examples than, e.g., the nearest neighbor algorithm.
We now specify the default hyperparameters used for Algorithm
1 in our experiments (unless otherwise specified). We set the initial total number of theories , from which theories are proposed from the theory hub. The initial precision floor and the number of gradient iterations . We use the Adam Kingma and Ba (2014) optimizer with default parameters for the optimization of both the prediction function and the domain classifier. We randomly split each dataset into and with 4:1 ratio. The is used only for evaluation of performance. The batch size is set to min(2000, ). We set the initial learning rate for the prediction functions and for the domain classifier . We also use a learning rate scheduler that monitors the validation loss every 10 epochs, and divides the learning rate by 10 if the validation loss has failed to decrease after 40 monitoring points and stops training early if there is no decrease after 200 epochs — or if the entire MSE loss for all the theories in their respective domains drops below .To the main harmonic loss , we add two regularization terms. One is loss whose strength increases quadratically from 0 to (to in the doublependulum experiment) during the first 5000 epochs and remains constant thereafter. The second regularization term is a very small MSE loss of strength , to encourage the prediction functions to remain not too far away from the target outside their domain.
Appendix C Occam’s Razor with MDL Algorithm
Pushing on after the DDAC algorithm with harmonic loss that minimizes the term in Eq. (4), the AI Physicist then strives to minimize the term, which can be decomposed as , where and . We focus on minimizing , since in different environments the prediction functions can often be reused, while the domains may differ. As mentioned, we define simply as the sum of the description lengths of the numbers parameterizing :
(13) 
This means that can be significantly reduced if an irrational parameter is replaced by a simpler rational number.
If a physics experiment or neural net training produces a parameter , it would be natural to interpret this as a hint, and to check if gives an equally acceptable fit to the data. We formalize this by replacing any realvalued parameter in our theory by its nearest integer if this reduces the total description length in equation (4), as detailed below. We start this search for integer candidates with the parameter that is closest to an integer, refitting for the other parameters after each successful “integer snap”.
What if we instead observe a parameter ? Whereas generic real numbers have a closest integer, they lack a closest rational number. Moreover, as illustrated in Figure 2, we care not only about closeness (to avoid increasing the second term in equation (4)), but also about simplicity (to reduce the first term). To rapidly find the best “rational snap” candidates (dots in Figure 2 that lie both near and far down), we perform a continued fraction expansion of and use each series truncation as a rational candidate. We repeat this for all parameters in the theory , again accepting only those snaps that reduce the total description length. We again wish to try the most promising snap candidates first; to rapidly identify promising candidates without having to recompute the second term in equation (4), we evaluate all truncations of all parameters as in Figure 3, comparing the description length of the rational approximation with the description length of the approximation error . The most promising candidate minimizes their sum, i.e., lies furthest down to the left of the diagonal in the figure. The figure illustrates how, given the parameter vector , the first snap to be attempted will replace the third parameter by .
We propose Algorithm 3 to implement the above minimization of without increasing (Eq. 4). For each theory , we first extract the examples inside its domain, then perform a series of tentative transformations (simplifications) of the prediction function using the MinimizeDL subroutine. This subroutine takes , the transformation, and as inputs and repeatedly applies the transformation to . After each such transformation, it finetunes the fit of to using gradient descent. For determining whether to accept the transformation, Algorithm 3 presents the simplest 0step patience implementation: if the description length for theory decreases, then apply the transformation again if possible, otherwise exit the loop. In general, to allow for temporary increase of DL during the transformations, a nonzero patience can be adopted: at each step, save the best performing model as the pivot model, and if DL does not decrease during consecutive transformations inside MinimizeDL, exit the loop. In our implementation, we use a 4step patience.
We now detail the five transformations used in Algorithm 3. The collapseLayer transformation finds all successive layers of a neural net where the lower layer has linear activation, and combines them into one. The toSymbolic transformation transforms
from the form of a neural net into a symbolic expression (in our implementation, from a PyTorch net to a SymPy symbolic lambda expression). These two transformations are onetime transformations (for example, once
has been transformed to a symbolic expression, toSymbolic cannot be applied to it again.) The localSnap transformation successively sets the incoming weights in the first layer to 0, thus favoring inputs that are closer to the current time step. The integerSnap transformation finds the (nonsnapped) parameters in that is closest to an integer, and snaps it to that integer. The rationalSnap transformation finds the (nonsnapped) parameter in that has the lowest bit sum when replaced by a rational number, as described in section II.4, and snaps it to that rational number. The latter three transformations can be applied multiple times to , until there are no more parameters to snap in , or the transformation followed by finetuning fails to reduce the description length.In the bigger picture, Algorithm 3 is an implementation of minimizing the without increasing the total , if the description length of is given by Eq. (13). There can be other ways to encode with a different formula for , in which case the transformations for decreasing may be different. But the structure of the Algorithm 3 remains the same, with the goal of minimizing without increasing w.r.t. whatever DL formula it is based on.
In the still bigger picture, Algorithm 3 is a computationally efficient approximate implementation of the MDL formalism, involving the following two approximations:

The description lengths for various types of numbers are approximate, for convenience. For example, the length of the shortest selfterminating bitstring encoding an arbitrary natural number grows slightly faster than our approximation , because selftermination requires storing not only the binary digits of the integer, but also the length of said bit string, recursively, requiring , where only the positive terms are included Rissanen (1983). Slight additional overhead is required to upgrade the encodings to actual programs in some suitable language, including encoding of whether bits encode integers, rational numbers, floatingpoint numbers, etc..

If the abovementioned formulas were made exact, they would be mere upper bounds on the true minimum description length. For example, our algorithm gives a gigabyte description length for with precision , even though it can be computed by a rather short program, and there is no simple algorithm for determining which numbers can be accurately approximated by algebraic numbers. Computing the true minimum description length is a famous numerically intractable problem.
Appendix D Unification Algorithm
The unification process takes as input the symbolic prediction functions , and outputs master theories such that by varying each p in , we can generate a continuum of prediction functions within a certain class of prediction functions. The symbolic expression consists of 3 building blocks: operators (e.g. ,,,), input variables (e.g. ), and coefficients that can be either a rational number or irrational number. The unification algorithm first calculates the DL of each prediction function, then clusters them into
clusters using e.g. Kmeans clustering. Within each cluster
, it first canonicalizes each into a 2tuple , where is a treeform expression of where each internal node is an operator, and each leaf is an input variable or a coefficient. When multiple orderings are equivalent (e.g. vs. ), it always uses a predefined partial ordering. is the structure of where all coefficients are replaced by an symbol. Then the algorithm obtains a set of that has the same structure with the largest cardinality (steps 78). This will eliminate some expressions within the cluster that might interfere with the following unification process. Step 9 is the core part, where it traverses each with synchronized steps using e.g. depthfirst search or breathfirst search. This is possible since each has the same tree structure . During traversing, whenever encountering a coefficient and not all coefficients across at this position are the same, replace the coefficients by some symbol that has not been used before. Essentially, we are turning all coefficients that varies across into a parameter, and the coefficients that do not vary stay as they are. In this way, we obtain a master prediction function . Finally, at step 13, the algorithm merges the master prediction functions in that have the exact same form, and return . The domain classifier is neglected during the unification process, since at different environments, each prediction function can have vastly different spacial domains. It is the prediction function (which characterizes the equation of motion) that is important for generalization.Appendix E Adding and Proposing Theories
Here we detail the algorithms adding theories to the hub and proposing them for use in new environments. Alg. 5 provides a simplest version of the theory proposing algorithm. Given a new dataset , the theory hub inspects all theories , and for each one, counts the number of data points where it outperforms all other theories. The top theories with largest are then proposed.
For theory adding after training with DDAC (Alg. 2), each theory calculates its description length inside its domain. If its is smaller than a threshold , then the theory with its corresponding examples are added to the theory hub. The reason why the data are also added to the hub is that gives a reference for how the theory was trained, and is also needed in the Occam’s razor algorithm.
Appendix F Proof of Theorem 1 and Corollary
Here we give the proof for Theorem 1, restated here for convenience.
Theorem 1 Let denote the prediction of the target by the function , .
Suppose that and for a monotonically increasing function
that vanishes on for some , with differentiable and strictly convex for .
Then if , we have
(14) 
where .
Proof. Since and , the generalized mean loss as defined in Eq. 3 can be rewritten as
(15) 
which implies that
Since only the last factor depends on , proving equation (14) is equivalent to proving that
(16) 
Let us henceforth consider only the case , since the conditions imply . Since , and , we have , so that . Because is differentiable and strictly convex, its derivative is monotonically increasing, implying that is monotonically decreasing. Thus
Comments
There are no comments yet.