Zeta Distribution and Transfer Learning Problem

06/23/2018 ∙ by Eray Özkural, et al. ∙ 0

We explore the relations between the zeta distribution and algorithmic information theory via a new model of the transfer learning problem. The program distribution is approximated by a zeta distribution with parameter near 1. We model the training sequence as a stochastic process. We analyze the upper temporal bound for learning a training sequence and its entropy rates, assuming an oracle for the transfer learning problem. We argue from empirical evidence that power-law models are suitable for natural processes. Four sequence models are proposed. Random typing model is like no-free lunch where transfer learning does not work. Zeta process independently samples programs from the zeta distribution. A model of common sub-programs inspired by genetics uses a database of sub-programs. An evolutionary zeta process samples mutations from Zeta distribution. The analysis of stochastic processes inspired by evolution suggest that AI may be feasible in nature, countering no-free lunch sort of arguments.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although power-law distributions have been analyzed in depth in physical sciences, little has been said about their relevance to Artificial Intelligence (AI). We introduce the zeta distribution as an analytic device in algorithmic information theory and propose using it to approximate the distribution of programs. We have been inspired by the empirical evidence in complex systems, especially biology and genetics, that show an abundance of power-law distributions in nature. It is well possible that the famous universal distribution in AI theory is closely related to power-law distributions in complex systems.

The transfer learning problem also merits our attention, as a general model of it has not been presented in machine learning literature. We develop a basic formalization of the problem using stochastic processes and introduce temporal bounds for learning a training sequence of induction problems, and transfer learning. The entropy rate of a stochastic process emerges as a critical quantity in these bounds. We show how to apply the bounds by analyzing the entropy rates of simple training sequence models that generate programs. Two models are close to what critics of AI have imagined, and easily result in unsolvable problems, while two models inspired by evolution suggest that there may be stochastic processes in nature on which AGI algorithms may be quite effective.

2 Approximating the Distribution of Programs

Solomonoff’s universal distribution depends on the probability distribution of programs. A natural model is to consider programs, the bits of which are generated by a fair coin. Solomonoff defined the probability of a program

as:

(1)

where is the program length in bits. The total probability of all programs thus defined unfortunately diverges if all bit-strings are considered valid programs. For constructing probability distributions, a convergent sum is required. Extended Kraft inequality shows that the total probability is less than for a prefix-free set of infinite programs [2]. Let be a reference machine which runs programs with a prefix-free encoding like LISP. The algorithmic probability that a bit-string is generated by a random program of is:

(2)

which conforms to Kolmogorov’s axioms [9].

is also called the universal prior for it may be used as the prior in Bayesian inference, as any data can be encoded as a bit-string.

2.1 Zeta Distribution of Programs

We propose the zeta distribution for approximating the distribution of programs of . The distribution of eq:prog-dist is already an approximation, even after normalization , since it contains many programs that are semantically incorrect, and those that do not generate any strings. A realistic program distribution requires us to specify a detailed probability model of programs, which is not covered by the general model, however, the general model, which is approximate, still gives excellent bounds on the limits of Solomonoff’s universal induction method. Therefore, other general approximations may also be considered.

Additionally, the zeta function is universal, which encourages us to relate algorithmic information theory to zeta distribution [12].

Let us consider a program bit-string . Let define the arithmetization of programs represented as bit-strings, where the first bit is the most significant bit.

(3)

Thus arithmetized, we now show a simple, but interesting inequality about the distribution of programs:

(4)
(5)
(6)

which shows an approximation that is closer than a factor of . Program codes are discarded.

Zipf’s law manifests itself as the Zipf distribution of ranked discrete objects in order of increasing rank

(7)

where

is a random variable,

is the normalization constant and (we used the notation simply to avoid confusion with exponentiation, is a standard notation for the zeta random variable). Zeta distribution is the countably infinite version of Zipf distribution with parameter

(8)

where is a random variable with co-domain and the zeta function is defined as

(9)

Note that Zeta distribution is a discrete variant of Pareto distribution.

It is much involved to work with a prefix-free set, therefore we will suggest an alternative device to approximate .

Theorem 2.1

A program distribution may be approximated by the Zipf distribution with , or by the zeta distribution with a real close to from above.

Proof

(a) Zeta distribution is undefined for . However, if we use the Zipf distribution instead, and model programs up to a fixed program-length, we can approximate the program distribution from above using and from below using due to the sandwich property eq:sandwich.

(b) We can approximate the program distribution from below using . Since

we can also approximate it with the Zeta distribution eq:zeta for close to .

In either case, the need for a prefix-free set of programs is obviated. Of the simplified distribution, we investigate if the approximations are usable.

Theorem 2.2

The program distribution asymptotically obeys a power law with exponent as program size grows.

Proof

The probability of arithmetized program is sandwiched between and , therefore as grows, Zipf’s law grows closer to .

(10)
(11)

Combining thm:zipf-approx and thm:zipf-conv, we propose using a Zeta distribution with a parameter close to . Obviously, lower and upper bounds vary only by a factor of within each other, therefore the error in the approximation of program distribution is at most by bit (this property will be analyzed in detail in an extended version of the present paper). Substituting into eq:alp, we propose an approximation

Definition 1
(12)

where (). def:alp-zeta may be useful for machine learning theorists wherever they must represent a priori program probabilities, as it allows them to employ number theory. See Elias Gamma Code [3] for an alternative integer code.

3 Training Sequence as a Stochastic Process

Although Solomonoff has theoretically described how the transfer learning problem might be solved in [10]

, a detailed theoretical model of transfer learning for the universal induction setting is missing in the literature. Here, we attempt to fill this gap. In his treatise of incremental learning, Solomonoff approached the transfer learning problem by describing an update problem which improves the guiding conditional probability distribution (GCPD) of the system as an inductive inference problem of the type that the system usually solves. Solomonoff’s modular approach started with a number of problem solving methods and invented new such methods as the system progressed. The initial methods, however, are not fully specified, and we leave it as an open problem in this paper. Instead, we attempt at describing the space of training sequences using the zeta distribution, showing an interesting similarity to our world, whereas most problems in a sequence may be solved, but rarely they are not solvable at all. For instance, a mathematician may solve most problems, but stall at a conjecture that requires the invention of a new, non-trivial axiom indefinitely.

In usual Solomonoff induction (with no transfer learning component), a computable stochastic source is assumed. The stochastic source may generate sequences, sets, functions, or other structures that we please, the general law of which may be induced via Solomonoff’s method. We extend Solomonoff’s induction model to a training sequence of induction problems, by considering a stochastic process of random variables.

(13)

The transfer learning problem thus is constituted from solving induction problems in sequence which are generated from the stochastic process . It does not matter which type of induction problem these problems are, as long as they are generated via .

3.1 Entropy Rate of a Training Sequence

A critical measurement of a stochastic process is its entropy rate, which is defined as the following for :

(14)

and the conditional entropy rate,

(15)

which gives the entropy given past observations. Observe that there is a well-known relation between average Kolmogorov complexity and the entropy of an i.i.d. stochastic process (Equation 5 in [1]):

(16)

where is a stochastic process and its random variables. We assume that the relation extends to conditional entropy without proof due to lack of space.

3.2 Training Time

Let be the minimal program for exactly simulating on . The most general expression for is given in the following

(17)

where the pdf of stochastic source is simulated by a program . The conditional parameter is optional. Let us note the following identity

(18)

since arguments are extraneous input to the pdf specified by .

Let denote the time taken to solve , and denote the time taken by program on M. Assume that . We know that the running time of extended Levin Search is bias-optimal [10], and

(19)

for a computable stochastic source (). The lower bound in eq:cjs has been named conceptual jump size by Solomonoff, because it refers to the solution of individual induction problems within a training sequence, quantifying how much conceptual innovation is required for a new problem [10]. We cannot exactly predict due to the incomputability of algorithmic probability. Extended Levin Search will keep running indefinitely. It is up to the user to stop execution, which is usually bounded only by the amount of computational resources available to the user. We should also mention that Levin himself does not think that any realistic problems can be solved by Levin search or created on a computer [8]. In the present paper, we run counter to Levin’s position, by arguing that Levin search can work in an evolutionary setting, assuming an oracle for the transfer learning problem.

We substitute the relation between and in the upper bound for ,

(20)

obtaining the following fact due to eq:sim-length and eq:time:

Lemma 1

The inequality translates to the time for the training sequence as

Theorem 3.1
(21)

which is a simple sum of lem:time2.

The conditional entropy rate is useful when the stochastic process has inter-dependence. Let us define conditional Kolmogorov complexity for the training sequence ,

(22)

where . We define likewise for the stochastic process probabilities.

(23)

captures new algorithmic information content for the variable of the stochastic process given the entire history.

As grows, the transfer learning oracle has to add bits of information to its memory on the average in the stochastic process as Kolmogorov-Shannon entropy relation eq:kolmogorov-shannon holds in the limit for conditional entropy, as well. Since the upper temporal bound grows exponentially, eq:conditional-entropy only relates loosely to the solution time of a particular problem. We instead define the conditional expected training time upper bound with respect to :

(24)

3.3 Random Typing Model

Let us start by considering the well-known model of random typing. If each is regarded as a random -bit program out of such programs, the programs are independent, and the entropy rate is bits exactly (under usual i.i.d. assumptions, e.g., we are using fair coin tosses, and we construct programs using a binary alphabet). Assume .

In the random typing model, all are algorithmically independent, therefore there is no saving that can be achieved by transfer learning. The time it takes for any problem is therefore:

(25)

for any of the programs. Since can be arbitrarily large, this model is compatible with Levin’s conjecture that AI is impossible. Note that this simplistic model is reminiscient of various no-free lunch theorems that were heralded as mathematical proof that general-purpose machine learning was impossible. However, this scenario is highly unrealistic. It is extremely difficult to find problems that are completely independent, as this would require us to be using true random number generators to generate any problem. In other words, we are only showing this “model” to demonstrate how far removed from reality no-free lunch theorems are. In a physical world, this model would correspond to the claim that quantum randomness saturates every observation we may make. However, we already know this claim to be false, since our observations do not consist of noise. On the contrary, there is a lot of dependable regularity in the environment we inhabit, which is sometimes termed “commmon sense” in AI literature.

3.4 Power-law in Nature

A more realistic model, however, uses the zeta distribution for programs instead of uniform distribution. We propose this indeed to be the case since zeta distribution is empirically observed in a multitude of domains, and has good theoretical justification for the abundance of power-law in nature. thm:zipf-conv gives some weak and indirect justification as to why we might observe fractions of the zeta distribution of programs in a computable universe. However, there are more direct and appealing reasons why we must expect to see the zeta distribution in highly evolved complex systems. First, it is a direct consequence of the power-law ansatz, and scale-invariance

[1] or preferential attachment in evolutionary systems [13]. Second, it follows from an application of maximum entropy principle where the mean of logarithms of observations is fixed [11]. Third, biologists have observed the zeta distribution directly in genetic evolution, thus strengthening the case that our ’s are likely to conform to zeta distributions. For instance, gene family sizes versus their frequencies follow a power-law distribution [5] and the gene expression in various species follows Zipf’s law [4]. Universal regularities in evolution have been observed, for instance in the power-law relation between the number of gene families and gene family size, and number of genes in a category versus number of genes in genome, and power-law like distribution of network node degree [6]

. Therefore, there is not only a highly theoretical heuristic argument that we are following, but there exist multiple theoretical and empirical justifications for expecting to observe the zeta distribution of programs in nature. The material evolution of the environment in a habitat, is not altogether different from biological evolution. Except in the case of rare natural catastrophes, the material environment changes only gradually in accord with the dynamic flow of natural law (surprise is small), and is dependent mostly on the actions of organisms in a complex habitat, which may be considered to be programs from an information-theoretic point of view. In that sense, the entire ecology of the habitat in question may be considered to be an evolutionary system, with program frequencies similar to the case of genes in a single organism. In the following, we introduce novel models of training sequences inspired by these empirical justifications.

3.5 Identical Zeta Random Variables

Let be i.i.d. generated from zeta distribution according to thm:zipf-conv. Then,

(26)

indicating that the constant entropy rate depends only on the entropy of the zeta distribution. We thus analyze the running time. Let .

(27)

For the first 1 trillion programs, for , which is a feasible factor for a realistic program search limit.

Note that AI theorists interpret i.i.d. assumptions as the main reason why no free-lunch theorems are unrealistic [7]. Our i.i.d. zeta process here may be interpreted as an elaboration of that particular objection to no free-lunch theorems. Therefore, we follow the heuristic argument that the right description of the environment which we observe must be something else than the random typing model since agents succeed in transfer learning. The constant zeta process leans towards feasibility, but it does not yet model transfer learning in complex environments.

3.6 Zipf Distribution of Sub-programs

Based upon the observations of genetic evolution above and the fact that the whole ecology is an evolutionary system, we may consider a process of programs that has the following property. Each that corresponds to

is constructed from a number of sub-programs (concatenated). The joint distribution of sub-programs is

. This is a model of gene frequencies observed in chromosomes, where each chromosome corresponds to a program, and each gene corresponds to a sub-program. Such a distribution would more closely model a realistic distribution of programs by constraining possible programs, as in the real-world the process that generates programs is not ergodic. The total entropy of the process therefore depends on the sub-programs that may be assumed to be random, and program coding. Let each sub-program be a -bit random program for the sake of simplicity. The sub-programs that correspond to instructions are specified in a database of bits. Instructions are not equiprobable, however, as in the random typing model. Let each program have instructions drawn from the set of instructions:

(28)

Then, we can model each optimal program as

(29)

which make up a matrix of instructions where is drawn from the set of instructions. The total entropy is due to the database of sub-programs, and the entropy of the global distribution of sub-programs which determines the entropy of . The total entropy is then approximately,

(30)

where we show the significant terms for parameters.

Lemma 2

For the Zipf distribution of sub-programs,

(31)

due to eq:total-entropy.

which is to say that, the entropy rate, and thus running time, critically depends on the choice of and .

3.7 An Evolutionary Zeta Process

Another process of programs may be determined by mimicking evolution, by considering random mutations of programs in a training sequence. Let us set

(32)
(33)

which would apply a random transformation sampled from in sequence to an initially null program. Such mutations are unlikely to be too complex. The resulting process has small conditional entropy rate, which is wholly dependent on .

(34)
Lemma 3
(35)
(36)

The lemma suggests that if an evolutionary process evolves slowly enough, then an AI can easily learn everything there is to learn about it provided that the time complexity of random variables is not too large. We can also employ instead of in eq:mutation. For a universal induction approximation, may be difficult to handle, however, for efficient model-based learning algorithms such as gradient descent methods, digesting new information on the order of a thousand bits is not a big challenge given sufficiently many samples for a problem in the sequence.

4 Concluding Remarks

We have shown novel relations between Zipf’s law and program distribution by means of the arithmetization of programs. We have shown that zeta distribution may be used for approximating program distributions. We have proposed using the conditional entropy rate as an informative quantity for transfer learning. We have extended Solomonoff’s induction model to a training sequence of problems as a stochastic process. We have proposed that the entropy rate of a stochastic process is informative. We have defined conditional Kolmogorov complexity and probability for the sequence, and have used these quantities to define a conditional expected upper bound of training time assuming an transfer learning oracle. We introduced sequence models to show that there is a wide range of possible stochastic processes that may be used to argue for the possibility of general purpose AI. The random typing model is a sensible elaboration of no-free lunch theorem kind of arguments, and demonstrate how artificial and unlikely they are since everything is interconnected in nature and pure randomness is very hard to come by, which we therefore rule out as a plausible model of transfer learning. We have shown several empirical justifications for using a power-law model of natural processes. Independent Zeta process tends to the feasible, but does not explain transfer learning. The models that were inspired by natural evolution allow general purpose learning to be feasible. In particular, the model of common sub-programs which is inspired by empirical evidence in genetics supports a view of evolution of natural processes that allows incremental learning to be effective. The evolutionary Zeta process applies random mutations, which can be slow enough for a machine learning algorithm to digest all the new information.

A more detailed analysis of the transfer learning problem will be presented in an extended journal paper. Open problems include analyzing the complexity of the optimal update algorithm, time complexity analysis for the evolutionary processes, and accounting for the time complexity of individual programs.

Acknowledgements

The paper was substantially improved owing to the extensive and helpful comments of anonymous AGI 2014 and AGI 2018 reviewers.

References