Ultimate Intelligence Part III: Measures of Intelligence, Perception and Intelligent Agents

09/08/2017
by   Eray Özkural, et al.
0

We propose that operator induction serves as an adequate model of perception. We explain how to reduce universal agent models to operator induction. We propose a universal measure of operator induction fitness, and show how it can be used in a reinforcement learning model and a homeostasis (self-preserving) agent based on the free energy principle. We show that the action of the homeostasis agent can be explained by the operator induction model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/06/2013

Extending Universal Intelligence Models with Formal Notion of Representation

Solomonoff induction is known to be universal, but incomputable. Its app...
07/04/2009

Open Problems in Universal Induction & Intelligence

Specialized intelligent systems can be found everywhere: finger print, h...
02/19/2012

Perception Lie Paradox: Mathematically Proved Uncertainty about Humans Perception Similarity

Agents' judgment depends on perception and previous knowledge. Assuming ...
04/19/2021

Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction

This paper asks whether a distinction between production-based and perce...
10/02/2021

Induction, Popper, and machine learning

Francis Bacon popularized the idea that science is based on a process of...
06/01/2013

Universal Induction with Varying Sets of Combinators

Universal induction is a crucial issue in AGI. Its practical applicabili...
12/18/2018

A cortical-inspired model for orientation-dependent contrast perception: a link with Wilson-Cowan equations

We consider a differential model describing neuro-physiological contrast...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ultimate intelligence research program is inspired by Seth Lloyd’s work on the ultimate physical limits to computation [15]. We investigate the ultimate physical limits and conditions of intelligence. This is the third installation of the paper series, the first two parts proposed new physical complexity measures, priors and limits of inductive inference [18, 17].

We frame the question of ultimate limits of intelligence in a general physical setting, for this we provide a general definition of an intelligent system and a physical performance criterion, which as anticipated turns out to be a relation of physical quantities and information, the latter of which we had conceptually reduced to physics with minimum machine volume complexity in [18].

2 Notation and Background

2.1 Universal Induction

Let us recall Solomonoff’s universal distribution [21]. Let be a universal computer which runs programs with a prefix-free encoding like LISP; denotes that the output of program on is where and are bit strings. 111A prefix-free code is a set of codes in which no code is a prefix of another. A computer file uses a prefix-free code, ending with an EOF symbol, thus, most reasonable programming languages are prefix-free. Any unspecified variable or function is assumed to be represented as a bit string. denotes the length of a bit-string . refers to function rather than its application.

The algorithmic probability that a bit string

is generated by a random program of is:

(1)

which conforms to Kolmogorov’s axioms [13]. considers any continuation of , taking into account non-terminating programs.222We used the regular expression notation in language theory.

is also called the universal prior for it may be used as the prior in Bayesian inference, for any data can be encoded as a bit string. We also give the basic definitions of Algorithmic Information Theory (AIT)

[14], where the algorithmic entropy, or complexity of a bit string is

(2)

We use some variables in overloaded fashion in the paper, e.g., might be a program, a policy, or a physical mechanism depending on the context.

2.2 Operator induction

Operator induction is a general form of supervised machine learning where we learn a stochastic map from

question and answer pairs sampled from a (computable) stochastic source . Operator induction can be solved by finding in available time a set of operators

, each a conditional probability density function (cpdf), such that the following goodness of fit is maximized

(3)

for a stochastic source where each term in the summation is the contribution of a model:

(4)

and are question/answer pairs in the input dataset drawn from , and is a computable cpdf in eq:opind-gof-term. We can use the found operators to predict unseen data with a mixture model [24]

(5)

The goodness of fit in this case strikes a balance between high a priori probability and reproduction of data like in minimum message length (MML) method [27, 26], yet uses a universal mixture like in sequence induction. The convergence theorem for operator induction was proven in [23] using Hutter’s extension to arbitrary alphabet, and it bounds total error by similarly to sequence induction.

2.3 Set induction

Set induction generalizes unsupervised machine learning where we learn a probability density function (pdf) from a set of bitstrings sampled from a stochastic source . We can then inductively infer new members to be added to the set with:

(6)

Set induction is clearly a restricted case of operator induction where we set ’s to null string. Set induction is a universal form of clustering, and it perfectly models perception. If we apply set induction over a large set of 2D pictures of a room, it will give us a 3D representation of it necessarily. If we apply it to physical sensor data, it will infer the physical theory – perfectly general, with infinite domains – that explains the data, perception is merely a specific case of scientific theory inference in this case, though set induction works both with deterministic and non-deterministic problems.

2.4 Universal measures of intelligence

There is much literature on the subject of defining a measure of intelligence. Hutter has defined an intelligence order relation in the context of his universal reinforcement learning (RL) model AIXI [8], which suggests that intelligence corresponds to the set of problems an agent can solve. Also notable is the universal intelligence measure [10, 11], which is again based on the AIXI model. Their universal intelligence measure is based on the following philosophical definition compiled from their review of definitions of intelligence in the AI literature.

Definition 1 (Legg & Hutter)

Intelligence measures an agent’s ability to achieve goals in a wide range of environments.

It implies that intelligence requires an autonomous goal-following agent. The intelligence measure of [10] is defined as

(7)

where is a computable reward bounded environment, And is the expected sum of future rewards in the total interaction sequence of agent . , where is the instantaneous reward at time generated from the interaction between the agent and the environment , and is the time discount factor.

2.5 The free energy principle.

In Asimov’s story titled “The Last Question”, the task of life is identified as overcoming the second law of thermodynamics, however futile. Variational free energy essentially measures predictive error, and it was introduced by Feynmann to address difficult path integral problems in quantum physics. In thermodynamic free energy, energies are negative log probabilities like entropy. The free energy principle states that any system must minimize its free energy to maintain its order. An adaptive system that tends to minimize average surprise (entropy) will tend to survive longer. A biological organism can be modelled as an adaptive system that has an implicit probabilistic model of the environment, and the variational free energy puts an upper bound on the surprise, thus minimizing free energy will improve the chances of survival. The divergence between the pdf of environment and an arbitrary pdf encoded by its own mechanism is minimized in Friston’s model [9]. It has been shown in detail that the free energy principle adequately models a self-preserving agent in a stochastic dynamical system [6, 9], which we can interpret as an environment with computable pdf. An active agent may be defined in the formalism of stochastic dynamical systems, by partitioning the physical states of the environment into where is an external state, is a sensory state, an active state, and is an internal state. Self-preservation is defined by the Markov blanket , the removal of which partitions into external states and internal states that influence each other only through sensory and action states. influences sensations , which in turn influence internal states , resulting in the choice of action signals , which impact , forming the feedback loop of the adaptive system. The system states evolve according to the stochastic equation:

(8)
(9)
(10)

where is the flow of system states and it is decomposed into flows over the sets in the system partition, explicitly showing the dependencies among state sets; models fluctuations. Friston formalizes the self-preservation (homeostasis) problem as finding an internal dynamics that minimizes the uncertainty (Shannon entropy) of the external states, and shows a solution based on the principle of least action [9] wherein minimizing free energy is synonymous with minimizing the entropy of the external states (principle of least action), which subsequently corresponds to active inference. We have space for only some key results from the rather involved mathematical theory. is the generative pdf that generates sensorium and fictive (hidden) states from probabilistic model , and is the recognition pdf that predicts hidden states in the world given internal state. Generative pdf factorizes as . Free energy is defined as energy minus entropy

(11)

which can be subjectively computed by the system. Free energy is also equal to surprise plus divergence between recognition and generative pdf’s.

(12)

Minimizing divergence minimizes free energy, internal states may be optimized to minimize predictive error using eq:free-energy2, and surprise is invariant with respect to . Free energy may be formulated as complexity plus accuracy of recognition, as well.

(13)

In this case, we may choose an action that changes sensations to reduce predictive error. Only the first term is a function of action signals. Minimization of free energy turns out to be equivalent to the information bottleneck principle of Tishby [9, 25]. The information bottleneck method is equivalent to the pioneering work of Ashby, which is simple enough to state here [3, 2]:

(14)

where the first term is the mutual information between internal and hidden states, and the second term is the mutual information between sensory states and internal states. Both terms are expanded using conditional entropy, and then two terms in the middle are eliminated because they are not relevant to the optimization problem – we do not know the hidden variables in and is constant.

(15)
(16)

Minimizing eq:reduced-info-bottleneck thus minimizes the sum of the entropy of internal states and the entropy required to encode sensory states given internal states. In other words, it strikes an optimal balance between model complexity , and model accuracy . Friston further shows that eq:reduced-info-bottleneck directly derives from the free energy principle, closing potential loopholes in the theory. Please see [5] for a comprehensive application of the free energy principle to agents and learning. Note also that the bulk of the theory assumes the ergodic hypothesis.

3 Perception as General Intelligence

Since we are chiefly interested in stochastic problems in the physical world, we propose a straightforward informal definition of intelligence:

Definition 2

Intelligence measures the capability of a mechanism to solve prediction problems.

Mechanism is any physical machine as usual, see [4] which suggests likewise. Therefore, a general formulation of Solomonoff induction, operator induction, might serve as a model of general intelligence, as well [24]. Recall that operator induction can infer any physically plausible cpdf, thus its approximation can solve any classical supervised machine learning problem. The only slight issue with eq:legg-uai might be that it seems to exclude classical AI systems that are not agents, e.g., expert systems, machine learning tools, knowledge representation systems, search and planning algorithms, and so forth, which are somewhat more naturally encompassed by our informal definition.

3.1 Is operator induction adequate?

A question naturally arises as to whether operator induction can adequately solve every prediction problem we require in AI. There are two strong objections to operator induction that we know of. It is argued that in a dynamic environment, as in a physical environment, we must use an active agent model so that we can account for changes in the environment, as in the space-time embedded agent [16] which also provides an agent-based intelligence measure. This objection may be answered by the simple solution that each decision of an active intelligent system may be considered a separate induction problem. The second objection is that the basic Solomonoff induction can only predict the next bit, but not the expected cumulative reward, which its extensions can solve. We counter this objection by stating that we can reduce an agent model to a perception and action-planning problem as in OOPS-RL [20]. In OOPS-RL, the perception module searches for the best world-model given the history of sensory input and actions in allotted time using OOPS, and the planning module searches for the best control program using the world-model of the perception module to determine the action sequence that maximizes cumulative reward likewise. OOPS has a generalized Levin Search [12] which may be tweaked to solve either prediction or optimization problems. Hutter has also observed that standard sequence induction does not readily address optimization problems [8]. However, Solomonoff induction is still complete in the sense of Turing, and can infer any computable cpdf; and when the extension to Solomonoff induction is applied to sequence prediction, it does not yield a better error bound, which seems like a conundrum. On the other hand, Levin Search with a proper universal probability density function (pdf) of programs can be modified to solve induction problems (sequence, set, operator, and sequence prediction with arbitrary loss), inversion problems (computer science problems in P and NP), and optimization problems [23]. The planning module of OOPS-RL likewise requires us to write such an optimization program. In that sense, AIXI implies yet another variation of Levin Search for solving a particular universal optimization problem, however, it also has the unique advantage that formal transformations between AIXI problem and many important problems including function minimization and strategic games have been shown [8]. Nevertheless, the discussion in [23] is rather brief. Also see [1] for a discussion of universal optimization.

Proposition 1

A discrete-time universal RL model may be reduced to operator induction.

More formally, the perceptual task of an RL agent would be inferring from a history the cumulative rewards in the future, without loss of generality. Let the chronology be a sequence of sensory, reward, and action data where accesses th element, and accesses the subsequence . Let be the cumulative reward function where . After observing , we construct dataset as follows. For every unique pair such that , we concatenate history tuples , and we form a question string that also includes the next action, and , , and an answer string which is the cumulative reward . Solving the operator induction problem for this dataset will yield a cpdf which predicts cumulative rewards in the future. After that, choosing the next action is a simple matter of maximizing where is the planning horizon. The reduction causes quadratic blow-up in the number of data items. Our somewhat cumbersome reduction suggests that all of the intelligence here comes from operator induction, surely an argmax function, or a summation of rewards does not provide it, but rather it builds constraints into the task. In other words, we interpret that the intelligence in an agent model is provided by inductive inference, rather than an additional application of decision theory.

4 Physical Quantification of Intelligence

def:legg-uai corresponds to any kind of reinforcement-learning or goal-following agent in AI literature quite well, and can be adapted to solve other kinds of problems. The unsupervised, active inference agent approach is proposed instead of reinforcement learning approach in [7], and the authors argue that they did not need to invoke the notion of reward, value or utility. The authors in particular claim that they could solve the mountain-car problem by the free-energy formulation of perception. We thus propose a perceptual intelligence measure.

4.1 Universal measure of perception fitness

Note that operator induction is considered to be insufficient to describe universal agents such as AIXI, because basic sequence induction is inappropriate for modelling optimization problems [8]. However, a modified Levin search procedure can solve such optimization problems as in finding an optimal control program [20]. In OOPS-RL, the perception module searches for the best world-model given the history of sensory input and actions in allotted time using OOPS, and the planning module searches for the best control program using the world-model of the perception module to determine the control program that maximizes cumulative reward likewise. In this paper, we consider the perception module of such a generic agent which must produce a world-model, given sensory input.

We can use the intelligence measure eq:legg-uai in a physical theory of intelligence, however it contains terms like utility that do not have physical units (i.e., we would be preferring a more reductive definition). We therefore attempt to obtain such a measure using the more benign goodness-of-fit (eq:opind-gof). Let the universal measure of the fitness of operator induction be defined as

(17)

where is the set of possible stochastic sources in the observable universe and is a physical mechanism, and is relative to a stochastic source and a physical mechanism (computer) . This would be maximum if we assume that operator induction were solved exactly by an oracle machine.

Note that is finite; is likewise bounded by the amount of computation will spend on approximating operator induction.

4.2 Application to homeostasis agent

In a presentation to Friston’s group in January 2015, we noted that the minimization of is identical to Minimum Message Length principle, which can be further refined as

(18)

using Solomonoff’s entropy formulation that takes the negative logarithm of algorithmic probability [22]. In the unsupervised agent context, solving this minimization problem corresponds to inferring an optimal behavioral policy as constitutes internal dynamics which may be modeled as a non-terminating program. We could directly apply induction to minimize KL divergence, as well. Note the correspondence to operator induction.

Theorem 4.1

Minimizing the free energy is equivalent to solving the operator induction problem for pairs where and .

Proof

Observe that minimizing eq:reduced-info-bottleneck corresponds to picking maximum since in entropy form,

We define a non-redundant selection of ’s, , e.g., we pick only the shortest programs that produce the same cpdf, otherwise the entropy form would diverge. Minimizing eq:sol-info-bottleneck is exactly operator induction, even though the questions are programs, the ensemble here is of all programs and all sensory state, program pairs in space-time. and . Note that this merely establishes model equivalence, we have not yet explained how it is to be computed in detail.

Proposition 2

By the above theorem, eq:soll-uai measures the goodness of fit for a given homeostasis agent mechanism, for all possible environments.

The mechanism that maximizes achieves less error with respect to a source (which may be taken to correspond to the whole random dynamical system in the framework of free energy principle), while normalizes with respect to a random dynamical system. It holds for the same reasons Legg’s measure holds, which are not discussed due to space limits in the present paper. We prefer the unsupervised homeostasis agent among the two agent models we discussed because it provides an exceptionally elegant and reductionist model of autonomous behavior, that has been rigorously formulated physically. Note that this agent is conceptually related to the survival property of RL agents discussed in [19].

4.3 Discussion

The unsupervised model still achieves exploration and curiosity, because it would stochastically sample and navigate the environment to reduce predictive errors. While we either optimize perceptual models or choose an action that would befit expectations, it might be possible to express the optimal adaptive agent policy in a general optimization framework. A more in-depth analysis of the unsupervised agent will be presented in a subsequent publication. A more general reductive definition of intelligence should also be researched. These developments could eventually help unify AGI theory.

References

  • [1] Alpcan, T., Everitt, T., Hutter, M.: Can we measure the difficulty of an optimization problem? In: 2014 IEEE Information Theory Workshop, ITW 2014, Hobart, Tasmania, Australia, November 2-5, 2014. pp. 356–360. IEEE (2014), http://dx.doi.org/10.1109/ITW.2014.6970853
  • [2] Ashby, W.R.: Principles of the self-organizing system. In: v. Foerster, H., Zopf, G.W. (eds.) Principles of Self-Organization: Transactions of the University of Illinois Symposium, pp. 255–278. Pergamon, London (1962)
  • [3] Ashby, W.: Principles of the self-organizing dynamic system. The Journal of General Psychology 37(2), 125–128 (1947)
  • [4] Dowe, D.L., Hernández-Orallo, J., Das, P.K.: Artificial General Intelligence: 4th International Conference, AGI 2011, Mountain View, CA, USA, August 3-6, 2011. Proceedings, chap. Compression and Intelligence: Social Environments and Communication, pp. 204–211. Springer Berlin Heidelberg, Berlin, Heidelberg (2011), http://dx.doi.org/10.1007/978-3-642-22887-2_21
  • [5] Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O⿿Doherty, J., Pezzulo, G.: Active inference and learning. Neuroscience and Biobehavioral Reviews 68, 862 – 879 (2016), http://www.sciencedirect.com/science/article/pii/S0149763416301336
  • [6] Friston, K., Kilner, J., Harrison, L.: A free energy principle for the brain. Journal of Physiology-Paris 100(1–3), 70 – 87 (2006), http://www.sciencedirect.com/science/article/pii/S092842570600060X, theoretical and Computational Neuroscience: Understanding Brain Functions
  • [7] Friston, K.J., Daunizeau, J., Kiebel, S.J.: Reinforcement learning or active inference? PLOS ONE 4(7), 1–13 (07 2009), https://doi.org/10.1371/journal.pone.0006421
  • [8] Hutter, M.: Universal algorithmic intelligence: A mathematical topdown approach. In: Goertzel, B., Pennachin, C. (eds.) Artificial General Intelligence, pp. 227–290. Cognitive Technologies, Springer, Berlin (2007)
  • [9] Karl, F.: A free energy principle for biological systems. Entropy 14(11), 2100–2121 (2012), http://www.mdpi.com/1099-4300/14/11/2100
  • [10] Legg, S., Hutter, M.: Universal intelligence: A definition of machine intelligence. Minds Mach. 17(4), 391–444 (Dec 2007)
  • [11]

    Legg, S., Veness, J.: An approximation of the universal intelligence measure. In: Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence, Lecture Notes in Computer Science, vol. 7070, pp. 236–249. Springer Berlin Heidelberg (2013)

  • [12] Levin, L.: Universal problems of full search. Problems of Information Transmission 9(3), 256–266 (1973)
  • [13] Levin, L.A.: Some theorems on the algorithmic approach to probability theory and information theory. CoRR abs/1009.5894 (2010)
  • [14] Li, M., Vitanyi, P.M.: An Introduction to Kolmogorov Complexity and Its Applications. Springer Publishing Company, Incorporated, 3 edn. (2008)
  • [15] Lloyd, S.: Ultimate physical limits to computation. Nature406 (Aug 2000)
  • [16] Orseau, L., Ring, M.: Space-time embedded intelligence. In: Bach, J., Goertzel, B., Iklé, M. (eds.) Artificial General Intelligence, Lecture Notes in Computer Science, vol. 7716, pp. 209–218. Springer Berlin Heidelberg (2012), http://dx.doi.org/10.1007/978-3-642-35506-6_22
  • [17] Özkural, E.: Ultimate Intelligence Part II: Physical Measure and Complexity of Intelligence. ArXiv e-prints (Apr 2015)
  • [18] Özkural, E.: Ultimate intelligence part I: physical completeness and objectivity of induction. In: Artificial General Intelligence - 8th International Conference, AGI 2015, AGI 2015, Berlin, Germany, July 22-25, 2015, Proceedings. pp. 131–141 (2015), http://dx.doi.org/10.1007/978-3-319-21365-1_14
  • [19] Ring, M., Orseau, L.: Delusion, survival, and intelligent agents. In: Artificial General Intelligence, pp. 11–20. Springer Berlin Heidelberg (2011)
  • [20] Schmidhuber, J.: Optimal ordered problem solver. Machine Learning 54, 211–256 (2004)
  • [21] Solomonoff, R.J.: A formal theory of inductive inference, part i. Information and Control 7(1), 1–22 (March 1964)
  • [22] Solomonoff, R.J.: Complexity-based induction systems: Comparisons and convergence theorems. IEEE Trans. on Information Theory IT-24(4), 422–432 (July 1978)
  • [23] Solomonoff, R.J.: Progress in incremental machine learning. Tech. Rep. IDSIA-16-03, IDSIA, Lugano, Switzerland (2003)
  • [24] Solomonoff, R.J.: Three kinds of probabilistic induction: Universal distributions and convergence theorems. The Computer Journal 51(5), 566–570 (2008)
  • [25] Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. ArXiv Physics e-prints (Apr 2000)
  • [26] Wallace, C.S., Dowe, D.L.: Minimum message length and kolmogorov complexity. The Computer Journal 42(4), 270–283 (1999), http://comjnl.oxfordjournals.org/content/42/4/270.abstract
  • [27] Wallace, C.S., Boulton, D.M.: A information measure for classification. Computer Journal 11(2), 185–194 (1968)