I Introduction
Ia Motivation
Artificial intelligence has under the last decade progressed significantly achieving superhuman performance in challenging domains such as the video game Atari and the board game Go [23, 32]. Unfortunately, for more complex and unconstrained environments (e.g., advanced realworld systems such as autonomous vehicles) results are more limited [26]. One major challenge here is the huge space of all possible strategies (policies) that the agent (i.e., robot or machine) can perform, making tractable solutions cumbersome to obtain naively. However, humans tend to face these complex domains with relative ease.
One explanation why humans perform well in complex tasks comes from cognitive neuroscience, proposing that general intelligence is linked to efficient compression, known as the efficient coding hypothesis [4, 5, 33]. This idea is not new but can be traced back to William of Occam saying “If there are alternative explanations for a phenomenon, then, all other things being equal, we should select the simplest one”, where the simple alternative is the alternative with the shortest explanation. This methodology is known as Occam’s razor [20]. Formalisations of Occam’s razor have been able to detect computational regularities in for example the decimals of [40] and automatically extract lowcomplexity physical laws (e.g., ) directly from data [37]. In this paper, we are interested if a similar formalisation can automatically extract lowcomplexity policies, seen as a first step towards more tractable and intelligent behaviour for agents acting in complex environments.
IB Contribution
The main contribution of this paper is to define a complexity measure for policies in deterministic finite automata (DFA) [15] using Kolmogorov complexity [20], and to construct tractable complexityaware planning algorithms with explicit tradeoffs between performance and complexity based on this measure. More precisely, our contributions are threefold:
Firstly, we define a complexity measure for deterministic policies in finitehorizon DFA with rewards as outputs. This measure uses Kolmogorov complexity to evaluate how complex a policy is to execute. Kolmogorov complexity, which can be seen as a formalisation of Occam’s razor, is a computational notion of complexity being able to not only detect statistical regularities (commonly exploited in standard information theory) but also computational regularities.^{1}^{1}1An example of a sequence with computational regularity but no apparent statistical regularity is the infinite sequence . Our key insight is that optimal policies typically posses computational regularities (such as reaching a goal state) apart from statistical ones, making Kolmogorov complexity an appealing complexity evaluator.
Secondly, we present a complexityaware planning objective based on our complexity measure, yielding an explicit tradeoff between a policy’s performance and complexity. We also prove that maximising this objective is nontrivial in the sense that dynamic programming [6] is infeasible.
Thirdly, we present two algorithms obtaining lowcomplexity policies. The first algorithm, Complexityguided Optimal Policy Search (COPS), finds a policy with low complexity among all optimal policies, following a twostep procedure. In the first step, the algorithm finds all optimal policies, without any complexity constraints. Dropping the complexity constraints, this step can be done using dynamic programming. The second step runs a uniformcost search [29] over all optimal policies, guided by a complexitydriven cost, favouring lowcomplexity policies and enabling moderate search depths. The second algorithm, Stagecomplexityaware Program (SCAP), penalises instead policies locally for executing complex manoeuvres. This is done by partitioning the horizon into stages, with local complexity constraints over the stages instead of the full horizon, which enables dynamic programming over the stages. Finally, we evaluate our algorithms on a simple navigation task where a mobile robot tries to reach a certain goal state. Our algorithms yield lowcomplexity policies that concur with intuition.
IC Related Work
The interest of complexity in control and learning has a long history going back to Bellmann’s curse of dimensionality
[6], Witsenhausen’s counterexample [38, 39] and the general open problem under what conditions LQG admits an optimal lowdimensional feedback controller [3], just to mention a few, and has lately been reviewed as an essential component for intelligent behaviour [30]. Recent attempts to find lowcomplexity policies can be divided into two categories. In the first category, the system itself is approximated by a lowcomplexity system (e.g., smaller dimension), whereas an approximately optimal solution can be obtained. Methods in this category include bisimulation [14, 7], PCA analysis [27, 21], and informationtheoretic compression such as the information bottleneck method [1, 18]. In the second category, a lowcomplexity policy is instead obtained directly. Here, notable methods include policy distillation [31], VCdimension constraints [17], concise finitestate machine plans [24, 25], lowmemory policies through sparsity constraints [8], and informationtheoretic approaches such as KLregularisation [28, 36], mutual information regularisation with variations [34, 13, 35], and minimal specification complexity [12, 11]. Our work belongs to this second category and resembles [24, 25, 12, 11] the most, but differ since we consider Kolmogorov complexity.Kolmogorov complexity has also been considered in the context of reinforcement learning as a tool for complexityconstrained inference
[10, 2, 16] based on Solomonoff’s theory of inductive inference [20]. We differ by focusing instead on constraining the computational complexity of the obtained policy itself, assuming the underlying system to be known.ID Outline
The remaining paper is as follows. Section II provides preliminaries together with the problem statement. Section III defines the complexity measure together with the complexityaware planning objective, and proves that dynamic programming is infeasible. Section IV presents two algorithms for yielding lowcomplexity policies and Section V evaluates the algorithms on a mobile robot example. Finally, Section VI concludes the paper.
Ii Problem Formulation
Iia Turing Machines
We give a brief review of Turing machines following
[20]. A Turing machine is a mathematical model of a computing device, manipulating symbols on an infinite list of cells (known as the tape), reading one symbol at a time on the tape with a one access pointer (known as the head). Formally:Definition 1
A Turing machine is a tuple with: A finite set of tape symbols with alphabet and blank symbol ; A finite set of states with start state ; A partial function^{2}^{2}2We use the notation to denote a partial function from a set to a set (i.e., a function that is only defined on a subset of ). called the transition function.
An execution of a Turing machine with input string^{3}^{3}3Here, denotes the set of all finite strings from the alphabet , e.g., if . starts with on the tape (and blank symbols on both sides of ), as state, and head on the first symbol of . It then transitions according to , where specifies what current scanned symbol should be replaced with, or if head should move left (right) one step (), and specifies the next state. This transition procedure is repeated until becomes undefined. In this case, halts with output equal to the maximal string in currently under scan, or if is scanned. For each Turing machine , this inputoutput convention defines a partial function (defined when halts) known as a partial computable function.
An important special class of Turing machines is the universal Turing machines. Informally, a universal Turing machine is a Turing machine that can simulate any other Turing machine, and can therefore be seen as an idealised version of a modern computer. A fundamental result states that there exist such machines [20], a fact utilised when defining the Kolmogorov complexity.
IiB Kolmogorov Complexity
In this section, we present needed theory concerning Kolmogorov complexity [20]. Informally, the Kolmogorov complexity of an object is the length of the smallest program on a computer that outputs . If is much smaller than , then we have compressed the information in to only the essential information of . The Kolmogorov complexity can therefore be seen as a formalisation of Occam’s razor, stripping away all the nonessential information. More formally, the computer is a universal Turing machine , and the object and the program are strings of a finite alphabet . Towards a precise definition, we need the following notion:
Definition 2 ([20])
The complexity of a string with respect to a partial computable function is defined as
where is the length of string . Here, is interpreted as the program input to , and is thus the length of the smallest program that, via , describes .
The following fundamental result, known as the invariance theorem, asserts that there is a partial computable function with shorter (i.e., more compact) descriptions than any other partial computable function, up to a constant:
Theorem 1 (Invariance theorem [20])
There exists a partial computable function , constructed from a universal Turing machine , with the following additively optimal property: For any other partial computable function , there exists a constant (dependent only on ) such that for all : .
This function serves as our computer when defining the Kolmogorov complexity:
Definition 3 (Kolmogorov Complexity [20])
Let be as in Theorem 1. The Kolmogorov complexity of is . We sometimes abbreviate as .
The Kolmogorov complexity is robust in the sense that it depends only benignly on the choice of , see [20] for details. The Kolmogorov complexity is however not computable in general, but can only be overapproximated pointwise [20]. The method we use in this paper to approximate is from [40]
and based on algorithmic probability
[20], see Appendix for a summary. This estimation method is used in the simulations (Section
V), while the underlying theoretical framework we develop (Section III and IV) is based on the exact Kolmogorov complexity.^{4}^{4}4We stress that other estimation methods can also be used, e.g., LempelZiv compression [19] (e.g., used in [9]), since the theoretical framework is independent on the particular estimation method chosen. We picked [40] due to its more direct connection with Kolmogorov complexity, whereas LempelZiv compression relies on classical information theory.IiC Deterministic Finite Automata
We consider finitehorizon planning for discrete systems formalised as timevarying DFA [15] with actions as inputs and rewards as outputs (i.e., timevarying Mealy machines [22]), on the form:^{5}^{5}5The timevarying feature of the DFA can be lifted by including the time into the state. We keep the current notation for easier readability.
(1) 
where is the finite set of states, the finite action set, the horizon length and , the transition function specifying the next state at time given current state and action , and the corresponding received reward . The system stops at , given by final state sets and for . Given a start state , the objective is to find a (deterministic) policy maximising the total reward subject to the transition dynamics . A policy which does this for every start state is called an optimal policy.
IiD Problem Statement
We now formalise the problem statement. In this paper, we answer the following questions:

Given an as in (1), how can one define a complexity measure capturing how complex a policy is to execute from a start state ?

How can one construct a planning objective with a formal tradeoff between total reward and complexity for a policy ?

How can one construct algorithms that maximises ?

In particular, can dynamic programming be used to maximise ?
Questions 1, 2 and 4 are answered in Section III, while Section IV answers question 3 together with the numerical evaluations in Section V.
Iii Complexityaware Planning
This section introduces a complexity measure in Section IIIA and then sets up an appropriate complexityaware planning objective in Section IIIB. Finally, we prove that dynamic programming cannot be used to maximise this objective in Section IIIC. Throughout this section, we fix a system as in (1).
Iiia Execution Complexity
We start by defining our complexity measure for policies, capturing how complex it is to execute a policy from a start state . Towards this, note that, given a start state and a policy , we get a sequence of actions in from time to time , generated by and the dynamics . We denote this action sequence by and say that has low execution complexity if is low. Intuitively, a policy with low execution complexity has a small program that can execute it.
Definition 4 (Execution Complexity)
Given a start state , the execution complexity of a policy is .
IiiB Complexityaware Planning Objective
The execution complexity can be used to find policies with high total reward while keeping a low complexity. Formally, we want to find an action sequence maximising the objective^{6}^{6}6For convenience, we maximise directly over control inputs instead of policies, since the horizon is typically lower than the number of states in . With this convention, in (2) agrees with in Definition 4.
(2) 
subject to the dynamics and start state . Here, determines how much we penalise complexity relative to obtaining a high total reward.
Example 1 (Simple optimal policies)
For sufficiently low, we obtain, for a start state , an interesting subset of all optimal policies (i.e., policies maximising the total reward ) with the lowest complexity, which we here call simple optimal policies. Note that, the simple optimal policies can also be found via the objective
(3) 
subject to the dynamics and start state . See Appendix for a proof.
IiiC Dynamic Programming is Infeasible
We are interested in algoritms obtaining the maximum of (2). A standard method for finding an optimal action sequence maximising the total reward is dynamic programming [6]. Hence, one may ask if (2) can be solved using dynamic programming. The answer is negative and follows from the following result.
Proposition 1
For sufficiently large , we cannot decompose on the form
where are functions given by
for some functions . That is, we cannot decompose recursively into an immediate complexity plus a future complexity. In particular, we cannot obtain by backward induction.^{7}^{7}7Note that and can by any functions, not necessarily computable. If one restricts these functions to be computable, then the result follows easily from the fact that itself is not commutable.
See Appendix for proof. Due to Proposition 1, we cannot naively apply dynamic programming to solve (2) (for large enough ). More precisely, the objective function
subject to the dynamics and start state , cannot be decomposed on the form , where are given recursively by for some functions . Indeed, if such a decomposition existed, it would contradict Proposition 1. Thus, cannot be maximised using dynamic programming. Fortunately, there are methods that circumvent this issue, presented next.
Iv Complexityaware Planning Algorithms
This section answers question 3 in the problem statement, presenting two algorithms that circumvent the dynamic programming issue given by Section IIIC. The first algorithm, COPS, restricts the task to find simple optimal policies, while the second algorithm, SCAP, modifies the objective focusing on local (stagewise) complexity.
Iva Complexityguided Optimal Policy Search (COPS)
COPS seeks a simple optimal policy as in Example 1 by a twostep procedure. Step 1 conducts ordinary dynamic programming (maximising the total reward ). This yields a mapping such that are the optimal actions at time and state .^{8}^{8}8Here, is the power set of , i.e., the set of all subsets of . In step 2, a uniformcost search [29] is executed to find an optimal action sequence with low execution complexity. A node in this search is on the form , where and are the current time and state and is the sequence of previous actions taken to arrive at state at time . If node is not a terminal node (i.e., ), the children of are given by all such that and , i.e., we expand only over optimal actions. The cost for a node is set to
. The heuristic intuition behind the cost is that a lowcomplexity sequence
should be more likely to have lowcomplexity subsequences and, thus, the cost focuses the search on lowcomplexity sequences, enabling moderate search depths.The uniformcost search is conducted by iteratively generating children of the node with the lowest cost, starting from the root node . We terminate the search when a terminal node has the lowest cost and return its action sequence . The algorithm is summarised by Algorithm 1. The termination is motivated by the following result:
Proposition 2
We stress that is not always true, since adding to may result in higher regularity than has alone. However, is more anticipated since is on average an increasing function with respect to sequence length [20], motivating the assumption in Proposition 2, and why the algorithm can work well in practice. Proposition 2 follows readily by applying uniformcost search properties, see Appendix.
Efficient searching in Algorithm 1 is possible for moderate horizon lengths, demonstrated numerically in Section VB. However, for longer horizons, the procedure becomes intractable. The next algorithm works for longer horizons by focusing on local (stagewise) complexity instead of the complexity of the whole action sequence. COPS can be seen as a special case of this latter algorithm with only one stage and , defined below in (4), sufficiently low.
IvB StageComplexityAware Program (SCAP)
SCAP modifies the objective to focus only on local (stagewise) complexity. More precisely, the modified objective is set to
(4) 
Here, is the number of stages in the horizon partition, each with stage length , and is the executed action sequence at stage , penalised by its execution complexity with weight .
The key insight is that the modified objective enables dynamic programming over the stages. More precisely, initialise the value function as for all and obtain the value function for the remaining stages using the backward recursion
(5) 
Here, denotes the state one arrives at by sequentially applying starting from , and . Provided is small enough, the maximisation in (IVB) can be done by going through all .
IvB1 Hardconstrained version
An alternative to (4) is the hardconstraint objective
(6) 
for some constants . In this case, we conduct dynamic programming with for all and backward recursion
(7) 
Using (7) enables more efficient computations than (IVB) since the sequences are typically only a fraction of . Moreover, can be computed beforehand by going through all sequences in (tractable for small ), or sought using a uniformcost search (for moderate , see Appendix for details).
IvB2 Action sequence extraction
Once has been computed using (IVB) or (7), maximising (4) or (6) can be readily obtained, for a given start state , by forward simulation over the stages (see Appendix for details).
Algorithm 2 summarises the procedure for the hardconstrained version.
V Numerical Evaluations
This section presents case studies evaluating the complexityaware planning algorithms proposed in Section IV. We describe the test environment in Section VA, evaluate COPS in Section VB and then SCAP in Section VC. The Kolmogorov complexity is estimated using the method from [40].



Va Test Environment
As test environment, we consider a system as in (1) where a robot (the agent) moves around in a room with discrete steps. More precisely, the room is a square with coordinates in each direction forming the state space . At each time, the robot can move to any of the neighbouring spots or stay, i.e., the actions are with dynamics . If the robot executes an action that would move it outside the room, then it stays at the same spot. The reward is zero everywhere except for all stateaction pairs that takes the robot to a given goal state , i.e., , with reward . We set the goal state to one of the corners of the room. The robot starts at and the horizon is set so that the agent precisely receives a reward for reaching if acting optimally.
VB Evaluation 1
VB1 Setup
We first consider COPS. To see how output and calculation time changes with problem size, we study the test environment with different room sizes and , called the small, medium and large room, respectively. We run the search until it has found 30 lowcomplexity action sequences, by keep expanding the node tree. Concretely, this is done by modifying the ifstatement in Algorithm 1 to a condition appending every found action sequence into a list until this list has 30 sequences, and then return the list.
VB2 Result
The running time for the three rooms are 21 seconds, 16 minutes and 15 minutes, respectively, and corresponding trajectories to the found action sequences are given in Fig. 0(a), 0(b) and 0(c), with execution complexities in Tables I(a), I(b), and I(c).
As a first example, consider the trajectory of the first action sequence found in the small room, labeled 1 in Fig. 0(a). Here, the robot goes right until it reaches the upper right corner and then goes down to . That is, it exploits executing the same actions in batches to lower complexity, reaching an execution complexity of 47.30 seen in Table I(a). We also note that this is the same complexity as action sequences 24 in Fig. 0(a). That 1 and 3 have the same complexity is due to symmetry, and the same is true for 2 and 4. However, why e.g., 1 and 2 have the same complexity (apart from the intuition that they both look like lowcomplexity executions) is unclear. It could be an inherited feature from the Kolmogorov complexity, or a bias from the estimation method, see Appendix for a discussion of the latter.
Looking now at all sequences in all three cases, we see that it is in general common to find sequences executing batches of the same actions and then alternate between such batches to lower execution complexity (18 in Fig. 0(a), and 1728 in Fig. 0(b)). Another typical feature to lower complexity is to exploit 2periodic alternation between going down and going right (930 in Fig. 0(a), 116 and 2930 in Fig. 0(b), and 130 in Fig. 0(c)). Also, what type of execution with lowest complexity varies with the room size, i.e., is horizondependent. Once again, this could come from the Kolmogorov complexity itself, where varying sequence length may facilitate different compression techniques, but could also originate from the estimation method.
We also note in Tables I(a) to I(c) that the algorithm mainly finds sequences in increasing complexityorder as anticipated by Proposition 2. Indeed, only the medium room case causes some unordered sequences, where notably the search finds the higher complexity sequences 2128 with execution complexity 62.57 before reaching sequences 2930 with complexity 39.75. This detour, caused by a violation in the assumption in Proposition 2, explains the long search time the medium room yields, longer than the large room although the latter has a larger horizon.
Finally, the cost enables moderate horizon lengths . In particular, for the large room, the number of possible optimal action sequences is around after dynamic programming. However, the search only iterates nodes to find the first lowcomplexity sequence, a significant decrease due to the complexityguiding cost.
VC Evaluation 2
We continue with SCAP, considering the hardconstrained version given by Algorithm 2, calculating by going through all sequences. The aim of the investigation is mainly to see how the complexityperformance tradeoff affects the behaviour. Towards this, we consider two similar setups, both considering the large room but where the first setup has the goal state in the lower right corner of the room and the second setup has in the middle of the room.
VC1 Setup 1: in the corner
In Setup 1, we penalise each stage equally with complexity limit , having stage length , and consider the large room from Section VB with a slightly larger horizon to be compatible with the stage partition (i.e., ), setting accordingly. The change in tradeoff comes into play by varying with values and 30. These values are picked since they illustrate the tradeoff well, allowing almost no action sequences for while many for .
VC2 Result
Value functions are plotted in Fig. 2 for , where is as a reference obtained by ordinary dynamic programming maximising the total reward . Blue (red) colour encodes low (high) value.
For , the room is divided into squares of length due to the stage partition and the low complexity limit . More precisely, at each stage , consists of only five action sequences: the ones executing only one action throughout the stage. This constraint implies that the robot sometimes has to wait at the walls, yielding the discontinuous jumps in the value function seen in Fig. 2 (a). Consider for example the two trajectories in Fig. 2 (a) generated by Algorithm 2, coloured in a greentoyellow scale, where states more visited are more yellow. The robot starting at A reaches without waiting. The robot starting at B goes right two stages and then down two stages, but hits the wall at its second goingright stage before the stage is done. This causes waiting at the wall and thus longer time to reach , even though is actually closer to than A.
For , the squares in Fig. 2 (b) are greatly diminished compared to , due to additional possible action sequences in . Here, a robot starting at B does not wait anymore at the wall. We note also that the squares are more evident near the goal state since the robot has less time here to adapt and avoid wall waiting.
Finally, for , has increased so much that the difference with the unconstrained case is minor, and the trajectories from A and B, already being optimal at , has not changed.
VC3 Setup 2: in the middle
Setup 2 is identical to Setup 1, except that is placed in the middle of the room. Thus, the robot can no longer heavily rely on the wall dynamics as in Setup 1. This notably changes the outcome.
VC4 Result
Similar to Setup 1, we plot for different in Fig. 3.
For , has higher values in a gridlike pattern, with highest value in the middle of the room at and high values at points being away a multiple of from . At all these points, the robot can (using the five sequences in ) reach and stay at resulting in a high total reward. Between those points are also states with higher value. Here, the robot can reach but not stay at ; Instead, the robot oscillates back and forth crossing the multiple times. This is seen for the robot starting at A in Fig. 3 (a). For the remaining states, the robot can never reach the goal state, due to the heavy complexity constraint. This is the case for the robot starting and staying at B. Thus, for low complexity limits, the robot may never reach the intended objective.
As we increase to , the gridlike pattern is expanded to groups of states due to the increased set of admissible action sequences. A robot starting at A reaches now and stays at as can be seen in Fig. 3 (b). The robot starting at B also reaches and stays at , but takes a detour to be within the local complexity limit, reaching the upper wall and then down to . Thus, for low complexity limits, the robot may execute an action sequence which is locally of low complexity, but globally quite complex.
Finally, for , is similar to the unconstrained case, except that the stage length together with the complexity constraint still partitions the space into entities of length , similar to Fig. 2 (a) (but in this case into diamondlike shapes). The robots starting at A and B now reaches the goal fast with the high complexity limit.
Vi Conclusion
In this paper, we have considered complexityaware planning for DFA with rewards as outputs. We first defined a complexity measure for deterministic policies based on Kolmogorov complexity. Kolmogorov complexity is used since it can detect computational regularities, a typical feature for optimal policies.
We then introduced a complexityaware planning objective based on our complexity measure, yielding an explicit tradeoff between a policy’s performance and complexity. It was proven that maximising this objective is nontrivial in the sense that dynamic programming is infeasible.
We presented two algorithms for obtaining lowcomplexity policies, COPS and SCAP. COPS finds a policy with low complexity among all optimal policies, following a twostep procedure. In the first step, the algorithm finds all optimal policies, without any complexity constraints. Dropping the complexity constraints, this step can be done using dynamic programming. The second step runs a uniformcost search over all optimal policies, guided by a complexitydriven cost, favouring lowcomplexity policies. SCAP modifies instead the objective penalising policies locally for executing complex manoeuvres. This is done by partitioning the horizon into stages and enforce local complexity constraints over these stages, where the partition enables dynamic programming. We illustrated and evaluated our algorithms on a simple navigation task where a mobile robot tries to reach a certain goal state. Our algorithms yield lowcomplexity policies that concur with intuition.
Future work includes comparisons with other estimation methods of the Kolmogorov complexity and how uncertainty (e.g., stochasticity) and feedback (e.g., receding horizon) can be incorporated into the existing framework. Finally, another challenge is to tractably maximise the complexityaware planning objective in the general case.
References
 [1] (2019) State abstraction as compression in apprenticeship learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3134–3142. Cited by: §IC.
 [2] (2017) Universal reinforcement learning algorithms: survey and experiments. arXiv preprint arXiv:1705.10557. Cited by: §IC.
 [3] (1970) Introduction to stochastic control theory. Mathematics in science and engineering, Vol. 70, Academic Press (eng). External Links: ISBN 0120656507 Cited by: §IC.
 [4] (1954) Some informational aspects of visual perception.. Psychological review 61 (3), pp. 183. Cited by: §IA.
 [5] (1961) Possible principles underlying the transformation of sensory messages. Sensory communication 1, pp. 217–234. Cited by: §IA.
 [6] (1957) Dynamic programming. Princeton University Press. Cited by: §IB, §IC, §IIIC.
 [7] (2020) Learning discrete state abstractions with deep variational inference. arXiv preprint arXiv:2003.04300. Cited by: §IC.
 [8] (2021) Learning to actively reduce memory requirements for robot control tasks. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, pp. 125–137. External Links: Link Cited by: §IC.
 [9] (2016) Factors of collective intelligence: how smart are agent collectives?. In Proceedings of the Twentysecond European Conference on Artificial Intelligence, pp. 542–550. Cited by: §IC, footnote 4.
 [10] (2019) A strongly asymptotically optimal agent in general environments. arXiv preprint arXiv:1903.01021. Cited by: §IC.
 [11] (2008) A software tool for hybrid control. IEEE Robotics Automation Magazine 15 (1), pp. 87–95. Cited by: §IC.
 [12] (2003) Feedback can reduce the specification complexity of motor programs. IEEE Transactions on Automatic Control 48 (2), pp. 213–223. Cited by: §IC.
 [13] (2016) Minimuminformation LQG control part i: memoryless controllers. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 5610–5616. Cited by: §IC.
 [14] (2011) Approximate bisimulation: a bridge between computer science and control theory. European Journal of Control 17 (56), pp. 568–578. Cited by: §IC.
 [15] (2006) Introduction to automata theory, languages, and computation (3rd edition). AddisonWesley. External Links: ISBN 0321455363 Cited by: §IB, §IIC.
 [16] (2004) Universal artificial intelligence: sequential decisions based on algorithmic probability. Springer Science & Business Media. Cited by: §IC.
 [17] (2000) Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, pp. 1001–1007. Cited by: §IC.
 [18] (2020) An informationtheoretic approach for path planning in agents with computational constraints. arXiv preprint arXiv:2005.09611. Cited by: §IC.
 [19] (1976) On the complexity of finite sequences. IEEE Transactions on information theory 22 (1), pp. 75–81. Cited by: footnote 4.
 [20] (2020) An introduction to kolmogorov complexity and its applications. 4th edition, Springer. Cited by: §IA, §IB, §IC, §IIA, §IIA, §IIB, §IIB, §IVA, §VIIA2, §VIIC1, Definition 2, Definition 3, Theorem 1, footnote 10.
 [21] (2017) On exploiting spectral properties for solving MDP with large state space. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1213–1219. Cited by: §IC.
 [22] (1955) A method for synthesizing sequential circuits. The Bell System Technical Journal 34 (5), pp. 1045–1079. External Links: Document Cited by: §IIC.
 [23] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §IA.
 [24] (2017) Concise planning and filtering: hardness and algorithms. IEEE Transactions on Automation Science and Engineering 14 (4), pp. 1666–1681. External Links: Document Cited by: §IC.
 [25] (2021) Algorithmic design for embodied intelligence in synthetic cells. IEEE Transactions on Automation Science and Engineering 18 (3), pp. 864–875. External Links: Document Cited by: §IC.
 [26] (2018) A tour of reinforcement learning: the view from continuous control. arXiv preprint arXiv:1806.09460. Cited by: §IA.
 [27] (2003) Exponential family PCA for belief compression in POMDPs. In Advances in Neural Information Processing Systems, pp. 1667–1674. Cited by: §IC.
 [28] (2012) Trading value and information in MDPs. In Decision Making with Imperfect Decision Makers, pp. 57–74. Cited by: §IC.
 [29] (2009) Artificial intelligence: a modern approach. 3rd edition, Prentice Hall Press, USA. External Links: ISBN 0136042597 Cited by: §IB, §IVA, Lemma 2.
 [30] (2016) Rationality and intelligence: a brief update. In Fundamental issues of artificial intelligence, pp. 7–28. Cited by: §IC.
 [31] (2015) Policy distillation. arXiv preprint arXiv:1511.06295. Cited by: §IC.

[32]
(2016)
Mastering the game of Go with deep neural networks and tree search
. Nature 529 (7587), pp. 484–489. Cited by: §IA.  [33] (2018) Efficient coding explains the universal law of generalization in human perception. Science 360 (6389), pp. 652–656. Cited by: §IA.
 [34] (2017) LQG control with minimum directed information: semidefinite programming approach. IEEE Transactions on Automatic Control 63 (1), pp. 37–52. Cited by: §IC.

[35]
(2017)
Transferentropyregularized Markov decision processes
. arXiv preprint arXiv:1708.09096. Cited by: §IC.  [36] (2011) Information theory of decisions and actions. In Perceptionaction cycle, pp. 601–636. Cited by: §IC.
 [37] (2020) AI Feynman 2.0: paretooptimal symbolic regression exploiting graph modularity. arXiv preprint arXiv:2006.10782. Cited by: §IA.
 [38] (1968) A counterexample in stochastic optimum control. SIAM Journal on Control 6 (1), pp. 131–147. Cited by: §IC.
 [39] (1972) Team decision theory and information structures in optimal control problems–part i. IEEE Transactions on Automatic Control 17 (1), pp. 15–22. Cited by: §IC.
 [40] (2018) A decomposition method for global evaluation of Shannon entropy and local estimations of algorithmic complexity. Entropy 20 (8), pp. 605. Cited by: §IA, §IIB, §V, §VIIC1, §VIIC, footnote 4.
Vii Appendix
Viia Proofs
ViiA1 Proof of the observation in Example 1
We prove that (2) and (3) are equivalent for sufficiently small. For brevity, let
denote the total reward, subject to the dynamics and start state . Note that the statement is trivially true if is constant (over all ), since (2) and (3) are then equivalent for all . Hence, we may assume that is not constant. In particular, the difference between the maximum of and its second highest value is then positive:
where .
We start by showing that a maximiser of (2) is a minimiser of (3) given that is sufficiently small (to be specified). Towards this, let be a maximiser of (2). Assume by contradiction that . Fix any and observe that
(8) 
Moreover, since is a maximiser of (2), we also have that
and therefore
(9) 
Note that (9) does not hold for
(10) 
Thus, for sufficiently small specified by (10), a maximiser of (2) must belong to , and since
we conclude that is a minimiser of (3).
ViiA2 Proof of Proposition 1
Proposition 1 follows immediately from the following lemma.
Lemma 1
For sufficiently large, there do not exist functions such that
(11) 
holds for all sequences .
We may without loss of generality restrict ourselves to the binary case . We prove the lemma by contradiction. Let be the partial computable function in Definition 3. Let be arbitrary. By a simple counting argument, there exists at least one such that (e.g., Theorem 2.2.1 in [20]). Also, the compliment of , defined by inverting all zeros and ones in , has complexity close to in the sense that
(12) 
holds for some constant independent of and . To see this, let be the Turing machine that takes a binary string , inverts all zeros and ones, and outputs the result . In particular, .^{9}^{9}9For brevity, we use, for a given Turing machine with corresponding partial computable function , the notation to denote for . Let in turn be the Turing machine that given input simulates the universal Turing machine corresponding to , obtains the output and then feeds it as input to . Then implies . Hence, letting be the corresponding partial computable function of , we have by the invariance theorem,
(13) 
where is independent of and . Furthermore, since , implies , we have again by the invariance theorem,
(14) 
We now show that , for some constant independent of , where is the string consisting of zeros. To see this, let be the binary string corresponding to and note that , where is independent of .^{10}^{10}10This correspondence between and is given by the lengthincreasing lexicographic ordering where in corresponds to in , see Chapter 1.4 in [20]. Consider the Turing machine that given a binary string as input, calculates the corresponding integer and outputs . In particular, . Let be the corresponding partial computable function of . By the invariance theorem,
(15) 
Similarly, , for some constant independent of , where is the string consisting of ones.
Assume now that (11) holds. Then,
This yields a contradiction for sufficiently large, which proves the lemma.
ViiA3 Proof of Proposition 2
We need the following lemma for uniformcost search.
Lemma 2 ([29])
Assume the cost in a uniformcost search is such that for any node and any child node of . Then the uniformcost search terminates at a node with lowest cost.
ViiB Details of SCAP
ViiB1 Uniformcost search for finding
One can use a uniformcost search for trying to find for moderate stage length . More precisely, the uniformcost search is conducted with nodes , cost , and termination criteria for some . If holds in the search, then one can set and the search terminates when all sequences in have been considered (cf. Proposition 2). However, since this assumption may not always hold throughout the search, one consider instead a positive as a margin (increasing the search, though without any formal guarantees of having found all elements of at termination).
ViiB2 Action sequence extraction
ViiC The Estimation Method
We provide a brief overview of the method from [40] used to estimate the Kolmogorov complexity , followed by a discussion concerning potential bias this method might generate in the numerical evaluations in Section V.
ViiC1 Overview
The method [40] estimates by first partitioning into substrings, all of fixed block length (plus one possible remainder substring with length less than ). Denoting such a substring and letting be the number of occurrences of in the partition of , the complexity is then estimated by
(16) 
where approximates based on the coding theorem in algorithmic probability [20], referring to [40] for details. In words, is the total complexity for generating all the substrings , plus the codelength needed for specifying how frequent each is in , which can be encoded in a codeword of length
Comments
There are no comments yet.