Computing Complexity-aware Plans Using Kolmogorov Complexity

09/21/2021
by   Elis Stefansson, et al.
0

In this paper, we introduce complexity-aware planning for finite-horizon deterministic finite automata with rewards as outputs, based on Kolmogorov complexity. Kolmogorov complexity is considered since it can detect computational regularities of deterministic optimal policies. We present a planning objective yielding an explicit trade-off between a policy's performance and complexity. It is proven that maximising this objective is non-trivial in the sense that dynamic programming is infeasible. We present two algorithms obtaining low-complexity policies, where the first algorithm obtains a low-complexity optimal policy, and the second algorithm finds a policy maximising performance while maintaining local (stage-wise) complexity constraints. We evaluate the algorithms on a simple navigation task for a mobile robot, where our algorithms yield low-complexity policies that concur with intuition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

01/23/2013

Solving POMDPs by Searching the Space of Finite Policies

Solving partially observable Markov decision processes (POMDPs) is highl...
01/23/2013

My Brain is Full: When More Memory Helps

We consider the problem of finding good finite-horizon policies for POMD...
09/06/2019

2-Local Hamiltonian with Low Complexity is QCMA

We prove that 2-Local Hamiltonian (2-LH) with Low Complexity problem is ...
08/30/2021

Trustworthy AI for Process Automation on a Chylla-Haase Polymerization Reactor

In this paper, genetic programming reinforcement learning (GPRL) is util...
05/04/2019

Pandora's Problem with Nonobligatory Inspection

Martin Weitzman's "Pandora's problem" furnishes the mathematical basis f...
02/14/2019

Learn a Prior for RHEA for Better Online Planning

Rolling Horizon Evolutionary Algorithms (RHEA) are a class of online pla...
07/12/2020

Low-Complexity Set-Membership Normalized LMS Algorithm for Sparse System Modeling

In this work, we propose two low-complexity set-membership normalized le...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation

Artificial intelligence has under the last decade progressed significantly achieving superhuman performance in challenging domains such as the video game Atari and the board game Go [23, 32]. Unfortunately, for more complex and unconstrained environments (e.g., advanced real-world systems such as autonomous vehicles) results are more limited [26]. One major challenge here is the huge space of all possible strategies (policies) that the agent (i.e., robot or machine) can perform, making tractable solutions cumbersome to obtain naively. However, humans tend to face these complex domains with relative ease.

One explanation why humans perform well in complex tasks comes from cognitive neuroscience, proposing that general intelligence is linked to efficient compression, known as the efficient coding hypothesis [4, 5, 33]. This idea is not new but can be traced back to William of Occam saying “If there are alternative explanations for a phenomenon, then, all other things being equal, we should select the simplest one”, where the simple alternative is the alternative with the shortest explanation. This methodology is known as Occam’s razor [20]. Formalisations of Occam’s razor have been able to detect computational regularities in for example the decimals of [40] and automatically extract low-complexity physical laws (e.g., ) directly from data [37]. In this paper, we are interested if a similar formalisation can automatically extract low-complexity policies, seen as a first step towards more tractable and intelligent behaviour for agents acting in complex environments.

I-B Contribution

The main contribution of this paper is to define a complexity measure for policies in deterministic finite automata (DFA) [15] using Kolmogorov complexity [20], and to construct tractable complexity-aware planning algorithms with explicit trade-offs between performance and complexity based on this measure. More precisely, our contributions are three-fold:

Firstly, we define a complexity measure for deterministic policies in finite-horizon DFA with rewards as outputs. This measure uses Kolmogorov complexity to evaluate how complex a policy is to execute. Kolmogorov complexity, which can be seen as a formalisation of Occam’s razor, is a computational notion of complexity being able to not only detect statistical regularities (commonly exploited in standard information theory) but also computational regularities.111An example of a sequence with computational regularity but no apparent statistical regularity is the infinite sequence . Our key insight is that optimal policies typically posses computational regularities (such as reaching a goal state) apart from statistical ones, making Kolmogorov complexity an appealing complexity evaluator.

Secondly, we present a complexity-aware planning objective based on our complexity measure, yielding an explicit trade-off between a policy’s performance and complexity. We also prove that maximising this objective is non-trivial in the sense that dynamic programming [6] is infeasible.

Thirdly, we present two algorithms obtaining low-complexity policies. The first algorithm, Complexity-guided Optimal Policy Search (COPS), finds a policy with low complexity among all optimal policies, following a two-step procedure. In the first step, the algorithm finds all optimal policies, without any complexity constraints. Dropping the complexity constraints, this step can be done using dynamic programming. The second step runs a uniform-cost search [29] over all optimal policies, guided by a complexity-driven cost, favouring low-complexity policies and enabling moderate search depths. The second algorithm, Stage-complexity-aware Program (SCAP), penalises instead policies locally for executing complex manoeuvres. This is done by partitioning the horizon into stages, with local complexity constraints over the stages instead of the full horizon, which enables dynamic programming over the stages. Finally, we evaluate our algorithms on a simple navigation task where a mobile robot tries to reach a certain goal state. Our algorithms yield low-complexity policies that concur with intuition.

I-C Related Work

The interest of complexity in control and learning has a long history going back to Bellmann’s curse of dimensionality

[6], Witsenhausen’s counterexample [38, 39] and the general open problem under what conditions LQG admits an optimal low-dimensional feedback controller [3], just to mention a few, and has lately been reviewed as an essential component for intelligent behaviour [30]. Recent attempts to find low-complexity policies can be divided into two categories. In the first category, the system itself is approximated by a low-complexity system (e.g., smaller dimension), whereas an approximately optimal solution can be obtained. Methods in this category include bisimulation [14, 7], PCA analysis [27, 21], and information-theoretic compression such as the information bottleneck method [1, 18]. In the second category, a low-complexity policy is instead obtained directly. Here, notable methods include policy distillation [31], VC-dimension constraints [17], concise finite-state machine plans [24, 25], low-memory policies through sparsity constraints [8], and information-theoretic approaches such as KL-regularisation [28, 36], mutual information regularisation with variations [34, 13, 35], and minimal specification complexity [12, 11]. Our work belongs to this second category and resembles [24, 25, 12, 11] the most, but differ since we consider Kolmogorov complexity.

Kolmogorov complexity has also been considered in the context of reinforcement learning as a tool for complexity-constrained inference

[10, 2, 16] based on Solomonoff’s theory of inductive inference [20]. We differ by focusing instead on constraining the computational complexity of the obtained policy itself, assuming the underlying system to be known.

Finally, the work [9] considers Kolmogorov complexity to measure the complexity of an action sequence similar to this line of work. We differ by also optimising over the complexity, while [9] only evaluates the complexity of an immutable object.

I-D Outline

The remaining paper is as follows. Section II provides preliminaries together with the problem statement. Section III defines the complexity measure together with the complexity-aware planning objective, and proves that dynamic programming is infeasible. Section IV presents two algorithms for yielding low-complexity policies and Section V evaluates the algorithms on a mobile robot example. Finally, Section VI concludes the paper.

Ii Problem Formulation

Ii-a Turing Machines

We give a brief review of Turing machines following

[20]. A Turing machine is a mathematical model of a computing device, manipulating symbols on an infinite list of cells (known as the tape), reading one symbol at a time on the tape with a one access pointer (known as the head). Formally:

Definition 1

A Turing machine is a tuple with: A finite set of tape symbols with alphabet and blank symbol ; A finite set of states with start state ; A partial function222We use the notation to denote a partial function from a set to a set (i.e., a function that is only defined on a subset of ). called the transition function.

An execution of a Turing machine with input string333Here, denotes the set of all finite strings from the alphabet , e.g., if . starts with on the tape (and blank symbols on both sides of ), as state, and head on the first symbol of . It then transitions according to , where specifies what current scanned symbol should be replaced with, or if head should move left (right) one step (), and specifies the next state. This transition procedure is repeated until becomes undefined. In this case, halts with output equal to the maximal string in currently under scan, or if is scanned. For each Turing machine , this input-output convention defines a partial function (defined when halts) known as a partial computable function.

An important special class of Turing machines is the universal Turing machines. Informally, a universal Turing machine is a Turing machine that can simulate any other Turing machine, and can therefore be seen as an idealised version of a modern computer. A fundamental result states that there exist such machines [20], a fact utilised when defining the Kolmogorov complexity.

Ii-B Kolmogorov Complexity

In this section, we present needed theory concerning Kolmogorov complexity [20]. Informally, the Kolmogorov complexity of an object is the length of the smallest program on a computer that outputs . If is much smaller than , then we have compressed the information in to only the essential information of . The Kolmogorov complexity can therefore be seen as a formalisation of Occam’s razor, stripping away all the non-essential information. More formally, the computer is a universal Turing machine , and the object and the program are strings of a finite alphabet . Towards a precise definition, we need the following notion:

Definition 2 ([20])

The complexity of a string with respect to a partial computable function is defined as

where is the length of string . Here, is interpreted as the program input to , and is thus the length of the smallest program that, via , describes .

The following fundamental result, known as the invariance theorem, asserts that there is a partial computable function with shorter (i.e., more compact) descriptions than any other partial computable function, up to a constant:

Theorem 1 (Invariance theorem [20])

There exists a partial computable function , constructed from a universal Turing machine , with the following additively optimal property: For any other partial computable function , there exists a constant (dependent only on ) such that for all : .

This function serves as our computer when defining the Kolmogorov complexity:

Definition 3 (Kolmogorov Complexity [20])

Let be as in Theorem 1. The Kolmogorov complexity of is . We sometimes abbreviate as .

The Kolmogorov complexity is robust in the sense that it depends only benignly on the choice of , see [20] for details. The Kolmogorov complexity is however not computable in general, but can only be over-approximated pointwise [20]. The method we use in this paper to approximate is from [40]

and based on algorithmic probability

[20]

, see Appendix for a summary. This estimation method is used in the simulations (Section

V), while the underlying theoretical framework we develop (Section III and IV) is based on the exact Kolmogorov complexity.444We stress that other estimation methods can also be used, e.g., Lempel-Ziv compression [19] (e.g., used in [9]), since the theoretical framework is independent on the particular estimation method chosen. We picked [40] due to its more direct connection with Kolmogorov complexity, whereas Lempel-Ziv compression relies on classical information theory.

Ii-C Deterministic Finite Automata

We consider finite-horizon planning for discrete systems formalised as time-varying DFA [15] with actions as inputs and rewards as outputs (i.e., time-varying Mealy machines [22]), on the form:555The time-varying feature of the DFA can be lifted by including the time into the state. We keep the current notation for easier readability.

(1)

where is the finite set of states, the finite action set, the horizon length and , the transition function specifying the next state at time given current state and action , and the corresponding received reward . The system stops at , given by final state sets and for . Given a start state , the objective is to find a (deterministic) policy maximising the total reward subject to the transition dynamics . A policy which does this for every start state is called an optimal policy.

Ii-D Problem Statement

We now formalise the problem statement. In this paper, we answer the following questions:

  1. Given an as in (1), how can one define a complexity measure capturing how complex a policy is to execute from a start state ?

  2. How can one construct a planning objective with a formal trade-off between total reward and complexity for a policy ?

  3. How can one construct algorithms that maximises ?

  4. In particular, can dynamic programming be used to maximise ?

Questions 1, 2 and 4 are answered in Section III, while Section IV answers question 3 together with the numerical evaluations in Section V.

Iii Complexity-aware Planning

This section introduces a complexity measure in Section III-A and then sets up an appropriate complexity-aware planning objective in Section III-B. Finally, we prove that dynamic programming cannot be used to maximise this objective in Section III-C. Throughout this section, we fix a system as in (1).

Iii-a Execution Complexity

We start by defining our complexity measure for policies, capturing how complex it is to execute a policy from a start state . Towards this, note that, given a start state and a policy , we get a sequence of actions in from time to time , generated by and the dynamics . We denote this action sequence by and say that has low execution complexity if is low. Intuitively, a policy with low execution complexity has a small program that can execute it.

Definition 4 (Execution Complexity)

Given a start state , the execution complexity of a policy is .

Iii-B Complexity-aware Planning Objective

The execution complexity can be used to find policies with high total reward while keeping a low complexity. Formally, we want to find an action sequence maximising the objective666For convenience, we maximise directly over control inputs instead of policies, since the horizon is typically lower than the number of states in . With this convention, in (2) agrees with in Definition 4.

(2)

subject to the dynamics and start state . Here, determines how much we penalise complexity relative to obtaining a high total reward.

Example 1 (Simple optimal policies)

For sufficiently low, we obtain, for a start state , an interesting subset of all optimal policies (i.e., policies maximising the total reward ) with the lowest complexity, which we here call simple optimal policies. Note that, the simple optimal policies can also be found via the objective

(3)

subject to the dynamics and start state . See Appendix for a proof.

Iii-C Dynamic Programming is Infeasible

We are interested in algoritms obtaining the maximum of (2). A standard method for finding an optimal action sequence maximising the total reward is dynamic programming [6]. Hence, one may ask if (2) can be solved using dynamic programming. The answer is negative and follows from the following result.

Proposition 1

For sufficiently large , we cannot decompose on the form

where are functions given by

for some functions . That is, we cannot decompose recursively into an immediate complexity plus a future complexity. In particular, we cannot obtain by backward induction.777Note that and can by any functions, not necessarily computable. If one restricts these functions to be computable, then the result follows easily from the fact that itself is not commutable.

See Appendix for proof. Due to Proposition 1, we cannot naively apply dynamic programming to solve (2) (for large enough ). More precisely, the objective function

subject to the dynamics and start state , cannot be decomposed on the form , where are given recursively by for some functions . Indeed, if such a decomposition existed, it would contradict Proposition 1. Thus, cannot be maximised using dynamic programming. Fortunately, there are methods that circumvent this issue, presented next.

Iv Complexity-aware Planning Algorithms

This section answers question 3 in the problem statement, presenting two algorithms that circumvent the dynamic programming issue given by Section III-C. The first algorithm, COPS, restricts the task to find simple optimal policies, while the second algorithm, SCAP, modifies the objective focusing on local (stage-wise) complexity.

Iv-a Complexity-guided Optimal Policy Search (COPS)

COPS seeks a simple optimal policy as in Example 1 by a two-step procedure. Step 1 conducts ordinary dynamic programming (maximising the total reward ). This yields a mapping such that are the optimal actions at time and state .888Here, is the power set of , i.e., the set of all subsets of . In step 2, a uniform-cost search [29] is executed to find an optimal action sequence with low execution complexity. A node in this search is on the form , where and are the current time and state and is the sequence of previous actions taken to arrive at state at time . If node is not a terminal node (i.e., ), the children of are given by all such that and , i.e., we expand only over optimal actions. The cost for a node is set to

. The heuristic intuition behind the cost is that a low-complexity sequence

should be more likely to have low-complexity subsequences and, thus, the cost focuses the search on low-complexity sequences, enabling moderate search depths.

The uniform-cost search is conducted by iteratively generating children of the node with the lowest cost, starting from the root node . We terminate the search when a terminal node has the lowest cost and return its action sequence . The algorithm is summarised by Algorithm 1. The termination is motivated by the following result:

Proposition 2

Assume holds for all and optimal action sequences . Then Algoritm 1 returns an optimal action sequence with lowest execution complexity, i.e., maximises (3).

We stress that is not always true, since adding to may result in higher regularity than has alone. However, is more anticipated since is on average an increasing function with respect to sequence length [20], motivating the assumption in Proposition 2, and why the algorithm can work well in practice. Proposition 2 follows readily by applying uniform-cost search properties, see Appendix.

Efficient searching in Algorithm 1 is possible for moderate horizon lengths, demonstrated numerically in Section V-B. However, for longer horizons, the procedure becomes intractable. The next algorithm works for longer horizons by focusing on local (stage-wise) complexity instead of the complexity of the whole action sequence. COPS can be seen as a special case of this latter algorithm with only one stage and , defined below in (4), sufficiently low.

Input : System as in (1) and start state .
Output : Low-complexity optimal action sequence .
Step 1: Dynamic programming
for all ;
for  do
        forall  do
               forall  do
                      ;
                     
              ;
               ;
              
       
Step 2: Uniform-cost search
 # priority queue ordered by the cost ;
Append root node to ;
loop  do
        Pop first node from ;
        if t=T+1 then
               return ;
              
       forall  do
               Append node to ;
              
       
Algorithm 1 COPS

Iv-B Stage-Complexity-Aware Program (SCAP)

SCAP modifies the objective to focus only on local (stage-wise) complexity. More precisely, the modified objective is set to

(4)

Here, is the number of stages in the horizon partition, each with stage length , and is the executed action sequence at stage , penalised by its execution complexity with weight .

The key insight is that the modified objective enables dynamic programming over the stages. More precisely, initialise the value function as for all and obtain the value function for the remaining stages using the backward recursion

(5)

Here, denotes the state one arrives at by sequentially applying starting from , and . Provided is small enough, the maximisation in (IV-B) can be done by going through all .

Iv-B1 Hard-constrained version

An alternative to (4) is the hard-constraint objective

(6)

for some constants . In this case, we conduct dynamic programming with for all and backward recursion

(7)

Using (7) enables more efficient computations than (IV-B) since the sequences are typically only a fraction of . Moreover, can be computed beforehand by going through all sequences in (tractable for small ), or sought using a uniform-cost search (for moderate , see Appendix for details).

Iv-B2 Action sequence extraction

Once has been computed using (IV-B) or (7), maximising (4) or (6) can be readily obtained, for a given start state , by forward simulation over the stages (see Appendix for details).

Algorithm 2 summarises the procedure for the hard-constrained version.

Input : System as in (1) and start state .
Output : Action sequence maximising (6).
Step 1: Dynamic programming
Set for all ;
Compute for ;
for  do
        forall  do
               forall  do
                      ;
                     
              ;
              
       
Step 2: Action sequence extraction
for  do
        Set ;
        ;
        return
Algorithm 2 SCAP

V Numerical Evaluations

This section presents case studies evaluating the complexity-aware planning algorithms proposed in Section IV. We describe the test environment in Section V-A, evaluate COPS in Section V-B and then SCAP in Section V-C. The Kolmogorov complexity is estimated using the method from [40].

(a)
(b)
(c)
Fig. 1: Trajectories of the first 30 action sequences using COPS for: (a) small room, (b) medium room, and (c) large room.
Action sequence Execution complexity
1-4 47.30
5-8 47.79
9-12 47.91
13-16 47.92
17-24 48.30
25-30 48.36
(a)
Action sequence Execution complexity
1-4 36.49
5-8 38.39
9-12 38.79
13-16 38.39
17-20 38.60
21-28 62.57
29-30 39.75
(b)
Action sequence Execution complexity
1-4 58.80
5-8 59.96
9-16 60.53
17-20 60.88
21-28 61.20
29-30 61.64
(c)
TABLE I: Execution complexity for the found action sequences corresponding to Figure 1.

V-a Test Environment

As test environment, we consider a system as in (1) where a robot (the agent) moves around in a room with discrete steps. More precisely, the room is a square with coordinates in each direction forming the state space . At each time, the robot can move to any of the neighbouring spots or stay, i.e., the actions are with dynamics . If the robot executes an action that would move it outside the room, then it stays at the same spot. The reward is zero everywhere except for all state-action pairs that takes the robot to a given goal state , i.e., , with reward . We set the goal state to one of the corners of the room. The robot starts at and the horizon is set so that the agent precisely receives a reward for reaching if acting optimally.

V-B Evaluation 1

V-B1 Setup

We first consider COPS. To see how output and calculation time changes with problem size, we study the test environment with different room sizes and , called the small, medium and large room, respectively. We run the search until it has found 30 low-complexity action sequences, by keep expanding the node tree. Concretely, this is done by modifying the if-statement in Algorithm 1 to a condition appending every found action sequence into a list until this list has 30 sequences, and then return the list.

V-B2 Result

The running time for the three rooms are 21 seconds, 16 minutes and 15 minutes, respectively, and corresponding trajectories to the found action sequences are given in Fig. 0(a), 0(b) and 0(c), with execution complexities in Tables I(a), I(b), and I(c).

As a first example, consider the trajectory of the first action sequence found in the small room, labeled 1 in Fig. 0(a). Here, the robot goes right until it reaches the upper right corner and then goes down to . That is, it exploits executing the same actions in batches to lower complexity, reaching an execution complexity of 47.30 seen in Table I(a). We also note that this is the same complexity as action sequences 2-4 in Fig. 0(a). That 1 and 3 have the same complexity is due to symmetry, and the same is true for 2 and 4. However, why e.g., 1 and 2 have the same complexity (apart from the intuition that they both look like low-complexity executions) is unclear. It could be an inherited feature from the Kolmogorov complexity, or a bias from the estimation method, see Appendix for a discussion of the latter.

Looking now at all sequences in all three cases, we see that it is in general common to find sequences executing batches of the same actions and then alternate between such batches to lower execution complexity (1-8 in Fig. 0(a), and 17-28 in Fig. 0(b)). Another typical feature to lower complexity is to exploit 2-periodic alternation between going down and going right (9-30 in Fig. 0(a), 1-16 and 29-30 in Fig. 0(b), and 1-30 in Fig. 0(c)). Also, what type of execution with lowest complexity varies with the room size, i.e., is horizon-dependent. Once again, this could come from the Kolmogorov complexity itself, where varying sequence length may facilitate different compression techniques, but could also originate from the estimation method.

We also note in Tables I(a) to I(c) that the algorithm mainly finds sequences in increasing complexity-order as anticipated by Proposition 2. Indeed, only the medium room case causes some unordered sequences, where notably the search finds the higher complexity sequences 21-28 with execution complexity 62.57 before reaching sequences 29-30 with complexity 39.75. This detour, caused by a violation in the assumption in Proposition 2, explains the long search time the medium room yields, longer than the large room although the latter has a larger horizon.

Finally, the cost enables moderate horizon lengths . In particular, for the large room, the number of possible optimal action sequences is around after dynamic programming. However, the search only iterates nodes to find the first low-complexity sequence, a significant decrease due to the complexity-guiding cost.

(a)
(b)
(c)
(d)
Fig. 2: for Setup 1 in Section V-C and . Shown are also two corresponding trajectories (for each finite ) starting from A and B, obtained by Algorithm 2.
(a)
(b)
(c)
(d)
Fig. 3: Same as Fig 2. except for Setup 2. Shown are also two trajectories starting from A and B, obtained by Algorithm 2.

V-C Evaluation 2

We continue with SCAP, considering the hard-constrained version given by Algorithm 2, calculating by going through all sequences. The aim of the investigation is mainly to see how the complexity-performance tradeoff affects the behaviour. Towards this, we consider two similar setups, both considering the large room but where the first setup has the goal state in the lower right corner of the room and the second setup has in the middle of the room.

V-C1 Setup 1: in the corner

In Setup 1, we penalise each stage equally with complexity limit , having stage length , and consider the large room from Section V-B with a slightly larger horizon to be compatible with the stage partition (i.e., ), setting accordingly. The change in tradeoff comes into play by varying with values and 30. These values are picked since they illustrate the tradeoff well, allowing almost no action sequences for while many for .

V-C2 Result

Value functions are plotted in Fig. 2 for , where is as a reference obtained by ordinary dynamic programming maximising the total reward . Blue (red) colour encodes low (high) value.

For , the room is divided into squares of length due to the stage partition and the low complexity limit . More precisely, at each stage , consists of only five action sequences: the ones executing only one action throughout the stage. This constraint implies that the robot sometimes has to wait at the walls, yielding the discontinuous jumps in the value function seen in Fig. 2 (a). Consider for example the two trajectories in Fig. 2 (a) generated by Algorithm 2, coloured in a green-to-yellow scale, where states more visited are more yellow. The robot starting at A reaches without waiting. The robot starting at B goes right two stages and then down two stages, but hits the wall at its second going-right stage before the stage is done. This causes waiting at the wall and thus longer time to reach , even though is actually closer to than A.

For , the squares in Fig. 2 (b) are greatly diminished compared to , due to additional possible action sequences in . Here, a robot starting at B does not wait anymore at the wall. We note also that the squares are more evident near the goal state since the robot has less time here to adapt and avoid wall waiting.

Finally, for , has increased so much that the difference with the unconstrained case is minor, and the trajectories from A and B, already being optimal at , has not changed.

V-C3 Setup 2: in the middle

Setup 2 is identical to Setup 1, except that is placed in the middle of the room. Thus, the robot can no longer heavily rely on the wall dynamics as in Setup 1. This notably changes the outcome.

V-C4 Result

Similar to Setup 1, we plot for different in Fig. 3.

For , has higher values in a grid-like pattern, with highest value in the middle of the room at and high values at points being away a multiple of from . At all these points, the robot can (using the five sequences in ) reach and stay at resulting in a high total reward. Between those points are also states with higher value. Here, the robot can reach but not stay at ; Instead, the robot oscillates back and forth crossing the multiple times. This is seen for the robot starting at A in Fig. 3 (a). For the remaining states, the robot can never reach the goal state, due to the heavy complexity constraint. This is the case for the robot starting and staying at B. Thus, for low complexity limits, the robot may never reach the intended objective.

As we increase to , the grid-like pattern is expanded to groups of states due to the increased set of admissible action sequences. A robot starting at A reaches now and stays at as can be seen in Fig. 3 (b). The robot starting at B also reaches and stays at , but takes a detour to be within the local complexity limit, reaching the upper wall and then down to . Thus, for low complexity limits, the robot may execute an action sequence which is locally of low complexity, but globally quite complex.

Finally, for , is similar to the unconstrained case, except that the stage length together with the complexity constraint still partitions the space into entities of length , similar to Fig. 2 (a) (but in this case into diamond-like shapes). The robots starting at A and B now reaches the goal fast with the high complexity limit.

Vi Conclusion

In this paper, we have considered complexity-aware planning for DFA with rewards as outputs. We first defined a complexity measure for deterministic policies based on Kolmogorov complexity. Kolmogorov complexity is used since it can detect computational regularities, a typical feature for optimal policies.

We then introduced a complexity-aware planning objective based on our complexity measure, yielding an explicit trade-off between a policy’s performance and complexity. It was proven that maximising this objective is non-trivial in the sense that dynamic programming is infeasible.

We presented two algorithms for obtaining low-complexity policies, COPS and SCAP. COPS finds a policy with low complexity among all optimal policies, following a two-step procedure. In the first step, the algorithm finds all optimal policies, without any complexity constraints. Dropping the complexity constraints, this step can be done using dynamic programming. The second step runs a uniform-cost search over all optimal policies, guided by a complexity-driven cost, favouring low-complexity policies. SCAP modifies instead the objective penalising policies locally for executing complex manoeuvres. This is done by partitioning the horizon into stages and enforce local complexity constraints over these stages, where the partition enables dynamic programming. We illustrated and evaluated our algorithms on a simple navigation task where a mobile robot tries to reach a certain goal state. Our algorithms yield low-complexity policies that concur with intuition.

Future work includes comparisons with other estimation methods of the Kolmogorov complexity and how uncertainty (e.g., stochasticity) and feedback (e.g., receding horizon) can be incorporated into the existing framework. Finally, another challenge is to tractably maximise the complexity-aware planning objective in the general case.

References

  • [1] D. Abel (2019) State abstraction as compression in apprenticeship learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3134–3142. Cited by: §I-C.
  • [2] J. Aslanides (2017) Universal reinforcement learning algorithms: survey and experiments. arXiv preprint arXiv:1705.10557. Cited by: §I-C.
  • [3] K. J. Åstrom (1970) Introduction to stochastic control theory. Mathematics in science and engineering, Vol. 70, Academic Press (eng). External Links: ISBN 0-12-065650-7 Cited by: §I-C.
  • [4] F. Attneave (1954) Some informational aspects of visual perception.. Psychological review 61 (3), pp. 183. Cited by: §I-A.
  • [5] H. B. Barlow (1961) Possible principles underlying the transformation of sensory messages. Sensory communication 1, pp. 217–234. Cited by: §I-A.
  • [6] R. Bellman (1957) Dynamic programming. Princeton University Press. Cited by: §I-B, §I-C, §III-C.
  • [7] O. Biza (2020) Learning discrete state abstractions with deep variational inference. arXiv preprint arXiv:2003.04300. Cited by: §I-C.
  • [8] M. Booker and A. Majumdar (2021) Learning to actively reduce memory requirements for robot control tasks. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, pp. 125–137. External Links: Link Cited by: §I-C.
  • [9] N. Chmait (2016) Factors of collective intelligence: how smart are agent collectives?. In Proceedings of the Twenty-second European Conference on Artificial Intelligence, pp. 542–550. Cited by: §I-C, footnote 4.
  • [10] M. K. Cohen (2019) A strongly asymptotically optimal agent in general environments. arXiv preprint arXiv:1903.01021. Cited by: §I-C.
  • [11] F. Delmotte, T. R. Mehta, and M. Egerstedt (2008) A software tool for hybrid control. IEEE Robotics Automation Magazine 15 (1), pp. 87–95. Cited by: §I-C.
  • [12] M. B. Egerstedt and R. W. Brockett (2003) Feedback can reduce the specification complexity of motor programs. IEEE Transactions on Automatic Control 48 (2), pp. 213–223. Cited by: §I-C.
  • [13] R. Fox and N. Tishby (2016) Minimum-information LQG control part i: memoryless controllers. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 5610–5616. Cited by: §I-C.
  • [14] A. Girard and G. J. Pappas (2011) Approximate bisimulation: a bridge between computer science and control theory. European Journal of Control 17 (5-6), pp. 568–578. Cited by: §I-C.
  • [15] J. E. Hopcroft (2006) Introduction to automata theory, languages, and computation (3rd edition). Addison-Wesley. External Links: ISBN 0321455363 Cited by: §I-B, §II-C.
  • [16] M. Hutter (2004) Universal artificial intelligence: sequential decisions based on algorithmic probability. Springer Science & Business Media. Cited by: §I-C.
  • [17] M. J. Kearns, Y. Mansour, and A. Y. Ng (2000) Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, pp. 1001–1007. Cited by: §I-C.
  • [18] D. T. Larsson, D. Maity, and P. Tsiotras (2020) An information-theoretic approach for path planning in agents with computational constraints. arXiv preprint arXiv:2005.09611. Cited by: §I-C.
  • [19] A. Lempel and J. Ziv (1976) On the complexity of finite sequences. IEEE Transactions on information theory 22 (1), pp. 75–81. Cited by: footnote 4.
  • [20] M. Li and P. Vitányi (2020) An introduction to kolmogorov complexity and its applications. 4th edition, Springer. Cited by: §I-A, §I-B, §I-C, §II-A, §II-A, §II-B, §II-B, §IV-A, §VII-A2, §VII-C1, Definition 2, Definition 3, Theorem 1, footnote 10.
  • [21] L. Liu, A. Chattopadhyay, and U. Mitra (2017) On exploiting spectral properties for solving MDP with large state space. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1213–1219. Cited by: §I-C.
  • [22] G. H. Mealy (1955) A method for synthesizing sequential circuits. The Bell System Technical Journal 34 (5), pp. 1045–1079. External Links: Document Cited by: §II-C.
  • [23] V. Mnih (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §I-A.
  • [24] J. M. O’Kane and D. A. Shell (2017) Concise planning and filtering: hardness and algorithms. IEEE Transactions on Automation Science and Engineering 14 (4), pp. 1666–1681. External Links: Document Cited by: §I-C.
  • [25] A. Pervan and T. D. Murphey (2021) Algorithmic design for embodied intelligence in synthetic cells. IEEE Transactions on Automation Science and Engineering 18 (3), pp. 864–875. External Links: Document Cited by: §I-C.
  • [26] B. Recht (2018) A tour of reinforcement learning: the view from continuous control. arXiv preprint arXiv:1806.09460. Cited by: §I-A.
  • [27] N. Roy and G. J. Gordon (2003) Exponential family PCA for belief compression in POMDPs. In Advances in Neural Information Processing Systems, pp. 1667–1674. Cited by: §I-C.
  • [28] J. Rubin, O. Shamir, and N. Tishby (2012) Trading value and information in MDPs. In Decision Making with Imperfect Decision Makers, pp. 57–74. Cited by: §I-C.
  • [29] S. Russell and P. Norvig (2009) Artificial intelligence: a modern approach. 3rd edition, Prentice Hall Press, USA. External Links: ISBN 0136042597 Cited by: §I-B, §IV-A, Lemma 2.
  • [30] S. Russell (2016) Rationality and intelligence: a brief update. In Fundamental issues of artificial intelligence, pp. 7–28. Cited by: §I-C.
  • [31] A. A. Rusu (2015) Policy distillation. arXiv preprint arXiv:1511.06295. Cited by: §I-C.
  • [32] D. Silver (2016)

    Mastering the game of Go with deep neural networks and tree search

    .
    Nature 529 (7587), pp. 484–489. Cited by: §I-A.
  • [33] C. R. Sims (2018) Efficient coding explains the universal law of generalization in human perception. Science 360 (6389), pp. 652–656. Cited by: §I-A.
  • [34] T. Tanaka, P. M. Esfahani, and S. K. Mitter (2017) LQG control with minimum directed information: semidefinite programming approach. IEEE Transactions on Automatic Control 63 (1), pp. 37–52. Cited by: §I-C.
  • [35] T. Tanaka (2017)

    Transfer-entropy-regularized Markov decision processes

    .
    arXiv preprint arXiv:1708.09096. Cited by: §I-C.
  • [36] N. Tishby and D. Polani (2011) Information theory of decisions and actions. In Perception-action cycle, pp. 601–636. Cited by: §I-C.
  • [37] S. Udrescu (2020) AI Feynman 2.0: pareto-optimal symbolic regression exploiting graph modularity. arXiv preprint arXiv:2006.10782. Cited by: §I-A.
  • [38] H. S. Witsenhausen (1968) A counterexample in stochastic optimum control. SIAM Journal on Control 6 (1), pp. 131–147. Cited by: §I-C.
  • [39] Y. Ho and K. Chu (1972) Team decision theory and information structures in optimal control problems–part i. IEEE Transactions on Automatic Control 17 (1), pp. 15–22. Cited by: §I-C.
  • [40] H. Zenil (2018) A decomposition method for global evaluation of Shannon entropy and local estimations of algorithmic complexity. Entropy 20 (8), pp. 605. Cited by: §I-A, §II-B, §V, §VII-C1, §VII-C, footnote 4.

Vii Appendix

Vii-a Proofs

Vii-A1 Proof of the observation in Example 1

We prove that (2) and (3) are equivalent for sufficiently small. For brevity, let

denote the total reward, subject to the dynamics and start state . Note that the statement is trivially true if is constant (over all ), since (2) and (3) are then equivalent for all . Hence, we may assume that is not constant. In particular, the difference between the maximum of and its second highest value is then positive:

where .

We start by showing that a maximiser of (2) is a minimiser of (3) given that is sufficiently small (to be specified). Towards this, let be a maximiser of (2). Assume by contradiction that . Fix any and observe that

(8)

Moreover, since is a maximiser of (2), we also have that

and therefore

(9)

Note that (9) does not hold for

(10)

Thus, for sufficiently small specified by (10), a maximiser of (2) must belong to , and since

we conclude that is a minimiser of (3).

We now conversely show that a minimiser of (3) is also a maximiser of (2) given that is sufficiently small specified by (10). More precisely, let be a minimiser of (3). Let be any maximiser of (2). By above, , hence , and since we get

from which we conclude that is a maximiser of (2).

By above, we conclude that (2) and (3) are equivalent given that is sufficiently small specified by (10). This completes the proof.

Vii-A2 Proof of Proposition 1

Proposition 1 follows immediately from the following lemma.

Lemma 1

For sufficiently large, there do not exist functions such that

(11)

holds for all sequences .

We may without loss of generality restrict ourselves to the binary case . We prove the lemma by contradiction. Let be the partial computable function in Definition 3. Let be arbitrary. By a simple counting argument, there exists at least one such that (e.g., Theorem 2.2.1 in [20]). Also, the compliment of , defined by inverting all zeros and ones in , has complexity close to in the sense that

(12)

holds for some constant independent of and . To see this, let be the Turing machine that takes a binary string , inverts all zeros and ones, and outputs the result . In particular, .999For brevity, we use, for a given Turing machine with corresponding partial computable function , the notation to denote for . Let in turn be the Turing machine that given input simulates the universal Turing machine corresponding to , obtains the output and then feeds it as input to . Then implies . Hence, letting be the corresponding partial computable function of , we have by the invariance theorem,

(13)

where is independent of and . Furthermore, since , implies , we have again by the invariance theorem,

(14)

Combining equation (13) and (14) yields (12).

We now show that , for some constant independent of , where is the string consisting of zeros. To see this, let be the binary string corresponding to and note that , where is independent of .101010This correspondence between and is given by the length-increasing lexicographic ordering where in corresponds to in , see Chapter 1.4 in [20]. Consider the Turing machine that given a binary string as input, calculates the corresponding integer and outputs . In particular, . Let be the corresponding partial computable function of . By the invariance theorem,

(15)

Similarly, , for some constant independent of , where is the string consisting of  ones.

Assume now that (11) holds. Then,

This yields a contradiction for sufficiently large, which proves the lemma.

Vii-A3 Proof of Proposition 2

We need the following lemma for uniform-cost search.

Lemma 2 ([29])

Assume the cost in a uniform-cost search is such that for any node and any child node of . Then the uniform-cost search terminates at a node with lowest cost.

[Proof of Proposition 2] The assumption in Proposition 2 corresponds to the assumption in Lemma 2. Hence, the uniform-cost search terminates at an action sequence with the lowest execution complexity, and since is optimal by construction, the result follows.

Vii-B Details of SCAP

Vii-B1 Uniform-cost search for finding

One can use a uniform-cost search for trying to find for moderate stage length . More precisely, the uniform-cost search is conducted with nodes , cost , and termination criteria for some . If holds in the search, then one can set and the search terminates when all sequences in have been considered (cf. Proposition 2). However, since this assumption may not always hold throughout the search, one consider instead a positive as a margin (increasing the search, though without any formal guarantees of having found all elements of at termination).

Vii-B2 Action sequence extraction

The action sequence extraction for the hard-constrained version (6) is given by

for , starting at , with next state given by . In the end, at , the total action sequence has been obtained. The procedure for (4) is analogous.

Vii-C The Estimation Method

We provide a brief overview of the method from [40] used to estimate the Kolmogorov complexity , followed by a discussion concerning potential bias this method might generate in the numerical evaluations in Section V.

Vii-C1 Overview

The method [40] estimates by first partitioning into substrings, all of fixed block length (plus one possible remainder substring with length less than ). Denoting such a substring and letting be the number of occurrences of in the partition of , the complexity is then estimated by

(16)

where approximates based on the coding theorem in algorithmic probability [20], referring to [40] for details. In words, is the total complexity for generating all the substrings , plus the code-length needed for specifying how frequent each is in , which can be encoded in a codeword of length