# The relationship between dynamic programming and active inference: the discrete, finite-horizon case

Active inference is a normative framework for generating behaviour based upon the free energy principle, a theory of self-organisation. This framework has been successfully used to solve reinforcement learning and stochastic control problems, yet, the formal relation between active inference and reward maximisation has not been fully explicated. In this paper, we consider the relation between active inference and dynamic programming under the Bellman equation, which underlies many approaches to reinforcement learning and control. We show that, on partially observable Markov decision processes, dynamic programming is a limiting case of active inference. In active inference, agents select actions to minimise expected free energy. In the absence of ambiguity about states, this reduces to matching expected states with a target distribution encoding the agent's preferences. When target states correspond to rewarding states, this maximises expected reward, as in reinforcement learning. When states are ambiguous, active inference agents will choose actions that simultaneously minimise ambiguity. This allows active inference agents to supplement their reward maximising (or exploitative) behaviour with novelty-seeking (or exploratory) behaviour. This clarifies the connection between active inference and reinforcement learning, and how both frameworks may benefit from each other.

## Authors

• 6 publications
• 9 publications
• 7 publications
• 15 publications
• 5 publications
06/07/2020

### Sophisticated Inference

Active inference offers a first principle account of sentient behaviour,...
09/21/2021

### Active inference, Bayesian optimal design, and expected utility

Active inference, a corollary of the free energy principle, is a formal ...
07/18/2020

### Modulation of viability signals for self-regulatory control

We revisit the role of instrumental value as a driver of adaptive behavi...
06/04/2021

### Online reinforcement learning with sparse rewards through an active inference capsule

Intelligent agents must pursue their goals in complex environments with ...
10/19/2021

### Contrastive Active Inference

Active inference is a unifying theory for perception and action resting ...
11/11/2021

### Agent Spaces

Exploration is one of the most important tasks in Reinforcement Learning...
09/24/2019

### Demystifying active inference

Active inference is a first (Bayesian) principles account of how autonom...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Active inference is a normative framework for explaining behaviour under the free energy principle—a global theory of self-organisation in the neurosciences (fristonFreeenergyPrincipleUnified2010; parrMarkovBlanketsInformation2020; fristonFreeEnergyPrinciple2019a; fristonFreeEnergyPrinciple2006)—by assuming that the brain performs approximate Bayesian inference (senguptaNeuronalGaugeTheory2016; wainwrightGraphicalModelsExponential2007; bishopPatternRecognitionMachine2006; jordanIntroductionVariationalMethods1998). Within the active inference framework, there is a collection of belief updating schemes or algorithms for modeling perception, learning, and behavior in the context of both continuous and discrete state spaces (fristonGraphicalBrainBelief2017; fristonSophisticatedInference2020). Within each scheme, active inference treats agents as systems that self-organise to some (non-equilibrium) steady-state (dacostaActiveInferenceDiscrete2020; pavliotisStochasticProcessesApplications2014); that is, an active inference agent acts upon the world so that its predicted states match a target distribution encoding its characteristic or preferred states. Building active inference agents requires: equipping the agent with a (generative) model of the environment, fitting the model to observations through approximate Bayesian inference by minimising variational free energy (bealVariationalAlgorithmsApproximate2003; wainwrightGraphicalModelsExponential2007; bishopPatternRecognitionMachine2006; jordanIntroductionVariationalMethods1998) and selecting actions that minimise expected free energy, a quantity that that can be decomposed into risk (i.e., the divergence between predicted and preferred states) and information gain, leading to context-specific combinations of exploratory and exploitative behaviour (parrMarkovBlanketsInformation2020; schwartenbeckComputationalMechanismsCuriosity2019). Exploitative behaviour ensures that predicted states match preferred states in a probabilistic sense or in the sense of maximising expected reward (dacostaActiveInferenceDiscrete2020). This framework has been used to simulate intelligent behaviour in neuroscience (parrComputationalPharmacologyOculomotion2019; adamsComputationalAnatomyPsychosis2013; kaplanPlanningNavigationActive2018; cullenActiveInferenceOpenAI2018; mirzaHumanVisualExploration2018; mirzaImpulsivityActiveInference2019)

(millidgeDeepActiveInference2020; tschantzReinforcementLearningActive2020; sajidActiveInferenceDemystified2020; ueltzhofferDeepActiveInference2018; tschantzScalingActiveInference2019; millidgeImplementingPredictiveProcessing2019; catalBayesianPolicySelection2019; fountasDeepActiveInference2020; catalLearningPerceptionPlanning2020a) and robotics (catalDeepActiveInference2020; lanillosRobotSelfOther2020; sancaktarEndtoEndPixelBasedDeep2020; pio-lopezActiveInferenceRobot2016; pezzatoNovelAdaptiveController2020). Given the prevalence of reinforcement learning (RL) and stochastic optimal control in these fields, it is useful to understand the relationship between active inference and these established approaches to modelling purposeful behaviour.

Stochastic control traditionally calls on strategies that evaluate different actions on a carefully handcrafted forward model of stochastic dynamics and then selects the reward-maximising action. RL has a broader and more ambitious scope. Loosely speaking, RL is a collection of methods that learn reward-maximising actions from data and seek to maximise reward in the long run. Many RL algorithms are model-free, which means that agents learn a reward-maximising state-action mapping, based on updating cached state-action pair values, through initially random actions that do not consider future state transitions. In contrast, model-based RL algorithms attempt to extend model-based control approaches by learning the dynamics and reward function from data. Because RL is a data driven field, particular algorithms are selected based on how well they perform on benchmark problems. This has yielded a zoo of diverse algorithms, many designed to solve specific problems and each with their own strengths and limitations. This makes RL difficult to characterise as a whole. Thankfully, many RL algorithms and approaches to solving control problems originate or otherwise build upon dynamic programming under the Bellman equation (bellmanAppliedDynamicProgramming2015; bertsekasStochasticOptimalControl1996), a collection of methods that maximise cumulative reward (although this often becomes computationally intractable in real-world problems) (bartoReinforcementLearningIntroduction1992). In what follows, we consider the relationship between active inference and dynamic programming, and discuss its implications in the broader context of RL.

This leads us to discuss the apparent differences between active inference and RL. First, while RL agents select actions to maximise cumulative reward (e.g., the solution to the Bellman equation (bellmanAppliedDynamicProgramming2015)), active inference agents select actions so that predicted states match a target distribution encoding preferred states. In fact, active inference also builds upon previous work on the duality between inference and control (todorovGeneralDualityOptimal2008; kappenOptimalControlGraphical2012; rawlikStochasticOptimalControl2013; toussaintRobotTrajectoryOptimization2009) to solve motor control problems via approximate inference (fristonActiveInferenceAgency2012; millidgeRelationshipActiveInference2020; fristonReinforcementLearningActive2009). Treating the control problem as an inference problem in this fashion, is also known as planning as inference (attiasPlanningProbabilisticInference2003; botvinickPlanningInference2012). Second, active inference agents always embody a generative (i.e., forward) model of their environment, while RL comprises both model-based algorithms as well as simpler model-free algorithms. Third, modelling exploratory behaviour—which can improve reward maximisation in the long run (especially in volatile environments)—is implemented differently in the two approaches. In most cases RL implements a simple form of exploration by incorporating randomness in decision-making (tokicValueDifferenceBasedExploration2011; wilsonHumansUseDirected2014), where the level of randomness may or may not change over time as a function of uncertainty. In other cases, RL incorporates ad-hoc "information bonus" terms to build in goal-directed exploratory drives. In contrast, goal-directed exploration emerges naturally within active inference through interactions between the reward and information gain terms in the expected free energy (schwartenbeckComputationalMechanismsCuriosity2019; dacostaActiveInferenceDiscrete2020). Although not covered in detail here, active inference can accommodate a principled form of random exploration (a.k.a. matching behaviour) by sampling actions from a posterior belief distribution over actions, whose precision is itself optimised—such that action selection becomes more random when the expected outcomes of actions are more uncertain (schwartenbeckDopaminergicMidbrainEncodes2015)

. Finally, traditional RL approaches have usually focused on cases where agents know their current state with certainty, and thus eschew uncertainty in state estimation (although, RL schemes can be supplemented with Bayesian state-estimation algorithms, leading to Bayesian RL). In contrast, active inference integrates state-estimation, learning, decision-making, and motor control under the single objective of minimising free energy

(dacostaActiveInferenceDiscrete2020).

Despite these well-known differences, the relationship between active inference and RL, and particularly between the objectives of free energy minimization and reward maximization, has not been thoroughly explicated. Their relationship has become increasingly important to understand, as a growing body of research has begun to 1) compare the performance of active inference and RL models in simulated environments (sajidActiveInferenceDemystified2020; cullenActiveInferenceOpenAI2018; millidgeDeepActiveInference2020), 2) apply active inference to study human behaviour on reward learning tasks (smithImpreciseActionSelection2020; smithGreaterDecisionUncertainty2020; smithActiveInferenceModel2020), and 3) consider the complementary predictions and interpretations they each offer in computational neuroscience, psychology, and psychiatry (schwartenbeckComputationalMechanismsCuriosity2019; schwartenbeckDopaminergicMidbrainEncodes2015; tschantzLearningActionorientedModels2020; cullenActiveInferenceOpenAI2018). In what follows, we try to clarify the relationship between RL and active inference and identify the conditions under which they are equivalent.

Despite apparent differences, we show that there is a formal relationship between active inference and RL that is most clearly seen with model-based RL. Specifically, we will see that dynamic programming under the Bellman equation is a limiting case of active inference on finite-horizon partially observable Markov decision processes (POMDPs). Equivalently, we show that a limiting case of active inference maximises reward on finite-horizon POMDPs. However, active inference also covers scenarios that do not involve reward maximization, as it can be used to solve any problem that can be cast in terms of reaching and maintaining a target distribution on a suitable state-space (see (dacostaActiveInferenceDiscrete2020, Appendix B)). In brief, active inference reduces to dynamic programming when the target distribution is a (uniform mixture of) Dirac distribution over reward maximising trajectories. Note that, in infinite horizon POMDPs, active inference will not necessarily furnish the solution to the Bellman equation, as it plans only up to finite temporal horizons.

In what follows, we first review dynamic programming on finite-horizon Markov Decision Processes (MDPs; Section 2). Next, we introduce active inference for finite-horizon MDPs (Section 3). Third, we demonstrate how active inference reduces to dynamic programming in a limiting case (Section 4). Finally, we show how these results generalise to POMDPs (Section 4.4). We conclude with a discussion of the implications of these results and future directions (Section 5).

## 2 Dynamic programming on finite horizon MDPs

In this section, we recall the fundamentals of discrete-time dynamic programming.

### 2.1 Basic definitions

Markov decision processes (MDPs) are a class of models specifying environmental dynamics widely used in dynamic programming, model-based RL, and more broadly in engineering and artificial intelligence (bartoReinforcementLearningIntroduction1992; stoneArtificialIntelligenceEngines2019). They have been used to simulate sequential decision-making tasks with the objective of maximising a reward or utility function. An MDP specifies the environmental dynamics in discrete time and space given the actions pursued by an agent.

[Finite horizon MDP] A finite horizon MDP comprises the following tuple:

• a finite set of states.

• a finite set which stands for discrete time. is the temporal horizon or planning horizon.

• is a finite set of actions.

• is the probability that action

in state at time will lead to state at time .

are random variables over

, which correspond to the state being occupied at time .

• specifies the probability of being at state at the start of the trial.

• is the finite reward received by the agent when at state .

The dynamics afforded by a finite horizon MDP can be written globally as a probability distribution over trajectories

, given a sequence of actions . This factorises as the following:

 P(s0:T|a0:T−1)=P(s0)T∏τ=1P(sτ|sτ−1,aτ−1).

These MDP dynamics can be regarded as a Markov chain on the state-space

, given a sequence of actions (see Figure 1).

[On the definition of reward] More generally, the reward function can be taken to be dependent on the previous action and previous state: is the reward received after transitioning from state to state due to action (bartoReinforcementLearningIntroduction1992; stoneArtificialIntelligenceEngines2019). However, given an MDP with such a reward function, we can recover our simplified setting without loss of generality. We define a new MDP where the states comprise the previous action, previous state, and current state in the original MDP. By inspection, the resulting reward function on the new MDP depends only on the current state (i.e., ).

[Admissible actions] In general, it is possible that not all actions are available at every state. Thus, is defined to be the finite set of (allowable) actions available from state . All the results in this paper concerning MDPs can be extended to this setting.

Given an MDP, the agent transitions from one state to the next as time unfolds. The transitions depend on the agent’s actions. The goal under RL is to select actions that maximise expected reward. To formalise what it means to choose actions, we introduce the notion of a state-action policy.

[State-action policy] A state-action policy is a probability distribution over actions, that depends on the state that the agent occupies, and time. Explicitly, it is a function that satisfies:

 Π:A×S×T→[0,1](a,s,t)↦Π(a|s,t)∀(s,t)∈S×T:∑a∈AΠ(a|s,t)=1.

When , we will write . Additionally, the action at time is redundant, as no further reward can be reaped from the environment. Therefore, one often specifies state-action policies only up to time . This is equivalent to defining a state-action policy as .

The state-action policy – as defined here – is stochastic and can be regarded as a generalisation of a deterministic policy that assigns the probability of to one of the available actions, and otherwise (putermanMarkovDecisionProcesses2014).

[Conflicting terminologies: policy in active inference] In active inference, a policy is defined as a sequence of actions indexed in time. To avoid terminological confusion, we use sequence of actions to denote a policy under active inference.

As previously mentioned, the goal for an RL agent at time is to choose actions that maximise future cumulative reward:

 R(st+1:T):=T∑τ=t+1R(sτ).

More precisely, the goal is to follow a state-action policy that maximises the state-value function:

 vΠ(s,t):=EΠ[R(st+1:T)|st=s]

. The state-value function scores the expected cumulative reward if the agent pursues state-action policy from state . When is clear from context, we will often write . Loosely speaking, we will call the expected reward the return.

[Notation: ] Whilst standard in RL (bartoReinforcementLearningIntroduction1992; stoneArtificialIntelligenceEngines2019), the notation

 EΠ[R(st+1:T)|st=s]

can be misleading. It denotes the expected reward, under the transition probabilities of the MDP for a particular state-action policy :

 EP(st+1:T|at:T−1,st=s)Π(at:T−1|st+1:T−1,st=s)[R(st+1:T)].

It is important to keep this correspondence in mind, as we will use both notations depending on context.

[Temporal discounting] In infinite horizon MDPs (i.e., when is infinite), we often add a temporal discounting term (bertsekasStochasticOptimalControl1996; bartoReinforcementLearningIntroduction1992; stoneArtificialIntelligenceEngines2019) such that the infinite sum

 vΠ(s,t):=EΠ[∞∑τ=t+1γR(sτ)|st=s]

converges. However, under the finite temporal horizons considered here, the expected reward converges regardless of , which eschews the need to include temporal discounting when evaluating expected reward. Thus, in what follows, we set .

We want to rank state-action policies in terms of their expected reward. To do this, we introduce a partial ordering on state-action policies, such that a state-action policy is better than another when it yields higher expected rewards in any situation:

 Π≥Π′⟺∀(s,t)∈S×T:vΠ(s,t)≥vΠ′(s,t).

Similarly, a state-action policy is strictly better than if

 Π>Π′⟺Π≥Π′ and ∃(s,t)∈S×T:vΠ(s,t)>vΠ′(s,t).

### 2.2 Bellman optimal state-action policies

[Bellman optimal state-action policy] A state-action policy is said to be Bellman optimal if, and only if, . That is, if it maximises the state-value function for any state at time .

In other words, a state-action policy is Bellman optimal if it is better than all alternatives. It is important to show that this concept is not vacuous. For this, we prove a classical result (putermanMarkovDecisionProcesses2014; bertsekasStochasticOptimalControl1996): [Existence of Bellman optimal state-action policies] Given a finite horizon MDP as specified in Definition 2.1, there exists a Bellman optimal state-action policy .

Note that uniqueness of the Bellman optimal state-action policy is not implied by Proposition 2.2. Indeed, it is a general feature of MDPs that there can be multiple Bellman optimal state-action policies (putermanMarkovDecisionProcesses2014; bertsekasStochasticOptimalControl1996).

###### Proof.

Note that a Bellman optimal state-action policy is a maximal element according to the partial ordering . Existence thus consists of a simple application of Zorn’s lemma. Zorn’s lemma states that if any increasing chain

 Π1≤Π2≤Π3≤… (1)

has an upper bound that is a state-action policy, then there is a maximal element .

Given the chain (1), we construct an upper bound. We enumerate by . Then the state-action policy sequence

 Πn(α1|σ1,t1),n=1,2,3,…

is bounded within . By the Bolzano-Weierstrass theorem, there exists a subsequence , that converges. Similarly, is also a bounded sequence, and by Bolzano-Weierstrass it has a subsequence that converges. We repeatedly take subsequences until . To ease notation, call the resulting subsequence ,

With this, we define . It is straightforward to see that is a state-action policy:

 ^Π(α|σ,t)=limm→∞Πm(α|σ,t)∈[0,1],∀(α,σ,t)∈A×S×T,∑α∈A^Π(α|σ,t)=limm→∞∑α∈AΠm(α|σ,t)=1,∀(σ,t)∈S×T.

To show that is an upper bound, take any in the original chain of state-action policies (1). Then by the definition of an increasing subsequence, there exists an index such that : . Since limits commute with finite sums, we have for any . Thus, by Zorn’s lemma there exists a Bellman optimal state-action policy . ∎

Now that we know that Bellman optimal state-action policies exist, we can characterise them recursively as a return-maximising action followed by a Bellman optimal state-action policy.

[Characterisation of Bellman optimal state-action policies] For a state-action policy , the following are equivalent:

1. is Bellman optimal.

2. is

1. Bellman optimal when restricted to . In other words, state-action policy and

 vΠ(s,t)≥vΠ′(s,t).
2. At time , selects actions that maximise return:

 Π(a|s,0)>0⟺a∈argmaxa∈AEΠ[R(s1:T)|s0=s,a0=a],∀s∈S. (2)

Note that this characterisation offers a recursive way to construct Bellman optimal state-action policies by backwards induction (i.e., by successively selecting the best action), as specified by Equation 2, starting from and inducting backwards (putermanMarkovDecisionProcesses2014).

###### Proof.

We only need to show assertion (b). By contradiction, suppose that such that and

 EΠ[R(s1:T)|s0=s,a0=α]

We let be the Bellman optimal action at state and time defined as

 α′:=argmaxa∈AEΠ[R(s1:T)|s0=s,a0=a].

Then, we let be the same state-action policy as except that assigns deterministically. Then,

 vΠ(s,0)=∑a∈AEΠ[R(s1:T)|s0=s,a0=a]Π(a|s,0)

So is not Bellman optimal, which is a contradiction.

We only need to show that maximises . By contradiction, there exists a state-action policy and a state such that

 vΠ(s,0)

By (a) the left hand side equals

 maxa∈AEΠ[R(s1:T)|s0=s,a0=a].

Unpacking the expression on the right-hand side:

 ∑a∈AEΠ′[R(s1:T)|s0=s,a0=a]Π′(a|s,0)=∑a∈A∑σ∈SEΠ′[R(s1:T)|s1=σ]P(s1=σ|s0=s,a0=a)Π′(a|s,0)=∑a∈A∑σ∈S{EΠ′[R(s2:T)|s1=σ]+R(σ)}P(s1=σ|s0=s,a0=a)Π′(a|s,0)=∑a∈A∑σ∈S{vΠ′(σ,1)+R(σ)]P(s1=σ|s0=s,a0=a)Π′(a|s,0) (3)

Since is Bellman optimal when restricted to we have . Therefore,

 ∑a∈A∑σ∈S{vΠ′(σ,1)+R(σ)]P(s1=σ|s0=s,a0=a)Π′(a|s,0)≤∑a∈A∑σ∈S{vΠ(σ,1)+R(σ)]P(s1=σ|s0=s,a0=a)Π′(a|s,0).

Repeating the steps above (3), but in reverse order, yields

 ∑a∈AEΠ′[R(s1:T)|s0=s,a0=a]Π′(a|s,0)≤∑a∈AEΠ[R(s1:T)|s0=s,a0=a]Π′(a|s,0)

However,

 ∑a∈AEΠ[R(s1:T)|s0=s,a0=a]Π′(a|s,0)

### 2.3 Backward induction

Proposition 2.2 suggests a straightforward recursive algorithm to construct Bellman optimal state-action policies known as backward induction (putermanMarkovDecisionProcesses2014). Backward induction entails reasoning backwards in time, from a goal state at the end of a problem or solution, to determine a sequence of Bellman optimal actions. It proceeds by first considering the last time at which a decision might be made and choosing what to do in any situation at that time. Using this information, one can then determine what to do at the second-to-last decision time. This process continues backwards until one has determined the best action for every possible situation or state at every point in time. This algorithm has a long history. It was developed by the German mathematician Zermelo in 1913 to prove that chess has Bellman optimal strategies (zermeloUberAnwendungMengenlehre1913). In stochastic control, backward induction is one of the main methods for solving the Bellman equation (mirandaAppliedComputationalEconomics2002; addaDynamicEconomicsQuantitative2003; sargentOptimalControl2000)

. In game theory, the same method is used to compute sub-game perfect equilibria in sequential games

(fudenbergGameTheory1991; watson2002strategy).

[Backward induction: construction of Bellman optimal state-action policies] Backward induction

 Π(a|s,T−1)>0⟺a∈argmaxa∈AE[R(sT)|sT−1=s,aT−1=a],∀s∈SΠ(a|s,T−2)>0⟺a∈argmaxa∈AEΠ[R(sT−1:T)|sT−2=s,aT−2=a],∀s∈S⋮Π(a|s,0)>0⟺a∈argmaxa∈AEΠ[R(s1:T)|s0=s,a0=a],∀s∈S (4)

defines a Bellman optimal state-action policy . Furthermore, this characterisation is complete: all Bellman optimal state-action policies satisfy the backward induction relation (4).

Intuitively, this recursive scheme (4) consists of planning backwards, by starting from the end goal and working out the actions needed to achieve the goal. To give a concrete example of this kind of planning, the present scheme would consider the example actions below in the following order:

1. Desired goal: I would like to go to the grocery store,

2. Intermediate action: I need to drive to the store,

3. Current best action: I should put my shoes on.

###### Proof of Proposition 2.3.
• We first prove that state-action policies defined as in (4) are Bellman optimal by induction on .

 Π(a|s,0)>0⟺a∈argmaxaE[R(s1)|s0=s,a0=a],∀s∈S

is a Bellman optimal state-action policy as it maximises the total reward possible in the MDP.

Let be finite and suppose that the Proposition holds for MDPs with a temporal horizon of . This means that

 Π(a|s,T−1)>0⟺a∈argmaxaE[R(sT)|sT−1=s,aT−1=a],∀s∈SΠ(a|s,T−2)>0⟺a∈argmaxaEΠ[R(sT−1:T)|sT−2=s,aT−2=a],∀s∈S⋮Π(a|s,1)>0⟺a∈argmaxaEΠ[R(s2:T)|s1=s,a1=a],∀s∈S

is a Bellman optimal state-action policy on the MDP restricted to times to . Therefore, since

 Π(a|s,0)>0⟺a∈argmaxaEΠ[R(s1:T)|s0=s,a0=a],∀s∈S

Proposition 2.2 allows us to deduce that is Bellman optimal.

• We now show that any Bellman optimal state-action policy satisfies the backward induction algorithm (4).

Suppose by contradiction that there exists a state-action policy that is Bellman optimal but does not satisfy (4). Say, , such that

 Π(a|s,t)>0 and a∉argmaxα∈AEΠ[R(st+1:T)|st=s,at=α].

This implies

 EΠ[R(st+1:T)|st=s,at=a]

Let . Let be a state-action policy such that assigns deterministically, and such that otherwise. Then we can contradict the Bellman optimality of as follows

 vΠ(s,t)=EΠ[R(st+1:T)|st=s]=∑α∈AEΠ[R(st+1:T)|st=s,at=α]Π(α|s,t)

This concludes our discussion of dynamic programming on finite horizon MDPs.

## 3 Active inference on finite horizon MDPs

We now turn to active inference agents on finite horizon MDPs.

Here, the agent’s generative model of its environment is modelled using the previously defined finite horizon MDP (see Definition 2.1). This means that we assume that the transition probabilities are known. We do not consider the general case where the transitions have to be learned but comment on it in the discussion (also see Appendix A).

In what follows, we fix a time and suppose that the agent has been in states . To ease notation we let be the future states and future actions.

Let be the predictive distribution of the agent. That is, the distribution specifying the next actions and states that the agent encounters and pursues when at state

 Q(→s,→a|st):=T−1∏τ=tQ(sτ+1|aτ,sτ)Q(aτ|sτ).

### 3.1 Perception as inference

In active inference, perception implies inferences about future, past, and current states given observations and a sequence of actions. In active inference, this is done through variational Bayesian inference by minimising (variational) free energy (a.k.a. an evidence bound in machine learning)

(bishopPatternRecognitionMachine2006; bealVariationalAlgorithmsApproximate2003; wainwrightGraphicalModelsExponential2007). See (dacostaActiveInferenceDiscrete2020) for details on active inference in the partially observable setting.

In the MDP setting, past and current states are known, hence it is only necessary to infer future states, given the current state and a sequence of actions, . These posteriors are known in virtue of the fact that the agent knows the transition probabilities of the MDP; hence variational inference becomes exact Bayesian inference.

 Q(→s|→a,st):=P(→s|→a,st)=T−1∏τ=tP(sτ+1|sτ,aτ) (5)

[Unknown transition probabilities] When the probabilities or reward are unknown to the agent the problem is one of RL (shohamMultiagentReinforcementLearning2003). Although we do not consider this scenario here, when the model is unknown, we simply equip the agents generative model with a prior, and the model is then updated via variational Bayesian inference to fit the observed data. Depending on the specific learning problem and generative model structure, this can involve updating the transition probabilities (i.e., the probability of transitioning to a rewarding state under each action) and/or the target distribution (to be defined later); in POMDPs it can also involve updating the probabilities of observations under each state. See Appendix A for further details on how active inference implements reward learning and potential connections to representative RL approaches; and see (dacostaActiveInferenceDiscrete2020) for details on modelling learning in active inference more generally.

### 3.2 Planning as inference

Now that the agent has inferred future states given different sequences of actions, we must score these sequences using the goodness of the resulting state trajectories (in terms of ). The expected free energy does exactly this: it is the objective that active inference agents minimise in order the select the best possible action.

Under active inference, agents minimize expected free energy in order to maintain a steady-state distribution over the state-space . This steady-state specifies the agent’s preferences, or the characteristic states it returns to after being perturbed. The expected free energy is defined as a functional of this steady-state distribution. In the absence of any observed or latent states, the expected free energy reduces to the following form (which is a special case of expected free energy for partially observed MDPs—see Section 4.4).

[Expected free energy] In MDPs, the expected free energy of an action sequence starting from is defined as (dacostaActiveInferenceDiscrete2020):

 G(→a|st)=DKL[Q(→s|→a,st)∥C(→s)] (6)

Therefore, minimising expected free energy corresponds to making the distribution over predicted states close to the distribution that encodes prior preferences.

The expected free energy may be rewritten as

 G(→a|st)=EQ(→s|→a,st)[−logC(→s)]Expected surprise−H[Q(→s|→a,st)]Entropy of future states (7)

Hence, minimising expected free energy minimises the expected surprise of states111The surprise of states is an information theoretic term (stoneInformationTheoryTutorial2015) that scores the extent to which an observation is unusual under . It does not mean that the agent consciously experiences surprise. according to and maximising the entropy of Bayesian beliefs over future states (a maximum entropy principle (jaynesInformationTheoryStatistical1957; jaynesInformationTheoryStatistical1957a)).

Evaluating the expected free energy of courses of action corresponds to planning as inference (attiasPlanningProbabilisticInference2003; botvinickPlanningInference2012). This follows from the fact that the expected free energy scores the goodness of inferred future states.

[Numerical tractability] The expected free energy is straightforward to compute using linear algebra. Given an action sequence , and are categorical distributions over . Let their parameters be , where denotes the cardinality of a set. Then:

 G(→a|st)=sT→a(logs→a−logc) (8)

Notwithstanding, (8) is expensive to evaluate repeatedly when all possible action sequences are considered. In practice, one can adopt a temporal mean field approximation over future states (millidgeWhenceExpectedFree2020):

 Q(→s|→a,st)≈T∏τ=t+1Q(sτ|→a,st),

which yields the simplified expression

 G(→a|st)≈T∑τ=t+1DKL[Q(sτ|→a,st)∥C(sτ)]. (9)

Expression (9) is much easier to handle: for each action sequence , 1) one evaluates the summands sequentially , and 2) if and when the sum up to becomes significantly higher than the lowest expected free energy encountered during planning, is set to an arbitrarily high value. Setting

to an high value is equivalent to pruning away unlikely trajectories. This bears some similarity to decision tree pruning procedures used in RL

(huysBonsaiTreesYour2012). It finesses exploration of the decision-tree in full depth and provides an Occam’s window for selecting action sequences.

There are complementary approaches to make planning tractable. For example, hierarchical generative models factorise decisions into multiple levels. By abstracting information at a higher-level, lower-levels entertain fewer actions (fristonDeepTemporalModels2018)—which reduces the depth of the decision tree by orders of magnitude. Another approach is to use algorithms that search the decision-tree selectively, such as Monte-Carlo tree search (silverMasteringGameGo2016; coulomEfficientSelectivityBackup2006)

, and amortising the expected free energy minimisation using artificial neural networks (i.e., learning to plan

(catalLearningPerceptionPlanning2020a)).

## 4 Reward maximisation as active inference

In the following, we show how active inference can solve the reward maximisation problem.

### 4.1 Reward maximisation as reaching preferences

From the definition of expected free energy (6), active inference can be thought of as reaching and remaining at a target distribution over state-space. This distribution encodes the agent’s preferences. In short, simulating active inference can be regarded as engineering a stationary process (pavliotisStochasticProcessesApplications2014), where the stationary distribution encodes the agent’s preferences.

The idea that underwrites the rest of this paper is that when the stationary distribution has all of its mass on reward maximising states, then the agent will maximise reward. To illustrate this, we define a distribution , encoding the agent’s preferences over state-space , such that rewarding states become preferred states.

The parameter is an inverse temperature parameter, which scores how motivated the agent is to occupy reward maximising states. Note that states that maximise the reward maximise and minimise for any .

Using the additive property of the reward function, we can extend to a probability distribution over trajectories . Specifically, scores to what extent a trajectory is preferred over another trajectory:

 Cλ(→σ):=expλR(→σ)∑→ς∈STexpλR(→ς)=T∏τ=1expλR(στ)∑ς∈SexpλR(ς)=T∏τ=1Cλ(στ),∀→σ∈ST⟺−logCλ(→σ)=−λR(→σ)−c′(λ)=−T∑τ=1λR(στ)−c′(λ),∀→σ∈ST, (10)

where is constant w.r.t .

When the preferences are defined in this way, the zero-temperature limit is the case where the preferences are non-zero only for states or trajectories that maximise reward. In this case, is a uniform mixture of Dirac distributions over reward maximising trajectories:

 limλ→+∞Cλ∝∑→s∈IT−tDirac→sI:=argmaxs∈SR(s). (11)

This is because, for a reward maximising state , will converge to more quickly than for a non-reward maximising state . Since is constrained to be normalised to (as it is a probability distribution), . Hence, in the limit , is non-zero (and uniform) only on reward maximising states.

We now show how reaching preferred states can be formulated as reward maximisation:

The sequence of actions that minimises expected free energy also maximises expected reward in the limiting case :

 limλ→+∞argmin→aG(→a|st)⊆argmax→aEQ(→s|→a,st)[R(→s)]

Furthermore, of those action sequences that maximise expected reward, the expected free energy minimisers will be those that maximize the entropy of future states .

###### Proof.

The inclusion follows from the fact that, as , a minimiser of the expected free energy has to maximise . Among such action sequences, the expected free energy minimisers are those that maximise the entropy of future states . ∎

In the zero temperature limit , minimising expected free energy corresponds to choosing the action sequence such that has most mass on reward maximising states or trajectories. See Figure 2 for an illustration. Of those candidates with the same amount of mass, the maximiser of the entropy of future states will be chosen.

### 4.2 Bellman optimality on a temporal horizon of 1

In this section we first consider the case of a single-step decision problem (i.e., temporal horizon of ) and demonstrate how one simple active inference scheme maximizes reward on this problem in the limit . This will act as an important building block for when we subsequently consider the more general multi-step decision problems that are addressed by both generic dynamic programming and active inference. In the multi-step case (), we will show that this simple active inference scheme is not guaranteed to maximize reward. However, when considering this more general class of decision problems, it is important to emphasise that, similar to RL, active inference is a broad normative framework that encompasses multiple algorithms or schemes. Thus, when we subsequently address multi-step decision problems, we will also show how a second, more sophisticated active inference scheme does maximise reward in the limit . These two schemes differ only in how the agent forms beliefs about the best possible courses of action when minimising expected free energy222The additional degree of freedom one has in POMDPs is specifying the family of distributions to optimise variational free energy over, in order to infer states from observations. See (schwobelActiveInferenceBelief2018; parrNeuronalMessagePassing2019; heskesConvexityArgumentsEfficient2006; dacostaActiveInferenceDiscrete2020) for details..

The most common action selection procedure consists of assigning the probability of action sequences to be the softmax of the negative expected free energy (dacostaActiveInferenceDiscrete2020; fristonActiveInferenceProcess2017)

 Q(→a|st)∝exp(−G(→a|st))

Action selection under active inference usually involves selecting the most likely action under :

 at ∈argmaxa∈AQ(a|st) =argmaxa∈A∑→aQ(a|→a)Q(→a|st) =argmaxa∈A∑→aQ(a|→a)exp(−G(→a|st)) =argmaxa∈A∑→a(→a)t=aexp(−G(→a|st))

In other words, this scheme selects actions that maximise the exponentiated negative expected free energies of all possible future action sequences. This means that if one action is part of an action sequence with very low expected free energy, this score is exponentiated and adds a large contribution to the selection of that particular action.

See Table 4.2 for a summary of this scheme.

to X[1,c] X[4,c] Process & Computation
Perceptual inference &
Planning as inference &
Decision-making &
Action selection &
Example of an active inference scheme on finite horizon MDPs.

In the zero temperature limit , the state-action policy defined as in Table 4.2

 at∈limλ→+∞argmaxa∈A∑→a(→a)t=aexp(−G(→a|st))G(→a|st)=DKL[Q(→s|→a,st)∥Cλ(→s)] (12)

is Bellman optimal for the temporal horizon .

###### Proof.

When the only action is . We fix an arbitrary initial state . By Proposition 2.2, a Bellman optimal state-action policy is fully characterised by an action that maximises immediate reward

 a∗0∈argmaxa∈AE[R(s1)|s0=s,a0=a].

Recall that by Remark 1, this expectation stands for return under the transition probabilities of the MDP

 a∗0∈argmaxa∈AEP(s1|a0=a,s0=s)[R(s1)].

Since transition probabilities are assumed to be known (5), this reads