Active Inference for Stochastic Control

by   Aswin Paul, et al.

Active inference has emerged as an alternative approach to control problems given its intuitive (probabilistic) formalism. However, despite its theoretical utility, computational implementations have largely been restricted to low-dimensional, deterministic settings. This paper highlights that this is a consequence of the inability to adequately model stochastic transition dynamics, particularly when an extensive policy (i.e., action trajectory) space must be evaluated during planning. Fortunately, recent advancements propose a modified planning algorithm for finite temporal horizons. We build upon this work to assess the utility of active inference for a stochastic control setting. For this, we simulate the classic windy grid-world task with additional complexities, namely: 1) environment stochasticity; 2) learning of transition dynamics; and 3) partial observability. Our results demonstrate the advantage of using active inference, compared to reinforcement learning, in both deterministic and stochastic settings.



page 1

page 2

page 3

page 4


Deep Active Inference as Variational Policy Gradients

Active Inference is a theory of action arising from neuroscience which c...

Scaling active inference

In reinforcement learning (RL), agents often operate in partially observ...

Learning Latent Dynamics for Planning from Pixels

Planning has been very successful for control tasks with known environme...

Deep active inference agents using Monte-Carlo methods

Active inference is a Bayesian framework for understanding biological in...

Active Inference or Control as Inference? A Unifying View

Active inference (AI) is a persuasive theoretical framework from computa...

Unifying task specification in reinforcement learning

Reinforcement learning tasks are typically specified as Markov decision ...

Planning as Inference in Epidemiological Models

In this work we demonstrate how existing software tools can be used to a...

Code Repositories


Active Inference for Stochastic Control

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Active inference, a corollary of the free energy principle, is a formal way of describing the behaviour of self-organising systems that interface with the external world and maintain a consistent form over time [1, 2, 3]. Despite its roots in neuroscience, active inference has snowballed to many fields owing to its ambitious scope as a general theory of behaviour [4, 5, 6]. Optimal control is one such field, and several recent results place active inference as a promising optimal control algorithm [7, 8, 9]. However, research in the area has largely been restricted to low-dimensional and deterministic settings where defining, and evaluating, policies (i.e., action trajectories) is feasible [9]. This follows from the active inference process theory that necessitates equipping agents a priori with sequences of actions in time. For example, with available actions and a time-horizon of , the total number of (definable) policies that would need to be considered .

This becomes more of a challenge in stochastic environments with inherently uncertain transition dynamics, and no clear way to constrain the large policy space to a smaller subspace. Happily, recent advancements like sophisticated inference [10] propose a modified planning approach for finite-temporal horizons [11]. Briefly, sophisticated inference [10], compared to the earlier formulation [12, 9], provides a recursive form of the expected free energy that implements a deep tree search over actions (and outcomes) in the future. We reserve further details for Section 3.2.2.

In this paper, we evaluate the utility of active inference for stochastic control using the sophisticated planning objective. For this, we utilise the windy grid-world task [13], and assess our agent’s performance when varying levels of complexity are introduced e.g., stochastic wind, partial observability, and learning the transition dynamics. Through these numerical simulations, we demonstrate that active inference, compared to a Q-learning agent [13], provides a promising approach for stochastic control.

2 Stochastic control in a windy grid-world

In this section, we describe the windy grid-world task, with additional complexity, used for evaluating our active inference agent (Section 3). This is a classic grid-world task from reinforcement learning [13], with a predefined start () and goal () states (Fig. 1). The aim is to navigate as optimally (i.e., within a minimum time horizon) as possible, taking into account the effect of the wind along the way. The wind runs upward through the middle of the grid, and the goal state is located in one such column. The strength of the wind is noted under each column in Fig. 1, and its amplitude is quantified by the number of columns shifted upwards that were unintended by the agent. Here, the agent controls its movement through available actions (i.e., the King’s moves): North (), South (), East (), West (), North-West (), South-West (), South-East (), and North-East (). Every episode terminates either at the allowed time horizon, or when the agent reaches the goal state.

Figure 1: Windy grid-world task. Here, and denote starting and goal locations. On the x-axis, the wind amplitude is shown. This is quantified as the number of unintended additional columns the agent moves during each action e.g., any action in column four results in one unintended shift upwards. There are actions: . We plot sample paths from the start to the goal state in light and dark blue. Notice, the indirect journey to the goal is a consequence of the wind.

2.1 Grid-world complexity

To test the performance of our active inference agent in a complex stochastic environment, we introduced different complexity levels to the windy grid-world setting (Table 1).

Level Wind Observability Dynamics
1 Deterministic Full (MDP) Known
2 Stochastic Full (MDP) Known
3 Deterministic Full (MDP) Learned
4 Stochastic Full (MDP) Learned
5 Stochastic Partial (POMDP) Known
Table 1: Five complexity levels for the windy grid-world task

2.1.1 Wind properties

In a deterministic setting, the amplitude of the wind remains constant. Conversely, in stochastic setting, for windy columns the effect varies by one from the mean values. We consider two settings: medium and high stochasticity. For medium stochasticity, the mean value is observed of the time and similarly of the time in the high stochastic case (Table 2

). The adjacent wind values are observed with remaining probabilities. Here, stochasticity is not externally introduced to the system, but it is inbuilt in the transition dynamics

(Section 3) of the environment.

Level Wind amplitude static Wind amplitude 1
Medium 70 of the time each for
High 40 of the time each for
Table 2: Stochastic nature of wind

2.1.2 Observability

In the fully observable setting, the agent is aware of the current state i.e., there is no ambiguity about the states of affair. We formalise this as a Markov decision processes (MDP). Whereas in the partially observable environment, the agent measures an indirect function of the associated state i.e., current observation. This is used to infer the current state of the agent. We formalise this as a partially observable MDP (POMDP). Specific details of outcome modalities used in the task are discussed in Appendix 


2.1.3 Transition dynamics known to agent

In the known set-up, the agent is equipped with the transition probabilities beforehand. However, if these are not known, the agent begins the trials with a uninformative (uniform) priors and updates its beliefs (Eq.9) using random transitions. Briefly, random actions are sampled and transition dynamics updated to reflect the best explanation for the observations at hand. Here, the learned dynamics are used for planning.

3 Active inference on finite temporal horizons

3.1 Generative model

The generative model is formally defined as a tuple of finite sets :

  • states where and is a predefined (fixed) start state.

  • where , in the fully observable setting, and in partial observability 111Here, outcomes introduce ambiguity for the agent as similar outcomes map to different (hidden) states. See Appendix 0.B, Table B.1 for implementation details. .

  • , and is a finite time horizon available per episode.

  • actions, where .

  • encodes the transition dynamics, i.e., the probability that action taken at state at time results in at time .

  • prior preferences over outcomes, . Here, preference for the predefined goal-state.

  • encodes the likelihood distribution, for the partially observable setting.

Accordingly, the agents generative model is defined as the following probability distribution:


3.2 Full observability

3.2.1 Perception:

During full observability, states can be directly accessed by agent with known or learned transition dynamics. Then the posterior estimates,

, can be directly calculated from [11].


3.2.2 Planning:

In active inference, expected free-energy () [9] is used for planning. For finite temporal horizons, the agent acts to minimise [11]. Here, to calculate we using the recursive formulation introduced in [10]. This is defined recursively as the immediate expected free energy plus the expected free energy for future actions:


for and,


for . In Eq.5, the second term is calculated as,


Prior preference over states are encoded such that the agent prefers to observe itself in the goal state at every time-step. , and otherwise. In the matrix form, the th element of , corresponds to th state in .

3.2.3 Action selection:

A distribution for action selection is defined using expected free energy such that,


Here, is the softmax function ensuring that components sum to one. At each time-step, actions are samples from:


3.2.4 Learning transition dynamics:

We learn the transition dynamics, , across time using conjugacy update rules [14, 12, 9]:


Here, is the learned transition dynamics updated over time, is the probability of taking action , is the state at time as a consequence of action ,

is the state-vector at time

taking action , and is the Kronecker-product of the corresponding state-vectors. Furthermore, we also assessed the model accuracy obtained after a given number of trials to update , when random actions were employed to explore transition dynamics. These learned transitions were used for control in Level-3 and Level-4 of the problem.

3.3 Partial observability

We formalise partial observability as a partially observed MDP (POMDP). Here, the agents have access to indirect observations about the environment. Specific details of outcome modalities used in this work are discussed in Appendix 0.B. These outcome modalities are same for many states for e.g., the states and have the same outcome modalities (see Appendix 0.B, Table B.1). Here, we evaluate the ability of active inference agent to perform optimal inference and planning in the face of ambiguity. The critical advancement with sophisticated inference [10] compared to the classical formulation [9] allows us to perform deep-tree search for actions in the future. The agent infers the hidden-states by minimising a functional of its predictive distribution (generative model) of the environment called the variational free-energy. This predictive distribution can be defined as,


To infer hidden-states from partial observations, thr agent engages in minimising variational free energy () functional of

using variational (Bayesian) inference. For a rigorous treatment of it, please refer to

[10, 11]

. In this scheme, actions are considered as random variables at each time-step, assuming successive actions are conditionally independent. This comes with a cost of having to consider many action sequences in time. The search for policies in time is optimised both by restricting the search over future outcomes which has a non-trivial posterior probability (Eg:

) as well as only evaluating policies with significant prior probabilities (Eg:

) calculated from the expected free energy (i.e., Occam’s window). In the partially observable setting, the expected free energy accommodates ambiguity in future observations prioritising both preference seeking as well as ambiguity reduction in observations [10].

4 Results

We compare the performance of our active inference agent with a popular reinforcement learning algorithm, Q-learning [13], in Level . Q-Learning is a model-free RL algorithm that operates by learning the ’value’ of actions at a particular state. It is well suited for problems with stochastic transitions and reward dynamics due to its model-free parameterisation. Q-Learning agents are extensively used in similar problem settings and exhibit state-of-the-art (SOTA) performances [13]. To train the Q-learning agents, we used an exploration rate of , learning rate of and discount factor of . Training was conducted using 10 different random seeds to ensure unbiased results. The training depth for Q-Learning agents were increased with complexity of the environment.

We instantiate two Q-learning agents, one trained for time-steps (QLearning500) and another for time-steps (QLearning5K) in Level-1. Both the active inference agent and the QLearning5K agent demonstrate optimal success rate for the time-horizon (see Appendix 0.A, Fig.A.1).

Using these baselines from the deterministic environment with known transition dynamics, we compared the performance of the agent in a complex setting with medium and highly stochastic wind (Level ; Table. 2).Here, the active inference agent is clearly superior against the Q-Learning agents (Fig. 2 top row). Moreover, they demonstrate better success rates for shorter time-horizons, and ’optimal’ action selection. Note, success rate is the percentage of trials for which the agent successfully reached the goal within the allowed time-horizon.

Figure 2: Stochastic environments: Performance comparison of agents in Level-2 (top row), Level-4 (middle row), and Level-5 (last row) of windy grid-world task for medium-stochastic (left column) and high-stochastic (right column) environments, respectively. Here, x-axis denotes time horizon and y-axis the success rate over multiple-trials. ’SophAgent’ represents the active inference agent, ’QLearning5K’ represents Q-learning agent trained for time-steps, ’QLearning10K’ for the ’Q-learning agent trained for time-steps, and ’QLearning20K’ for the Q-learning agent trained for time-steps. Each agent was trained using different random seeds. ’SophAgent (5K B-updates)’ and SophAgent (10K B-updates) refers to active inference agent using self-learned transition dynamics with and updates respectively.

Next, we considered how learning the transition dynamics impacted agent behaviour (Level and ). Here, we used Eq. 9 for learning the transition dynamics, . First, the algorithm learnt the dynamics by taking random actions over steps (for example, is time steps in ’SophAgent (5K B-updates)’, see Fig. 2 middle row). These learned transition dynamics were used (see Fig. 3) by the active inference agent to estimate the action distribution in Eq. 8. Results for level are presented in Appendix 0.A, Fig. A.2. Here, the Q-Learning algorithm with learning steps shows superior performance to the active inference agents. However with longer time horizons, the active inference agent shows competitive performance. Importantly, the active inference agent used self-learned, and imprecise transition dynamics in these levels. Level results for medium and highly stochastic setting are presented in Fig. 2 (middle row). For medium stochasticity, the QLearning10K exhibited satisfactory performance, however it failed with zero success rate in the highly stochastic case. This shows the need for extensive training for algorithms like Q-Learning in highly stochastic environments. However, the active inference agent demonstrated at-par performance. Remarkably, the performance was achieved using imprecise (compared to true-model), self-learned transition dynamics () (see Fig. 3).

Figure 3: Accuracy of learned dynamics in terms of deviation from true transition dynamics in Level-4 A: Medium stochastic case B: High stochastic case

The active inference agent shows superior performance in the highly stochastic environment even with partial observability (Fig. 2, last row). Conversely, excessive training was required for the Q-Learning agent to achieve a high success rate in a medium stochastic environment, but even this training depth led to a zero success rate with high stochasticity. These results present active inference, with a recursively calculated free-energy, as a promising algorithm for stochastic control.

5 Discussion

We explored the utility of the active inference with planning in finite temporal-horizons for five complexity levels of the windy grid-world task. Active inference agents performed at-par, or superior, when compared with well-trained Q-Learning agents. Importantly, in the highly stochastic environments the active inference agent showed clear superiority over the Q-Learning agents. The higher success rates at lower time horizons demonstrated the ’optimality’ of actions in stochastic environments presented to the agent. Additionally, this performance is obtained with no specifications of acceptable policies. The total number of acceptable policies scale exponentially with the number of available actions and time-horizon. Moreover, the Level results demonstrate the need for extensive training for the Q-Learning agents when operating in stochastic environments. We also demonstrated the ability of the active inference agents to achieve high success rate even with self-learned, but sub-optimal, transition dynamics. Methods to equip the agent to learn both transition-dynamics and outcome-dynamics for a partially observable setting have been previously explored [14, 9]. For a stochastic setting, we leave their implementation for future work.

The limitation yet to be addressed is the time consumed for trials in active inference. Large run-time restricted analysis for longer time horizons in Level

. Deep learning approaches using tree searches, for representing policies were proposed recently

[15, 16, 17], may be useful in this setting. We leave run-time analysis and optimisation for more ambitious environments for future work. Also, comparing active inference to model based RL algorithms like Dyna-Q [13] and control as inference approaches [18] is a promising direction to pursue.
We conclude that the above results place active inference as a promising algorithm for stochastic-control.

Software note

The environments and agents were custom written in Python for fully observable settings. The script ’SPM_MDP_VB_XX.m’ available in SPM12 package was used in the partially observable setting. All scripts are available in the following link:


AP acknowledges research sponsorship from IITB-Monash Research Academy, Mumbai and Department of Biotechnology, Government of India. AR is funded by the Australian Research Council (Refs: DE170100128 & DP200100757) and Australian National Health and Medical Research Council Investigator Grant (Ref: 1194910). AR is a CIFAR Azrieli Global Scholar in the Brain, Mind & Consciousness Program. AR and NS are affiliated with The Wellcome Centre for Human Neuroimaging supported by core funding from Wellcome [203147/Z/16/Z].


Appendix 0.A Results Level-1 and Level-3 (Non-stochastic settings)

Figure A.1: Performance comparison of agents in Level-1 of windy grid-world task. ’RandomAgent’ refers to a naive-agent that takes all actions with equal probability at every time step.
Figure A.2: A: Performance comparison of active inference agents with learned using 5000 and 10000 updates respectively to Q-Learning agent in Level-3. ’Q-Learning5K’ stands for Q-Learning agent trained for time steps using different random seeds. B: Accuracy of learned dynamics in terms of deviation from true dynamics.

Appendix 0.B Outcome modalities for POMDPs

In the partially observable setting, we considered two outcome modalities and both of them were the function of ’side’ and ’down’ coordinates defined for every state in Fig. 1. Examples of the coordinates and modalities are given below. First outcome modality is the sum of co-ordinates and second modality is the product of coordinates.

coordinate (C1)
coordinate (C2)
1 1 1 2 1
2 1 2 3 2
. . . . .
11 2 1 3 2
. . . . .
31 4 1 5 4
38 4 8 12 32
. . . . .
Table B.1: Outcome modalities specifications

These outcome modalities are similar for many states (for e.g., states and have the same outcome modalities (see Tab. B.1)). The results demonstrates the ability of active inference agent to perform optimal inference and planning in the face of ambiguity. One of the output from ’SPM_MDP_VB_XX.m’ is ’MDP.P’. ’MDP.P’ returns the action probabilities an agent will use for a given POMDP as input at each time-step. This distribution was used to conduct multiple trails to evaluate success rate of the active inference agent.