Deep active inference agents using Monte-Carlo methods
Active inference is a Bayesian framework for understanding biological intelligence. The underlying theory brings together perception and action under one single imperative: minimizing free energy. However, despite its theoretical utility in explaining intelligence, computational implementations have been restricted to low-dimensional and idealized situations. In this paper, we present a neural architecture for building deep active inference agents operating in complex, continuous state-spaces using multiple forms of Monte-Carlo (MC) sampling. For this, we introduce a number of techniques, novel to active inference. These include: i) selecting free-energy-optimal policies via MC tree search, ii) approximating this optimal policy distribution via a feed-forward `habitual' network, iii) predicting future parameter belief updates using MC dropouts and, finally, iv) optimizing state transition precision (a high-end form of attention). Our approach enables agents to learn environmental dynamics efficiently, while maintaining task performance, in relation to reward-based counterparts. We illustrate this in a new toy environment, based on the dSprites data-set, and demonstrate that active inference agents automatically create disentangled representations that are apt for modeling state transitions. In a more complex Animal-AI environment, our agents (using the same neural architecture) are able to simulate future state transitions and actions (i.e., plan), to evince reward-directed navigation - despite temporary suspension of visual input. These results show that deep active inference - equipped with MC methods - provides a flexible framework to develop biologically-inspired intelligent agents, with applications in both machine learning and cognitive science.READ FULL TEXT VIEW PDF
Active Inference is a theory of action arising from neuroscience which c...
Learning to take actions based on observations is a core requirement for...
In reinforcement learning (RL), agents often operate in partially observ...
Active inference offers a first principle account of sentient behaviour,...
Sequential decision problems are often approximately solvable by simulat...
Bayesian reinforcement learning (BRL) encodes prior knowledge of the wor...
The Expected Free Energy (EFE) is a central quantity in the theory of ac...
Deep active inference agents using Monte-Carlo methods
A common goal in cognitive science and artificial intelligence is to emulate biological intelligence, to gain new insights into the brain and build more capable machines. A widely-studied neuroscience proposition for this is the free-energy principle, which views the brain as a device performing variational (Bayesian) inferenceFriston [2010, 2019]. Specifically, this principle provides a framework for understanding biological intelligence, termed active inference, by bringing together perception and action under a single objective: minimizing free energy across time Friston et al. [2016, 2017a, 2017b], Pezzulo et al. , Da Costa et al. . However, despite the potential of active inference for modeling intelligent behavior, computational implementations have been largely restricted to low-dimensional, discrete state-space tasks Parr and Friston , Friston et al. , Sajid et al. , Hesp et al. .
Recent advances have seen deep active inference agents solve more complex, continuous state-space tasks, including Doom Cullen et al. , the mountain car problem Friston et al. , Ueltzhöffer , Çatal et al. , and several tasks based on the MuJoCo environment Tschantz et al. , many of which use amortization to scale-up active inference Friston et al. , Ueltzhöffer , Çatal et al. , Millidge . A common limitation of these applications is a deviation from vanilla active inference in their ability to plan. For instance, Millidge Millidge  introduced an approximation of the agent’s expected free energy (EFE), the quantity that drives action selection, based on bootstrap samples, while Tschantz et al. Tschantz et al.  employed a reduced version of EFE. Additionally, since all current approaches tackle low-dimensional problems, it is unclear how they would scale up to more complex domains. Here, we propose an extension of previous formulations that is closely aligned with active inference Friston et al. [2017a, 2018]
by estimating all EFE summands using a single deep neural architecture.
Our implementation of deep active inference focuses on ensuring both scalability and biological plausibility. We accomplish this by introducing MC sampling – at several levels – into active inference. For planning, we propose the use of MC tree search (MCTS) for selecting a free-energy-optimal policy. This is consistent with planning strategies employed by biological agents and provides an efficient way to select actions. Next, we approximate the optimal policy distribution using a feed-forward ‘habitual’ network. This is inspired by biological habit formation, when acting in familiar environments that relieves the computational burden of planning in commonly-encountered situations. Additionally, for both biological consistency and reducing computational burden, we predict model parameter belief updates using MC-dropouts, a problem previously tackled with networks ensembles. Lastly, inspired by neuromodulatory mechanisms in biological agents, we introduce a top-down mechanism that modulates precision over state transitions, which enhances learning of latent representations.
In what follows, we briefly review active inference. This is followed by a description of our deep active inference agent. We then evaluate the performance of this agent. Finally, we discuss the potential implications of this work.
Agents defined under active inference: sample their environment and calibrate their internal generative model to best explain sensory observations (i.e., reduce surprise) and
perform actions under the objective of reducing their uncertainty about the environment. A more formal definition requires a set of random variables:to represent the hidden state of the world at time , as the corresponding observation, as a sequence of actions (typically referred to as ‘policy’ in the active inference literature) up to a given time horizon , and as the agent’s generative model parameterized by . From this, the agent’s surprise at time can be defined as the negative log-likelihood .
To address objective under this formulation, the surprise of current observations can be indirectly minimized by optimizing the parameters,
, using as a loss function the tractable expression:
where is an arbitrary distribution of . The RHS expression of this inequality is the variational free energy at time . This quantity is commonly referred to as negative evidence lower bound Blei et al.  in variational inference. Furthermore, to realize objective , the expected surprise of future observations where can be minimized by selecting the policy that is associated with the lowest EFE, Parr and Friston :
Finally, the process of action selection in active inference is realized as sampling from the distribution
where is a temperature parameter and the standard softmax function.
In this section, we introduce a deep active inference model using neural networks, based on amortization and MC sampling.
Throughout this section, we denote the parameters of the generative and recognition densities with and , respectively. The parameters are partitioned as follows: , where parameterizes the observation function , and parameterizes the transition function . For the recognition density, , where is the amortization parameters of the approximate posterior (i.e., the state encoder), and the amortization parameters of the approximate posterior (i.e., our habitual network).
First, we extend the probabilistic graphical model (as defined in Sec. 2) to include the action sequences and factorize the model based on Fig. 1A. We then exploit standard variational inference machinery to calculate the free energy for each time-step as:
is the summed probability of all policies that begin with action. We assume that
is normally distributed and
is Bernoulli distributed, with all parameters given by a neural network, parameterized by, , and for the observation, transition, and encoder models, respectively (see Sec. 3.2 for details about ). With this assumption, all the terms here are standard log-likelihood and KL terms easy to compute for Gaussian and Bernoulli distributions. The expectations over are taken via MC sampling, using a single sample from the encoder.
Next, we consider EFE. At time-step and for a time horizon up to time , EFE is defined as Friston et al. [2017a]:
where and . Following Schwartenbeck et al. Schwartenbeck et al. , the EFE of a single time instance can be further decomposed as
Interestingly, each term constitutes a conceptually meaningful expression. The term (7a) corresponds to the likelihood assigned to the desired observations
, and plays an analogous role to the notion of reward in the reinforcement learning literatureSutton and Barto . The term (7b) corresponds to the mutual information between the agent’s beliefs about its latent representation of the world, before and after making a new observation, and hence, it reflects a motivation to explore areas of the environment that resolve state uncertainty. Similarly, the term (7c
) describes the tendency of active inference agents to reduce their uncertainty about model parameters via new observations and is usually referred to in the literature as active learningFriston et al. , novelty, or curiosity Schwartenbeck et al. .
However, two of the three terms that constitute EFE cannot be easily computed as written in Eq. (7). To make computation practical, we will re-arrange these expressions and make further use of MC sampling to render these expressions tractable and re-write Eq. (7) as
where these expressions can be calculated from the deep neural network illustrated in Fig. 1B. The derivation of Eq. (8) can be found in the supplementary material. To calculate the terms (8a) and (8b), we sample , and sequentially (through ancestral sampling) and then is compared with the prior distribution . The parameters of the neural network are sampled from using the MC dropout technique Gal and Ghahramani . Similarly, to calculate the expectation of , the same drawn is used again and is re-sampled for times while, for , the set of parameters is also re-sampled times. Finally, all entropies can be computed using the standard formulas for multivariate Gaussian and Bernoulli distributions.
In active inference, agents choose an action given by their EFE. In particular, any given action is selected with a probability proportional to the accumulated negative EFE of the corresponding policies (see Eq. (3) and Ref. Parr and Friston ). However, computing across all policies is costly since it involves making an exponentially-increasing number of predictions for -steps into the future, and computing all the terms in Eq. (8). To solve this problem, we employ two methods operating in tandem. First, we employ standard MCTS Coulom , Browne et al. , Silver et al. , a search algorithm in which different potential future trajectories of states are explored in the form of a search tree (Fig. 1C), giving emphasis to the most likely future trajectories. This algorithm is used to calculate the distribution over actions , defined in Eq. (5), and control the agent’s final decisions. Second, we make use of amortized inference through a habitual neural network that directly approximates the distribution over actions, which we parameterize by and denote . In essence, acts as a variational posterior that approximates , with a prior , calculated by MCTS (see Fig. 1A). During learning, this network is trained to reproduce the last executed action (selected by sampling ) using the last state . Since both tasks used in this paper (Sec. 4) have discrete action spaces , we define as a neural network with parameters and softmax output units.
During the MCTS process, the agent generates a weighted search tree iteratively that is later sampled during action selection. In each single MCTS loop, one plausible state-action trajectory – starting from the present time-step – is calculated. For states that are explored for the first time, the distribution is used. States that have been explored are stored in the buffer search tree and accessed during later loops of the same planning process. The weights of the search tree represent the agent’s best estimation for EFE after taking action from state . An upper confidence bound for is defined as
where is the number of times that was explored from state , and a hyper-parameter that controls exploration. In each round, the EFE of the newly-explored parts of the trajectory is calculated and back-propagated to all visited nodes of the search tree. Additionally, actions are sampled in two ways. Actions from states that have been explored are sampled from while actions from new states are sampled from .
Finally, the actions that assemble the selected policy are drawn from . In our implementation, the planning loop stops if either the process has identified a clear option (i.e. if ) or the maximum number of allowed loops has been reached.
Through the combination of the approximation and the MCTS, our agent has at its disposal two methods of action selection. We refer to as the habitual network, as it corresponds to a form of fast decision-making, quickly evaluating and selecting a action; in contrast with the more deliberative system that includes future imagination via MC tree traversals Van Der Meer et al. .
One of the key elements of our framework is the state transition model , that belongs to the agent’s generative model. In our implementation, we take , where and come from the linear and softplus units (respectively) of a neural network with parameters applied to , and, importantly, is a precision factor (c.f. Fig. 1A) modulating the uncertainty on the agent’s estimate of the hidden state of the environment Parr and Friston . We model the precision factor as a simple function of the belief update about the agent’s current policy,
where and are fixed hyper-parameters. Note that is a monotonically decreasing function of , such that when the posterior belief about the current policy is similar to the prior, precision is high.
In cognitive terms, can be thought of as a means of top-down attention Byers and Serences , that regulates which transitions should be learnt in detail and which can be learnt less precisely. This attention mechanism acts as a form of resource allocation: if is high, then a habit has not yet been formed, reflecting a generic lack of knowledge. Therefore, the precision of the prior (i.e., the belief about the current state before a new observation has been received) is low, and less effort is spent learning .
In practice, the effect of is to incentivize disentanglement in the latent state representation – the precision factor is somewhat analogous to the parameter in -VAE Higgins et al. , effectively pushing the state encoder to have independent dimensions (since has a diagonal covariance matrix). As training progresses and the habitual network becomes a better approximation of , is gradually increased, implementing a natural form of precision annealing.
First, we present the two environments that were used to validate our agent’s performance.
We defined a simple 2D environment based on the dSprites dataset Matthey et al. , Higgins et al. . This was used to quantify the agent’s behavior against ground truth state-spaces and evaluate the agent’s ability to disentangle state representations. This is feasible as the dSprites data is designed for characterizing disentanglement, using a set of interpretable, independent ground-truth latent factors. In this task, which we call object sorting, the agent controls the position of the object via different actions (right, left, up or down) and is required to sort single objects based on their shape (a latent factor). The agent receives reward when it moves the object across the bottom border, and the reward value depends on the shape and location as depicted in Fig. 2A. For the results presented in Section 4, the agent was trained in an on-policy fashion, with a batch size of .
We used a variation of ‘preferences’ task from the Animal-AI environment Crosby et al. . The complexity of this, partially observable, 3D environment is the ideal test-bed for showcasing the agent’s reward-directed exploration of the environment, whilst avoiding negative reward or getting stuck in corners. In addition, to test the agent’s ability to rely on its internal model, we used a lights-off variant of this task, with temporary suspension of visual input at any given time-step with probability . For the results presented in Section 4, the agent was trained in an off-policy fashion due to computational constraints. The training data for this was created using a simple rule: move in the direction of the greenest pixels.
In the experiments that follow, we encode the actual reward from both environments as the prior distribution of future expected observations or, in active inference terms, the expected outcomes. We optimized the networks using ADAM Kingma and Ba , with loss given in Eq. (4) and an extra regularization term . The explicit training procedure is detailed in the supplementary material. The complete source-code, data, and pre-trained agents, is available on GitHub (https://github.com/zfountas/deep-active-inference-mc).
We initially show – through a simple visual demonstration (Fig. 2B) – that agents learn the environment dynamics with or without consistent visual input for both dynamic dSprites and AnimalAI. This is further investigated, for the dynamic dSprites, by evaluating task performance (Fig. 3A-C), as well as reconstruction loss for both predicted visual input and reward (Fig. 3D-E) during training.
To explore the effect of using different EFE functionals on behavior, we trained and compared active inference agents under three different formulations, all of which used the implicit reward function , against a baseline reward-maximizing agent. These include ) beliefs about the latent states (i.e., terms a,b from Eq. 7), ) beliefs about both the latent states and model parameters (i.e., complete Eq. 7) and ) beliefs about the latent states, with a down-weighted reward signal. We found that, although all agents exhibit similar performance in collecting rewards (Fig. 3B), active inference agents have a clear tendency to explore the environment (Fig. 3C). Interestingly, our results also demonstrate that all three formulations are better at reconstructing the expected reward, in comparison to a reward-maximizing baseline (Fig. 3D). Additionally, our agents are capable of reconstructing the current observation, as well as predicting 5 time-steps into the future, for all formulations of EFE, with similar loss with the baseline (Fig. 3E).
Disentanglement of latent spaces leads to lower dimensional temporal dynamics that are easier to predict Hsieh et al. . Thus, generating a disentangled latent space can be beneficial for learning the parameters of the transition function . Due to the similarity between the precision term and the hyper-parameter in -VAE Higgins et al.  discussed in Sec. 3.3, we hypothesized that could play an important role in regulating transition learning. To explore this hypothesis, we compared the total correlation (as a metric for disentanglement Kim and Mnih ) of latent state beliefs between agents that have been trained with the different EFE functionals, the baseline (reward-maximizing) agent, an agent trained without top-down attention (although the average value of was maintained), as well as
a simple variational autoencoder that received the same visual inputs. As seen in Fig.3F, all active inference agents using generated structures with significantly more disentanglement (see traversals in supp. material). Indeed, the performance ranking here is the same as in Fig. 3D, pointing to disentanglement as a possible reason for the performance difference in predicting rewards.
The training process in the dynamic dSprites environment revealed two types of behavior. Initially, we see epistemic exploration (i.e., curiosity), that is overtaken by reward seeking (i.e., goal-directed behavior) once the agent is reasonably confident about the environment. An example of this can be seen in the left trajectory plot in Fig. 4Ai, where the untrained agent – with no concept of reward – deliberates between multiple options and chooses the path that enables it to quickly move to the next round. The same agent, after learning iterations, can now optimally plan where to move the current object, in order to maximize potential reward, . We next investigated the sensitivity when deciding, by changing the threshold . We see that changing the threshold has clear implications for the distribution of explored alternative trajectories i.e., number of simulated states (Fig. 4Aii). This plays an important role in the performance, with maximum performance found at (Fig. 4Aiii).
Agents trained in the Animal-AI environment also exhibit interesting (and intelligent) behavior. Here, the agent is able to make complex plans, by avoiding obstacles with negative reward and approaching expected outcomes (red and green objects respectively, Fig. 4Bi). Maximum performance can be found for MCTS loops and (Fig. 4Bii; details in the supplementary material). When deployed in ’lights-off’ experiments, the agent can successfully maintain an accurate representation of the world state and simulate future plans despite temporary suspension of visual input (Fig. 2B). This is particularly interesting because is defined as a feed-forward network, without the ability to maintain memory of states before . As expected, the agent’s ability to operate in this set-up becomes progressively worse the longer the visual input is removed, while shorter decision thresholds are found to preserve performance longer (Fig. 4Biii).
The attractiveness of active inference inherits from the biological plausibility of the framework Friston et al. [2017a], Isomura and Friston , Adams et al. . Accordingly, we focused on scaling-up active inference inspired by neurobiological structure and function that supports intelligence. This is reflected in the hierarchical generative model, where the higher-level policy network contextualizes lower-level state representations. This speaks to a separation of temporal scales afforded by cortical hierarchies in the brain and provides a flexible framework to develop biologically-inspired intelligent agents.
We introduced MCTS for tackling planning problems with vast search spaces Coulom , Kocsis and Szepesvári , Browne et al. , Guo et al. , Silver et al. . This approach builds upon Çatal et al.’s Çatal et al.  deep active inference proposal, to use tree search to recursively re-evaluate EFE for each policy, but is computationally more efficient. Additionally, using MCTS offers an Occam’s window
for policy pruning; that is, we stop evaluating a policy path if its EFE becomes much higher than a particular upper confidence bound. This pruning drastically reduces the number of paths one has to evaluate. It is also consistent with biological planning, where agents adopt brute force exploration of possible paths in a decision tree, up to a resource-limited finite depthSnider et al. . This could be due to imprecise evidence about different future trajectories Solway and Botvinick  where environmental constraints subvert evaluation accuracy van Opheusden et al. , Holding  or alleviate computational load Huys et al. . Previous work addressing the depth of possible future trajectories in human subjects under changing conditions shows that both increased cognitive load Holding  and time constraints Burns , Van Harreveld et al. , van Opheusden et al.  reduce search depth. Huys et al. Huys et al. 
highlighted that in tasks involving alleviated computational load, subjects might evaluate only subsets of decision trees. This is consistent with our experiments as the agent selects to evaluate only particular trajectories based on their prior probability to occur.
We have shown that the precision factor, , can be used to incorporate uncertainty over the prior and enhances disentanglement by encouraging statistical independence between features Mathieu et al. , Kim et al. , Fatemi Shariatpanahi and Nili Ahmadabadi , Mott et al. . This is precisely why it has been associated with attention Parr et al. ; a signal that shapes uncertainty Dayan et al. . Attention enables flexible modulation of neural activity that allows behaviorally relevant sensory data to be processed more efficiently Baluch and Itti , Sasaki et al. , Byers and Serences . The neural realizations of this have been linked with neuromodulatory systems, e.g., cholinergic and noradrenergic Posner and Petersen , Dayan and Yu , Gu , Yu and Dayan , Moran et al. . In active inference, they have been associated specifically with noradrenaline for modulating uncertainty about state transitions Parr and Friston , noradrenergic modulation of visual attention Parr  and dopamine for policy selection Friston et al. [2017a], Parr .
A limitation of this work lies in its comparison to reward-maximizing agents. That is, if the specific goal is to maximize reward, then it is not clear whether deep active inference (i.e., full specification of EFE) has any performance benefits over simple reward-seeking agents (i.e., using only Eq. 7a). We emphasize, however, that the primary purpose of the active inference framework is to serve as a model for biological cognition, and not as an optimal solution for reward-based tasks. Therefore, we have deliberately not focused on bench-marking performance gains against state-of-the-art reinforcement learning agents, although we hypothesize that insights from active inference could prove useful in complex environments where either reward maximization isn’t the objective, or in instances where direct reward maximization leads to sub-optimal performance.
There are several extensions that can be explored, such as testing whether performance would increase with more complex, larger neural networks, e.g., using LSTMs to model state transitions. One could also assess if including episodic memory would finesse EFE evaluation over a longer time horizon, without increasing computational complexity. Future work should also test how performance shifts if the objective of the task changes. Lastly, it might be neurobiologically interesting to see whether the generated disentangled latent structures are apt for understanding functional segregation in the brain.
The authors would like to thank Sultan Kenjeyev for his valuable contributions and comments on early versions of the model presented in the current manuscript and Emotech team for the great support throughout the project. NS was funded by the Medical Research Council (MR/S502522/1). PM and KJF were funded by the Wellcome Trust (Ref: 210920/Z/18/Z - PM; Ref: 088130/Z/09/Z - KJF).
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.In International Conference on Machine Learning, pages 1050–1059, 2016.
IEEE International Conference on Computer Vision, pages 2979–2987, 2019.
where we have only used the definition , and the definition of the standard (and conditional) Shannon entropy.
Next, the term (7c) can be re-written as:
where the first equality is obtained via a normal Bayes inversion, and the second via the factorization of .
The model presented here was implemented in Python and the library TensorFlow 2.0. We initialized 3 different ADAM optimizers, which we used in parallel, to allow learning parameters with different rates. The networkswere optimized using an initial learning rate of and, as a loss function, the first two terms of Eq. (4). In experiments where regularization was used, the loss function used by this optimizer was adjusted to
where is a hyper parameter, starting with value and gradually increasing to . In our experiments, we found that the effect of regularization is only to improve the speed of convergence and not the behavior of the agent and, thus, it can be safely omitted.
The parameters of the network were optimized using a rate of and only the second term of Eq. (4) as a loss. Finally, the parameters of were optimized with a learning rate of and only the final term of Eq. (4) as a loss. For all presented experiments and learning curves, batch size was set to 50. A learning iteration is defined as 1000 optimization steps with new data generated from the corresponding environment.
In order to learn to plan further into the future, the agents were trained to map transitions every 5 simulation time-steps in dynamic dSprites and 3 simulation time-steps in Animal-AI. Finally, the runtime of the results presented here is as follows. For the agents in the dynamic dSprites environment, training of the final version of the agents took approximately hours per version (on-policy, 700 learning iterations) using an NVIDIA Titan RTX GPU. Producing the learning and performance curves in Fig. 3, took hours per agent when the 1-step and habitual strategies were employed and approximately days when the full MCTS planner was used (Fig. 3A). For the Animal-AI environment, off-policy training took approximately hours per agent, on-policy training took days and, the results presented in Fig. 4 took approximately days, using an NVIDIA GeForce GTX 1660 super GPU (CPU: i7-4790k, RAM: 16GB DDR3).
In both simulated environments, the network structure used was almost identical, consisting of convolutional, deconvolutional, fully-connected and dropout layers (Fig. 5). In both cases, the dimensionality of the latent space was 10. For the top-down attention mechanism, the parameters used were and for the Animal-AI environment and and for dynamic dSprites. The action space was for Animal-AI and for dynamic dSprites. Finally, with respect to the planner, we set in both cases, (when another value is not specifically mentioned), the depth of MCTS simulation rollouts was set to , while the maximum number of MCTS loops was set to for dynamic dSprites and for Animal-AI.