Scaling active inference

by   Alexander Tschantz, et al.

In reinforcement learning (RL), agents often operate in partially observed and uncertain environments. Model-based RL suggests that this is best achieved by learning and exploiting a probabilistic model of the world. 'Active inference' is an emerging normative framework in cognitive and computational neuroscience that offers a unifying account of how biological agents achieve this. On this framework, inference, learning and action emerge from a single imperative to maximize the Bayesian evidence for a niched model of the world. However, implementations of this process have thus far been restricted to low-dimensional and idealized situations. Here, we present a working implementation of active inference that applies to high-dimensional tasks, with proof-of-principle results demonstrating efficient exploration and an order of magnitude increase in sample efficiency over strong model-free baselines. Our results demonstrate the feasibility of applying active inference at scale and highlight the operational homologies between active inference and current model-based approaches to RL.



page 1

page 2

page 3

page 4


Bayesian policy selection using active inference

Learning to take actions based on observations is a core requirement for...

Reinforcement Learning through Active Inference

The central tenet of reinforcement learning (RL) is that agents seek to ...

Active Inference for Stochastic Control

Active inference has emerged as an alternative approach to control probl...

Model-based versus Model-free Deep Reinforcement Learning for Autonomous Racing Cars

Despite the rich theoretical foundation of model-based deep reinforcemen...

Active Inference or Control as Inference? A Unifying View

Active inference (AI) is a persuasive theoretical framework from computa...

Deep Active Inference for Autonomous Robot Navigation

Active inference is a theory that underpins the way biological agent's p...

A Minimal Active Inference Agent

Research on the so-called "free-energy principle" (FEP) in cognitive neu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In model-based reinforcement learning (RL), agents first learn a predictive model of the world, before using this model to determine actions atkeson_comparison_1997 . Encoding a model of the world plausibly affords several advantages. For instance, such models can be used to perform perceptual inference ha_recurrent_2018 , implement prospective control chua_deep_2018 ; schrittwieser_mastering_2019 , quantify and resolve uncertainty shyam_model-based_2019 , and generalize existing knowledge to new tasks and environments hafner_learning_2018 . As such, the use of predictive models has been touted as a potential solution to the sample inefficiencies of modern RL algorithms deisenroth_pilco:_2011 ; schmidhuber_making_1990 .

At the same time, the theoretical framework of active inference has emerged in cognitive and computational neuroscience as a unifying account of perception, action, and learning friston_free-energy_2010 ; friston_active_2017 . Active inference suggests that biological systems learn a probabilistic model of their habitable environment and that the states of the system change to maximize the evidence for this model friston_active_2015 ; friston_free_2019

. The resulting scheme casts perception, action and learning as emergent processes of (approximate) Bayesian inference, thereby offering a potentially unifying theory of adaptive biological systems. Despite its strong theoretical foundations, existing computational implementations have been restricted to low-dimensional tasks, often with discrete state spaces and actions

friston_active_2015 ; friston_active_2017 ; friston_active_2017-1 ; friston_active_2017-2 ; friston_deep_2018 . Here, we establish a formal connection between active inference and model-based RL. In doing so, we extend practical implementations of active inference so that they work effectively at scale, and we situate model-based RL within the broad theoretical context offered by active inference.

We present a model of active inference that is applicable in high-dimensional control tasks with both continuous states and actions. Our model builds upon previous attempts to scale active inference millidge_deep_2019 ; ueltzhoffer_deep_2018 ; catal_bayesian_2019 by including an efficient planning algorithm, as well as the quantification and active resolution of model uncertainty. Consistent with the active inference framework, learning and inference are achieved by maximizing single lower bound on Bayesian model evidence, and policies are selected to maximize a lower bound on expected Bayesian model evidence friston_active_2015 . We demonstrate that this unified normative scheme enables sample efficient learning, strong performance on difficult control tasks, and a principled approach to active exploration. Moreover, we establish homologies between our active inference based model and state-of-the-art approaches to model-based RL.

In what follows, we specify the general mathematical formulation of active inference, before describing our implementation, which is applicable in both partially-observed and fully-observed environments. We then present preliminary results in three challenging fully-observed continuous control benchmarks, leaving the analysis of partially-observed environments (i.e. pixels) to future work. These results demonstrate that our algorithm facilitates active exploration over long temporal horizons and significantly outperforms a strong model-free RL baseline, in terms of both sample efficiency and performance.

2 Active inference

Following previous work friston_active_2017 ; friston_active_2015

, we consider active inference in the context of a partially observed Markov decision process (POMPD). At each time step

, the true state of the environment evolves according to the stochastic transition dynamics , where denotes an agent’s actions. Agents do not always have access to the true state of the environment, but might instead receive observations , which are generated according to . As such, agents must operate on beliefs about the true state of the environment . In what follows, we denote the true dynamics with upright letters and a model of these dynamics (the agent) with italics .

Active inference proposes that agents implement and update a generative model of their world , where the tilde notation denotes a sequence of variables through time , denotes a policy, , and

denotes parameters of the generative model, which are themselves random variables. Additionally, agents maintain a recognition distribution

, representing an agent’s (approximately optimal) beliefs over states , policies and model parameters .

As new observations are sampled, agents update the parameters of their recognition distribution to minimize variational free energy :


This makes the recognition distribution converge towards an approximation of the (intractable) posterior distribution , thereby implementing a tractable form of (approximate) Bayesian inference blei_variational_2017 .

Crucially, active inference also proposes that an agent’s goals and desires are encoded in the generative model as prior preferences for favourable observations friston_free_2019 ; baltieri2019pid , i.e. blood temperature at 37°C. Free energy then provides a proxy for how surprising (i.e., unlikely) some observations are under the agent’s model. While minimising Eq. 1

provides an estimate for how surprising some observations are, it cannot reduce this quantity directly. To achieve this, agents must change their observations through action. Acting to minimise variational free energy ensures the minimisation of

surprisal , or the maximisation of the (Bayesian) model evidence , since free energy provides an upper bound on surprisal. Active inference, therefore, proposes that agent’s select policies in order to minimize expected free energy friston_free_2019 , where the expected free energy for a given policy at some future time is:


Expected free energy provides a bound on expected surprisal, and can be decomposed into extrinsic value, which quantifies the degree to which expected observations are congruent with an agent’s prior beliefs, and intrinsic value, which quantifies the amount of information an agent expects to gain from enacting some policy friston_active_2017-1 ; friston_active_2015 ; friston_active_2017 . This decomposition affords a natural interpretation: to avoid being surprised, one should sample unsurprising data, but also learn about the world to make data less surprising per se. Selecting policies that minimize Eq. 2

will, therefore, ensure that probable (i.e. favourable, given an agent’s normative priors) observations are preferentially sampled, while also ensuring that agents gather information about their environment.

3 Model

In cognitive and computational neuroscience, implementations of active inference agents generally follow one of two approaches. The first considers the generative model and recognition distribution to be Gaussian under the Laplace approximation and prescribes gradient-descent updates that recurrently minimize free energy with each new observation friston_predictive_2009 ; buckley_free_2017 ; baltieri2019pid . While this approach is purported as biologically plausible and enjoys empirical support under the guise of predictive coding friston_predictive_2009 ; clark_whatever_2013 , it is not clear how, or at least not straightforward, to extend this implementation to the prospective free energy minimization discussed in Sec. 2. The second approach employs discrete distributions (e.g., Categorical, Dirichlet) that are updated via variational message-passing friston_active_2015 . While this approach provides an elegant framework for evaluating expected free energy, it can only be applied in discrete state and action spaces, meaning it is not directly applicable to the high-dimensional states and continuous actions considered in RL benchmarks.

In the current paper, we take an alternative approach and employ amortized inference kingma_auto-encoding_2013

, which utilizes function approximators (i.e., neural networks) to parameterize distributions. Free energy is then minimized with respect to the parameters of the function approximators, and not the variational parameters themselves. We detail our generative model and recognition distribution in Sec.

3.1, how learning and inference are implemented in Sec. 3.2, how policy selection and trajectory sampling are implemented in Sec. 3.3 & Sec. 3.4, and how to evaluate expected free energy in section Sec. 3.5. Finally, we describe the implementation details for the fully-observed case in Sec. 3.6.

3.1 Generative model & recognition distribution

We consider a generative model over sequences of observations , hidden states , policies and parameters :


where we have assumed that is fixed. In Eq. 3, we have parametrized both the likelihood distribution and the transition distribution

with function approximators. Specifically, the likelihood distribution is described by a multivariate Gaussian distribution with a mean and covariance parameterized by some (potentially non-linear) function approximator

, while the prior distribution is described by a Gaussian with mean and variance parameterized by some function approximator


Amortizing the inference procedure offers several benefits. For instance, the number of parameters remains constant with respect to the size of the data and inference can be achieved through a single forward pass of a network. Moreover, while the amount of information encoded about variables is fixed, the conditional relationship between variables can be arbitrarily complex. In Eq. 3, the parameters of the transition distribution, , are themselves random variables. In the current context, these parameters are the weights of the neural network . This approach allows the uncertainty about these parameters to be quantified and casts learning as a process of (variational) inference blundell_weight_2015

. The prior probability of

is given by a standard Gaussian, which acts as a regularizer during learning. Although we have only considered distributions over the parameters of the transition distribution , the same scheme could be applied to the parameters of the likelihood distribution, . Finally, the prior probability of policies is a softmax function of the negative expected free energy of those policies friston_active_2015 . This formalizes the notion that policies are a-priori more likely if they are expected to minimize free energy in the future friston_free_2019 .

To make active inference applicable to the kinds of tasks considered in RL, we treat reward signals as observations in a separate modality. Therefore, we extend the generative model to include an additional scalar Gaussian over reward observations with unit variance and mean , where is a fully-connected neural network with parameters .

We consider a recognition distribution over sequences of hidden states , policies and parameters :


The distribution is a diagonal Gaussian with mean and variance parameterized by some function approximator , while the the variational posterior over parameters and policies are both diagonal Gaussians.

3.2 Learning & Inference

In order to implement learning, we derive the updates for , , and that minimize free energy . Given Eq. 3 and 4, the variational free energy for a given time point can be defined as:


where we have followed friston_active_2015 and omitted the additional term from the optimisation of , allowing us to ignore the dependency between hidden states and (the prior probability of) policies. We optimise with respect to separately, as described in the following section.

Eq. 5 can be minimized with respect to

using stochastic gradient descent. Given some observation

, the negative log-likelihood (third term) can be calculated by mapping the observation to the variational parameters of , e.g., . The reparameterization trick kingma_auto-encoding_2013 is then utilized to obtain a differentiable sample from 111For a Gaussian , a reparameterized sample is obtained via , where , which is then passed through , giving the parameters of the likelihood distribution . The negative-log likelihood of the observations is then calculated under this distribution. Next, the parameter divergence (second term) is calculated analytically, as both distributions are fully factorized Gaussians. Finally, The state divergence (first term) is calculated by taking samples from , again using the reparameterization trick. For each sample in , a reparameterized sample from the previous beliefs over hidden states is propagated through (where refers to the action that was taken at the previous time step), giving the parameters of the transition distribution. The KL-divergence term is then analytically calculated for each sample in and averaged.

This procedure is carried out in batched fashion over the available data set. At test time, inference can be achieved by directly mapping observations to the variational parameters using

. This approach to inferring hidden states is similar to that of a variational autoencoder

kingma_auto-encoding_2013 , but here the global prior has been replaced with a prior based on the transition distribution. Moreover, the inference of parameters is homologous to the Bayesian neural network approach to parameter learning blundell_weight_2015 .

Deriving updates for all parameters through a single (variational) objective function offers several potential benefits. First, the learned latent space is forced to balance between the compression of observations and (action-conditioned) temporal transitions. This is in contrast to ‘modular’ approaches, whereby a latent space is first learned to compress observations, and subsequently, a transition model is learned in this fixed latent space ha_recurrent_2018 . Moreover, this approach allows the quantification of uncertainty in both hidden states and model parameters, thereby quantifying both aleatoric and epistemic uncertainty depeweg_decomposition_2017 ; depeweg_decomposition_2017-1 .

3.3 Policy selection

Under active inference, policy selection is achieved by updating in order to minimize free energy . Given the prior belief that policies minimize expected free energy, i.e., (as specified in Eq. 3), free energy is minimized when friston_active_2015 . For discrete action spaces with short temporal horizons, can be evaluated in full by considering each possible policy friston_active_2017 . However, in continuous action spaces, there are infinite policies, meaning an alternative approach is required.

In the current work, we treat as a diagonal Gaussian with parameters . At each time step, we optimise such that . While this solution will fail to capture the exact shape of , agents need only identify the peak of the landscape to enact the optimal policy. To optimise the parameters of , we utilise the cross-entropy method (CEM) hafner_learning_2018 ; chua_deep_2018 . At each time step , we consider policies of a fixed horizon , using notation . The distribution over policies is initialized as and optimized as follows:

  • Sample policies from

  • Evaluate for each sample (described in the following section), returning a scalar value

  • Refit to the top samples

This procedure is carried out times, after which the mean of the belief for the current time step is returned. Moreover, this procedure is carried out after each new observation. For the current experiments, , , and .

This process of model predictive control camacho_model_2007 was selected for consistency with previous computational models of active inference friston_active_2017 , where a distribution over policies is updated after each new observation. Alternative approaches include optimizing a parametrized policy with respect to past evaluations of expected free energy millidge_deep_2019 . However, this approach is not suited for non-stationary objective functions or active exploration shyam_model-based_2019 . Alternatively, a parametrized policy could be optimized with respect to imagined rollouts from a transition model hafner_learning_2018 , which would enable active exploration shyam_model-based_2019 . The effectiveness of these approaches depends on the complexity of the value function relative to the transition dynamics dong_bootstrapping_2019 , as well as the stationarity of the value function.

3.4 Trajectory sampling

To evaluate the expected free energy for a given policy , it is first necessary to evaluate the expected future beliefs conditioned on that policy . The fact that the transition model is probabilistic, and the parameters of the transition model are random variables, induces a distribution over future trajectories friston_active_2015 . Several approaches exist to approximate the propagation of uncertain trajectories chua_deep_2018 . For instance, one can ignore uncertainty entirely and propagate the mean of the distributions, or one can explicitly propagate the full statistics of the distribution deisenroth_gaussian_2015 . In the current work, we utilise a particle approach chua_deep_2018 ; hafner_learning_2018 , whereby a set of Monte Carlo samples are propagated. In particular, we consider samples from the parameter distribution , and for each sample in , propagate samples through the transition model . To infer observations and rewards, we pass all samples through the respective model and average.

3.5 Expected free energy

In this section we describe how to evaluate , where we have used for notational convenience. The negative expected free energy for a policy is equal to the sum of negative expected free energies over time, , where


We refer to friston_active_2017 for a derivation of Eq. 6. The first term (extrinsic value) quantifies the degree to which the expected observations are congruent with the agent’s prior beliefs (i.e., preferences) . Note that in active inference, there is no intrinsic delineation of reward signals - all observations are assigned some a-priori probability. However, as RL environments specify a distinct reward signal, we have defined the agent’s prior preferences over reward observations only. Moreover, as RL environments are constructed such that agents wish to simply maximize the sum of rewards (rather than obtain any particular reward observation), we evaluate extrinsic value as , such that extrinsic value increases as larger rewards are predicted. We refer the reader to catal_bayesian_2019 for an alternative formulation where agent’s learn a specific prior distribution.

The second term (state information gain) quantifies the expected reduction in uncertainty in beliefs over hidden states . In other words, it promotes agents to sample data in order to resolve uncertainty about the hidden state of the environment. This term is formally equivalent to a number of established quantities, such as (expected) Bayesian surprise, mutual information, and the expected reduction in posterior entropy friston_active_2015 ; tschantz_learning_2019 , and has been used to describe various epistemic foraging behaviors, such as saccades parr_discrete_2018 ; yang_active_nodate ; itti_bayesian_2009 ; mirza_introducing_2019 and sentence comprehension friston_deep_2018 . In the current paper, we conduct experiments in fully observed environments, and as such, do not consider the state information gain term in our analysis.

The final term (parameter epistemic value) quantifies the expected reduction in uncertainty in beliefs over parameters , and promotes agents to actively explore the environment in order to resolve uncertainty in their model schwartenbeck_computational_2019 ; friston_active_2017-2 . In order to evaluate parameter epistemic value, we use a nearest neighbor estimate of the entropies depeweg_decomposition_2017-1 ; beirlant_nonparametric_1997 . In other words, we estimate the entropy via spatial properties of samples from the relevant distributions. Specifically, we estimate the entropy as , where is the number of samples from the distribution, is the nearest neighbor distance of a sample from other samples and is the Euler constant. Alternatively, parameter epistemic value could be rewritten as the (expected) Bayesian surprise of the distribution over parameters and then calculated analytically houthooft_vime:_2016 ; barron_information_2018 ; itti_bayesian_2009 ; mirza_introducing_2019 . However, this requires doing fictive updates to the parameter distribution, which is computationally expensive when conducted for each candidate policy at each time step.

3.6 Fully observed model

The model presented in the preceding sections serves as the most general formulation, applicable in both partially-observed and fully-observed environments. In what follows, we describe an implementation for the fully-observed case, leaving an analysis of the partially-observed case for future work.

To adapt the generative model for fully-observed environments, we utilise a fixed identity covariance for the likelihood distribution , and parameterize the mean as , thereby encoding the belief that there is a direct mapping between states and observations. For the transition distribution , we parameterize the mean as and utilize a fixed unit variance. In all experiments,

is a feed-forward network with two fully connected layers of size 500 with ReLU activations, which defines the dimensionality of

and .

Note that by treating the variance of the transition distribution as fixed, the evaluation of parameter epistemic value is significantly simplified. Specifically, the second entropy term in parameter epistemic value becomes constant under policies, such that we need only evaluate the first entropy term . We use 5 samples from to evaluate the expectation in this entropy term throughout. Finally, we treat the variance of as a fixed unit parameter and parameterize the mean as , thereby encoding the belief that there is a direct mapping between observations and states. Note that this means that the parameters of and are fixed and are thus excluded from the optimisation scheme.

4 Experiments

In this section, we investigate (i) whether the proposed active inference model can successfully promote exploration in the absence of reward observations (i.e. exploration), and (ii) whether the model can achieve good performance and high sample efficiency on challenging continuous control tasks (i.e. exploitation). We evaluate these two aspects of the model separately, leaving analysis of their joint performance (i.e. the exploration-exploitation dilemma) to future work.

We utilise the following learning scheme for both the exploration and exploitation experiments. We initialize the data set with 5 seed episodes collected under random actions. For each iteration of the experiment, we train the agent’s model via Eq. 5 with 100 batches randomly sampled from the data set, using a batch size of 50. Agents then collect data from the environment until the episode ends (when the maximum number of steps is reached, or when agent the agent reaches a terminal state).

4.1 Exploration

To test whether the active inference model enables efficient exploration, we explore the state space visited by different algorithms in the continuous MountainCar environment (). We compare the active inference model to two algorithms, (i) a ‘reward’ agent which operates via the same scheme, but only selects actions based on extrinsic value, and (ii) and an -greedy agent which selects action based on extrinsic value, but additionally adds Gaussian exploration noise (

) to each action. Agents learn and act in the environment for 100 epochs. The cumulative coverage of state space is plotted in Fig.

1. These results demonstrate that the active inference agent can effectively explore more of the state space, relative to the other algorithms.

Figure 1: Comparison of exploration strategies. (A) The cumulative state-space coverage after 100 epochs for the reward agent. (B) The cumulative state-space coverage after 100 epochs for the -greedy agent. (C) The cumulative state-space coverage after 100 epochs for the active inference agent. These results demonstrate that the active inference agent explores more of the state space, relative to the other exploration strategies.

4.2 Exploitation

Next, we investigate whether the active inference agent can achieve good performance on continuous control tasks. We explore performance in the inverted pendulum task (, ) and the more challenging hopper task (, ). The performance of our model is compared to a strong model-free baseline, DDPG lillicrap_continuous_2019 .

As both environments have well-shaped rewards, we only consider the exploitation component (extrinsic value) of the expected free energy objective function, ignoring the exploration component (epistemic value). As such, we utilise a point-estimate version of the model, thus removing the distributions over parameters. To retain stochasticity in the transition model, we parameterize both the mean and variance of the transition distribution using a function approximator (as opposed to just the mean), and fix the variance of the recognition distribution to 0.1. Moreover, following hafner_learning_2018 , we use an action repeat of 3 for all algorithms, enabling shorter planning horizons and a more pronounced learning signal.

In Fig. 2, we plot the performance of both algorithms as a function of the number of epochs. We show the mean performance over a fixed set of 5 random seeds and the shaded lines shown the 95% interquartile range at each epoch. These results demonstrate that the active inference agent can achieve strong performance in under 100 epochs on both tasks, demonstrating an order of magnitude better sample efficiency compared to the model-free baseline.

Figure 2: Comparison of Performance on two continuous control tasks. (A) Average returns over 1500 epochs on the inverted pendulum task for the active inference agent and the model-free DDPG agent. (B) Average returns over 1500 epochs on the hopper task for the active inference agent and the model-free DDPG agent. Note that for A & B, we stopped the active inference agent after 100 epochs due to convergence. (C & D) A zoomed-in view of figures A & B, showing a more fine-grained view of the active inference agent’s progression over 100 epochs. For all figures, the filled lines represent the mean of 5 random seeds, whereas the shaded areas denote the 95% interquartile range. Together, these results demonstrate that the active inference agent can learn difficult continuous control tasks, with a far greater sample efficiency, relative to a strong model-free baseline.

5 Previous work

Deep active inference

Previous work has explored the prospect of scaling active inference using amortized inference. In ueltzhoffer_deep_2018 , the authors parameterized both the generative model and recognition distribution with function approximators and used evolutionary strategies to optimise the free energy functional when gradients were not available. Similarly, millidge_deep_2019 utilized amortization to parametrize distributions and also amortized action by learning a parameterized approximation of the expected free energy bound. Finally, catal_bayesian_2019 extended previous work to include a specific planning component based on CEM. The authors focused on the problem of learning the prior distribution over reward observations and demonstrated this could be implemented in a learning-from-example framework.

Our work builds upon these previous models by incorporating model uncertainty and its active resolution. In other words, we extend the previous point-estimate models to include full distributions over parameters and update the expected free energy functional such that the uncertainty in these distributions is actively minimized. This brings our implementation in line with the canonical models of active inference from the cognitive and computational neuroscience literature friston_free_2019 . Moreover, it enables us to evaluate the feasibility of active exploration under the scaled active inference framework, apply the model to more complex control tasks, and obtain increased sample efficiency, relative to previous models.

Model-based RL

The model presented in the current work bears a number of resemblances with model-based approaches to RL. First, variational autoencoders have been used extensively to map observations into a compressed latent space, thereby simplifying the problem of action selection and the process of learning a forward transition model ha_recurrent_2018 ; hafner_learning_2018 ; igl_deep_2018 ; karl_deep_2016 ; kaiser_model-based_2019 ; barron_information_2018 ; watter_embed_2015 . Moreover, the CEM algorithm is a popular method for implementing planning in model-based RL hafner_learning_2018 ; chua_deep_2018 ; nagabandi_neural_2017 . Recent work has additionally highlighted the importance of using a probabilistic dynamics model in order to capture epistemic uncertainty chua_deep_2018 ; hafner_learning_2018 ; deisenroth_pilco:_2011 ; yarin_gal_improving_2016 ; kahn_uncertainty-aware_2017 ; vuong_uncertainty-aware_2019 . The success of these approaches has demonstrated that deterministic models are prone to model bias, which can lead to overfitting in low data regimes. Most approaches either utilize Bayesian neural networks depeweg_decomposition_2017-1 , ensembles of deterministic networks chua_deep_2018 , dropout yarin_gal_improving_2016 or Gaussian processes deisenroth_gaussian_2015 in order to capture uncertainty. In the current work, we opted for Bayesian neural networks to ensure consistency with the variational principles espoused by the active inference framework, but note that ensembles can be made explicitly Bayesian with minor modifications pearce_bayesian_2018 .

Information gain

Identifying scalable and efficient exploration strategies remains one of the key open questions in RL. Model-free methods, such as -greedy or Boltzmann choice rules sutton_introduction_1998 , utilise noise in the action selection process or uncertainty in the reward statistics agrawal_analysis_2012 ; speekenbrink_uncertainty_2015 .

A more powerful approach osband_generalization_2016 is to construct a model of the world, allowing the agent to evaluate which parts of state space it has (and has not) visited. For instance, bellemare_unifying_2016 construct a pseudo-count measure for estimating state visitation frequency in continuous state spaces. Alternatively, an explicit forward model of the transition dynamics can be learned. This allows for measures such as the amount of prediction error stadie_incentivizing_2015 ; thrun_efficient_1992 ; chentanez_intrinsically_2005 ; meyer_possibility_1991 or prediction error improvement lopes_exploration_2012 to be utilized for exploration.

If the learned model (implicitly or explicitly) captures probabilistic features then information-theoretic measures can be used to guide exploration (see aubret_survey_2019 for a review). In still_information-theoretic_2012 , the authors derived an information-theoretic measure to maximize the predictive power of the agent, while in mohamed_variational_2015 , the authors derived an objective function to maximize the mutual information between actions and future states of the environment (i.e., empowerment).

Of particular relevance to the current work is the use of information gain to promote exploration, which has been demonstrated to outperform alternative measures such as prediction error hester_intrinsically_2017 . From a theoretical perspective, information gain helps overcome what is known as the "TV-problem" itti_bayesian_2009 , where (unpredictable) noise in the environment is mistakenly treated as epistemically valuable. This is because information gain considers the amount of information provided for beliefs, as opposed to the amount of information provided by the signal per se.

Information gain can be traced back to lindley_measure_1956 , who used it to quantify the amount of information to be gained from some experiment. sun_planning_2011 developed a Bayesian framework in order to maximize information gain via dynamic programming, however, the experiments were limited to discrete state spaces using tabular MDPs. In houthooft_vime:_2016 , the authors utilized Bayesian neural networks to quantify the amount of information gained from some (action-conditioned) transition. This work was further extended in barron_information_2018 , where the amount of information gain was quantified with respect to a latent dynamics model.

In parallel with the current work, shyam_model-based_2019 looked to maximize expected information gain, which entails an active approach to exploration. This is in contrast to the majority of exploration strategies in RL, which are reactive, in the sense that they must first observe an informative state before being able to gather information shyam_model-based_2019 . This can lead to problems of over-commitment, whereby informative parts of state space must be unlearned once the relevant information has been gathered. However, shyam_model-based_2019 optimized expected information gain offline, whereas the current model uses an online approach. Finally, The use of nearest-neighbour entropy estimators for information gain has been explored in mirchev_approximate_2018 ; depeweg_decomposition_2017-1 .

6 Discussion & Conclusion

We have presented a model of active inference that can scale to continuous control tasks, complex dynamics and high-dimensional state spaces. The presented model can be trained via a single objective function, expected free energy, that captures both epistemic and aleatoric uncertainty, and prescribes both goal-directed and information-gathering behaviour via a single normative drive.

Our model makes two primary contributions. First, we showed that the full active inference construct can be scaled to the kinds of tasks considered in the RL literature. This involved extending previous models of deep active inference to include model uncertainty and expected information gain. Second, we highlighted the overlap between active inference and state-of-the-art approaches to model-based RL. These include the use of variational inference for the compression of observations, the use of variational inference for learning distributions over parameters, the use of probabilistic models of dynamics, the use of prospective planning in latent space, and the active resolution of uncertainty.

While active inference defined the properties of living systems from first principles friston_free_2019 , and model-based RL has attempted to engineer adaptive agents through the most effective means available, both perspectives have converged on similar solutions. Our work has exploited this convergence to show that active inference provides a principled and unified theoretical framework in which to contextualize the various developments in model-based RL. This perspective by itself offers little practical benefit. However, active inference offers two potentially novel perspectives from which model-based RL can benefit. The first is casting reward as (prior) probabilities. This provides a principled framework for learning reward structure (i.e., reward shaping), for assigning rewards (i.e., probability) across multiple observation modalities juechems_where_2019 , and for learning-from-demonstration catal_bayesian_2019 . The second perspective is casting both exploration and exploitation as two components of a single imperative to maximize expected Bayesian model evidence. This perspective has the potential to recast the exploration-exploitation dilemma as a problem of optimising parameters in order to maximise model evidence. We leave a practical investigation of this perspective to future work.

7 Acknowledgements

AT is funded by a PhD studentship from the Dr. Mortimer and Theresa Sackler Foundation and the School of Engineering and Informatics at the University of Sussex. CLB is supported by BBRSC grant number BB/P022197/1. MB acknowledges support as an International Research Fellow of the Japan Society for the Promotion of Science. AT and AKS are grateful to the Dr. Mortimer and Theresa Sackler Foundation, which supports the Sackler Centre for Consciousness Science. AKS is additionally grateful to the Canadian Institute for Advanced Research (Azrieli Programme on Brain, Mind, and Consciousness).