1 Introduction
Deep Reinforcement Learning (RL) has led to successful results in several domains, such as robotics, video games and board games Schrittwieser et al. (2020); OpenAI et al. (2019); Badia et al. (2020)
. From a neuroscience perspective, the reward prediction error signal that drives learning in deep RL closely relates to the neural activity of dopamine neurons for rewardbased learning
Schultz (1998); Bogacz (2020). However, the reward functions used in deep RL typically require domain and taskspecific design from humans, spoiling the generalization capabilities of RL agents. Furthermore, the possibility of faulty reward functions makes the application of deep RL risky in realworld contexts, given the possible unexpected behaviors that may derive from it Clark and Amodei (2016); Krakovna and others (2020); Popov and others (2017).Active Inference (AIF) has recently emerged as a unifying framework for learning perception and action. In AIF, agents operate according to one absolute imperative: minimize their free energy Friston et al. (2016). With respect to past experience, this encourages to update an internal model of the world to maximize evidence with respect to sensory data. With regard to future actions, the inference process becomes ‘active’ and agents select behaviors that fulfill optimistic predictions of their model, which are represented as preferred outcomes or goals Friston et al. (2010). Compared to RL, the AIF framework provides a more natural way of encoding objectives for control. However, its applicability has been limited because of the shortcomings in scaling the approach to complex environments, and current implementations have focused on tasks with either lowdimensional sensory inputs and/or small sets of discrete actions Da Costa et al. (2020). Moreover, several experiments in the literature have replaced the agent’s preferred outcomes with RLlike rewards from the environment, downplaying the AIF potential to provide selfsupervised objectives Fountas et al. (2020); Millidge (2019); Tschantz et al. (2020).
One of the major shortcomings in scaling AIF to environments with highdimensional, e.g. imagebased, environments comes from the necessity of building accurate models of the world, which try to reconstruct every detail in the sensory data. This complexity is also reflected in the control stage, when AIF agents compare future imaginary outcomes of potential actions with their goals, to select the most convenient behaviors. In particular, we advocate that fulfilling goals in image space can be poorly informative to build an objective for control.
In this work, we propose Contrastive Active Inference, a framework for AIF that aims to both reduce the complexity of the agent’s internal model and to propose a more suitable objective to fulfill preferred outcomes, by exploiting contrastive learning. Our method provides a selfsupervised objective that constantly informs the agent about the distance from its goal, without needing to reconstruct the outputs of potential actions in highdimensional image space.
The contributions of our work can be summarised as follows: (i) we propose a framework for AIF that drastically reduces the computational power required both for learning the model and planning future actions, (ii) we combine our method with value iteration methods for planning, inspired by the RL literature, to amortize the cost of planning in AIF, (iii) we compare our framework to stateoftheart RL techniques and to a noncontrastive AIF formulation, showing that our method compares well with rewardbased systems and outperforms noncontrastive AIF, (iv) we show that contrastive methods work better than reconstructionbased methods in presence of distractors in the environment, (v) we found that our contrastive objective for control allows matching desired goals, despite differences in the backgrounds. The latter finding could have important consequences for deploying AIF in realworld settings, such as robotics, where perfectly reconstructing observations from the environment and matching them with highdimensional preferences is practically unfeasible.
2 Background
The control setting can be formalized as a Partially Observable Markov Decision Process (POMDP), which is denoted with the tuple
, where is the set of unobserved states, is the set of actions, is the state transition function, also referred to as the dynamics of the environment, is the set observations,is a set of conditional observation probabilities, and
is a discount factor (Figure 1). We use the terms observations and outcomes interchangeably throughout the work. In RL, the agent has also access to a reward function , mapping stateaction pairs to rewards.Active Inference. In AIF, the goal of the agent is to minimize (a variational bound on) the surprisal over observations, . With respect to past observations, the upper bound leads to the variational free energy , which for timestep is:
(1) 
where represents an approximate posterior.
The agent hence builds a generative model over states, actions and observations, by defining a state transition function and a likelihood mapping , while the posterior distribution over states is approximated by the variational distribution . The free energy can then be decomposed as:
(2) 
This implies that minimizing variational free energy, on the one hand, maximizes the likelihood of observations under the likelihood mapping (i.e. maximizing accuracy), whilst minimizing the KL divergence between the approximate posterior and prior (i.e. complexity). Note that for the past we assume that outcomes and actions are observed, hence only inferences are made about the state
. Also note that the variational free energy is defined as the negative evidence lower bound as known from the variational autoencoder framework
Rezende et al. (2014); Kingma and Welling (2014).For future timesteps, the agent has to make inferences about both future states and actions , while taking into account expectations over future observations. Crucially, in active inference the agent has a prior distribution on preferred outcomes it expects to obtain. Action selection is then cast as an inference problem, i.e. inferring actions that will yield preferred outcomes, or more formally that minimize the expected free energy :
(3) 
where is the agent’s biased generative model, and the expectation is over predicted observations, states and actions .
If we assume the variational posterior over states is a good approximation of the true posterior, i.e. , and we also consider a uniform prior over actions Millidge et al. (2020), the expected free energy can be formulated as:
(4) 
Intuitively, this means that the agent will infer actions for which observations have a high information gain about the states (i.e. intrinsic value), which will yield preferred outcomes (i.e. extrinsic value), while also keeping its possible actions as varied as possible (i.e. action entropy).
Full derivations of the equations in this section are provided in the Appendix.
Reinforcement Learning. In RL, the objective of the agent is to maximize the discounted sum of rewards, or return, over time . Deep RL can also be cast as probabilistic inference, by introducing an optimality variable which denotes whether the time step is optimal Levine (2018). The distribution over the optimality variable is defined in terms of rewards as . Inference is then obtained by optimizing the following variational lower bound
(5) 
where the rewardmaximizing RL objective is augmented with an action entropy term, as in maximum entropy control Haarnoja et al. (2018). As also highlighted in Millidge et al. (2020), if we assume , we can see that RL works alike AIF, but encoding optimality value in the likelihood rather than in the prior.
In order to improve sampleefficiency of RL, modelbased approaches (MBRL), where the agent relies on an internal model of the environment to plan highrewarding actions, have been studied.
Contrastive Learning.
Contrastive representations, which aim to organize the data distinguishing similar and dissimilar pairs, can be learned through Noise Contrastive Estimation (NCE)
Gutmann and Hyvärinen (2010). Following Poole et al. (2019), an NCE loss can be defined as a lower bound on the mutual information between two variables. Given two random variables
and , the NCE lower bound is:(6) 
where the expectation is over
independent samples from the joint distribution:
and is a function, called critic, that approximates the density ratio . Crucially, the critic can be unbounded, as in van den Oord et al. (2019), where the authors showed that an inner product of transformated samples from X and Y, namely , with and functions, works well as a critic.3 Contrastive Active Inference
In this section, we present the Contrastive Active Inference framework, which reformulates the problem of optimizing the free energy of the past and the expected free energy of the future as contrastive learning problems.
3.1 Contrastive Free Energy of the Past
In order to learn a generative model of the environment following AIF, an agent could minimize the variational free energy from Equation 2. For highdimensional signals, such as pixelbased images, the model works similarly to a Variational AutoEncoder (VAE) Kingma and Welling (2014), with the information encoded in the latent state being used to produce reconstructions of the highdimensional observations through the likelihood model.
However, reconstructing images at pixel level has several shortfalls: (a) it requires models with high capacity, (b) it can be quite computationally expensive, and (c) there is the risk that most of the representation capacity is wasted on complex details of the images that are irrelevant for the task.
We can avoid predicting observations, by using an NCE loss. Optimizing the mutual information between states and observations, it becomes possible to infer from , without having to compute a reconstruction. In order to turn the variational free energy loss into a contrastive loss, we add the constant marginal logprobability of the data to , obtaining:
(7) 
As for Equation 6, we can apply a lower bound on the mutual information . We can define the contrastive free energy of the past as:
(8) 
where the dynamics is modelled as , and the samples from the distribution represent observations that do not match with the state , catalyzing the contrastive mechanism. Given the inequality , this contrastive utility provides an upper bound on the variational free energy, , and thus on suprisal.
3.2 Contrastive Free Energy of the Future
Performing active inference for action selection means inferring actions that realize preferred outcomes, by minimizing the expected free energy . In order to assess how likely expected future outcomes are to fulfill the agent’s preferences, in Equation 4, the agent uses its generative model to predict future observations.
Reconstructing imaginary observations in the future can be computationally expensive. Furthermore, matching imagined outcomes with the agent’s preferences in pixel space can be poorly informative, as pixels are not supposed to capture any semantics about observations. Also, observations that are “far” in pixel space aren’t necessarily far in transition space. For example, when the goal is behind a door, standing before the door is “far” in pixel space but only one action away (i.e. opening the door).
When the agent learns a contrastive model of the world, following Equation 8, it can exploit its ability to match observations with states without reconstructions, in order to search for the states that correspond with its preferences. Hence, we formulate the expectation in the expected free energy in terms of the preferred outcomes, so that we can add the constant marginal , obtaining:
(9) 
With abuse of notation, the mutual information between and quantifies the amount of information shared between future imaginary states and preferred outcomes.
We further assume , which constrains the agent to only modify its actions, preventing it to change the dynamics of the world to accomplish its goal, as pointed out in Levine (2018). This leads to the following objective for the contrastive free energy of the future:
(10) 
Similar as in the , the samples from foster the contrastive mechanism, ensuring that the state corresponds to the preferred outcomes, while also being as distinguishable as possible from other observations. This component implies a similar process as to the ambiguity minimization aspect typically associated with the AIF framework Friston et al. (2015).
4 Model and Algorithm
The AIF framework entails perception and action, in a unified view. In practice, this is translated into learning a world model, to capture the underlying dynamics of the environment, minimizing the free energy of the past, and learning a behavior model, which proposes actions to accomplish the agent’s preferences, minimizing the free energy of the future. In this work, we exploit the high expressiveness of deep neural networks to learn the world and the behavior models.
The world model is composed by the following components:
For the prior network, we use a GRU Chung et al. (2014) while the posterior network combines a GRU with a CNN to process observations. Both the prior and the posterior outputs are used to parameterize Gaussian multivariate distributions, which represent a stochastic state, from which we sample using the reparameterization trick Kingma and Welling (2014). This setup is inspired upon the models presented in Hafner et al. (2019); Çatal et al. (2020b); Buesing et al. (2018). For the representation model, we utilize a network that first processes and with MLPs and then computes the dotproduct between the outputs, obtaining , analogously to van den Oord et al. (2019). We indicate the unified world model loss with: .
In order to amortize the cost of longterm planning for behavior learning, we use an expected utility function
to estimate the expected free energy in the future for the state
, similarly to Millidge (2019). The behavior model is then composed by the following components:where the action and expected utility networks are both MLPs that are concurrently trained as in actorcritic architectures for RL Konda and Tsitsiklis (2000); Haarnoja et al. (2018). The action model aims to minimize the expected utility, which is an estimate of the expected free energy of the future over a potentially infinite horizon, while the utility network aims to predict a good estimate of the expected free energy of the future that is obtainable by following the actions of the action network. We indicate the action network loss with and the utility network loss with , where the sum from the current time step to an infinite horizon is obtained by using a TD(
) exponentiallyweighted estimator that trades off bias and variance
Schulman et al. (2018) (details in Appendix).The training routine, which alternates updates to the models with data collection, is shown in Algorithm 1. At each training iteration of the model, we sample trajectories of length from the replay buffer . Negative samples for the contrastive functionals are selected, for each state, by taking intraepisode negatives, corresponding to temporally different observations, and extraepisode negatives, corresponding to observations from different episodes.
Most of the above choices, along with the training routine itself, are deliberately inspired to current stateoftheart approaches for MBRL Hafner et al. (2020, 2021); Clavera et al. (2020). The motivation behind this is twofold: on the one hand, we want to show that approaches that have been used to scale RL for complex planning can also straightforwardly be applied for scaling AIF. On the other hand, in the next section, we offer a direct comparison to current stateoftheart techniques for RL that, being unbiased with respect to the models’ architecture and the training routine, can focus on the relevant contributions of this paper, which concerns the contrastive functionals for perception and action.
Relevant parameterization for the experiments can be found in the next section, while hyperparameters and a detailed description of each network are left to the Appendix.
5 Experiments
In this section, we compare the contrastive AIF method to likelihoodbased AIF and MBRL in highdimensional imagebased settings. As the experiments are based in environments originally designed for RL, we defined adhoc preferred outcomes for AIF. Our experimentation aims to answer the following questions: (i) is it possible to achieve highdimensional goals with AIFbased methods? (ii) what is the difference in performance between RLbased and AIFbased methods? (iii) does contrastive AIF perform better than likelihoodbased AIF? (iv) in what contexts contrastive methods are more desirable than likelihoodbased methods? (v) are AIFbased methods resilient to variations in the environment background?
We compare the following four flavors of MBRL and AIF, sharing similar model architectures and all trained according to Algorithm 1:

LikelihoodAIF: the agent minimizes the AIF functionals, using observation reconstructions. The representation model from the previous section is replaced with an observation likelihood model , which we model as a transposed CNN. Similar approaches have been presented in Fountas et al. (2020); Millidge (2019).

ContrastiveAIF (ours): the agent minimizes the contrastive free energy functionals.
MMACs  # Params  

Likelihood  212.2  4485.7k 
Ours  15.4  1266.7k 
w.r.t. Dreamer  

Contrastive Dreamer/AIF  0.84 
LikelihoodAIF  3.24 
In Table 2, we compare the number of parameters and of multiplyaccumulate (MAC) operations required for the two flavors of the representation model in our implementation: likelihoodbased and contrastive (ours). Using a contrastive representation makes the model 13.8 times more efficient in terms of MAC operations and reduces the number of parameters by a factor 3.5.
In Table 2, we compare the computation speed in our experiments, measuring wallclock time and using Dreamer as a reference. Contrastive methods are on average 16% faster, while LikelihoodAIF, which in addition to Dreamer reconstructs observations for behavior learning, is 224% slower.
5.1 MiniGrid Navigation
We performed experiments on the Empty 66 and the Empty 88 environments from the MiniGrid suite ChevalierBoisvert et al. (2018). In these tasks, the agent, represented as a red arrow, should reach the goal green square navigating a black grid (see Figure 2(a)). The agent only sees a part of the environment, corresponding to a 77 grid centered on the agent (in the bottom center tile). We render observations as 6464 pixels. For RL, a positive reward between 0 and 1 is provided to the agent as soon as the agent reaches the goal tile: the faster the agent reaches the goal, the higher the reward. For AIF agents, we defined the preferred outcome as the agent seeing itself on the goal green tile, as shown in Figure 2 (left).
For the 66 task, the world model is trained by sampling trajectories of length , while the behavior model is trained by imagining steps long trajectories. For the 88 task, we increased the length to 11 and the imagination horizon to 10. For both tasks, we first collected random episodes, to populate the replay buffer, and train for steps after collecting a new trajectory. Being the action set discrete, we optimized the action network employing REINFORCE gradients Williams (1992) with respect to the expected utility network’s estimates.
We assess performance in terms of the rewards achieved along one trajectory, stressing that AIF methods did not have access to the reward function but only to the goal observation, during training. The results, displayed in Figure 2 (right), show the average sum of rewards obtained along training, over the number of trajectories collected. We chose to compare over the number of trajectories as the trajectories’ length depends on whether the agent completed the task or not.
In this benchmark, we see that MBRL algorithms rapidly converge to highly rewarding trajectories, in both the 66 and the 88 tasks. LikelihoodAIF struggles to converge to trajectories that reach the goal consistently and fast, mostly achieving a reward mean lower than 0.4. In contrast, our method performs comparably to the MBRL methods in the 66 grid and reaches the goal twice more consistently than LikelihoodAIF in the 88 grid, leaning towards Dreamer and Contrastive Dreamer’s results.
(left) Empty task goal image. (right) Results: shaded areas indicate standard deviation across several runs.
Utility Function Analysis. In order to understand the differences between the utility functions we experimented with, we analyze the values assigned to each tile in the 88 task by every method. For the AIF methods, we collected all possible transitions in the environment and used the model to compute utility values for each tile. The results are shown in Figure 3.
The reward signal for the Empty environment is very sparse and informative only once the agent reaches the goal. In contrast, AIF methods provide denser utility values. In particular, we noticed that the LikelihoodAIF model provides a very strong signal for the goal position, whereas other values are less informative of the goal. Instead, the ContrastiveAIF model seems to capture some semantic information about the environment: it assigns high values to all corners, which are conceptually closer outcomes to the goal, while also providing the steepest signal for the green corner and its neighbor tiles. As also supported by the results obtained in terms of rewards, our method provides a denser and more informative signal to reach the goal in this task.
5.2 Reacher Task
We performed continuouscontrol experiments on the Reacher Easy and Hard tasks from the DeepMind Control (DMC) Suite Tassa et al. (2018) and on Reacher Easy from the Distracting Control Suite Stone et al. (2021). In this task, a twolink arm should penetrate a goal sphere with its tip in order to obtain rewards, with the sphere being bigger in the Easy task and smaller in the Hard one. The Distracting Suite adds an extra layer of complexity to the environment, altering the camera angle, the arm and the goal colors, and the background. In particular, we used the ‘easy’ version of this benchmark, corresponding to smaller changes in the camera angles and in the colors, and choosing the background from one of four videos (example in Figure 3(c)).
In order to provide consistent goals for the AIF agents, we fixed the goal sphere position as shown in Figure 3(b) and 3(a). As there is no fixed background in the Distracting Suite task, we could not use a goal image with the correct background, as that would have meant changing it at every trajectory. To not introduce ‘external’ interventions into the AIF experiments, we decided to use a goal image with the original blue background from the DMC Suite to test out the AIF capability to generalize goals to environments having the same dynamics but different backgrounds.
For both tasks, the world model is trained by sampling trajectories of length , while the behavior model is trained by imagining steps long trajectories. We first collect random episodes, to populate the replay buffer, and train for
steps after every new trajectory. Being the action set continuous, we optimized the action network backpropagating the expected utility value through the dynamics, by using the reparameterization trick for sampling actions
Hafner et al. (2020); Clavera et al. (2020).The results are presented in Figure 5, evaluating agents in term of the rewards obtained per trajectory. The length of a trajectory is fixed to 1 steps.
Reacher Easy/Hard.
The results on the Reacher Easy and Hard tasks show that our method was the fastest to converge to stable high rewards, with Contrastive Dreamer and Dreamer following. In particular, Dreamer’s delay to convergence should be associated with the more complex model, that took more epochs of training than the contrastive ones to provide good imagined trajectories for planning, especially for the Hard task. The LikelihoodAIF failed to converge in all runs, because of the difficulty of matching the goal state in pixel space, which only differs a small number of pixels from any other environment observation.
Distracting Reacher Easy. On the Distracting task, we found that Dreamer failed to succeed. As we show in Appendix, the reconstruction model’s capacity was entirely spent on reconstructing the complex backgrounds, failing to capture relevant information for the task. Conversely, Contrastive Dreamer was able to ignore the complexity of the observations and the distractions present in the environment, eventually succeeding in the task. Surprisingly, also our ContrastiveAIF method was able to succeed, showing generalization capabilities that are not shared by the likelihood counterpart.
We believe this result is important for two reasons: (1) it provides evidence that contrastive features better capture semantic information in the environment, potentially ignoring complex irrelevant details, (2) contrastive objectives for planning can be invariant to changes in the background, when the underlying dynamics of the task stays the same.
Utility Function Analysis. To collect further insights on the different methods’ objectives, we analyze the utility values assigned to observations with different poses in the Reacher Hard task. In Figure 6, we show a comparison where all the values are normalized in the range [0,1], considering the maximum and minimum values achievable by each method.
The reward signal is sparse and provided only when the arm is penetrating the goal sphere with his orange tip. In particular, a reward of is obtained only when the tip is entirely contained in the sphere. The LikelihoodAIF utility looks very flat due to the static background, which causes any observation to be very similar to the preferred outcome in pixel space. Even a pose that is very different from the goal, such as the top left one, is separated only by a relatively small number of pixels from the goal one, in the bottom right corner, and this translates into very minor differences in utility values (i.e. 0.98 vs 1.00). For ContrastiveAIF, we see that the model provides higher utility values for observations that look perceptually similar to the goal and lower values for more distant states, providing a denser signal to optimize for reaching the goal. This was certainly crucial in achieving the task in this experiment, though overlyshaped utility functions can be more difficult to optimize Andrychowicz et al. (2017), and future work should analyze the consequences of such dense shaping.
6 Related Work
Contrastive Learning.
Contrastive learning methods have recently led to important breakthroughs in the unsupervised learning landscape. Techniques like MoCO
Chen et al. (2020c); He et al. (2020) and SimCLR Chen et al. (2020a, b)have progressively improved performance in image recognition, by using only a few supervised labels. Contrastive learning representations have also shown successful when employed for natural language processing
van den Oord et al. (2019) and modelfree RL Srinivas et al. (2020).Modelbased Control. Improvements in the dynamics generative model Hafner et al. (2019), have recently allowed modelbased RL methods to reach stateoftheart performance, both in control tasks Hafner et al. (2020) and on video games Hafner et al. (2021); Kaiser et al. (2020). An important line of research focuses on correctly balancing realworld experience with data generated from the internal model of the agent Janner et al. (2019); Clavera et al. (2020).
OutcomeDriven Control. The idea of using desired outcomes to generate control objectives has been explored in RL as well Schaul et al. (2015); Ganin et al. (2018); Rudner et al. (2021). In Lynch et al. (2019), the authors propose a system that, given a desired goal, can sample plans of action from a latent space and decode them to act on the environment. DISCERN WardeFarley et al. (2019)
maximizes mutual information to the goal, using cosine similarity between the goal and a given observation, in the feature space of a CNN model.
Active Inference. In our work, we used active inference to derive actions, which is just one possibility to perform AIF, as discussed in Friston et al. (2021); Millidge et al. (2020). In other works, the expected free energy is passively used as the utility function to select the best behavior among potential sequences of actions Friston et al. (2016, 2015). Methods that combine the expressiveness of neural networks with AIF have been raising in popularity in the last years Çatal et al. (2020a). In Fountas et al. (2020), the authors propose an amortized version of Monte Carlo Tree Search, through an habit network, for planning. In Tschantz et al. (2020), AIF is seen performing better than RL algorithms in terms of reward maximization and exploration, on smallscale tasks. In Millidge (2019), they propose an objective to amortize planning in a value iteration fashion.
7 Discussion
We presented the Contrastive Active Inference framework, a contrastive learning approach for active inference, that casts the free energy minimization imperatives of AIF as contrastive learning problems. We derived the contrastive objective functionals and we corroborated their applicability through empirical experimentation, in both continuous and discrete action settings, with highdimensional observations. Combining our method with models and learning routines inspired from the modelbased RL scene, we found that our approach can perform comparably to models that have access to humandesigned rewards. Our results show that contrastive features better capture relevant information about the dynamics of the task, which can be exploited both to find conceptually similar states to preferred outcomes and to make the agent’s preferences invariant to irrelevant changes in the environment (e.g. background, colors, camera angle).
While the possibility to match states to outcomes in terms of similar features is rather convenient in imagebased tasks, the risk is that, if the agent never saw the desired outcome, it would converge to the semantically closest state in the environment that it knows. This raises important concerns about the necessity to provide good exploratory data about the environment, in order to prevent the agent from hanging in local minima. For this reason, we aim to look into combining our agent with explorationdriven data collection, for zeroshot goal achievement Mazzaglia et al. (2021); Sekar et al. (2020). Another complementary line of research would be equipping our method with better experience replay mechanisms, such as HER Andrychowicz et al. (2017), to improve the generalization capabilities of the system.
Broader impact
Active inference is a biologicallyplausible unifying theory for perception and action. Implementations of active inference that are both tractable and computationally cheap are important to foster further research towards potentially better theories of the human brain. By strongly reducing the computational requirements of our system, compared to other deep active inference implementations, we aim to make the study of this framework more accessible. Furthermore, our successful results on the robotic manipulator task with varying realistic backgrounds show that contrastive methods are promising for realworld applications with complex observations and distracting elements.
This research received funding from the Flemish Government (AI Research Program).
References
 [1] (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . Cited by: §5.2, §7.
 [2] (2020) Agent57: outperforming the atari human benchmark. External Links: 2003.13350 Cited by: §1.
 [3] (202007) Dopamine role in learning and action inference. eLife 9, pp. e53262. External Links: Document, Link, ISSN 2050084X Cited by: §1.
 [4] (2018) Learning and querying fast generative models for reinforcement learning. External Links: 1802.03006 Cited by: Appendix B, §4.
 [5] (2020) Learning perception and planning with deep active inference. In ICASSP 2020  2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 3952–3956. External Links: Document Cited by: §6.
 [6] (2020) Learning generative state space models for active inference. Frontiers in Computational Neuroscience 14, pp. 103. External Links: Document, ISSN 16625188 Cited by: Appendix B, §4.
 [7] (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §6.
 [8] (2020) Big selfsupervised models are strong semisupervised learners. arXiv preprint arXiv:2006.10029. Cited by: §6.
 [9] (2020) Improved baselines with momentum contrastive learning. External Links: 2003.04297 Cited by: §6.
 [10] (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gymminigrid Cited by: §5.1.

[11]
(2014)
Empirical evaluation of gated recurrent neural networks on sequence modeling
. InNIPS 2014 Workshop on Deep Learning, December 2014
, (English (US)). Cited by: §4.  [12] (2016) Faulty reward functions in the wild. Cited by: §1.
 [13] (2020) Modelaugmented actorcritic: backpropagating through paths. External Links: 2005.08068 Cited by: §4, §5.2, §6.
 [14] (2020) Active inference on discrete statespaces: a synthesis. Journal of Mathematical Psychology 99, pp. 102447. External Links: ISSN 00222496, Document, Link Cited by: §1.
 [15] (2020) Deep active inference agents using montecarlo methods. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 11662–11675. Cited by: §1, 3rd item, §6.
 [16] (2015) Active inference and epistemic value. Cogn Neurosci 6 (4), pp. 187–214. Cited by: §3.2, §6.
 [17] (202103) Sophisticated Inference. Neural Computation 33 (3), pp. 713–763. External Links: ISSN 08997667, Document, https://direct.mit.edu/neco/articlepdf/33/3/713/1889421/neco_a_01351.pdf Cited by: §6.
 [18] (2016) Active inference and learning. Neuroscience & Biobehavioral Reviews 68, pp. 862 – 879. External Links: ISSN 01497634, Document, Link Cited by: §1, §6.
 [19] (20100301) Action and behavior: a freeenergy formulation. Biological Cybernetics 102 (3), pp. 227–260. External Links: ISSN 14320770, Document, Link Cited by: §1.

[20]
(2018)
Synthesizing programs for images using reinforced adversarial learning.
In
Proceedings of the 35th International Conference on Machine Learning
, Proceedings of Machine Learning Research, Vol. 80. Cited by: §6. 
[21]
(201013–15 May)
Noisecontrastive estimation: a new estimation principle for unnormalized statistical models.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, Y. W. Teh and M. Titterington (Eds.), Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 297–304. External Links: Link Cited by: §2.  [22] (201810–15 Jul) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1861–1870. Cited by: §2, §4.
 [23] (201909–15 Jun) Learning latent dynamics for planning from pixels. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 2555–2565. Cited by: Appendix B, §4, §6.
 [24] (2021) Mastering atari with discrete world models. External Links: 2010.02193 Cited by: §4, 1st item, §6.
 [25] (2020) Dream to control: learning behaviors by latent imagination. In ICLR, Cited by: §4, 1st item, 2nd item, §5.2, §6.

[26]
(2020)
Momentum contrast for unsupervised visual representation learning.
In
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020
, pp. 9726–9735. External Links: Document Cited by: §6.  [27] (2019) When to trust your model: modelbased policy optimization. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §6.
 [28] (2020) Modelbased reinforcement learning for atari. External Links: 1903.00374 Cited by: §6.
 [29] (2014) Autoencoding variational bayes. External Links: 1312.6114 Cited by: §2, §3.1, §4.
 [30] (2000) Actorcritic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §4.
 [31] (2020) Specification gaming: the flip side of ai ingenuity. Cited by: §1.
 [32] (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. External Links: 1805.00909 Cited by: §2, §3.2.
 [33] (2019) Learning latent plans from play. External Links: 1903.01973 Cited by: §6.
 [34] (2020) Contrastive variational modelbased reinforcement learning for complex observations. In Proceedings of the 4th Conference on Robot Learning, Cited by: 2nd item.
 [35] (2021) Selfsupervised exploration via latent bayesian surprise. External Links: 2104.07495 Cited by: §7.
 [36] (2020) On the relationship between active inference and control as inference. External Links: 2006.12964 Cited by: §2, §2, §6.
 [37] (2019) Deep active inference as variational policy gradients. External Links: 1907.03876 Cited by: §1, §4, 3rd item, §6.
 [38] (2019) Solving rubik’s cube with a robot hand. External Links: 1910.07113 Cited by: §1.
 [39] (201909–15 Jun) On variational bounds of mutual information. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 5171–5180. Cited by: §2.
 [40] (2017) Dataefficient deep reinforcement learning for dexterous manipulation. External Links: 1704.03073 Cited by: §1.
 [41] (2014) Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML), Vol. 32, pp. 1278–1286. Cited by: §2.
 [42] (2021) Outcomedriven reinforcement learning via variational inference. External Links: 2104.10190 Cited by: §6.
 [43] (201507–09 Jul) Universal value function approximators. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1312–1320. External Links: Link Cited by: §6.
 [44] (202012) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588 (7839), pp. 604–609. External Links: ISSN 14764687, Link, Document Cited by: §1.
 [45] (2018) Highdimensional continuous control using generalized advantage estimation. External Links: 1506.02438 Cited by: Appendix B, §4.
 [46] (199807) Predictive reward signal of dopamine neurons. J Neurophysiol 80 (1), pp. 1–27. Cited by: §1.
 [47] (2020) Planning to explore via selfsupervised world models. In ICML, Cited by: §7.
 [48] (2020) CURL: contrastive unsupervised representations for reinforcement learning. External Links: 2004.04136 Cited by: §6.
 [49] (2021) The distracting control suite – a challenging benchmark for reinforcement learning from pixels. External Links: 2101.02722 Cited by: §5.2.
 [50] (2018) DeepMind control suite. External Links: 1801.00690 Cited by: §5.2.
 [51] (2020) Reinforcement learning through active inference. External Links: 2002.12636 Cited by: §1, §6.
 [52] (2019) Representation learning with contrastive predictive coding. External Links: 1807.03748 Cited by: §2, §4, §6.
 [53] (2019) Unsupervised control through nonparametric discriminative rewards. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §6.
 [54] (199205) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Mach. Learn. 8 (3–4), pp. 229–256. External Links: ISSN 08856125, Link, Document Cited by: §5.1.
Appendix A Background Derivations
In this section, we provide the derivations of the equations provided in section 2.
In all equations, both for the past and the future, we consider only one time step . This is possible thanks to the Markov assumption, stating that the environment properties exclusively depend on the previous time step. This makes possible to write stepwise formulas, by applying ancestral sampling, i.e. for the state dynamics until :
To simplify and shorten the Equations, we mostly omit conditioning on past states and actions. However, as shown in section 4, the transition dynamics explicitly take ancestral sampling into account, by using recurrent neural networks that process multiple time steps.
a.1 Free Energy of the Past
For past observations, the objective is to build a model of the environment for perception. Since computing the posterior is intractable, we learn to approximate it with a variational distribution . As we show, this process provides an upper bound on the surprisal (log evidence) of the model:
where we applied Jensen’s inequality in the fourth row, obtaining the variational free energy (Equation 1).
The free energy of the past can be mainly rewritten in two ways:
where the first expression highlights the evidence bound on the model’s evidence, and the second expression shows the balance between the complexity of the state model and the accuracy of the likelihood one. From the latter, the (Equation 2) can be obtained by expliciting as , according to the Markov assumption, and by choosing as the approximate variational distribution.
a.2 Free Energy of the Future
For the future, the agent selects actions that it expects to minimize the free energy. In particular, active inference assumes that the future’s model of the agent is biased towards its preferred outcomes, distributed according to the prior . Thus, we define the agent’s generative model as and we aim to find the distributions of future states and actions by applying variational inference, with the variational distribution . If we consider expectations taken over trajectories sampled from , the expected free energy (Equation 3) becomes:
where we explicit conditioning on the previous stateaction () for the sake of clarity.
We now assume , which means assuming the variational state posterior model approximates the true posterior over states, as a consequence of minimizing . Thus, we can rewrite the above result as:
Then, we assume that the agent’s model likelihood over actions is uniform and constant, namely :
Finally, by dropping the constant and rewriting all terms as KL divergences and entropies, we obtain:
that is the expected free energy as described in Equation 4.
Appendix B Model Details
The world model, composed by the prior network , the posterior network and the representation model , is presented in Figure 7.
The prior and the posterior network share a GRU cell, used to remember information from the past. The prior network first combines previous states and actions using a linear layer, then it processes the output with the GRU cell, and finally uses a 2layer MLP to compute the stochastic state from the hidden state of the GRU. The posterior network also has access to the features computed by a 4layer CNN over observations. This setup is inspired on the models presented in [23, 6, 4]. For the representation model, on the one hand, we take the features computed from the observations by the posterior’s CNN, process them with a 2layer MLP and apply a nonlinearity, obtaining . On the other hand, we take the state , we process it with a 2layer MLP and apply a nonlinearity, obtaining . Finally, we compute a dotproduct, obtaining . In the world model’s loss, , we clip the KL divergence term in the below 3 free nats, to avoid posterior collapse.
The behavior model is composed by the action network and the expected utility network , which are both 3layer MLPs. In order to get a good estimate of future utility, able to trade off between bias and variance, we used GAE() estimation [45]. In practice this translates into approximating the infinitehorizon utility with:
where is an hyperparameter and is the imagination horizon for future trajectories. Given the above definition, we can rewrite the actor network loss as: and the utility network loss with . In , we scale the action entropy by , to prevent entropy maximization from taking over the rest of the objective. In order to stabilize training, when updating the actor network, we use the expected utility network and the world model from the previous epoch of training.
Appendix C Hyper Parameters
Name  Value 
World Model  
Latent state dimension  30 
GRU cell dimension  200 
Adam learning rate  
Behavior Model  
parameter  0.99 
parameter  0.95 
Adam learning rate  
Common  
Hidden layers dimension  200 
Gradient clipping  100 
Appendix D Experiment Details
Hardware. We ran the experiments on a TitanX GPU, with an i52400 CPU and 16GB of RAM.
Preferred Outcomes. For the tasks of our experiments, the preferred outcomes are 64x64x3 images (displayed in Figure 2, 3(b), 3(a)). Corresponding distributions are defined as 64x64x3 multivariate Laplace distributions, centered on the images’ pixel values. We also experimented with 64x64x3 multivariate Gaussians with unit variance, obtaining similar results.
Baselines. In section 5, we compare four different flavors of modelbased control: Dreamer, Contrastive Dreamer, LikelihoodAIF and ContrastiveAIF. Losses for each of these methods are provided in Table 4, adopting the following additional definitions:
where is the same as in Equation 5.
Dreamer  

Contrastive Dreamer  
LikelihoodAIF  
ContrastiveAIF 
Distracting Suite Reconstructions. In the Reacher Easy experiment from the Distracting Control Suite, we found that Dreamer, a stateoftheart algorithm on the DeepMind Control Suite, was not able to succeed. We hypothesized that this was due to the world model spending most of its capacity to predict the complex background, being then unable to capture relevant information about the task.
In Figure 8, we compare ground truth observations and reconstructions from the Dreamer posterior model. As we expected, we found that despite the model correctly stored information about several details of the background, it missed crucial information about the arm pose. Although better world models could alleviate problems like this, we strongly believe that different representation learning approaches, like contrastive learning, provide a better solution to the issue.
Comments
There are no comments yet.