Contrastive Active Inference

10/19/2021
by   Pietro Mazzaglia, et al.
Ghent University
0

Active inference is a unifying theory for perception and action resting upon the idea that the brain maintains an internal model of the world by minimizing free energy. From a behavioral perspective, active inference agents can be seen as self-evidencing beings that act to fulfill their optimistic predictions, namely preferred outcomes or goals. In contrast, reinforcement learning requires human-designed rewards to accomplish any desired outcome. Although active inference could provide a more natural self-supervised objective for control, its applicability has been limited because of the shortcomings in scaling the approach to complex environments. In this work, we propose a contrastive objective for active inference that strongly reduces the computational burden in learning the agent's generative model and planning future actions. Our method performs notably better than likelihood-based active inference in image-based tasks, while also being computationally cheaper and easier to train. We compare to reinforcement learning agents that have access to human-designed reward functions, showing that our approach closely matches their performance. Finally, we also show that contrastive methods perform significantly better in the case of distractors in the environment and that our method is able to generalize goals to variations in the background.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

page 17

02/28/2020

Reinforcement Learning through Active Inference

The central tenet of reinforcement learning (RL) is that agents seek to ...
06/04/2021

Online reinforcement learning with sparse rewards through an active inference capsule

Intelligent agents must pursue their goals in complex environments with ...
09/17/2020

The relationship between dynamic programming and active inference: the discrete, finite-horizon case

Active inference is a normative framework for generating behaviour based...
07/08/2019

Deep Active Inference as Variational Policy Gradients

Active Inference is a theory of action arising from neuroscience which c...
06/21/2018

Expanding the Active Inference Landscape: More Intrinsic Motivations in the Perception-Action Loop

Active inference is an ambitious theory that treats perception, inferenc...
01/22/2021

Prior Preference Learning from Experts:Designing a Reward with Active Inference

Active inference may be defined as Bayesian modeling of a brain with a b...
09/01/2021

Active Inference and Epistemic Value in Graphical Models

The Free Energy Principle (FEP) postulates that biological agents percei...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement Learning (RL) has led to successful results in several domains, such as robotics, video games and board games Schrittwieser et al. (2020); OpenAI et al. (2019); Badia et al. (2020)

. From a neuroscience perspective, the reward prediction error signal that drives learning in deep RL closely relates to the neural activity of dopamine neurons for reward-based learning

Schultz (1998); Bogacz (2020). However, the reward functions used in deep RL typically require domain and task-specific design from humans, spoiling the generalization capabilities of RL agents. Furthermore, the possibility of faulty reward functions makes the application of deep RL risky in real-world contexts, given the possible unexpected behaviors that may derive from it Clark and Amodei (2016); Krakovna and others (2020); Popov and others (2017).

Active Inference (AIF) has recently emerged as a unifying framework for learning perception and action. In AIF, agents operate according to one absolute imperative: minimize their free energy Friston et al. (2016). With respect to past experience, this encourages to update an internal model of the world to maximize evidence with respect to sensory data. With regard to future actions, the inference process becomes ‘active’ and agents select behaviors that fulfill optimistic predictions of their model, which are represented as preferred outcomes or goals Friston et al. (2010). Compared to RL, the AIF framework provides a more natural way of encoding objectives for control. However, its applicability has been limited because of the shortcomings in scaling the approach to complex environments, and current implementations have focused on tasks with either low-dimensional sensory inputs and/or small sets of discrete actions Da Costa et al. (2020). Moreover, several experiments in the literature have replaced the agent’s preferred outcomes with RL-like rewards from the environment, downplaying the AIF potential to provide self-supervised objectives Fountas et al. (2020); Millidge (2019); Tschantz et al. (2020).

One of the major shortcomings in scaling AIF to environments with high-dimensional, e.g. image-based, environments comes from the necessity of building accurate models of the world, which try to reconstruct every detail in the sensory data. This complexity is also reflected in the control stage, when AIF agents compare future imaginary outcomes of potential actions with their goals, to select the most convenient behaviors. In particular, we advocate that fulfilling goals in image space can be poorly informative to build an objective for control.

In this work, we propose Contrastive Active Inference, a framework for AIF that aims to both reduce the complexity of the agent’s internal model and to propose a more suitable objective to fulfill preferred outcomes, by exploiting contrastive learning. Our method provides a self-supervised objective that constantly informs the agent about the distance from its goal, without needing to reconstruct the outputs of potential actions in high-dimensional image space.

The contributions of our work can be summarised as follows: (i) we propose a framework for AIF that drastically reduces the computational power required both for learning the model and planning future actions, (ii) we combine our method with value iteration methods for planning, inspired by the RL literature, to amortize the cost of planning in AIF, (iii) we compare our framework to state-of-the-art RL techniques and to a non-contrastive AIF formulation, showing that our method compares well with reward-based systems and outperforms non-contrastive AIF, (iv) we show that contrastive methods work better than reconstruction-based methods in presence of distractors in the environment, (v) we found that our contrastive objective for control allows matching desired goals, despite differences in the backgrounds. The latter finding could have important consequences for deploying AIF in real-world settings, such as robotics, where perfectly reconstructing observations from the environment and matching them with high-dimensional preferences is practically unfeasible.

2 Background

Figure 1: POMDP Graphical Model

The control setting can be formalized as a Partially Observable Markov Decision Process (POMDP), which is denoted with the tuple

, where is the set of unobserved states, is the set of actions, is the state transition function, also referred to as the dynamics of the environment, is the set observations,

is a set of conditional observation probabilities, and

is a discount factor (Figure 1). We use the terms observations and outcomes interchangeably throughout the work. In RL, the agent has also access to a reward function , mapping state-action pairs to rewards.

Active Inference. In AIF, the goal of the agent is to minimize (a variational bound on) the surprisal over observations, . With respect to past observations, the upper bound leads to the variational free energy , which for timestep is:

(1)

where represents an approximate posterior.

The agent hence builds a generative model over states, actions and observations, by defining a state transition function and a likelihood mapping , while the posterior distribution over states is approximated by the variational distribution . The free energy can then be decomposed as:

(2)

This implies that minimizing variational free energy, on the one hand, maximizes the likelihood of observations under the likelihood mapping (i.e. maximizing accuracy), whilst minimizing the KL divergence between the approximate posterior and prior (i.e. complexity). Note that for the past we assume that outcomes and actions are observed, hence only inferences are made about the state

. Also note that the variational free energy is defined as the negative evidence lower bound as known from the variational autoencoder framework 

Rezende et al. (2014); Kingma and Welling (2014).

For future timesteps, the agent has to make inferences about both future states and actions , while taking into account expectations over future observations. Crucially, in active inference the agent has a prior distribution on preferred outcomes it expects to obtain. Action selection is then cast as an inference problem, i.e. inferring actions that will yield preferred outcomes, or more formally that minimize the expected free energy :

(3)

where is the agent’s biased generative model, and the expectation is over predicted observations, states and actions .

If we assume the variational posterior over states is a good approximation of the true posterior, i.e. , and we also consider a uniform prior over actions Millidge et al. (2020), the expected free energy can be formulated as:

(4)

Intuitively, this means that the agent will infer actions for which observations have a high information gain about the states (i.e. intrinsic value), which will yield preferred outcomes (i.e. extrinsic value), while also keeping its possible actions as varied as possible (i.e. action entropy).

Full derivations of the equations in this section are provided in the Appendix.

Reinforcement Learning. In RL, the objective of the agent is to maximize the discounted sum of rewards, or return, over time . Deep RL can also be cast as probabilistic inference, by introducing an optimality variable which denotes whether the time step is optimal Levine (2018). The distribution over the optimality variable is defined in terms of rewards as . Inference is then obtained by optimizing the following variational lower bound

(5)

where the reward-maximizing RL objective is augmented with an action entropy term, as in maximum entropy control Haarnoja et al. (2018). As also highlighted in Millidge et al. (2020), if we assume , we can see that RL works alike AIF, but encoding optimality value in the likelihood rather than in the prior.

In order to improve sample-efficiency of RL, model-based approaches (MBRL), where the agent relies on an internal model of the environment to plan high-rewarding actions, have been studied.

Contrastive Learning.

Contrastive representations, which aim to organize the data distinguishing similar and dissimilar pairs, can be learned through Noise Contrastive Estimation (NCE)

Gutmann and Hyvärinen (2010). Following Poole et al. (2019)

, an NCE loss can be defined as a lower bound on the mutual information between two variables. Given two random variables

and , the NCE lower bound is:

(6)

where the expectation is over

independent samples from the joint distribution:

and is a function, called critic, that approximates the density ratio . Crucially, the critic can be unbounded, as in van den Oord et al. (2019), where the authors showed that an inner product of transformated samples from X and Y, namely , with and functions, works well as a critic.

3 Contrastive Active Inference

In this section, we present the Contrastive Active Inference framework, which reformulates the problem of optimizing the free energy of the past and the expected free energy of the future as contrastive learning problems.

3.1 Contrastive Free Energy of the Past

In order to learn a generative model of the environment following AIF, an agent could minimize the variational free energy from Equation 2. For high-dimensional signals, such as pixel-based images, the model works similarly to a Variational AutoEncoder (VAE) Kingma and Welling (2014), with the information encoded in the latent state being used to produce reconstructions of the high-dimensional observations through the likelihood model.

However, reconstructing images at pixel level has several shortfalls: (a) it requires models with high capacity, (b) it can be quite computationally expensive, and (c) there is the risk that most of the representation capacity is wasted on complex details of the images that are irrelevant for the task.

We can avoid predicting observations, by using an NCE loss. Optimizing the mutual information between states and observations, it becomes possible to infer from , without having to compute a reconstruction. In order to turn the variational free energy loss into a contrastive loss, we add the constant marginal log-probability of the data to , obtaining:

(7)

As for Equation 6, we can apply a lower bound on the mutual information . We can define the contrastive free energy of the past as:

(8)

where the dynamics is modelled as , and the samples from the distribution represent observations that do not match with the state , catalyzing the contrastive mechanism. Given the inequality , this contrastive utility provides an upper bound on the variational free energy, , and thus on suprisal.

3.2 Contrastive Free Energy of the Future

Performing active inference for action selection means inferring actions that realize preferred outcomes, by minimizing the expected free energy . In order to assess how likely expected future outcomes are to fulfill the agent’s preferences, in Equation 4, the agent uses its generative model to predict future observations.

Reconstructing imaginary observations in the future can be computationally expensive. Furthermore, matching imagined outcomes with the agent’s preferences in pixel space can be poorly informative, as pixels are not supposed to capture any semantics about observations. Also, observations that are “far” in pixel space aren’t necessarily far in transition space. For example, when the goal is behind a door, standing before the door is “far” in pixel space but only one action away (i.e. opening the door).

When the agent learns a contrastive model of the world, following Equation 8, it can exploit its ability to match observations with states without reconstructions, in order to search for the states that correspond with its preferences. Hence, we formulate the expectation in the expected free energy in terms of the preferred outcomes, so that we can add the constant marginal , obtaining:

(9)

With abuse of notation, the mutual information between and quantifies the amount of information shared between future imaginary states and preferred outcomes.

We further assume , which constrains the agent to only modify its actions, preventing it to change the dynamics of the world to accomplish its goal, as pointed out in Levine (2018). This leads to the following objective for the contrastive free energy of the future:

(10)

Similar as in the , the samples from foster the contrastive mechanism, ensuring that the state corresponds to the preferred outcomes, while also being as distinguishable as possible from other observations. This component implies a similar process as to the ambiguity minimization aspect typically associated with the AIF framework Friston et al. (2015).

4 Model and Algorithm

The AIF framework entails perception and action, in a unified view. In practice, this is translated into learning a world model, to capture the underlying dynamics of the environment, minimizing the free energy of the past, and learning a behavior model, which proposes actions to accomplish the agent’s preferences, minimizing the free energy of the future. In this work, we exploit the high expressiveness of deep neural networks to learn the world and the behavior models.

The world model is composed by the following components:

For the prior network, we use a GRU Chung et al. (2014) while the posterior network combines a GRU with a CNN to process observations. Both the prior and the posterior outputs are used to parameterize Gaussian multivariate distributions, which represent a stochastic state, from which we sample using the reparameterization trick Kingma and Welling (2014). This setup is inspired upon the models presented in Hafner et al. (2019); Çatal et al. (2020b); Buesing et al. (2018). For the representation model, we utilize a network that first processes and with MLPs and then computes the dot-product between the outputs, obtaining , analogously to van den Oord et al. (2019). We indicate the unified world model loss with: .

In order to amortize the cost of long-term planning for behavior learning, we use an expected utility function

to estimate the expected free energy in the future for the state

, similarly to Millidge (2019). The behavior model is then composed by the following components:

where the action and expected utility networks are both MLPs that are concurrently trained as in actor-critic architectures for RL Konda and Tsitsiklis (2000); Haarnoja et al. (2018). The action model aims to minimize the expected utility, which is an estimate of the expected free energy of the future over a potentially infinite horizon, while the utility network aims to predict a good estimate of the expected free energy of the future that is obtainable by following the actions of the action network. We indicate the action network loss with and the utility network loss with , where the sum from the current time step to an infinite horizon is obtained by using a TD(

) exponentially-weighted estimator that trades off bias and variance

Schulman et al. (2018) (details in Appendix).

The training routine, which alternates updates to the models with data collection, is shown in Algorithm 1. At each training iteration of the model, we sample trajectories of length from the replay buffer . Negative samples for the contrastive functionals are selected, for each state, by taking intra-episode negatives, corresponding to temporally different observations, and extra-episode negatives, corresponding to observations from different episodes.

Most of the above choices, along with the training routine itself, are deliberately inspired to current state-of-the-art approaches for MBRL Hafner et al. (2020, 2021); Clavera et al. (2020). The motivation behind this is twofold: on the one hand, we want to show that approaches that have been used to scale RL for complex planning can also straightforwardly be applied for scaling AIF. On the other hand, in the next section, we offer a direct comparison to current state-of-the-art techniques for RL that, being unbiased with respect to the models’ architecture and the training routine, can focus on the relevant contributions of this paper, which concerns the contrastive functionals for perception and action.

Relevant parameterization for the experiments can be found in the next section, while hyperparameters and a detailed description of each network are left to the Appendix.

1Initialize world model parameters and behavior model parameters and . Initialize dataset with R random-action episodes. while not done do
2      for update step u=1..U do
3           Sample B trajectories of length L from . Infer states using the world model. Update the world model parameters on the B trajectories, minimizing . Imagine I trajectories of length H from each . Update the action network parameters on the I trajectories, minimizing . Update the expected utility network parameters on the I trajectories, minimizing .
4           end for
5          Reset the environment. Init state = and set t = 0 Init new trajectory with the first observation = while environment not done do
6                Infer action using the action network . Act on the environment with , and receive observation . Add transition to the buffer = , and set t = t + 1 Infer state using the world model
7                end while
8               
9                end while
Algorithm 1 Training and Data Collection

5 Experiments

In this section, we compare the contrastive AIF method to likelihood-based AIF and MBRL in high-dimensional image-based settings. As the experiments are based in environments originally designed for RL, we defined ad-hoc preferred outcomes for AIF. Our experimentation aims to answer the following questions: (i) is it possible to achieve high-dimensional goals with AIF-based methods? (ii) what is the difference in performance between RL-based and AIF-based methods? (iii) does contrastive AIF perform better than likelihood-based AIF? (iv) in what contexts contrastive methods are more desirable than likelihood-based methods? (v) are AIF-based methods resilient to variations in the environment background?

We compare the following four flavors of MBRL and AIF, sharing similar model architectures and all trained according to Algorithm 1:

  • Dreamer: the agents build a world model able to reconstruct both observations and rewards from the state. Reconstructed rewards for imagined trajectories are then used to optimize the behavior model in an MBRL fashion Hafner et al. (2020, 2021).

  • Contrastive Dreamer: this method is analog to its reconstruction-based counterpart, apart from that it uses a contrastive representation model, like our approach. Similar methods have been studied in Hafner et al. (2020); Ma et al. (2020).

  • Likelihood-AIF: the agent minimizes the AIF functionals, using observation reconstructions. The representation model from the previous section is replaced with an observation likelihood model , which we model as a transposed CNN. Similar approaches have been presented in Fountas et al. (2020); Millidge (2019).

  • Contrastive-AIF (ours): the agent minimizes the contrastive free energy functionals.

MMACs # Params
Likelihood 212.2 4485.7k
Ours 15.4 1266.7k
Table 2: Computation Time
w.r.t. Dreamer
Contrastive Dreamer/AIF 0.84
Likelihood-AIF 3.24
Table 1: Computational Efficiency

In Table 2, we compare the number of parameters and of multiply-accumulate (MAC) operations required for the two flavors of the representation model in our implementation: likelihood-based and contrastive (ours). Using a contrastive representation makes the model 13.8 times more efficient in terms of MAC operations and reduces the number of parameters by a factor 3.5.

In Table 2, we compare the computation speed in our experiments, measuring wall-clock time and using Dreamer as a reference. Contrastive methods are on average 16% faster, while Likelihood-AIF, which in addition to Dreamer reconstructs observations for behavior learning, is 224% slower.

5.1 MiniGrid Navigation

We performed experiments on the Empty 66 and the Empty 88 environments from the MiniGrid suite Chevalier-Boisvert et al. (2018). In these tasks, the agent, represented as a red arrow, should reach the goal green square navigating a black grid (see Figure 2(a)). The agent only sees a part of the environment, corresponding to a 77 grid centered on the agent (in the bottom center tile). We render observations as 6464 pixels. For RL, a positive reward between 0 and 1 is provided to the agent as soon as the agent reaches the goal tile: the faster the agent reaches the goal, the higher the reward. For AIF agents, we defined the preferred outcome as the agent seeing itself on the goal green tile, as shown in Figure 2 (left).

For the 66 task, the world model is trained by sampling trajectories of length , while the behavior model is trained by imagining steps long trajectories. For the 88 task, we increased the length to 11 and the imagination horizon to 10. For both tasks, we first collected random episodes, to populate the replay buffer, and train for steps after collecting a new trajectory. Being the action set discrete, we optimized the action network employing REINFORCE gradients Williams (1992) with respect to the expected utility network’s estimates.

We assess performance in terms of the rewards achieved along one trajectory, stressing that AIF methods did not have access to the reward function but only to the goal observation, during training. The results, displayed in Figure 2 (right), show the average sum of rewards obtained along training, over the number of trajectories collected. We chose to compare over the number of trajectories as the trajectories’ length depends on whether the agent completed the task or not.

In this benchmark, we see that MBRL algorithms rapidly converge to highly rewarding trajectories, in both the 66 and the 88 tasks. Likelihood-AIF struggles to converge to trajectories that reach the goal consistently and fast, mostly achieving a reward mean lower than 0.4. In contrast, our method performs comparably to the MBRL methods in the 66 grid and reaches the goal twice more consistently than Likelihood-AIF in the 88 grid, leaning towards Dreamer and Contrastive Dreamer’s results.

Figure 2: MiniGrid Experiments.

(left) Empty task goal image. (right) Results: shaded areas indicate standard deviation across several runs.

(a) Grid Task
(b) Rewards
(c) AIF Model
(d) NCE Model (ours)
Figure 3: Utility Values MiniGrid. (b-c-d) Darker tiles correspond to higher utility values.

Utility Function Analysis. In order to understand the differences between the utility functions we experimented with, we analyze the values assigned to each tile in the 88 task by every method. For the AIF methods, we collected all possible transitions in the environment and used the model to compute utility values for each tile. The results are shown in Figure 3.

The reward signal for the Empty environment is very sparse and informative only once the agent reaches the goal. In contrast, AIF methods provide denser utility values. In particular, we noticed that the Likelihood-AIF model provides a very strong signal for the goal position, whereas other values are less informative of the goal. Instead, the Contrastive-AIF model seems to capture some semantic information about the environment: it assigns high values to all corners, which are conceptually closer outcomes to the goal, while also providing the steepest signal for the green corner and its neighbor tiles. As also supported by the results obtained in terms of rewards, our method provides a denser and more informative signal to reach the goal in this task.

5.2 Reacher Task

We performed continuous-control experiments on the Reacher Easy and Hard tasks from the DeepMind Control (DMC) Suite Tassa et al. (2018) and on Reacher Easy from the Distracting Control Suite Stone et al. (2021). In this task, a two-link arm should penetrate a goal sphere with its tip in order to obtain rewards, with the sphere being bigger in the Easy task and smaller in the Hard one. The Distracting Suite adds an extra layer of complexity to the environment, altering the camera angle, the arm and the goal colors, and the background. In particular, we used the ‘easy’ version of this benchmark, corresponding to smaller changes in the camera angles and in the colors, and choosing the background from one of four videos (example in Figure 3(c)).

In order to provide consistent goals for the AIF agents, we fixed the goal sphere position as shown in Figure 3(b) and 3(a). As there is no fixed background in the Distracting Suite task, we could not use a goal image with the correct background, as that would have meant changing it at every trajectory. To not introduce ‘external’ interventions into the AIF experiments, we decided to use a goal image with the original blue background from the DMC Suite to test out the AIF capability to generalize goals to environments having the same dynamics but different backgrounds.

For both tasks, the world model is trained by sampling trajectories of length , while the behavior model is trained by imagining steps long trajectories. We first collect random episodes, to populate the replay buffer, and train for

steps after every new trajectory. Being the action set continuous, we optimized the action network backpropagating the expected utility value through the dynamics, by using the reparameterization trick for sampling actions

Hafner et al. (2020); Clavera et al. (2020).

The results are presented in Figure 5, evaluating agents in term of the rewards obtained per trajectory. The length of a trajectory is fixed to 1 steps.

(a) Reacher Easy Goal
(b) Reacher Hard Goal
(c) Distracting Reacher Easy
Figure 4: Continuous tasks setup. Note that the Reacher Easy Goal is also used for the Distracting Reacher Easy task, without changing the goal’s background.
Figure 5: Reacher Results. Shaded areas indicate standard deviation across several runs.
Figure 6: Utility Values Reacher. Normalized utility values for multiple poses in Reacher Hard.

Reacher Easy/Hard.

The results on the Reacher Easy and Hard tasks show that our method was the fastest to converge to stable high rewards, with Contrastive Dreamer and Dreamer following. In particular, Dreamer’s delay to convergence should be associated with the more complex model, that took more epochs of training than the contrastive ones to provide good imagined trajectories for planning, especially for the Hard task. The Likelihood-AIF failed to converge in all runs, because of the difficulty of matching the goal state in pixel space, which only differs a small number of pixels from any other environment observation.

Distracting Reacher Easy. On the Distracting task, we found that Dreamer failed to succeed. As we show in Appendix, the reconstruction model’s capacity was entirely spent on reconstructing the complex backgrounds, failing to capture relevant information for the task. Conversely, Contrastive Dreamer was able to ignore the complexity of the observations and the distractions present in the environment, eventually succeeding in the task. Surprisingly, also our Contrastive-AIF method was able to succeed, showing generalization capabilities that are not shared by the likelihood counterpart.

We believe this result is important for two reasons: (1) it provides evidence that contrastive features better capture semantic information in the environment, potentially ignoring complex irrelevant details, (2) contrastive objectives for planning can be invariant to changes in the background, when the underlying dynamics of the task stays the same.

Utility Function Analysis. To collect further insights on the different methods’ objectives, we analyze the utility values assigned to observations with different poses in the Reacher Hard task. In Figure 6, we show a comparison where all the values are normalized in the range [0,1], considering the maximum and minimum values achievable by each method.

The reward signal is sparse and provided only when the arm is penetrating the goal sphere with his orange tip. In particular, a reward of is obtained only when the tip is entirely contained in the sphere. The Likelihood-AIF utility looks very flat due to the static background, which causes any observation to be very similar to the preferred outcome in pixel space. Even a pose that is very different from the goal, such as the top left one, is separated only by a relatively small number of pixels from the goal one, in the bottom right corner, and this translates into very minor differences in utility values (i.e. 0.98 vs 1.00). For Contrastive-AIF, we see that the model provides higher utility values for observations that look perceptually similar to the goal and lower values for more distant states, providing a denser signal to optimize for reaching the goal. This was certainly crucial in achieving the task in this experiment, though overly-shaped utility functions can be more difficult to optimize Andrychowicz et al. (2017), and future work should analyze the consequences of such dense shaping.

6 Related Work

Contrastive Learning.

Contrastive learning methods have recently led to important breakthroughs in the unsupervised learning landscape. Techniques like MoCO

Chen et al. (2020c); He et al. (2020) and SimCLR Chen et al. (2020a, b)

have progressively improved performance in image recognition, by using only a few supervised labels. Contrastive learning representations have also shown successful when employed for natural language processing

van den Oord et al. (2019) and model-free RL Srinivas et al. (2020).

Model-based Control. Improvements in the dynamics generative model Hafner et al. (2019), have recently allowed model-based RL methods to reach state-of-the-art performance, both in control tasks Hafner et al. (2020) and on video games Hafner et al. (2021); Kaiser et al. (2020). An important line of research focuses on correctly balancing real-world experience with data generated from the internal model of the agent Janner et al. (2019); Clavera et al. (2020).

Outcome-Driven Control. The idea of using desired outcomes to generate control objectives has been explored in RL as well Schaul et al. (2015); Ganin et al. (2018); Rudner et al. (2021). In Lynch et al. (2019), the authors propose a system that, given a desired goal, can sample plans of action from a latent space and decode them to act on the environment. DISCERN Warde-Farley et al. (2019)

maximizes mutual information to the goal, using cosine similarity between the goal and a given observation, in the feature space of a CNN model.

Active Inference. In our work, we used active inference to derive actions, which is just one possibility to perform AIF, as discussed in Friston et al. (2021); Millidge et al. (2020). In other works, the expected free energy is passively used as the utility function to select the best behavior among potential sequences of actions Friston et al. (2016, 2015). Methods that combine the expressiveness of neural networks with AIF have been raising in popularity in the last years Çatal et al. (2020a). In Fountas et al. (2020), the authors propose an amortized version of Monte Carlo Tree Search, through an habit network, for planning. In Tschantz et al. (2020), AIF is seen performing better than RL algorithms in terms of reward maximization and exploration, on small-scale tasks. In Millidge (2019), they propose an objective to amortize planning in a value iteration fashion.

7 Discussion

We presented the Contrastive Active Inference framework, a contrastive learning approach for active inference, that casts the free energy minimization imperatives of AIF as contrastive learning problems. We derived the contrastive objective functionals and we corroborated their applicability through empirical experimentation, in both continuous and discrete action settings, with high-dimensional observations. Combining our method with models and learning routines inspired from the model-based RL scene, we found that our approach can perform comparably to models that have access to human-designed rewards. Our results show that contrastive features better capture relevant information about the dynamics of the task, which can be exploited both to find conceptually similar states to preferred outcomes and to make the agent’s preferences invariant to irrelevant changes in the environment (e.g. background, colors, camera angle).

While the possibility to match states to outcomes in terms of similar features is rather convenient in image-based tasks, the risk is that, if the agent never saw the desired outcome, it would converge to the semantically closest state in the environment that it knows. This raises important concerns about the necessity to provide good exploratory data about the environment, in order to prevent the agent from hanging in local minima. For this reason, we aim to look into combining our agent with exploration-driven data collection, for zero-shot goal achievement Mazzaglia et al. (2021); Sekar et al. (2020). Another complementary line of research would be equipping our method with better experience replay mechanisms, such as HER Andrychowicz et al. (2017), to improve the generalization capabilities of the system.

Broader impact

Active inference is a biologically-plausible unifying theory for perception and action. Implementations of active inference that are both tractable and computationally cheap are important to foster further research towards potentially better theories of the human brain. By strongly reducing the computational requirements of our system, compared to other deep active inference implementations, we aim to make the study of this framework more accessible. Furthermore, our successful results on the robotic manipulator task with varying realistic backgrounds show that contrastive methods are promising for real-world applications with complex observations and distracting elements.

This research received funding from the Flemish Government (AI Research Program).

References

  • [1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . Cited by: §5.2, §7.
  • [2] A. P. Badia, B. Piot, S. Kapturowski, P. Sprechmann, A. Vitvitskyi, D. Guo, and C. Blundell (2020) Agent57: outperforming the atari human benchmark. External Links: 2003.13350 Cited by: §1.
  • [3] R. Bogacz (2020-07) Dopamine role in learning and action inference. eLife 9, pp. e53262. External Links: Document, Link, ISSN 2050-084X Cited by: §1.
  • [4] L. Buesing, T. Weber, S. Racaniere, S. M. A. Eslami, D. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, and D. Wierstra (2018) Learning and querying fast generative models for reinforcement learning. External Links: 1802.03006 Cited by: Appendix B, §4.
  • [5] O. Çatal, T. Verbelen, J. Nauta, C. D. Boom, and B. Dhoedt (2020) Learning perception and planning with deep active inference. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 3952–3956. External Links: Document Cited by: §6.
  • [6] O. Çatal, S. Wauthier, C. De Boom, T. Verbelen, and B. Dhoedt (2020) Learning generative state space models for active inference. Frontiers in Computational Neuroscience 14, pp. 103. External Links: Document, ISSN 1662-5188 Cited by: Appendix B, §4.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §6.
  • [8] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton (2020) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §6.
  • [9] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. External Links: 2003.04297 Cited by: §6.
  • [10] M. Chevalier-Boisvert, L. Willems, and S. Pal (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gym-minigrid Cited by: §5.1.
  • [11] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    .
    In

    NIPS 2014 Workshop on Deep Learning, December 2014

    ,
    (English (US)). Cited by: §4.
  • [12] J. Clark and D. Amodei (2016) Faulty reward functions in the wild. Cited by: §1.
  • [13] I. Clavera, V. Fu, and P. Abbeel (2020) Model-augmented actor-critic: backpropagating through paths. External Links: 2005.08068 Cited by: §4, §5.2, §6.
  • [14] L. Da Costa, T. Parr, N. Sajid, S. Veselic, V. Neacsu, and K. Friston (2020) Active inference on discrete state-spaces: a synthesis. Journal of Mathematical Psychology 99, pp. 102447. External Links: ISSN 0022-2496, Document, Link Cited by: §1.
  • [15] Z. Fountas, N. Sajid, P. Mediano, and K. Friston (2020) Deep active inference agents using monte-carlo methods. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 11662–11675. Cited by: §1, 3rd item, §6.
  • [16] K. Friston, F. Rigoli, D. Ognibene, C. Mathys, T. Fitzgerald, and G. Pezzulo (2015) Active inference and epistemic value. Cogn Neurosci 6 (4), pp. 187–214. Cited by: §3.2, §6.
  • [17] K. Friston, L. Da Costa, D. Hafner, C. Hesp, and T. Parr (2021-03) Sophisticated Inference. Neural Computation 33 (3), pp. 713–763. External Links: ISSN 0899-7667, Document, https://direct.mit.edu/neco/article-pdf/33/3/713/1889421/neco_a_01351.pdf Cited by: §6.
  • [18] K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, J. O’Doherty, and G. Pezzulo (2016) Active inference and learning. Neuroscience & Biobehavioral Reviews 68, pp. 862 – 879. External Links: ISSN 0149-7634, Document, Link Cited by: §1, §6.
  • [19] K. J. Friston, J. Daunizeau, J. Kilner, and S. J. Kiebel (2010-03-01) Action and behavior: a free-energy formulation. Biological Cybernetics 102 (3), pp. 227–260. External Links: ISSN 1432-0770, Document, Link Cited by: §1.
  • [20] Y. Ganin, T. Kulkarni, I. Babuschkin, S. M. A. Eslami, and O. Vinyals (2018) Synthesizing programs for images using reinforced adversarial learning. In

    Proceedings of the 35th International Conference on Machine Learning

    ,
    Proceedings of Machine Learning Research, Vol. 80. Cited by: §6.
  • [21] M. Gutmann and A. Hyvärinen (2010-13–15 May) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , Y. W. Teh and M. Titterington (Eds.),
    Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 297–304. External Links: Link Cited by: §2.
  • [22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018-10–15 Jul) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1861–1870. Cited by: §2, §4.
  • [23] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019-09–15 Jun) Learning latent dynamics for planning from pixels. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 2555–2565. Cited by: Appendix B, §4, §6.
  • [24] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba (2021) Mastering atari with discrete world models. External Links: 2010.02193 Cited by: §4, 1st item, §6.
  • [25] D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi (2020) Dream to control: learning behaviors by latent imagination. In ICLR, Cited by: §4, 1st item, 2nd item, §5.2, §6.
  • [26] K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020

    ,
    pp. 9726–9735. External Links: Document Cited by: §6.
  • [27] M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §6.
  • [28] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, A. Mohiuddin, R. Sepassi, G. Tucker, and H. Michalewski (2020) Model-based reinforcement learning for atari. External Links: 1903.00374 Cited by: §6.
  • [29] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. External Links: 1312.6114 Cited by: §2, §3.1, §4.
  • [30] V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §4.
  • [31] V. Krakovna et al. (2020) Specification gaming: the flip side of ai ingenuity. Cited by: §1.
  • [32] S. Levine (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. External Links: 1805.00909 Cited by: §2, §3.2.
  • [33] C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet (2019) Learning latent plans from play. External Links: 1903.01973 Cited by: §6.
  • [34] X. Ma, S. Chen, D. Hsu, and W. S. Lee (2020) Contrastive variational model-based reinforcement learning for complex observations. In Proceedings of the 4th Conference on Robot Learning, Cited by: 2nd item.
  • [35] P. Mazzaglia, O. Catal, T. Verbelen, and B. Dhoedt (2021) Self-supervised exploration via latent bayesian surprise. External Links: 2104.07495 Cited by: §7.
  • [36] B. Millidge, A. Tschantz, A. K. Seth, and C. L. Buckley (2020) On the relationship between active inference and control as inference. External Links: 2006.12964 Cited by: §2, §2, §6.
  • [37] B. Millidge (2019) Deep active inference as variational policy gradients. External Links: 1907.03876 Cited by: §1, §4, 3rd item, §6.
  • [38] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang (2019) Solving rubik’s cube with a robot hand. External Links: 1910.07113 Cited by: §1.
  • [39] B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker (2019-09–15 Jun) On variational bounds of mutual information. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 5171–5180. Cited by: §2.
  • [40] I. Popov et al. (2017) Data-efficient deep reinforcement learning for dexterous manipulation. External Links: 1704.03073 Cited by: §1.
  • [41] D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML), Vol. 32, pp. 1278–1286. Cited by: §2.
  • [42] T. G. J. Rudner, V. H. Pong, R. McAllister, Y. Gal, and S. Levine (2021) Outcome-driven reinforcement learning via variational inference. External Links: 2104.10190 Cited by: §6.
  • [43] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015-07–09 Jul) Universal value function approximators. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1312–1320. External Links: Link Cited by: §6.
  • [44] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, and et al. (2020-12) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588 (7839), pp. 604–609. External Links: ISSN 1476-4687, Link, Document Cited by: §1.
  • [45] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2018) High-dimensional continuous control using generalized advantage estimation. External Links: 1506.02438 Cited by: Appendix B, §4.
  • [46] W. Schultz (1998-07) Predictive reward signal of dopamine neurons. J Neurophysiol 80 (1), pp. 1–27. Cited by: §1.
  • [47] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak (2020) Planning to explore via self-supervised world models. In ICML, Cited by: §7.
  • [48] A. Srinivas, M. Laskin, and P. Abbeel (2020) CURL: contrastive unsupervised representations for reinforcement learning. External Links: 2004.04136 Cited by: §6.
  • [49] A. Stone, O. Ramirez, K. Konolige, and R. Jonschkowski (2021) The distracting control suite – a challenging benchmark for reinforcement learning from pixels. External Links: 2101.02722 Cited by: §5.2.
  • [50] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller (2018) DeepMind control suite. External Links: 1801.00690 Cited by: §5.2.
  • [51] A. Tschantz, B. Millidge, A. K. Seth, and C. L. Buckley (2020) Reinforcement learning through active inference. External Links: 2002.12636 Cited by: §1, §6.
  • [52] A. van den Oord, Y. Li, and O. Vinyals (2019) Representation learning with contrastive predictive coding. External Links: 1807.03748 Cited by: §2, §4, §6.
  • [53] D. Warde-Farley, T. V. de Wiele, T. D. Kulkarni, C. Ionescu, S. Hansen, and V. Mnih (2019) Unsupervised control through non-parametric discriminative rewards. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §6.
  • [54] R. J. Williams (1992-05) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8 (3–4), pp. 229–256. External Links: ISSN 0885-6125, Link, Document Cited by: §5.1.

Appendix A Background Derivations

In this section, we provide the derivations of the equations provided in section 2.

In all equations, both for the past and the future, we consider only one time step . This is possible thanks to the Markov assumption, stating that the environment properties exclusively depend on the previous time step. This makes possible to write step-wise formulas, by applying ancestral sampling, i.e. for the state dynamics until :

To simplify and shorten the Equations, we mostly omit conditioning on past states and actions. However, as shown in section 4, the transition dynamics explicitly take ancestral sampling into account, by using recurrent neural networks that process multiple time steps.

a.1 Free Energy of the Past

For past observations, the objective is to build a model of the environment for perception. Since computing the posterior is intractable, we learn to approximate it with a variational distribution . As we show, this process provides an upper bound on the surprisal (log evidence) of the model:

where we applied Jensen’s inequality in the fourth row, obtaining the variational free energy (Equation 1).

The free energy of the past can be mainly rewritten in two ways:

where the first expression highlights the evidence bound on the model’s evidence, and the second expression shows the balance between the complexity of the state model and the accuracy of the likelihood one. From the latter, the (Equation 2) can be obtained by expliciting as , according to the Markov assumption, and by choosing as the approximate variational distribution.

a.2 Free Energy of the Future

For the future, the agent selects actions that it expects to minimize the free energy. In particular, active inference assumes that the future’s model of the agent is biased towards its preferred outcomes, distributed according to the prior . Thus, we define the agent’s generative model as and we aim to find the distributions of future states and actions by applying variational inference, with the variational distribution . If we consider expectations taken over trajectories sampled from , the expected free energy (Equation 3) becomes:

where we explicit conditioning on the previous state-action () for the sake of clarity.

We now assume , which means assuming the variational state posterior model approximates the true posterior over states, as a consequence of minimizing . Thus, we can rewrite the above result as:

Then, we assume that the agent’s model likelihood over actions is uniform and constant, namely :

Finally, by dropping the constant and rewriting all terms as KL divergences and entropies, we obtain:

that is the expected free energy as described in Equation 4.

Appendix B Model Details

Figure 7: World Model. Prior, posterior and representation models. For the posterior CNN, the configuration of each layer is provided.

The world model, composed by the prior network , the posterior network and the representation model , is presented in Figure 7.

The prior and the posterior network share a GRU cell, used to remember information from the past. The prior network first combines previous states and actions using a linear layer, then it processes the output with the GRU cell, and finally uses a 2-layer MLP to compute the stochastic state from the hidden state of the GRU. The posterior network also has access to the features computed by a 4-layer CNN over observations. This setup is inspired on the models presented in [23, 6, 4]. For the representation model, on the one hand, we take the features computed from the observations by the posterior’s CNN, process them with a 2-layer MLP and apply a non-linearity, obtaining . On the other hand, we take the state , we process it with a 2-layer MLP and apply a non-linearity, obtaining . Finally, we compute a dot-product, obtaining . In the world model’s loss, , we clip the KL divergence term in the below 3 free nats, to avoid posterior collapse.

The behavior model is composed by the action network and the expected utility network , which are both 3-layer MLPs. In order to get a good estimate of future utility, able to trade off between bias and variance, we used GAE() estimation [45]. In practice this translates into approximating the infinite-horizon utility with:

where is an hyperparameter and is the imagination horizon for future trajectories. Given the above definition, we can rewrite the actor network loss as: and the utility network loss with . In , we scale the action entropy by , to prevent entropy maximization from taking over the rest of the objective. In order to stabilize training, when updating the actor network, we use the expected utility network and the world model from the previous epoch of training.

Appendix C Hyper Parameters

Name Value
World Model
Latent state dimension 30
GRU cell dimension 200
Adam learning rate
Behavior Model
parameter 0.99
parameter 0.95
Adam learning rate
Common
Hidden layers dimension 200
Gradient clipping 100
Table 3: World and behavior models hyperparameters.

Appendix D Experiment Details

Hardware. We ran the experiments on a Titan-X GPU, with an i5-2400 CPU and 16GB of RAM.

Preferred Outcomes. For the tasks of our experiments, the preferred outcomes are 64x64x3 images (displayed in Figure 2, 3(b), 3(a)). Corresponding distributions are defined as 64x64x3 multivariate Laplace distributions, centered on the images’ pixel values. We also experimented with 64x64x3 multivariate Gaussians with unit variance, obtaining similar results.

Baselines. In section 5, we compare four different flavors of model-based control: Dreamer, Contrastive Dreamer, Likelihood-AIF and Contrastive-AIF. Losses for each of these methods are provided in Table 4, adopting the following additional definitions:

where is the same as in Equation 5.

Dreamer
Contrastive Dreamer
Likelihood-AIF
Contrastive-AIF
Table 4: Baselines overview. All losses are summed over multiple timesteps.

Distracting Suite Reconstructions. In the Reacher Easy experiment from the Distracting Control Suite, we found that Dreamer, a state-of-the-art algorithm on the DeepMind Control Suite, was not able to succeed. We hypothesized that this was due to the world model spending most of its capacity to predict the complex background, being then unable to capture relevant information about the task.

In Figure 8, we compare ground truth observations and reconstructions from the Dreamer posterior model. As we expected, we found that despite the model correctly stored information about several details of the background, it missed crucial information about the arm pose. Although better world models could alleviate problems like this, we strongly believe that different representation learning approaches, like contrastive learning, provide a better solution to the issue.

Figure 8: Dreamer Reconstructions