Inference of Affordances and Active Motor Control in Simulated Agents

Flexible, goal-directed behavior is a fundamental aspect of human life. Based on the free energy minimization principle, the theory of active inference formalizes the generation of such behavior from a computational neuroscience perspective. Based on the theory, we introduce an output-probabilistic, temporally predictive, modular artificial neural network architecture, which processes sensorimotor information, infers behavior-relevant aspects of its world, and invokes highly flexible, goal-directed behavior. We show that our architecture, which is trained end-to-end to minimize an approximation of free energy, develops latent states that can be interpreted as affordance maps. That is, the emerging latent states signal which actions lead to which effects dependent on the local context. In combination with active inference, we show that flexible, goal-directed behavior can be invoked, incorporating the emerging affordance maps. As a result, our simulated agent flexibly steers through continuous spaces, avoids collisions with obstacles, and prefers pathways that lead to the goal with high certainty. Additionally, we show that the learned agent is highly suitable for zero-shot generalization across environments: After training the agent in a handful of fixed environments with obstacles and other terrains affecting its behavior, it performs similarly well in procedurally generated environments containing different amounts of obstacles and terrains of various sizes at different locations. To improve and focus model learning further, we plan to invoke active inference-based, information-gain-oriented behavior also while learning the temporally predictive model itself in the near future. Moreover, we intend to foster the development of both deeper event-predictive abstractions and compact, habitual behavioral primitives.



page 10

page 12

page 14

page 15

page 16

page 23


Autonomous Identification and Goal-Directed Invocation of Event-Predictive Behavioral Primitives

Voluntary behavior of humans appears to be composed of small, elementary...

Learning, Planning, and Control in a Monolithic Neural Event Inference Architecture

We introduce a dynamic artificial neural network-based (ANN) adaptive in...

Goal-Directed Planning by Reinforcement Learning and Active Inference

What is the difference between goal-directed and habitual behavior? We p...

Active inference, Bayesian optimal design, and expected utility

Active inference, a corollary of the free energy principle, is a formal ...

Active Inference for Robotic Manipulation

Robotic manipulation stands as a largely unsolved problem despite signif...

Developing hierarchical anticipations via neural network-based event segmentation

Humans can make predictions on various time scales and hierarchical leve...

Path Planning Using Probability Tensor Flows

Probability models have been proposed in the literature to account for "...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We, as humans, direct our actions towards goals. But how do we select goals and how do we reach them? In this work we will focus on a more specific version of the latter question: Given a goal and some information about the environment, how can suitable actions be inferred that ultimately lead to the goal with high certainty?

The free energy principle proposed in Friston (2005a) serves as a good starting point for an answer. It is sometimes regarded as a “unified theory of the brain” (Friston, 2010c), because it attempts to explain a variety of brain processes such as perception, learning, and goal-directed action selection, based on a single objective: to minimize free energy. Free energy constitutes an upper bound on surprise, which results from interactions with the environment. When actions are selected in this way, we also refer to it as active inference. Active inference basically states that agents infer suitable actions by minimizing expected free energy, leading to risk-sensitive, goal-directed planning.

One limitation of active inference-based planning is computational complexity: Optimal active inference requires an agent to predict the free energy for all possible action sequences potentially far into the future. This soon becomes computationally intractable, which is why so far mostly simple, discrete environments with small state and action spaces have been investigated (Friston et al., 2015). How do biological agents, such as humans, deal with this computational explosion when planning behavior in our complex, dynamic world? It appears that humans, and other animals, have developed a variety of inductive biases that facilitate processing high-dimensional sensorimotor information in familiar situations (Butz, 2008; Butz et al., 2021). Affordances (Gibson, 1986), for example, encode object- and situation-specific action possibilities. By equipping an active inference agent with the tendency to infer affordances, then, inference-based planning could first focus on afforded environmental interactions, significantly alleviating the computational load when considering interaction options.

In this work, we model these conjectures by means of an output-probabilistic, temporally predictive artificial neural network architecture. The architecture is designed to focus on local environmental properties, from which it predicts action-dependent interaction consequences via latent state encodings. We show that, through this processing pipeline, affordance maps emerge, which encode behavior-relevant properties of the environment. These affordance maps can then be employed during goal-directed planning. Given spatially local visual information, the resulting latent affordance codes constrain the considered environmental interactions. As a result, planning via active inference becomes more effective and enables, for example, the avoidance of uncertainty while moving towards a given goal location. We furthermore show that the architecture exhibits zero-shot learning abilities (Eppe et al., 2022), directly solving related environments and tasks within.

2 Foundations

This section introduces the theoretical foundations of our work. We first specify our problem setting and notation. We then introduce the free energy principle and show how we can perform active inference-based goal-directed planning with two different algorithms. Subsequently, we combine the theory of affordances with the idea of cognitive maps and arrive at the concept of affordance maps. We propose that the incorporation of affordance maps can facilitate goal-directed planning via active inference.

2.1 Problem formulation and notation

We consider problems in which an agent interacts with its environment by performing actions and in turn receiving sensory states . The sensory states reveal only parts of the environmental states

, which therefore are not directly observable, i.e., we are facing a partially observable Markov decision process

111In Markov decision processes, usually the environment additionally returns a reward in each time step, which is to be maximized by the agent. Here, we do not define a reward function but instead plan in a model-predictive, goal-directed manner.. In every time step , an agent selects and performs an action , and receives a sensory state (often called observation) of the next time step .

Model-based planning, such as active inference, requires a model of the world to simulate actions and their consequences. We use a forward model that predicts the unfolding motor-activity-dependent sensory dynamics while an agent interacts with its environment. In order to deal with partial observability, the forward model needs to be equipped with its own internal hidden state . Its purpose is to encode the state of the environment, including potentially non-observable parts. Given a current sensory state , an internal hidden state , and an action

, the forward model computes an estimate of the sensory state in the next time step



See Figure 1

for a depiction of how environment, agent, forward model, and action selection relate to each other. We use recurrent neural networks as our forward models.


Figure 1: Depiction of a (partially observable) Markov decision process. An agent interacts with its environment by sending actions and receiving consequent sensory states . Partial observability here implies that the sensory state does not encode the whole environmental state . Rather, certain aspects remain hidden for the agent and must be inferred from the sensory state. To deal with this, our agent utilizes a forward model with its own internal hidden state . It predicts sensory states , which aid the action selection algorithm to produce appropriate actions. In order to stay in tune with the environment and to predict multiple time steps into the future, the forward model also receives observed and predicted sensory states (dashed arrows).

2.2 Toward free energy-based planning

The free energy principle starts formalizing life itself, very generally, as having an interior and exterior, separated by some boundary (Friston, 2013). For life to maintain homeostasis, this boundary, protecting the interior, needs to be maintained. It follows that living things need to be in specific states because only a small number of all possible states ensure homeostasis. The free energy principle formalizes this maintenance of homeostatic states by means of minimizing entropy. But how can entropy be computed? One possibility is given by the presence of an internal, generative model of the world. In this case, we can regard entropy as the expected surprise about encountered sensory states given the model (Friston, 2010c). In other words: Living things must minimize expected surprise.

This implies that all living things act as if they strive to maintain a model of the world over time in some way or another. Surprise, however, is not directly accessible for a living thing. In order to compute the surprise corresponding to some sensory input, it is necessary to integrate over all possible states in the world that could have led to that input (Friston, 2009b). We can see this in the formal definition of surprise for a given sensory state (Friston et al., 2010a):


where is the model or the living thing itself, and are all environmental states, including states that are not fully observable for the living thing. The consideration of all these states is infeasible. Thus, according to the free energy principle, living things minimize free energy, which is defined as follows (Friston et al., 2010a):


where is an approximate posterior, denotes entropy, and denotes expected value. Since now all parameters are accessible, this quantity is computable. Rewriting it shows that free energy can be decomposed into a surprise and a divergence term:



denotes the Kullback-Leibler divergence. Since the divergence cannot be less than zero, free energy is an upper bound on surprise, our original quantity of interest.

Given a generative model of the world, surprise corresponds to an unexpected, inaccurate prediction of sensory information. In order to minimize free energy, an agent equipped with a generative world model thus has two ways to minimize the discrepancy between predicted and actually encountered sensory information: (1.) The internal world model can be adjusted to better resemble the world. In the short term, this relates to perception, while in the long term, this corresponds to learning. (2.) The agent can manipulate the world via its actions, such that the world better fits its internal model. In this case, an agent chooses actions that minimize expected free energy in the future, pursuing active inference.

2.2.1 Active inference

When the free energy principle is employed as a process theory for action selection, it is called active inference

. The name comes from the fact that the brain actively samples the world to perform inference: It infers actions (also called control states) that minimize expected free energy (EFE), that is, surprise in anticipated future states. This is closely related to the principle of planning as inference in the machine learning and control theory communities

(Botvinick and Toussaint, 2012; Lenz et al., 2015). According to Friston et al. (2015b), a policy is evaluated by projecting it time steps into the future and evaluating the EFE for this time horizon. Including the internal hidden states , EFE can be formalized as


where is the current time step, indexes the considered sequence of states projected into the future, and is a weighting parameter. This formula equates EFE with a sum of two components. The first part is the Kullback-Leibler divergence, which estimates how far the predicted sensory states deviate from desired ones. The second part is the entropy of the predicted sensory states, which quantifies uncertainty. We introduce to weigh these components. It enables us to tune the trade-off between choosing actions that minimize uncertainty and actions that minimize divergence from desired states. Based on this formula, policies can be evaluated and the policy with the least EFE can be chosen:


Intuitively speaking, active inference-based planning agents choose actions that lead to desired sensory states with high certainty.

2.2.2 Planning via active inference

On the computational level, active inference tells us to minimize EFE to perform goal-directed planning. Thus, it provides an objective to optimize actions. However, it does not specify how to optimize the actions on an algorithmic level. We thus detail two planning algorithms that can be employed for this kind of action selection. In both algorithms, we limit ourselves to a finite prediction horizon with fixed policy lengths. In order to evaluate possible policies, both algorithms employ a forward model and “imagine” the execution of a policy:


For active inference-based planning, we can compute the EFE for the predicted sequence and optimize the actions using one of the planning algorithms.

After a fixed number of optimization cycles, both algorithms return a sequence of actions. The first action can then be executed in the environment.

Gradient-based active inference

Action inference (Otte et al., 2017; Butz et al., 2019)

is a gradient-based optimization algorithm for model-predictive control. Therefore, it requires the forward model to be differentiable. The algorithm maintains a policy, which, in each optimization cycle, is fed into the forward model. Afterwards, we use backpropagation through time to backpropagate the EFE onto the policy. We obtain the gradient by taking the derivative of the EFE with respect to an action

from the policy. After multiple optimization cycles, the algorithm returns the first action of the optimized policy.

Evolutionary-based active inference

The cross-entropy method (CEM, Rubinstein, 1999

) is an evolutionary optimization algorithm. CEM maintains the parameters of a probability distribution and minimizes the cross-entropy between this distribution and a distribution that minimizes the given objective. It does so by sampling candidates, evaluating them according to EFE, and using the best performing candidates to estimate the parameters of its probability distribution for the next optimization cycle. Recently, CEM has been used as a gradient-free optimization technique for model-based control and reinforcement learning

(Chua et al., 2018; Hafner et al., 2019; Pinneri et al., 2020). In such a model-predictive control setting, CEM maintains a sequence of probability distributions and candidates correspond to policies. After multiple optimization cycles, the algorithm returns the first action of the best sampled policy.

2.3 Behavior-oriented predictive encodings

In theory, given a sufficiently accurate model, active inference enables an agent to plan risk-sensitive, goal-directed behavior regardless of the complexity of the problem. In practice, however, considering all possible actions and consequences thereof quickly becomes computationally intractable. To counteract this problem, it appears that humans and other animal have developed a variety of inductive learning biases to focus the planning process by means of behavior-oriented, internal representations. Here, we focus on biases that lead to the development of affordances, cognitive maps, and, in combination, affordance maps.

2.3.1 Affordances

Gibson (1986) defines affordances as what the environment offers an animal: Depending on the environment, affordances act as indicators for possible environmental interactions. As a result, affordances fundamentally determine how animals behave depending on their environment. They constitute behavioral options from which the animal can select suitable ones in order to fulfil its current goal.

To give an example, imagine a flat surface at the height of a human’s knees. Given the structure underneath is sufficiently sturdy, it is possible to sit on the surface in a way that requires relatively little effort. Therefore, such a surface is sit-upon-able: It offers a human the possibility to sit on it in an effortless way.

The theory of affordances explicitly states that to (visually) perceive the environment is to perceive what it affords. Animals do not see the world as it is and derive their behavioral options from their perspective. Rather, Gibson proposes that affordances are perceived directly, assigning distinct meanings to different parts of the environment. From an ecological perspective, it appears that vision may have evolved for exactly this purpose (Gibson, 1986): to convey what behaviors are possible in the current situation. First, however, an animal needs to learn the relationship between visual stimuli and their meaning for behavior. This is non-trivial: Similar visual stimuli can mean different things, or the other way round. Furthermore, visual input is rich such that the animal needs to effectively focus on the behavior-relevant information.

2.3.2 Cognitive maps

The concept of cognitive maps was introduced in Tolman (1948). Tolman showed that after exploring a given maze, rats were able to navigate towards a food source regardless of their starting position. He concluded that the rats acquired a mental representation of the maze: a cognitive map. Place cells in the hippocampus seem to be a promising candidate for the neural correlate of this concept (O’keefe and Nadel, 1978). These cells tend to fire when the animal is at associated locations. Visual input acts as stimuli, but also the olfactory and vestibular senses play a role. Together, place cells constitute a cognitive map, which the animal appears to use for orientation, reflection, and planning (Diba and Buzsaki, 2007; Pfeiffer and Foster, 2013).

2.3.3 Affordance maps

Cognitive maps are well-suited for flexible navigation and goal-directed planning. However, to improve the efficiency of the planning mechanisms, it will be useful to encode behavior-relevant aspects, such as the aforementioned affordances, within the cognitive map. Accordingly, we combine the theory of affordances with cognitive maps, leading to affordance maps. Their function is to map spatial locations onto affordance codes. Like cognitive maps, their encoding depends on visual cues. In contrast to cognitive maps’ traditional focus on map-building, though, affordance maps signal distinct behavioral options at particular environmental locations. As an example, consider a hallway corner situation with corridors to your right and behind you. An affordance map would encode successful navigation options for turning to the right or turning around.

2.4 Related neural network models

Ha and Schmidhuber (2018)

used a world model to facilitate planning via reinforcement learning. Their overall architecture consists of a vision model which compresses visual information, a memory module, and a controller model, which predicts actions given a history of the compressed visual information. Their vision model is given by a variational autoencoder, which is trained in an unsupervised manner to reconstruct its input. Therefore, and in contrast to ours, their vision model is not trained to extract meaningful, behavior-oriented information. This is why we would not regard the emerging compressed codes as affordance codes.

Affordance maps were used before in Qi et al. (2020)

to aid planning. The authors put an agent into an environment (VizDoom) with hazardous regions that were to be avoided. The agent moved around in its environment and collected experiences of harm or no harm, which were backprojected onto the pixels of the input to the agent’s visual system, thereby performing image segmentation. The authors then trained a convolutional neural network (CNN) on the resulting data of which the output was utilized by the A* algorithm for planning. In contrast to ours, their architecture was not trained in an end-to-end fashion, meaning that the resulting affordance codes were not optimized to suit their forward model.

3 Model

We now detail the proposed architecture, which learns a forward model of the environment with spatial affordance encodings. The architecture predicts a probability density function over changes in sensory states given the latent, history-compressing state of the architecture as well as the current sensory state and action. This action-dependent forward model of the environment thus enables active inference-based planning. We first specify the architecture, then detail the model learning mechanism, and finally turn to goal-directed, risk-sensitive active inference.

3.1 Affordance-conditioned inference

Our model adheres to the general notion introduced above (cf. Equation 1). Our model consists of three main components: a forward model , a vision model , and a look-up map of the environment. The model with its different components is illustrated in Figure 2

Our system learns a forward model of its environment by maintaining an internal history-compressing encoding and state estimations over time. Combined with the current action , the forward model predicts the next sensory state and a consequent hidden state . Focusing on motion control tasks, we encode the state by a two-dimensional positional encoding , where the forward model continues to predict changes in positions given the last positional change , hidden state , and current action . To enable the model to consider the properties of different regions in an environment during goal-directed planning, though, we introduce an additional contextual input , which is able to modify the forward model predictions (cf. Butz et al. 2019, for a related approach without map-specificity). In each time step, the forward model

thus additionally receives a context encoding vector

, which should encode the currently behavior-relevant characteristics of the environment.

This context code is produced by the vision model , which receives a visual representation of the agent’s current surroundings in the form of a small pixel image. The vision model is thus designed to generate vector embeddings that accurately modify the forward model’s predictions context-dependently.

The prediction of the forward model can thus be formalized as follows:


The visual information can be understood as a local view of the environment surrounding the agent. Thus, depends on the agent’s location . To enable the model to predict for various agent positions, for example, for "imagined" trajectories while planning, the system is equipped with a look-up map to translate positions into local views of the environment . We augment the model with the ability to probe particular map locations, translate the location into a local image, and extract behavior-relevant information from the image. Intuitively speaking, this is as if the network can put its virtual finger (or focus of attention) to any location on the map and consider the context-dependent behavioral consequences at the considered location. As a result, the system is able to consider behavioral consequences dependent on probed environmental locations. In future work, the learning of completely internal maps may be investigated further.


Figure 2: Affordance map architecture: Based on the current position , the architecture performs a lookup in an environmental map. The vision model receives the resulting visual information and produces a contextual code . The forward model utilizes this context , the last change in the position , an action , and its internal hidden state to predict a probability distribution over the next change in position. During training, the loss between predicted and actual change in position is backpropagated onto (red arrows) and further onto (orange arrows) to train both models end-to-end. During planning, the map look-up is performed using position predictions. For gradient-based active inference, EFE is backpropagated onto the action code (red and green arrows). For planning with the cross-entropy method, is modified directly via evolutionary optimization.

The consequence of this model design is that the context code will tend to encode visible, behavior-influencing aspects, that is, affordances. The context is therefore a compressed version of the environment’s behaviorally-relevant characteristics at the corresponding position. Therefore, the incorporation of the affordance codes can be expected to improve both the accuracy of action-dependent predictions and active inference-based planning.

3.2 Uncertainty estimation

The free energy principle is inherently probabilistic and therefore active inference requires our architecture to produce probability density functions over sensory states. We implement this in terms of a forward model that does not predict a point estimate of the change in sensory state in the next time step, but rather the parameters of a probability distribution over this quantity. We choose the multivariate normal distribution with diagonal covariance matrix (i.e., covariances are set to

). The output of the forward model is then given by a mean vector

and a vector of standard deviations

. We thus replace with .

3.3 Training

We train our architecture in an end-to-end, self-supervised fashion to perform one-step ahead predictions via backpropagation through time. The gradient flow during training is depicted in Figure 2. Inputs consist of the sensor-action tuples described above. The only induced target is given by the change in position in the next time step . This target signal is compared to the output of the forward model by the negative log-likelihood (NLL)222See Appendix D for a description of how to compute gradients when the objective is given by the NLL in a multivariate normal distribution., which approximates free energy assuming no uncertainty in our point estimate of environmental state (see Equation 1). Due to end-to-end backpropagation, the vision model is trained to output compact, forward model-conditioning representations of the visual input.

While we use the NLL as the objective during training here, in future work one could utilize full-blown FE in a probabilistic architecture. However, there is a close relationship between NLL and FE due to the Kullback-Leibler divergence: In Appendix F, we show that minimizing NLL is equivalent to minimizing the Kullback-Leibler divergence up to a constant factor and a constant. Thus, through NLL-based learning we can approximate learning through FE minimization.

3.4 Goal-directed control

We perform goal-directed control via gradient- or evolutionary-based active inference as described in Section 2.2.1. Usually, in order to predict multiple time steps into the future given a policy, the forward model receives its own output as input. Since our architecture predicts the parameters of a normal distribution, we use the predicted mean as input in the next prediction time step. The model incorporates visual information from locations corresponding to the predicted means. Therefore, the model does not blindly imagine a path but simultaneously “looks” at, or focuses on, predicted positions, incorporating the inferred affordance code into the forward model predictions.

In order to compare the predicted path to the given target and to look up the visual information, we need absolute locations. We thus take the cumulative sum and add the current absolute position. To compute EFE along a predicted path we also need to consider the standard deviations at every point. For that, we first convert standard deviations to variances, compute the cumulative sum, and convert back to standard deviations. We then can compute the EFE between the resulting sequence of probability distributions over predicted absolute positions and the given target according to Equation 

5. To do so, we encode the target with a multivariate normal distribution as well, setting the mean to the target location and the standard deviation to a fixed value. We can thus optimize the policy via the gradient- or evolutionary-based EFE minimization method introduced in Section 2.2.2 above333Appendix B summarizes the particular adjustments we applied to these algorithms..

4 Experiments and results

To evaluate the abilities of our neural affordance map architecture, we first introduce the environmental simulator and specify our evaluation procedure in general. The individual experimental results then evaluate the system’s planning abilities to avoid obstacles and regional uncertainty as well as to generalize to unseen environments. With respect to the affordance codes , we show emerging affordance maps and examine disentanglement.

4.1 Environment

The environment used in our experiments is a physics-based simulation of a circular, vehicle-like agent with radius in a 2-dimensional space with an arbitrary size of units. It is confined by borders, which prevent the vehicle from leaving the area. The vehicle is able to fly around in the environment by adjusting its throttles, which are attached between the vertical and horizontal axes in a diagonal fashion. They take values between and resembling actions and enable the vehicle to reach a maximum velocity of approximately units per time step within the environment. Therefore, at least time steps are required for the vehicle to fly from the very left to the very right. Due to its mass, the vehicle undergoes inertia and per default, it is not affected by gravity. See Figure 3 for a depiction of the environment and an agent. It is implemented as an OpenAI Gym (Brockman et al., 2016).


Figure 3: The simulation environment we use in our experiments. It resembles a confined space in which a vehicle (green) can move around by adjusting its throttles (blue). Its goal is to navigate towards the target (red). Depending on the experiment, obstacles or different terrains (black) are present, which affect the vehicle’s sensorimotor dynamics.

The environment can contain obstacles, which block the way. Friction values are larger when the vehicle touches obstacles or borders. Furthermore, the environment can comprise different terrains, which locally change the sensorimotor dynamics. Force fields pull the agent up- or downwards. If the vehicle is inside a fog terrain, the environments returns a position that is corrupted by Gaussian noise. Two values from a standard normal distribution are sampled and added to each coordinate. This implies a standard deviation of approximately on the difference between positions from two consecutive time steps within fog. 444Since .

4.2 Model and agent

The vision model is given by a CNN, which produces the context activations. We always evaluate contexts of sizes , and

. The forward model consists of a Long Short-Term Memory (LSTM, 

Hochreiter and Schmidhuber (1997)) followed by two parallel fully connected layers for the means and standard deviations, respectively. For each setting, we train versions of the same architecture with different initial weights and report averaged results. See Appendix A

for more details on the hyperparameters and training procedure.

Active inference performance is evaluated after performing goal-directed control runs per setting for time steps. For each trained architecture instance, we consider

distinct start and target positions corresponding to each corner of the environment. The start position is chosen randomly with a uniform distribution over a

units square with a distance of units to the borders. The target position is chosen in the same way in the diagonally opposite corner. We consider the agent to have reached the target when its distance to the target falls below units. The prediction horizon always has a length of . For active inference based planning (Equation 5

), the target is provided to the system as a Gaussian distribution with standard deviation

. We reduce the standard deviation of the target distribution to once the agent comes closer than units. The two different standard deviations can be seen as corresponding to, for example, smelling and seeing the target, respectively. See Appendix B for more details on the hyperparameters. All hyperparameters were optimized empirically or with Hyperopt (Bergstra et al., 2013) via Tune (Liaw et al., 2018).

In order to get an idea about the nature of the emerging affordance codes, we plot affordance maps by generating position-dependent context activations via the vision model for each possible location in the environment. That is, in the affordance maps shown below, the x- and y-axes correspond to locations in the environment while the color of each dot represents the context activation at that position. With context size , we directly interpret the activations as RGB color values.

4.3 Experiment I: Obstacle avoidance

The first of our experiments examines our architecture’s ability to avoid obstacles during active inference through the use of affordance codes. As a baseline experiment, we consider context size , disabling information flow from the vision model to the forward model. With context sizes larger than , however, the forward model can be informed about obstacles and borders via the context.

We train the architecture on the environment depicted in Figure 3, where black areas resemble obstacles. Visual information has one channel only for the obstacles and borders. We examine performance with context sizes , and . We perform goal-directed control on the same environment we train on.


Figure 4: Results from Experiment I (Subsection 4.3): (A) validation loss during training; (B) exemplary affordance map for context size ; (C) prediction error during goal-directed control; (D) ratio of runs that made it to the target; (E) mean distance to the target; (F) ratio of steps performed close to target. EB is short for evolutionary- and GB for gradient-based active inference.

Figure 4 shows the results. Larger context sizes lead to smaller validation losses (Figure 4A), indicating their utility in improving the forward model’s accuracy. The affordance map (Figure 4B) shows that obstacles are encoded differently from the rest of the environment. Moreover, the colors indicate that the different sides of the obstacles are encoded similar to the corresponding sides of the arena’s boundary, confirming behavior relevant encodings of the visually perceived environment. We evaluate goal-directed planning in terms of prediction error (Figure 4C), percentage of goals reached (Figure 4D), mean distance to the target (Figure 4E) and ratio of steps closer than units to the target (Figure 4F). We find a general trend towards improvement with increasing context sizes, while the target is reached almost always with all context sizes. Gradient-based outperforms evolutionary-based active inference with context size . With larger context sizes, they perform similarly.

4.4 Experiment II: Generalization

In this experiment we examine how well our architecture is able to generalize to similar environments. In Experiment I (Subsection 4.3), we trained on a single environment. Once the architecture is trained, we expect that our system should be able to successfully perform goal-directed control in other environments as well, given we provide the corresponding visual input. The local view onto the map essentially allows us to change position and size of obstacles without expecting significant deterioration in performance.

We reuse the trained models from Experiment 4.3, and apply them for goal-directed control on two additional environments (see Figure 5). We only consider evolutionary-based active inference.


Figure 5: Additional environments used during goal-directed control in Experiment II (Subsection 4.4).

Figure 6 shows the results. All metrics (Figure 6C-F) indicate an even stronger trend towards improvement with increasing context sizes (when compared to the performance from Experiment 1, cf. Subsection 4.3). Furthermore, we find slightly worse performance in the environment with obstacles, while slightly better performance is achieved in the environment with obstacles. We believe this is mainly due to the fact that the environment with two obstacles blocks the direct path much more severely. Thus, overall these results indicate that (i) the system generalized well to similar environments and (ii) a context size of three values is beneficial for performance optimization.


Figure 6: Results from Experiment II (Subsection 4.4): (A-B) exemplary affordance maps for context size ; (C) prediction error during goal-directed control; (D) ratio of runs that made it to the target; (E) mean distance to the target; (F) ratio of steps performed close to target.

4.5 Experiment III: Behavioral-relevance of affordance codes

Affordances should only encode visual information if it is relevant to the behavior of an agent. Is our architecture able to ignore visual information for creating its affordance maps, if this information has no effect on the agent’s behavior? Furthermore, affordances should encode different visual information with the same behavioral meaning similarly. To investigate our architecture in this regard, we perform an experiment similar to Experiment I (Subsection 4.3), but with two additional channels in the cognitive map. The first channel encodes the borders and upper obstacles, the second channel encodes the lower obstacles, and the third channel encodes meaningless information, which does not affect the behavior of the agent. Figure 7 shows the corresponding environment. We compare the results from this “hard condition” to the results of Experiment I (Subsection 4.3), to which we refer as the “easy condition”. We only consider evolutionary-based active inference.


Figure 7: One of the environments (hard condition) used in Experiment III (Subsection 4.5). Black and grey circles represent obstacles. They look differently to the network but signal the same influence on behavior (i.e. path blockage). Green circles represent nothing and therefore mean the same a the white background behaviorally.

Figure 8 shows the results. We do not find a significant difference between the two conditions. The developing affordance map (Figure 8B) is qualitatively similar to the one obtained from Experiment I (Subsection 4.3): neither do significant visual differences between the encodings of the different obstacles remain, nor traces of the meaningless information. Appendix C exemplarily shows how this affordance map develops over the course of training. Finally, also performance stays similar to Experiment I when analyzing goal-directed control (Figure 8C-F).


Figure 8: Results from Experiment III (Subsection 4.5): (A) validation loss during training; (B) exemplary affordance map for context size ; (C) prediction error during goal-directed control; (D) ratio of runs that made it to the target; (E) mean distance to the target; (F) ratio of steps performed close to target.

4.6 Experiment IV: Uncertainty avoidance

Active inference leads to risk-sensitive goal-directed control. In this experiment, we examined the architecture’s ability to avoid regions of uncertainty during planning. We consider a run a success if the agent reached the target and was at no point inside a fog terrain. As mentioned above, we introduce an additional hyperparameter , which scales the influence of the entropy term on the free energy (see Equation 5). Here, we set to to foster avoidance of uncertainty. For gradient-based active inference, we additionally consider .

We train the architecture on the environment depicted in Figure 3, this time black areas indicate fog terrains instead of obstacles. The cognitive map consists of two channels: one channel for fog terrains and one channel for the borders.

Figure 9 shows the results. We find that the context encoding clearly improves the validation error (Figure 9A). Increasing the context size beyond leads to small additional improvements. The affordance map (Figure 9B) shows that free areas, obstacles, and fog are encoded differently. Gradient-based active inference reliably reaches the target with , however, it does not avoid fog terrains. With , fog terrains are avoided more frequently, but the target is reached less often. A more detailed examination of the performance with gradient-based active inference shows that the gradient-based optimization tends to fall into local optima, such as actively flying to a wall and staying there for avoiding uncertainty. Except for context size , evolutionary-based active inference outperforms gradient-based active inference. In the case of evolutionary-based active inference, all metrics (Figure 9C-F) improve when context is computed. However, the improvement between context size and is marginal at best, indicating that as long as the uncertainty induced by fog areas can be encoded, successful fog-avoidance can be accomplished when applying evolutionary-based action planning.


Figure 9: Results from Experiment IV (Subsection 4.6): (A) validation loss during training; (B) exemplary affordance map for context size ; (C) prediction error during goal-directed control; (D) ratio of runs that made it to the target without touching fog; (E) ratio of runs that made it to the target; (F) ratio of runs that did not touch fog. EB is short for evolutionary- and GB10 and GB50 for gradient-based active inference with and respectively.

4.7 Experiment V: Disentanglement

In our final experiment, we examined the architecture’s ability to combine previously learned affordance codes. We trained each architecture instance on four different environments. The environments are constructed as shown in Figure 3, black areas resembling obstacles in the first, fog terrains in the second, force fields pointing upwards in the third, and force fields pointing downwards in the fourth environment. Accordingly, the cognitive map consists of four channels—one channel for each of the aforementioned properties. We evaluate the architecture on procedurally generated environments. In each environment, a randomly chosen amount between and obstacles, fog terrains, force fields pointing downwards, and force fields pointing upwards with randomly chosen radii between and are placed at random locations in the environment. All obstacles and fog terrains have a minimum distance of units from each other, ensuring that the agent is able to fly between them—thus prohibiting dead ends. Furthermore, all properties have a minimum distance of to each border, again to avoid dead ends. Patches of size units are left free in the corners such that start and target positions are not affected. We generate environments with two different conditions. In the first condition (easy), force fields are handled similarly to obstacles and fog terrains in the way that they have a minimum distance of units to all other obstacles, terrains, and force fields. This means that properties do not overlap. In the second condition (hard), force fields can overlap with each other, obstacles, and fog terrains. Due to the increased complexity of the problem, we use a context of size of instead of and double the size of the last feed forward layer of the vision model to neurons. We only consider evolutionary-based active inference.

Figure 10 shows the results. Larger context sizes lead to smaller validation losses (Figure 10A). The prediction error during control (Figure 10C), however, is larger with context size compared to . The affordance map (Figure 10B) shows that the network has learned to encode the distinct areas indeed with distinct encodings: obstacles green, fog terrains blue, and force fields yellow. The encoding also partially incorporates boundary directions, thus encoding the properties relative to the free space from which the agent may enter the area (see, for example, the borders of the environment and the blue fog terrains). As expected, the agent always performs better in the easy condition. Best performance (Figure 10D-F) is achieved with context size , although also a context size of already yields very good performance, which is somewhat surprising. It appears that the behavior in our task is slightly too easy and is usually accomplished by simply avoiding all unusual areas. Future research may make the environments even more complex or challenging—such as demanding flying through a force field for reaching the goal—in order to further reveal the full potential of the affordance map.


Figure 10: Results from Experiment V (Subsection 4.7): (A) validation loss during training; (B) exemplary affordance map for context size , where we used the first three vectors of a PCA to project the eight dimensional embeddings into 3D color space; (C) prediction error during goal-directed control; (D) ratio of runs that made it to the target without touching fog; (E) ratio of runs that made it to the target; (F) ratio of runs that did not touch fog.

5 Discussion

In this paper, we have connected active inference with the theory of affordances in order to guide the search for suitable behavioral policies via active inference in recurrent neural networks. The resulting architecture is able to perform risk-sensitive goal-directed planning while considering the properties of the agent’s local environment. This chapter provides a summary of our architecture’s abilities, compares it to related work, and eventually presents possible future work directions.

5.1 Conclusion

Experiment I (Subsection 4.3) has shown that our proposed architecture facilitates goal-directed planning via active inference. Both the validation loss as well as performance during goal-directed control revealed an advantage of incorporating affordance information, i.e. using a context size larger than . The affordance maps confirmed that the architecture is able to infer relationships between environmental features and their meaning for the agent’s behavior: Depending on the direction of and the distance to the next obstacle, different codes emerged. Experiment II (Subsection 4.4) has shown that once the relationship between environmental features and their meaning was learned, this knowledge can be generalized to other environments with similar, but differently sized and positioned obstacles. Experiment III (Subsection 4.5) has shown that our architecture is able to map properties of the environment that are encoded differently visually but have the same influence on behavior onto the same affordance codes. Moreover, our architecture was able to learn to ignore information that does not have any behavioral meaning. This provides evidence that our architecture indeed learns affordances.

Experiment IV (Subsection 4.6) has shown that our architecture is able to avoid regions of uncertainty (fog terrains) during planning via active inference, thus exhibiting risk-sensitive goal-directed behavior. We find that evolutionary-based active inference outperforms gradient-based active inference, once contextual information is provided. Gradient-based active inference can either not successfully avoid areas of uncertainty (small values), or avoid areas of uncertainty without reaching the target (large ). Since the loss term is the same for both planning algorithms and evolutionary-based active inference succeeds in reaching the target while avoiding obstacles, we conclude that the backpropagated gradient signal often leads to local optima, out of which the optimizer does not escape. The broader initial search of the evolutionary-based process avoids these local optima. Future work may consider combining both techniques.

Experiment V (Subsection 4.7) underlined our architecture’s generalization abilities, but also showed that it has difficulties in disentangling the learned affordances. If properties of the environment do not overlap and with sufficiently large context size, the agent can successfully reach the target without touching regions of uncertainty nearly all the time. This is less so when properties do overlap. We propose that an additional regularization that may foster a disentanglement or factorization of the learned affordances could lead to a fully successful generalization to arbitrary combinations of previously encountered properties.

5.2 Future work

A current challenge for our architecture lies in the ability to handle different kinds of uncertainty (Der Kiureghian and Ditlevsen, 2009). For example, aleatoric

uncertainty addresses the uncertainty inherent in the input data distribution. In this work, we used regions of uncertainty in the environment where the sensory states were corrupted by noise and other regions where this was not the case. Therefore, the amount of uncertainty in the data varied, which is referred to as heteroscedasticity. In contrast to aleatoric uncertainty,

epistemic uncertainty results from mis- or underspecification of the model.

Distributional parameter prediction captures both kinds of uncertainty, albeit entangled (Hüllermeier and Waegeman, 2021). The affordance mechanism enables our architecture to model the heteroscedastic aspect of aleatoric uncertainty. It is not able to distinguish it from epistemic uncertainty, though. This could be accomplished by introducing Bayesian neural networks (Mullachery et al., 2018)

, which are able to represent epistemic uncertainty by using random variables as weights. Bayesian neural networks were used before in recurrent settings

(Fortunato et al., 2019) and can easily be approximated with MC dropout (Gal and Ghahramani, 2016). In conjunction with distributional parameter prediction, aleatoric and epistemic uncertainty may then be separated. In Chua et al. (2018), the authors successfully achieved this in a reinforcement learning setting.

In this work, we trained our architecture on previously collected data and only afterwards performed goal-directed control. Alternatively, one could perform goal-directed control from the very beginning and train the architecture on inferred actions and the corresponding encountered observations in a self-supervised learning manner. This should increase performance since the distribution of the training data for the forward model then more closely matches the distribution of the data encountered during control. If aleatoric and epistemic uncertainty are disentangled by the architecture as described above, this could be used to guide exploration towards areas with remaining high epistemic uncertainty. We expect that such a information-gain-driven behavioral exploration may focus the self-supervised training towards the affordances.

Eventually, one could merge the exploration and exploitation phases. In this case, the exploration-exploitation dilemma needs to be resolved: How should the agent decide whether it should exploit previously acquired knowledge to reach its goal or instead explore the environment to gain further knowledge that can be exploited in later trials? The active inference mechanism (Friston et al., 2015b) generally provides a solution to this problem, altough optimal parameter tuning remains challenging (Tani, 2017).

In this work, our model computed affordance codes directly from visual information. Future work could examine to what extent it is possible to fully memorize an affordance akin to a cognitive map. A straightforward approach would be to train a multi-layer perceptron that maps absolute positions onto affordance codes, in which case translational invariance is lost. Alternatively, a recurrent neural network that receives actions could predict affordances in future time steps conditioned upon previously encountered affordances. Additionally, the forward model could be split into an encoder, which maps sensory states onto internal hidden states, and a transition model, which maps internal hidden states and actions onto next internal hidden states. The introduction of an observation model that translates internal hidden states back into sensory states would then enable the whole planning process to take place in hidden state space akin to PlaNet 

Hafner et al. (2019) and Dreamer Hafner et al. (2019).

Besides learning cognitive maps, also the planning mechanism is currently still somewhat limited. Our proposed architecture solves the considered tasks in a greedy manner. During planning, we compare the predicted sensory states to a fixed desired sensory state over the predicted trajectory. Our model therefore prefers actions that lead closer to the target only within the prediction horizon. This can be problematic if we consider e.g. tool use. Imagine an environment with keys and doors. Here, it might be necessary to temporarily stear away from the target in order to pick up a key and eventually get closer to the target after unlocking and passing through the door. Without further modifications, our agent would not make such a detour deliberately. In the future, we want to investigate how this can be mitigated through hierarchical planning on events (Eppe et al., 2022).

The event-segmentation theory states that humans segment continuous sensory information into separate events (Zacks et al., 2007; Butz et al., 2021; Zacks and Tversky, 2001). These can be characterized as compact representations of what is or was going on and are separated by event boundaries. Within one event, sensorimotor contingencies are somewhat consistent, but these can change drastically across boundaries. Events can consist of other events, resulting in a hierarchy with increasing abstraction that enables the emergence of habitual behavior on lower levels and conceptual thoughts on higher levels (Butz et al., 2021).

How do humans detect event boundaries? One approach is to continuously infer the currently ongoing event retrospectively from recently encountered action-observation tuples. With REPRISE, Butz et al. showed how this is possible within recurrent neural networks that make use of inferred context or event codes to facilitate goal-directed control of different types of vehicles (Butz et al., 2019). In contrast, the objective function of GateL0RD is designed in a way that allows only sparse changes in the internal hidden states of a gated recurrent forward model (Gumbsch et al., 2021). Both architectures incorporate inductive biases that reflect that most of the time, we are within one event and comparably rarely at event boundaries. We propose that our affordance mechanism can facilitate the emergence of event codes when we consider interaction events: If the affordance code does not change, we stay in the same event, and if it does, we likely encounter an event boundary. With such architectural enhancements, we expect to be able to solve even more elaborate sequential environmental interaction tasks in the near future.


  • J. Bergstra, D. Yamins, and D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pp. 115–123. Cited by: §4.2.
  • M. Botvinick and M. Toussaint (2012) Planning as inference. Trends in Cognitive Sciences 16 (10), pp. 485 – 488. External Links: Document, ISSN 1364-6613 Cited by: §2.2.1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §4.1.
  • M. V. Butz, A. Achimova, D. Bilkey, and A. Knott (2021) Event‐predictive cognition: a root for conceptual human thought. Topics in Cognitive Science 13, pp. 10–24. External Links: Document Cited by: §1, §5.2.
  • M. V. Butz, D. Bilkey, D. Humaidan, A. Knott, and S. Otte (2019) Learning, planning, and control in a monolithic neural event inference architecture. arXiv:1809.07412 [cs]. Note: arXiv: 1809.07412 External Links: Link Cited by: §2.2.2, §3.1, §5.2.
  • M. V. Butz (2008) How and why the brain lays the foundations for a conscious self. Constructivist Foundations 4 (1), pp. 1–42. Cited by: §1.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems 31. External Links: 1805.12114v2, Link Cited by: Appendix B, §2.2.2, §5.2.
  • P. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein (2005) A tutorial on the cross-entropy method. Annals of operations research 134 (1), pp. 19–67. Cited by: Appendix B.
  • A. Der Kiureghian and O. Ditlevsen (2009) Aleatory or epistemic? does it matter?. Structural safety 31 (2), pp. 105–112. Cited by: §5.2.
  • K. Diba and G. Buzsaki (2007) Forward and reverse hippocampal place-cell sequences during ripples. Nature neuroscience 10 (10), pp. 1241–1242. External Links: Document, ISSN 1097-6256 Cited by: §2.3.2.
  • M. Eppe, C. Gumbsch, M. Kerzel, P. D. H. Nguyen, M. V. Butz, and S. Wermter (2022) Intelligent problem-solving as integrated hierarchical reinforcement learning. Nature Machine Intelligence 4 (1), pp. 11–20. External Links: Document, ISSN 2522-5839 Cited by: §1, §5.2.
  • M. Fortunato, C. Blundell, and O. Vinyals (2019) Bayesian recurrent neural networks. arXiv:1704.02798 [cs, stat]. Note: arXiv: 1704.02798 External Links: Link Cited by: §5.2.
  • K. J. Friston, J. Daunizeau, J. Kilner, and S. J. Kiebel (2010a) Action and behavior: a free-energy formulation. Biological Cybernetics 102 (3), pp. 227–260. External Links: Document, ISSN 1432-0770, Link Cited by: §2.2.
  • K. Friston, F. Rigoli, D. Ognibene, C. Mathys, T. FitzGerald, and G. Pezzulo (2015) Active inference and epistemic value. Cognitive Neuroscience 6, pp. 187–214. External Links: Document, Cited by: §1.
  • K. Friston, F. Rigoli, D. Ognibene, C. Mathys, T. Fitzgerald, and G. Pezzulo (2015b) Active inference and epistemic value. Cognitive Neuroscience 6, pp. 187–214. External Links: Document, Link Cited by: §2.2.1, §5.2.
  • K. Friston (2013) Life as we know it. Journal of The Royal Society Interface 10 (86). External Links: Document Cited by: §2.2.
  • K. Friston (2005a) A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences 360 (1456), pp. 815–836. Note: Publisher: The Royal Society London Cited by: §1.
  • K. Friston (2009b) The free-energy principle: a rough guide to the brain?. Trends in Cognitive Sciences 13, pp. 293–301. External Links: Document, Link Cited by: §2.2.
  • K. Friston (2010c) The free-energy principle: a unified brain theory?. Nature reviews neuroscience 11 (2), pp. 127–138. Note: Publisher: Nature publishing group Cited by: §1, §2.2.
  • Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    In international conference on machine learning, pp. 1050–1059. Cited by: §5.2.
  • J. J. Gibson (1986) The ecological approach to visual perception. book, Vol. 1, Psychology Press New York. Cited by: §1, §2.3.1, §2.3.1.
  • C. Gumbsch, M. V. Butz, and G. Martius (2021) Sparsely changing latent states for prediction and planning in partially observable domains. 35th Conference on Neural Information Processing Systems (NeurIPS 2021). Cited by: §5.2.
  • D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. External Links: Document, 1803.10122v4, Link Cited by: §2.4.
  • D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603. External Links: 1912.01603v3, Link Cited by: §5.2.
  • D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019) Learning latent dynamics for planning from pixels. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 2555–2565. External Links: Link Cited by: §2.2.2, §5.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9, pp. 1735–1780. External Links: Document, Link Cited by: §4.2.
  • E. Hüllermeier and W. Waegeman (2021) Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning 110 (3), pp. 457–506. External Links: Document, ISSN 1573-0565, Link Cited by: §5.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A, Appendix A, Appendix B.
  • I. Lenz, R. Knepper, and A. Saxena (2015) DeepMPC: learning deep latent features for model predictive control. Robotics: Science and Systems. External Links: Document Cited by: §2.2.1.
  • R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica (2018) Tune: a research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118. Cited by: §4.2.
  • V. Mullachery, A. Khera, and A. Husain (2018) Bayesian neural networks. arXiv preprint arXiv:1801.07710. Cited by: §5.2.
  • S. J. Nowlan and G. E. Hinton (2018) Simplifying neural networks by soft weight sharing. incollection In The Mathematics of Generalization, pp. 373–394. External Links: Document, Link Cited by: Appendix A.
  • J. O’keefe and L. Nadel (1978) The hippocampus as a cognitive map. book, Oxford: Clarendon Press. Cited by: §2.3.2.
  • S. Otte, T. Schmitt, K. Friston, and M. V. Butz (2017) Inferring adaptive goal-directed behavior within recurrent neural networks. incollection In Artificial Neural Networks and Machine Learning – ICANN 2017, A. Lintas, S. Rovetta, P. F.M.J. Verschure, and A. E.P. Villa (Eds.), Vol. 10613, pp. 227–235. Note: Series Title: Lecture Notes in Computer Science External Links: Document, ISBN 978-3-319-68599-1 978-3-319-68600-4, Link Cited by: §2.2.2.
  • R. Pascanu, T. Mikolov, and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Cited by: Appendix A, Appendix A.
  • B. E. Pfeiffer and D. J. Foster (2013) Hippocampal place-cell sequences depict future paths to remembered goals. Nature 497 (7447), pp. 74–79. External Links: Document Cited by: §2.3.2.
  • C. Pinneri, S. Sawant, S. Blaes, J. Achterhold, J. Stueckler, M. Rolinek, and G. Martius (2020) Sample-efficient cross-entropy method for real-time planning. arXiv preprint arXiv:2008.06389. External Links: 2008.06389v1, Link Cited by: Appendix B, §2.2.2.
  • W. Qi, R. T. Mullapudi, S. Gupta, and D. Ramanan (2020) Learning to move with affordance maps. arXiv preprint arXiv:2001.02364 ICLR 2020. External Links: Link Cited by: §2.4.
  • R. Rubinstein (1999) The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability 1 (2), pp. 127–190. Cited by: §2.2.2.
  • J. Tani (2017) Dialogue: exploring robotic minds by predictive coding principle. IEEE CDS Newsletter: Cognitive and Developmental Systems 14 (1), pp. 4–13. Cited by: §5.2.
  • E. C. Tolman (1948) Cognitive maps in rats and men.. Psychological review 55 (4), pp. 189. Cited by: §2.3.2.
  • T. Wang and J. Ba (2019) Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649. Cited by: Appendix B.
  • J. M. Zacks and B. Tversky (2001) Event structure in perception and conception.. Psychological bulletin 127 (1), pp. 3. Cited by: §5.2.
  • J. M. Zacks, N. K. Speer, K. M. Swallow, T. S. Braver, and J. R. Reynolds (2007) Event perception: a mind-brain perspective.. Psychological Bulletin 133, pp. 273–293. External Links: Document, Link Cited by: §5.2.


Appendix A Model and training hyperparameters

Unless noted otherwise, we use the following hyperparameters: The changes in position are scaled up by a constant factor of before feeding them into the forward model, such that the forward model receives inputs which approximately cover the interval between and . The LSTM has hidden size

with biases turned off. One fully connected layer predicts mean vectors via a linear activation function. A second parallel fully connected layer predicts vectors of standard deviations via the exponential activation function, providing non-negative values and therefore ensuring valid standard deviations. From a probabilistic point of view, under the assumption that the values before the activation function are uniformly distributed, these mappings implement an uninformative prior in a Bayesian framework

[Nowlan and Hinton, 2018]. The forward model has parameters in total. We use Adam [Kingma and Ba, 2014] as our optimizer with learning rate , -values , and

. We perform gradient clipping

[Pascanu et al., 2013] and set the maximum norm to .

Visual input consists of pixels with the number of channels depending on the experiment. We obtain it by rasterization of a square of the environment, centering and excluding the agent. Due to the maximum velocity of units per time step, the agent’s next position is always within it’s visual field. Each channel corresponds to a property (obstacle, fog terrain, force field up and down) of the environment. The presence of a property is encoded with

s, while the rest of the tensor is set to


The vision model consists of a convolutional layer, a max pooling layer, and another convolutional layer followed by a fully connected layer. The convolutional layers have kernel size

with stride

, no padding and

channels. The max pooling layer has a receptive field size of with stride . The fully connected layer has size . We use the activation function in all layers. The vision model has parameters in total, where denotes the number of channels of the input. We use Adam [Kingma and Ba, 2014] as our optimizer with learning rate , -values , and . We perform gradient clipping [Pascanu et al., 2013] and set the maximum norm to .

We generate training data by sending randomly generated actions to the environment. Actions were generated in a way that ensures good coverage of the whole environment. For each environment used in our experiments, we gather sequences of sensor-action-tuples. We use sequences for training and for validation. Each sequence has a length of time steps. We train end-to-end for epochs with batch size . We backpropagate the error through time every time steps and reset the hidden states every batches. This way we train the model to avoid exploding hidden states also during goal-directed control.

Appendix B Details on planning algorithms

When planning with gradient-based active inference, we apply the following adjustments to improve performance. Firstly, if an optimization cycle increases EFE, we perform early stopping and use the policy from the cycle before. Secondly, we decrease the learning rate exponentially over the policy from the future to the present. This leads to more stable paths since actions which lie in the later future are adapted more than actions to be executed in the nearer future. More precisely, given a mean learning rate and decay , we set the learning rate for action to:

See Appendix E for a description of how to compute gradients when the objective is given by the FE between two multivariate normal distributions. After each update, we clamp the policy to be in the correct value range. Finally, after optimization, we shift the policy while copying the last element. We use Adam [Kingma and Ba, 2014] with learning rate , -values , and , set the exponential learning rate decay to , and perform optimization cycles.

During evolutionary-based planning, we use normal distributions to model actions. In order to improve performance, we apply the following modifications. We use a momentum term on the means and covariances [De Boer et al., 2005]. After a single optimization iteration, we keep a fixed number of the elites for the next iteration [Pinneri et al., 2020]. After optimization, we do not discard the means but shift them [Wang and Ba, 2019, Chua et al., 2018] while copying the last action in order to not start from scratch in the next optimization. We reset the variances, however, to avoid local minima. Analogously, we shift the elites that we keep [Pinneri et al., 2020]. We use the first action from the best sampled policy as the optimization result [Pinneri et al., 2020]. Instead of clipping sampled actions, we perform rejection sampling and sample until we have an action within the allowed value range. We generate trajectory candidates, use elites for parameters estimation, keep elites for the next optimization cycle, use an initial covariance of , and a momentum of .

Appendix C Affordance maps from experiment III after different amounts of epochs

Here, we show affordance maps from Experiment III (Subsection 4.5) after different amounts of epochs. In Figure 11 we see that with increasing amounts of training epochs, the upper and lower obstacles get encoded more similarly, the additional meaningless information gets more filtered out, and the affordance maps get more distinctive regarding the encoding of different behavioral possibilities.


Figure 11: Exemplary affordance maps from Experiment III (Subsection 4.5) after different amounts of epochs.

Appendix D Derivative of negative log-likelihood in a normal distribution

In this section we derive the negative log-likelihood in a multivariate normal distribution with respect to the distribution’s parameters.

The likelihood in a multivariate normal distribution is given by its probability density function:


This leads to the following log-likelihood:


Now, we take the derivative of the log-likelihood function with respect to the parameters of our probability distribution. The resulting quantity is also referred to as the score. We start by calculating the derivative with respect to the mean :


Next, we calculate the derivative with respect to the covariance matrix . We assume to be symmetric:


We are now able to calculate the derivatives of the log-likehood function of a multivariate normal distribution. In this work, we applied two simplifications: First, we used the special case of a bivariate normal distribution. Second, we assume all covariances to be , leading to a diagonal covariance matrix. With these assumptions, the multivariate normal distribution factors into two univariate normal distributions. We replace the covariance matrix with a vector of variances and end up with the following score:


Appendix E Derivative of free energy between normal distributions

In this section we derive the expected free energy as used in this work between two multivariate normal distributions with respect to the parameters of one of the distributions. We first take the derivative of the entropy and subsequently of the divergence term.

The entropy of a multivariate normal distribution is given by:


Now we take the derivative with respect to the mean :


Next, we take the derivative with respect to the covariance matrix (assuming to be symmetric):


We are now able to calculate the derivative of the entropy of a multivariate normal distribution. Following the simplifications from above (Section D), we again replace the covariance matrix with a vector of variances and end up with the following gradients:


The Kullback-Leibler divergence between to multivariate normal distributions is given by:


We first take the derivative with respect to the mean of the first distribution :


Now we take the derivative with respect to the covariance matrix of the first distribution (assuming and to be symmetric):


We are now able to calculate the gradients of the Kullback-Leibler divergence between two multivariate normal distributions. Following the simplifications from above, we again replace the covariance matrices and with vectors of variances and and end up with the following gradients:


Appendix F Relationship between negative log-likelihood and Kullback-Leibler divergence

In this work, we trained our architecture with the negative log-likelihood as the loss but performed goal-directed control via EFE minimization which includes the Kullback-Leibler divergence. Here, we show the relationship between the negative log-likelihood and the Kullback-Leibler divergence in general.

The Kullback-Leibler divergence between two probability distributions and is defined as


where denotes the expected value. Let us now assume that describes the distribution of some data we want to approximate with . The left term does not depend on and therefore is constant. If we now take samples from the real distribution with we end up with


which, up to a constant factor, is the definition of the negative log-likelihood.

We conclude that minimizing negative log-likelihood is equivalent to minimizing the Kullback-Leibler divergence.