# Deep Active Inference for Partially Observable MDPs

Deep active inference has been proposed as a scalable approach to perception and action that deals with large policy and state spaces. However, current models are limited to fully observable domains. In this paper, we describe a deep active inference model that can learn successful policies directly from high-dimensional sensory inputs. The deep learning architecture optimizes a variant of the expected free energy and encodes the continuous state representation by means of a variational autoencoder. We show, in the OpenAI benchmark, that our approach has comparable or better performance than deep Q-learning, a state-of-the-art deep reinforcement learning algorithm.

• 1 publication
• 20 publications
01/30/2020

### Learning Perception and Planning with Deep Active Inference

Active inference is a process theory of the brain that states that all l...
07/08/2019

### Deep Active Inference as Variational Policy Gradients

Active Inference is a theory of action arising from neuroscience which c...
09/09/2021

### Deep Active Inference for Pixel-Based Discrete Control: Evaluation on the Car Racing Problem

Despite the potential of active inference for visual-based control, lear...
08/17/2020

### A deep active inference model of the rubber-hand illusion

Understanding how perception and action deal with sensorimotor conflicts...
03/04/2019

### Optimizing Object-based Perception and Control by Free-Energy Principle

One of the well-known formulations of human perception is a hierarchical...
09/06/2022

### Efficient search of active inference policy spaces using k-means

We develop an approach to policy selection in active inference that allo...
10/02/2022

### Occlusion-Aware Crowd Navigation Using People as Sensors

Autonomous navigation in crowded spaces poses a challenge for mobile rob...

## 1 Introduction

Deep active inference (dAIF) [1, 2, 3, 4, 5, 6] has been proposed as an alternative to Deep Reinforcement Learning (RL) [7, 8] as a general scalable approach to perception, learning and action. The active inference mathematical framework, originally proposed by Friston in [9], relies on the assumption that an agent will perceive and act in an environment such as to minimize its free energy [10]. Under this perspective, action is a consequence of top-down proprioceptive predictions coming from higher cortical levels, i.e., motor reflexes minimize prediction errors [11].

On the one hand, works on dAIF, such as [2, 12, 13], have focused on scaling the optimization of the Variational Free-Energy bound (VFE), as described in [9, 14]

, to high-dimensional inputs such as images, modelling the generative process with deep learning architectures. This type of approach preserves the optimization framework (i.e., dynamic expectation maximization

[15]

) under the Laplace approximation by exploiting the forward and backward passes of the neural network. Alternatively, pure end-to-end solutions to VFE optimization can be achieved by approximating all the probability density functions with neural networks

[1, 3].

On the other hand, Expected Free Energy (EFE) and Generalized Free Energy (GFE) were proposed to extend the one-step ahead implicit action computation into an explicit policy formulation, where the agent is able to compute the best action taking into account a time horizon [16]. Initial agent implementations of these approaches needed the enumeration over every possible policy projected forward in time up to the time horizon, resulting in significant scaling limitations. As a solution, deep neural networks were also proposed to approximate the densities comprising the agent’s generative model [1, 2, 3, 4, 5, 6], allowing active inference to be scaled up to larger and more complex tasks.

However, despite the general theoretical formulation, current state-of-the-art dAIF, has only been successfully tested in toy problems with fully observable state spaces (Markov Decision Processes, MDP). Conversely, Deep Q-learning (DQN) approaches

[7] can deal with high-dimensional inputs such as images.

Here, we propose a dAIF model111The code is available on: https://github.com/Grottoh/Deep-Active-Inference-for-Partially-Observable-MDPs that extends the formulation presented in [3] to tackle problems where the state is not observable222

We formulate image-based estimation and control as a POMDP—See

[17] for a discussion (i.e. Partially Observable Markov Decision Processes, POMDP), in particular, the environment state has to be inferred directly high-dimensional from visual input. The agent’s objective is to minimize its EFE into the future up to some time horizon T similarly as a receding horizon controller. We compared the performance of our proposed dAIF algorithm in the OpenAI CartPole-v1 environment against DQN. We show that the proposed approach has comparable or better performance depending on observability.

## 2 Deep Active Inference Model

We define the active inference agent’s objective as optimizing its variational free energy (VFE) at time t, which can be expressed as:

 −Ft= DKL[q(s,a)∥p(ot,s0:t,a0:t)] (1) = −Eq(st)[lnp(ot|st)]+DKL[q(st)∥p(st|st−1,at−1)] +DKL[q(at|st∥p(at|st)] (2)

Where is the observation at time , is the state of the environment, is the agent’s action and is the expectation over the variational density .

We approximate the densities of Eq. 2 with deep neural networks as proposed in [1, 3, 4]. The first term, containing densities and concerns the mapping of observations to states, and vice-versa. We capture this objective with a variational autoencoder (VAE). A graphical representation of this part of the neural network architecture is depicted in Fig. 1 – see the appendix for network details.

We can use an encoder network with parameters to model , and we can use a decoder network with parameters to model . The encoder network encodes high-dimensional input as a distribution over low-dimensional latent states, returning the sufficient statistics of a multivariate Gaussian, i.e. the mean

and variance

. The decoder network consequently reconstructs the original input from reparametrized sufficient statistics . The distribution over latent states can be used as a model of the environment in case the true state of an environment is inaccessible to the agent (i.e. in a POMDP).

The second term of Eq. 2 can be interpreted as state prediction error, which is expressed as the Kullback-Leibler (KL) divergence between the state derived at time t and the state that was predicted for time t at the previous time point. In order to compute this term the agent must, in addition to the already addressed , have a transition model

, which is the probability of being in a state given the previous state and action. We compute the MAP estimate with a feedforward network

. To compute the state prediction error, instead of using the KL-divergence over the densities, we use the Mean-Squared-Error (MSE) between the encoded mean state and the predicted state returned by

The third and final term contains the last two unaddressed densities and . We model variational density using a feedforward neural network parameterized by , which returns a distribution over actions given a multivariate Gaussian over states. Finally, we model action conditioned by the state or policy . According to the active inference literature, if an agent that minimizes the free energy does not have the prior belief that it selects policies that minimize its (expected) free energy (EFE), it would infer policies that do not minimize its free energy [16]. Therefore, we can assume that our agent expects to act as to minimize its EFE into the future. The EFE of a policy after time onwards can be expressed as:

 Gπ=∑τ>tGπ,tGπ,τ=−lnp(oτ)−rτ+DKL[q(sτ|π)∥q(sτ|oτ)] (3)

Note that the EFE has been transformed into a RL instance by substituting the negative log-likelihood of an observation (i.e. surprise) by the reward [3, 18]. Since under this formulation minimizing one’s EFE involves computing one’s EFE for each possible policy for potentially infinite time points , a tractable way to compute is required. Here we estimate via bootstrapping, as proposed in [3]. To this end the agent is equipped with an EFE-value (feedforward) network with parameters , which returns an estimate that specifies an estimated EFE for each possible action. This network is trained with the aid of a bootstrapped EFE estimate , which consists of the free energy for the current time step, and a discounted value net approximation of the free energy expected under for the next time step:

 ^Gt=−rt+DKL[q(st)∥q(st|ot)]+βEq(at+1|st+1)~Gt (4)

In this form the parameters of can be optimized through gradient descent on (see Fig. 2):

 Lt=MSE(~Gt,^Gt) (5)

The distribution over actions can then at last be modelled as a precision-weighted Boltzmann distribution over our EFEs estimate [3, 16]:

 p(at|st)=σ(−γ~Gt) (6)

Finally, Eq. 2 is computed with the neural network density approximations as – See Fig. 4.

 −Ft= −Eqθ(st|ot−3:t)[lnpϑ(ot−3:t|zt)] +MSE(sμ,t,fϕ(sμ,t−1,at−1)) +DKL[qξ(at|sμ,t,sΣ,t)∥σ(−γfψ(sμ,t,sΣ,t))] (7)

Where and are encoded by .

## 3 Experimental Setup

To evaluate the proposed algorithm we used the OpenAI Gym’s CartPole-v1, as depicted in Fig. 4. In the CartPole-v1 environment, a pole is attached to a cart that moves along a track. The pole is initially upright, and the agent’s objective is to keep the pole from tilting too far to one side or the other by increasing or decreasing the cart’s velocity. Additionally, the position of the cart must remain within certain bound. An episode of the task terminates when the agent fails either of these objectives, or when it has survived for 500 time steps. Each time step the agent receives a reward of 1.

The CartPole state consists of four continuous values: the cart position, the cart velocity, the pole angle and the velocity of the pole at its tip. Each run the state values are initialized at random within a small margin to ensure variability between runs. The agent can exact influence on the next state through two discrete actions, by pushing the cart to the left, or by pushing it to the right.

Tests were conducted in two scenarios: 1) an MDP scenario in which the agent has direct access to the state of the environment, and 2) a POMDP scenario in which the agent does not have direct access the environment state, and instead receives pixel value from which it must derive meaningful hidden states. By default, rendering the CartPole-v1 environment returns a (color, height, width) array of pixel values. In our experiments we provide the POMDP agents with a downscaled and cropped image. There the agents receive a pixel value array in which the cart is centered until it comes near the left or right border.

## 4 Results

The performance of our dAIF agents was compared against DQN agents for the MDP and the POMDP scenarios, and against an agent that selects it actions at random. Each agent was equipped with a memory buffer and a target network [19]. The memory buffer stores transitions from which the agent can sample random batches on which to perform batch gradient descent. The target network is a copy of the value network of which the weights are not updated directly through gradient descent, but are instead updated periodically with the weights of the value network. In between updates this provides the agent with fixed EFE-value or Q-value targets, such that the value network does not have to chase a constantly moving objective.

The VAE of the POMDP dAIF agent is pre-trained to deconstruct input images into a distribution over latent states and to subsequently reconstruct them as accurately as possible.

Fig. 5

shows the mean and standard deviation of the moving average reward (

) over all runs for the five algorithms at each episode. Each agent performed 10 runs of 5000 episodes. The moving average reward for an episode is calculated using an smoothing average:

 MARe=0.1CRe+0.9MARe−1 (8)

Where is the cumulative reward of episode and is the of the previous episode.

The dAIF MDP agent results closely resemble those presented in [3] and outperforms the DQN MDP agent by a significant margin. Further, the standard deviation shadings show that the dAIF MDP is agent is more consistent between runs than the DQN agent. The POMDP agents are both demonstrated to be capable of learning successful policies, attaining comparable performance.

We have exploited probabilistic model based control through a VAE that encodes the state. On the one hand, this allows the tracking of an internal state which can be used for a range of purposes, such the planning of rewarding policies and the forming of expectations about the future. On the other hand, it makes every part of the algorithm dependent on the proper encoding of the latent space, conversely to the DQN that did not require a state representation to achieve the same performance. However, we expect our approach to improve relative to DQN in more complex environments where the world state encoding can play a more relevant role.

## 5 Conclusion

We described a dAIF model that tackles partially observable state problems, i.e., it learns the policy from high-dimensional inputs, such as images. Results show that in the MDP case the dAIF agent outperforms the DQN agent, and performs more consistently between runs. Both agents were also shown to be capable of learning (less) successful policies in the POMDP version, where the performance between dAIF and DQN models was found to be comparable. Further work will focus on validating the model on a broader range of more complex problems.

## Appendix

Deep Q Agent MDP
Networks & params. Description
Number of states.
Number of actions.
Q-value network Fully connected network using an Adam optimizer with a learning rate of , of the form: .
Discount factor set to 0.98
Memory size Maximum amount of transition that can be stored in the memory buffer: 65,536
Mini-batch size 32
Target network freeze period The amount of time steps the target network’s parameters are frozen, until they are updated with the parameters of the value network: 25
Deep Q Agent POMDP
Networks & params. Description
Number of actions.
Q-value network

Consists of three 3D convolutional layers (each followed by batch normalization and a rectified linear unit) with

kernels and strides with respectively 3, 16 and 32 input channels, ending with 32 output channels. The output is fed to a fully connected layer which leads to a fully connected layer. Uses an Adam optimizer with the learning rate set to .
Discount factor set to 0.99
Memory size Maximum amount of transition that can be stored in the memory buffer: 65,536
Mini-batch size 32
Target network freeze period The amount of time steps the target network’s parameters are frozen, until they are updated with the parameters of the value network: 25
Deep Active Inference Agent MDP
Networks & params. Description
Number of states.
Number of actions.
Transition network Fully connected network using an Adam optimizer with a learning rate of , of the form: .
Policy network Fully connected network using an Adam optimizer with a learning rate of , of the form: , a softmax function is applied to the output.
EFE-value network Fully connected network using an Adam optimizer with a learning rate of , of the form: .
Precision parameter set to 1.0
Discount factor set to 0.99
Memory size Maximum amount of transition that can be stored in the memory buffer: 65,536
Mini-batch size 32
Target network freeze period The amount of time steps the target network’s parameters are frozen, until they are updated with the parameters of the value network: 25
Deep Active Inference Agent POMDP
Networks & params. Description
Size of the VAE latent space, here set to 32.
Number of actions.
Encoder-network Consists of three 3D convolutional layers (each followed by batch normalization and a rectified linear unit) with kernels and strides with respectively 3, 16 and 32 input channels, ending with 32 output channels. The output is fed to a fully connected layer which splits to two additional fully connected layers. Uses an Adam optimizer with the learning rate set to .
Decoder-network Consists of a fully connected layer leading to a fully connected layer leading to three 3D transposed convolutional layers (each followed by batch normalization and a rectified linear unit) with kernels and strides with respectively 32, 16 and 3 input channels, ending with 3 output channels. Uses an Adam optimizer with the learning rate set to .
Transition-network Fully connected network using an Adam optimizer with a learning rate of , of the form: .
Policy-network Fully connected network using an Adam optimizer with a learning rate of , of the form: , a softmax function is applied to the output.
EFE-value-network Fully connected network using an Adam optimizer with a learning rate of , of the form: .
Precision parameter set to 12.0
Discount factor set to 0.99
A constant that is multiplied with the VAE loss to take it to the same scale as the rest of the VFE terms, set to
Memory size Maximum amount of transition that can be stored in the memory buffer: 65,536
Mini-batch size 32
Target network freeze period The amount of time steps the target network’s parameters are frozen, until they are updated with the parameters of the value network: 25