Active Inference for Integrated State-Estimation, Control, and Learning

by   Mohamed Baioumy, et al.
University of Oxford

This work presents an approach for control, state-estimation and learning model (hyper)parameters for robotic manipulators. It is based on the active inference framework, prominent in computational neuroscience as a theory of the brain, where behaviour arises from minimizing variational free-energy. The robotic manipulator shows adaptive and robust behaviour compared to state-of-the-art methods. Additionally, we show the exact relationship to classic methods such as PID control. Finally, we show that by learning a temporal parameter and model variances, our approach can deal with unmodelled dynamics, damps oscillations, and is robust against disturbances and poor initial parameters. The approach is validated on the `Franka Emika Panda' 7 DoF manipulator.



page 1


Active Inference in Robotics and Artificial Agents: Survey and Challenges

Active inference is a mathematical framework which originated in computa...

A Novel Adaptive Controller for Robot Manipulators based on Active Inference

More adaptive controllers for robot manipulators are needed, which can d...

Kalman filters as the steady-state solution of gradient descent on variational free energy

The Kalman filter is an algorithm for the estimation of hidden variables...

Active Inference or Control as Inference? A Unifying View

Active inference (AI) is a persuasive theoretical framework from computa...

Free Energy Principle for the Noise Smoothness Estimation of Linear Systems with Colored Noise

The free energy principle (FEP) from neuroscience provides a framework c...

Adaptation through prediction: multisensory active inference torque control

Adaptation to external and internal changes is major for robotic systems...

Deep Active Inference for Pixel-Based Discrete Control: Evaluation on the Car Racing Problem

Despite the potential of active inference for visual-based control, lear...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Since it is infeasible to model all time-varying dynamics and disturbances a priori, it is crucial that intelligent robotics systems are able to adapt to the presence of unmodeled dynamics and model uncertainties. This is necessary for many applications such as aerial vehicles encountering unpredicted wind dynamics [sierra2019wind], manipulators handling objects of unknown weight [wei2018adaptive] and autonomous vehicles on slippery road surfaces [arifin2019lateral]

. There exists a verity of methods to deal with such problems. For example, machine learning methods for either learning inverse dynamics or tuning conventional controllers. These require large amounts of training data and several iterations for learning

[nguyen2009local, ledezma2017first]

. On the other hand, approaches from adaptive control such as Model Reference Adaptive Control suffer from scalability issues with the number of degrees of freedom

[zhang2017review, pezzato2019novel].

Humans are skilled in dealing with these problems and recent work in robotics has taken inspiration from ‘active inference’ [friston2017active], a neuroscientific theory of the brain and behaviour. Active inference provides a framework for understanding decision-making of biological agents. Under the active inference framework, optimal behavior arises from minimising variational free-energy: a measure of the fit between an internal model and (past) sensory observations. Additionally, agents act to fulfill prior beliefs about preferred future observations [FEProughbrainguide]. This framework has been employed to explain and simulate a wide range of behaviors including abstract rule learning [pezzulo2018hierarchical], planning and navigation [kaplan2018planning].

Fig. 1: The Franka Emika Panda 7 DoF manipulator real robot (left) and simulated robot in Gazebo (right).

In [pezzato2019novel], the authors use the active inference framework for joint space control of a robotic manipulator. The active inference controller (AIC) outperforms the state-of-the-art Model Reference Adaptive Control (MRAC) in tasks requiring adaptive behaviour. The active inference approach of [pezzato2019novel] avoids the scalability issues of MRAC by requiring only a fixed number of parameters. However, this approach has a few limitations including sensitivity to its initial parameters. By slightly changing one of the model parameters or variances, the system suffers from severe oscillations and never settle at its target state.

The contributions of this paper are: 1) We present a novel formulation of active inference which includes the introduction of a temporal parameter

that affects the controller performance. 2) We show that under this framework, we are able to derive approaches to automatically update the model hyperparameters including the variances as well as the introduced parameter

. This ensures the controllers is tuned properly during run-time.

Our approach provides a number of benefits: learning happens during execution-time, it has a fixed number of parameters to be specified regardless of the DoF (number of degrees of freedom), it does not require an accurate model of the dynamics, and it is robust to poor initial settings.

Additionally, we highlight theoretical results that show, if our introduced parameter approaches zero (), the approach converts to a classic PID controller. When approaches (), the approach converts to a pure estimator. As a running example throughout the paper, the mass-spring-damper [dahleh1998theory] system is used. However, we validate the approach on a ‘Franka Emika Panda’ 7 DoF manipulator in the results section. Our approach is tested against previous work to carry different payloads and with different hyperparameters.

Ii Related work

There exists a verity of methods to allow robotic systems to show adaptive and robust behaviour. Some work considers conventional controllers coupled with artificial intelligence methods. For example in

[yu2010study, khodadadi2018self] PID controllers are tuned with fuzzy logic. Other approaches rely on estimating accurate inverse dynamic models using machine learning methods [nguyen2009local]. These approaches require large amount of training data and several iterations for learning [ledezma2017first], are hard to generalize [jamone2014incremental, kappler2017new]

, and require expert definition (for instance, for the best topology of a neural network


Existing adaptive control approaches include self-tuning adaptive control, which represent the robot as a linear discrete-time model and estimate the unknown parameters online, substituting them in the control law [walters1982application]. Another method is Model Reference Adaptive Control (MRAC) [zhang2017review], which finds a control signal to be applied to the robot actuators. This signal will force the system to behave as specified by a chosen reference model; however, these approaches suffer from scalability issues. For instance, [pezzato2019novel] reports that in order to use the MRAC on a real 7DoF robot previously tuned in simulation, a severe re-tuning of 63 parameters had to be performed.

The framework of active inference has the potential to facilitate building intelligent, robust and adaptive robotic systems. However there are only a handful of attempts to use active inference in robotics and control. In [pio2016active], a PR2 robot, simulated in ROS, is controlled by open-loop Active Inference; however, the computational complexity made an online implementation unfeasible. In [lanillos2018adaptive], the authors use free-energy minimization for adaptive body perception. In [oliver2019active] an implementation of active inference is presented with real hardware on a 3 DoF humanoid robot. It was capable of performing reaching behaviors with both arms and active head object tracking in the visual field with noisy observation. However, the control was performed using velocity commands rather than torque commands. This assumes the reliability of the low-level controllers to achieve the requested velocities.

The work about active inference described above mostly included simple tasks and wasn’t compared to state-of-the-art methods. In [pezzato2019novel], a method for joint space control (using torque commands rather than velocity commands) of robotic manipulators is presented that outperforms the state-of-the-art MRAC in adaptability. Therefore, we use the work in [pezzato2019novel] as comparison for our results.

Iii Active Inference framework

This section introduces active inference as a general framework and derives the key equations for free-energy (). The free-energy term, F, is used in later sections to achieve state-estimation, control and hyperparameter learning.

Iii-a Variational free energy

Active Inference considers an agent in a dynamic environment that receives an observation about a state . Given a model of the agent’s world, Bayes’ rule can be used to infer the state as


Performing such a calculation is computationally expensive especially if is a non-standard distribution. The normalization term involves calculating an integral making calculations of all but trivial examples infeasible. Instead, the agent can approximate the posterior distribution with a ‘variational distribution’ over states which we can define to have a more simple form (such as a Gaussian). The goal is then to minimize the difference between the two distributions. The mismatch between the two distribution can be computed using the KL-divergence [fox2012tutorial]:


The quantity is referred to as the (variational) free-energy and minimizing F minimizes the KL-divergence. is also often referred to as the evidence lower bound (ELBO) in the Machine Learning community.111

Minimizing the ELBO and thus the KL divergence is common in variational inference, a method for approximating probability densities

[fox2012tutorial]. Active inference on the otherhand is a framework that utilizes variational inference to explain behaviour of biological agents. If we choose (

) to be a Gaussian distribution with mean

, and utilize the Laplace approximation [friston2007variational], the free-energy expression simplifies to:


Now the expression for variational free-energy is solely dependent on one parameter, , which is referred to as the ‘belief state’ or simply ‘belief’. The objective is to find the which minimizes . This results in the agent finding the best estimate of its state. For a robotic manipulator the set of observations () and beliefs (

) are vectors with length depending on the number of degrees of freedom.

Iii-B Generalized motions

The state of the robot is given by its joint position, velocities . In the Active Inference literature the state given by position and higher derivatives is also known as generalised coordinates of motions [buckley2017free]. We also consider observations given by joint encoders, ], which are the main source of information about the state.

Iii-C Observation model and state transition model

Taking generalized motions into account, the joint probability from Equation (3) can be written as:


where is the probability of receiving an observation while in (belief) state , and is the state transition model (also referred to as the dynamic model or the generative model). The state transition model predicts the state evolution given the current state. These distributions are assumed Gaussian according to:


where the functions and represent a mapping between observations and states. For many applications in robotics the state is directly observable. For instance, in the context of a robotic manipulator the state consists of the positions and velocities of all joints and the manipulator is provided with position and velocity encoders. Thus we can assume: and . The functions and represent the evolution of the belief state over time. This encodes the agent’s preference over future states (in this case the preferred future state is the target state, ). In our case and , where is the desired state and a time scale (explained further in Section IV-D). In [pezzato2019novel], is set to one by default but we will show the limitations of that choice.

Now that all the terms have been defined, can be expanded to:


where , , and and refers to constant terms.

Equation (6) differs from the work presented in [pezzato2019novel] in the third line (terms with , which are not included). The importance of this difference is highlighted in Section V-A.

Iv State-estimation and Control

We now introduce how to perform state-estimation and control by minimizing . We show how the estimation step uses the observations to refine its (state) estimate and then biases that estimate towards the goal state. The control step then steers the system from its observed state to its estimated state . This can be considered an estimator coupled with a moving target PID. In Section IV-E we show that if , the approach converts to a classic PID controller. Additionally, if , the approach converts to a pure estimator.

Iv-a State estimation by minimizing free-energy

Estimating the state of our system is achieved by finding a value that minimizes . If we are able to compute the gradient of with respect to the , gradient descent is a simple way to accomplish that:


Where is a tuning parameter and is a temporal derivative operator. Using Equation 7 the agent takes one-step in the gradient descent at every time-step. In this case the equation expands to:


The first equation states that belief is refined using the term which moves our new estimate towards the value just observed. Additionally, the term , shifts the belief towards the target (). Essentially, this ‘biases’ the estimate towards preferred future states (the preferred future observation is this case is the target state ). The degree to which the system is biased (rather than refined using the observation) depends on the the values and .

Iv-B Control by minimizing free-energy

Similar to state-estimation, to find the control action which minimizes , gradient descent is used:


where is a tuning parameter. The term

is assumed linear, and equal to the identity matrix (multiplied by a constant) similar to existing work  

[pezzato2019novel, oliver2019active]. Actions are then computed as:


This controller essentially steers the system from its observed state to its estimated state .

Note how the current control law does not contain any information about the dynamical system, it is thus a reactive controller. The control law only requires and which is biased towards the future desired state (). This controller thus operates in the presence of unmodelled dynamics similar to a PID controller.

Iv-C Simultaneous state-estimation and control

The presented approach performs state estimation and control simultaneously. The estimation and control step are dependent. This is because the estimation step refines the belief using the observation and biases the belief towards the target . The controller then steers the system from the observation to the refined then biased estimated state . If and are larger, the estimate is biased more towards the target .

To illustrate this, consider the mass damper system. It’s given by the equation: , where is the position of the mass, the control action, the spring constant (set to ), the damper coefficient (set to ) and the system has unit mass. It’s simulated with initial conditions , and . Equations (8) are used separately performing state estimation. To challenge the system, the initial beliefs are inaccurate ( and ). The system is simulated for different values of and presented in Figure 2. Code for the simulations is available on the accompanying Github repository.

Fig. 2: Separately performing state-estimation for different values of . Higher values of give more bias towards the target.

It’s clear that higher values of give more bias towards the target. For (green line), the estimate is close to the target (black dashed line) and far away form the actual position (blue dashed line) as opposed to setting (red line), the estimate better follows the real trajectory (not perfectly since the observations are noisy and the trajectory is highly non-linear). If , the estimation step reduces to a pure estimator, which would follow the trajectory without any bias towards the target.

Enabling control steers the system to its target. The in this case determines how aggressive the controller is. Since larger values move the estimate more towards the target, the difference is larger and thus the controller is more aggressive. An illustration for different values of is shown in Figure 3.

Fig. 3: Control for different values of . Higher values of provide more bias towards the target and thus more aggressive control causing overshoot oscillations.

Iv-D Understanding temporal parameter

Recall, the generative model specified by Equations 4 and 5 includes the function which determines how the state evolves over time, .

How the state is specified to evolve over time is the derivative between the current state and target . This can be evaluated as the ( - ) divided by a time scale . The smaller , the larger the derivative. If approaches zero (), the value approaches . As a results, the estimate is infinitely biased towards the target and .

Iv-E Relationship to a classic PID Controller

A classic PID controller defines an error term . The control law is then designed as

where ‘P’, ‘I’ and ‘D’ are tuning parameters.

For the control law defined by active inference, our is similar to the error term. Additionally, as explained in the previous section, when then . Now the control law of active inference can be rewritten in terms of the error term as:

This means than if , the active inference controller is equivalent to a PI Controller (PID with ) with a ‘P’ gain of and an ‘I’ gain of . If one considers the generalized motions (from section III-B) up to a third order rather than a second, the resulting control law would include a non-zero ‘D’ terms.

The relationship to a pure estimator is straightforward. As previously mentioned, if , the estimation step reduces to a pure estimator. Essentially, this indicates, the estimation step has zero bias towards the target. As Figure 2 has shown, for very small values of , the estimator follows the real position without bias.

V Learning hyperparameters as Active Inference

We have shown that state estimation and control can be performed using gradient decent on the free-energy . The same applies to the hyperparameters. Estimating and the model variances is done using gradient decent on .

V-a Learning model variances

As illustrated in Section IV-E, the model variances and can be considered as gains for the controller, similar to the ‘P’ and ‘I’ gains in a PID controller. Additionally, the values and affect how much the estimation step biases the controller towards the desired position ( also affects the bias towards the target).

We can update and using gradient decent on as:


The presented update rules have several practical issues. First, in any high dimensional case, would be a matrix. Since in most equations presented so far, the inverse is used, updating the covariance matrix using Equations 11 then inverting it would be computationally expensive. A work around is to simply update the inverse covariance matrix, sometimes referred to as the precision matrix or information matrix, as done in [2020UKRAS_baioumy].


The second issue is that a covariance needs to be positive and a covariance matrix needs to be positive semi-definite. However, the update rules from Equations 12 may violate these conditions. One way to avoid this problem is by setting a positive lower bound on the variance (as suggested in [bogacz2017tutorial]). In the case of the covariance matrix, all diagonal elements have a positive lower bound and all non-diagonal elements are set to zero. Other workarounds are suggested in the discussion.

We demonstrate this using fixed at 0.5 and will be varied. If is too high, the systems suffers from oscillations and overshoot. However, if and are updated during run-time, the controller shows improved behaviour. Results are shown in Figure 4.

The convergence of occurs when . Since the observations change over time and have a certain level of noise, converges to the expected value of . This does not necessarily happen upon reaching the target state.

Fig. 4: Effect of updating the the control variances ( and ). When the value of is initialized at high values, the system oscillates. Updating and essentially ‘tunes’ the controller and ensures robust performance.
Fig. 5: Effect of updating the value of during operation. This figure shows that increasing both or results in overshoots and severe oscillations. Rather than tuning and , updating can be sufficient.
Fig. 6: Results comparing the active inference controller with and without updating hyperparameters (, and ). This graphs corresponds to the second column of Table I. Note the difference in the scale of the y axes!

V-B Learning the temporal parameter

Figure 3 showed the importance of chosing appropriate values for : If the value is too high, the controller suffers from overshoot and oscillations. On the other hand, a low value results in a slow response. Ideally, the value for would be high in the start but decrease as the system reaches the target. This would essentially be the value that minimizes and can be found using gradient descent on as:


Note how the inverse is updated rather than directly. Similar to the variances, all previous equations contain and since inverting a matrix is computationally expensive, the inverse is directly updated. Additionally, requires the definition of a lowerbound. The optimization can results in approaching zero which means the controller converts to a pure estimator. In this work, is set to have a minimum value of 0.5 on all diagonal elements and zero elsewhere.

Using Equation 13, the oscillations can be damped as well as improving settling time as shown in Figure 5. Note how updating only is satisfactory to eliminate the oscillations (no update of or was used).

The convergence of happens when

This occurs when and which corresponds to the controller settling at its target position. The updates for will thus retune the controller appropriately until the the target is reached. This gives a preference for updating rather than updating or in most cases, since at the desired state is not updated anymore, unlike and .

Vi Results on a robotic manipulator

This section evaluates the presented approach and uses the active inference controller (AIC) from [pezzato2019novel] as a benchmark since the authors have shown their work outperforms state-of-the-art MRAC. We show that our approach outperforms the AIC from [pezzato2019novel] for carrying different payloads, different initial parameters for the variances and different values of . A summary of the results is reported in this section; however, full results are posted along with the video demonstrations.

The AIC from [pezzato2019novel] achieves adaptive control without explicit model dynamics of the system. However, it’s sensitive to the initialization of its parameters. By slightly changing for instance, the system suffers from severe oscillations and never settles at its target state. Our approach overcomes this problem by updating the variances during run-time.

Consider the task of reaching a target starting from

where each element in these vectors corresponds to one of the 7 joints from the Panda manipulator (Figure 1).If the AIC from [pezzato2019novel] is tuned properly (, , and ), this results in satisfactory behaviour. In this case, refers to the 7x7 identity matrix. However, if we vary to other values ( and ), the performance gets considerably worse. In our approach, we update , and online to retune the controller online. We ran the experiment of moving from to for several values of and recorded the Mean Absolute Error (MAE) for all joints in Table I.

The Mean Absolute Error (MAE) is defined as:

When the AIC is properly tuned , the two cases have the same MAE. However, when , the controller suffers from severe oscillations and never settle around its target (visualized in Figure 6). The MAE increase to more the triple its value while in the case of tuning the hyperparameters, the MAE actually decreases. This is due to the fact that increasing makes the controller more aggressive and when tuned, it does not oscillate and also has a slightly faster response.

In a similar fashion, results for changing the value of are recorded in Table II. Again, the MAE is much lower when tuning the hyperparameters.

For the last experiment, the robot is supposed to carry varying payloads. We test three different masses: , and (max payload for the Panda arm). The MAE for these cases is recorded in the Table III. Again, our approach outperforms the approach from previous work even when it is properly tuned.

The controller presented does not require a dynamic model and performs robustly on both the mass-spring-damper and an industrial manipulator. Additionally, the approach does not need any offline training. These benefits also apply to approach presented in [pezzato2019novel]; however, when initial parameters are slightly altered, the performance of the presented approach is clearly superior. The presented approach damps oscillations robustly and updating ensure converges to the target. Additionally, the presented approach performs better when carrying different payloads.

= 0.1 0.3 0.5
No updates 0.028 0.088 0.118
Updating , and 0.028 0.025 0.032
TABLE I: Mean Absolute Error (MAE) for different values of in case of updating hyperparameters and no updates.
= 2 = 3
No updates 0.091 0.123
Updating , and 0.025 0.032
TABLE II: Mean Absolute Error (MAE) for different values of in case of updating hyperparameters and no updates.
No updates 0.024 0.029 0.027
Updating , and 0.020 0.020 0.021
TABLE III: Mean Absolute Error (MAE) for different payloads in case of updating hyperparameters and no updates.

Vii Discussion and future work

In section III, the generative model was selected to have the form which does not explicitly include any notion of an action

. Thus to choose the action that minimizes free-energy, the chain rule was utilized (Equation

9). Alternatively, the actions could be explicitly added in the generative model . Additionally, the presented models could efficiently solved as factor graphs [loeliger2007factor, vanderbroeck2019active].

To improve the estimation, several modifications are possible: a prior factor could be introduced or a sliding window could be considered based on the last steps. Additionally, the current method only returns the control actions for the next timestep and thus planning ahead is not possible. Solving this can be achieved by a forward sliding window (receding horizon) similar to model predictive control [kouvaritakis2016model].

For updating the variances, a lower bound was set since the value has to be strictly positive (or positive semi-definite for a matrix). An alternative would be to put the variable through a mapping to a strictly positive function. For instance, and the update rules would choose to minimize .

Finally, using the Laplace approximation allowed us to only optimize the mean of the variational distribution Q. The covariance was not computed or used. Future work should look into utilizing the covariances and using full variational inference rather than the Laplace approximation.

Viii Conclusions

In this paper a method for state-estimation, control and learning model (hyper)parameters is introduced based on minimizing free-energy. Online estimation of relevant quantities can be achieved with one step of gradient descent on the free-energy for each iteration of the controller. We showed that when a temporal parameter approaches zero, the approach converts to a PID controller and if approaches , it converts to a pure estimator. We then demonstrated the effectiveness of the framework for a 7 DOF robotic arm and showed adaptability and robustness ourperforming previous work by a large margin. Our approach provides a number of benefits: it doesn’t require training data (or trials to learn), has a fixed number of parameters to be specified regardless of DoF, it does not require an accurate model of the dynamics (see Section IV-B), and it damps oscillations while being robust against poor initial settings (see Section VI).


Authors thank Matias Mattamala and Mees Vanderbroeck for helpful comments and feedback.