I Introduction
This survey presents the current work in active inference for robotics and artificial agents and discusses the benefits and challenges it must address if it is to become a practical and revolutionary unified mathematical framework for estimation, control, planning and learning.
Active inference (AIF) is a biologically plausible mathematical construct based on the free energy principle (FEP) proposed by Karl Friston [1]. This principle describes how living systems resist a natural tendency to disorder. Its backbone can be traced back to the work of Helmholtz on perception [2]. In the presence of uncertain stimuli, such as the Dallenbach illusion [3] depicted in Fig. 1a, our perceptual apparatus tends to fill in for missing information by utilizing prior knowledge, a process referred to as unconscious inference [2]. Fig. 1a shows a cow rotated 90 degrees clockwise that is impossible to see before the presence of this prior information and impossible to unsee afterwards. One implication of this theory is that it suggests the brain maintains an internal model, i.e., a generative model, of the causes of sensation which is combined with the sensory stream to form a given percept. Algorithmically it has been suggested this is achieved by minimizing the difference (error) between topdown predictions of this internal model and incoming sensory data. Importantly, under this predictive coding framework [4], it has also been suggested that agents can also act on the world to change the sensation to better fit the same internal model thus providing a dual account of both perception and action [5, 6]. Again phenomenal experience is supportive of this active component [7]. For example when reaching an escalator (Fig. 1b), even if it is broken, we perceive it moving and we prepare our body to fit the velocity of the stair and are often surprised. Once we realize that it is stopped, we adapt again to the new situation.
Hence, adaptive behaviour can be viewed as an active inference process in which the agent selects those actions that support the maximization of model evidence, or equivalently, the minimization of surprise. This can be performed by exploiting the internal model or generating exploratory behaviours that reduce the model entropy: i.e., reduce uncertainty by maximizing information gain. This work describes how these concepts can be applied to and enrich robotic systems.
Ia Overview
We formalize and describe AIF in the context of current challenges in robotics, where adaptation to uncertain, complex and changing environment plays a major role. We survey robotics applications of AIF and provide the appropriate mathematical and theoretical background. Hence, aiming to offer both a review and a technical reference. Table I summarises the most relevant works organized into: stateestimation, control, planning (i.e., computing actions into the future), and highlevel cognitive skills (e.g., selfother distinction).
Research Topic  Approach and Implementation  References  
Estimation  Linear systems with colored noise  Dynamic Expectation Maximization 
[8], [9], [10], [11] 
Multisensory estimation and learning  Predictive coding  [12, 13]  
Localization  Laserbased continuous AIF  [14]  
Control  Humanoid robots and manipulators  Continuous torque control. Low dimensional input  [15], [16], [17], [18] 
Highdimensional input with function learning (Deep AIF)  [19], [20], [21]  
Faulttolerant systems  Threshold on the surprise, precision learning  [22], [23]  
Bioinspired agents  Phototaxis, Continuous AIF  [24]  
Planning  Discretetime stochastic control  AIF with rewards  [25], [26], [27], [28], [29], [30] 
Discrete Inference  [31], [32]  
Deep AIF  [33], [34], [35], [36]  
Navigation  Deep AIF  [37], [38], [39], [40]  
Recurrent spiking neural networks 
[41]  
Cognitive  Symbolic reasoning  AIF + behavior trees  [42] 
Human robot interaction  Imitative interactions via visioproprioceptive sequences  [43], [44], [45], [46]  
Self/other distinction  Selfrecognition  Robot mirror test using movement and visual cues  [47, 48] 
IB Paper structure
Section II introduces the common ground and notation. Stateestimation, control, planning, learning, and hierarchical representations are mathematically described in Sec. III, Sec. IV and Sec. V respectively. Section VIII details relevant robotic experiments that showcase the advantages of AIF approaches. Finally, Sec. IX
describes the relation of AIF with other frameworks, such as classical control and reinforcement learning, and Sec.
X discusses the benefits and challenges to make AIF a standard modelling technique for robotic systems.Ii Active inference common ground
We introduce the standard equations and concepts from the AIF literature, and the notation used in this paper, framed for estimation and control of robotic systems.
AIF solves the dual problem of estimation and control by optimizing a single objective: a free energy bound [1, 5, 6]. This entails updating the internal state and generating the control actions that minimize the error in the predicted observations. Hence, AIF has the particularity that the generative model of actions is cast into a generative model of predicted observations or inputs. In contrast to other similar approaches, control actions are generated to make the world more predictable and the least surprising. Figure 2b sketches the hierarchical AIF, where the lowest level is the sensory input. Estimation and control is solved through Bayesian inference, e.g., usually using stochastic variational approaches and transforming the inference problem into an optimization problem. In AIF the internal state is conditionally independent of the external state (world). However, they can affect each other through sensory and action states [49]. An important consequence is that the robot encodes as preferences its intentions and these preferences drive both the stateestimation and the control.
To help new robotic researchers in AIF to get started, we provide a very distilled summary of works in Table II. It contains what the authors consider seminal papers that led to the current advancements in the state of the art with the focus on the control and robotics communities.
Topic  References 

General introduction and tutorial  [50, 5], 
Derivations for robot control  [13, 17] 
Relationship with classical control  [51, 18] 
Relationship with optimal control  [52, 53] 
Discrete Active Inference  [54] 
RL and active inference  [55], [25] 
State estimation  [56], [9] 
Predictive processing  [57], [58] 
Human Robot Interaction  [45] 
Neuroscientific foundations  [1], [59] 
For consistency, we will use notation described in Table III. Figure 2b describes the AIF architecture with the sensory observations and the latent states
. Variables can be matrices or tensors. For instance, the sensory observation
may have the dimension of every sensor modality (e.g. visual and joints angles) and their higherorder derivatives^{1}^{1}1In the AIF terminology this way to encode observation and states is called generalized coordinates—See Appendix D for a detailed explanation.—for rotational joints these are the angles, the joints velocities, accelerations, etc.Iia From Bayesian Inference to the Free Energy Principle
We will start by designing an agent that does not have access to the world/body state but has to infer it from the sensor measurements. In terms of Bayesian inference, it infers the most probable state of the world
using imperfect noisy sensory observations . According to Bayes rule,(1) 
The probability of a state given the observed data is encoded in the posterior probability . The likelihood measures the compatibility of the sensory input with the state, while the prior probability is the current belief about the state before receiving the observation . Finally, is the marginal likelihood, which corresponds to the probability of observing regardless of the state. The goal is to find the value of which maximizes the posterior.
IiA1 Variational Free Energy
Now, we model the influence of the world on the agent as the tendency of the agent to find an equilibrium between its internal model and the external process. This is the core of the free energy principle [1], which in Bayesian terminology states that the agent maximizes model evidence by approximating the internal state described by the density to the world posterior . Besides the biological motivation, there is also a computational reason for introducing this variational density. The posterior is typically intractable and cannot be evaluated directly in most cases, particularly in continuous spaces. In the free energy principle, a variational Bayes approach is used to obtain a tractable solution^{2}^{2}2
The variational inference approach is common in modern machine learning for approximating probability densities
[60].. Instead of computing an exact posterior, it is approximated through optimization [61, 62]. This approach requires the auxiliary variational density . The idea is to minimize the KullbackLeibler divergence between and , as the divergence will approach to 0 when both distributions are the same.(2) 
is defined as the variational free energy^{3}^{3}3In machine learning the negative VFE is also known as the Evidence Lower Bound (ELBO). See Appendix A for demonstration. (VFE) and measures the divergence between the variational density
and the joint distribution (generative model)
. The VFE can be evaluated because it depends on and the knowledge about the environment of the agent .Interestingly, because the KL is always positive the VFE is an upper bound on the surprise: , which measures the atypicality of events quantified through the negative log probability of sensory data. Therefore, optimizing is equivalent to evaluating the posterior density. In the ideal case (e.g., no noise), when the model is able to capture the real generative process is zero and becomes the marginal likelihood or surprise. The key advantage of this formalism is that it reduces the intractable Bayesian inference problem given in Eq. 1 into an optimization problem. Crucially for AIF agents, the state and action are simultaneously inferred by optimizing .
IiA2 Meanfield and Laplace approximation
We did not describe yet the nature of the auxiliary density that is encoded by the internal state of the system. Here, for mathematical convenience and to reflect the majority of the active inference literature, we choose
as a factorization of random variables with known form, i.e., Gaussian
, and track the densities with the sufficient statistics defined by the mean and the variance (see Appendix B and C). Other forms of variational approximations are out of the scope of this paper. The VFE ( in Eq. IIA1) is by definition:(3)  
(4) 
The VFE under the meanfield and Laplace approximations simplifies to^{4}^{4}4Note that the first term of Eq. (4) vanishes and the second term becomes Eq. 5. See Appendix C for a short explanation of the mean field and Laplace approximations.:
(5) 
where are the sufficient statistics of the factorized variational density that codifies the process state , and is the optimal variance that optimizes the variational free energy.
One of the advantages of using these approximations is that estimation and control becomes a quadratic optimization problem that explicitly minimizes the model error prediction. We will show this explicitly in the following sections.
IiB Generative models
We differentiate two functionals [63]: the generative process, which defines the real system in the environment that is responsible for data generation—usually referred to as ’the plant’ in control engineering— and the generative model, which describes the agent’s internal representation (approximation) of the generative process. The AIF agent’s behaviour is driven by the generative model. This model can be defined by an expert designer (operational specification) or learnt through interaction with the world.
The generative model of the system at instant given all past observations and states is defined as the joint distribution over states and observations:
(6) 
Where the transition model collapses to when , is the likelihood of the observations given the state (observation model).
We can also describe the generative model from the dynamical systems approach using statespace equations:
internal state dynamics  (7a)  
observation model  (7b) 
Both equations encode the generative model described in Eq. 6. Equation 7a describes the evolution of the internal state—using the prior information or desired preference—and Eq. 7b describes the generative model of the causes, usually simplified to the likelihood of the sensory output given the internal states. Here , are the estimated state and output, and and are the process and observation (Gaussian) noise. Finally,
is the timederivative of the state vector
^{5}^{5}5An explanation of this operation and the generalized coordinates used in continuous AIF can be found in the Appendix D..The functions that describe the generative model and can be explicit or function approximators, such as neural networks. According to the model type chosen, different behaviours of an agent can be achieved.
As a useful and particular example, in the case of a linear plant, the agent’s generative model can be described as:
where , and are the plant specific matrices and here is the control input.
IiC Perception and action
As stated at the beginning of Sec. II, AIF agents perceive and act in the environment by optimizing the same objective, i.e., the VFE [1, 13]. Perception involves the estimation of states and parameters (e.g., ), whereas action has a dual role of resolving uncertainty and obtaining new data which concords with the agent’s belief/intention. Action involves the computation of the control signal to act in the environment. Together, the optimization of VFE through perception and action drives the agent towards reducing the prediction error and producing better sensory predictions. Assuming that the variational density is described by the latent variable they solve the following equations:
(8a)  
(8b) 
Both equations are usually solved through gradient descent. It is important to highlight that original works on active inference do not explicitly model the action in the generative model as it is encoded within the observation . Alternatives on this formulation have been left for the discussion section.
Iii Stateestimation
This section summarizes the mathematical formalization of the first of the four blocks treated in this paper: stateestimation. We solve stateestimation similarly to applying a Bayesian filter. This involves the estimation of two components: the mean estimate and the associated confidence (precision) of the estimate. Under the free energy principle framework, both can be estimated using the first two gradients of the free energy. State inference is driven by the following differential equation:
(9) 
The VFE under the Laplace and meanfield approximations has closed form and is defined as:
(10) 
Iiia Stateestimation minimizes the prediction error
IiiB Confidence in estimation
The inverse variance or precision () of the state estimate represents the agent’s confidence in estimation. This precision also minimizes the free energy, simplified as the negative curvature of the internal energy at the estimated states [65, 8]:
(12) 
where the generalized precision matrices () are computed as given in Appendix D.
AIF in continuoustime (with generalized coordinates) enables us to track the evolution of the probability density of the trajectory of states [56], instead of just its point estimates, thereby endowing the method with an accurate state estimate^{6}^{6}6When controlling a real robotic system, higherorder derivatives beyond accelerations are problematic when the noise smoothness properties are unknown.. The key advantage of this model is its ability to leverage the generalized coordinates to capture the noise smoothness in data [9]. However, the information contained in the higherorder noise derivatives is less valuable for the estimation process. Therefore, the noise precision matrices in the VFE expression should be designed such that the prediction errors coming from higherorder derivatives should be weighed less than those coming from lowerorder derivatives. This raises the importance of noise precision modeling [9]. One of the main challenges while using generalized coordinates for real robots is that the quality of estimation is highly sensitive to the assumed noise smoothness of the signal, especially for low noise smoothness [10, 11]. To resolve this issue, future research can focus on the smoothness estimation of the coloured noise. Although a few attempts have been made in providing theoretical guarantees of stability and convergence for the estimation involving generalized coordinates [66, 9, 8], there is a huge scope for the proofs for optimality guarantees.
Iv Control
Here we extend the previous section introducing the control actions. This is, how the robot both estimates its state and computes the control actions by filtering the information from previous observations according to its internal model dynamics. This can be seen as a lowlevel control where the actions correct for external and internal perturbation.
AIF robots use the same objective function for both estimation and control: the VFE. Thereby, the agent not only updates its state influenced by the world but can also apply actions to change the state of the world. This happens as an indirect consequence of actively sampling sensory data that is more in line with what is predicted by the internal model. Actions in active inference play a fundamental role in the approximation of the real distribution, acting also in the marginal likelihood by changing the real robot’s configuration and modifying the sensory input . The AIF robot is driven by:
(13a)  
(13b) 
In AIF the control actions steer the system towards minimizing the prediction errors in . This is again achieved through gradient descent. However, the VFE described is not a function of the control actions directly, but the actions can influence by modifying the sensory input. Thus, we can differentiate, in Eq. 13b
, observations w.r.t actions using the chain rule.
and are tuning parameters that define the step size in the iterative update. The partial derivatives of the sensory input with respect to the control action is a central point in active inference which has been tackled in different ways in past work like [13, 17].Conversely to other control frameworks, the desired state (goal or reference) is encoded in the internal state dynamics—Eq. (7a)—as a preference. Thus, being away from the desired state increases the prediction error hence, generating the control actions that steer the system towards the goal. The properties and limitations of this formulation where does not depend explicitly on the actions are analysed in the discussion, as well as possible alternatives.
V Planning
In previous sections, we considered AIF in the context of continuoustime systems, where it was been cast as a gradient descent on instantaneous variational free energy (VFE). These sections showed that active inference can naturally describe estimation and control using a common probabilistic framework. In this section, we show how active inference can be used to explicitly model future states and observations in an action plan dependent manner, thereby enabling prospective planning
. This capability is crucial for many tasks and environments where actions have delayed consequences, where simply doing what is best in the current moment is not necessarily what is best in the long run.
To this end, we first introduce the concept of Partially Observable Markov decision process (POMDP) that we omitted in the introduction for clarity and model the control actions as a discretetime optimization problem. Second, we define the expected free energy of the future (EFE) and finally, we describe the AIF approach to find the optimal control plan.
Va Discretetime optimization under the Markov assumption
Modelling states and observations using generalized coordinates—as described in Sec. III—does allow the system to incorporate knowledge of the future, since generalized coordinates are a Taylor expansion in time around some given time point. However, to use generalized coordinates in practice, it is necessary to truncate them at some small order, meaning they can only model smooth and local changes over time. This is not sufficient for many complex control tasks where the consequences of action may occur far into the future. While there are alternative continuoustime methods to model future trajectories, these typically require substantially more mathematical machinery for limited algorithmic gain. Therefore, for reasons of mathematical tractability and computational (i.e., statistical) efficiency, AIF agents plans are typically constructed in discrete time. Moreover, it is further assumed that the relationship between states, observations and time are described by a POMDP [67]. Intuitively, a POMDP assumes that there are discretetime sequences of observations and hidden states ^{7}^{7}7Again, for reasons of conceptual simplicity and mathematical tractability, we only consider future trajectories up to some time horizon . Many of the results presented may apply in the infinite horizon case , but require more nuanced mathematical machinery to demonstrate.. It is then assumed that, at some given instant point , observations depend only on the hidden state . Similarly, it is assumed that the hidden states at the current time depend only on the hidden states at the previous timestep (Markov assumption) and the action at the previous timestep . Analogously to the continuoustime space state AIF approach described in the previous sections, only current observation and previous state are needed to infer states and actions, instead of the entire history of states and observations, improving the tractability of the control problem. Additionally, these assumptions are not as restrictive in terms of generality as they first appear since the states are hidden, they can, by definition, include whatever information is necessary to ensure that the state at some instant depends only on the state and action at the previous timestep.
The dependency structure entailed by the planning over time for stochastic variables is as follows,
(14) 
This factorization of the joint distribution over observations, hidden states, and action trajectories will also mirror the structure of our active inference agent’s generative model.
VB Expected Free Energy
To augment active inference for this scenario we introduce the concept of the expected free energy (EFE) [68, 69, 70]. The EFE quantifies the average free energy of a planconditioned trajectory of states and observations, rather than the instantaneous free energy. Like instantaneous VFE agents are then mandated to minimize this quantity through both perception and action. An important terminological note concerns the word “plan” and its relation to the word policy. A plan is a sequence of actions . This differs from the policy used in reinforcement learning that refers to a parameterised action distribution conditioned on the current state. Interestingly, these control plans naturally confer active inference agents with both reward seeking and exploratory behaviours.
The EFE provides a measure quantifying the notion of the free energy expected over future trajectories. In effect, the EFE computes the average free energy of a trajectory, taking into account the the fact that because future observations are unknown they must be considered as random variables to be inferred, given a specific plan . Mathematically, the EFE () is defined as,
(15) 
Like the VFE introduced in previous sections, the EFE quantifies the difference between a variational density and a generative model . However, the distributions are over trajectories of states and observations, instead of just the state and observation at a single timestep, and there is an additional expectation over future observations, meaning that EFE is a functional of both states and observations, as opposed to only states. In the continuoustime formulation, goals are encoded as setpoints which, when compared against the current state estimates, form a prediction error that is minimized by action. In POMDP formalism of EFE agents, the goals or ‘preferences‘ of the agent are encoded into the generative model to form a biased generative model , where the tilde notation denotes that the model is biased towards predicting the agent’s preferred environment, or equivalent, rewarding states and observations.
VC Finding the optimal plan
One advantage of the EFE is that it allows the derivation of an expression for the optimal plan over a trajectory in terms of the sum of the EFE’s for each individual timestep. This is possible due to the statistical factorizations intrinsic to the definition of the POMDP, whereby the current state only depends on action and the state at the previous timestep. Specifically, assuming the variational density also factorizes in time so that , the EFE of a some trajectory can be decomposed into a sum of the EFE for each individual timestep,
(16) 
where,
(17) 
Hence, the optimal plan (where a plan denotes a sequence of individual actions up to some time horizon ) is a softmax distribution over the sums of the EFE for each planconditioned trajectory. Specifically,
(18)  
The optimal posterior is given by is
To gain an understanding of the kinds of behaviours that active inference agents will exhibit in practice, it is worthwhile studying the structure of the EFE objective in more detail. Crucially the EFE can be decomposed into two terms: an extrinsic value term, which scores how close the agent is to achieve its goals, or to maximizing its reward or utility, and an intrinsic value term which scores the information gain an agent could receive executing some plan. This information gain, mathematically, is simply the distinction between the posterior and prior variational distribution (or the agents’ ‘beliefs’) about the state trajectory—and maximizing this divergence essentially mandates the agent to seek out states which are maximally informative, thereby maximally reducing uncertainty. In effect, active inference agents possess a natural and inherent desire to seek out and explore novel states of the world which will cause them to update their world model.
The EFE can be decomposed as follows—See Appendix E for derivation,
(19) 
which involves both extrinsic and intrinsic value terms. The exploratory, informationseeking behaviour induced by the intrinsic value term has been extensively studied in the active inference literature [68, 71] for its relationship to human notion of curiosity and intrinsic motivation that produces exploratory behaviours. Artificial curiosity [72], intrinsic motivation [73] and goaldirected exploration [74] is crucial for learning and planning. Furthermore, it has also been applied productively in rewardbased AIF [36, 55, 34, 35] to handle agents which proactively learn to explore large state spaces with sparse rewards.
Vi Learning
In this section, we summarize how we can learn the generative models used in previous sections without the need of an expert designer. More importantly, function approximations [75, 19, 33, 36] aid the extension of proposed framework to computationally tractable active inference agents that use highdimensional inputs and statesspaces.
We can learn the likelihood and transition model (or prior) of the generative model—depending on the formulation or ; the variational approximate posterior , and the desired observation distribution . In the literature, two main approaches have been taken to achieve this: 1) discretestatespace [71, 54] (Sec. VIA) and 2) function approximation for adaptive control [12, 19, 20] (Sec. VIB1) and planning schemes [36] (Sec. VIB2).
Via Discretestatespace
Distributions are described as discrete categorical distributions, which explicitly enumerate every possible state and explicitly assign a probability to each. In practice, these distributions are implemented through normalized matrices and vectors representing each state, or each state combination. For instance, the likelihood mapping can be represented by an dimensional matrix where is the dimensionality of the observation space and is the dimensionality of the action space. Moreover, in a discrete state space setting, it is often tractable to explicitly evaluate the integral over time and compute the optimal plan posterior since agents in discrete states typically exist in small enough environments such that all policies can be explicitly enumerated and evaluated [71]. Discrete state space active inference has been widely applied in computational neuroscience to simulate choice behaviour, e.g., saccades [76] and exploratory behaviour [68].
Discrete state space active inference suffers from clear limitations of scalability and expressiveness. First, the restriction of using discrete categorical distributions means that it must be possible to model the world with a discrete and lowdimensional set of states (to be able to successfully store the full matrices on a digital computer)^{8}^{8}8This problem may be addressed by having a hierarchy of states or by using sparse matrices where applicable. We leave it to future work to see whether this would enable discretestate space active inference to scale to realworld tasks.. This approach renders operating in any kind of continuous environment with finegrained discretisation unfeasible. Second, and more importantly, explicitly evaluating the path integral in Equation 16 by enumerating all policies is an operation with exponential computational complexity in the timehorizon as well as the size of the statespace, since this increases with the branching factor of the policy tree. This exponential complexity very rapidly limits the scalability of this direct form of discrete state space active inference to relatively simple tasks such as the Tmaze [68], although more recent studies are pushing the limit on the computational power of this method [77].
ViB Deep active inference
Another approach, which is substantially more scalable, although at the cost of losing performance and convergence guarantees, as well as interpretability, is to instead parametrize the distributions with general function approximators [75], such as Gaussian processes [12] or artificial neural networks [33]. This allows, in theory, for any distribution to be represented faithfully since deep artificial neural networks (ANN) can approximate any arbitrary nonlinear function given sufficient depth and width. For instance, the following approximate posterior
(20) 
can be parametrized as follows,
(21) 
where represents the prediction (forward pass in a deep ANN with weights ) which outputs a predicted mean and variance for the Gaussian posterior as function of the input . The parameters
can be straightforwardly optimized, for instance using NNs, with the backpropagation algorithm where the loss function is the VFE or the EFE.
When the functions are approximated by deep nets the term coined is deep active inference (deep AIF). Here, first we describe the learning for adaptive control and second we explain how to use learning in planning. Furthermore, in the next Sec. VII hierarchical learning is detailed.
ViB1 Adaptive control with deep AIF
Estimation and adaptive control can be scaled to highdimensional inputs through stochastic optimization of the VFE exploiting deep ANNs [6]. Algorithm 1 describes the nonhierachical deep AIF for adaptive control. The most common approach is to learn the forward model —the sensory output given the internal state
—with selfsupervised learning and then use the VFE optimization for the online estimation and control
[19, 20]. The state at instant is inferred using the forward pass of the network to get the estimated sensory input and backpropagating the weighted error (exploiting the Jacobian of the network, ). The action (e.g., torques, velocities, etc.) is computed with an analogous procedure but in this case the change of the sensory input w.r.t the action should be modeled . There are several techniques to resolve this partial derivative [13, 23]. This approach can be further extended to multimodal sensory input [20] and combined with model agnostic AIFs approaches.ViB2 Planning with deep AIF
Planning ahead can be modeled with deep AIF. Here all densities (even the policy) and parameters are learnt by optimizing the EFE. The algorithm is described in Alg. 2. This deep AIF approach draws heavily from recent work in machine learning, especially deep RL, and has achieved significant successes at enabling active inference to be scaled up to challenging RL benchmark tasks, and to meet, and potentially exceed the state of the art in deep RL [36, 55, 28, 34].
Interestingly, if we explicitly model the parameters in the generative and approximate posterior distributions, the EFE objective also gives rise to an information gain term over the model parameters, which encourages deliberate exploration to efficiently search the parameter space. The ensuing exploration is usually described as one aspect of artificial curiosity or intrinsic motivation [72, 73, 74]; namely, novelty seeking.
While this approach allows for the expression of arbitrary probability densities, it does not solve the problem of the computationally expensive evaluation of the EFE path integral, used to compute the optimal policy. However, in the deep AIF literature there have been several proposed solutions which draw either from modelbased or modelfree RL. One approach is to use the fact that the EFE satisfies a similar recursive relationship as the Bellman equation in traditional RL to derive a bootstrapped estimator similar to a value function, which can be optimized across batches [36]
. Another approach is to utilize blackbox optimization methods, such as genetic algorithms, to directly approximate this quantity through sampling
[33]. A final approach, which is closely related to modelbased planning, and is more common in the literature, is to approximate the path integral with samples of future rollouts given different policies, which can be simulated by the agent using its own generative model to imagine and score future trajectories [55]. This modelbased approach has been used fairly widely in the literature, and has been utilized to develop powerful deep active inference algorithms capable of matching the performance of state of the art modelbased RL systems in MDP problems, such as the Cartpole, Acrobot, Lunarlander [36], the mountain car problem [78]; and POMDPS as visual control of the cartpole [34], the car racing [37, 79] and the AnimalAI environment [35].Vii Hierarchical Representation
In previous sections, we described the state in AIF as a variational density. This section introduces how to represent the state of the robot in a hierarchical setting. Using a hierarchical recurrent neural network (RNN) we can generate predictive behaviour exploiting the free energy principle. Murata and colleagues
[43] developed a variational hierarchicallyorganized RNN model, referred to as the Stochastic ContinuousTime RNN (SCTRNN). This model inherits the dynamic property of the multiple timescales RNN (MTRNN) [80] wherein the neural activation dynamics in the higher layer is dominated by slower dynamics by employing a larger time constant and the one in the lower layer is dominated by faster dynamics with a smaller time constant. Although this model was successfully applied for predictive coding and active inference of a physical humanoid robot that imitatively interacts with a humanoperated robot, the model was limited since the random latent variable is allocated only in the initial latent state of the MTRNN.By considering that random latent variables should be introduced not only in the initial time step but also all time steps during the sequential processing, another variational RNN, socalled the predictive coding inspired variational RNN (PVRNN) [81] was developed. This model was inspired by the sequence prior scheme [82] evinced in a model called the variational RNN (VRNN) wherein the prior changes over time. PVRNN using the sequence prior was extended incorporating (1) a scheme called error regression to infer random latent variables using prediction error signals during action generation which is analogous to predictive coding (2) the multiple timescale scheme described above. A graphical representation of PVRNN is shown in Figure 3.
In each layer of the network we define the internal state with two latent variables: deterministic and random , respectively with as time step. The random variable
is represented by a Gaussian distribution with mean
and standard distribution . The lowest layer contains the output predicting sensory observation including both exteroception and proprioception. In robotics applications, motor movement can be generated by feeding the prediction of proprioception in terms of target joint angles in the next time step to the motor controller.Both learning and action generation is performed through iterative interactions between the topdown generative process and the bottomup inference process. The generative model can be written as factorized:
(22) 
Although is a deterministic latent variable, it can be considered to have a Dirac delta distribution centered on as . In the generative process, the information propagates from the highest layer to the lowest layer at each time step in the forward direction. At each layer the deterministic latent variable and the prior distribution are computed. The lowest layer computes the sensory prediction . First, is computed as shown in Eq. (23) where its internal value is denoted by .
(23)  
where is a sampling of the posterior distribution estimated in the last iteration of the inference process. represent the learnable parameters in terms of connectivity weight matrices between layers and their deterministic and stochastic units. represents another learnable parameter bias. The output is computed as mapping from in the lowest layer as:
(24) 
where is a learnable bias.
The prior distribution takes a Gaussian distribution with mean
. The prior depends on by following the sequence prior scheme [82].(25)  
Next, the inference process using the free energy minimization is described. The VFE can be written as:
(26) 
where is the generative model with learnable parameter and is the posterior inference model with parameter . The objective of the inference in learning is to obtain optimal values for and by minimizing the free energy. Let us consider the posterior inference with a given sensory observation for the pasttime window from time step to of PVRNN. The approximate posterior for is represented as:
(27)  
The adaptation variable represents the parameters for the posterior inference model . With considering the generative process described in Eq. 23 and Eq. 25, the free energy for PVRNN can be obtained as:
(28)  
where the first term is the accuracy and the second term is the complexity as corresponding to those in Eq. 26. In practice, the generative process—sweeping from the higher layer to the lower layer and from time step to —is performed by following Eq. 23 and Eq. 25 using and the connectivity weight updated in the previous iteration. The inference process—sweeping from the lower layer to the higher layer and from time step to —is implemented with backpropagation through time (BPTT) [83]. This uses the error calculated between the output computed in the last generative process and the target for updating the current and . The paired computation for the generative process and the inference process is iterated until the free energy is minimized. After the learning process has converged, action generation can be conducted.
The basic architecture described above has been extended and implemented in various robotic experimental tasks including humanrobot imitative interaction [84, 85], dyadic robot imitative interaction [86], goaldirected planning for robot object manipulation [87], and goaldirected planning using active inference and reinforcement learning in navigation [29].
Viii Robotic experiments
While the previous sections were devoted to describing the mathematical insights of AIF. This section presents selected experiments in the literature that showcase the relevant characteristics of AIF for robotics.
Historical preamble.
While Friston was developing the basis of AIF, the free energy principle [1], Tani and colleagues [88, 89] were investigating models similar to AIF in real robots. In [88], a hierarchically organized RNN as a generative model was trained to predict/generate visuoproprioceptive sequence patterns for a set of movement primitives. It was demonstrated that the robot could successfully adapt its movement pattern to the corresponding movement primitive in realtime when the environment changed. However, these models are limited because they were predicated on deterministic dynamics perspective instead of a Bayesian perspective which is used in the formal formulation of AIF [90]. A robotic attempt of AIF with its exact Friston’s formalism for reaching tasks was described in [15], with a 7DOF simulated robot arm with the generative models and parameters known in advance. Finally, Lanillos and colleagues [12, 13], were able to develop and deploy an AIF model on a humanoid robot for estimation and control. Concurrently, similar approaches were being investigated for industrial manipulators [23].
An important concept that emerged in these works is that actions in AIF realize sensory consequences of prior causes. For this reason, there is no need for precise inverse dynamical models which might be problematic to compute. While compared to optimal control, AIF is quite appealing for robotic applications where the dynamics of the robot or the task are uncertain. However, other challenges appear, e.g., inverse dynamic modelling is shifted to the design/learning of meaningful generative models and prior beliefs (preferences). Despite the challenges, AIF is showing huge potential for real robotic applications [6] for estimation, adaptive control, faulttolerant control, prospective planning and complex cognition skills (i.e., humanrobot collaboration, self/other distinction).
In the following subsections we describe experiments for: A) estimation, B) adaptive control, C) faulttolerant control, D) planning and E) complex cognition.
Viiia Estimation
VFE optimization has been successfully applied for adaptive robot estimation and learning in applications, such as body estimation [12], drones state tracking [9] and navigation [39, 14].
In [12] the authors proposed a computational model based on predictive coding which allows generic multisensory integration for inferring and learning its body configuration by means of arbitrary sensors affected by Gaussian noise. Figure 4
a shows the experimental setup. They learned the forward model using Gaussian processes. This is particularly useful in the case an accurate model of the body and the environment is not available, or when selfcalibration of different sensors is needed. Body learning was formulated as obtaining the mapping that encodes the sensory value dependency on the joint variables. Body perception/estimation was achieved by minimizing the VFE (through stochastic gradient descent), which iteratively reduces the discrepancy between the belief of the robot about its configuration and the observed posterior. The results showed that different sensor modalities improve the refinement of the body configuration and allow the body estimation to adapt to the most plausible solution when injecting strong visualtactile perturbation or in the case of missing sensory inputs. Interestingly, the system was prone to visualtactile illusions, similarly to how humans process body multisensory information
[91].Based on the Dynamic Expectation Maximization (DEM) [65], [9] proposed a simultaneous state and input observer design for tackling stable Linear Time Invariant (LTI) systems with coloured noise
, which was tested on a real system. The use of generalized coordinates enabled this observer to outperform the Kalman filter for state estimation and Unknown Input Observer (UIO) for input estimation, under coloured noise. Similar ideas of perception were used to reformulate DEM into a blind
system identification algorithm [8]for the simultaneous estimation of states, inputs, parameters and noise hyperparameters of an LTI system. This estimator was shown to outperform other classical estimators for parameter estimation under coloured noise. On the application side, DEM was applied for the perception of a quadrotor flying under wind conditions
[10, 11]. The DEM based learning algorithm successfully learned the dynamic model of the quadrotor for accurate output predictions when compared to other stateoftheart system identification methods [10]. The existence of a mathematical proof for stable parameter estimation [66] motivates its reliability and safety for real robotic applications. These results demonstrate the applicability of DEM as a learning algorithm for future robots to learn their generative model from sensory data, rendering them with the capability to make accurate predictions of the world.ViiiB Adaptive control
ViiiB1 Without learning
AIF robot body perception and action were successfully deployed, for the first time, on a physical robot [13]—the iCub humanoid robot. It provided natural behaviours for upper body reaching and head object tracking using proprioceptive and visual sensory inputs. The AIF algorithm was validated in terms of noise robustness, and multisensory integration, and was also applied for reaching and grasping a moving object. Figure 4e shows the behavioural sequence for one of the tests. The chosen models for active inference allowed to express goals in Cartesian or image spaces by specifying attractors in this space and translating them into joint space by using the inverse of the Jacobian matrix. The robot was controlled using velocity commands which allow to easily define the partial derivatives of the joint position with respect to the control actions considering discrete updates of the joint positions at each time step. This removed the need for complex inverse dynamical models to compute the control actions. The experimental results showed the potential of AIF for dual perception and action, particularly for counteracting environmental or model unexpected changes.
AIF for torquecontrol in the joint space for industrial arms was shown in [17]. The authors presented an adaptive scheme for controlling a generic DOFs robot manipulators, namely Active Inference Controller (AIC). The control law is modelfree, lightweight, and it uses proprioceptive sensors for position and velocity to control a 7DOF robot arm in joint space. By choosing the state of the system to be controlled as the joint positions, the generative models of the sensory input simply resulted in the identity mapping. The generative function of the state dynamics was used to impose a specific behaviour. The robot believed that each joint was moving towards a given target following the dynamics of a chosen firstorder linear system. Furthermore, instead of computing the forward dynamics of the robot manipulator, the authors in [17] proposed to approximate the partial derivatives of with respect to the control input by just encoding the sign of this relationship, relying on the adaptability of the algorithm to compensate for the modelling errors. Note that by specifying a certain generative model in AIF, one can define a robot that thinks it will behave in a particular way. This is closely related to the more established idea of model reference adaptive control (MRAC) [92]. The results showed that the AIC greatly outperformed the MRAC in terms of adaptability to unmodelled dynamics, tuning effort, disturbance rejection, computational effort, and overall performance in pick and place tasks with unknown object weights. A great advantage of this approach was the ability to transfer from simulation to real robot without retuning of the controller while preserving compliant behaviour through torque commands. The model parameters for the AIC have however been considered constants and they were seen as tuning parameters for the designer to achieve a smooth response. Also, the Mutisensory AIF presented in [20] showed better performance than the optimized factory controller of the Panda robot (Fig. 4f) when adapting to external and internal parameters changes, such as gravity, stiffness, endeffector inertia and external perturbations.
ViiiB2 With precision learning
The authors in [18] presented an evolution of [17] by including online parameters learning for controller autotuning. The authors demonstrated how the minimization of the VFE produces effective stateestimation, control for robotic manipulators. They introduced a temporal parameter in the dynamics generative model used in [17] corresponding to a variable time constant of the firstorder linear system used as the model reference for the joint motions. Basically, the authors showed that the gains of the controller correspond to the covariance matrices of the observation model, and that learning the optimal covariance matrices results in finding the optimal gains for the controller [18]. The results showed improved performance in terms of response and robustness compared to a manually tuned AIC, but they also showed some limitations of the standard active inference formulation when (hyper)parameters learning is introduced. The belief about the current state is intentionally biased towards the target to achieve control, which is an uncommon thing in the control community. Thus, the state reconstruction is not accurate unless the system reaches the target. This is reflected also in the learned covariance matrices that converged to a much higher value than the Gaussian noise affecting the controlled system.
ViiiB3 With function learning
Instead of designing the observation and the dynamical model, a solution for body estimation and control was proposed in [16], where function learning was employed. In this approach, only forward models needed to be learned. The authors pointed out how this learning approach was not simpler than classical inverse dynamics techniques, because it required learning the state forward dynamics and the Jacobian of the observation model with respect to the latent space.
In [12] AIF estimation with function learning was solved using GP regression when the input is lowdimensional. It exploited the closedform equation to compute the partial derivatives with respect to the body state. A pixelbased deep AIF controller [19] was presented to handle highdimensional inputs, such as images. Figure 4
b depicts the PixelAIF architecture and a sequence of the dual perceptionaction inference process. In the perception row, the arm overlays represent the robot visual input and the predicted image. In the action row, the overlays represent the visual input and the desired goal in the visual space. This approach incorporated generative model learning using convolutional neural networks for one sensor modality (visual) and performed velocity control. Finally, a multimodal variational autoencoder AIF (MAIF) torque controller
[20] was presented. It combined VFE optimization with generative model learning, extending the previous AIF formulations to work with highdimensional multimodal input at the torque level. Figure 4e shows the robot imaging its future trajectory thus, describing AIF as a generative model of actions. Figure 4f shows the MAIF results with the panda manipulator robot when changing external parameters: the Jupiter experiment. The gravity parameter was modified to . According to the endeffector error plot, the AIF formulation achieved better performance when compared with the builtin controller provided by the panda company. Further tests showed the adaptation properties to input noise and internal parameters changes, such as stiffness.In all these methods the computation of the observation model Jacobian (inverse model) is indirect and depends on the proper representation learning. This Jacobian has a physical meaning and describes the change in perception or the action to be exerted. Figure 4c visualizes the convolutional decoder Jacobian of a virtual arm after learning the visual mapping [21]. The red and blue pixels defines the change on the arm edge w.r.t. to the elbow and shoulder joints.
Other works focused on robot navigation with active inference. In particular, [38, 37, 39] proposed a method to learn generative statespace models from pixel data. In [38] the authors approximated the variational posterior distributions, as well as the likelihood model with two deep neural networks. This allows performing simple navigation tasks using a Kuka YouBot platform in an aisle.
ViiiB4 In combination with discrete planning
Adaptive control can also be achieved for highlevel behaviour with active inference, as shown in [42]. A discrete formulation of active inference is used in combination with behaviour trees [93] to provide prior preferences over specific states while adapting at runtime to unforeseen contingencies during mobile manipulation in dynamic environments. The solution proposed in [42] blended acting and planning to provide a solution for continual planning and hierarchical deliberation for long term tasks in robotics. The core idea is to use behaviour trees as a graphical method to encode priors for the active inference algorithm. These priors were used to bias the generative model online and to guide the search with active inference to provide adaptive and fast responses to changes in the environment. The experiments were conducted both in simulation and in a real robot considering two different mobile manipulators and two similar tasks in a retail environment, for instance stocking an empty shelf. The hybrid combination of active inference and behaviour trees provides reactivity to unforeseen events, allowing the mobile manipulators to quickly perform, repeat, or skip actions according to the state of the environment. Crucially, the method also provides safety guarantees and convergence of the highlevel behaviour of the robot to the given goal.
ViiiC Faulttolerant control
AIF can also help in case of degraded sensory input if sensory redundancy is provided in the system [12]. As pointed out in [15], if the robot has poor proprioceptive information, multimodal integration with visual data can compensate and restore effective control [12]. This property of active inference of naturally fusing different sensory modalities for both state estimation and control has been taken further for faulttolerant control [23, 22].
The work in [23] proposes the use of the AIC from [17] with the addition of visual information on the endeffector position—using the GP learning approach from [12]. The authors proposed a scheme for online threshold generation for fault detection and isolation of sensory faults based on the sensory prediction errors in the freeenergy. Fault recovery when either the proprioceptive or the camera were marked as faulty is achieved by simply setting to zero the precision matrices (or inverse covariances) of the relative sensors. Results on a simulated 2DOF robot arm showed that the faulttolerant AIC can detect and recover from freezing encoders and camera misalignment providing convergence to a given goal. The main advantage with respect to standard faulttolerant approaches is that fault detection and isolation, as well as fault recovery, do not require the design of additional signals to monitor or alternative controllers besides what is already provided by the AIC. However, as AIF is by nature an adaptation mechanism biased towards the desired state it hindered fault detection, producing false alarms. In fact, the sensory prediction errors in the freeenergy can increase due to a changing goal for the robot manipulator and not necessarily due to a faulty sensor. This problem can be addressed by introducing an unbiased AIC controller [22].
Subsequently, the authors in [94] highlighted how fault detection and recovery can be automatically achieved through precision learning. This provides a method for stochastic faultdetection (the probability of sensory being fault) rather than deterministic and allowed for faulttolerant behaviour without needing any threshold definition.
ViiiD Planning
As described in section V, planning can be cast as optimizing the parameters of the plan density with respect to the cost function EFE . This process favors plans which realise an agent’s prior preferences (i.e. goaldriven behaviour), while at the same time gathering the most information from the environment (i.e exploration).
A classical illustrative example is modelling saccadic eye movements. In [95] they computed the saccades (using expected free energy, EFE) and then executed them through vanilla prediction error minimization (free energy gradients).
In [55], the authors investigated the performance of deep AIF in the context of planning with rewards in standard MDP RL benchmarks—See Fig. 5. Implementationwise, the deep AIF agent optimizes at each time step, and executes the first action specified by the most likely plan. This involves estimating the EFE for each plan, which in turn involves evaluating the expected future beliefs and observations, given the plan (). This can be achieved through the generative model, whereby beliefs about future hidden states (given the current hidden state and plan) are evaluated via the transition distribution, and beliefs about observations, given some hidden state, are evaluated using the likelihood distribution. Given this counterfactual distribution over future states and observations, EFE can be approximated [36, 55]. The final requirement is an optimization procedure which iteratively updates the plan density such that EFE is minimized. In [55], the plan is parameterized as a diagonal Gaussian and the crossentropy method [96] is used optimize the parameters such that . The experiments focused on whether the algorithm was able to balance exploration and exploitation.
The performance was evaluated in domains with (i) wellshaped rewards (Half Cheetah), (ii) extremely sparse rewards, where agents only receive reward when the goal is achieved (Mountain Car and Cup Catch) and (iii) a complete absence of rewards, where there are no rewards and success is measured by the percent of the maze covered (Ant Maze). In sparse rewards environment, deep AIF was compared to two baselines, a reward algorithm which only selects plans based on the extrinsic term (ignores the information gain), and a variance algorithm that seeks out uncertain transitions by acting to maximise the output variance of the transition model. For environments with wellshaped rewards, deep AIF was compared to the maximum reward obtained after 100 episodes by a oftactorcritic (SAC) [97]—stateoftheart modelfree RL algorithm.
The Mountain Car experiment is shown in Fig. 5, where we plot the total reward obtained for each episode over 25 episodes, where each episode is at most 200 time steps. These results showed that deep AIF rapidly explores and consistently reaches the goal, achieving optimal performance in a single trial. In contrast, the benchmark algorithms were, on average, unable to successfully explore and achieve good performance. Deep AIF performs comparably to benchmarks on the Cup Catch environment (Fig. 1B). Figure 1 C&D shows that deep AIF performs substantially better than a stateoftheart modelfree algorithm after 100 episodes on the challenging Half Cheetah tasks.This reflects robust performance in environments with wellshaped rewards and provides considerable improvements in sampleefficiency. The directed exploration afforded by minimizing the EFE proves beneficial in environments with no reward structure. Deep AIF rate of exploration was substantially higher than that of a random baseline in the antmaze environment, resulting in a more substantial portion of the maze being covered.
ViiiE Complex cognition
ViiiE1 Intentionblended Humanrobot collaboration
Ohata and Tani [44] applied the frameworks of predictive coding and active inference to the social cognition study in which they investigated the dynamics of intention in humanrobot interactions. In this study, multimodal imitative interactions between a humanoid robot and a human counterpart were simulated in which the robot and human imitate each other’s body movement simultaneously. The imitation task was designed to reveal the difference between strong intention and weak intention in social interaction. During imitative interactions sometimes there were cases where the robot tries to perform a different movement primitive than the human. In such conflicting situations, it can be assumed that if the robot has a strong intention, the robot would keep performing the original primitive, and if it has a weak intention, it would change its intention to adapt to the human’s primitive. Body movements were implemented using three types of motion primitives (A, B, and C) and they follow specific probabilistic state transition rules. Every time the primitive A comes, either of the primitive B or C follows with 50% chance, and the primitive A always follows the primitive B and C (Fig. 6A). Therefore,
To model multimodal perception and action generation, a hierarchicallyorganized variational RNN (see Sec. VII) was extended. This model is comprised of three modules: the proprioception module, the vision module, and the associative module (Figure 6B).
In the simulation of mutual imitative interaction, the balance between the accuracy and complexity term in the free energy was the key focus. This balance determines how the approximate posterior is optimized through the free energy minimization given the observation and the prior. Results showed that when the complexity term was less dominant, the robot tended to change its prediction, namely intention, easily so that it could adapt to the observation. In a conflicting situation of motion primitives, the robot tended to change its motion primitive to the one the human presented. When the complexity term was more dominant, the robot tended to ignore the observation and retained its original intention. Figure 6C describes the joint dynamics depending on the complexity term domination. The conflicting situation in which the network predicted the primitive B, but observed C. In the less dominant condition (a), the network reconstructed the observation and modified the prediction such that the primitive C persisted. In the more dominant condition (b), the network ignored the observation and maintained its original prediction.
The congruence between an agent’s intention, the action and its anticipated outcome can be linked to the psychology concept of agency [100, 101]. In the context of AIF, an agent’s intention can be formulated as a predictive model, and the congruence between the predicted action outcomes and observation could be considered to contribute to the system being in control of the actions and the consequences [102]. Hence, when the accuracy term dominant condition, the robot is endowed with weaker agency, and in the complexity term dominant condition, the robot owns stronger agency.
ViiiE2 Self/other distinction
The work in [103] presented an algorithm that enables a robot to perform nonappearance self/other distinction on a mirror by distinguishing its simple actions from other agents. Developing this visualkinesthetic matching is essential for safe humanrobot interaction, especially in social robotics [104, 48]. Using movement cues the robot first learns the visual forward kinematics and then exploits the Bayesian model evidence to accumulate evidence. The potential of modelling the highorder cognition needed to pass the mirror test in robots makes AIF very attractive for the cognitive sciences view of robotics—See [48] for a computational modelling roadmap.
Ix Connections with other frameworks
As one might have realised from previous sections, AIF shares many similarities with other more established control schemes and theories. In this section, we summarise the main commonalities between active inference and other approaches.
Ixa Relationship with classical controllers
Consider the generative model specified by Eq. (10). This model includes the function which determines how the belief state evolves over time. This can be, for example, according to a first order linear system which results in [5, 17, 13]. The belief state is specified to evolve (linearly) over time as the derivative between the current belief and target . The term indicates the desired state to be achieved while is the time constant. The smaller , the larger the derivative. If approaches zero (), the value approaches . As a result, the belief is infinitely biased towards the target and . A classic PID controller defines an error term . The control action is then chosen as:
where and are tuning parameters. For the control law defined by active inference, our is similar to the error term. Additionally, as explained in the previous section, when then . Now the control law of active inference can be rewritten in terms of the error term as:
This means than if , the active inference controller is equivalent to a PI Controller i.e. PID with , a gain of and an gain of . If one considers the generalized motions up to a third order, the control law would include a nonzero term. A detailed analysis of how PID arises in active inference under approximate linear generative models can be found in [51, 18].
Regarding more complex controllers, AIF shares similarities with Linear Quadratic Gaussian control (LQG), since both are grounded in Bayesian inference and optimal control [105]. However, a closer look reveals several key differences between the two approaches regarding the formulation of the state space, the cost functions, and their minimization—See [106].
AIF can also be considered to be a form of model predictive control [53] due to the evaluation of the expected free energy. The key difficulty is evaluating the expectation over all possible future trajectories, which can be approximated via MonteCarlo sampling of these trajectories using a forward model of the environmental dynamics. Many classical modelpredictive control planning algorithms, such as the CrossEntropyMethod (CEM) [107], and PathIntegralControl (PI) [108, 109] can be used to estimate the value of this integral in the cases where explicitly enumerating every possible future trajectory, as is commonly done in discretestatespace AIF [71]. Moreover, it has recently been demonstrated that many of these classical planning algorithms can be interpreted as performing variational inference [110], thus linking modelpredictive control closely with the active inference interpretation of adaptive and intelligent behaviour as being fundamentally an inference process.
IxB Relationship to Reinforcement Learning
Planning with AIF using the expected free energy functional has close relationships to the field of RL. This fact is not unduly surprising since both attempt to solve the same problem of optimal plan selection in unknown environments through largely similar approaches. The key difference between RL and active inference relies on the properties of the EFE. Crucially, the intrinsic and posterior divergence terms of the EFE are not present in RL, only the extrinsic value term is. These differences arise from the fact that active inference is posed as a variational inference procedure, thus requiring a variational approximate distribution, while RL is solely an optimization problem based around maximizing expected reward. Considering only the extrinsic value term, this becomes,
Defining the desired distribution to be a Boltzmann distribution around the reward , the extrinsic value term simply reduces to the average reward expected under the variational belief distribution in the future. The only remaining difference to RL is that in modelfree RL the reward is instead evaluated under the true environmental distribution instead of under the model distribution, while in modelbased RL in practice the reward is evaluated under the model distribution so the equivalence is more exact.
Importantly, however, active inference generalizes and extends reinforcement learning in several important ways. Firstly, the expected free energy objective generalizes the notion of utility by including an intrinsic informational objective, which encourages exploration and equips agents with an ‘intrinsic curiosity’. It can be shown that this exploratory drive enables agents to perform better in many environments which are high dimensional with sparse rewards, where exploration is necessary to solve the task [55]. From a mathematical perspective, it is also possible to generalize the expected free energy to a class of objectives called divergence functionals, which all result in this combination of rewardseeking and informationseeking behaviour [70, 116]. Secondly, active inference’s notion of encoding goals as a prior distribution over observations is more flexible than the use of rewards in reinforcement learning, since, as shown above, RL implicitly assumes that the prior distribution is Boltzmann, while active inference is free to use any other prior distribution instead which enables considerably more flexibility in specifying goals.
X Benefits and Challenges
For robot perception, control and learning, Active Inference is: 1) a unified framework, 2) with functional biological plausibility, 3) using variational Bayesian inference. Each of these three aspects lead to specific expected benefits for robot perception and control, as well as to interesting open research challenges.
Xa Unified framework
Perhaps the most exciting aspect of AIF is the natural integration of perception and action into a single objective, namely the minimization of Variational Free Energy (or the Expected Free Energy when planning). A potential result is that an AIF agent could use action to make its world behave more predictably. This introduces, in a sense, a double modelbias; not only is state estimation recursively biased towards a Bayesian prior, even the actions help reinforce this bias. In other words, rather than trying to accurately model the world, and requiring unfathomable amounts of data to do so, the AIF agent gets by with a strongly simplified model which it enforces (where possible) through its own actions. This may hold an essential key to solve the data and experience hungriness of presentday stateoftheart learning algorithms.
There are several interesting research challenges still to be solved, related to this modelbias. First, the question is how to avoid suboptimal convergence [117], possibly through the presence of intrinsic value in the optimization criterion [31]. Second, if the action consists of some continuoustime feedback signal, it is a challenge to find the update rule for (the parameters of) that feedback signal. For simple systems, the observation w.r.t. action derivative of the VFE can be directly computed (e.g. [5]). However, this is very challenging for complex systems—see the closed form equation in [63, appendix] when the system is known—and hence it is usually approximated with a directional constant [13, 18, 23] or by performing the control on the proprioceptive states [13, 20]. Potentially interesting alternative solutions exist in the concepts of adaptive interaction [118] or direct gradient descent control [119]. Finally, there is a challenging question regarding the purpose of the belief (internal state) , which in vanilla AIF is biased towards the desired state. This means, unless the agent is at the target state, the belief will be biased, and therefore does not accurately represent the actual hidden state. A potential solution might consist of introducing the action in the generative model [115, 22] in the same way that we included it in the planning.
As a second expected benefit, AIF not only integrates action with perception, it also provides for a natural Bayesian integration of intrinsic and extrinsic value into a single planning objective. This is formally appealing. An interesting remaining research challenge is to find a principled way to balance intrinsic and extrinsic value, regulated in AIF by the precision of the priors on desired observations [79]
, to completely remove the need for a heuristic exploration/exploitation tradeoff.
Finally, AIF provides a way to integrate generative models in a hierarchical fashion aiding the construction of complex probabilistic controllers [95].
XB Functional biological plausibility
The concept of AIF has a very strong presence in the field of neuroscience with an everincreasing list of showcases of functional biological plausibility of the concept [71]. This naturally leads to great expectations for the field of robotics; if AIF is indeed an accurate mathematical description of the neurological processes underlying biological perception, control, and most importantly cognition, then it may bring similar cognitive skills to our robots. Specifically, the AIF approach (and the general framework of predictive coding) allows, by construction [120], to go beyond control and achieve highorder cognitive and metacognitive capabilities, such as monitoring, selfexplainability and in some degree “awareness”.
In a broader sense of metacognition, i.e., cognition about cognition [121], the robot should be able to evaluate and monitor the firstorder cognitive processes [122]. One distinct characteristics of the AIF is that it uses the secondorder formula in terms of the precision in prediction. The predictive model does not just predict sensations but also their predictability. This means that robots can become selfattentive and monitor its uncertainty. This is relevant for applications, such as humanrobot interaction in industrial [44] and healthcare settings.
Furthermore, there are interesting connections between AIF and two awareness abilities in humans that may be essential for interaction, and would bring to robotics a way to enforce explainability and safety. First, selfawareness, where the agent is able to differentiate as an independent entity. or instance, Tani and colleagues [123, 124] as well as Lanillos et al. [47, 48] proposed that the prediction/reconstruction error for the past sensation could be related to the sense of selfawareness in machines. Second, agency, i.e., the feeling of controlling the actions and consequences. Hence, robotic solutions for body selfawareness [6, 91], and models of agency are exciting opportunities under the AIF approach.
Additionally, there is an open research question of a very different nature, but also connected to the biological plausibility of Active Inference. Inherited from Friston’s worldrenowned work on dynamic causal modeling (DCM) for brainimaging [125], AIF uses the concept of ”generalized coordinates”, i.e. appending system states with up to the 6 order of derivatives of those states, which is quite alien for control engineers. We have shown that this improves estimation accuracy in drones in wind [10], for example. The open research question is to what extend the use of generalized coordinates is worth the methodological investment for generic robotic applications, and whether it can be replaced by extending system model states with noise filter model states.
XC Variational Bayesian Inference
The fundamental mathematical operation of AIF is variational Bayesian Inference, an approach which has already been gaining popularity within the robotics community [126, 127, 112]. The expected benefit is that, when a proper (hierarchical) set of Bayesian priors is in place, robots will be able to perceive, decide, and learn with much less data or require much fewer trials that current systems. To obtain such a proper set of priors, we expect that AIF will provide a highly natural framework for humans to teach robots; the required hierarchical set of Bayesian priors can possibly be obtained through a proper curriculum of demonstration and training. This is still a wide open and very exciting research area.
Second, we expect that the variational Bayesian inference approach will lead to a leap forward in fault monitoring and fault tolerance. If the entire control hierarchy is based on prediction errors, then all unexpected sensory inputs will be noticed. By maintaining not only statedependent expectations of the sensor value itself, but also expectations of the variance of those values, the system will only trigger on signals that are outside regular bounds. As a followup from our preliminary robot arm experiments, we expect eventually to see robots that predict all sensor signals at all times, and detect any unexpected behaviour, including (unexpected) collisions, sensor/actuator malfunction, sample frequency hiccups, and even unexpected user behavior.
Third, we expect that the variational Bayesian inference approach will help alleviate the combinatorial explosion associated with making longerterm plans, and the accompanying deterioration in accuracy of predictions with the number planning steps. In principle this should be an emergent property of AIF, in the sense the the objective to minimize free energy entails a minimization of complexity, both statistically and in terms of computational cost via the Jarzynski equality (see [32] for an approach along these lines). AIF suggests possible approaches to address both issues. For example, it suggests a principled solution to combinatorial explosion of plans in terms of an approximate factorisation of the distributions over actions, a technique core to variational approaches in general. Uncertainty accumulated during the execution of a given policy can be quantified and thus actions planned accordingly. The generality of the variational Bayesian formulation provides an intuitive foundation from which to develop new uncertaintysensitive approaches and gives a very appealing probabilistic solution to modelpredictive control.
Xi Summary
We discussed how a theory of cognition originating in computational neuroscience opened up opportunities to improve robotic systems. In particular, we detailed its application in estimation, adaptive control, planning and learning. We described both the mathematical formulation of AIF and the most relevant works in the literature as well as showcasing some experiments and lessons learned from deploying AIF in real robotic systems, such as industrial manipulators or humanoids. We also described its connection with other fields like classical control and RL. Finally, we discussed the benefits and challenges of this approach to transform AIF into a standard methodology in robotics and to give robots humanlike interactive capabilities.
Acknowledgment
We would like to thank Karl Friston for his comments on the manuscript and his invaluable inspiration for AIF in robotics.
References
 [1] K. Friston, “The freeenergy principle: a unified brain theory?” Nature reviews neuroscience, vol. 11, no. 2, pp. 127–138, 2010.
 [2] H. v. Helmholtz, Handbuch der physiologischen Optik. L. Voss, 1867.
 [3] K. M. Dallenbach, “A puzzlepicture with a new principle of concealment,” The American journal of psychology, pp. 431–433, 1951.
 [4] R. P. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptivefield effects,” Nature neuroscience, vol. 2, no. 1, pp. 79–87, 1999.
 [5] C. L. Buckley, C. S. Kim, S. McGregor, and A. K. Seth, “The free energy principle for action and perception: A mathematical review,” Journal of Mathematical Psychology, vol. 81, pp. 55–79, 2017.
 [6] P. Lanillos and M. van Gerven, “Neuroscienceinspired perceptionaction in robotics: applying active inference for state estimation, control and selfperception,” arXiv preprint arXiv:2105.04261, 2021.
 [7] P. Lanillos, S. Franklin, A. Maselli, and D. W. Franklin, “Active strategies for multisensory conflict suppression in the virtual hand illusion,” Scientific Reports, 2021.
 [8] A. Anil Meera and M. Wisse, “Dynamic expectation maximization algorithm for estimation of linear systems with colored noise,” Entropy, vol. 23, no. 10, 2021.
 [9] A. A. Meera and M. Wisse, “Free energy principle based state and input observer design for linear systems with colored noise,” in 2020 American Control Conference (ACC). IEEE, 2020, pp. 5052–5058.
 [10] A. Anil Meera and M. Wisse, “A brain inspired learning algorithm for the perception of a quadrotor in wind,” arXiv preprint arXiv:2109.11971, 2021.
 [11] F. Bos, A. Anil Meera, D. Benders, and M. Wisse, “Free energy principle for state and input estimation of a quadcopter flying in wind,” arXiv preprint arXiv:2109.12052, 2021.
 [12] P. Lanillos and G. Cheng, “Adaptive robot body learning and estimation through predictive coding,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 4083–4090.
 [13] G. Oliver, P. Lanillos, and G. Cheng, “An empirical study of active inference on a humanoid robot,” IEEE Transactions on Cognitive and Developmental Systems, 2021.
 [14] D. Burghardt and P. Lanillos, “Robot localization and navigation through predictive processing using lidar,” arXiv preprint arXiv:2109.04139, 2021.
 [15] L. PioLopez, A. Nizard, K. Friston, and G. Pezzulo, “Active inference and robot control: a case study,” Journal of The Royal Society Interface, vol. 13, no. 122, p. 20160616, 2016.
 [16] P. Lanillos and G. Cheng, “Active inference with function learning for robot body perception,” in Proc. Int. Workshop Continual Unsupervised Sensorimotor Learn., 2018, pp. 1–5.
 [17] C. Pezzato, R. Ferrari, and C. H. Corbato, “A novel adaptive controller for robot manipulators based on active inference,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2973–2980, 2020.
 [18] M. Baioumy, P. Duckworth, B. Lacerda, and N. Hawes, “Active inference for integrated stateestimation, control, and learning,” in International conference on Robotics and Automation, ICRA, 2021.
 [19] C. Sancaktar, M. van Gerven, and P. Lanillos, “Endtoend pixelbased deep active inference for body perception and action,” in 2020 Joint IEEE 10th International Conference on Development and Learning and Epigenetic Robotics (ICDLEpiRob), 2020.
 [20] C. Meo and P. Lanillos, “Multimodal vae active inference controller,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021.
 [21] T. Rood, M. van Gerven, and P. Lanillos, “A deep active inference model of the rubberhand illusion,” arXiv preprint arXiv:2008.07408, 2020.
 [22] M. Baioumy, C. Pezzato, R. Ferrari, C. H. Corbato, and N. Hawes, “Faulttolerant control of robot manipulators with sensory faults using unbiased active inference,” in European Control Conference, ECC, 2021.
 [23] C. Pezzato, M. Baioumy, C. H. Corbato, N. Hawes, M. Wisse, and R. Ferrari, “Active inference for fault tolerant control of robot manipulators with sensory faults,” in International Workshop on Active Inference. Springer, 2020, pp. 20–27.
 [24] An active inference implementation of phototaxis, ser. ALIFE 2021: The 2021 Conference on Artificial Life, vol. ECAL 2017, the Fourteenth European Conference on Artificial Life, 09 2017.
 [25] K. J. Friston, J. Daunizeau, and S. J. Kiebel, “Reinforcement learning or active inference?” PloS one, vol. 4, no. 7, p. e6421, 2009.
 [26] B. Millidge, A. Tschantz, A. K. Seth, and C. L. Buckley, “Reinforcement learning as iterative and amortised inference,” 2020.
 [27] N. Sajid, P. J. Ball, and K. J. Friston, “Active inference: demystified and compared,” arXiv, pp. arXiv–1909, 2019.
 [28] A. Tschantz, M. Baltieri, A. K. Seth, and C. L. Buckley, “Scaling active inference,” in 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–8.
 [29] D. Han, K. Doya, and J. Tani, “Goaldirected planning by reinforcement learning and active inference,” arXiv preprint arxiv:2106.09938v2, 2021.
 [30] B. Millidge, “Combining active inference and hierarchical predictive coding: A tutorial introduction and case study,” 2019.
 [31] A. Tschantz, A. K. Seth, and C. L. Buckley, “Learning actionoriented models through active inference,” PLoS computational biology, vol. 16, no. 4, p. e1007805, 2020.
 [32] K. Friston, L. Da Costa, D. Hafner, C. Hesp, and T. Parr, “Sophisticated inference,” arXiv preprint arXiv:2006.04120, 2020.
 [33] K. Ueltzhöffer, “Deep active inference,” Biological cybernetics, vol. 112, no. 6, pp. 547–573, 2018.
 [34] O. van der Himst and P. Lanillos, “Deep active inference for partially observable mdps,” Communications in Computer and Information Science, p. 61–71, 2020.
 [35] Z. Fountas, N. Sajid, P. A. M. Mediano, and K. Friston, “Deep active inference agents using montecarlo methods,” 2020.
 [36] B. Millidge, “Deep active inference as variational policy gradients,” Journal of Mathematical Psychology, vol. 96, p. 102348, 2020.
 [37] O. Çatal, S. Wauthier, C. De Boom, T. Verbelen, and B. Dhoedt, “Learning generative state space models for active inference,” Frontiers in Computational Neuroscience, vol. 14, p. 103, 2020.
 [38] O. Çatal, S. Wauthier, T. Verbelen, C. De Boom, and B. Dhoedt, “Deep active inference for autonomous robot navigation,” in The bridging AI and cognitive science (BAICS) workshop, ICLR, 2020.
 [39] O. Çatal, T. Verbelen, T. Van de Maele, B. Dhoedt, and A. Safron, “Robot navigation as hierarchical active inference,” Neural Networks, vol. 142, pp. 192–204, 2021.
 [40] T. Matsumoto and J. Tani, “Goaldirected planning for habituated agents by active inference using a variational recurrent neural network,” Entropy, vol. 22, no. 5, p. 564, May 2020.
 [41] M. Traub, M. V. Butz, R. Legenstein, and S. Otte, “Dynamic action inference with recurrent spiking neural networks,” in Artificial Neural Networks and Machine Learning – ICANN 2021, I. Farkas, P. Masulli, S. Otte, and S. Wermter, Eds. Cham: Springer International Publishing, 2021, pp. 233–244.
 [42] C. Pezzato, C. Hernandez, S. Bonhof, and M. Wisse, “Active inference and behavior trees for reactive action planning and execution in robotics,” arXiv preprint arXiv:2011.09756, 2020.
 [43] S. Murata, Y. Yamashita, H. Arie, T. Ogata, S. Sugano, and J. Tani, “Learning to perceive the world as probabilistic or deterministic via interaction with others: A neurorobotics experiment,” IEEE transactions on neural networks and learning systems, vol. 28, no. 4, pp. 830–848, 2015.
 [44] W. Ohata and J. Tani, “Investigation of the sense of agency in social cognition, based on frameworks of predictive coding and active inference: A simulation study on multimodal imitative interaction,” Frontiers in Neurorobotics, vol. 14, Sep 2020.
 [45] H. F. Chame, A. Ahmadi, and J. Tani, “A hybrid humanneurorobotics approach to primary intersubjectivity via active inference,” Frontiers in Psychology, vol. 11, p. 3207, 2020.
 [46] H. F. Chame and J. Tani, “Cognitive and motor compliance in intentional humanrobot interaction,” 2020.
 [47] P. Lanillos, J. Pages, and G. Cheng, “Robot self/other distinction: Active inference meets neural networks learning in a mirror,” in European Conference on Artificial Intelligence. Amsterdam: IOS Press, 2020.
 [48] M. Hoffmann, S. Wang, V. Outrata, E. Alzueta, and P. Lanillos, “Robot in the mirror: toward an embodied computational model of mirror selfrecognition,” KIKünstliche Intelligenz, vol. 35, no. 1, pp. 37–51, 2021.
 [49] M. Kirchhoff, T. Parr, E. Palacios, K. Friston, and J. Kiverstein, “The markov blankets of life: autonomy, active inference and the free energy principle,” Journal of The royal society interface, vol. 15, no. 138, p. 20170792, 2018.
 [50] R. Bogacz, “A tutorial on the freeenergy framework for modelling perception and learning,” Journal of mathematical psychology, vol. 76, pp. 198–211, 2017.
 [51] M. Baltieri and C. L. Buckley, “Pid control as a process of active inference with linear generative models,” Entropy, vol. 21, no. 3, p. 257, 2019.
 [52] T. van de Laar, A. Özçelikkale, and H. Wymeersch, “Application of the free energy principle to estimation and control,” 2020.
 [53] M. Baioumy, M. Mattamala, and N. Hawes, “Variational inference for predictive and reactive controllers,” in ICRA 2020 Workshop on New advances in Braininspired Perception, Interaction and Learning, Paris, France, 2020.
 [54] L. Da Costa, T. Parr, N. Sajid, S. Veselic, V. Neacsu, and K. Friston, “Active inference on discrete statespaces: a synthesis,” Journal of Mathematical Psychology, vol. 99, p. 102447, 2020.
 [55] A. Tschantz, B. Millidge, A. K. Seth, and C. L. Buckley, “Reinforcement learning through active inference,” arXiv preprint arXiv:2002.12636, 2020.
 [56] K. Friston, N. TrujilloBarreto, and J. Daunizeau, “Dem: A variational treatment of dynamic systems,” NeuroImage, vol. 41, pp. 849–85, 08 2008.
 [57] A. Ciria, G. Schillaci, G. Pezzulo, V. V. Hafner, and B. Lara, “Predictive processing in cognitive robotics: a review,” 2021.
 [58] M. Spratling, “A review of predictive coding algorithms,” Brain and Cognition, vol. 112, pp. 92–97, 2017, perspectives on Human Probabilistic Inferences and the ’Bayesian Brain’.
 [59] G. Pezzulo, F. Rigoli, and K. Friston, “Active inference, homeostatic regulation and adaptive behavioural control,” Progress in Neurobiology, vol. 134, pp. 17–35, 2015.
 [60] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999.
 [61] K. Friston, J. Mattout, N. TrujilloBarreto, J. Ashburner, and W. Penny, “Variational free energy and the laplace approximation,” NeuroImage, vol. 34, pp. 220–34, 02 2007.
 [62] C. Buckley, C. Kim, S. McGregor, and A. Seth, “The free energy principle for action and perception: A mathematical review,” Journal of Mathematical Psychology, vol. 81, pp. 55–79, 2017.
 [63] K. J. Friston, J. Daunizeau, J. Kilner, and S. J. Kiebel, “Action and behavior: a freeenergy formulation,” Biological cybernetics, vol. 102, no. 3, pp. 227–260, 2010.
 [64] K. Friston and S. Kiebel, “Predictive coding under the freeenergy principle,” Philosophical transactions of the Royal Society B: Biological sciences, vol. 364, no. 1521, pp. 1211–1221, 2009.
 [65] K. J. Friston, N. TrujilloBarreto, and J. Daunizeau, “Dem: a variational treatment of dynamic systems,” Neuroimage, vol. 41, no. 3, pp. 849–885, 2008.
 [66] A. Anil Meera and M. Wisse, “On the convergence of dem’s linear parameter estimator,” in International Workshop on Active Inference, 2021.
 [67] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial intelligence, vol. 101, no. 12, pp. 99–134, 1998.
 [68] K. Friston, F. Rigoli, D. Ognibene, C. Mathys, T. Fitzgerald, and G. Pezzulo, “Active inference and epistemic value,” Cognitive neuroscience, vol. 6, no. 4, pp. 187–214, 2015.
 [69] T. Parr and K. J. Friston, “Generalised free energy and active inference,” Biological cybernetics, vol. 113, no. 5, pp. 495–513, 2019.
 [70] B. Millidge, A. Tschantz, and C. L. Buckley, “Whence the expected free energy?” Neural Computation, vol. 33, no. 2, pp. 447–482, 2021.
 [71] K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, and G. Pezzulo, “Active inference: a process theory,” Neural computation, vol. 29, no. 1, pp. 1–49, 2017.
 [72] J. Schmidhuber, “Formal theory of creativity, fun, and intrinsic motivation (1990–2010),” IEEE Transactions on Autonomous Mental Development, vol. 2, no. 3, pp. 230–247, 2010.
 [73] P.Y. Oudeyer and F. Kaplan, “What is intrinsic motivation? a typology of computational approaches,” Frontiers in neurorobotics, vol. 1, p. 6, 2009.
 [74] P. Schwartenbeck, J. Passecker, T. U. Hauser, T. H. FitzGerald, M. Kronbichler, and K. J. Friston, “Computational mechanisms of curiosity and goaldirected exploration,” Elife, vol. 8, p. e41703, 2019.
 [75] P. Lanillos and G. Cheng, “Active inference with function learning for robot body perception,” International Workshop on Continual Unsupervised Sensorimotor Learning, IEEE Developmental Learning and Epigenetic Robotics (ICDLEpirob), 2018.
 [76] T. Parr, N. Sajid, L. Da Costa, M. B. Mirza, and K. J. Friston, “Generative models for active vision,” Frontiers in Neurorobotics, vol. 15, p. 34, 2021.
 [77] N. Sajid, P. J. Ball, T. Parr, and K. J. Friston, “Active inference: demystified and compared,” Neural Computation, vol. 33, no. 3, pp. 674–712, 2021.
 [78] O. Çatal, J. Nauta, T. Verbelen, P. Simoens, and B. Dhoedt, “Bayesian policy selection using active inference,” 2019.
 [79] N. van Hoeffelen and P. Lanillos, “Deep active inference for pixelbased discrete control: Evaluation on the car racing problem,” arXiv preprint arXiv:2109.04155, 2021.
 [80] Y. Yamashita and J. Tani, “Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment,” PLoS computational biology, vol. 4, no. 11, p. e1000220, 2008.
 [81] A. Ahmadi and J. Tani, “A novel predictivecodinginspired variational rnn model for online prediction and recognition,” Neural computation, vol. 31, no. 11, pp. 2025–2074, 2019.
 [82] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in Advances in neural information processing systems, 2015, pp. 2980–2988.
 [83] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985.
 [84] H. F. Chame and J. Tani, “Cognitive and motor compliance in intentional humanrobot interaction,” in Proc. of 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 11 291–11 297.
 [85] H. F. Chame, A. Ahmadi, and J. Tani, “A hybrid humanneurorobotics approach to primary intersubjectivity via active inference,” Frontiers in Psychology, vol. 11, p. 3207, 2020.
 [86] N. Wirkuttis and J. Tani, “Leading or following? dyadic robot imitative interaction using the active inference framework,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 6024–6031, 2021.
 [87] T. Matsumoto and J. Tani, “Goaldirected planning for habituated agents by active inference using a variational recurrent neural network,” Entropy, vol. 22, no. 5, p. 564, 2020.
 [88] J. Tani, “Learning to generate articulated behavior through the bottomup and the topdown interaction processes,” Neural networks, vol. 16, no. 1, pp. 11–23, 2003.
 [89] J. Tani, M. Ito, and Y. Sugita, “Selforganization of distributedly represented multiple behavior schemata in a mirror system: reviews of robot experiments using rnnpb,” Neural Networks, vol. 17, no. 89, pp. 1273–1289, 2004.
 [90] K. Friston, J. Mattout, and J. Kilner, “Action understanding and active inference,” Biological cybernetics, vol. 104, no. 1, pp. 137–160, 2011.
 [91] N.A. Hinz, P. Lanillos, H. Mueller, and G. Cheng, “Drifting perceptual patterns suggest prediction errors fusion rather than hypothesis selection: replicating the rubberhand illusion on a robot,” in 2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDLEpiRob). IEEE, 2018, pp. 125–132.
 [92] K. Astrom, “Theory and applications of adaptive control  a survey,” Automatica, vol. Vol. 19, No. 5, pp. 471–486, 1983.

[93]
M. Colledanchise and P. Ogren, “How Behavior Trees Modularize Hybrid Control Systems and Generalize Sequential Behavior Compositions, the Subsumption Architecture, and Decision Trees,”
IEEE Transactions on Robotics, vol. 33, no. 2, pp. 372–389, 2017.  [94] M. Baioumy, C. Pezzato, C. H. Corbato, N. Hawes, and R. Ferrari, “Towards stochastic faulttolerant control usingprecision learning and active inference,” in International Workshop on Active Inference. Springer, 2021.
 [95] K. J. Friston, T. Parr, and B. de Vries, “The graphical brain: belief propagation and active inference,” Network Neuroscience, vol. 1, no. 4, pp. 381–414, 2017.
 [96] R. Y. Rubinstein, “Optimization of computer simulation models with rare events,” European Journal of Operational Research, vol. 99, no. 1, pp. 89–112, 1997.
 [97] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning. PMLR, 2018, pp. 1861–1870.
 [98] P. Mazzaglia, T. Verbelen, and B. Dhoedt, “Contrastive active inference,” Advances in Neural Information Processing Systems, vol. 34, 2021.
 [99] A. D. Noel, C. van Hoof, and B. Millidge, “Online reinforcement learning with sparse rewards through an active inference capsule,” arXiv preprint arXiv:2106.02390, 2021.
 [100] S. Gallagher, “Philosophical conceptions of the self: implications for cognitive science,” Trends in cognitive sciences, vol. 4, no. 1, pp. 14–21, 2000.
 [101] M. Synofzik, G. Vosgerau, and A. Newen, “Beyond the comparator model: a multifactorial twostep account of agency,” Consciousness and cognition, vol. 17, no. 1, pp. 219–239, 2008.
 [102] K. Friston, “Prediction, perception and agency,” International Journal of Psychophysiology, vol. 83, no. 2, pp. 248–252, 2012.
 [103] P. Lanillos, J. Pages, and G. Cheng, “Robot self/other distinction: active inference meets neural networks learning in a mirror,” in European Conference on Artificial Intelligence (ECAI 2020), 2020.
 [104] Y. Nagai, “Predictive learning: its key role in early cognitive development,” Philosophical Transactions of the Royal Society B, vol. 374, no. 1771, p. 20180030, 2019.
 [105] K. Friston, “What is optimal about motor control?” Neuron, vol. 72, no. 3, pp. 488–498, 2011.
 [106] M. Baltieri and C. L. Buckley, “On kalmanbucy filters, linear quadratic control and active inference,” arXiv preprint arXiv:2005.06269, 2020.
 [107] R. Rubinstein, “The crossentropy method for combinatorial and continuous optimization,” Methodology and computing in applied probability, vol. 1, no. 2, pp. 127–190, 1999.
 [108] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral control approach to reinforcement learning,” The Journal of Machine Learning Research, vol. 11, pp. 3137–3181, 2010.
 [109] G. Williams, A. Aldrich, and E. A. Theodorou, “Model predictive path integral control: From theory to parallel computation,” Journal of Guidance, Control, and Dynamics, vol. 40, no. 2, pp. 344–357, 2017.
 [110] M. Okada and T. Taniguchi, “Variational inference mpc for bayesian modelbased reinforcement learning,” in Conference on Robot Learning. PMLR, 2020, pp. 258–272.
 [111] M. Botvinick and M. Toussaint, “Planning as inference,” Trends in cognitive sciences, vol. 16, no. 10, pp. 485–488, 2012.
 [112] S. Levine, “Reinforcement learning and control as probabilistic inference: Tutorial and review,” arXiv preprint arXiv:1805.00909, 2018.
 [113] B. Millidge, A. Tschantz, A. K. Seth, and C. L. Buckley, “On the relationship between active inference and control as inference,” arXiv preprint arXiv:2006.12964, 2020.
 [114] J. Watson, A. Imohiosen, and J. Peters, “Active inference or control as inference? a unifying view,” arXiv preprint arXiv:2010.00262, 2020.
 [115] J. Watson, H. Abdulsamad, and J. Peters, “Stochastic optimal control as approximate input inference,” in Conference on Robot Learning. PMLR, 2020, pp. 697–716.
 [116] B. Millidge, A. Tschantz, A. Seth, and C. Buckley, “Understanding the origin of informationseeking exploration in probabilistic objectives for control,” 2021.

[117]
M. Deisenroth and C. E. Rasmussen, “Pilco: A modelbased and dataefficient
approach to policy search,” in Proceedings of the 28th International
Conference on machine learning (ICML11)
. Citeseer, 2011, pp. 465–472.
 [118] F. Lin, R. D. Brandt, and G. Saikalis, “Selftuning of pid controllers by adaptive interaction,” in Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No. 00CH36334), vol. 5. IEEE, 2000, pp. 3676–3681.
 [119] J. Naiborhu, S. Nababan, R. Saragih, and I. Pranoto, “Direct gradient descent control as a dynamic feedback control for linear system,” Bulletin of the Malaysian Mathematical Sciences Society, vol. 29, no. 2, 2006.
 [120] L. SandvedSmith, C. Hesp, J. Mattout, K. Friston, A. Lutz, and M. J. Ramstead, “Towards a computational phenomenology of mental action: modelling metaawareness and attentional control with deep parametric active inference,” Neuroscience of consciousness, vol. 2021, no. 1, p. niab018, 2021.
 [121] A. Cleeremans, D. Achoui, A. Beauny, L. Keuninckx, J.R. Martin, S. MuñozMoldes, L. Vuillaume, and A. De Heering, “Learning to be conscious,” Trends in cognitive sciences, vol. 24, no. 2, pp. 112–123, 2020.
 [122] C. Hesp, R. Smith, T. Parr, M. Allen, K. J. Friston, and M. J. Ramstead, “Deeply felt affect: The emergence of valence in deep active inference,” Neural computation, vol. 33, no. 2, pp. 398–446, 2021.
 [123] J. Tani, “An interpretation of the ‘self’ from the dynamical systems perspective: A constructivist approach,” Journal of Consciousness Studies, vol. 5, no. 56, pp. 516–542, 1998.
 [124] J. Tani and J. White, “Cognitive neurorobotics and self in the shared world, a focused review of ongoing research,” Adaptive Behavior, p. 1059712320962158, 2020.
 [125] K. J. Friston, L. Harrison, and W. Penny, “Dynamic causal modelling,” Neuroimage, vol. 19, no. 4, pp. 1273–1302, 2003.
 [126] M. Toussaint, “Robot trajectory optimization using approximate inference,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 1049–1056.
 [127] M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy search for robotics,” Foundations and trends in Robotics, vol. 2, no. 12, pp. 388–403, 2013.