The medical doctor and physicist Hermann von Helmholtz described visual perception as an unconscious mechanism that infers the world 
. In other words, the brain has generative models that complete or reconstruct the world from partial information. Nowadays, there is a scientific mainstream that describes the inner workings of the brain as those of a Bayesian inference machine . This approach supports that we are able to adjust the cues (visual, proprioceptive, tactile, etc) contribution to our interpretation in a Bayesian optimal way taking into account sensors and motor uncertainties. This implies that the brain is able to encode uncertainty not only for perception but also for acting in the world. Optimal feed-back control was proposed for modelling motor coordination . Alternatively, active inference , defended that both perception and action are two sides of the same process: an unconscious mechanism that infers and adapts to the environment. Either way, perception is inevitably connected to the body senses and actuators, being the body the entity of interaction  and possible learned through development .
From the several available brain theories that have arisen in the last decades, some of them can be unified under the free-energy principle . This principle accounts for perception, action and learning through the minimization of surprise. This is the discrepancy between the current state and the predicted or desired one, also known as prediction error. According to this approach, free-energy is a way of quantifying surprise and it can be optimized by changing the current beliefs (perception) or by acting on the environment (action) to adjust the difference between reality and prediction .
We present robotic body perception as a flexible and dynamic process that approximates the body latent configuration using the error between the expected and the observed sensory information. In this work we provide an active inference mathematical model for a humanoid robot combining perception and action, extending . This model enabled the robot to have adaptive body perception and to perform robust reaching behaviors even under high levels of sensor noise and discrepancies between the model and the real robot (Fig. 1). This is due to the way the optimization framework fuses the available information from different sources and the coupling between action and perception.
I-a Related work
Multisensory perception has been widely studied in the literature and enables the robot to combine joint information with other sensors such as images and tactile cues. Bayesian estimation has been proved to achieve robust and accurate model based robot arm tracking even under occlusion . Furthermore, integrated visuomotor processes enabled humanoid robots to learn object representations through manipulation without any prior knowledge about them , learn motor representations for robust reaching [13, 14] and even visuotactile motor representations for reaching and avoidance behaviors .
Active inference (under the free-energy principle) includes action as a classical spinal reflex arc pathway triggered by perception prediction errors and has been mainly studied in theoretical or simulated conditions. Friston presented in 16], one degree of freedom simulated vehicle  and two degrees of freedom simulated robot arm .
A first model of the free-energy optimization in a real robot was performed in  working as an approximate Bayesian filter estimation, where the robot was able to perceive its arm location fusing visual, proprioceptive and tactile information. However, authors left out the action. In this work, we took one step further and modelled and applied active inference to the iCub robot for dual arm reaching with active head tracking. For reproducibility, the code is publicly available111tobereleased. While the arms goal is to minimize the prediction error between the goal (object) and the end-effector visual location, the head goal is to maintain the object centered in the field of view to provide wider and more accurate reaching capabilities.
I-B Paper organization
First in Sec. II we explain the general mathematical free-energy optimization for perception and action. Afterwards, Sec. III describes the the iCub physical model and in Sec. IV and V we detail the active inference computational model that allows the robot to perform robust reaching and tracking tasks. Finally, VI shows the results obtained analyzing the advantages and limitations of the proposed algorithm.
Ii Free-energy optimization model
Ii-a Bayesian inference
According to the Bayesian inference model for the brain, the body configuration222We define body configuration or body schema as a generic way to refer to the body position in the space. For instance, the joint angles of the robot. is inferred using the available sensory data
by applying Bayes’ theorem:
where the posterior probability,
, corresponding to the probability of body configurationgiven the observed data , is obtained as a consequence of three antecedents: (1) likelihood, , or compatibility of the observed data with the body configuration , (2) prior probability, , or current belief about the configuration before receiving the sensory data , also known as previous state belief, and (3) marginal likelihood, , which corresponds to the marginalization of the likelihood of receiving sensory data regardless of the configuration. This is a normalization term,
, which ensures that the posterior probabilities,, for the whole range of , integrates to 1.
The goal is to find the value of which maximizes , because it is the most likely value for the real-world body configuration according the sensory data obtained . This direct method presents a great difficulty, where the marginalization over all the possible body states becomes intractable.
Ii-B Free-energy principle
The free-energy principle  provides a tractable solution to this obstacle, where, instead of calculating the marginal likelihood, the idea is to minimize the Kullback-Leibler divergence  between a reference distribution and the real , in order for the first to the a good approximation of the second.
It is important to note that the marginal likelihood, , is independent of the configuration variable and because
is a probability distribution, which means that the integral over its entire range is 1. Thus,falls out of the integration.
Maximizing the negative first term effectively minimizes the difference between these two densities, with only the marginal likelihood remaining. Unlike the whole expression of the K-L divergence, the first term can be evaluated because it depends on the reference distribution and the knowledge about the environment we can assume the agent has, . This term is defined as variational negative free-energy .
Maximizing this expression with respect to is known as free-energy optimization and results in minimizing the K-L divergence between the two distributions333The expression for free-energy also appears in the literature in its positive version, without the negative sign that precedes it, being in that case the objective of optimization the minimization of the expression..
Maximizing is equivalent to the previous goal of maximizing the posterior probability , due to the fact that all probability distributions are strictly non-negative. Considering that the second term of the K-L divergence, , is not dependent on neither the reference distribution nor the value of the body configuration , the same value of optimizes all three quantities: , and .
According to the free-energy optimization theory, there are two ways to minimize surprise, which accounts for the discrepancy between the current state and the predicted or desired one (prediction error): changing the belief (perceptual inference) or acting on the world (active inference). Perceptual inference and active inference optimize the value of the free-energy expression , while active inference also optimizes the value of the marginal likelihood by acting on the environment and changing the sensory data .
Ii-C Probability density distributions
Let the probability densities follow a normal distribution, with the meanbeing the value of body configuration that best relates to the sensory input and prior state knowledge. The density of a normal distribution with respect to with mean
and varianceis defined as:
We shall then define the reference distribution as a Dirac delta function, centered at the mean value of the normal distribution (4): . These assumptions simplify the expression of variational negative free-energy, considering the fundamental property of the Dirac delta function where the integral for all the range of multiplied by any arbitrary function is . Variational negative free-energy is now reduced to likelihood and prior probabilities:
The K-L divergence is defined as strictly non-negative, only equal to when . Negative free-energy is therefore non-positive, due to the nature of the range of probability densities between and and the non-positive (or zero) value of the natural logarithm for numbers in this domain.
Ii-D Perceptual inference
Perceptual inference, is the process of updating the inner model belief to best account for sensory data, minimizing the prediction error.
The agent must update the most-likely or optimal value for the body configuration in each state. This optimal value is the one that maximizes negative free-energy, therefore a first-order iterative optimization algorithm of gradient ascent will be applied to approach the local maximum of the function. In this case, this means that it should be changed proportionally to the gradient of negative free-energy.
For static systems, this update is done directly considering only the gradient ascent formulation: . In dynamic systems, the time derivatives of the body configuration should be considered. Usually first and second order derivatives are considered, and
, but higher order derivatives could also be considered if their dynamic equations of behavior are known. The state variable is now a vector:. In this case, all values and derivatives must be updated taking into consideration the next higher order derivative:
where the is the block-matrix derivative operator with the superdiagonal filled with ones.
When negative free-energy is maximized, the value of its derivative is , and the system is at equilibrium; i.e. static systems and dynamic systems .
The expression (6) denotes the change of with time, and it is used to update the value of using any kind of numerical integration methods. We use a simple first-order Euler integration method, where in each iteration the value will be calculated using a linear dependency: , where is the period of execution of the updating cycle for the internal state.
Ii-E Active inference
Active inference , is the extension of perceptual inference to the relationship between sensors and actions, taking into account that actions can change the world to make sensory data more accurate with predictions made by the inner model.
Action plays a core role on the optimization and improves the approximation of the real distribution, therefore reducing the prediction error by minimizing free-energy. It also acts on the marginal likelihood by changing the real configuration which modifies the sensory data to obtain new data that is more in concordance with the agent’s belief.
In this case, the optimal value is also the one which maximizes negative free-energy, and again a gradient ascent approach will be taken to update the value of the action.
where is calculated using a first-order Euler numerical integration with explicit gain: .
The combination of active inference and perception provides the mathematical framework for free-energy optimization.
Iii Robot physical model
iCub  (v1.4) is a 104 centimeter tall and 22 kilogram humanoid robot that resembles a small child with 53 degrees of freedom powered by electric motors and driven by tendons. The upper body has 38 degrees of freedom, distributed in 7 for each arm, 9 for each hand and 6 for the head (3 on the neck and 3 for the eyes). The lower body has 15 degrees of freedom, 6 for each leg and 3 more in the waist. The software is built on top of the YARP (Yet Another Robot Platform) framework, to facilitate communication between different hardware and software implementations in robotics .
The robot is divided into several kinematic chains, that are distributed according to its extremities. All kinematic chains are defined through homogeneous transformation matrices using Denavit-Hartenberg convention. We focus on two kinematic chains, those with the end-effector being the right hand (without considering its fingers) and the left eye.
|Arm (right/left)||4||-90 + [0, 160.8]||(r/l)_shoulder_roll|
|Arm (right/left)||5||-105 + [-37, 100]||(r/l)_shoulder_yaw|
|Arm (right/left)||6||[5.5, 106]||(r/l)_elbow|
|Head||3||90 + [-40, 30]||neck_pitch|
|Head||5||90 + [-55, 55]||neck_yaw|
Without lose of generality, the arm model is defined as a three degree of freedom (revolute joints) system: r_shoulder_roll, r_shoulder_yaw and r_elbow. The left eye camera observes the end-effector position and the world around it. The joints considered for the motion of the head are: neck_pitch and neck_yaw.
The symbolic matrices for the kinematics of these chains in terms of the joint variable were obtained using Mathematica. These are the homogeneous transformation matrices for both complete chains from the local robot origin to the end-effector reference frame, as well as their partial derivatives in terms of its three degrees of freedom.
Iv Active inference computational model for iCub arm reaching task
Iv-a Problem formulation
The body configuration, or internal variables, is defined as the joint angles. The estimated states are the belief the agent has about the joint angle position and the action is the angular velocity of those same joints. Due to the fact that we use a velocity control for the joints, first order dynamics must also be considered .
The sensory data will be obtained though several input sensors that provide information about the position of the end-effector in the visual field, , and joint angle position, .
The likelihood is made up of proprioception functions in terms of the current body configuration, while the prior will take into account the dynamic model of the agent that describes how this internal state changes with time. The combination of both probabilities formalize the negative free-energy. Adapting Eq. 5 for the model described in Fig. 2:
Iv-B Negative free-energy optimization
In order to define the conditional densities for each of the terms, we should define the expressions for the sensory data. Joint angle position, , is obtained directly from the joint angle sensors. Lets assume that the input is noisy and follows a normal distribution with mean at the internal value and variance . The end-effector visual position, , is defined by a non-linear function dependent on the body configuration and obtained using the forward model of the right arm and the pinhole camera model for the left eye camera of the robot. Lets assume that the input is noisy and follows a normal distribution with mean at the value of this function and variance . The dynamic model is determined by a function which depends on both the current state and the causal variables (e.g. the visual plane position of the object to be reached). We assume that the input is noisy and follows a normal distribution with mean at the value of this function and variance .
Considering the same normal distribution assumptions for the internal state and sensorial terms, the expressions of probability functions are extended to consider all the elements of the vectors, where :
Variational negative free-energy, considering the previous density functions is obtained applying the natural logarithm to (10). The sequence product is transformed into a summation due to the properties of the natural logarithm.
The vectorial equations used for the gradient ascent formulation are obtained from the differentiation of the scalar free-energy term by the internal state vector and the action vector (Eq. (6) and (7)). The dependency of with respect to the vector of internal variables
can be calculated using the chain rule on the functions that depend on those internal variables. The dependency ofwith respect to the vector of actions is calculated considering that the only magnitudes directly affected by action are the values obtained from the sensors.
Even though an angular velocity control is being carried out, the agent can also be aware of the values of the first-order dynamics and they can be updated using a gradient ascent formulation. The dependency of with respect to the first-order dynamics vector is limited to the influence of the dynamic model.
Equation (18) shows that the update of the first-order dynamics has a negative feedback, causing a beneficial effect on the stability of the process. This also means that at the equilibrium point the value of the derivative should be zero.
Considering the equations above, the complete update equations are:
A first order Euler integration method is applied to update the values of , and in each iteration.
Iv-C Perceptual attractor dynamics
We introduce the reaching goal as a perceptual attractor in the visual field as follows:
The internal variable dynamics will be defined in terms of the attractor:
where is the function that transforms the attractor vector from target space (visual) to the joint space. The system is velocity controlled, therefore the target space is a linear velocity vector and the joint space is an angular velocity.
The visual Jacobian matrix that relates visual space (2 coordinates) to joint space (3 DOF) is a rectangular matrix, therefore the mapping matrix used is the generalized inverse (Moore-Penrose pseudoinverse) of it:
. This matrix is calculated using the singular-value decomposition (SVD), whereand .
Iv-D Active inference
The action is set to be an angular velocity magnitude, , which corresponds with angular joint velocity in the latent space. We must calculate the expression of the partial derivatives for the matrices and in (17) to quantify the dependency of these parameters with respect to the velocities of the joints.
We assume that the control action,
, is updated for every cycle, and therefore for each interval of time between cycles it has the same value. For each period (cycle time between updates), the equation of a uniform circular motion is satisfied for each joint position. If this equation is discretized, for each sampling moment, which areseconds apart, the value will be updated in the following way: . The dependency of the joint angle position with respect to the control action is therefore defined.
The partial derivatives of joint position with respect to action, considering there is no cross-influence or coupling between joint velocities and that and its expected should converge at equilibrium, are given by the following expression:
If the dependency of joint position with respect to action is known, we can use the chain rule to calculate the dependency for the visual sensor, . Considering that the values of and should also converge to and at equilibrium, the partial derivatives are given by the following expression:
V Active inference computational model for iCub head object tracking task
We extend the arm reaching model for the head to obtain an object tracking motion behavior. The goal of this task is to maintain the object in the center of the visual field, thus increasing its reaching working range capabilities. Two available degrees of freedom (yaw and pitch) are used for this purpose.
V-a Problem formulation
Sensory data and proprioception for the humanoid robot head is defined by internal variables beliefs , actions , and first-order dynamic vectors and because the end effector in this case is the camera itself, there is only joint angle position, .
V-B Negative free-energy optimization
Variational negative free-energy for the head motion, , is obtained from the conditional densities and its dependency with respect to internal variables , actions and first-order dynamics is calculated.
V-C Perceptual attractor dynamics
In order to obtain a the desired motion in the visual field, an attractor will be defined towards the center of the image . The attractor position is read from the visual input and will dynamically update with the motion of the head, while the center coordinates have a constant value.
Internal variable dynamics will be defined in terms of the attractor: , . With two pixel coordinates and two degrees of freedom, the inverse of the Jacobian matrix can be directly used as the mapping matrix in the visual space: .
Vi-a Experimental Setup
The iCub robot is placed in a controlled environment with an easily recognizable object in front of it, which serves as a perceptual attractor to produce the movement of the robot. The values of the causal variables and are the horizontal and vertical positions of that object in the image plane obtained from the left eye camera, and the value of is a weighting factor to adjust the power of the attractor. The right arm end-effector is also recognized by a visual marker that is placed on the hand of the robot, obtaining the values of and . The goal is to obtain a reaching behavior in the robot. The proposed algorithm will generate an action in the right arm towards the object, in order to reduce the discrepancy between the current and the desired state imposed by the goal.
Three different experiments were performed: (1) robustness, right arm reaching towards a series of locations that the robot must follow in its visual plane, (2) dynamics evaluation, right arm reaching model and active head model towards a moving object, (3) generalization, both arms reaching and active head.
The relevant parameters in the algorithm are: (1) variance in the encoder sensor, (2) variance in the visual perception, (3) variance in attractor dynamics and (4) action gains . These parameters were tuned empirically with their physical meaning in mind and remain constant during the experiments, except encoder noise that was modified to withstand more deviation in the encoder noise handling situation of the first experiment.
Vi-B Right arm reaching with sensory fusion under noisy sensors
The first experiment is performed to test the robustness of the algorithm under two different conditions (Figure 3). The robot has to reach four different static locations in the visual field with the right arm. Once a location is reached the next location becomes active. A location is considered to be reached when the visual position is inside a disk with a radius of five pixels centered at the location position. The evaluation is assessed using the root mean square (RMS) of the errors in the visual plane coordinates between the real end-effector location (visual marker) and the target location.
Under the first condition we did not add any noise (only intrinsic noise from the sensors and model errors). We tested the contribution of each sensory information at the reaching task: visual, joint angle encoders, and both together. Figures 2(a) and 2(b) show the RMS error and the path followed by the arm respectively. Even though the model has been verified and the camera calibrated, there was a difference between the forward model and the real robot, due to possible deviations in parameters and motor backslash, which implies that the robot has to use the visual information to correct its real position. Employing joint angle encoders and vision provides the best behavior for the reaching of the fixed locations in the visual field, achieving all positions in the shortest fashion. Visual perception also reaches all the positions but it does not follow the optimum path, while using only the encoder values fails to reach all locations.
At the second condition we injected Gaussian noise in the robot motors encoders
in order to test the robustness against high noise. Thus, four trials were performed with Gaussian noise with a zero mean and with standard deviation of(control test), , and . Figures 2(d) and 2(e) show the reaching error and the followed path for each trial. The results of the runs with no noise and achieved very similar results, with the first one achieving the objectives slightly faster. When , motion was affected by deviations in the path followed by the end-effector. The extreme case with caused oscillations and erroneous approximation trajectories that produced significant delays in the reaching of the target locations. These results show the importance of a reliable visual perception when discrepancies in the model or unreliable joint angle measurements are present.
Vi-C Right arm reaching of moving object with active head
We evaluated the algorithm for the right arm model and active head for a moving object (manually operated by a human, Fig. 3(i)). Variable dynamics are shown in Fig. 4. The initial position of the right arm lies outside of the visual plane (Fig. 3(e) missing ). Hence, the free-energy optimization algorithm only relies on joint measurements to produce the reaching motion until the right hand appears in the visual plane ( enters Fig. 3(d) from the top). Fig. 3(a) and 3(b) show both the encoders measurements and the estimated joint angle of the arm and head. Fig. 3(d) and 3(e) show how calculated and real visual positions of the right arm end-effector follow the perceptual attractor , while the head tries to maintain the object at the center of the visual field . Right arm actions are depicted in 3(g), and stop action is produced by the sense of touch. Contact in any of the pressure sensors triggers the grasping motion of the hand. Finally, Fig. 3(h) shows that the algorithm optimizes (maximizes) the value of negative free-energy for both arm and head.
Vi-D Generalization: Dual arm reaching and active head
We generalized the algorithm for dual arm reaching. Free-energy optimization reaching task was replicated for the left arm, obtaining a reaching motion for both arms with a tracking motion performed by the head. The result of this experiment, along with other runs of the previous experiments can and be found in the supplementary video tobereleased.
This work presents the first active inference model working on a real humanoid robot for dual arm reaching and active head object tracking. The robot, evaluated with different level of sensor noise (up to 40 degrees joint angles deviation), was able to reach the visual goal compensating the errors through free-energy optimization. The body configuration was treated as an unobserved variable and the forward model as an approximation of the real end-effector location corrected online with visual input and thus tackling model errors. The proposed approach can be generalized to whole body reaching and incorporate forward model learning as shown in .
-  H. v. Helmholtz, Handbuch der physiologischen Optik. L. Voss, 1867.
-  D. C. Knill and A. Pouget, “The bayesian brain: the role of uncertainty in neural coding and computation,” TRENDS in Neurosciences, vol. 27, no. 12, pp. 712–719, 2004.
-  K. Friston, “A theory of cortical responses,” Philos Trans R Soc Lond B: Biological Sciences, vol. 360, no. 1456, pp. 815–836, 2005.
-  E. Todorov and M. I. Jordan, “Optimal feedback control as a theory of motor coordination,” Nature neuroscience, vol. 5, no. 11, p. 1226, 2002.
-  K. J. Friston, “The free-energy principle: a unified brain theory?” Nature Reviews. Neuroscience, vol. 11, pp. 127–138, 02 2010.
-  P. Lanillos, E. Dean-Leon, and G. Cheng, “Yielding self-perception in robots through sensorimotor contingencies,” IEEE Trans. on Cognitive and Developmental Systems, no. 99, pp. 1–1, 2016.
-  Y. Kuniyoshi and S. Sangawa, “Early motor development from partially ordered neural-body dynamics: experiments with a cortico-spinal-musculo-skeletal model,” Biological cybernetics, vol. 95, p. 589, 2006.
-  K. J. Friston, J. Daunizeau, J. Kilner, and S. J. Kiebel, “Action and behavior: a free-energy formulation,” Biological cybernetics, vol. 102, no. 3, pp. 227–260, 2010.
-  P. Lanillos and G. Cheng, “Adaptive robot body learning and estimation through predictive coding,” Intelligent Robots and Systems (IROS), 2018 IEEE/RSJ Int. Conf. on, 2018.
-  C. Fantacci, U. Pattacini, V. Tikhanoff, and L. Natale, “Visual end-effector tracking using a 3d model-aided particle filter for humanoid robot platforms,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1411–1418.
-  C. Garcia Cifuentes, J. Issac, M. Wüthrich, S. Schaal, and J. Bohg, “Probabilistic articulated real-time tracking for robot manipulation,” IEEE Robotics and Automation Letters, vol. PP, 10 2016.
-  A. Ude, D. Omrčen, and G. Cheng, “Making object learning and recognition an active process,” International Journal of Humanoid Robotics, vol. 5, no. 02, pp. 267–286, 2008.
-  C. Gaskett and G. Cheng, “Online learning of a motor map for humanoid robot reaching,” in 2nd Int. Conf. on computational inte., robotics and autonomous systems, Singapore, 2003.
-  L. Jamone, M. Brandao, L. Natale, K. Hashimoto, G. Sandini, and A. Takanishi, “Autonomous online generation of a motor representation of the workspace for intelligent whole-body reaching,” Robotics and Autonomous Systems, vol. 62, no. 4, pp. 556–567, 2014.
-  A. Roncone, M. Hoffmann, U. Pattacini, L. Fadiga, and G. Metta, “Peripersonal space and margin of safety around the body: learning visuo-tactile associations in a humanoid robot with artificial skin,” PloS one, vol. 11, no. 10, p. e0163713, 2016.
-  L. Pio-Lopez, A. Nizard, K. Friston, and G. Pezzulo, “Active inference and robot control: a case study,” J R Soc Interface, vol. 13, 2016.
-  M. Baltieri and C. L. Buckley, “An active inference implementation of phototaxis,” 2018 Conference on Artificial Life, no. 29, pp. 36–43, 2017.
-  P. Lanillos and G. Cheng, “Active inference with function learning for robot body perception,” International Workshop on Continual Unsupervised Sensorimotor Learning, ICDL-Epirob, 2018.
-  S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 03 1951.
-  R. Bogacz, “A tutorial on the free-energy framework for modelling perception and learning,” Journal of Mathematical Psychology, vol. 76, no. B, pp. 198–211, 2015.
-  G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori, “The icub humanoid robot: An open platform for research in embodied cognition,” Performance Metrics for Intelligent Systems Workshop, 01 2008.
-  G. Metta, P. Fitzpatrick, and L. Natale, “Yarp: Yet another robot platform,” International Journal of Advanced Robotic Systems, vol. 3(1), pp. 43–48, 2006.