I Introduction
In robotic systems perception and action have often been studied in isolation in the past without an overarching principle of how to put the two processes together. Yet, there is compelling evidence that the two processes are interdependent in humans and animals [3]. In the robotics literature, the direct coupling between action and perception has been especially emphasized in behaviorbased robotics [4] and by proponents of embodied cognition [5]
, but more recently also approaches applying machine learning to sensorimotor processing have focused on the interactive nature of perception—see
[1] for a review. The main insight of interactive perception is that sensory processing can be enhanced when manipulating or interacting with the environment. This can be achieved by creating novel signals through movement [6, 7] or by exploiting actionperception regularities that are generated when the same action is performed repeatedly in the same environment [8, 9]. For example, object segmentation could be improved when separating different objects by movement, or some object properties like inertia or weight could be estimated through interaction. In such cases action directly subserves the perceptual process. In other cases, however, interactive perception has the primary objective to achieve a manipulation goal. Defining the objective is therefore critical in determining what kind of sensorimotor coupling can arise.
The formal framework that deals with adaptive systems optimizing arbitrary objective functions under uncertainty is decision theory. A rational agent has to decide which action to take from the action set according to the desirability of the action quantified by a utility function. A fundamental problem of such perfect rationality models [10, 11, 12] is that they ignore computational costs that arise when searching for the maximum utility action. As such costs can be prohibitive, decisionmaking with limited informationprocessing resources has recently been studied extensively in psychology, economics, cognitive science, computer science, and artificial intelligence research [13, 14, 15, 16, 17, 18]. In the following we argue that such resource limitations are crucial for the emergence of sensorimotor coupling.
Ia An InformationTheoretic Principle for Bounded Rational DecisionMaking with Contextdependence
In this study, we use an informationtheoretic model of bounded rational decisionmaking [19, 20, 21, 22, 23]. In a decisionmaking task with context, an agent is presented with a worldstate and has to find an optimal action from a set of admissible actions. The desirability of the action under a particular worldstate is quantified by the utility function . The objective of the decisionmaker is to maximize the utility depending on the context:
(1) 
For an agent with limited computational resources that has to react within a certain timelimit, searching for the best action can potentially become intractable, especially when the number of possible actions is enormous. Thus, a bounded rational agent tries to find a good enough yet tractable solution. In multiple contexts, bounded rational decisionmaking requires to compute multiple strategies under limited computational resources which can be expressed as a set of probability distributions
over actions given the different worldstates. Mathematically, this informational cost can be measured in terms of an “information distance”, namely the KullbackLeibler divergence
from a prior behavior to the posterior strategies . This informationprocessing cost can be motivated on axiomatic grounds [24, 22] and has been used previously in the robotics and control literature [25, 26, 27, 28]. An upper bound of the KullbackLeibler divergence with constrains the decisionmaker to spend a maximum number of bits to adapt its behavior. The resulting optimization problem can be formalized as(2) 
where the inverse temperature governs the tradeoff between expected utility and information cost. For classic decision theory is recovered, whereas for the decisionmaker has no access to computational resources at all and thus acts according to the prior. When optimizing the expected free energy over all possible contexts, it can readily be shown that the optimal prior is given by the marginal distribution [see Section 2.1.1 in [29] or [30]]. Plugging in the marginal as the optimal prior yields the following variational principle for bounded rational decisionmaking
(3) 
where is the mutual information between action and worldstates and measures the reduction of uncertainty about the action after observing or vice versa. This problem formulation is commonly known as the ratedistortion problem from information theory [31]. The solution of (3) is given by a set of two selfconsistent equations:
(4)  
(5) 
where is the partition sum. In practice, the solution can be computed using the BlahutArimoto algorithm [32, 33, 34] by starting with an initial distribution and then iterating through both equations (4) and (5) until the distributions converge. The iteration is guaranteed to converge to a global optimum with the prerequisite that has the same support as [29, 30, 35].
IB An InformationTheoretic Principle for PerceptionAction Coupling
We follow the work of [2], where the authors extend the ratedistortion framework to systems with multiple informationprocessing nodes. The serial perceptionaction system consists of two stages: a perceptual stage that maps worldstates to observations and an action stage that maps observations to actions
. The three random variables for worldstate, observation and action form a serial chain of two channels, which is expressed by the graphical model
, and implies the following conditional independence(6) 
We assume that the utility function depends only on the worldstate and the action , the internal percept does not influence the utility. The information processing price in the perceptual channel can be different from the price of information processing in the action channel . Formally, we set up the following variational problem:
(7)  
(8) 
Here we define as an overall objective function. Note that in this problem statement both the perceptual channel and the action channel are optimized. This is in contrast to traditional problem statements where a likelihood model is assumed to be given and the decision rule is built given this model. However, the coupling between action and perception falls out naturally by extending the rate distortion problem to Equation (8).
Similar to the ratedistortion case of Equation (3), the solution is given by the following set of four analytic selfconsistent equations [2]:
(9)  
(10)  
(11)  
(12) 
where and denote the corresponding partition sums of worldstate and internal perception . The conditional probability is given by Bayes’ rule
and
is the free energy difference of the action stage. Equation (9) – (12) can be computed by starting with arbitrary initial distributions , and in lieu of , and and then iterating (9) to (12) until convergence. As the iterations involve evaluations of the utility function over all possible action and worldstates, the computation can become very costly. Another drawback of such BlahutArimotostyle algorithms is that they cannot be applied straightforwardly to continuous problems, closedform analytic solutions exist only for special cases. Here we propose an alternative online optimization method to solve this problem.
Ii Theoretical results
In this study, we present an algorithm to update an agent’s perceptual module and its behavioural policy—expressed through two separate parametric models—in a joint fashion under constrained informationprocessing resources according to the framework of informationtheoretic bounded rationality.
Implementation of the perceptual channel
We consider neural networks of the type depicted in Figure 1 as parameterized model to represent the perceptual distribution . The network possesses one input layer , one hidden layer and one output layer .
is a realvalued column vector representing the worldstate
. The synaptic weights between the input and the hidden layer are expressed as a real matrix and the weights between the hidden and the output layer are expressed as another real matrix. The activation function in the hidden layer is a hyperbolic tangent function
. We apply a softmax activation function in the output layer to compute the perceptual distribution with . Accordingly, the gradients of the output distribution with respect to parameter matrices and are given by:(13) 
(14) 
Implementation of the action channel
Due to the abstract nature of the observation and action space, we assume here for simplicity discrete action choices. We therefore parameterize the action channel as a multinominal distribution:
(15) 
with dimensionality and an auxiliary function . Note that the parameter is conditioned on observations . In our implementation is expressed and updated as a realvalued matrix with dimensionality
. We represent actions as a binaryvalued vector in onehot encoding having the form
with and for all where . The conventional constraint of a conditional distribution is satisfied by defining . Thus, the gradient of the action distribution with respect to the parameter is given by(16) 
Parameter updates
In the course of the simulation, the bounded rational decisionmaker constantly updates the parameters representing the perceptual and the action channel. The overall objective in (8) is expressed as a parametric function of , and :
(17) 
By defining an auxiliary term the objective can be rewritten as . Here we apply the logtrick to transform the derivative of into an expected value by noticing that for any parametric function the equation is valid. This trick allows us to rewrite the derivative of the overall objective as follows:
(18)  
(19)  
(20) 
The expectation value can be approximated by drawing sample triplets
from the joint distribution
. The number of samples governs the accuracy of the approximation. A large provides high accuracy but demands vast computational resources. Setting the batch size to economizes the computational cost at every iteration by avoiding the expensive evaluation of the summand function, thus leading to an effective online rule for parameter updates as is done in stochastic gradient ascent—see [36] for a similar method. We apply a soft update rule to optimize parameters in an online fashion by introducing the learning rate(21) 
for each parameter . Note that stochastic gradient ascent does not always converge to a global maximum. The global maximum is assumed only when the objective function is concave and the learning rates decrease with an appropriate rate, otherwise a local maximum might be attained [37]. Therefore, the online rule for the problem described is not guaranteed to converge to global optimal solutions and should be treated carefully by using small learning rates .
Iii Experimental Results
Iterating the analytical solutions (Equations (9)(12)) requires the evaluation of the utility function for all pairs in each iteration step. A major advantage of the gradientupdate scheme derived in the previous section is that it is suitable for online updates, i.e. one iteration can be performed after every interaction of the robot with the environment. In such a scheme worldstates are generated (randomly) by the environment . Each is processed by the perceptual stage of the agent, which samples a percept (in our case, drawing a sample from the distribution that results from the softmax output of the neural network). Similarly, the agent samples an action . This leads to a rollout which allows to evaluate , and to perform one stochastic gradient update step. In the following, we first compare this online stochastic gradient update scheme against solutions obtained from iterating the analytical solution equations. Afterwards we demonstrate the scheme on a more challenging task in a simulated robot environment.
Iiia Comparison with baseline
To empirically verify the convergence of our gradient update scheme and the correctness of the resulting solution, we compare against the (analytical) solution obtained by iterating the set of selfconsistent equations as given in [2]. To this end, we use the “predatorprey” example from [2]. In the example, a fictional animal encounters other animals, which can either be prey that should be hunted, or predators that should be avoided. To decide which action to take, the animal has a perceptual sensor to determine the size of the encountered animal. The example is described by the utility function shown in Figure 2A. Animals belong to one of three groups: small, mediumsized and large animals. All large animals are predators, thus the only action that yields nonzero utility is “flee” (regardless of the particular size of the large animal). For each of the small animals, a specific huntingaction yields the highest utility, therefore it is relevant to distinguish between the individual animals of the smallgroup. In contrast, for the animals of the mediumsized group the specific huntingactions yield the same utility as a generic huntingaction that works equally well for all mediumsized animals. The example clearly illustrates the importance of coupling perception with the downstream actionpart of the agent. The distinction between the individual animals of the medium and largesized groups is irrelevant for acting. Thus, spending (computational) capacity on the perceptual channel for this distinction is lavish and should be avoided, particularly if the capacity of the perceptual channel is limited.
The original example in [2] used categorical distributions for perception and action . Here, we use a neural network with one hidden layer for perception (with being a binary encoding of ) and a multinomial distribution (parameterized as given by Equation (15)) for the actionstage (with being a onehot encoding of the action ). The neural network consisted of four input neurons, 20 hidden and 13 output neurons, initialized with Glorot’s scheme (also known as Xavierinitialization) [38]. The parameters were initialized such that all actions were equally probable. We found that convergence of the gradientupdate scheme crucially depends on using different learning rates for the perceptual channel () and the action channel (). Figure 2B shows the evolution of the objective value (Equation (8)) during gradientupdate iterations of (Equation (21)) using (corresponding to large perceptual capacity) and (highcapacity action channel). We used a learning rate of for the perceptual channel and for the action channel. The dashed red line in the panel indicates the baseline, that is the value of the objective function obtained by iterating the set of analytical solutions (Equations (9)(12)) until convergence as in [2]. As shown in the figure, the gradient update scheme converges to a solution with the same objectivevalue as the analytical baseline method (after roughly 50000 iterations). Panel C of Figure 2 shows the corresponding behavior (omitting binary/onehot encoding from the notation for simplicity) after gradientupdate iterations. Comparing panel C against Figure 6D in [2] shows, that the solution obtained from the gradient update scheme is qualitatively identical to the solution obtained by iterating the set of selfconsistent equations. Importantly, the solution reflects the intuition, that an informationoptimal agent does not distinguish between the individual animals of the medium and large groups, even if the computational capacity of the perceptual channel would in principle allow for such a distinction.
We have performed the same comparison for the other settings of in [2], corresponding to either low capacity of the perceptual or the actionchannel, and found that the gradient scheme converges to solutions that are qualitatively identical (same objective value, same behavior ) to solutions obtained from iterating the selfconsistent equations. We conclude that the gradient update scheme in conjunction with a neural network for the perceptual stage and a multinomial distribution for the actionstage successfully matches the analytical baseline. Due to the limitations in length of the manuscript, we have omitted further plots of the empirical baseline comparison.
IiiB Robot Simulation
We also test our method in a simulated robotic environment to illustrate the usage of the proposed update principle for sensorimotor coupling with parametric perception and action modules. To this end, we designed a simplified grasping task with a simulated Nao robot. In the simulation, the robot is positioned next to a table with mugs on it (see Figure 3). The mugs differ in the number and orientation of the handles. One mug (m0) has no handle at all, two mugs have one handle (positioned such that the handle is either to the left (mL) or to the right (mR)) and one mug has handles on both sides (m2), allowing a direct grasp with either the left or the right hand of the robot. The Nao robot has two cameras—in this simulation we make use of the chestcamera which shows the area on the table directly in front of the robot (see the bottom left inlet in Figure 3). Based on this camera input, the (boundedrational) agent has to decide how to grasp the mug appropriately. We defined possible actions: lift the mug with both hands (a2), lift the mug with the left hand (aL), lift with the right hand (aR) or execute no lift (a0). Additionally, we have defined the following utility function shown in Figure 4A, where each mug has one preferred action (yielding the highest utility) and the case of grasping a singlehandle mug with both hands, which yields slightly lower utility.
As in the previous example, the perceptual stage of the robot is implemented by a neural network with one hidden layer (192 input neurons, four hidden and four output neurons, Xavierinitialization), and the action stage is implemented with a multinomial distribution (parameterized according to Equation (15), initialized to have uniform probabilities over actions). The perceptual and action stage are then learned through interaction with the environment, using the stochastic gradient update scheme proposed in this paper. At the beginning of each trial, one mug is randomly selected according to
(uniform distribution in our case) and placed in front of the agent. A
image from the chest camera of the robot is then fed into the neural network for perceptual processing (2Dimage is flattened into 1D vector). Accordingly, an observation and an action are sampled by the agent. After evaluating the utility , the model parameters , and are updated with a gradient ascent step as described in the previous section.Figure 4 shows three experiments with different computational capacity of the perception and actionchannels, they are: highcapacity channels ( and , , ), high perceptual capacity combined with low action capacity ( and , , ), and lowcapacity channels (, , ). Panel B shows how the objective value evolves during training. In all cases, the online update scheme converges to a stable solution. The dashed lines show the optimal solution. Panel C and D show evolution of the mutual information of the perceptual channel and the action channel . The mutual information on the perceptual and action stage is high with high computational capacity. Lowering the capacity of the action channel leads to a reduction of mutual information of both channels. Note that for the second case only the information processing price on the action stage is changed, the perceptual channel adjusts accordingly. Under (very) low computational capacity, the agent develops a single action strategy such that it requires no computation at all (channel capacity for both perception and action is effectively zero).
Figure 5 shows the behaviour of the robot after convergence.With high computational capacity, the robot learns to associate the camera images with the best possible action for each mug (Panel A & B). If the action channel does not have sufficient capacity, the agent is not able to apply specific actions to specific contexts, therefore, its policy collapses into two modes: lift with both hands or do nothing at all (Panel C). Accordingly, the agent spends less information processing in the perceptual channel such that it only discriminates between mugs with handle(s) and mugs without handle (Panel D). Under (very) low computational capacity, the agent always chooses to lift the mug with both hands which requires no computation. (Panel E F)
Iv Discussion
In this study we propose a novel online optimization rule to find boundedoptimal perceptionaction coupling in serial perceptionaction systems. The perceptual channel is implemented as a multilayer neural network while the action channel is represented by a parametric distribution, which was a multinomial in our case. Our method is illustrated with a NAO robot simulator.
The proposed algorithm can be improved in several ways. In the case of rate distortion equation (3), the BlahutArimoto algorithm is guaranteed to converge to a unique maximum [see Section 2.1.1 in [29]]. When we consider an extension of bounded rational problems to systems with multiple informationprocessing stages, since there is no convergenceproof, it cannot be ruled out that the solutions obtained by iterating the selfconsistent equations until numerical convergence are only local optima. One future improvement would therefore be to better theoretically understand the convergence properties of the extended rate distortion problem.
Another issue is that the learning rates for the parameter updates (see equation (21)) have a significant impact on the development of the boundedoptimal behaviors. Inappropriate learning rates easily lead to an optimization failure in that the boundedrational decisionmaker is unable to find the boundedoptimal solutions. We choose a gridsearch method to optimize the learning rates. A future improvement would require a better understanding of the relationship between the learning rates in the perceptual and action module and to study better optimization procedures for choosing the learning rates accordingly.
In informationtheoretic bounded rationality, the main assumption is that the decisionmaker’s behavioral policy may not deviate too much from some prior policy. Deviations from the prior policy are costly and modeled through the KullbackLeibler divergence (Section 1.A). This is similar to other proposed regularization techniques from robotics such as Trust Region Policy Optimization (TRPO, [39]) and Relative Entropy Policy Search (REPS, [25]
). The main difference lies in the choice of the prior behavioral policy. In both, TRPO and REPS, the prior policy represents the agent’s behavior at a previous iteration of the optimization procedure or a set of initial expert trajectories (in imitationlearning). In our approach, on the other hand, the agent’s prior policy is optimal w.r.t. the informationtheoretic constraints (Eq. (3)), encouraging the agent to ignore irrelevant sensory information that has little impact on the reward.
Conceptually, bounded rationality has its most obvious application when the informationprocessing capacity is limited by physical constraints like time or space constraints. In our present model this was not the case, which raises the question why one would restrict the system to a smaller capacity than it might naturally have. Apart from possible effects on learning speed that we did not investigate here, restricting the channel capacity creates a bottleneck that filters out relevant information and creates abstractions that are useful for generalization [29]. For instance in panel D of figure 5, two abstract percepts emerge: “mug with handle(s)” (x2) and “mug without handles” (x4).
In summary, our informationtheoretic principle (8) for perceptionaction coupling provides a novel generic principled method and could in principle be applied to combine any parameterized perception and action modules. Compared to the existing literature, this approach is most similar in spirit to approaches that learn particular perceptual features that are most useful to solve a particular task [40, 41, 42, 43]. Here this feature search is integrated in a single bounded rational optimization problem.
References
 [1] J. Bogh, K. Hausman, B. Sankaran, O. Brock, D. Kragic and S. Schaal, ”Interactive Perception: Leveraging Action in Perception and Perception in Action”, arXiv:1604.03670v2
 [2] T. Genewein, F. Leibfried, J. GrauMoya and D. A. Braun, ”Bounded rationality, abstraction and hierarchical decisionmaking: an informationtheoretic optimality principle”, Frontiers in Robotics and AI, vol. 2(27), 2015, pp.124.
 [3] A. Noë,”Action in Perception”, Bradford book, 2004
 [4] R. C. Arkin, ”Behaviorbased Robotics”, MIT Press, 1998
 [5] R. Pfeifer and J. C. Bongard, ”How the Body Shapes the Way We Think: A New View of Intelligence”, MIT Press, 2006
 [6] P. Fitzpatrick and G. Metta, ”Towards manipulationdriven vision”, IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1, 2002, pp.43–48
 [7] H. van Hoof, O. Kroemer and J. Peters, ”Probabilistic segmentation and targeted exploration of objects in cluttered environments”, IEEE Transactions on Robotics, vol. 30, no. 5, 2014, pp. 1198–1209
 [8] M. Gupta and G. S. Sukhatme, ”Using manipulation primitives for brick sorting in clutter”, International Conference on Robotics and Automation, 2012
 [9] L. Y. Chang, J. R. Smith and D. Fox, ”Interactive singulation of objects from a pile”, International Conference on Robotics and Automation, 2012
 [10] F. P. Ramsey, ”Truth and probability”, in The Foundations of Mathematics and Other Logical Essays, ed. R. B. Braithwaite (New York, NY: Harcourt, Brace and Co), 1931, pp. 156198.
 [11] J. Von Neumann and O. Morgenstern, ”Theory of Games and Economic Behavior”, Princeton: Princeton University Press, 1944.
 [12] L. J. Savage, ”The Foundations of Statistics”, New York: Wiley, 1954.

[13]
G. Gigerenzer and P.M. Todd. ”Simple Heuristics That Make Us Smart”, Oxford: Oxford University Press, 1999.
 [14] D. Kahneman, ”Maps of bounded rationality: psychology for behavioral economics”, Am. Econ. Rev., vol. 93, 2003, pp.14491475.
 [15] A. Howes, R. L. Lewis and A. Vera, ”Rational adaptation under task and processing constraints: implications for testing theories of cognition and action”, Psychol. Rev., vol. 116, 2009, pp. 717751.
 [16] S. Russell, ”Rationality and intelligence”, in Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, ed. C. Mellish (San Francisco, CA: Morgan Kaufmann), 1995, pp. 950957.
 [17] S. Russell and P. Norvig, ”Artificial Intelligence: A Modern Approach”, Upper Saddle River: Prentice Hall, 2002.
 [18] R. L. Lewis, A. Howes and S. Singh, ”Computational rationality: linking mechanism and behavior through bounded utility maximization”, Top. Cogn. Sci., vol 6, 2014, pp. 279311.

[19]
D.A. Braun, P.A. Ortega,E. Theodorou, and S. Schaal, ”Path integral control and bounded rationality”, in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (Piscataway: IEEE), 2011, pp. 202209.
 [20] D. A. Braun, and P. A. Ortega, ”Informationtheoretic bounded rationality and epsilonoptimality”, Entropy 16, 2014, pp. 46624676.
 [21] P. A. Ortega and D. A. Braun, ”Free energy and the generalized optimality equations for sequential decision making”, Workshop and Conference, 2012
 [22] P.A. Ortega and D. A. Braun, ”Thermodynamics as a theory of decisionmaking with informationprocessing costs”, Proc. R. Soc. A Math. Phys. Eng. Sci., vol. 469(2153), 2013.
 [23] P. A. Ortega, D. A. Braun and N. Tishby, ”Monte Carlo methods for exact & efficient solution of the generalized optimality equations”, in Proceedings of IEEE International Conference on Robotics and Automation, Hong Kong, 2014.
 [24] P. A. Ortega and D. A. Braun, ”A conversion between utility and information”, in Third Conference on Artificial General Intelligence (AGI 2010) (Lugano: Atlantis Press), 2010, pp. 115120.
 [25] J. Peters, K. Mülling, Y. Altün, ”Relative Entropy Policy Search”, TwentyFourth National Conference on Artificial Intelligence (AAAI10), 2010, pp. 1607–1612
 [26] E. Theodorou, J. Buchli and S. Schaal, ”A Generalized Path Integral Control Approach to Reinforcement Learning”, J. Mach. Learn. Res., vol. 11, 2010, pp. 3137–3181
 [27] E. Todorov, ”Linearlysolvable Markov decision problems”, Advances in neural information processing systems, 2006, pp. 1369–1376
 [28] H. J. Kappen ”Linear theory for control of nonlinear stochastic systems”, Physical review letters, vol. 95, no. 20, 2005, p. 200201
 [29] N. Tishby, F. C. Pereira, and W. Bialek, ”The information bottleneck method”, in The 37th Annual Allerton Conference on Communication, Control, and Computing, 1999.
 [30] I. Csiszar, and G. Tusnady, ”Information geometry and alternating minimization procedures”, Stat. Decis. vol. 1, 1984, pp. 205237.
 [31] C. E. Shannon, ”Coding Theorems for a Discrete Source With a Fidelity CriterionInstitute of Radio Engineers”, International Convention Record, vol. 7, 1959, pp. 142163.
 [32] R. Blahut, ”Computation of channel capacity and ratedistortion functions”, IEEE Trans. Inf. Theory 18, 1972, pp. 460473.
 [33] S. Arimoto, ”An algorithm for computing the capacity of arbitrary discrete memoryless channels”, IEEE Trans. Inf. Theory 18, 1972, pp. 1420.
 [34] R.W. Yeung, ”Information Theory and Network Coding”, New York: Springer, 2008.
 [35] T. M. Cover, and J. A. Thomas, ”Elements of Information Theory”, Hoboken: John Wiley & Sons, 1991.
 [36] F. Leibfried and D. A. Braun, ”Bounded Rational DecisionMaking in Feedforward Neural Networks”, Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence, 2016
 [37] L. Bottou, ”Online Algorithms and Stochastic Approximations”. Online Learning and Neural Networks. Cambridge University Press,1998.
 [38] X. Glorot and Y. Bengio, ”Understanding the difficulty of training deep feedforward neural networks”, In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10), 2010
 [39] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, P. Abbeel, ”Trust Region Policy Optimization”, International Conference on Machine Learning (ICML), 2015
 [40] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, ”Learning to poke by poking: Experiential learning of intuitive physics”, in Advances in Neural Information Processing Systems, 2016
 [41] R. Jonschkowski and O. Brock, ”Learning state representations with robotic priors”, Autonomous Robots, vol. 39, no. 3, pp. 407–428, 2015
 [42] S. Levine, C. Finn, T. Darrell and P. Abbeel, ”Endtoend training of deep visuomotor policies”, Journal of Machine Learning Research, vol. 17, no. 39, pp. 1–40, 2016
 [43] J. Piater, S. Jodogne, R. Detry, D. Kraft, N. Krueger, O. Kroemer and J. Peters, ”Learning visual representations for perceptionaction systems”, International Journal of Robotics Research, 2011