In robotic systems perception and action have often been studied in isolation in the past without an overarching principle of how to put the two processes together. Yet, there is compelling evidence that the two processes are interdependent in humans and animals . In the robotics literature, the direct coupling between action and perception has been especially emphasized in behavior-based robotics  and by proponents of embodied cognition 
, but more recently also approaches applying machine learning to sensorimotor processing have focused on the interactive nature of perception—see for a review. The main insight of interactive perception is that sensory processing can be enhanced when manipulating or interacting with the environment. This can be achieved by creating novel signals through movement [6, 7] or by exploiting action-perception regularities that are generated when the same action is performed repeatedly in the same environment [8, 9]
. For example, object segmentation could be improved when separating different objects by movement, or some object properties like inertia or weight could be estimated through interaction. In such cases action directly subserves the perceptual process. In other cases, however, interactive perception has the primary objective to achieve a manipulation goal. Defining the objective is therefore critical in determining what kind of sensorimotor coupling can arise.
The formal framework that deals with adaptive systems optimizing arbitrary objective functions under uncertainty is decision theory. A rational agent has to decide which action to take from the action set according to the desirability of the action quantified by a utility function. A fundamental problem of such perfect rationality models [10, 11, 12] is that they ignore computational costs that arise when searching for the maximum utility action. As such costs can be prohibitive, decision-making with limited information-processing resources has recently been studied extensively in psychology, economics, cognitive science, computer science, and artificial intelligence research [13, 14, 15, 16, 17, 18]. In the following we argue that such resource limitations are crucial for the emergence of sensorimotor coupling.
I-a An Information-Theoretic Principle for Bounded Rational Decision-Making with Context-dependence
In this study, we use an information-theoretic model of bounded rational decision-making [19, 20, 21, 22, 23]. In a decision-making task with context, an agent is presented with a world-state and has to find an optimal action from a set of admissible actions. The desirability of the action under a particular world-state is quantified by the utility function . The objective of the decision-maker is to maximize the utility depending on the context:
For an agent with limited computational resources that has to react within a certain time-limit, searching for the best action can potentially become intractable, especially when the number of possible actions is enormous. Thus, a bounded rational agent tries to find a good enough yet tractable solution. In multiple contexts, bounded rational decision-making requires to compute multiple strategies under limited computational resources which can be expressed as a set of probability distributions
over actions given the different world-states. Mathematically, this informational cost can be measured in terms of an “information distance”, namely the Kullback-Leibler divergencefrom a prior behavior to the posterior strategies . This information-processing cost can be motivated on axiomatic grounds [24, 22] and has been used previously in the robotics and control literature [25, 26, 27, 28]. An upper bound of the Kullback-Leibler divergence with constrains the decision-maker to spend a maximum number of bits to adapt its behavior. The resulting optimization problem can be formalized as
where the inverse temperature governs the trade-off between expected utility and information cost. For classic decision theory is recovered, whereas for the decision-maker has no access to computational resources at all and thus acts according to the prior. When optimizing the expected free energy over all possible contexts, it can readily be shown that the optimal prior is given by the marginal distribution [see Section 2.1.1 in  or ]. Plugging in the marginal as the optimal prior yields the following variational principle for bounded rational decision-making
where is the mutual information between action and world-states and measures the reduction of uncertainty about the action after observing or vice versa. This problem formulation is commonly known as the rate-distortion problem from information theory . The solution of (3) is given by a set of two self-consistent equations:
where is the partition sum. In practice, the solution can be computed using the Blahut-Arimoto algorithm [32, 33, 34] by starting with an initial distribution and then iterating through both equations (4) and (5) until the distributions converge. The iteration is guaranteed to converge to a global optimum with the prerequisite that has the same support as [29, 30, 35].
I-B An Information-Theoretic Principle for Perception-Action Coupling
We follow the work of , where the authors extend the rate-distortion framework to systems with multiple information-processing nodes. The serial perception-action system consists of two stages: a perceptual stage that maps world-states to observations and an action stage that maps observations to actions
. The three random variables for world-state, observation and action form a serial chain of two channels, which is expressed by the graphical model, and implies the following conditional independence
We assume that the utility function depends only on the world-state and the action , the internal percept does not influence the utility. The information processing price in the perceptual channel can be different from the price of information processing in the action channel . Formally, we set up the following variational problem:
Here we define as an overall objective function. Note that in this problem statement both the perceptual channel and the action channel are optimized. This is in contrast to traditional problem statements where a likelihood model is assumed to be given and the decision rule is built given this model. However, the coupling between action and perception falls out naturally by extending the rate distortion problem to Equation (8).
where and denote the corresponding partition sums of world-state and internal perception . The conditional probability is given by Bayes’ rule
is the free energy difference of the action stage. Equation (9) – (12) can be computed by starting with arbitrary initial distributions , and in lieu of , and and then iterating (9) to (12) until convergence. As the iterations involve evaluations of the utility function over all possible action and world-states, the computation can become very costly. Another drawback of such Blahut-Arimoto-style algorithms is that they cannot be applied straightforwardly to continuous problems, closed-form analytic solutions exist only for special cases. Here we propose an alternative online optimization method to solve this problem.
Ii Theoretical results
In this study, we present an algorithm to update an agent’s perceptual module and its behavioural policy—expressed through two separate parametric models—in a joint fashion under constrained information-processing resources according to the framework of information-theoretic bounded rationality.
Implementation of the perceptual channel
We consider neural networks of the type depicted in Figure 1 as parameterized model to represent the perceptual distribution . The network possesses one input layer , one hidden layer and one output layer .
is a real-valued column vector representing the world-state. The synaptic weights between the input and the hidden layer are expressed as a real matrix and the weights between the hidden and the output layer are expressed as another real matrix
. The activation function in the hidden layer is a hyperbolic tangent function. We apply a soft-max activation function in the output layer to compute the perceptual distribution with . Accordingly, the gradients of the output distribution with respect to parameter matrices and are given by:
Implementation of the action channel
Due to the abstract nature of the observation and action space, we assume here for simplicity discrete action choices. We therefore parameterize the action channel as a multinominal distribution:
with dimensionality and an auxiliary function . Note that the parameter is conditioned on observations . In our implementation is expressed and updated as a real-valued matrix with dimensionality
. We represent actions as a binary-valued vector in one-hot encoding having the formwith and for all where . The conventional constraint of a conditional distribution is satisfied by defining . Thus, the gradient of the action distribution with respect to the parameter is given by
In the course of the simulation, the bounded rational decision-maker constantly updates the parameters representing the perceptual and the action channel. The overall objective in (8) is expressed as a parametric function of , and :
By defining an auxiliary term the objective can be rewritten as . Here we apply the log-trick to transform the derivative of into an expected value by noticing that for any parametric function the equation is valid. This trick allows us to rewrite the derivative of the overall objective as follows:
The expectation value can be approximated by drawing sample triplets
from the joint distribution. The number of samples governs the accuracy of the approximation. A large provides high accuracy but demands vast computational resources. Setting the batch size to economizes the computational cost at every iteration by avoiding the expensive evaluation of the summand function, thus leading to an effective online rule for parameter updates as is done in stochastic gradient ascent—see  for a similar method. We apply a soft update rule to optimize parameters in an online fashion by introducing the learning rate
for each parameter . Note that stochastic gradient ascent does not always converge to a global maximum. The global maximum is assumed only when the objective function is concave and the learning rates decrease with an appropriate rate, otherwise a local maximum might be attained . Therefore, the online rule for the problem described is not guaranteed to converge to global optimal solutions and should be treated carefully by using small learning rates .
Iii Experimental Results
Iterating the analytical solutions (Equations (9)-(12)) requires the evaluation of the utility function for all pairs in each iteration step. A major advantage of the gradient-update scheme derived in the previous section is that it is suitable for on-line updates, i.e. one iteration can be performed after every interaction of the robot with the environment. In such a scheme world-states are generated (randomly) by the environment . Each is processed by the perceptual stage of the agent, which samples a percept (in our case, drawing a sample from the distribution that results from the softmax output of the neural network). Similarly, the agent samples an action . This leads to a roll-out which allows to evaluate , and to perform one stochastic gradient update step. In the following, we first compare this on-line stochastic gradient update scheme against solutions obtained from iterating the analytical solution equations. Afterwards we demonstrate the scheme on a more challenging task in a simulated robot environment.
Iii-a Comparison with baseline
To empirically verify the convergence of our gradient update scheme and the correctness of the resulting solution, we compare against the (analytical) solution obtained by iterating the set of self-consistent equations as given in . To this end, we use the “predator-prey” example from . In the example, a fictional animal encounters other animals, which can either be prey that should be hunted, or predators that should be avoided. To decide which action to take, the animal has a perceptual sensor to determine the size of the encountered animal. The example is described by the utility function shown in Figure 2A. Animals belong to one of three groups: small, medium-sized and large animals. All large animals are predators, thus the only action that yields non-zero utility is “flee” (regardless of the particular size of the large animal). For each of the small animals, a specific hunting-action yields the highest utility, therefore it is relevant to distinguish between the individual animals of the small-group. In contrast, for the animals of the medium-sized group the specific hunting-actions yield the same utility as a generic hunting-action that works equally well for all medium-sized animals. The example clearly illustrates the importance of coupling perception with the downstream action-part of the agent. The distinction between the individual animals of the medium- and large-sized groups is irrelevant for acting. Thus, spending (computational) capacity on the perceptual channel for this distinction is lavish and should be avoided, particularly if the capacity of the perceptual channel is limited.
The original example in  used categorical distributions for perception and action . Here, we use a neural network with one hidden layer for perception (with being a binary encoding of ) and a multinomial distribution (parameterized as given by Equation (15)) for the action-stage (with being a one-hot encoding of the action ). The neural network consisted of four input neurons, 20 hidden and 13 output neurons, initialized with Glorot’s scheme (also known as Xavier-initialization) . The parameters were initialized such that all actions were equally probable. We found that convergence of the gradient-update scheme crucially depends on using different learning rates for the perceptual channel () and the action channel (). Figure 2B shows the evolution of the objective value (Equation (8)) during gradient-update iterations of (Equation (21)) using (corresponding to large perceptual capacity) and (high-capacity action channel). We used a learning rate of for the perceptual channel and for the action channel. The dashed red line in the panel indicates the baseline, that is the value of the objective function obtained by iterating the set of analytical solutions (Equations (9)-(12)) until convergence as in . As shown in the figure, the gradient update scheme converges to a solution with the same objective-value as the analytical baseline method (after roughly 50000 iterations). Panel C of Figure 2 shows the corresponding behavior (omitting binary/one-hot encoding from the notation for simplicity) after gradient-update iterations. Comparing panel C against Figure 6D in  shows, that the solution obtained from the gradient update scheme is qualitatively identical to the solution obtained by iterating the set of self-consistent equations. Importantly, the solution reflects the intuition, that an information-optimal agent does not distinguish between the individual animals of the medium and large groups, even if the computational capacity of the perceptual channel would in principle allow for such a distinction.
We have performed the same comparison for the other settings of in , corresponding to either low capacity of the perceptual- or the action-channel, and found that the gradient scheme converges to solutions that are qualitatively identical (same objective value, same behavior ) to solutions obtained from iterating the self-consistent equations. We conclude that the gradient update scheme in conjunction with a neural network for the perceptual stage and a multinomial distribution for the action-stage successfully matches the analytical baseline. Due to the limitations in length of the manuscript, we have omitted further plots of the empirical baseline comparison.
Iii-B Robot Simulation
We also test our method in a simulated robotic environment to illustrate the usage of the proposed update principle for sensorimotor coupling with parametric perception and action modules. To this end, we designed a simplified grasping task with a simulated Nao robot. In the simulation, the robot is positioned next to a table with mugs on it (see Figure 3). The mugs differ in the number and orientation of the handles. One mug (m0) has no handle at all, two mugs have one handle (positioned such that the handle is either to the left (mL) or to the right (mR)) and one mug has handles on both sides (m2), allowing a direct grasp with either the left or the right hand of the robot. The Nao robot has two cameras—in this simulation we make use of the chest-camera which shows the area on the table directly in front of the robot (see the bottom left inlet in Figure 3). Based on this camera input, the (bounded-rational) agent has to decide how to grasp the mug appropriately. We defined possible actions: lift the mug with both hands (a2), lift the mug with the left hand (aL), lift with the right hand (aR) or execute no lift (a0). Additionally, we have defined the following utility function shown in Figure 4A, where each mug has one preferred action (yielding the highest utility) and the case of grasping a single-handle mug with both hands, which yields slightly lower utility.
As in the previous example, the perceptual stage of the robot is implemented by a neural network with one hidden layer (192 input neurons, four hidden and four output neurons, Xavier-initialization), and the action stage is implemented with a multinomial distribution (parameterized according to Equation (15), initialized to have uniform probabilities over actions). The perceptual and action stage are then learned through interaction with the environment, using the stochastic gradient update scheme proposed in this paper. At the beginning of each trial, one mug is randomly selected according to
(uniform distribution in our case) and placed in front of the agent. Aimage from the chest camera of the robot is then fed into the neural network for perceptual processing (2D-image is flattened into 1D vector). Accordingly, an observation and an action are sampled by the agent. After evaluating the utility , the model parameters , and are updated with a gradient ascent step as described in the previous section.
Figure 4 shows three experiments with different computational capacity of the perception- and action-channels, they are: high-capacity channels ( and , , ), high perceptual capacity combined with low action capacity ( and , , ), and low-capacity channels (, , ). Panel B shows how the objective value evolves during training. In all cases, the on-line update scheme converges to a stable solution. The dashed lines show the optimal solution. Panel C and D show evolution of the mutual information of the perceptual channel and the action channel . The mutual information on the perceptual and action stage is high with high computational capacity. Lowering the capacity of the action channel leads to a reduction of mutual information of both channels. Note that for the second case only the information processing price on the action stage is changed, the perceptual channel adjusts accordingly. Under (very) low computational capacity, the agent develops a single action strategy such that it requires no computation at all (channel capacity for both perception and action is effectively zero).
Figure 5 shows the behaviour of the robot after convergence.With high computational capacity, the robot learns to associate the camera images with the best possible action for each mug (Panel A & B). If the action channel does not have sufficient capacity, the agent is not able to apply specific actions to specific contexts, therefore, its policy collapses into two modes: lift with both hands or do nothing at all (Panel C). Accordingly, the agent spends less information processing in the perceptual channel such that it only discriminates between mugs with handle(s) and mugs without handle (Panel D). Under (very) low computational capacity, the agent always chooses to lift the mug with both hands which requires no computation. (Panel E F)
In this study we propose a novel online optimization rule to find bounded-optimal perception-action coupling in serial perception-action systems. The perceptual channel is implemented as a multi-layer neural network while the action channel is represented by a parametric distribution, which was a multinomial in our case. Our method is illustrated with a NAO robot simulator.
The proposed algorithm can be improved in several ways. In the case of rate distortion equation (3), the Blahut-Arimoto algorithm is guaranteed to converge to a unique maximum [see Section 2.1.1 in ]. When we consider an extension of bounded rational problems to systems with multiple information-processing stages, since there is no convergence-proof, it cannot be ruled out that the solutions obtained by iterating the self-consistent equations until numerical convergence are only local optima. One future improvement would therefore be to better theoretically understand the convergence properties of the extended rate distortion problem.
Another issue is that the learning rates for the parameter updates (see equation (21)) have a significant impact on the development of the bounded-optimal behaviors. Inappropriate learning rates easily lead to an optimization failure in that the bounded-rational decision-maker is unable to find the bounded-optimal solutions. We choose a grid-search method to optimize the learning rates. A future improvement would require a better understanding of the relationship between the learning rates in the perceptual and action module and to study better optimization procedures for choosing the learning rates accordingly.
In information-theoretic bounded rationality, the main assumption is that the decision-maker’s behavioral policy may not deviate too much from some prior policy. Deviations from the prior policy are costly and modeled through the Kullback-Leibler divergence (Section 1.A). This is similar to other proposed regularization techniques from robotics such as Trust Region Policy Optimization (TRPO, ) and Relative Entropy Policy Search (REPS, 
). The main difference lies in the choice of the prior behavioral policy. In both, TRPO and REPS, the prior policy represents the agent’s behavior at a previous iteration of the optimization procedure or a set of initial expert trajectories (in imitation-learning). In our approach, on the other hand, the agent’s prior policy is optimal w.r.t. the information-theoretic constraints (Eq. (3)), encouraging the agent to ignore irrelevant sensory information that has little impact on the reward.
Conceptually, bounded rationality has its most obvious application when the information-processing capacity is limited by physical constraints like time or space constraints. In our present model this was not the case, which raises the question why one would restrict the system to a smaller capacity than it might naturally have. Apart from possible effects on learning speed that we did not investigate here, restricting the channel capacity creates a bottleneck that filters out relevant information and creates abstractions that are useful for generalization . For instance in panel D of figure 5, two abstract percepts emerge: “mug with handle(s)” (x2) and “mug without handles” (x4).
In summary, our information-theoretic principle (8) for perception-action coupling provides a novel generic principled method and could in principle be applied to combine any parameterized perception and action modules. Compared to the existing literature, this approach is most similar in spirit to approaches that learn particular perceptual features that are most useful to solve a particular task [40, 41, 42, 43]. Here this feature search is integrated in a single bounded rational optimization problem.
-  J. Bogh, K. Hausman, B. Sankaran, O. Brock, D. Kragic and S. Schaal, ”Interactive Perception: Leveraging Action in Perception and Perception in Action”, arXiv:1604.03670v2
-  T. Genewein, F. Leibfried, J. Grau-Moya and D. A. Braun, ”Bounded rationality, abstraction and hierarchical decision-making: an information-theoretic optimality principle”, Frontiers in Robotics and AI, vol. 2(27), 2015, pp.1-24.
-  A. Noë,”Action in Perception”, Bradford book, 2004
-  R. C. Arkin, ”Behavior-based Robotics”, MIT Press, 1998
-  R. Pfeifer and J. C. Bongard, ”How the Body Shapes the Way We Think: A New View of Intelligence”, MIT Press, 2006
-  P. Fitzpatrick and G. Metta, ”Towards manipulation-driven vision”, IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1, 2002, pp.43–48
-  H. van Hoof, O. Kroemer and J. Peters, ”Probabilistic segmentation and targeted exploration of objects in cluttered environments”, IEEE Transactions on Robotics, vol. 30, no. 5, 2014, pp. 1198–1209
-  M. Gupta and G. S. Sukhatme, ”Using manipulation primitives for brick sorting in clutter”, International Conference on Robotics and Automation, 2012
-  L. Y. Chang, J. R. Smith and D. Fox, ”Interactive singulation of objects from a pile”, International Conference on Robotics and Automation, 2012
-  F. P. Ramsey, ”Truth and probability”, in The Foundations of Mathematics and Other Logical Essays, ed. R. B. Braithwaite (New York, NY: Harcourt, Brace and Co), 1931, pp. 156-198.
-  J. Von Neumann and O. Morgenstern, ”Theory of Games and Economic Behavior”, Princeton: Princeton University Press, 1944.
-  L. J. Savage, ”The Foundations of Statistics”, New York: Wiley, 1954.
G. Gigerenzer and P.M. Todd. ”Simple Heuristics That Make Us Smart”, Oxford: Oxford University Press, 1999.
-  D. Kahneman, ”Maps of bounded rationality: psychology for behavioral economics”, Am. Econ. Rev., vol. 93, 2003, pp.1449-1475.
-  A. Howes, R. L. Lewis and A. Vera, ”Rational adaptation under task and processing constraints: implications for testing theories of cognition and action”, Psychol. Rev., vol. 116, 2009, pp. 717-751.
-  S. Russell, ”Rationality and intelligence”, in Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, ed. C. Mellish (San Francisco, CA: Morgan Kaufmann), 1995, pp. 950-957.
-  S. Russell and P. Norvig, ”Artificial Intelligence: A Modern Approach”, Upper Saddle River: Prentice Hall, 2002.
-  R. L. Lewis, A. Howes and S. Singh, ”Computational rationality: linking mechanism and behavior through bounded utility maximization”, Top. Cogn. Sci., vol 6, 2014, pp. 279-311.
D.A. Braun, P.A. Ortega,E. Theodorou, and S. Schaal, ”Path integral control and bounded rationality”, in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (Piscataway: IEEE), 2011, pp. 202-209.
-  D. A. Braun, and P. A. Ortega, ”Information-theoretic bounded rationality and epsilon-optimality”, Entropy 16, 2014, pp. 4662-4676.
-  P. A. Ortega and D. A. Braun, ”Free energy and the generalized optimality equations for sequential decision making”, Workshop and Conference, 2012
-  P.A. Ortega and D. A. Braun, ”Thermodynamics as a theory of decision-making with information-processing costs”, Proc. R. Soc. A Math. Phys. Eng. Sci., vol. 469(2153), 2013.
-  P. A. Ortega, D. A. Braun and N. Tishby, ”Monte Carlo methods for exact & efficient solution of the generalized optimality equations”, in Proceedings of IEEE International Conference on Robotics and Automation, Hong Kong, 2014.
-  P. A. Ortega and D. A. Braun, ”A conversion between utility and information”, in Third Conference on Artificial General Intelligence (AGI 2010) (Lugano: Atlantis Press), 2010, pp. 115-120.
-  J. Peters, K. Mülling, Y. Altün, ”Relative Entropy Policy Search”, Twenty-Fourth National Conference on Artificial Intelligence (AAAI-10), 2010, pp. 1607–1612
-  E. Theodorou, J. Buchli and S. Schaal, ”A Generalized Path Integral Control Approach to Reinforcement Learning”, J. Mach. Learn. Res., vol. 11, 2010, pp. 3137–3181
-  E. Todorov, ”Linearly-solvable Markov decision problems”, Advances in neural information processing systems, 2006, pp. 1369–1376
-  H. J. Kappen ”Linear theory for control of nonlinear stochastic systems”, Physical review letters, vol. 95, no. 20, 2005, p. 200201
-  N. Tishby, F. C. Pereira, and W. Bialek, ”The information bottleneck method”, in The 37th Annual Allerton Conference on Communication, Control, and Computing, 1999.
-  I. Csiszar, and G. Tusnady, ”Information geometry and alternating minimization procedures”, Stat. Decis. vol. 1, 1984, pp. 205-237.
-  C. E. Shannon, ”Coding Theorems for a Discrete Source With a Fidelity CriterionInstitute of Radio Engineers”, International Convention Record, vol. 7, 1959, pp. 142-163.
-  R. Blahut, ”Computation of channel capacity and rate-distortion functions”, IEEE Trans. Inf. Theory 18, 1972, pp. 460-473.
-  S. Arimoto, ”An algorithm for computing the capacity of arbitrary discrete memoryless channels”, IEEE Trans. Inf. Theory 18, 1972, pp. 14-20.
-  R.W. Yeung, ”Information Theory and Network Coding”, New York: Springer, 2008.
-  T. M. Cover, and J. A. Thomas, ”Elements of Information Theory”, Hoboken: John Wiley & Sons, 1991.
-  F. Leibfried and D. A. Braun, ”Bounded Rational Decision-Making in Feedforward Neural Networks”, Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016
-  L. Bottou, ”Online Algorithms and Stochastic Approximations”. Online Learning and Neural Networks. Cambridge University Press,1998.
-  X. Glorot and Y. Bengio, ”Understanding the difficulty of training deep feedforward neural networks”, In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10), 2010
-  J. Schulman, S. Levine, P. Moritz, M. I. Jordan, P. Abbeel, ”Trust Region Policy Optimization”, International Conference on Machine Learning (ICML), 2015
-  P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, ”Learning to poke by poking: Experiential learning of intuitive physics”, in Advances in Neural Information Processing Systems, 2016
-  R. Jonschkowski and O. Brock, ”Learning state representations with robotic priors”, Autonomous Robots, vol. 39, no. 3, pp. 407–428, 2015
-  S. Levine, C. Finn, T. Darrell and P. Abbeel, ”End-to-end training of deep visuomotor policies”, Journal of Machine Learning Research, vol. 17, no. 39, pp. 1–40, 2016
-  J. Piater, S. Jodogne, R. Detry, D. Kraft, N. Krueger, O. Kroemer and J. Peters, ”Learning visual representations for perception-action systems”, International Journal of Robotics Research, 2011