Intelligent systems are often formalized as decision-makers that learn probabilistic models of their environment and optimize utilities. Such utility functions can represent different classes of problems, such as classification, regression or reinforcement learning. To enable these agents to learn optimal policies, it is usually too costly to enumerate all possibilities and determine the expected utilities. Intelligent agents must instead invest their limited resources such that they optimally trade off utility versus processing costs , which can be formalized in the framework of bounded rationality . The information-theoretic approach to bounded rationality  provides an abstract model to formalize how such agents behave in order to maximize utility within a given resource limit, where resources are quantified by information processing constraints [12, 29, 45, 49].
Intriguingly, the information-theoretic model of bounded rationality can also explain the emergence of hierarchies and abstractions, in particular when multiple bounded rational agents are involved in a decision-making process . In this case an optimal arrangement of decision-makers leads to specialization of agents and an optimal division of labor, which can be exploited to reduce computational effort . Here, we introduce a novel gradient-based on-line learning paradigm for hierarchical decision-making systems. Our method finds an optimal soft partitioning of the problem space by imposing information-theoretic constraints on both the coupling between expert selection and on the expert specialization. We argue that these constraints enforce an efficient division of labor in systems that are bounded. As an example, we apply our algorithm to systems that are limited in their representational power—in particular by assuming linear decision-makers that can be combined to solve problems that are too complex for each decision-maker alone.
The outline of this paper is as follows: first we give an introduction to Bounded Rationality, next we introduce our novel approach and demonstrate how it can be applied to classification, regression, non-linear control, and reinforcement learning. At last, we conclude.
Ii-a Bounded Rational Decision Making
An important concept in decision making is the Maximum Utility principle , where an agent always chooses an optimal action such that it maximizes their expected utility depending on the context , i.e.
where the utility is given by a function and the states are distributed according to a known and fixed distribution . Solving this optimization problem naively leads to exhaustive search over all possible pairs, which is in general a prohibitive strategy. One possible approach to this is studying decision-makers that have limited processing power, e.g. that have to act within a given time limit. Instead of finding an optimal strategy, a bounded-rational decision-maker optimally trades off expected utility and the processing costs required to adapt the system. In this study we consider the information-theoretic free-energy principle [32, 33]
of bounded rationality, where the decision-maker’s behavior is described by a probability distributionover actions given a particular state
and the decision-maker‘s processing costs are given by the Kullback-Leibler divergencebetween the agent‘s prior distribution over the actions and the posterior policies .
We can model the decision-maker’s processing power by defining an upper bound B on the the agent can maximally spend to adapt its prior behavior, which results in the following constrained optimization problem:
This constraint can be regarded as a regularization imposed on the forms the distributions can take. The resulting constrained optimization problem can be transformed into an unconstrained variational problem by introducing a Lagrange multiplier that governs the trade-off between expected utility gain and information cost :
For the agent acts perfectly rational and for the agent can only act according to the prior. The optimal prior in this case is given by the marginal . The solution to this optimization problem can be found by applying Blahut-Arimoto type algorithms, similar to Rate-Distortion Theory [9, 5, 13].
Ii-A1 Hierarchical Decision Making
Combining several bounded-rational agents by a selection policy allows for solving optimization problems that exceed the capabilities of the individual decision-makers . To achieve this, the search space is split into optimal partitions, that can be solved by the individual decision-makers. A two stage mechanism is introduced: The first stage is comprised of an expert selection policy that chooses an expert based on the past performance of given the state . This requires a high processing power in the expert selection, such that an optimal mapping from states to experts can be learned. The second stage chooses an action according to the expert’s policy . The optimization problem given by (4) can be extended to incorporate a trade-off between computational costs and utility in both stages in the following way:
where is the resource parameter for the expert selection stage and for the experts.
is the mutual information between the two random variables. The solution can be found by iterating the following set of equations:
where and are normalization factors and is the free energy of the action selection stage. The effective distribution encapsulates a mixture-of-experts policy consisting of the experts weighted by the responsibilities . Note that the Bayesian posterior is not determined by a given likelihood model, but is the result of the optimization process (5). The hierarchical system is depicted in Figure 1.
Ii-B Maximum Entropy Reinforcement Learning
We model sequential decision problems by defining a Markov Decision Process as a tuple, where is the set of states, the set of actions, is the transition probability, and is a reward function. The aim is to find the parameter of a policy that maximizes the expected reward:
We define as the cumulative reward of trajectory , which is sampled by acting according to the policy , i.e. and . Learning in this environment can then be modeled by reinforcement learning , where an agent interacts with an environment defined by the tuple over a number of (discrete) time steps . At each time step , the agent receives a state and selects an action according to the policy . In return, the agent receives the next state and a scalar reward . This process continues until the agent reaches a terminal state after which the process restarts. The goal of the agent is to maximize the expected return from each state , which is typically defined as the infinite horizon discounted sum of the rewards. A common choice to achieving this is Q-Learning , where we make use of an action-value function that is defined as the discounted sum of rewards:
and is a discount factor. Learning the optimal policy can be achieved in many ways. Here, we consider Policy gradient methods  which are a popular choice to tackle reinforcement learning problems. The main idea is to directly manipulate the parameters of the policy in order to maximize the objective by taking steps in the direction of the gradient . The gradient of the policy can be written as
where is the probability of a trajectory
. This result of the policy gradient theorem has inspired several algorithms, which often differ in how they estimate the cumulative reward, e.g. Q-Learning, the REINFORCE algorithm , and Actor-Critic algorithms . In this study we will introduce an hierarchical Actor-Critic algorithm.
To balance exploration vs. exploitation  maximum entropy reinforcement learning introduces an additional policy entropy term as a penalty to the value function. The optimal value function under this constraint is defined as
Here, trades off between reward and entropy. With we recover the standard RL value function and with we recover the value function under a random policy. We can define this objective as an inference problem  where we specify a fixed prior distribution over trajectories. In the next sections we generalize this assumption by assuming the prior distribution to be part of the optimization problem, as discussed in the earlier section on bounded rationality.
Iii Specialization in Hierarchical Systems
In this section we introduce our novel gradient based algorithm to learn the components of a hierarchical multi-agent policy. Information-theoretic bounded rationality argues that hierarchies and abstractions emerge when agents have only limited computational resources . In particular, we leverage the hierarchical model introduced earlier to learn a disentangled representation of the state and action spaces. We will see how limiting the amount of uncertainty each agent can reduce leads to specialization. First we will show how this principle can be transformed into a general on-line learning algorithm and afterwards we will show how it can be applied to classification as an illustrative example and reinforcement learning. In the following we will derive our algorithm with a focus on reinforcement learning.
The model consists of two stages: an expert selection stage, and an action selection stage. The first stage learns a soft partitioning of the state space and assigns each partition optimally to the experts according to a parametrized policy with parameters such that is maximized under the information-theoretic constraint of . The second stage is defined by a set of policies that maximize the expected utility for each expert . We device a gradient based on-line learning algorithm to find the optimal parameters in the following.
Firstly, we note that in the reinforcement learning setup the utility is given by the reward function . And secondly that in maximum entropy RL the regularization penalizes deviation from a fixed equally distributed prior, but in a more general setting we can discourage deviation from an arbitrary prior policy:
As discussed in Section II-A the optimal prior (in terms of an optimal utility vs. processing cost) is the marginal of the posterior policy given by . We can incorporate this into the optimization problem (5) by rewriting it to:
where the objective is given by
and are the parameters of the selection policy and the expert policies, respectively. Note that each expert policy has a distinct set of parameters , i.e. , but we drop the index for readability. Considering our algorithm is on-line and the action space is continuous, it would be prohibitive to compute the prior in each step. Instead we approximate the prior distributions and by exponential running mean averages of the posterior policies with momentum terms and :
In our experiments we set to 0.99. To optimize the objective we define two separate value functions: one to estimate the discounted sum of rewards and one to estimate the free energy of the expert policies. The discounted reward for the experts is
which again we learn by parameterizing the value function with a neural network and performing regression on
where are the parameters of the value net and is a set of trajectories up to horizon collected by roll-outs of the policies. Similar to the discounted reward we can now define the discounted free energy as
where . The value function is learned by parameterizing the value function with a neural network and performing regression on the mean-squared-error:
where are the parameters of the value net, and is a set of trajectories.
Iii-a Expert Selection
The selector network learns a policy that maps states to expert policies , based on their performance in the past. The resource parameter constrains how well this gating step can differentiate between the experts. For the selection maps each state equally distributed to each experts or all to one single expert, depending on . For , converges to the perfectly rational selector which always chooses the optimal expert . Thus, the expert selection stage optimizes the following objective:
where , which is the free energy of the expert. The gradient of is then given (up to an additive constant) by
The double expectation can be replaced by Monte Carlo estimates, where in practice we use a single tuple for . This formulation is known as the policy gradient method 
and is prone to producing high variance gradients. A common technique to reduce the variance is to formulate the updates using the advantage function instead of the reward[40, 30]. The advantage function is a measure of how well a certain action performs in a state compared to the average performance in that state, i.e. . Here, is called the value function and is a measure of how well the agent performs in state , and is an estimate of the cumulative reward achieved in state when the agent executes action . Thus the advantage is an estimate of how advantageous it is to pick in state in relation to a baseline performance . Instead of learning the value and the Q function, we can approximate the advantage function in the following way:
This yields the following gradient estimates for the selector:
where is a set of trajectories produced by the policies. This formulation allows us to perform the updates as in Advantage-Actor-Critic-Models. The expectation can be estimated via Monte Carlo sampling which enables us to employ our algorithm in an on-line optimization fashion.
Iii-B Action Selection
The actions are given by the posterior action distribution of the experts. Each expert maintains a policy for each of the world states and updates those according to the utility/cost trade-off. The advantage function for each expert is given as
The objective of this stage is then to maximize the expected advantage , yielding the gradient estimates
for each of the experts.
The algorithm we propose to find such an optimal hierarchical structure is based on the alternating optimization paradigm . During one phase the expert selector distribution is optimized while the action selectors are held fixed and vice versa. The length of the phases can either be fixed or limited by checking for a convergence criterion. The complete algorithm is given in Algorithm 1.
Iv Experiments and Results
In the following we will show how our approach can be applied to learning tasks where the overall complexity of the problem exceeds the processing power of (linear) experts. In particular, we will look at classification, regression, non-linear control (gain scheduling), and reinforcement learning. In our experiments we model the selector and both value functions as artificial neural networks and train them according to the gradient estimates we derived in Section III.
Iv-a Classification and Regression with Linear Decision-Makers
When dealing with complex data for classification (or regression) it is often beneficial to combine multiple classifiers (or regressors). In the framework of ensemble learning, for example, multiple classifiers are joined together to stabilize learning and improve the results , such as Mixture-of-Experts  and Multiple Classifier systems 
. The method we propose when applied to classification problems can be interpreted as a member of this family of algorithms. The application to regression could be considered as an example of local linear regression. In accordance with Section II-A we define the utility as the negative loss, i.e. , where is the action of the expert, i.e. the estimated class label or the estimated regression value, and is the ground truth, and are the input features. For classification, we chose the cross-entropy loss as a performance measure, for regression the mean squared error . The objective for expert selection becomes
where , i.e. the free energy of the expert . For action selection the objective then becomes
. Our method is able to partition the problem set into subspaces and fit a linear decision-maker on each subset. This is achieved by the emergence of a hierarchy of specialized agents, as is evidenced by the decision boundaries. We implemented our experiments using TensorFlow and scikit-learn .
Iv-B Gain Scheduling by Combining Linear Decision-Makers
When dealing with non-linear dynamics one approach is to decompose it into several linear sub-problems and design linear controllers for each sub-problem. This method is known as gain scheduling and is a well established method in the control literature . In some cases it is possible to find auxiliary variables that correlate well with the changes in the underlying dynamics. It is then possible to reduce the effects of the parameter variations simply by changing the parameters of the regulator as a function of the auxiliary variables. For example, in flight control systems the Mach number and the dynamic pressure are measured by sensors and used as scheduling variables. A main problem in designing such systems is to find suitable scheduling variables. This is usually done by incorporating prior knowledge about the system dynamics. Here, we demonstrate how our approach can be applied to automatically learn the scheduling regions and pertinent linear controllers without pre-specifying the operating points. As an illustrative example, consider a scalar non-linear plant defined by the following piecewise linear dynamics:
where is the system state, are the system matrices, are the control matrices and is a random Gaussian noise source. Here, is a state partition into piecewise linear (affine) control regimes. To approach this problem, we denote the plant state as states , the control signal as the action and learn a set of linear Gaussian control policies (i.e. the experts), where we perform gradient descent to find the optimal parameters, as described in Section III.
Consider the following operation regimes:
where and . Assuming quadratic costs , it is straightforward to find an optimal controller for each partition and we can switch between the regimes by defining the system state as the scheduling variable. We set and and and find the optimal gains and , for regimes and respectively. Obviously, this is only possible if the plant dynamics are known. In contrast, our algorithm is able to learn the scheduling policy and the control policy automatically, which is shown in Figure 4. The gains found by our algorithm were and and achieve a control cost of compared to achieved by gain scheduling. We set the noise source to , causing the control policy to shift when the plant state is close to zero. The prior is , showing that both expert policies were used to control the plant. The mutual information of bits and the entropy of decaying towards 0 show that the selector successfully learns to partition the state space. The resource parameter was set to to drive specialization.
Iv-C Reinforcement Learning with Linear Decision-Makers
In the following we will show how our hierarchical system is able to solve a continuous reinforcement learning problem using an optimal arrangement of linear control policies. We evaluate on a task known as Acrobot , more commonly referred to as the inverted double pendulum. The task is to swing a double-linked pendulum up and keep it balanced as long as possible. The agent receives a reward of 10 plus a distance penalty between its actual state and the goal state. The episode is terminated if the agent reaches a predefined terminal state (hanging downwards) or after 1000 time steps. To balance the pendulum the agent is able to apply a force to the central joint of , i.e. move it to the left or the right, respectively. This environment poses a non-linear control problem and can thus not be solved optimally by a single linear controller. We show that using our approach, a committee of linear experts can solve this non-linear task. The results are shown in Figure 5. We allowed for five experts (both with ), but our system learns that three experts are sufficient to solve the task. The priors for each expert (lower right Figure, each color represents an expert) center on -1, 0, and 1, which correspond to swinging the double pendulum to the left, no force, and swinging to the right, respectively. The remaining two experts overlap accordingly. We can see that the mean in the five expert setup decreases, while the selection increases to
. Both indicate that the system has learned an optimal arrangement of three experts and is thus able to achieve maximum reward and eventually catches up to the performance of TRPO, which is our nonlinear control baseline that was trained with three-layered policy and value neural networks consisting of ReLu activation functions. Our method successfully learned a partitioning of the double-pendulum state space without having any prior information about any of the system dynamics or the state space. We implemented our experiments in TensorFlow and OpenAI Gym .
Recently, there has been increased interest in investigating the effects of information-theoretic constraints on reinforcement learning tasks with mixture-of-experts policies. For example, the authors of  have proposed a divide-and-conquer principle for policy learning in a reinforcement learning setting. They argue that splitting a central policy into several sub-policies improves the learning phase by requiring less samples overall. To implement this idea they split the action and state space into pre-defined partitions and train policies on these partitions. The information-theoretic constraints during training enforce that multiple experts are kept similar to each other, so that the expert policies can be fused into one central policy. In contrast, in our approach the information-theoretic constraints enforce that all experts stay close to their respective priors thereby generating as little informational surprise as possible. This leads to specialization of experts, because each expert is assigned a sub-space of the input space where information can be processed efficiently without deviating too much from the optimal prior adapted to that region. Crucially, in our setup the partitioning is not predefined but part of the optimization process.
Our approach belongs to a wider class of models that use information constraints for regularization to deal more efficiently with learning and decision-making problems [11, 31, 28, 25, 19, 35, 26, 18, 15, 3, 21, 38, 20, 17, 16]. One such prominent approach is Trust Region Policy Optimization (TRPO) . The main idea is to constrain each update step to a trust region around the current state of the system. This region is defined by the Kullback-Leibler Divergence between the old policy and the new policy. The smooth updates provide a theoretic monotonic policy improvement guarantee. Another similar approach are relative entropy policy search methods , where the idea is to learn a gating policy that can decide which sub-policy to choose. To achieve this the authors impose a constraint between the data distribution and the next policy.
Another related approach to ours are Mixture of Experts (ME) models, originally introduced by  as a tree structure to solve complex classification and regression tasks by leveraging the divide and conquer paradigm. MEs consist of three main components: gates, experts, and a probabilistic model that combines the expert predictions. The objective of the gate is to find a soft partitioning of the input space and assign partitions to experts which perform best on the partition. Experts are built to perform optimally in regression or classification tasks given an assigned partition. The model is a weighted sum of the experts outputs, weighted by how confident the gate is in the experts opinion. MEs exhibit a high degree of flexibility as evidenced by the variety of models and algorithms employed in the three components . Our model allows learning such models, but can also be applied to more general decision-making scenarios like reinforcement learning.
Prior work to ours in the control literature working on a similar setup assumed the system dynamics to be given. e.g. in [2, 36, 50]. The authors of  split the state space of the inverted pendulum into predefined bins and find a linear controller that stabilizes each bin by learning a selection policy over these predefined controllers. Our approach differs from these by the fact that we take only the reward signal and perform a partitioning of the state space and learning linear controller on these partitions simultaneously. This poses a difficult learning problem as both system parts have to adjust to one another on different timescales. Other decentralized approaches (e.g. ) have trained separate decentralized models to fuse them into a single model that can be used by a reinforcement learning agent. In contrast, our method learns several sub-policies.
In summary, we introduce a promising novel gradient based on-line learning paradigm for hierarchical multi-agent systems. Our method finds an optimal soft partitioning by considering the agents’ limitations in terms of information-theoretic constraints, supporting expert specialization. Importantly, our model is capable of doing so without any prior information about the task. This becomes especially difficult in continuous control tasks, where the system dynamics are unknown. Our method is abstract and principled in a way that allows it to be employed on a variety of tasks including multi-agent decision-making, mixture-of-expert regression, and divide-and-conquer reinforcement learning. An open questions remains how to apply our method to high dimensional control tasks.
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
Tensorflow: a system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
-  Ekaterina Abramova, Luke Dickens, Daniel Kuhn, and Aldo Faisal. Hierarchical, heterogeneous control of non-linear dynamical systems using reinforcement learning. European Workshop On Reinforcement Learning at ICML, 2012.
-  Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31, 2017.
-  Rakshit Allamraju and Girish Chowdhary. Communication efficient decentralized gaussian process fusion for multi-uas path planning. In 2017 American Control Conference (ACC), pages 4442–4447. IEEE, 2017.
-  S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20, Jan 1972.
-  Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning for control. In Lazy learning, pages 75–113. Springer, 1997.
Peter Bellmann, Patrick Thiam, and Friedhelm Schwenker.
Multi-classifier-systems: Architectures, algorithms and applications.
Computational Intelligence for Pattern Recognition, pages 83–113. Springer, 2018.
-  James C Bezdek and Richard J Hathaway. Convergence of alternating optimization. Neural, Parallel & Scientific Computations, 11(4):351–368, 2003.
-  Richard E. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473, Jul 1972.
-  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
-  Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pages 273–281, 2012.
-  Vul Edward, Goodman Noah, Griffiths Thomas L., and Tenenbaum Joshua B. One and done? optimal decisions from very few samples. Cognitive Science, 38(4):599–637, 2014.
-  Tim Genewein, Felix Leibfried, Jordi Grau-Moya, and Daniel Alexander Braun. Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI, 2:27, 2015.
-  Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science, 349(6245):273–278, 2015.
-  Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-and-conquer reinforcement learning. arXiv preprint arXiv:1711.09874, 2017.
-  Sebastian Gottwald and Daniel A. Braun. Bounded rational decision-making from elementary computations that reduce uncertainty. Entropy, 21(4), 2019.
-  Sebastian Gottwald and Daniel A Braun. Systems of bounded rational agents with information-theoretic constraints. Neural computation, 31(2):440–476, 2019.
-  Jordi Grau-Moya, Matthias Krüger, and Daniel A Braun. Non-equilibrium relations for bounded rational decision-making in changing environments. Entropy, 20(1):1, 2017.
-  Jordi Grau-Moya, Felix Leibfried, Tim Genewein, and Daniel A Braun. Planning with information-processing constraints and model uncertainty in markov decision processes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 475–491. Springer, 2016.
-  Jordi Grau-Moya, Felix Leibfried, and Peter Vrancx. Soft q-learning with mutual-information regularization. International Conference on Learning Representations, 2019.
-  Heinke Hihn, Sebastian Gottwald, and Daniel A Braun. Bounded rational decision-making with adaptive neural network priors. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pages 213–225. Springer, 2018.
-  Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
-  Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
-  Ludmila I Kuncheva. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.
Felix Leibfried and Daniel A Braun.
A reward-maximizing spiking neuron as a bounded rational decision maker.Neural computation, 27(8):1686–1720, 2015.
-  Felix Leibfried, Jordi Grau-Moya, and Haitham B Ammar. An information-theoretic optimality principle for deep reinforcement learning. Deep Reinforcement Learning Workshop NIPS 2018, 2017.
-  Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
-  Georg Martius, Ralf Der, and Nihat Ay. Information driven self-organization of complex robotic behaviors. PloS one, 8(5):e63400, 2013.
-  Richard D. McKelvey and Thomas R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6 – 38, 1995.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
-  Gerhard Neumann, Christian Daniel, Andras Kupcsik, Marc Deisenroth, and Jan Peters. Information-theoretic motor skill learning. In Proceedings of the AAAI Workshop on Intelligent Robotic Systems, 2013.
-  Pedro A. Ortega and Daniel A. Braun. Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 469(2153), 2013.
-  Pedro A Ortega, Daniel A Braun, Justin Dyer, Kee-Eung Kim, and Naftali Tishby. Information-theoretic bounded rationality. arXiv preprint arXiv:1512.06789, 2015.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  Zhen Peng, Tim Genewein, Felix Leibfried, and Daniel A Braun. An information-theoretic on-line update principle for perception-action coupling. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 789–796. IEEE, 2017.
-  Jette Randløv, Andrew G Barto, and Michael T Rosenstein. Combining reinforcement learning with a local control algorithm. International Conference on Machine Learning, 2000.
-  Wilson J Rugh and Jeff S Shamma. Research on gain scheduling. Automatica, 36(10):1401–1425, 2000.
-  Sonja Schach, Sebastian Gottwald, and Daniel A Braun. Quantifying motor task performance by bounded rational decision theory. Frontiers in neuroscience, 12, 2018.
-  John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
-  John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. International Conference on Learning Representations, 2015.
-  Herbert A. Simon. A behavioral model of rational choice. The Quarterly Journal of Economics, 69(1):99–118, 1955.
-  Richard S Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pages 1038–1044, 1996.
-  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
-  Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
-  Naftali Tishby and Daniel Polani. Information theory of decisions and actions. In Perception-Action Cycle: Models, Architectures, and Hardware. Springer, 2011.
-  John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior (commemorative edition). Princeton university press, 2007.
-  Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
-  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
David H. Wolpert.
Information Theory – The Bridge Connecting Bounded Rational Game Theory and Statistical Physics, pages 262–290. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.
-  Junichiro Yoshimoto, Masaya Nishimura, Yoichi Tokita, and Shin Ishii. Acrobot control by learning the switching of multiple controllers. Artificial Life and Robotics, 9(2):67–71, 2005.
-  Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012.