
An InformationTheoretic Optimality Principle for Deep Reinforcement Learning
In this paper, we methodologically address the problem of cumulative reward overestimation in deep reinforcement learning. We generalise notions from informationtheoretic bounded rationality to handle highdimensional state spaces efficiently. The resultant algorithm encompasses a wide range of learning outcomes that can be demonstrated by tuning a Lagrange multiplier that intrinsically penalises rewards. We show that deep Qnetworks arise as a special case of our proposed approach. We introduce a novel scheduling scheme for boundedrational behaviour that ensures sample efficiency and robustness. In experiments on Atari games, we show that our algorithm outperforms other deep reinforcement learning algorithms (e.g., deep and double deep Qnetworks) in terms of both gameplay performance and sample complexity.
08/06/2017 ∙ by Felix Leibfried, et al. ∙ 0 ∙ shareread it

Planning with InformationProcessing Constraints and Model Uncertainty in Markov Decision Processes
Informationtheoretic principles for learning and acting have been proposed to solve particular classes of Markov Decision Problems. Mathematically, such approaches are governed by a variational free energy principle and allow solving MDP planning problems with informationprocessing constraints expressed in terms of a KullbackLeibler divergence with respect to a reference distribution. Here we consider a generalization of such MDP planners by taking model uncertainty into account. As model uncertainty can also be formalized as an informationprocessing constraint, we can derive a unified solution from a single generalized variational principle. We provide a generalized value iteration scheme together with a convergence proof. As limit cases, this generalized scheme includes standard value iteration with a known model, Bayesian MDP planning, and robust planning. We demonstrate the benefits of this approach in a grid world simulation.
04/07/2016 ∙ by Jordi GrauMoya, et al. ∙ 0 ∙ shareread it

Adaptive informationtheoretic bounded rational decisionmaking with parametric priors
Deviations from rational decisionmaking due to limited computational resources have been studied in the field of bounded rationality, originally proposed by Herbert Simon. There have been a number of different approaches to model bounded rationality ranging from optimality principles to heuristics. Here we take an informationtheoretic approach to bounded rationality, where informationprocessing costs are measured by the relative entropy between a posterior decision strategy and a given fixed prior strategy. In the case of multiple environments, it can be shown that there is an optimal prior rendering the bounded rationality problem equivalent to the rate distortion problem for lossy compression in information theory. Accordingly, the optimal prior and posterior strategies can be computed by the wellknown BlahutArimoto algorithm which requires the computation of partition sums over all possible outcomes and cannot be applied straightforwardly to continuous problems. Here we derive a samplingbased alternative update rule for the adaptation of prior behaviors of decisionmakers and we show convergence to the optimal prior predicted by rate distortion theory. Importantly, the update rule avoids typical infeasible operations such as the computation of partition sums. We show in simulations a proof of concept for discrete action and environment domains. This approach is not only interesting as a generic computational method, but might also provide a more realistic model of human decisionmaking processes occurring on a fast and a slow time scale.
11/05/2015 ∙ by Jordi GrauMoya, et al. ∙ 0 ∙ shareread it

Bounded Rational DecisionMaking in Changing Environments
A perfectly rational decisionmaker chooses the best action with the highest utility gain from a set of possible actions. The optimality principles that describe such decision processes do not take into account the computational costs of finding the optimal action. Bounded rational decisionmaking addresses this problem by specifically trading off informationprocessing costs and expected utility. Interestingly, a similar tradeoff between energy and entropy arises when describing changes in thermodynamic systems. This similarity has been recently used to describe bounded rational agents. Crucially, this framework assumes that the environment does not change while the decisionmaker is computing the optimal policy. When this requirement is not fulfilled, the decisionmaker will suffer inefficiencies in utility, that arise because the current policy is optimal for an environment in the past. Here we borrow concepts from nonequilibrium thermodynamics to quantify these inefficiencies and illustrate with simulations its relationship with computational resources.
12/24/2013 ∙ by Jordi GrauMoya, et al. ∙ 0 ∙ shareread it

A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
We propose a novel Bayesian approach to solve stochastic optimization problems that involve finding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a twostep procedure of first, doing inference over the function space and second, finding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a nonparametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. We illustrate the effectiveness of our model by optimizing a noisy, highdimensional, nonconvex objective function.
06/09/2012 ∙ by Pedro A. Ortega, et al. ∙ 0 ∙ shareread it

Balancing TwoPlayer Stochastic Games with Soft QLearning
Within the context of video games the notion of perfectly rational agents can be undesirable as it leads to uninteresting situations, where humans face tough adversarial decision makers. Current frameworks for stochastic games and reinforcement learning prohibit tuneable strategies as they seek optimal performance. In this paper, we enable such tuneable behaviour by generalising soft Qlearning to stochastic games, where more than one agent interact strategically. We contribute both theoretically and empirically. On the theory side, we show that games with soft Qlearning exhibit a unique value and generalise team games and zerosum games far beyond these two extremes to cover a continuous spectrum of gaming behaviour. Experimentally, we show how tuning agents' constraints affect performance and demonstrate, through a neural network architecture, how to reliably balance games with highdimensional representations.
02/09/2018 ∙ by Jordi GrauMoya, et al. ∙ 0 ∙ shareread it

Disentangled Skill Embeddings for Reinforcement Learning
We propose a novel framework for multitask reinforcement learning (MTRL). Using a variational inference formulation, we learn policies that generalize across both changing dynamics and goals. The resulting policies are parametrized by shared parameters that allow for transfer between different dynamics and goal conditions, and by taskspecific latentspace embeddings that allow for specialization to particular tasks. We show how the latentspaces enable generalization to unseen dynamics and goals conditions. Additionally, policies equipped with such embeddings serve as a space of skills (or options) for hierarchical reinforcement learning. Since we can change task dynamics and goals independently, we name our framework Disentangled Skill Embeddings (DSE).
06/21/2019 ∙ by Janith C. Petangoda, et al. ∙ 0 ∙ shareread it

A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment
Empowerment is an informationtheoretic method that can be used to intrinsically motivate learning agents. It attempts to maximize an agent's control over the environment by encouraging visiting states with a large number of reachable next states. Empowered learning has been shown to lead to complex behaviors, without requiring an explicit reward signal. In this paper, we investigate the use of empowerment in the presence of an extrinsic reward signal. We hypothesize that empowerment can guide reinforcement learning (RL) agents to find good early behavioral solutions by encouraging highly empowered states. We propose a unified Bellman optimality principle for empowered reward maximization. Our empowered reward maximization approach generalizes both Bellman's optimality principle as well as recent informationtheoretical extensions to it. We prove uniqueness of the empowered values and show convergence to the optimal solution. We then apply this idea to develop offpolicy actorcritic RL algorithms for highdimensional continuous domains. We experimentally validate our methods in robotics domains (MuJoCo). Our methods demonstrate improved initial and competitive final performance compared to modelfree stateoftheart techniques.
07/26/2019 ∙ by Felix Leibfried, et al. ∙ 0 ∙ shareread it

MutualInformation Regularization in Markov Decision Processes and ActorCritic Learning
Cumulative entropy regularization introduces a regulatory signal to the reinforcement learning (RL) problem that encourages policies with highentropy actions, which is equivalent to enforcing small deviations from a uniform reference marginal policy. This has been shown to improve exploration and robustness, and it tackles the value overestimation problem. It also leads to a significant performance increase in tabular and highdimensional settings, as demonstrated via algorithms such as soft Qlearning (SQL) and soft actorcritic (SAC). Cumulative entropy regularization has been extended to optimize over the reference marginal policy instead of keeping it fixed, yielding a regularization that minimizes the mutual information between states and actions. While this has been initially proposed for Markov Decision Processes (MDPs) in tabular settings, it was recently shown that a similar principle leads to significant improvements over vanilla SQL in RL for highdimensional domains with discrete actions and function approximators. Here, we follow the motivation of mutualinformation regularization from an inference perspective and theoretically analyze the corresponding Bellman operator. Inspired by this Bellman operator, we devise a novel mutualinformation regularized actorcritic learning (MIRACLE) algorithm for continuous action spaces that optimizes over the reference marginal policy. We empirically validate MIRACLE in the Mujoco robotics simulator, where we demonstrate that it can compete with contemporary RL methods. Most notably, it can improve over the modelfree stateoftheart SAC algorithm which implicitly assumes a fixed reference policy.
09/11/2019 ∙ by Felix Leibfried, et al. ∙ 0 ∙ shareread it
Jordi GrauMoya
is this you? claim profile