
Robust temporal difference learning for critical domains
We present a new Qfunction operator for temporal difference (TD) learning methods that explicitly encodes robustness against significant rare events (SRE) in critical domains. The operator, which we call the κoperator, allows to learn a safe policy in a modelbased fashion without actually observing the SRE. We introduce single and multiagent robust TD methods using the operator κ. We prove convergence of the operator to the optimal safe Qfunction with respect to the model using the theory of Generalized Markov Decision Processes. In addition we prove convergence to the optimal Qfunction of the original MDP given that the probability of SREs vanishes. Empirical evaluations demonstrate the superior performance of κbased TD methods both in the early learning phase as well as in the final converged stage. In addition we show robustness of the proposed method to small model errors, as well as its applicability in a multiagent context.
01/23/2019 ∙ by Richard Klima, et al. ∙ 20 ∙ shareread it

Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent
In this paper, we present exploitability descent, a new algorithm to compute approximate equilibria in twoplayer zerosum extensiveform games with imperfect information, by direct policy optimization against worstcase opponents. We prove that when following this optimization, the exploitability of a player's strategy converges asymptotically to zero, and hence when both players employ this optimization, the joint policies converge to a Nash equilibrium. Unlike fictitious play (XFP) and counterfactual regret minimization (CFR), our convergence result pertains to the policies being optimized rather than the average policies. Our experiments demonstrate convergence rates comparable to XFP and CFR in four benchmark games in the tabular case. Using function approximation, we find that our algorithm outperforms the tabular version in two of the games, which, to the best of our knowledge, is the first such result in imperfect information games among this class of algorithms.
03/13/2019 ∙ by Edward Lockhart, et al. ∙ 20 ∙ shareread it

Neural Replicator Dynamics
In multiagent learning, agents interact in inherently nonstationary environments due to their concurrent policy updates. It is, therefore, paramount to develop and analyze algorithms that learn effectively despite these nonstationarities. A number of works have successfully conducted this analysis under the lens of evolutionary game theory (EGT), wherein a population of individuals interact and evolve based on biologicallyinspired operators. These studies have mainly focused on establishing connections to valueiteration based approaches in stateless or tabular games. We extend this line of inquiry to formally establish links between EGT and policy gradient (PG) methods, which have been extensively applied in single and multiagent learning. We pinpoint weaknesses of the commonlyused softmax PG algorithm in adversarial and nonstationary settings and contrast PG's behavior to that predicted by replicator dynamics (RD), a central model in EGT. We consequently provide theoretical results that establish links between EGT and PG methods, then derive Neural Replicator Dynamics (NeuRD), a parameterized version of RD that constitutes a novel method with several advantages. First, as NeuRD reduces to the wellstudied noregret Hedge algorithm in the tabular setting, it inherits noregret guarantees that enable convergence to equilibria in games. Second, NeuRD is shown to be more adaptive to nonstationarity, in comparison to PG, when learning in canonical games and imperfect information benchmarks including Poker. Thirdly, modifying any PGbased algorithm to use the NeuRD update rule is straightforward and incurs no added computational costs. Finally, while singleagent learning is not the main focus of the paper, we verify empirically that NeuRD is competitive in these settings with a recent baseline algorithm.
06/01/2019 ∙ by Shayegan Omidshafiei, et al. ∙ 12 ∙ shareread it

ActorCritic Policy Optimization in Partially Observable Multiagent Environments
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actorcritic algorithms in partiallyobservable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the oneshot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to modelfree multiagent reinforcement learning in adversarial sequential decision problems (zerosum imperfect information games), using RLstyle function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in selfplay with rates similar to or better than a baseline modelfree algorithm for zero sum games, without any domainspecific state space reductions.
10/21/2018 ∙ by Sriram Srinivasan, et al. ∙ 8 ∙ shareread it

Fast Convergence for Object Detection by Learning how to Combine Error Functions
In this paper, we introduce an innovative method to improve the convergence speed and accuracy of object detection neural networks. Our approach, CONVERGEFASTAUXNET, is based on employing multiple, dependent loss metrics and weighting them optimally using an online trained auxiliary network. Experiments are performed in the wellknown RoboCup@Work challenge environment. A fully convolutional segmentation network is trained on detecting objects' pickup points. We empirically obtain an approximate measure for the rate of success of a robotic pickup operation based on the accuracy of the object detection network. Our experiments show that adding an optimally weighted Euclidean distance loss to a network trained on the commonly used Intersection over Union (IoU) metric reduces the convergence time by 42.48 pickup rate is improved by 39.90 methods, the improvement is 24.5 pickup rate.
08/13/2018 ∙ by Benjamin Schnieders, et al. ∙ 2 ∙ shareread it

ValueDecomposition Networks For Cooperative MultiAgent Learning
We study the problem of cooperative multiagent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the "lazy agent" problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agentwise value functions. We perform an experimental evaluation across a range of partiallyobservable multiagent domains and show that learning such valuedecompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.
06/16/2017 ∙ by Peter Sunehag, et al. ∙ 0 ∙ shareread it

Symmetric Decomposition of Asymmetric Games
We introduce new theoretical insights into twopopulation asymmetric games allowing for an elegant symmetric decomposition into two single population symmetric games. Specifically, we show how an asymmetric bimatrix game (A,B) can be decomposed into its symmetric counterparts by envisioning and investigating the payoff tables (A and B) that constitute the asymmetric game, as two independent, single population, symmetric games. We reveal several surprising formal relationships between an asymmetric twopopulation game and its symmetric single population counterparts, which facilitate a convenient analysis of the original asymmetric game due to the dimensionality reduction of the decomposition. The main finding reveals that if (x,y) is a Nash equilibrium of an asymmetric game (A,B), this implies that y is a Nash equilibrium of the symmetric counterpart game determined by payoff table A, and x is a Nash equilibrium of the symmetric counterpart game determined by payoff table B. Also the reverse holds and combinations of Nash equilibria of the counterpart games form Nash equilibria of the asymmetric game. We illustrate how these formal relationships aid in identifying and analysing the Nash structure of asymmetric games, by examining the evolutionary dynamics of the simpler counterpart games in several canonical examples.
11/14/2017 ∙ by Karl Tuyls, et al. ∙ 0 ∙ shareread it

Lenient MultiAgent Deep Reinforcement Learning
A significant amount of research in recent years has been dedicated towards single agent deep reinforcement learning. Much of the success of deep reinforcement learning can be attributed towards the use of experience replay memories within which state transitions are stored. Function approximation methods such as convolutional neural networks (referred to as deep QNetworks, or DQNs, in this context) can subsequently be trained through sampling the stored transitions. However, considerations are required when using experience replay memories within multiagent systems, as stored transitions can become outdated due to agents updating their respective policies in parallel [1]. In this work we apply leniency [2] to multiagent deep reinforcement learning (MADRL), acting as a control mechanism to determine which statetransitions sampled are allowed to update the DQN. Our resulting LenientDQN (LDQN) is evaluated using variations of the Coordinated MultiAgent Object Transportation Problem (CMOTP) outlined by Busoniu et al. [3]. The LDQN significantly outperforms the existing hysteretic DQN (HDQN) [4] within environments that yield stochastic rewards. Based on results from experiments conducted using vanilla and double Qlearning versions of the lenient and hysteretic algorithms, we advocate a hybrid approach where learners initially use vanilla Qlearning before transitioning to double Qlearners upon converging on a cooperative joint policy.
07/14/2017 ∙ by Gregory Palmer, et al. ∙ 0 ∙ shareread it

SAIGA: A Multiagent Reinforcement Learning Method Towards Socially Optimal Outcomes
In multiagent environments, the capability of learning is important for an agent to behave appropriately in face of unknown opponents and dynamic environment. From the system designer's perspective, it is desirable if the agents can learn to coordinate towards socially optimal outcomes, while also avoiding being exploited by selfish opponents. To this end, we propose a novel gradient ascent based algorithm (SAIGA) which augments the basic gradientascent algorithm by incorporating social awareness into the policy update process. We theoretically analyze the learning dynamics of SAIGA using dynamical system theory and SAIGA is shown to have linear dynamics for a wide range of games including symmetric games. The learning dynamics of two representative games (the prisoner's dilemma game and the coordination game) are analyzed in details. Based on the idea of SAIGA, we further propose a practical multiagent learning algorithm, called SAPGA, based on Qlearning update rule. Simulation results show that SAPGA agent can achieve higher social welfare than previous socialoptimality oriented Conditional Joint Action Learner (CJAL) and also is robust against individually rational opponents by reaching Nash equilibrium solutions.
03/08/2018 ∙ by Chengwei Zhang, et al. ∙ 0 ∙ shareread it

A multiagent reinforcement learning model of commonpool resource appropriation
Humanity faces numerous problems of commonpool resource appropriation. This class of multiagent social dilemma includes the problems of ensuring sustainable use of fresh water, common fisheries, grazing pastures, and irrigation systems. Abstract models of commonpool resource appropriation based on noncooperative game theory predict that selfinterested agents will generally fail to find socially positive equilibriaa phenomenon called the tragedy of the commons. However, in reality, human societies are sometimes able to discover and implement stable cooperative solutions. Decades of behavioral game theory research have sought to uncover aspects of human behavior that make this possible. Most of that work was based on laboratory experiments where participants only make a single choice: how much to appropriate. Recognizing the importance of spatial and temporal resource dynamics, a recent trend has been toward experiments in more complex realtime video gamelike environments. However, standard methods of noncooperative game theory can no longer be used to generate predictions for this case. Here we show that deep reinforcement learning can be used instead. To that end, we study the emergent behavior of groups of independently learning agents in a partially observed Markov game modeling commonpool resource appropriation. Our experiments highlight the importance of trialanderror learning in commonpool resource appropriation and shed light on the relationship between exclusion, sustainability, and inequality.
07/20/2017 ∙ by Julien Perolat, et al. ∙ 0 ∙ shareread it

The Mechanics of nPlayer Differentiable Games
The cornerstone underpinning deep learning is the guarantee that gradient descent on an objective converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, where there are multiple interacting losses. The behavior of gradientbased methods in games is not well understood  and is becoming increasingly important as adversarial and multiobjective architectures proliferate. In this paper, we develop new techniques to understand and control the dynamics in general games. The key result is to decompose the secondorder dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games. Basic experiments show SGA is competitive with recently proposed algorithms for finding local Nash equilibria in GANs  whilst at the same time being applicable to  and having guarantees in  much more general games.
02/15/2018 ∙ by David Balduzzi, et al. ∙ 0 ∙ shareread it
Karl Tuyls
is this you? claim profile
Staff Research Scientist at Google DeepMind since 2017, Professor of Computer Science at University of Liverpool since 2013, Visiting senior research fellow at King's College London from 20122014