David Balduzzi

is this you? claim profile


Staff Research Scientist at DeepMind

  • Stable Opponent Shaping in Differentiable Games

    A growing number of learning methods are actually games which optimise multiple, interdependent objectives in parallel -- from GANs and intrinsic curiosity to multi-agent RL. Opponent shaping is a powerful approach to improve learning dynamics in such games, accounting for the fact that the 'environment' includes agents adapting to one another's updates. Learning with Opponent-Learning Awareness (LOLA) is a recent algorithm which exploits this dynamic response and encourages cooperation in settings like the Iterated Prisoner's Dilemma. Although experimentally successful, we show that LOLA can exhibit 'arrogant' behaviour directly at odds with convergence. In fact, remarkably few algorithms have theoretical guarantees applying across all differentiable games. In this paper we present Stable Opponent Shaping (SOS), a new method that interpolates between LOLA and a stable variant named LookAhead. We prove that LookAhead locally converges and avoids strict saddles in all differentiable games, the strongest results in the field so far. SOS inherits these desirable guarantees, while also shaping the learning of opponents and consistently either matching or outperforming LOLA experimentally.

    11/20/2018 ∙ by Alistair Letcher, et al. ∙ 74 share

    read it

  • Open-ended Learning in Symmetric Zero-sum Games

    Zero-sum games such as chess and poker are, abstractly, functions that evaluate pairs of agents, for example labeling them `winner' and `loser'. If the game is approximately transitive, then self-play generates sequences of agents of increasing strength. However, nontransitive games, such as rock-paper-scissors, can exhibit strategic cycles, and there is no longer a clear objective -- we want agents to increase in strength, but against whom is unclear. In this paper, we introduce a geometric framework for formulating agent objectives in zero-sum games, in order to construct adaptive sequences of objectives that yield open-ended learning. The framework allows us to reason about population performance in nontransitive games, and enables the development of a new algorithm (rectified Nash response, PSRO_rN) that uses game-theoretic niching to construct diverse populations of effective agents, producing a stronger set of agents than existing algorithms. We apply PSRO_rN to two highly nontransitive resource allocation games and find that PSRO_rN consistently outperforms the existing alternatives.

    01/23/2019 ∙ by David Balduzzi, et al. ∙ 16 share

    read it

  • Strongly-Typed Recurrent Neural Networks

    Recurrent neural networks are increasing popular models for sequential learning. Unfortunately, although the most effective RNN architectures are perhaps excessively complicated, extensive searches have not found simpler alternatives. This paper imports ideas from physics and functional programming into RNN design to provide guiding principles. From physics, we introduce type constraints, analogous to the constraints that forbids adding meters to seconds. From functional programming, we require that strongly-typed architectures factorize into stateless learnware and state-dependent firmware, reducing the impact of side-effects. The features learned by strongly-typed nets have a simple semantic interpretation via dynamic average-pooling on one-dimensional convolutions. We also show that strongly-typed gradients are better behaved than in classical architectures, and characterize the representational power of strongly-typed nets. Finally, experiments show that, despite being more constrained, strongly-typed architectures achieve lower training and comparable generalization error to classical architectures.

    02/06/2016 ∙ by David Balduzzi, et al. ∙ 0 share

    read it

  • Deep Online Convex Optimization by Putting Forecaster to Sleep

    Methods from convex optimization such as accelerated gradient descent are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since neural networks are not convex and standard guarantees do not apply. This paper develops the first rigorous link between online convex optimization and error backpropagation on convolutional networks. The first step is to introduce circadian games, a mild generalization of convex games with similar convergence properties. The main result is that error backpropagation on a convolutional network is equivalent to playing out a circadian game. It follows immediately that the waking-regret of players in the game (the units in the neural network) controls the overall rate of convergence of the network. Finally, we explore some implications of the results: (i) we describe the representations learned by a neural network game-theoretically, (ii) propose a learning setting at the level of individual units that can be plugged into deep architectures, and (iii) propose a new approach to adaptive model selection by applying bandit algorithms to choose which players to wake on each round.

    09/06/2015 ∙ by David Balduzzi, et al. ∙ 0 share

    read it

  • Strongly-Typed Agents are Guaranteed to Interact Safely

    As artificial agents proliferate, it is becoming increasingly important to ensure that their interactions with one another are well-behaved. In this paper, we formalize a common-sense notion of when algorithms are well-behaved: an algorithm is safe if it does no harm. Motivated by recent progress in deep learning, we focus on the specific case where agents update their actions according to gradient descent. The first result is that gradient descent converges to a Nash equilibrium in safe games. The paper provides sufficient conditions that guarantee safe interactions. The main contribution is to define strongly-typed agents and show they are guaranteed to interact safely. A series of examples show that strong-typing generalizes certain key features of convexity and is closely related to blind source separation. The analysis introduce a new perspective on classical multilinear games based on tensor decomposition.

    02/24/2017 ∙ by David Balduzzi, et al. ∙ 0 share

    read it

  • The Shattered Gradients Problem: If resnets are the answer, then what is the question?

    A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. The problem has largely been overcome through the introduction of carefully constructed initializations and batch normalization. Nevertheless, architectures incorporating skip-connections such as resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise. In contrast, the gradients in architectures with skip-connections are far more resistant to shattering decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new "looks linear" (LL) initialization that prevents shattering. Preliminary experiments show the new initialization allows to train very deep networks without the addition of skip-connections.

    02/28/2017 ∙ by David Balduzzi, et al. ∙ 0 share

    read it

  • On the information-theoretic structure of distributed measurements

    The internal structure of a measuring device, which depends on what its components are and how they are organized, determines how it categorizes its inputs. This paper presents a geometric approach to studying the internal structure of measurements performed by distributed systems such as probabilistic cellular automata. It constructs the quale, a family of sections of a suitably defined presheaf, whose elements correspond to the measurements performed by all subsystems of a distributed system. Using the quale we quantify (i) the information generated by a measurement; (ii) the extent to which a measurement is context-dependent; and (iii) whether a measurement is decomposable into independent submeasurements, which turns out to be equivalent to context-dependence. Finally, we show that only indecomposable measurements are more informative than the sum of their submeasurements.

    07/06/2011 ∙ by David Balduzzi, et al. ∙ 0 share

    read it

  • Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks

    Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. Nevertheless, methods from convex optimization such as gradient descent and Adam are widely used as building blocks for deep learning algorithms. This paper provides the first convergence guarantee applicable to modern convnets. The guarantee matches a lower bound for convex nonsmooth functions. The key technical tool is the neural Taylor approximation -- a straightforward application of Taylor expansions to neural networks -- and the associated Taylor loss. Experiments on a range of optimizers, layers, and tasks provide evidence that the analysis accurately captures the dynamics of neural optimization. The second half of the paper applies the Taylor approximation to isolate the main difficulty in training rectifier nets: that gradients are shattered. We investigate the hypothesis that, by exploring the space of activation configurations more thoroughly, adaptive optimizers such as RMSProp and Adam are able to converge to better solutions.

    11/07/2016 ∙ by David Balduzzi, et al. ∙ 0 share

    read it

  • Deep Online Convex Optimization with Gated Games

    Methods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.

    04/07/2016 ∙ by David Balduzzi, et al. ∙ 0 share

    read it

  • Compliance-Aware Bandits

    Motivated by clinical trials, we study bandits with observable non-compliance. At each step, the learner chooses an arm, after, instead of observing only the reward, it also observes the action that took place. We show that such noncompliance can be helpful or hurtful to the learner in general. Unfortunately, naively incorporating compliance information into bandit algorithms loses guarantees on sublinear regret. We present hybrid algorithms that maintain regret bounds up to a multiplicative factor and can incorporate compliance information. Simulations based on real data from the International Stoke Trial show the practical potential of these algorithms.

    02/09/2016 ∙ by Nicolás Della Penna, et al. ∙ 0 share

    read it

  • Cortical prediction markets

    We investigate cortical learning from the perspective of mechanism design. First, we show that discretizing standard models of neurons and synaptic plasticity leads to rational agents maximizing simple scoring rules. Second, our main result is that the scoring rules are proper, implying that neurons faithfully encode expected utilities in their synaptic weights and encode high-scoring outcomes in their spikes. Third, with this foundation in hand, we propose a biologically plausible mechanism whereby neurons backpropagate incentives which allows them to optimize their usefulness to the rest of cortex. Finally, experiments show that networks that backpropagate incentives can learn simple tasks.

    01/07/2014 ∙ by David Balduzzi, et al. ∙ 0 share

    read it