Philip H. S. Torr

is this you? claim profile


Professor Oxford University, Chief Scientific Advisor at FiveAI

  • The StarCraft Multi-Agent Challenge

    In the last few years, deep multi-agent reinforcement learning (RL) has become a highly active area of research. A particularly challenging class of problems in this area is partially observable, cooperative, multi-agent learning, in which teams of agents must learn to coordinate their behaviour while conditioning only on their private observations. This is an attractive research area since such problems are relevant to a large number of real-world systems and are also more amenable to evaluation than general-sum problems. Standardised environments such as the ALE and MuJoCo have allowed single-agent RL to move beyond toy domains, such as grid worlds. However, there is no comparable benchmark for cooperative multi-agent RL. As a result, most papers in this field use one-off toy problems, making it difficult to measure real progress. In this paper, we propose the StarCraft Multi-Agent Challenge (SMAC) as a benchmark problem to fill this gap. SMAC is based on the popular real-time strategy game StarCraft II and focuses on micromanagement challenges where each unit is controlled by an independent agent that must act based on local observations. We offer a diverse set of challenge maps and recommendations for best practices in benchmarking and evaluations. We also open-source a deep multi-agent RL learning framework including state-of-the-art algorithms. We believe that SMAC can provide a standard benchmark environment for years to come. Videos of our best agents for several SMAC scenarios are available at:

    02/11/2019 ∙ by Mikayel Samvelyan, et al. ∙ 32 share

    read it

  • Meta Learning Deep Visual Words for Fast Video Object Segmentation

    Meta learning has attracted a lot of attention recently. In this paper, we propose a fast and novel meta learning based method for video object segmentation that quickly adapts to new domains without any fine-tuning. The proposed model performs segmentation by matching pixels to object parts. The model represents object parts using deep visual words, and meta learns them with the objective of minimizing the object segmentation loss. This is however not straightforward as no ground-truth information is available for the object parts. We tackle this problem by iteratively performing unsupervised learning of the deep visual words, followed by supervised learning of the segmentation problem, given the visual words. Our experiments show that the proposed method performs on-par with state-of-the-art methods, while being computationally much more efficient.

    12/04/2018 ∙ by Harkirat Singh Behl, et al. ∙ 26 share

    read it

  • Deeper Interpretability of Deep Networks

    Deep Convolutional Neural Networks (CNNs) have been one of the most influential recent developments in computer vision, particularly for categorization. There is an increasing demand for explainable AI as these systems are deployed in the real world. However, understanding the information represented and processed in CNNs remains in most cases challenging. Within this paper, we explore the use of new information theoretic techniques developed in the field of neuroscience to enable novel understanding of how a CNN represents information. We trained a 10-layer ResNet architecture to identify 2,000 face identities from 26M images generated using a rigorously controlled 3D face rendering model that produced variations of intrinsic (i.e. face morphology, gender, age, expression and ethnicity) and extrinsic factors (i.e. 3D pose, illumination, scale and 2D translation). With our methodology, we demonstrate that unlike human's network overgeneralizes face identities even with extreme changes of face shape, but it is more sensitive to changes of texture. To understand the processing of information underlying these counterintuitive properties, we visualize the features of shape and texture that the network processes to identify faces. Then, we shed a light into the inner workings of the black box and reveal how hidden layers represent these features and whether the representations are invariant to pose. We hope that our methodology will provide an additional valuable tool for interpretability of CNNs.

    11/19/2018 ∙ by Tian Xu, et al. ∙ 14 share

    read it

  • Domain Partitioning Network

    Standard adversarial training involves two agents, namely a generator and a discriminator, playing a mini-max game. However, even if the players converge to an equilibrium, the generator may only recover a part of the target data distribution, in a situation commonly referred to as mode collapse. In this work, we present the Domain Partitioning Network (DoPaNet), a new approach to deal with mode collapse in generative adversarial learning. We employ multiple discriminators, each encouraging the generator to cover a different part of the target distribution. To ensure these parts do not overlap and collapse into the same mode, we add a classifier as a third agent in the game. The classifier decides which discriminator the generator is trained against for each sample. Through experiments on toy examples and real images, we show the merits of DoPaNet in covering the real distribution and its superiority with respect to the competing methods. Besides, we also show that we can control the modes from which samples are generated using DoPaNet.

    02/21/2019 ∙ by Botos Csaba, et al. ∙ 12 share

    read it

  • Real-Time RGB-D Camera Pose Estimation in Novel Scenes using a Relocalisation Cascade

    Camera pose estimation is an important problem in computer vision. Common techniques either match the current image against keyframes with known poses, directly regress the pose, or establish correspondences between keypoints in the image and points in the scene to estimate the pose. In recent years, regression forests have become a popular alternative to establish such correspondences. They achieve accurate results, but have traditionally needed to be trained offline on the target scene, preventing relocalisation in new environments. Recently, we showed how to circumvent this limitation by adapting a pre-trained forest to a new scene on the fly. The adapted forests achieved relocalisation performance that was on par with that of offline forests, and our approach was able to estimate the camera pose in close to real time. In this paper, we present an extension of this work that achieves significantly better relocalisation performance whilst running fully in real time. To achieve this, we make several changes to the original approach: (i) instead of accepting the camera pose hypothesis without question, we make it possible to score the final few hypotheses using a geometric approach and select the most promising; (ii) we chain several instantiations of our relocaliser together in a cascade, allowing us to try faster but less accurate relocalisation first, only falling back to slower, more accurate relocalisation as necessary; and (iii) we tune the parameters of our cascade to achieve effective overall performance. These changes allow us to significantly improve upon the performance our original state-of-the-art method was able to achieve on the well-known 7-Scenes and Stanford 4 Scenes benchmarks. As additional contributions, we present a way of visualising the internal behaviour of our forests and show how to entirely circumvent the need to pre-train a forest on a generic scene.

    10/29/2018 ∙ by Tommaso Cavallari, et al. ∙ 8 share

    read it

  • Learn to Interpret Atari Agents

    Deep Reinforcement Learning (DeepRL) models surpass human-level performance in a multitude of tasks. Standing in stark contrast to the stellar performance is the obscure nature of the learned policies. The direct mapping from states to actions makes it hard to interpret the rationale behind the decision making of agents. In contrast to previous a-posteriori methods of visualising DeepRL policies, we propose an end-to-end trainable framework based on Rainbow, a representative Deep Q-Network (DQN) agent. Our method automatically detects important regions in the input domain, which enables characterization of general strategy and explanation for non-intuitive behaviors. Hence, we call it Region Sensitive Rainbow (RS-Rainbow). RS-Rainbow utilises a simple yet effective mechanism to incorporate innate visualisation ability into the learning model, not only improving the interpretability, but enabling the agent to leverage enhanced state representations for improved performance. Without extra supervision, specialised feature detectors focusing on distinct aspects of gameplay can be learned. Extensive experiments on the challenging platform of Atari 2600 demonstrates the superiority of RS-Rainbow. In particular, our agent achieves state of the art at just 25 parallel training.

    12/29/2018 ∙ by Zhao Yang, et al. ∙ 8 share

    read it

  • 3D Hand Shape and Pose from Images in the Wild

    We present in this work the first end-to-end deep learning based method that predicts both 3D hand shape and pose from RGB images in the wild. Our network consists of the concatenation of a deep convolutional encoder, and a fixed model-based decoder. Given an input image, and optionally 2D joint detections obtained from an independent CNN, the encoder predicts a set of hand and view parameters. The decoder has two components: A pre-computed articulated mesh deformation hand model that generates a 3D mesh from the hand parameters, and a re-projection module controlled by the view parameters that projects the generated hand into the image domain. We show that using the shape and pose prior knowledge encoded in the hand model within a deep learning framework yields state-of-the-art performance in 3D pose prediction from images on standard benchmarks, and produces geometrically valid and plausible 3D reconstructions. Additionally, we show that training with weak supervision in the form of 2D joint annotations on datasets of images in the wild, in conjunction with full supervision in the form of 3D joint annotations on limited available datasets allows for good generalization to 3D shape and pose predictions on images in the wild.

    02/09/2019 ∙ by Adnane Boukhayma, et al. ∙ 8 share

    read it

  • Stable Rank Normalization for Improved Generalization in Neural Networks and GANs

    Exciting new work on the generalization bounds for neural networks (NN) given by Neyshabur et al. , Bartlett et al. depend on two parameter-depenedent quantities: the Lipschitz constant upper-bound and the stable rank (a softer version of the rank operator). This leads to an interesting question of whether controlling these quantities might improve the generalization behaviour of NNs. To this end, we propose stable rank normalization (SRN), a novel, optimal, and computationally efficient weight-normalization scheme which minimizes the stable rank of a linear operator. Surprisingly we find that SRN, inspite of being non-convex problem, can be shown to have a unique optimal solution. Moreover, we show that SRN allows control of the data-dependent empirical Lipschitz constant, which in contrast to the Lipschitz upper-bound, reflects the true behaviour of a model on a given dataset. We provide thorough analyses to show that SRN, when applied to the linear layers of a NN for classification, provides striking improvements-11.3 standard NN along with significant reduction in memorization. When applied to the discriminator of GANs (called SRN-GAN) it improves Inception, FID, and Neural divergence scores on the CIFAR 10/100 and CelebA datasets, while learning mappings with low empirical Lipschitz constants.

    06/11/2019 ∙ by Amartya Sanyal, et al. ∙ 5 share

    read it

  • Fast Online Object Tracking and Segmentation: A Unifying Approach

    In this paper we illustrate how to perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach. Our method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes at 35 frames per second. Despite its simplicity, versatility and fast speed, our strategy allows us to establish a new state-of-the-art among real-time trackers on VOT-2018, while at the same time demonstrating competitive performance and the best speed for the semi-supervised video object segmentation task on DAVIS-2016 and DAVIS-2017. The project website is qwang/SiamMask.

    12/12/2018 ∙ by Qiang Wang, et al. ∙ 4 share

    read it

  • Hijacking Malaria Simulators with Probabilistic Programming

    Epidemiology simulations have become a fundamental tool in the fight against the epidemics of various infectious diseases like AIDS and malaria. However, the complicated and stochastic nature of these simulators can mean their output is difficult to interpret, which reduces their usefulness to policymakers. In this paper, we introduce an approach that allows one to treat a large class of population-based epidemiology simulators as probabilistic generative models. This is achieved by hijacking the internal random number generator calls, through the use of a universal probabilistic programming system (PPS). In contrast to other methods, our approach can be easily retrofitted to simulators written in popular industrial programming frameworks. We demonstrate that our method can be used for interpretable introspection and inference, thus shedding light on black-box simulators. This reinstates much-needed trust between policymakers and evidence-based methods.

    05/29/2019 ∙ by Bradley Gram-Hansen, et al. ∙ 4 share

    read it

  • SNIP: Single-shot Network Pruning based on Connection Sensitivity

    Pruning large neural networks while maintaining the performance is often highly desirable due to the reduced space and time complexity. In existing methods, pruning is incorporated within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization. Specifically, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task even before training. This eliminates the need for both pretraining as well as the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on image classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.

    10/04/2018 ∙ by Namhoon Lee, et al. ∙ 2 share

    read it