Zeynep Akata

is this you? claim profile


Senior Researcher at Max-Planck-Institut für Informatik

  • Visual Rationalizations in Deep Reinforcement Learning for Atari Games

    Due to the capability of deep learning to perform well in high dimensional problems, deep reinforcement learning agents perform well in challenging tasks such as Atari 2600 games. However, clearly explaining why a certain action is taken by the agent can be as important as the decision itself. Deep reinforcement learning models, as other deep learning models, tend to be opaque in their decision-making process. In this work, we propose to make deep reinforcement learning more transparent by visualizing the evidence on which the agent bases its decision. In this work, we emphasize the importance of producing a justification for an observed action, which could be applied to a black-box decision agent.

    02/01/2019 ∙ by Laurens Weitkamp, et al. ∙ 10 share

    read it

  • Combining Generative and Discriminative Models for Hybrid Inference

    A graphical model is a structured representation of the data generating process. The traditional method to reason over random variables is to perform inference in this graphical model. However, in many cases the generating process is only a poor approximation of the much more complex true data generating process, leading to suboptimal estimation. The subtleties of the generative process are however captured in the data itself and we can `learn to infer', that is, learn a direct mapping from observations to explanatory latent variables. In this work we propose a hybrid model that combines graphical inference with a learned inverse model, which we structure as in a graph neural network, while the iterative algorithm as a whole is formulated as a recurrent neural network. By using cross-validation we can automatically balance the amount of work performed by graphical inference versus learned inference. We apply our ideas to the Kalman filter, a Gaussian hidden Markov model for time sequences, and show, among other things, that our model can estimate the trajectory of a noisy chaotic Lorenz Attractor much more accurately than either the learned or graphical inference run in isolation.

    06/06/2019 ∙ by Victor Garcia Satorras, et al. ∙ 8 share

    read it

  • Primal-Dual Wasserstein GAN

    We introduce Primal-Dual Wasserstein GAN, a new learning algorithm for building latent variable models of the data distribution based on the primal and the dual formulations of the optimal transport (OT) problem. We utilize the primal formulation to learn a flexible inference mechanism and to create an optimal approximate coupling between the data distribution and the generative model. In order to learn the generative model, we use the dual formulation and train the decoder adversarially through a critic network that is regularized by the approximate coupling obtained from the primal. Unlike previous methods that violate various properties of the optimal critic, we regularize the norm and the direction of the gradients of the critic function. Our model shares many of the desirable properties of auto-encoding models in terms of mode coverage and latent structure, while avoiding their undesirable averaging properties, e.g. their inability to capture sharp visual features when modeling real images. We compare our algorithm with several other generative modeling techniques that utilize Wasserstein distances on Frechet Inception Distance (FID) and Inception Scores (IS).

    05/24/2018 ∙ by Mevlana Gemici, et al. ∙ 2 share

    read it

  • Textual Explanations for Self-Driving Vehicles

    Deep neural perception and control networks have become key components of self-driving vehicles. User acceptance is likely to benefit from easy-to-interpret textual explanations which allow end-users to understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. We propose a new approach to introspective explanations which consists of two parts. First, we use a visual (spatial) attention model to train a convolutional network end-to-end from images to the vehicle control commands, i.e., acceleration and change of course. The controller's attention identifies image regions that potentially influence the network's output. Second, we use an attention-based video-to-text model to produce textual explanations of model actions. The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments. We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-X) dataset. Code is available at https://github.com/JinkyuKimUCB/explainable-deep-driving.

    07/30/2018 ∙ by Jinkyu Kim, et al. ∙ 2 share

    read it

  • Manipulating Attributes of Natural Scenes via Hallucination

    In this study, we explore building a two-stage framework for enabling users to directly manipulate high-level attributes of a natural scene. The key to our approach is a deep generative network which can hallucinate images of a scene as if they were taken at a different season (e.g. during winter), weather condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the scene is hallucinated with the given attributes, the corresponding look is then transferred to the input image while preserving the semantic details intact, giving a photo-realistic manipulation result. As the proposed framework hallucinates what the scene will look like, it does not require any reference style image as commonly utilized in most of the appearance or style transfer approaches. Moreover, it allows to simultaneously manipulate a given scene according to a diverse set of transient attributes within a single model, eliminating the need of training multiple networks per each translation task. Our comprehensive set of qualitative and quantitative results demonstrate the effectiveness of our approach against the competing methods.

    08/22/2018 ∙ by Levent Karacan, et al. ∙ 2 share

    read it

  • Feature Generating Networks for Zero-Shot Learning

    Suffering from the extreme training data imbalance between seen and unseen classes, most of existing state-of-the-art approaches fail to achieve satisfactory results for the challenging generalized zero-shot learning task. To circumvent the need for labeled examples of unseen classes, we propose a novel generative adversarial network (GAN) that synthesizes CNN features conditioned on class-level semantic information, offering a shortcut directly from a semantic descriptor of a class to a class-conditional feature distribution. Our proposed approach, pairing a Wasserstein GAN with a classification loss, is able to generate sufficiently discriminative CNN features to train softmax classifiers or any multimodal embedding method. Our experimental results demonstrate a significant boost in accuracy over the state of the art on five challenging datasets -- CUB, FLO, SUN, AWA and ImageNet -- in both the zero-shot learning and generalized zero-shot learning settings.

    12/04/2017 ∙ by Yongqin Xian, et al. ∙ 0 share

    read it

  • Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)

    Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annotations is one of the bottlenecks of generating multimodal explanations. Thus, we propose two large-scale datasets with annotations that visually and textually justify a classification decision for various activities, i.e. ACT-X, and for question answering, i.e. VQA-X. We also introduce a multimodal methodology for generating visual and textual explanations simultaneously. We quantitatively show that training with the textual explanations not only yields better textual justification models, but also models that better localize the evidence that support their decision.

    11/17/2017 ∙ by Dong Huk Park, et al. ∙ 0 share

    read it

  • Grounding Visual Explanations (Extended Abstract)

    Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image. In this paper, a new model is proposed for generating explanations by utilizing localized grounding of constituent phrases in generated explanations to ensure image relevance. Specifically, we introduce a phrase-critic model to refine (re-score/re-rank) generated candidate explanations and employ a relative-attribute inspired ranking loss using "flipped" phrases as negative examples for training. At test time, our phrase-critic model takes an image and a candidate explanation as input and outputs a score indicating how well the candidate explanation is grounded in the image.

    11/17/2017 ∙ by Lisa Anne Hendricks, et al. ∙ 0 share

    read it

  • Generative Adversarial Text to Image Synthesis

    Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories, such as faces, album covers, and room interiors. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image model- ing, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions.

    05/17/2016 ∙ by Scott Reed, et al. ∙ 0 share

    read it

  • Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly

    Due to the importance of zero-shot learning, i.e. classifying images where there is a lack of labeled training data, the number of proposed approaches has recently increased steadily. We argue that it is time to take a step back and to analyze the status quo of the area. The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets used for this task. This is an important contribution as published results are often not comparable and sometimes even flawed due to, e.g. pre-training on zero-shot test classes. Moreover, we propose a new zero-shot learning dataset, the Animals with Attributes 2 (AWA2) dataset which we make publicly available both in terms of image features and the images themselves. Second, we compare and analyze a significant number of the state-of-the-art methods in depth, both in the classic zero-shot setting but also in the more realistic generalized zero-shot setting. Finally, we discuss in detail the limitations of the current status of the area which can be taken as a basis for advancing it.

    07/03/2017 ∙ by Yongqin Xian, et al. ∙ 0 share

    read it

  • Channel-Recurrent Variational Autoencoders

    Variational Autoencoder (VAE) is an efficient framework in modeling natural images with probabilistic latent spaces. However, when the input spaces become complex, VAE becomes less effective, potentially due to the oversimplification of its latent space construction. In this paper, we propose to integrate recurrent connections across channels to both inference and generation steps of VAE. Sequentially building up the complexity of high-level features in this way allows us to capture global-to-local and coarse-to-fine structures of the input data spaces. We show that our channel-recurrent VAE improves existing approaches in multiple aspects: (1) it attains lower negative log-likelihood than standard VAE on MNIST; when trained adversarially, (2) it generates face and bird images with substantially higher visual quality than the state-of-the-art VAE-GAN and (3) channel-recurrency allows learning more interpretable representations; finally (4) it achieves competitive classification results on STL-10 in a semi-supervised setup.

    06/12/2017 ∙ by Wenling Shang, et al. ∙ 0 share

    read it