Leonid Sigal

is this you? claim profile

0

Associate Professor in the Department of Computer Science at the University of British Columbia, Senior Research Scientist at Disney Research Pittsburgh from 2010-2017, Adjunct Faculty member at Carnegie Mellon University, Postdoctoral Researcher at University of Toronto from 2007-2009, Summer Intern at Intel Corporation 2006

  • AttentionRNN: A Structured Spatial Attention Mechanism

    Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effective way to utilize visual information selectively, which has shown to be especially valuable in multi-modal learning tasks. However, all prior attention frameworks lack the ability to explicitly model structural dependencies among attention variables, making it difficult to predict consistent attention masks. In this paper we develop a novel structured spatial attention mechanism which is end-to-end trainable and can be integrated with any feed-forward convolutional neural network. This proposed AttentionRNN layer explicitly enforces structure over the spatial attention variables by sequentially predicting attention values in the spatial mask in a bi-directional raster-scan and inverse raster-scan order. As a result, each attention value depends not only on local image or contextual information, but also on the previously predicted attention values. Our experiments show consistent quantitative and qualitative improvements on a variety of recognition tasks and datasets; including image categorization, question answering and image generation.

    05/22/2019 ∙ by Siddhesh Khandelwal, et al. ∙ 31 share

    read it

  • Neural Sequential Phrase Grounding (SeqGROUND)

    We propose an end-to-end approach for phrase grounding in images. Unlike prior methods that typically attempt to ground each phrase independently by building an image-text embedding, our architecture formulates grounding of multiple phrases as a sequential and contextual process. Specifically, we encode region proposals and all phrases into two stacks of LSTM cells, along with so-far grounded phrase-region pairs. These LSTM stacks collectively capture context for grounding of the next phrase. The resulting architecture, which we call SeqGROUND, supports many-to-many matching by allowing an image region to be matched to multiple phrases and vice versa. We show competitive performance on the Flickr30K benchmark dataset and, through ablation studies, validate the efficacy of sequential grounding as well as individual design choices in our model architecture.

    03/18/2019 ∙ by Pelin Dogan, et al. ∙ 8 share

    read it

  • Probabilistic Video Generation using Holistic Attribute Control

    Videos express highly structured spatio-temporal patterns of visual data. A video can be thought of as being governed by two factors: (i) temporally invariant (e.g., person identity), or slowly varying (e.g., activity), attribute-induced appearance, encoding the persistent content of each frame, and (ii) an inter-frame motion or scene dynamics (e.g., encoding evolution of the person ex-ecuting the action). Based on this intuition, we propose a generative framework for video generation and future prediction. The proposed framework generates a video (short clip) by decoding samples sequentially drawn from a latent space distribution into full video frames. Variational Autoencoders (VAEs) are used as a means of encoding/decoding frames into/from the latent space and RNN as a wayto model the dynamics in the latent space. We improve the video generation consistency through temporally-conditional sampling and quality by structuring the latent space with attribute controls; ensuring that attributes can be both inferred and conditioned on during learning/generation. As a result, given attributes and/orthe first frame, our model is able to generate diverse but highly consistent sets ofvideo sequences, accounting for the inherent uncertainty in the prediction task. Experimental results on Chair CAD, Weizmann Human Action, and MIT-Flickr datasets, along with detailed comparison to the state-of-the-art, verify effectiveness of the framework.

    03/21/2018 ∙ by Jiawei He, et al. ∙ 2 share

    read it

  • Visual Reference Resolution using Attention Memory for Visual Dialog

    Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references. The model then merges the retrieved attention with a tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by 16 resolution plays an important role. Moreover, the proposed model achieves superior performance ( 2 despite having significantly fewer parameters than the baselines.

    09/23/2017 ∙ by Paul Hongsuck Seo, et al. ∙ 0 share

    read it

  • Weakly-supervised Visual Grounding of Phrases with Linguistic Structures

    We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i.e., localize) arbitrary linguistic phrases, in the form of spatial attention masks. Specifically, the model is trained with images and their associated image-level captions, without any explicit region-to-phrase correspondence annotations. To this end, we introduce an end-to-end model which learns visual groundings of phrases with two types of carefully designed loss functions. In addition to the standard discriminative loss, which enforces that attended image regions and phrases are consistently encoded, we propose a novel structural loss which makes use of the parse tree structures induced by the sentences. In particular, we ensure complementarity among the attention masks that correspond to sibling noun phrases, and compositionality of attention masks among the children and parent phrases, as defined by the sentence parse tree. We validate the effectiveness of our approach on the Microsoft COCO and Visual Genome datasets.

    05/03/2017 ∙ by Fanyi Xiao, et al. ∙ 0 share

    read it

  • Weakly-Supervised Spatial Context Networks

    We explore the power of spatial context as a self-supervisory signal for learning visual representations. In particular, we propose spatial context networks that learn to predict a representation of one image patch from another image patch, within the same image, conditioned on their real-valued relative spatial offset. Unlike auto-encoders, that aim to encode and reconstruct original image patches, our network aims to encode and reconstruct intermediate representations of the spatially offset patches. As such, the network learns a spatially conditioned contextual representation. By testing performance with various patch selection mechanisms we show that focusing on object-centric patches is important, and that using object proposal as a patch selection mechanism leads to the highest improvement in performance. Further, unlike auto-encoders, context encoders [21], or other forms of unsupervised feature learning, we illustrate that contextual supervision (with pre-trained model initialization) can improve on existing pre-trained model performance. We build our spatial context networks on top of standard VGG_19 and CNN_M architectures and, among other things, show that we can achieve improvements (with no additional explicit supervision) over the original ImageNet pre-trained VGG_19 and CNN_M models in object categorization and detection on VOC2007.

    04/10/2017 ∙ by Zuxuan Wu, et al. ∙ 0 share

    read it

  • Semi-Latent GAN: Learning to generate and modify facial images from attributes

    Generating and manipulating human facial images using high-level attributal controls are important and interesting problems. The models proposed in previous work can solve one of these two problems (generation or manipulation), but not both coherently. This paper proposes a novel model that learns how to both generate and modify the facial image from high-level semantic attributes. Our key idea is to formulate a Semi-Latent Facial Attribute Space (SL-FAS) to systematically learn relationship between user-defined and latent attributes, as well as between those attributes and RGB imagery. As part of this newly formulated space, we propose a new model --- SL-GAN which is a specific form of Generative Adversarial Network. Finally, we present an iterative training algorithm for SL-GAN. The experiments on recent CelebA and CASIA-WebFace datasets validate the effectiveness of our proposed framework. We will also make data, pre-trained models and code available.

    04/07/2017 ∙ by Weidong Yin, et al. ∙ 0 share

    read it

  • Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization

    Emotional content is a key element in user-generated videos. However, it is difficult to understand emotions conveyed in such videos due to the complex and unstructured nature of user-generated content and the sparsity of video frames that express emotion. In this paper, for the first time, we study the problem of transferring knowledge from heterogeneous external sources, including image and textual data, to facilitate three related tasks in video emotion understanding: emotion recognition, emotion attribution and emotion-oriented summarization. Specifically, our framework (1) learns a video encoding from an auxiliary emotional image dataset in order to improve supervised video emotion recognition, and (2) transfers knowledge from an auxiliary textual corpus for zero-shot recognition of emotion classes unseen during training. The proposed technique for knowledge transfer facilitates novel applications of emotion attribution and emotion-oriented summarization. A comprehensive set of experiments on multiple datasets demonstrate the effectiveness of our framework.

    11/16/2015 ∙ by Baohan Xu, et al. ∙ 0 share

    read it

  • Learning from Synthetic Data Using a Stacked Multichannel Autoencoder

    Learning from synthetic data has many important and practical applications. An example of application is photo-sketch recognition. Using synthetic data is challenging due to the differences in feature distributions between synthetic and real data, a phenomenon we term synthetic gap. In this paper, we investigate and formalize a general framework-Stacked Multichannel Autoencoder (SMCAE) that enables bridging the synthetic gap and learning from synthetic data more efficiently. In particular, we show that our SMCAE can not only transform and use synthetic data on the challenging face-sketch recognition task, but that it can also help simulate real images, which can be used for training classifiers for recognition. Preliminary experiments validate the effectiveness of the framework.

    09/17/2015 ∙ by Xi Zhang, et al. ∙ 0 share

    read it

  • Learning Language-Visual Embedding for Movie Understanding with Natural-Language

    Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases based on "Knowlywood", an activity knowledge mining model. Our best model archives Recall@10 of 19.2 of 1000 samples. For multiple-choice test, our best model achieve accuracy 58.11

    09/26/2016 ∙ by Atousa Torabi, et al. ∙ 0 share

    read it

  • Robust Classification by Pre-conditioned LASSO and Transductive Diffusion Component Analysis

    Modern machine learning-based recognition approaches require large-scale datasets with large number of labelled training images. However, such datasets are inherently difficult and costly to collect and annotate. Hence there is a great and growing interest in automatic dataset collection methods that can leverage the web. unreliable way. Collecting datasets in this way, however, requires robust and efficient ways for detecting and excluding outliers that are common and prevalent. far, there have been a limited effort in machine learning community to directly detect outliers for robust classification. Inspired by the recent work on Pre-conditioned LASSO, this paper formulates the outlier detection task using Pre-conditioned LASSO and employs unsupervised transductive diffusion component analysis to both integrate the topological structure of the data manifold, from labeled and unlabeled instances, and reduce the feature dimensionality. Synthetic experiments as well as results on two real-world classification tasks show that our framework can robustly detect the outliers and improve classification.

    11/19/2015 ∙ by Yanwei Fu, et al. ∙ 0 share

    read it