Learning Robust Visual-Semantic Embeddings

03/17/2017
by   Yao-Hung Hubert Tsai, et al.
0

Many of the existing methods for learning joint embedding of images and text use only supervised information from paired images and its textual attributes. Taking advantage of the recent success of unsupervised learning in deep neural networks, we propose an end-to-end learning framework that is able to extract more robust multi-modal representations across domains. The proposed method combines representation learning models (i.e., auto-encoders) together with cross-domain learning criteria (i.e., Maximum Mean Discrepancy loss) to learn joint embeddings for semantic and visual features. A novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data. We evaluate our method on Animals with Attributes and Caltech-UCSD Birds 200-2011 dataset with a wide range of applications, including zero and few-shot image recognition and retrieval, from inductive to transductive settings. Empirically, we show that our framework improves over the current state of the art on many of the considered tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/15/2016

Learning a Deep Embedding Model for Zero-Shot Learning

Zero-shot learning (ZSL) models rely on learning a joint embedding space...
research
03/25/2019

f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning

When labeled training data is scarce, a promising data augmentation appr...
research
08/23/2018

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Cross-modal retrieval between visual data and natural language descripti...
research
05/28/2019

Unsupervised Learning from Video with Deep Neural Embeddings

Because of the rich dynamical structure of videos and their ubiquity in ...
research
11/16/2017

Deep Matching Autoencoders

Increasingly many real world tasks involve data in multiple modalities o...
research
04/11/2019

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for lear...
research
09/14/2020

Zero-shot Synthesis with Group-Supervised Learning

Visual cognition of primates is superior to that of artificial neural ne...

Please sign up or login with your details

Forgot password? Click here to reset