Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

04/11/2019
by   Hao Wu, et al.
18

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics. The model unifies the embeddings of concepts at different levels: objects, attributes, relations, and full scenes. We view the sentential semantics as a combination of different semantic components such as objects and relations; their embeddings are aligned with different image regions. A contrastive learning approach is proposed for the effective learning of this fine-grained alignment from only image-caption pairs. We also present a simple yet effective approach that enforces the coverage of caption embeddings on the semantic components that appear in the sentence. We demonstrate that the Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks. Moreover, our model empowers the use of visual cues to accurately resolve word dependencies in novel sentences.

READ FULL TEXT

page 8

page 16

page 17

page 18

page 19

page 20

research
04/11/2019

UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations

We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a jo...
research
03/01/2020

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Cross-modal retrieval between videos and texts has attracted growing att...
research
03/17/2022

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

Vision-Language Pre-training (VLP) has achieved impressive performance o...
research
06/27/2018

Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

We study the problem of grounding distributional representations of text...
research
10/22/2020

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Image-Text Matching is one major task in cross-modal information process...
research
03/14/2022

Contrastive Visual Semantic Pretraining Magnifies the Semantics of Natural Language Representations

We examine the effects of contrastive visual semantic pretraining by com...
research
03/17/2017

Learning Robust Visual-Semantic Embeddings

Many of the existing methods for learning joint embedding of images and ...

Please sign up or login with your details

Forgot password? Click here to reset