Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

06/27/2018
by   Haoyue Shi, et al.
0

We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO) both quantitatively and qualitatively. The large gap between the number of possible constitutions of real-world semantics and the size of parallel data, to a large extent, restricts the model to establish the link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with textual contrastive adversarial samples. These samples are synthesized using linguistic rules and the WordNet knowledge base. The construction procedure is both syntax- and semantics-aware. The samples enforce the model to ground learned embeddings to concrete concepts within the image. This simple but powerful technique brings a noticeable improvement over the baselines on a diverse set of downstream tasks, in addition to defending known-type adversarial attacks. We release the codes at https://github.com/ExplorerFreda/VSE-C.

READ FULL TEXT

page 2

page 7

research
06/17/2022

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Language grounding to vision is an active field of research aiming to en...
research
11/13/2021

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

In natural language processing, most models try to learn semantic repres...
research
04/11/2019

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for lear...
research
06/30/2022

Visual grounding of abstract and concrete words: A response to Günther et al. (2020)

Current computational models capturing words' meaning mostly rely on tex...
research
04/11/2019

UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations

We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a jo...
research
05/15/2019

Aligning Visual Regions and Textual Concepts: Learning Fine-Grained Image Representations for Image Captioning

In image-grounded text generation, fine-grained representations of the i...
research
06/14/2022

Comprehending and Ordering Semantics for Image Captioning

Comprehending the rich semantics in an image and ordering them in lingui...

Please sign up or login with your details

Forgot password? Click here to reset