Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

11/28/2018
by   Hassan Akbari, et al.
6

We address the problem of phrase grounding by learning a multi-level common semantic space shared by the textual and visual modalities. This common space is instantiated at multiple layers of a Deep Convolutional Neural Network by exploiting its feature maps, as well as contextualized word-level and sentence-level embeddings extracted from a character-based language model. Following a dedicated non-linear mapping for visual features at each level, word, and sentence embeddings, we obtain a common space in which comparisons between the target text and the visual content at any semantic level can be performed simply with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at different semantic levels. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available benchmarks show significant performance gains (20 phrase localization and set a new performance record on those datasets. We also provide a detailed ablation study to show the contribution of each element of our approach.

READ FULL TEXT

page 1

page 4

page 7

page 8

research
12/01/2016

Video Captioning with Multi-Faceted Attention

Recently, video captioning has been attracting an increasing amount of i...
research
12/05/2018

Enriching Article Recommendation with Phrase Awareness

Recent deep learning methods for recommendation systems are highly sophi...
research
07/03/2019

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

This paper describes a conditional neural network architecture for Manda...
research
12/15/2020

A Deep Multi-Level Attentive network for Multimodal Sentiment Analysis

Multimodal sentiment analysis has attracted increasing attention with br...
research
09/13/2020

Cosine meets Softmax: A tough-to-beat baseline for visual grounding

In this paper, we present a simple baseline for visual grounding for aut...
research
04/22/2022

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

To solve video-and-language grounding tasks, the key is for the network ...
research
06/29/2023

Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation

Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, whi...

Please sign up or login with your details

Forgot password? Click here to reset