Contextual Grounding of Natural Language Entities in Images

11/05/2019
by   Farley Lai, et al.
0

In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token embeddings and image object features from an off-the-shelf object detector as input. Additional encoding to capture the positional and spatial information can be added to enhance the feature quality. There are separate text and image branches facilitating respective architectural refinements for different modalities. The text branch is pre-trained on a large-scale masked language modeling task while the image branch is trained from scratch. Next, the model learns the contextual representations of the text tokens and image objects through layers of high-order interaction respectively. The final grounding head ranks the correspondence between the textual and visual representations through cross-modal interaction. In the evaluation, we show that our model achieves the state-of-the-art grounding accuracy of 71.36 the Flickr30K Entities dataset. No additional pre-training is necessary to deliver competitive results compared with related work that often requires task-agnostic and task-specific pre-training on cross-modal dadasets. The implementation is publicly available at https://gitlab.com/necla-ml/grounding.

READ FULL TEXT
research
01/22/2020

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model – Im...
research
12/19/2022

Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to a...
research
10/12/2022

Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features

Hateful memes are a growing menace on social media. While the image and ...
research
06/11/2021

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

Entities Object Localization (EOL) aims to evaluate how grounded or fait...
research
07/17/2023

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

LLMs have demonstrated remarkable abilities at interacting with humans t...
research
04/13/2021

Disentangled Motif-aware Graph Learning for Phrase Grounding

In this paper, we propose a novel graph learning framework for phrase gr...
research
05/12/2023

IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images

Word embeddings, i.e., semantically meaningful vector representation of ...

Please sign up or login with your details

Forgot password? Click here to reset