CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension

02/17/2023
by   Zhi Zhang, et al.
0

The task of multimodal referring expression comprehension (REC), aiming at localizing an image region described by a natural language expression, has recently received increasing attention within the research comminity. In this paper, we specifically focus on referring expression comprehension with commonsense knowledge (KB-Ref), a task which typically requires reasoning beyond spatial, visual or semantic information. We propose a novel framework for Commonsense Knowledge Enhanced Transformers (CK-Transformer) which effectively integrates commonsense knowledge into the representations of objects in an image, facilitating identification of the target objects referred to by the expressions. We conduct extensive experiments on several benchmarks for the task of KB-Ref. Our results show that the proposed CK-Transformer achieves a new state of the art, with an absolute improvement of 3.14 over the existing state of the art.

READ FULL TEXT

page 5

page 9

page 11

research
06/02/2020

Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge

Conventional referring expression comprehension (REF) assumes people to ...
research
12/16/2021

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Answering complex questions about images is an ambitious goal for machin...
research
04/01/2019

Ranking and Selecting Multi-Hop Knowledge Paths to Better Predict Human Needs

To make machines better understand sentiments, research needs to move fr...
research
07/31/2022

One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning

Referring Expression Comprehension (REC) is one of the most important ta...
research
03/01/2020

Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension

Referring expression comprehension (REF) aims at identifying a particula...
research
12/16/2021

KAT: A Knowledge Augmented Transformer for Vision-and-Language

The primary focus of recent work with largescale transformers has been o...
research
07/08/2019

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

We focus on grounding (i.e., localizing or linking) referring expression...

Please sign up or login with your details

Forgot password? Click here to reset