QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time Reasoning

02/02/2023
by   Weimin Shi, et al.
0

Daily images may convey abstract meanings that require us to memorize and infer profound information from them. To encourage such human-like reasoning, in this work, we teach machines to predict where and when it was taken rather than performing basic tasks like traditional segmentation or classification. Inspired by Horn's QR theory, we designed a novel QR-CLIP model consisting of two components: 1) the Quantity module first retrospects more open-world knowledge as the candidate language inputs; 2) the Relevance module carefully estimates vision and language cues and infers the location and time. Experiments show our QR-CLIP's effectiveness, and it outperforms the previous SOTA on each task by an average of about 10 location and time reasoning. This study lays a technical foundation for location and time reasoning and suggests that effectively introducing open-world knowledge is one of the panaceas for the tasks.

READ FULL TEXT

page 2

page 3

page 8

research
07/12/2023

Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Vision-Language Models (VLMs) are expected to be capable of reasoning wi...
research
03/01/2022

There is a Time and Place for Reasoning Beyond the Image

Images are often more significant than only the pixels to human eyes, as...
research
10/02/2017

Visual Reasoning with Natural Language

Natural language provides a widely accessible and expressive interface f...
research
01/22/2022

Physical Reasoning in an Open World

Most work on physical reasoning, both in artificial intelligence and in ...
research
12/16/2021

KAT: A Knowledge Augmented Transformer for Vision-and-Language

The primary focus of recent work with largescale transformers has been o...

Please sign up or login with your details

Forgot password? Click here to reset