ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition

07/02/2023
by   Debaditya Roy, et al.
0

Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb and the semantic roles played by actors and objects. In this task, the same activity verb can describe a diverse set of situations as well as the same actor or object category can play a diverse set of semantic roles depending on the situation depicted in the image. Hence model needs to understand the context of the image and the visual-linguistic meaning of semantic roles. Therefore, we leverage the CLIP foundational model that has learned the context of images via language descriptions. We show that deeper-and-wider multi-layer perceptron (MLP) blocks obtain noteworthy results for the situation recognition task by using CLIP image and text embedding features and it even outperforms the state-of-the-art CoFormer, a Transformer-based model, thanks to the external implicit visual-linguistic knowledge encapsulated by CLIP and the expressive power of modern MLP block designs. Motivated by this, we design a cross-attention-based Transformer using CLIP visual tokens that model the relation between textual roles and visual entities. Our cross-attention-based Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a large margin of 14.1 on semantic role labelling (value) for top-1 accuracy using imSitu dataset. We will make the code publicly available.

READ FULL TEXT
research
03/26/2020

Grounded Situation Recognition

We introduce Grounded Situation Recognition (GSR), a task that requires ...
research
12/03/2016

Commonly Uncommon: Semantic Sparsity in Situation Recognition

Semantic sparsity is a common challenge in structured visual classificat...
research
08/14/2017

Situation Recognition with Graph Neural Networks

We address the problem of recognizing situations in images. Given an ima...
research
12/10/2021

Rethinking the Two-Stage Framework for Grounded Situation Recognition

Grounded Situation Recognition (GSR), i.e., recognizing the salient acti...
research
12/22/2020

Multi-Head Self-Attention with Role-Guided Masks

The state of the art in learning meaningful semantic representations of ...
research
12/08/2022

VASR: Visual Analogies of Situation Recognition

A core process in human cognition is analogical mapping: the ability to ...
research
03/30/2022

Collaborative Transformers for Grounded Situation Recognition

Grounded situation recognition is the task of predicting the main activi...

Please sign up or login with your details

Forgot password? Click here to reset