DeepAI AI Chat
Log In Sign Up

Grounded Situation Recognition

by   Sarah Pratt, et al.
Allen Institute for Artificial Intelligence

We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities, overcoming semantic sparsity, and disambiguating roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To study this new task we create the Situations With Groundings (SWiG) dataset which adds 278,336 bounding-box groundings to the 11,538 entity classes in the imsitu dataset. We propose a Joint Situation Localizer and find that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8 exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval. Code and data available at


page 1

page 5

page 12

page 13

page 14

page 19

page 25

page 26


Collaborative Transformers for Grounded Situation Recognition

Grounded situation recognition is the task of predicting the main activi...

Commonly Uncommon: Semantic Sparsity in Situation Recognition

Semantic sparsity is a common challenge in structured visual classificat...

Grounded Video Description

Video description is one of the most challenging problems in vision and ...

Semantic Image Retrieval via Active Grounding of Visual Situations

We describe a novel architecture for semantic image retrieval---in parti...

Grounded Situation Recognition with Transformers

Grounded Situation Recognition (GSR) is the task that not only classifie...

Recurrent Models for Situation Recognition

This work proposes Recurrent Neural Network (RNN) models to predict stru...

Grounded Video Situation Recognition

Dense video understanding requires answering several questions such as w...

Code Repositories


Situation With Groundings (SWiG) dataset and Joint Situation Localizer (JSL)

view repo