Learning Human-Human Interactions in Images from Weak Textual Supervision

04/27/2023
by   Morris Alper, et al.
0

Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absence of data labelled specifically for this task, we use knowledge distillation applied to synthetic caption data produced by a large language model without explicit supervision. We show that the pseudo-labels produced by this procedure can be used to train a captioning model to effectively understand human-human interactions in images, as measured by a variety of metrics that measure textual and semantic faithfulness and factual groundedness of our predictions. We further show that our approach outperforms SOTA image captioning and situation recognition models on this task. We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding.

READ FULL TEXT

page 1

page 4

page 7

page 15

research
01/20/2023

Visual Semantic Relatedness Dataset for Image Captioning

Modern image captioning system relies heavily on extracting knowledge fr...
research
03/23/2023

Open-Vocabulary Object Detection using Pseudo Caption Labels

Recent open-vocabulary detection methods aim to detect novel objects by ...
research
03/09/2023

Weakly-Supervised HOI Detection from Interaction Labels Only and Language/Vision-Language Priors

Human-object interaction (HOI) detection aims to extract interacting hum...
research
10/27/2019

Leveraging Auxiliary Text for Deep Recognition of Unseen Visual Relationships

One of the most difficult tasks in scene understanding is recognizing in...
research
05/24/2023

Text Conditional Alt-Text Generation for Twitter Images

In this work we present an approach for generating alternative text (or ...
research
11/28/2022

G^3: Geolocation via Guidebook Grounding

We demonstrate how language can improve geolocation: the task of predict...
research
03/29/2020

Learning Interactions and Relationships between Movie Characters

Interactions between people are often governed by their relationships. O...

Please sign up or login with your details

Forgot password? Click here to reset