Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

05/15/2023
by   Jingyi Wang, et al.
0

Dynamic scene graphs generated from video clips could help enhance the semantic visual understanding in a wide range of challenging tasks such as environmental perception, autonomous navigation, and task planning of self-driving vehicles and mobile robots. In the process of temporal and spatial modeling during dynamic scene graph generation, it is particularly intractable to learn time-variant relations in dynamic scene graphs among frames. In this paper, we propose a Time-variant Relation-aware TRansformer (TR^2), which aims to model the temporal change of relations in dynamic scene graphs. Explicitly, we leverage the difference of text embeddings of prompted sentences about relation labels as the supervision signal for relations. In this way, cross-modality feature guidance is realized for the learning of time-variant relations. Implicitly, we design a relation feature fusion module with a transformer and an additional message token that describes the difference between adjacent frames. Extensive experiments on the Action Genome dataset prove that our TR^2 can effectively model the time-variant relations. TR^2 significantly outperforms previous state-of-the-art methods under two different settings by 2.1

READ FULL TEXT
research
12/18/2021

Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs

Structured video representation in the form of dynamic scene graphs is a...
research
07/26/2021

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

Dynamic scene graph generation aims at generating a scene graph of the g...
research
11/11/2022

SSGVS: Semantic Scene Graph-to-Video Synthesis

As a natural extension of the image synthesis task, video synthesis has ...
research
08/10/2023

Local-Global Information Interaction Debiasing for Dynamic Scene Graph Generation

The task of dynamic scene graph generation (DynSGG) aims to generate sce...
research
12/14/2021

Temporal Transformer Networks with Self-Supervision for Action Recognition

In recent years, 2D Convolutional Networks-based video action recognitio...
research
09/19/2016

On Support Relations and Semantic Scene Graphs

Scene understanding is a popular and challenging topic in both computer ...
research
10/17/2022

Real-Time Driver Monitoring Systems through Modality and View Analysis

Driver distractions are known to be the dominant cause of road accidents...

Please sign up or login with your details

Forgot password? Click here to reset