Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

03/26/2022
by   Muheng Li, et al.
0

Action recognition models have shown a promising capability to classify human actions in short video clips. In a real scenario, multiple correlated human actions commonly occur in particular orders, forming semantically meaningful human activities. Conventional action recognition approaches focus on analyzing single actions. However, they fail to fully reason about the contextual relations between adjacent actions, which provide potential temporal logic for understanding long videos. In this paper, we propose a prompt-based framework, Bridge-Prompt (Br-Prompt), to model the semantics across adjacent actions, so that it simultaneously exploits both out-of-context and contextual information from a series of ordinal actions in instructional videos. More specifically, we reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. The generated text prompts are paired with corresponding video clips, and together co-train the text encoder and the video encoder via a contrastive approach. The learned vision encoder has a stronger capability for ordinal-action-related downstream tasks, e.g. action segmentation and human activity recognition. We evaluate the performances of our approach on several video datasets: Georgia Tech Egocentric Activities (GTEA), 50Salads, and the Breakfast dataset. Br-Prompt achieves state-of-the-art on multiple benchmarks. Code is available at https://github.com/ttlmh/Bridge-Prompt

READ FULL TEXT

page 1

page 3

page 7

research
02/15/2021

Win-Fail Action Recognition

Current video/action understanding systems have demonstrated impressive ...
research
08/15/2021

Temporal Action Segmentation with High-level Complex Activity Labels

Over the past few years, the success in action recognition on short trim...
research
08/05/2021

Elaborative Rehearsal for Zero-shot Action Recognition

The growing number of action classes has posed a new challenge for video...
research
11/01/2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

In egocentric videos, actions occur in quick succession. We capitalise o...
research
08/14/2020

CAPHAR: context-aware personalized human activity recognition using associative learning in smart environments

The existing action recognition systems mainly focus on generalized meth...
research
07/21/2015

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Every moment counts in action recognition. A comprehensive understanding...
research
05/31/2023

Humans in 4D: Reconstructing and Tracking Humans with Transformers

We present an approach to reconstruct humans and track them over time. A...

Please sign up or login with your details

Forgot password? Click here to reset