Contextual Explainable Video Representation: Human Perception-based Understanding

12/12/2022
by   Khoa Vo, et al.
0

Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.

READ FULL TEXT

page 1

page 2

page 4

page 6

page 7

research
03/16/2022

ABN: Agent-Aware Boundary Networks for Temporal Action Proposal Generation

Temporal action proposal generation (TAPG) aims to estimate temporal int...
research
10/05/2022

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

Temporal action proposal generation (TAPG) is a challenging task, which ...
research
07/17/2021

Agent-Environment Network for Temporal Action Proposal Generation

Temporal action proposal generation is an essential and challenging task...
research
10/21/2021

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

Humans typically perceive the establishment of an action in a video thro...
research
08/31/2020

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

This paper proposes a novel pretext task to address the self-supervised ...
research
12/18/2021

Tell me what you see: A zero-shot action recognition method based on natural language descriptions

Recently, several approaches have explored the detection and classificat...
research
08/29/2020

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

Most prior art in visual understanding relies solely on analyzing the "w...

Please sign up or login with your details

Forgot password? Click here to reset