ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition

08/07/2023
by   Soumyabrata Chaudhuri, et al.
0

Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. In literature, 2D skeleton or pose modality has often been used for this task, either independently or in conjunction with the visual information (RGB modality) present in videos. However, the combination of pose, visual information, and text attributes has not been explored yet, though text and pose attributes independently have been proven to be effective in numerous computer vision tasks. In this paper, we present the first pose augmented Vision-language model (VLM) for VAR. Notably, our scheme achieves an accuracy of 92.81 action recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even without any video data pre-training, and an accuracy of 96.11 kinetics pre-training.

READ FULL TEXT
research
04/20/2019

EV-Action: Electromyography-Vision Multi-Modal Action Dataset

Multi-modal human motion analysis is a critical and attractive research ...
research
07/19/2021

UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition

Action recognition based on skeleton data has recently witnessed increas...
research
12/18/2022

2D Pose Estimation based Child Action Recognition

We present a graph convolutional network with 2D pose estimation for the...
research
06/12/2023

Valley: Video Assistant with Large Language model Enhanced abilitY

Recently, several multi-modal models have been developed for joint image...
research
06/03/2019

How Much Does Audio Matter to Recognize Egocentric Object Interactions?

Sounds are an important source of information on our daily interactions ...
research
05/04/2020

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video

In this paper, we tackle the problem of egocentric action anticipation, ...
research
10/12/2022

Distilling Knowledge from Language Models for Video-based Action Anticipation

Anticipating future actions in a video is useful for many autonomous and...

Please sign up or login with your details

Forgot password? Click here to reset