AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

09/15/2023
by   Xingjian Diao, et al.
0

Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset outperforms the existing work and reaches the state-of-the-art, with a top-1 accuracy of 98.8 top-5 accuracy of 99.9

READ FULL TEXT

page 2

page 4

research
12/15/2022

MAViL: Masked Audio-Video Learners

We present Masked Audio-Video Learners (MAViL) to train audio-visual rep...
research
03/06/2022

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

With the assumption that a video dataset is multimodality annotated in w...
research
08/07/2023

AudioVMAF: Audio Quality Prediction with VMAF

Video Multimethod Assessment Fusion (VMAF) [1], [2], [3] is a popular to...
research
05/29/2020

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

We propose detection of deepfake videos based on the dissimilarity betwe...
research
06/15/2021

Is this Harmful? Learning to Predict Harmfulness Ratings from Video

Automatically identifying harmful content in video is an important task ...
research
06/16/2023

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Recent work has studied text-to-audio synthesis using large amounts of p...
research
07/04/2019

LumièreNet: Lecture Video Synthesis from Audio

We present LumièreNet, a simple, modular, and completely deep-learning b...

Please sign up or login with your details

Forgot password? Click here to reset