A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

03/07/2022
by   Qing Wang, et al.
10

In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC). We employ pre-trained networks trained only on image data sets to extract video embedding; whereas for audio embedding models, we decide to train them from scratch. We explore different neural network architectures for joint modeling to effectively combine the video and audio modalities. Moreover, data augmentation strategies are investigated to increase audio-visual training set size. For the video modality the effectiveness of several operations in RandAugment is verified. An audio-video joint mixup scheme is proposed to further improve AVSC performances. Evaluated on the development set of TAU Urban Audio Visual Scenes 2021, our final system can achieve the best accuracy of 94.2 DCASE 2021 Task 1b.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/11/2019

Audiogmenter: a MATLAB Toolbox for Audio Data Augmentation

Audio data augmentation is a key step in training deep neural networks f...
research
05/28/2021

Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions

This paper presents the details of the Audio-Visual Scene Classification...
research
03/02/2022

Improving Generalization of Deep Networks for Estimating Physical Properties of Containers and Fillings

We present methods to estimate the physical properties of household cont...
research
04/25/2022

Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy

Recently, audio-visual scene classification (AVSC) has attracted increas...
research
08/20/2021

Video Ads Content Structuring by Combining Scene Confidence Prediction and Tagging

Video ads segmentation and tagging is a challenging task due to two main...
research
07/11/2020

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

The novelty of this study consists in a multi-modality approach to scene...
research
08/24/2022

Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio Text Augmentations

The absence of large labeled datasets remains a significant challenge in...

Please sign up or login with your details

Forgot password? Click here to reset