Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy

04/25/2022
by   Chengxin Chen, et al.
0

Recently, audio-visual scene classification (AVSC) has attracted increasing attention from multidisciplinary communities. Previous studies tended to adopt a pipeline training strategy, which uses well-trained visual and acoustic encoders to extract high-level representations (embeddings) first, then utilizes them to train the audio-visual classifier. In this way, the extracted embeddings are well suited for uni-modal classifiers, but not necessarily suited for multi-modal ones. In this paper, we propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task. Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training. We evaluate the approach on the development dataset of TAU Urban Audio-Visual Scenes 2021. The experimental results show that our proposed approach achieves significant improvement over the conventional pipeline training strategy. Moreover, our best single system outperforms previous state-of-the-art methods, yielding a log loss of 0.1517 and accuracy of 94.59

READ FULL TEXT
research
04/30/2023

Deep Learning Based Multimodal with Two-phase Training Strategy for Daily Life Video Classification

In this paper, we present a deep learning based multimodal system for cl...
research
10/16/2019

Contextual Joint Factor Acoustic Embeddings

Embedding acoustic information into fixed length representations is of i...
research
03/07/2022

A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

In this paper, we propose two techniques, namely joint modeling and data...
research
06/12/2021

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

In this paper, we present deep learning frameworks for audio-visual scen...
research
07/11/2018

A punishment voting algorithm based on super categories construction for acoustic scene classification

In acoustic scene classification researches, audio segment is usually sp...
research
07/28/2021

Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification

The use of multiple and semantically correlated sources can provide comp...
research
03/31/2022

1-D CNN based Acoustic Scene Classification via Reducing Layer-wise Dimensionality

This paper presents an alternate representation framework to commonly us...

Please sign up or login with your details

Forgot password? Click here to reset