Deep Learning Based Multimodal with Two-phase Training Strategy for Daily Life Video Classification

04/30/2023
by   Lam Pham, et al.
0

In this paper, we present a deep learning based multimodal system for classifying daily life videos. To train the system, we propose a two-phase training strategy. In the first training phase (Phase I), we extract the audio and visual (image) data from the original video. We then train the audio data and the visual data with independent deep learning based models. After the training processes, we obtain audio embeddings and visual embeddings by extracting feature maps from the pre-trained deep learning models. In the second training phase (Phase II), we train a fusion layer to combine the audio/visual embeddings and a dense layer to classify the combined embedding into target daily scenes. Our extensive experiments, which were conducted on the benchmark dataset of DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) 2021 Task 1B Development, achieved the best classification accuracy of 80.5 audio data, with only visual data, both audio and visual data, respectively. The highest classification accuracy of 95.3 compared with DCASE baseline and shows very competitive to the state-of-the-art systems.

READ FULL TEXT
research
06/12/2021

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

In this paper, we present deep learning frameworks for audio-visual scen...
research
04/25/2022

Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy

Recently, audio-visual scene classification (AVSC) has attracted increas...
research
11/26/2018

Cross-domain Deep Feature Combination for Bird Species Classification with Audio-visual Data

In recent decade, many state-of-the-art algorithms on image classificati...
research
03/22/2023

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Existing audio-visual event localization (AVE) handles manually trimmed ...
research
06/19/2018

A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification

In the past, Acoustic Scene Classification systems have been based on ha...
research
12/16/2021

An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification

This paper presents a task of audio-visual scene classification (SC) whe...
research
01/09/2022

An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies

In this paper, we evaluate various deep learning frameworks for detectin...

Please sign up or login with your details

Forgot password? Click here to reset