DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection

07/14/2020
by   Ehsan Asali, et al.
0

For recognizing speakers in video streams, significant research studies have been made to obtain a rich machine learning model by extracting high-level speaker's features such as facial expression, emotion, and gender. However, generating such a model is not feasible by using only single modality feature extractors that exploit either audio signals or image frames, extracted from video streams. In this paper, we address this problem from a different perspective and propose an unprecedented multimodality data fusion framework called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images. DeepMSRF uses a two-stream VGGNET to train on both modalities to reach a comprehensive model capable of accurately recognizing the speaker's identity. We apply DeepMSRF on a subset of VoxCeleb2 dataset with its metadata merged with VGGFace2 dataset. The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream. The experimental results illustrate that DeepMSRF outperforms single modality speaker recognition methods with at least 3 percent accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2021

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal Fusion

It is now well established from a variety of studies that there is a sig...
research
07/06/2019

Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition

This paper presents a novel deep neural network (DNN) for multimodal fus...
research
07/17/2015

Deep Multimodal Speaker Naming

Automatic speaker naming is the problem of localizing as well as identif...
research
08/28/2023

Video Multimodal Emotion Recognition System for Real World Applications

This paper proposes a system capable of recognizing a speaker's utteranc...
research
07/06/2023

Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

The prevalence of large-scale multimodal datasets presents unique challe...
research
05/03/2018

Framewise approach in multimodal emotion recognition in OMG challenge

In this report we described our approach achieves 53% of unweighted accu...
research
10/13/2020

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Lip motion reflects behavior characteristics of speakers, and thus can b...

Please sign up or login with your details

Forgot password? Click here to reset