3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

06/18/2017
by   Amirsina Torfi, et al.
0

Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We propose the use of a coupled 3D Convolutional Neural Network (3D-CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features. The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset for training, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use 3D CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance. The proposed method achieves relative improvements over 20 Rate (EER) and over 7 state-of-the-art method.

READ FULL TEXT

page 4

page 7

page 8

research
08/06/2020

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

Audio-visual information fusion enables a performance improvement in spe...
research
10/15/2019

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Our interaction with the world is an inherently multimodal experience. H...
research
03/30/2018

Detecting Alzheimer's Disease Using Gated Convolutional Neural Network from Audio Data

We propose an automatic detection method of Alzheimer's diseases using a...
research
09/17/2016

GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion

Data generated from real world events are usually temporal and contain m...
research
01/25/2022

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) extends the speech re...
research
07/02/2021

Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion

We propose an audio-visual spatial-temporal deep neural network with: (1...
research
06/03/2021

ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition

We present a new architecture of convolutional neural networks (CNNs) ba...

Please sign up or login with your details

Forgot password? Click here to reset