Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition

06/07/2017
by   Che-Wei Huang, et al.
0

Deep convolutional neural networks are being actively investigated in a wide range of speech and audio processing applications including speech recognition, audio event detection and computational paralinguistics, owing to their ability to reduce factors of variations, such as speaker and environment information in signals, for speech recognition. However, studies have suggested to favor a certain type of convolutional operations when building a deep convolutional neural network for speech applications although there has been promising results using different types of convolutional operations. In this work, we study four types of convolutional operations on different input features for speech emotion recognition in order to derive a comprehensive understanding. Since affective behavioral information has been shown to reflect temporally varying of mental state and convolutional operation are applied locally in time, all deep neural networks share a deep recurrent sub-network architecture for further temporal modeling. We present detailed quantitative module-wise performance analysis to gain insights into information flows within the proposed architectures. In particular, we demonstrate the interplay of affective information and the other irrelevant information during the progression from one module to another. Finally we show that all of our deep neural networks provide state-of-the-art performance on the eNTERFACE'05 corpus.

READ FULL TEXT
research
09/15/2021

FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition

Using mel-spectrograms over conventional MFCCs features, we assess the a...
research
03/03/2020

Untangling in Invariant Speech Recognition

Encouraged by the success of deep neural networks on a variety of visual...
research
03/06/2020

Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio Signals

Robustness against temporal variations is important for emotion recognit...
research
10/10/2018

Multimodal Speech Emotion Recognition Using Audio and Text

Speech emotion recognition is a challenging task, and extensive reliance...
research
05/15/2020

ConcealNet: An End-to-end Neural Network for Packet Loss Concealment in Deep Speech Emotion Recognition

Packet loss is a common problem in data transmission, including speech d...
research
01/21/2020

A Comprehensive Study on Temporal Modeling for Online Action Detection

Online action detection (OAD) is a practical yet challenging task, which...

Please sign up or login with your details

Forgot password? Click here to reset