Enhancing Segment-Based Speech Emotion Recognition by Deep Self-Learning

by   Shuiyang Mao, et al.

Despite the widespread utilization of deep neural networks (DNNs) for speech emotion recognition (SER), they are severely restricted due to the paucity of labeled data for training. Recently, segment-based approaches for SER have been evolving, which train backbone networks on shorter segments instead of whole utterances, and thus naturally augments training examples without additional resources. However, one core challenge remains for segment-based approaches: most emotional corpora do not provide ground-truth labels at the segment level. To supervisely train a segment-based emotion model on such datasets, the most common way assigns each segment the corresponding utterance's emotion label. However, this practice typically introduces noisy (incorrect) labels as emotional information is not uniformly distributed across the whole utterance. On the other hand, DNNs have been shown to easily over-fit a dataset when being trained with noisy labels. To this end, this work proposes a simple and effective deep self-learning (DSL) framework, which comprises a procedure to progressively correct segment-level labels in an iterative learning manner. The DSL method produces dynamically-generated and soft emotion labels, leading to significant performance improvements. Experiments on three well-known emotional corpora demonstrate noticeable gains using the proposed method.


Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

Categorical speech emotion recognition is typically performed as a seque...

Emotion Profile Refinery for Speech Emotion Classification

Human emotions are inherently ambiguous and impure. When designing syste...

End-to-End Label Uncertainty Modeling in Speech Emotion Recognition using Bayesian Neural Networks and Label Distribution Learning

To train machine learning algorithms to predict emotional expressions in...

EigenEmo: Spectral Utterance Representation Using Dynamic Mode Decomposition for Speech Emotion Classification

Human emotional speech is, by its very nature, a variant signal. This re...

Estimating the Uncertainty in Emotion Class Labels with Utterance-Specific Dirichlet Priors

Emotion recognition is a key attribute for artificial intelligence syste...

Estimating the Uncertainty in Emotion Attributes using Deep Evidential Regression

In automatic emotion recognition (AER), labels assigned by different hum...

A Multi-task Neural Approach for Emotion Attribution, Classification and Summarization

Emotional content is a crucial ingredient in user-generated videos. Howe...

Please sign up or login with your details

Forgot password? Click here to reset