Towards adversarial learning of speaker-invariant representation for speech emotion recognition

03/22/2019

∙

Speech emotion recognition (SER) has attracted great attention in recent years due to the high demand for emotionally intelligent speech interfaces. Deriving speaker-invariant representations for speech emotion recognition is crucial. In this paper, we propose to apply adversarial training to SER to learn speaker-invariant representations. Our model consists of three parts: a representation learning sub-network with time-delay neural network (TDNN) and LSTM with statistical pooling, an emotion classification network and a speaker classification network. Both the emotion and speaker classification network take the output of the representation learning network as input. Two training strategies are employed: one based on domain adversarial training (DAT) and the other one based on cross-gradient training (CGT). Besides the conventional data set, we also evaluate our proposed models on a much larger publicly available emotion data set with 250 speakers. Evaluation results show that on IEMOCAP, DAT and CGT provides 5.6 system without speaker-invariant representation learning on 5-fold cross validation. On the larger emotion data set, while CGT fails to yield better results than baseline, DAT can still provide 9.8 standalone test set.

READ FULL TEXT

Towards adversarial learning of speaker-invariant representation for speech emotion recognition

Sign in with Google

Consider DeepAI Pro