Multi-Channel Auto-Encoder for Speech Emotion Recognition

10/25/2018 ∙ by Zefang Zong, et al. ∙ 0

Inferring emotion status from users' queries plays an important role to enhance the capacity in voice dialogues applications. Even though several related works obtained satisfactory results, the performance can still be further improved. In this paper, we proposed a novel framework named multi-channel auto-encoder (MTC-AE) on emotion recognition from acoustic information. MTC-AE contains multiple local DNNs based on different low-level descriptors with different statistics functions that are partly concatenated together, by which the structure is enabled to consider both local and global features simultaneously. Experiment based on a benchmark dataset IEMOCAP shows that our method significantly outperforms the existing state-of-the-art results, achieving 64.8% leave-one-speaker-out unweighted accuracy, which is 2.4% higher than the best result on this dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As voice-controlled intelligent applications develop rapidly, emotion recognition and analysis are becoming more and more important. Obtaining information from literal expressions cannot satisfy our demand any more, for a great part of information is conveyed by human emotions. Some utterance, for example, ironic phrases, may have completely opposite meaning from what it sounds literally. And voice-controlled virtual assistants like Siri 111 may work much better with emotion information. So inferring emotion from voice data can help to understand the accurate meaning of users, as well as providing more humanized responses.

Traditionally, two major frameworks were explored for speech emotion recognition. One is HMM-GMM framework based on dynamic features [1]

, and the other one is classified by support vector machines (SVM) based on high-level representations generated by applying lots of functions on low-level descriptors (LLDs)


Recently, more and more attention has been paid to deep learning methods for speech emotion recognition which brings a better performance than the traditional frameworks

[3]. Some researches focus on utterance-level features which usually extract high-level representations from LLDs and then utilize deep neural network (DNN) for classification [4]. Meanwhile, instead of high-level statistics representation, some other researchers utilize frame-level representation or raw signal as inputs to neural network for an end-to-end training [5, 6]. Generally, deep learning approaches has made a great contribution to the field of speech emotion learning.

However, the current methods using deep learning have some apparent limitations in training procedure. In the framework using utterance-level features, the features to be inputted into neural networks are always generated by concatenating LLDs directly. Not only does this method ignore the independent nature of each feature, but also results in generating too high dimensions of the input, which makes it quite hard to reach a satisfying result because of overfitting due to amount-limited training data. Although we can reduce the dimension susing other methods, the process of reduction may lose important information, which is inevitable to an extent.

In this paper, we put forward the multi-channel auto-encoder (MTC-AE) as a new scheme to avoid the limitations listed above. Instead of training the network using all features concatenated directly, we take features from several local classifiers by utilizing DNN separately, which can keep the information from the independence of each classifier, and add a strong regularization to the whole system, helping to relieve overfitting. Features are taken from bottleneck layers of each classifier and then concatenated into a higher-dimension one. The final concatenated feature will be the input of a global classifier. Because of the regularization of local classifiers, the concatenated feature is more discriminative for classification. Finally, we fuse the outputs of all classifiers to obtain the final prediction results. It’s important that the training procedure in both global classifier and the local ones are trained simultaneously through a single objective function, which guarantees that the final results consider both independence and relevance among different features. In addition, inspired by bottleneck features[7], we initialize each local DNN with the stacked denoising auto-encoder (SDAE) as the complete MTC-AE scheme to yield lower classification error caused by corrupted inputs and reach even better performance. The structure of MTC-AE is shown in Figure 1.

Experiments on benchmark dataset IEMOCAP show that our method outperforms the existing state-of-the-art methods, achieving unweighted accuracy with leave-one-speaker-out (LOSO) 10-fold cross-validation.

2 Multi-channel Auto-Encoder

In this section, we will interpret the complete scheme of MTC-AE in detail. In each local DNN, we use SDAEs for initialization before supervised training, to denoise corrupted versions of the inputs. Local classifiers are obtained after training from each local DNN based on each LLD. Meanwhile, we take bottleneck layers of each local DNN and concatenate them together as the total representation to train a global classifier. Training global classifier and the local ones simultaneously does not only take both relevance and independence of different LLDs into consideration, but also relieve overfitting caused by high-dimensional input and amount-limited training data. Figure 1 shows the detailed structure.

Figure 1: The structure of Multi-channel Auto-encoder(MTC-AE). The yellow layers in each local calssifier are pre-trained by SDAEs. The blue layers are bottleneck layers which are concatenated for global classifier. And the blue layers are fully connected layer with random initialization for each local classifiers. All the predictions are fused for final prediction.

2.1 Local Initialization

Because of the large amount of local classifiers, our network structure is relatively complex. Initialization of network parameters is quite a key factor affecting performance. In voice data, it is usual that data gets corrupted. Corruption of data effectively influence the performance of training. Inspired by DBNF[7], we utilize SDAEs for initialization. As shown in Figure 1, the lowest two layers in our local DNN are pre-trained by using SDAEs in an unsupervised manner. A denoising auto-encoder operates mostly like a traditional auto-encoder, except that denoising auto-encoder takes the corrupted version as its input, which is generated by a corruption process operated on the clean data , and is meant to reconstruct the original data . Generally, a random fraction of the elements of are set to be zero as the corrupted version . The reconstruction process can be written mathematically:


where is the input, is the output,

are the linear transforms,

are the biases, and

are the non-linear active functions. In our formulation, we use the ELU function

[8] for both :


where is set to be in our paper. We use the back-propagation algorithm to train the auto-decoder by minimizing the cost function as following:


where is the hyper parameter to limit the influence of regularization term. In our experiments, is set to be .

After the previous auto-encoder is trained, the hidden representation is regarded as the original data for training, and conveyed to the next auto-encoder. Totally, for each local classifier, two auto-encoders are trained for initialize the lowest two layers.

2.2 Joint Fine-tuning

After training SDAEs, a bottleneck layer, a hidden layer and a classification layer will be connected to form a feed-forward neural network for each local classifier. The three layers attached on the top are initialized randomly, while the previous layers are initialized with the auto-encoder weights. Then, as shown in Figure

1, the feed-forward neural network is utilized as a block to form a local classifier in our framework.

It’s important to note that only measuring the local classifiers does consider information of independence sufficiently, but ignores the relevance among different features on the other hand. Therefore, we concatenate the bottleneck layer of each local classifier as the global representation to train the global classifier, which is initialized randomly. The global classifier takes the relevance of each local representation into consideration, and measures it effectively.

Moreover, in order to optimize the whole system considering both relevance and the independence, we use a single objective function to train local classifiers and the global one simultaneously:


where is the true distribution of one-hot label and is the approximating distribution. is the parameters of global classifier and is the parameters of th classifier. is the number of local classifiers and is the weight coefficient that between and , in our experiments, we set to be . Specially, for , only the global classifiers are included in the framework, and for , local classifiers are included instead of the ”global classifiers”. is a function that returns cross-entropy between an approximating distribution and a true distribution that can be written mathematically:


For the objective function, Eq.4, can be optimized based on back-propagation algorithm.

Since our model contains many classifiers, we fuse the outputs of each classifier by summation simply with a weight parameter between global classifier and local classifier, which is set to be in this paper:


The maximum output position of is regarded as the final prediction result.

3 Results and Discussion

3.1 Dataset

The IEMOCAP [9] database contains approximately 12 hours’ audio-visual conversations of 10 speakers in English, with them manually segmented into utterances. The database contains the following categorical labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and others. In our experiment, we form a four-class emotion classification dataset containing {happy, angry, sad and neutral} after merging happiness and excitement categories as the happy category only, to compare with the former state-of-the-art methods as mentioned in section 3.2. Table 1 presents the detail utterance number and the corresponding percentage of each category.

Category Happy Anger Sad Neutral Total
Utterances 1636 1103 1084 1708 5531
Percentage 29.6 19.9 19.6 30.9 100
Table 1: The number of utterances and their corresponding percentage for each emotion category.

3.2 Experimental Details

Evaluation We performed all evaluations using 10-fold LOSO cross-validation, to stay in the same manner as most approaches, so that there is no speaker overlap between the training and test data. As for the method to evaluating the performance, we utilize the unweighted accuracy (UA), which have been used in several previous emotion challenges. UA is quite a good measurement in this case since the class distribution is imbalanced.

Feature Extraction We utilize openSMILE toolkit [10] to extract statistics features which was used in the INTERSPEECH 2010 Paralinguistic Challenge [11], as discussed in [4]. Totally, 1582-dimensional features are generated by extracting 38 kinds of LLDs shown in Table 2, and applying 21 statistics functions shown in Table 3. Details of these features can be found in [11].

Low level Descriptors (LLDs)
PCM loudness
MFCC [0-14]
log Mel Frequency Band [0-7]
Line Spectral Pairs (LSP) Frequency [0-7]
F0 by sub-harmonic summation
F0 Envelope

Voicing probability

Jitter local
Jitter difference of difference of periods (DDP)
Shimmer local
Table 2: 38-dimensional frame-level acoustic features
Statistics Functions
Position maximum/minimum

Arithmetic mean, standard deviation

Linear regression coefficients 1/2
Linear regression error quadratic/absolute
Quartile 1/2/3
Quartile range 2-1/3-2/3-1
Percemtile 1/99
Percemtile range 99-1
Up-level time 75/99
Table 3: 21 kinds of statistics functions applied on LLDs
Method Reference Validation Setting UA (%)
Ensemble of SVM Trees [APSIPA ASC, 2012] 10-fold LOSO 60.9
Replicated Softmax Models + SVM [ISCAS, 2014] 10-fold LOSO 57.4
CNN Feature + MKL Classifier [ICDM, 2016] 10-fold 61.3
Contextual LSTM [ACL, 2017] First 8 speakers for training 57.1
Attention-based RNN [ICASSP, 2017] 4 Sessions for training 58.8
Deep Multi-layered Neural Network [Neural Networks, 2017] 8-fold LOSO 60.9
Multi-task DBN Feature + SVM [Trans-AC, 2017] 10-fold LOSO 62.4
MTC-DNN Our Method 10-fold LOSO 62.7
MTC-AE Our Method 10-fold LOSO 64.8
Table 4: The performance on IEMOCAP dataset with different models and comparison with the state of the art based on unweighted accuracy

Network Setting In our experiments, 38 local classifiers consist in the total framework, corresponding 38-dimensional frame-level acoustic features mentioned above.

For per-training the SDAE, Adam[12] is used for optimization with learning rate for iteration, and the batch size is . Input vectors are corrupted by applying masking noise to set a random of their elements to zero. Each auto-encoder contains hidden units and the next one is trained on top of it when finish training.

For fine-tuning process, a bottleneck layer with units is added, and followed by a new hidden layer with units. For global classifier, the new hidden layer is set with units. Similarly, Adam is used for optimization with learning rate for iteration, and batch size is

. After each epoch, the current model was evaluated on validation set, and the model performing best on this set is used for testing.

All of these processes are done on GPUs using Keras toolkit.

State-of-the-art Methods To evaluate the effectiveness of proposed framework, we compare the performance of emotion classification with some state-of-the-art methods based on IEMOCAP as following:

[APSIPA ASC, 2012] [13] proposed an ensemble of trees of binary SVM classifiers to address the sentence-level multimodal emotion recognition problem.

[ISCAS, 2014] [14] proposed a multi-modal framework for emotion recognition using bag-of-words features and undirected, replicated softmax topic models.

[ICDM, 2016] [15]

feed features extracted by deep convolutional neural networks(CNN) into a multiple kernel learning classifier to do multimodal emotion recognition.

[ACL, 2017] [16] This paper propose a LSTM-based model to capture contextual information between utterance-level features in the same video.

[ICASSP, 2017] [17]

This paper study automatically discovering emotionally relevant speech features using a deep recurrent neural network(RNN) and a local attention base feature pooling strategy.

[Neural Networks, 2017] [18] proposed speech emotion recognition system to empirically explore feed-forward and recurrent neural network architectures and their variants.

[Trans-AC, 2017] [4]

This paper propose a framework for acoustic emotion recognition based on the deep belief network (DBN) framework.

3.3 Experimental Results and Analysis

Table 4 compares the classification performance of different models from different state-of-the-art literatures based on IEMOCAP dataset. Our experiments was run with 10-fold LOSO as most of the previous work did, and was carried out separately based on whether to use SDAEs to pre-train the network. We call the network without SDAEs as MTC-DNN. The highest accuracy on unweighted accuracy was by using muti-task deep belief network to obtain feature representation, and utilizing SVM as the classifier [4] before our work. Comparing to this, the two methods we proposed, named MTC-DNN and MTC-AE, improved the UA by and , respectively. Both the two methods outperformed the existing state-of-the-art methods. It well proves that taking the independence of different features into consideration does help to improve the performance of the system. It also adds a strong regularization to the whole system, and makes the system more discriminating for classification. And it is remarkable to point out that using SDAEs to pre-train the network helps to further improve the classification accuracy. SDAEs explicitly reduces the existing noises from the data set, and helps to accelerate in convergence of the network.

4 Conclusion

In this paper, we proposed a novel architectures named MTC-AE for speech emotion recognition. Multiple local DNNs not only keeps the information from the independence of each feature, but also adds a regularization to the whole scheme, helping to relieve over fitting. Moreover, bottleneck layers from local DNNs are concatenated altogether to form a strong classifier. SDAEs is utilized to initialize complex networks. Experiments show that our method significantly outperforms the existing state-of-the-art methods with UA on IEMOCAP dataset.


  • [1] B Schuller, G Rigoll, and M Lang,

    Hidden markov model-based speech emotion recognition,”

    in International Conference on Multimedia and Expo, 2003. ICME ’03. Proceedings, 2003, pp. I–401–4 vol.1.
  • [2] Björn Schuller, Bogdan Vlasenko, Florian Eyben, Gerhard Rigoll, and Andreas Wendemuth, “Acoustic emotion recognition: A benchmark comparison of performances,” in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on. IEEE, 2009, pp. 552–557.
  • [3] Yelin Kim, Honglak Lee, and Emily Mower Provost, “Deep learning for robust feature generation in audiovisual emotion recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 3687–3691.
  • [4] Rui Xia and Yang Liu, “A multi-task learning framework for emotion recognition using 2d continuous space,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 3–14, 2017.
  • [5] Shiqing Zhang, Shiliang Zhang, Tiejun Huang, and Wen Gao, “Multimodal deep convolutional neural network for audio-visual emotion recognition,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016, pp. 281–284.
  • [6] George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 5200–5204.
  • [7] Jonas Gehring, Yajie Miao, Florian Metze, and Alex Waibel, “Extracting deep bottleneck features using stacked auto-encoders,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 3377–3381.
  • [8] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” Computer Science, 2015.
  • [9] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
  • [10] Florian Eyben, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in ACM International Conference on Multimedia, 2010, pp. 1459–1462.
  • [11] Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian A. Müller, and Shrikanth S. Narayanan, “The interspeech 2010 paralinguistic challenge,” in INTERSPEECH 2010, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September, 2010, pp. 2794–2797.
  • [12] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
  • [13] Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad, “Ensemble of svm trees for multimodal emotion recognition,” in Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific. IEEE, 2012, pp. 1–4.
  • [14] Mohit Shah, Chaitali Chakrabarti, and Andreas Spanias, “A multi-modal approach to emotion recognition using undirected topic models,” in Circuits and Systems (ISCAS), 2014 IEEE International Symposium on. IEEE, 2014, pp. 754–757.
  • [15] Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain,

    “Convolutional mkl based multimodal emotion recognition and sentiment analysis,”

    in Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 439–448.
  • [16] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency, “Context-dependent sentiment analysis in user-generated videos,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, vol. 1, pp. 873–883.
  • [17] Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2227–2231.
  • [18] H. M. Fayek, M Lech, and L Cavedon, “Evaluating deep learning architectures for speech emotion recognition.,” Neural Networks, 2017.