Transfer learning from High-Resource to Low-Resource Language Improves Speech Affect Recognition Classification Accuracy

03/04/2021 ∙ by Sara Durrani, et al. ∙ FAST University 0

Speech Affect Recognition is a problem of extracting emotional affects from audio data. Low resource languages corpora are rear and affect recognition is a difficult task in cross-corpus settings. We present an approach in which the model is trained on high resource language and fine-tune to recognize affects in low resource language. We train the model in same corpus setting on SAVEE, EMOVO, Urdu, and IEMOCAP by achieving baseline accuracy of 60.45, 68.05, 80.34, and 56.58 percent respectively. For capturing the diversity of affects in languages cross-corpus evaluations are discussed in detail. We find that accuracy improves by adding the domain target data into the training data. Finally, we show that performance is improved for low resource language speech affect recognition by achieving the UAR OF 69.32 and 68.2 for Urdu and Italian speech affects.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech affect analysis is an open research problem that is making an impact on the research community. For many years, work is being done for speech recognition and detection (Karsten et al., 2007

). The automatic speech recognition systems work by identifying different speakers (

Javed et al., 2020a). The models developed can be speaker-dependent or independent. These acoustic models have good results and fulfill several domain requirements of different applications (Tariq et al., 2019). The problem arises when these models have to deal with speech data that comes from speakers of different age groups, gender, accent, and language (Asad et al. (2020)). In this context, language is a huge barrier for a lot of speech recognition systems. In real-time scenarios, speech includes different affects that make a huge impact on the performance of speech recognition systems (Dilawar et al. (2018)). These affects include sadness, anger, disgust, happiness, and many more (Beg and Van Beek (2010)). The speech affect has different applications in multiple domains that include call centers (Burkhardt et al., 2006), face affect recognition (Jain et al., 2018), smart classrooms, human behaviour analysis, and increasing customer shopping experience (Vidrascu and Devillers, 2005). The affects in speech are also being used to track depression and mental pressure (Bangash et al., 2017) in different smart home and offices environments (Huang et al., 2019).

In the early work, we have found the continuous Hidden Markov Model and Gaussian Mixture Model. The paper (

Schuller et al., 2003) discusses two methods, first is about a global static framework that extracted derived features from the speech signal (Javed et al., 2020b

). The second method extracts low-level instantaneous features by applying the continuous Hidden Markov Model.

Feature engineering has become more advanced for capturing a lot of information that became part of different machine learning techniques used for speech affect recognition or analysis (

Uzair et al., 2019

). Support vector machine (

Pan et al., 2012) is trained using these useful features that include linear predictive spectrum (LPCC), Mel-frequency spectrum coefficients (MFCCs) (Logan and others, 2000), speech energy, and pitch by achieving 91 percent accuracy on the Chinese database (Sahar et al., 2019). The speech affect recognition accuracy has been boosted with the advent of deep neural architectures (Beg and Beek, 2013

). The automatic feature selection property of networks made the achievement of this task easy (

Awan and Beg, 2021). The deep neural architecture (Han et al., 2014

) extract high-level features and produce probability distributions for deep neural networks. The extreme learning machine which is a single layer hidden network feeds utterance level features and identify the hidden emotions (

Naeem et al., 2020a). Further to this, deep neural networks are explored along with LSTMs for this speech task (Zhao et al., 2019

). The 1d and 2d convolution neural network has learned global and local affect features from speech and spectrograms (

Alvi et al. (2017)

). This architecture consists of two learning feature blocks: a max-pooling layer, and one convolutional layer. Overall the architecture of the network takes advantage of both LSTM and CNN (

Zafar et al., 2020). The LSTM networks are explored for this task in depth for capturing more enhancing features and improving the accuracy. In (Xie et al., 2019), the attention mechanism is introduced with LSTM for processing the time-series signal information. This work increased the accuracy of the standard emotion corpus (Khawaja et al., 2018).

Attention based Bidirectional LSTM with multi stream is proposed in (Chiba et al., 2020) for individual temporal speech parameters. An advanced form of convolutional network named temporal convolutional network is introduced in (Liu et al., 2020b

) with a vector quantization variational encoder. The encoder is trained in unsupervised manner with a lot of unlabeled data. Due to the evaluations on a single corpus, the concept of transfer learning is proposed and used widely (

Arshad et al., 2019). The Siamese neural network’s loss is modified (Feng and Chaspari, 2020)for training in transfer learning setting. The results are achieved by the distance loss between same and different classes. In addition to this, problems arise when the systems are tested for cross language data. This work (Liu et al., 2020a) handles the problem of cross corpus speech affect recognition. The architecture consists of two modules: first model features are extracted and domain adaptive layer is introduced. The cross corpus experimentation and evaluations effect the performance of affect recognition systems (Farooq et al. (2019)).

Now, a variety of speech corpora exist that have data in different languages, advanced emotions, and diverse labeling. A variety of populations speak different languages. On record, *389 languages 111 are spoken by one million people in different areas of the world that make 94.1 percent of the world’s population. This seems very difficult to have large data sets of every language for model training (Zafar et al., 2019). The data seems to be inadequate for ever low-resource language. The researchers and speech system developers face the problem of having not enough data (Javed et al., 2019). Therefore, it seems impossible to have a single model trained on a single language corpus to capture all variations, dynamics, recognition, emotions, and affects (Beg et al., 2019). The generalization of the model is required to make it work effectively.

Figure 1: The picture shows a general view of speech affect recognition as speech signal inputs into a CNN and recognizes the different speech affects.

In automatic speech recognition, it is mostly not taken into account that if trained on a single corpus the model will fail in cross-corpus settings (Naeem et al., 2020b). For the resolution of this problem, transfer learning is introduced into the deep neural architectures. Transfer learning transfers the domain knowledge from the source to the target domain. This technique seems very helpful where very less labeled or non-labeled data is present (Majeed et al., 2020). The accuracy rate increases due to transfer learning in speech recognition for low-resource languages. Due to speech features capturing capacity and adaptability to perform in a cross-corpus setting, we have used a encode-decoder model with attention (Zahid et al., 2020). Transfer learning is applied on convolutional neural networks followed by LSTM with an attention mechanism. In this work, we have solved the above-mentioned challenges by analyzing the results in same corpus, cross-corpus and multilingual settings.

2 Methodology

We have used the encoder-decoder model with attention (Bansal et al., 2018) for all experiment settings to transfer the domain knowledge. This has allowed us to transfer the training parameters between chosen models. The form of learning introduced through transfer learning is really flexible and transfers all domain knowledge from high resource to a low resource one. The hyper parameters can also set to make it easy to fit into the available computation resources. We train a English model on IEMOCAP data set to make it available for different evaluations and transfer of parameters. Further for Affect SAR Model (IEMOCAP-Urdu), we have used IEMOCAP pre-trained model and retrained it with 320 samples of Urdu. During training the parameters are updated that helps in transferring domain knowledge from high resource language to low resource one. Only encoder parameters are updated due to the reason that speech signal knowledge is same and can be processed in a same way.

Figure 2: The encoder-decoder architecture for both speech affect recognition. The speech signal is the input of encoder and decoder generates class labels.

3 Experimental Setup

3.1 Speech Data sets

For the task of speech affect recognition in different languages, we have selected four different publicly available speech data sets of different languages for capturing the maximum diversity. These data sets are annotated differently and have recordings for basic and advanced speech affects. The affects are studied in depth by considering the important positive and negative classes for this classification problem. The table 1 shows the details for different selected data sets.

Data Set Language Recordings Affects List References
SAVEE English 480 Anger, Sad, Neutral, Happy, Surprise,fear, Disgust (Jackson and Haq, 2014)
EMOVO Italian 588 Anger, Sad, Neutral, Joy, Surprise,fear, Disgust (Costantini et al., 2014)
URDU Urdu 400 Anger, Sad, Neutral, Happy (Latif et al., 2018a)
IEMOCAP English 5531 Anger, Sad, Neutral, Happy, Excited (Busso et al., 2008)
Table 1: The table shows corpora names, language, no of utterances, the recorded affects and references. The class labels are assigned according to the genre of affects.

All datasets are relevent and prepared in a well-organized way. IEMOCAP has most of the data and it consists of five sessions in the form of audio, video and images. The sessions are recorded by five pairs of males and females. The gold labels are assigned to data by crowd sourcing. The data is gathered in controlled environment as acted under specific conditions, thus obtain high accuracy results.

3.2 Speech Features Details

The extraction of features is a very crucial part for developing a good model. We use Chromagram and Tonnetz representation feature sets that are highly used for the distinguishable representation of pitch and harmony. The feature specially spectral contrast shows detailed spectra of sound in contrast to MFCCs spectograms. Tonnetz captures pitch and harmony classes of sound. The tonal centroids of sound are measured in tonal centroid space (Harte et al., 2006).The pitch classes along harmonic relations are studied in depth in the form of Harmonic network representations. This features set includes frequency, spectral information, pitch, energy, static and dynamic variations.

4 Experimental Results

This section explains the different cross-corpus settings and comparisons of our results with the previous results of existing techniques. The different possible scenerios are explored and studied in detail.

4.1 Proposed Baseline Model Results for single corpus

We trained our model using proposed approach on each corpus that set the baseline accuracy results. The performance of our approach is compared with the very relevant work using Deep Belief Networks(

Latif et al., 2018b) and sparse autoenoder with SVM using transfer learning in speech Emotion Recognition (Deng et al., 2013). This enables us to train model on 80 percent and test it on 20 percent unseen data. The figure 3 shows the comparison results and it is evident that our approach has outperformed the other two with accuracy improvement.

Figure 3: The baseline accuracy comparison of different corpora on our Affect Recognition Model, DBM and Sparse auto-encoder with SVM.

4.2 Cross Corpus Setting

In a cross-corpus setting, we have used IEMOCAP and EMOVO for training the model. The overall task is to use one language data set for training and the remaining corpora for testing. The remaining corpora which include SAVEE and Urdu are used for evaluations. The cross-corpus settings are fairly good for generalizing the language barriers as this evaluation makes models strong and provides space for improvement of results. We have compared the recognition rate of different languages and obtained the results. Figure 4 shows the recognition rate for our model and the other two existing approaches.

Figure 4: The comparison results of approaches in a cross-corpus setting when the Affect Recognition Model is trained using the IEMOCAP data set.
Figure 5: The comparison results of approaches in a cross-corpus setting when the Affect Recognition Model is trained using the EMOVO data set.

It is evident by the obtained results that the Affect Recognition Model outperforms the existing two by the improvement in accuracy and performs better in cross-corpus settings.

4.3 Transfer Learning Results

The concept of transfer learning enables us to use corpora of different languages jointly for training. This makes the performance of models better for different corpora. We use IEMOCAP and Urdu for training and tested the model for EMOVO,SAVEE and Urdu. The Urdu data set is also divided 80 percent for testing and 20 percent for training. IEMOCAP has large set of data, we utilized three sessions as training and rest left for evaluations. The evaluations are done using three fold cross validation for testing specified corpora i.e. EMOVO,SAVEE and Urdu.

4.4 Evaluations

UAR: The unweighted average recall score is also calculated for Speech Affect Recognition. Unweighted Average Recall is a parameter that is calculated for recall of every class. It gives out an easy calculation for data set accuracy when the data set samples count is imbalanced as compared to all other classes. The table 2 shows the results of transfer learning which depicts that for Urdu, English and Italian it has achieved 69.32, 68.52 and 68.52 UAR. Therefore, it can be seen that results can be obtained for the languages that have less annotated data set and least training capacity. This transfer learning approach plays its part in providing more assistance in less resources.

Figure 6: The comparison results of approaches in a transfer learning setting when the affect recognition model(IEMOCAP-Urdu) is trained.
Data Set UAR
SAVEE 68.52
EMOVO 70.41
Urdu 69.32
Table 2: The UAR score of selected corpora

5 Analysis

From different experiments that we have performed for the Speech Affect Recognition has allowed us to note key points. The transfer learning outperformed the baseline results within same corpus training. The accuracy results for all data sets are higher in transfer learning settings as compared to cross-corpus and baseline( even when the model is trained and tested on the same corpus). Low resource languages in our study, Urdu, and Italian have scored high classification accuracy rates while comparing with high resource language English. The model encoder parameters along with attention parameters are all transferred that proved to be the most effective.The transfer of alone decoder parameters do not improve the efficiency for increase in accuracy. The dependency has created and it sums that decoder parameters without encoder might not transfer enough knowledge.
We have also found that adding the domain data while training improves the classification accuracy rate. However, the performance of the systems and models drops due to different associated factors. Speech Affects are highly sensitive to age, gender, noise, and language diversity. We have studied that transfer learning solves this problem to a great extent but for more accuracy, the domain data can be added to the training data from a variety of languages.

6 Conclusions and Future Work

In this paper, we evaluate the performance of encoder-decoder model with attention for Speech Affect Recognition in the same corpus, cross-corpus, and transfer learning settings. The detailed experiments show that the transfer learning based model outperformed the existing approaches for mentioned settings. We perform on four different language corpora by transferring high resource language features and domain knowledge to low resource languages. This would be very helpful in building applications specific to Speech Affect Recognition. Moreover, this technique also solves a problem when less or non-labeled low resource language data is available. We show that it is possible to improve the results by this approach. In our future work, we aim to work for advanced speech affects in low resource languages to capture the high diversity and improving the Speech Affect Recognition rate.


  • H. M. Alvi, H. Sahar, A. A. Bangash, and M. O. Beg (2017) Ensights: a tool for energy aware software development. In 2017 13th International Conference on Emerging Technologies (ICET), pp. 1–6. Cited by: §1.
  • M. U. Arshad, M. F. Bashir, A. Majeed, W. Shahzad, and M. O. Beg (2019) Corpus for emotion detection on roman urdu. In 2019 22nd International Multitopic Conference (INMIC), pp. 1–6. Cited by: §1.
  • M. Asad, M. Asim, T. Javed, M. O. Beg, H. Mujtaba, and S. Abbas (2020)

    DeepDetect: detection of distributed denial of service attacks using deep learning

    The Computer Journal 63 (7), pp. 983–994. Cited by: §1.
  • M. N. Awan and M. O. Beg (2021) TOP-rank: a topicalpostionrank for extraction and classification of keyphrases in text. Computer Speech & Language 65, pp. 101116. Cited by: §1.
  • A. A. Bangash, H. Sahar, and M. O. Beg (2017) A methodology for relating software structure with energy consumption. In 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 111–120. Cited by: §1.
  • S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater (2018) Low-resource speech-to-text translation. Proc. Interspeech 2018, pp. 1298–1302. Cited by: §2.
  • M. Beg and P. v. Beek (2013) A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures. ACM Transactions on Embedded Computing Systems (TECS) 13 (1), pp. 1–23. Cited by: §1.
  • M. O. Beg, M. N. Awan, and S. S. Ali (2019) Algorithmic machine learning for prediction of stock prices. In FinTech as a Disruptive Technology for Financial Institutions, pp. 142–169. Cited by: §1.
  • M. Beg and P. Van Beek (2010) A graph theoretic approach to cache-conscious placement of data for direct mapped caches. In Proceedings of the 2010 international symposium on Memory management, pp. 113–120. Cited by: §1.
  • F. Burkhardt, J. Ajmera, R. Englert, J. Stegmann, and W. Burleson (2006) Detecting anger in automated voice portal dialogs. In Ninth International Conference on Spoken Language Processing, Cited by: §1.
  • C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4), pp. 335. Cited by: Table 1.
  • Y. Chiba, T. Nose, and A. Ito (2020) Multi-stream attention-based blstm with feature segmentation for speech emotion recognition. Power 1 (2), pp. 3. Cited by: §1.
  • G. Costantini, I. Iaderola, A. Paoloni, and M. Todisco (2014) EMOVO corpus: an italian emotional speech database. In International Conference on Language Resources and Evaluation (LREC 2014), pp. 3501–3504. Cited by: Table 1.
  • J. Deng, Z. Zhang, E. Marchi, and B. Schuller (2013)

    Sparse autoencoder-based feature transfer learning for speech emotion recognition

    In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 511–516. Cited by: §4.1.
  • N. Dilawar, H. Majeed, M. O. Beg, N. Ejaz, K. Muhammad, I. Mehmood, and Y. Nam (2018) Understanding citizen issues through reviews: a step towards data informed planning in smart cities. Applied Sciences 8 (9), pp. 1589. Cited by: §1.
  • M. U. Farooq, S. U. R. Khan, and M. O. Beg (2019)

    MELTA: a method level energy estimation technique for android development

    In 2019 International Conference on Innovative Computing (ICIC), pp. 1–10. Cited by: §1.
  • K. Feng and T. Chaspari (2020) A siamese neural network with modified distance loss for transfer learning in speech emotion recognition kexin feng. In

    Proc. Conference on Artificial Intelligence (AAAI 2020), February 7, 2020

    pp. 29–35. Cited by: §1.
  • K. Han, D. Yu, and I. Tashev (2014) Speech emotion recognition using deep neural network and extreme learning machine. In Fifteenth annual conference of the international speech communication association, Cited by: §1.
  • C. Harte, M. Sandler, and M. Gasser (2006) Detecting harmonic change in musical audio. In Proceedings of the 1st ACM workshop on Audio and music computing multimedia, pp. 21–26. Cited by: §3.2.
  • Z. Huang, J. Epps, and D. Joachim (2019) Speech landmark bigrams for depression detection from naturalistic smartphone speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5856–5860. Cited by: §1.
  • P. Jackson and S. Haq (2014) Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK. Cited by: Table 1.
  • N. Jain, S. Kumar, A. Kumar, P. Shamsolmoali, and M. Zareapoor (2018) Hybrid deep neural networks for face emotion recognition. Pattern Recognition Letters 115, pp. 101–106. Cited by: §1.
  • A. R. Javed, M. O. Beg, M. Asim, T. Baker, and A. H. Al-Bayatti (2020a) AlphaLogger: detecting motion-based side-channel attack using smartphone keystrokes. Journal of Ambient Intelligence and Humanized Computing, pp. 1–14. Cited by: §1.
  • A. R. Javed, M. U. Sarwar, M. O. Beg, M. Asim, T. Baker, and H. Tawfik (2020b) A collaborative healthcare framework for shared healthcare plan with ambient intelligence. Human-centric Computing and Information Sciences 10 (1), pp. 1–21. Cited by: §1.
  • H. T. Javed, M. O. Beg, H. Mujtaba, H. Majeed, and M. Asim (2019)

    Fairness in real-time energy pricing for smart grid using unsupervised learning

    The Computer Journal 62 (3), pp. 414–429. Cited by: §1.
  • M. Karsten, S. Keshav, S. Prasad, and M. Beg (2007) An axiomatic basis for communication. ACM SIGCOMM Computer Communication Review 37 (4), pp. 217–228. Cited by: §1.
  • H. S. Khawaja, M. O. Beg, and S. Qamar (2018)

    Domain specific emotion lexicon expansion

    In 2018 14th International Conference on Emerging Technologies (ICET), pp. 1–5. Cited by: §1.
  • S. Latif, A. Qayyum, M. Usman, and J. Qadir (2018a) Cross lingual speech emotion recognition: urdu vs. western languages. In 2018 International Conference on Frontiers of Information Technology (FIT), pp. 88–93. Cited by: Table 1.
  • S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps (2018b) Transfer learning for improving speech emotion classification accuracy. Proc. Interspeech 2018, pp. 257–261. Cited by: §4.1.
  • J. Liu, W. Zheng, Y. Zong, C. Lu, and C. Tang (2020a) Cross-corpus speech emotion recognition based on deep domain-adaptive convolutional neural network. IEICE TRANSACTIONS on Information and Systems 103 (2), pp. 459–463. Cited by: §1.
  • J. Liu, Z. Liu, L. Wang, Y. Gao, L. Guo, and J. Dang (2020b) Temporal attention convolutional network for speech emotion recognition with latent representation. Proc. Interspeech 2020, pp. 2337–2341. Cited by: §1.
  • B. Logan et al. (2000) Mel frequency cepstral coefficients for music modeling.. In Ismir, Vol. 270, pp. 1–11. Cited by: §1.
  • A. Majeed, H. Mujtaba, and M. O. Beg (2020) Emotion detection in roman urdu text using machine learning. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering Workshops, pp. 125–130. Cited by: §1.
  • B. Naeem, A. Khan, M. O. Beg, and H. Mujtaba (2020a) A deep learning framework for clickbait detection on social area network using natural language cues. Journal of Computational Social Science, pp. 1–13. Cited by: §1.
  • S. Naeem, M. Iqbal, M. Saqib, M. Saad, M. S. Raza, Z. Ali, N. Akhtar, M. O. Beg, W. Shahzad, and M. U. Arshad (2020b) Subspace gaussian mixture model for continuous urdu speech recognition using kaldi. In

    2020 14th International Conference on Open Source Systems and Technologies (ICOSST)

    pp. 1–7. Cited by: §1.
  • Y. Pan, P. Shen, and L. Shen (2012) Speech emotion recognition using support vector machine. International Journal of Smart Home 6 (2), pp. 101–108. Cited by: §1.
  • H. Sahar, A. A. Bangash, and M. O. Beg (2019) Towards energy aware object-oriented development of android applications. Sustainable Computing: Informatics and Systems 21, pp. 28–46. Cited by: §1.
  • B. Schuller, G. Rigoll, and M. Lang (2003) Hidden markov model-based speech emotion recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Vol. 2, pp. II–1. Cited by: §1.
  • M. Tariq, H. Majeed, M. O. Beg, F. A. Khan, and A. Derhab (2019) Accurate detection of sitting posture activities in a secure iot based assisted living environment. Future Generation Computer Systems 92, pp. 745–757. Cited by: §1.
  • A. Uzair, M. O. Beg, H. Mujtaba, and H. Majeed (2019) Weec: web energy efficient computing: a machine learning approach. Sustainable Computing: Informatics and Systems 22, pp. 230–243. Cited by: §1.
  • L. Vidrascu and L. Devillers (2005) Annotation and detection of blended emotions in real human-human dialogs recorded in a call center. In 2005 IEEE International Conference on Multimedia and Expo, pp. 4–pp. Cited by: §1.
  • Y. Xie, R. Liang, Z. Liang, and L. Zhao (2019) Attention-based dense lstm for speech emotion recognition. IEICE TRANSACTIONS on Information and Systems 102 (7), pp. 1426–1429. Cited by: §1.
  • A. Zafar, H. Mujtaba, M. T. Baig, and M. O. Beg (2019) Using patterns as objectives for general video game level generation. ICGA Journal 41 (2), pp. 66–77. Cited by: §1.
  • A. Zafar, H. Mujtaba, and M. O. Beg (2020) Search-based procedural content generation for gvg-lg. Applied Soft Computing 86, pp. 105909. Cited by: §1.
  • R. Zahid, M. O. Idrees, H. Mujtaba, and M. O. Beg (2020) Roman urdu reviews dataset for aspect based opinion mining. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), pp. 138–143. Cited by: §1.
  • J. Zhao, X. Mao, and L. Chen (2019) Speech emotion recognition using deep 1d & 2d cnn lstm networks. Biomedical Signal Processing and Control 47, pp. 312–323. Cited by: §1.