Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition

by   Li-Wei Chen, et al.
Carnegie Mellon University

While wav2vec 2.0 has been proposed for speech recognition (ASR), it can also be used for speech emotion recognition (SER); its performance can be significantly improved using different fine-tuning strategies. Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented. We show that V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset. TAPT, an existing NLP fine-tuning strategy, further improves the performance on SER. We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations. Experiments show that P-TAPT performs better than TAPT especially under low-resource settings. Compared to prior works in this literature, our top-line system achieved a 7.4 unweighted accuracy (UA) over the state-of-the-art performance on IEMOCAP. Our code is publicly available.



There are no comments yet.


page 2


A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

Self-supervised speech representations such as wav2vec 2.0 and HuBERT ar...

Data Fine-tuning

In real-world applications, commercial off-the-shelf systems are utilize...

On Scaling Contrastive Representations for Low-Resource Speech Recognition

Recent advances in self-supervised learning through contrastive training...

Unsupervised Representation Learning with Future Observation Prediction for Speech Emotion Recognition

Prior works on speech emotion recognition utilize various unsupervised l...

Automatic recognition of suprasegmentals in speech

This study reports our efforts to improve automatic recognition of supra...

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Multimodal emotion recognition from speech is an important area in affec...

The role of cue enhancement and frequency fine-tuning in hearing impaired phone recognition

A speech-based hearing test is designed to identify the susceptible erro...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech emotion recognition (SER) remains one of the key components in human-machine interaction and in human communication systems. With the development of deep learning, several attempts 

[9, 11, 28] have been made to automatically learn emotion representations from audio signal using neural nets. However, the improvement of deep learning based systems are often limited by the lack of annotated data. Commonly used SER datasets [3, 14, 17]

are relatively small in size in comparison to automatic speech recognition (ASR) datasets. Moreover, systems trained on these datasets may not generalize well to other domains such as call centers.

Self-supervised pretrained models [6, 16]

provide a solution by first learning from a large scale speech corpus without explicit labeling. The knowledge learned from pretraining can be transferred to downstream tasks by either using the model as a feature extractor or directly fine-tuning the whole model. While first introduced for the purpose of natural language processing (NLP), several pretrained models 

[26, 1, 12] have been developed for speech processing. Wav2vec [26] is a multi-layer convoluational neural network (CNN) trained to predict future frames conditioned on past frames by minimizing a contrastive loss. On the other hand, wav2vec 2.0 [1] is a transformer-based model that adopts a masked learning objective to predict missing frames from the remaining context.

Despite the success of these methods in ASR, speaker verification, and mispronunciation detection [1, 8, 23], only a few attempts [2, 27, 24] have been made to apply them on SER. Boigne et al. [2] find that wav2vec features are superior to traditional spectral-based features on SER. Xia et al. [27]

compare features extracted with different time span and conclude that features with longer temporal context such as wav2vec perform better on SER. Pepino et al. 

[24] show that features extracted from a linear combination of layers outperform singe layer representations in wav2vec 2.0 on SER. While these studies demonstrated the usefulness of the pretrained models as a feature extractor, little research has been conducted on fine-tuning them for SER.

One persistent issue on fine-tuning pretrained models is the mismatch between pretraining and target domain [10, 13]. Task adaptive pretraining (TAPT) [10] is proposed to resolve the domain shift by continuing the pretraining process on the target dataset. Hsu et al. [13] show that TAPT greatly improves generalization and robustness on ASR when the pretraining and fine-tuning data are dissimilar. Since the speech in the pretraining ASR corpus differs from emotive speech in multiple regards [22], we consider TAPT a compelling method for fine-tuning on SER.

Figure 1:

System overview of our methods. (a) Emotion state estimation phase of P-TAPT. An additional CNN with stride 2 is used to align the time steps between wav2vec and wav2vec 2.0. The output of cluster assignments will be used as pseudo-labels for the P-TAPT objective. (b) Model architecture and pretraining objective of wav2vec 2.0 along with our P-TAPT objective.

In this paper, we explore methods for fine-tuning wav2vec 2.0 on SER. We show that by adding a simple neural network on the top of wav2vec 2.0, vanilla fine-tuning (V-FT) outperforms state-of-the-art (SOTA) methods on the IEMOCAP [3] dataset. In addition, with V-FT as a baseline, TAPT significantly boosts the performance of fine-tuning wav2vec 2.0 on SER. Furthermore, motivated by previous works on segment-based emotion features [9, 19, 27] and self-supervised representation learning [12, 4], we develop a novel fine-tuning procedure for SER which yields even better performance especially in low-resource conditions. Finally, we achieve a 7.4% absolute increase on unweighted accuracy (UA) over the SOTA performance on IEMOCAP.

2 Method

We first review wav2vec 2.0, which serves as the backbone model for the methods we examine. We then present the two baseline methods we established. Finally, we introduce pseudo-label task adaptive pretraining (P-TAPT), a novel method we designed to fine-tune wav2vec 2.0 on SER.

2.1 The wav2vec 2.0 model

Wav2vec 2.0 is a transformer-based model trained to extract contextualized representations from raw audio signal. Figure 1.b shows the wav2vec2.0 model architecture and its pretraining objective. It consists of three sub-modules, feature encoder, transformer module, and quantization module. Feature encoder is a multi-layer CNN that processes the input signal into low-level features. Based on this representation, the transformer module is further applied to produce contextualized representation. The quantization module discretizes the low-level features into a trainable codebook. To train the model, part of the low-level features are masked from the transformer module, and the objective is to identify the quantized version of the masked features based on its context.222There is an additional diversity loss in pretraining which promotes the diversity of the quantization codebook.

2.2 Comparing methods

As there is no existing baseline system fine-tuning wav2vec 2.0 on SER, we created two baseline. One is the conventional fine-tuning method, and the other is task adaptive pretraining which is first introduced in NLP.

Vanilla fine-tuning. Wav2vec 2.0 differs from its NLP counterparts [6]

in that there is no utterance level pretraining task to naturally form a sentence representation. As a consequence, aggregation across time step is required to fine-tune on utterance level classification tasks. We experimented with different configurations and found that using average pooling on the final layer is simple yet effective for SER. Specifically, the final contextualized representation extracted by wav2vec 2.0 is first processed by a global average pooling across the time dimension, then followed by the ReLU activation and a single linear layer to predict the emotion categories. In addition, a modified version of SpecAugment 

[21] proposed in wav2vec 2.0 is applied during training for better generalization. We will use this architecture for the fine-tuning stage of all three methods. We abbreviate the vanilla fine-tuning method as V-FT.

Task adaptive pretraining. Task adaptive pretraining (TAPT) [10]

is a simple but effective method to fine-tune pretrained language models 

[6] on domain-specific tasks. It bridges the difference between the pretraining and target domain by continuing to pretrain on the target dataset. In this paper, we examine TAPT as one of the methods of fine-tuning wav2vec 2.0 on SER. To distinguish from the original pretraining and fine-tuning stage, we define an intermediate task adaptation stage for the continual pretraining process.

2.3 Pseudo-label task adaptive pretraining

While TAPT adapts to the emotive speech by continual training with the pretraining objective, it does not make use of the emotion labels. Essentially, the contextualized representations obtained will be general features suitable for various downstream tasks. As we only focus on SER, we propose to adapt this objective to generate emotion specific features. Instead of identifying the missing low-level features, we focus on predicting the emotion state of the masked sequence. One advantage it brings is better data efficiency. Reconstruction of missing audio parts is a more complicated task, which makes the model vulnerable to over-fitting. Additionally, it simplifies the fine-tuning stage as it already filters out information unrelated to emotion recognition from the contextualized representation.

However, frame-level emotion states need to be recognized to realize our method. While only utterance-level emotion labels are given for most of the SER dataset, several studies [27, 9, 19] indicate that frame-level emotion information can still be inferred by training with a segment-based classification objective. Particularly, as shown in Figure 1

.a, we fine-tune wav2vec to extract frame-level emotion representation that is useful for predicting an utterance-level emotion label. We find that using CNN architectures such as wav2vec is important since the locality of CNN preserves sequential structure. After training, we run k-means clustering algorithm 

[18] on all of the extracted representations from the target dataset. As Mathilde et al. [4]

has shown, the k-means cluster assignments on intermediate layers of CNN classifiers can capture information related to the target labels. Therefore, we interpret this cluster assignment as a pseudo-label that represents local emotion state.

We replace the TAPT objective with our new P-TAPT objecive. We add a position-wise linear head composed of two linear layers to predict the cluster assignments of the masked frames. In practice, we run multiple k-means clustering with different number of clusters, and our model needs to predict an ensemble of cluster assignments with multiple linear heads. This cluster ensemble technique is shown to facilitate representation learning in HuBERT [12], a recently developed self-supervised speech representation learning model.

3 Experimental Setup

3.1 Dataset

We use two datasets for evaluation, IEMOCAP [3] and SAVEE [14]. We only use the speech modality.

IEMOCAP. Interactive Emotional Dyadic Motion Capture (IEMOCAP) is a popular dataset for evaluating SER systems. It contains five recording sessions, each with one male speaker and one female speaker. To compare with previous works, we use the default labels provided by IEMOCAP, however only four emotion categories are considered: neutral, sad, angry, and happy. In particular, the “excited” category is merged with “happy” due to its sparsity in the dataset. The total amount of speech is about 7 hours.

SAVEE. The Surrey Audio-Visual Expressed Emotion (SAVEE) dataset contains four male speakers: DC, JE, JK, and KL. Each speaker reads out the same set of 120 sentences labeled with one of the 7 emotion categories: angry, disgust, sad, fear, happy, surprise, and neutral. We use all of the emotion categories, which results in 480 utterances with a total of 30 minutes of speech.

3.2 Training and evaluation procedure

All experiments use the same learning rate with Adam optimizer [15]. For the wav2vec model, we use a pretrained model developed by Facebook AI333https://github.com/pytorch/fairseq/tree/master/examples/wav2vec. We build our wav2vec 2.0 implementation on the top of the huggingface implementation, and adopt an pretrained model checkpoint from Facebook AI444https://huggingface.co/facebook/wav2vec2-base. Both models are pretrained on unsupervised speech of LibriSpeech 960h [20] without transcriptions. We evaluate our systems using unweighted accuracy (UA) [11]

under speaker-independent setting; the speakers in the test set are excluded from the training data. Additional implementation details are provided in our github repository. We run each experiment 5 times for the full IEMOCAP dataset and 20 times for SAVEE and on sub-sampled versions of IEMOCAP. Additionally, we observe that wav2vec 2.0 fails to converge with some of the random seeds. Therefore we discard and rerun outlier runs where the performance is outside two standard deviations from the mean.

IEMOCAP. To have a fair comparison with the majority of previous works, we split the dataset by leaving one session out as test set, the remaining four sessions are used for training. Note that most of the papers using IEMOCAP do not explicit define their validation set [7]

. We therefore train with all of the four sessions for a fixed 15 epochs without validation using a batch size of 64. The number of epochs is chosen so that each of our competing methods converge in terms of training loss.

SAVEE. A similar evaluation procedure is used for SAVEE. In each fold one speaker is left out for testing and the remaining three are used for training. We increase the number of training epochs to 30 and batch size is halved to 32 for the smaller training set.

4 Results and Discussion

4.1 Comparison of fine-tuning methods

Table 1 compares performance for the fine-tuning methods on IEMOCAP. For all sessions except the first, TAPT yields a noticeable improvement over V-FT, and P-TAPT performs better than TAPT for all sessions. On the other hand, Table 2 shows that on SAVEE, both TAPT and P-TAPT outperform V-FT by a large margin. However, the performance of P-TAPT is very close to that of TAPT on SAVEE. We analyze these results by considering the characteristics of SAVEE and IEMOCAP.

Domain shift and linguistic content. We first quantify the domain shift between both datasets and the pretraining dataset. We take the wav2vec 2.0 model pretrained on LibriSpeech and calculate the pretraining loss on both datasets along with the test set of LibriSpeech.

Session 1 2 3 4 5 Mean
P-TAPT 72.8 80.2 71.0 73.6 73.7 74.3
Table 1: Comparison of methods on IEMOCAP in UA(%)
Speaker DC JE JK KL Mean
TAPT 69.9 49.7 71.1
P-TAPT 86.7 84.2
Table 2: Comparison of methods on SAVEE in UA(%)

Table 3 verifies the presence of domain shift on both datasets providing room for TAPT to improve. Smaller loss indicates that SAVEE is closer to LibriSpeech as the model can already generalize well to SAVEE. However, this improvement is larger on SAVEE than IEMOCAP despite the smaller domain shift. We observe a strong correlation between linguistic content and emotion labels in SAVEE.555Two-thirds of the sentences are specific to one emotion and shared across all speakers. We conjecture that this correlation is captured by our model and surpasses human evaluators who annotate emotion from only para-linguistic information. This also explains why P-TAPT does not further improve TAPT, as the TAPT objective is already suitable for modeling linguistic information. Nonetheless, in more naturally elicited emotional conversations (IEMOCAP), P-TAPT performs better than TAPT.

Data efficiency. We also investigated the behavior of our methods when presented with different amounts of training data. Specifically, We fix session 5 of IEMOCAP as the held out test set, and gradually halve the amount of training examples in the remaining four sessions by random selection. We compare TAPT and P-TAPT using the ratio of their corresponding improvements over V-FT. Lower ratio indicates that the improvement from P-TAPT is more significant than TAPT. As shown in Table 4, this ratio is lower under low-resource settings with one hour or less of training data. Thus P-TAPT is more data-efficient than TAPT. As mentioned in Section 2.3, we attribute this to the change of objective from reconstruction of audio frames to prediction of emotion states which is less data intensive though it requires labeled data.

Dataset Libri.(test-clean) SAVEE IEMOCAP
Table 3: Wav2vec 2.0 pretraining loss on different datasets

4.2 Comparison with prior works

Table 5 compares our performance on IEMOCAP to that of existing SOTA models. We only include methods that evaluate under speaker-independent settings. Simply fine-tuning the wav2vec 2.0 model (using V-FT) outperforms wav2vec 2.0 without fine-tuning[24] by 3.6% absolute UA. The P-TAPT method provides 7.4% absolute improvement over SOTA models on IEMOCAP. We also show performance for methods that use both speech and text [5, 25]; our audio-only method appears comparable.

Data size 0.5hr 1hr 3hr 7hr
P-TAPT 58.8 64.1 70.0 73.7
Ratio(%) 65.0 62.5 80.6 81.3
Table 4: Comparison of methods on data efficiency in UA(%) on the session 5 of IEMOCAP
Method Feature UA (%)
FCN+Attention [28] Spectrogram 63.9
Wav2vec w/o. FT [2] Wav2vec 64.3
Wav2vec w. FT [27] Waveform 66.9
Wav2vec 2.0 w/o. FT [24] Wav2vec 2.0666We identify it as feature instead of model architecture since they only use wav2vec/wav2vec 2.0 as feature extractor without fine-tuning. 66.3
Wav2vec 2.0 w. V-FT Waveform 69.9
Wav2vec 2.0 w. TAPT Waveform 73.5
Wav2vec 2.0 w. P-TAPT Waveform


Audio + Text [5] MFCC+ALBERT777Referring to the text features extracted from ALBERT [16] and BERT [6]. 72.1
Audio + ASR [25] MFCC+BERT 75.9
Table 5: Comparison with prior works on IEMOCAP

5 Conclusion

We describe different fine-tuning strategies for wav2vec 2.0 on SER. These strategies produce SOTA performance on IEMOCAP, a well-studied corpus. We verify the presence of domain shift in SER and demonstrate that addressing it improves performance. We describe an algorithm for learning contextualized emotion representation, and show its advantage on fine-tuning a wav2vec 2.0 model for SER. We believe that these techniques can be generalized to other tasks and can provide a basis for research on the utility of contextualized emotion representation. We intend to continue exploring the usefulness of approach, in a multi-modal setting.

6 Acknowledgements

We are grateful to PwC USA as well as to The Digital Transformation and Innovation Center at Carnegie Mellon University for supporting our research. We thank Yangyang Xia and Richard M. Stern for discussions and feedback.


  • [1] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020)

    Wav2vec 2.0: a framework for self-supervised learning of speech representations

    In Advances in Neural Information Processing Systems, Vol. 33, pp. 12449–12460. External Links: Link Cited by: §1, §1.
  • [2] J. Boigne, B. Liyanage, and T. Östrem (2020)

    Recognizing more emotions with less data using self-supervised transfer learning

    ArXiv abs/2011.05585. Cited by: §1, Table 5.
  • [3] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan (2008) IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4), pp. 335–359. Cited by: §1, §1, §3.1.
  • [4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features


    European Conference on Computer Vision

    Cited by: §1, §2.3.
  • [5] M. Chen and X. Zhao (2020) A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. In Proc. Interspeech 2020, pp. 374–378. External Links: Document Cited by: §4.2, Table 5.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §2.2, §2.2, footnote 7.
  • [7] C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, and B. Schmauch (2018) CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation. In Proc. Workshop on Speech, Music and Mind (SMM 2018), pp. 21–25. External Links: Document Cited by: §3.2.
  • [8] Z. Fan, M. Li, S. Zhou, and B. Xu (2021) Exploring wav2vec 2.0 on Speaker Verification and Language Identification. In Proc. Interspeech 2021, pp. 1509–1513. External Links: Document Cited by: §1.
  • [9] H. M. Fayek, M. Lech, and L. Cavedon (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Networks 92, pp. 60–68. Cited by: §1, §1, §2.3.
  • [10] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of ACL, Cited by: §1, §2.2.
  • [11] K. Han, D. Yu, and I. Tashev (2014) Speech emotion recognition using deep neural network and extreme learning machine. In Fifteenth annual conference of the international speech communication association, Cited by: §1, §3.2.
  • [12] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. External Links: 2106.07447 Cited by: §1, §1, §2.3.
  • [13] W. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli (2021) Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. In Proc. Interspeech 2021, pp. 721–725. External Links: Document Cited by: §1.
  • [14] Cited by: §1, §3.1.
  • [15] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §3.2.
  • [16] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020) ALBERT: a lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, External Links: Link Cited by: §1, footnote 7.
  • [17] S. R. Livingstone and F. Russo (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13. Cited by: §1.
  • [18] S. Lloyd (1982) Least squares quantization in pcm. IEEE Transactions on Information Theory 28 (2), pp. 129–137. External Links: Document Cited by: §2.3.
  • [19] S. Mao, P.C. Ching, C.-C. J. Kuo, and T. Lee (2020)

    Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

    In Proc. Interspeech 2020, pp. 2357–2361. External Links: Document Cited by: §1, §2.3.
  • [20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5206–5210. External Links: Document Cited by: §3.2.
  • [21] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech 2019, pp. 2613–2617. Cited by: §2.2.
  • [22] M. D. Pell (2001) Influence of emotion and focus location on prosody in matched statements and questions.. The Journal of the Acoustical Society of America 109 4, pp. 1668–80. Cited by: §1.
  • [23] L. Peng, K. Fu, B. Lin, D. Ke, and J. Zhan (2021) A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis. In Proc. Interspeech 2021, pp. 4448–4452. External Links: Document Cited by: §1.
  • [24] L. Pepino, P. Riera, and L. Ferrer (2021) Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. In Proc. Interspeech 2021, pp. 3400–3404. External Links: Document Cited by: §1, §4.2, Table 5.
  • [25] J. Santoso, T. Yamada, S. Makino, K. Ishizuka, and T. Hiramura (2021) Speech Emotion Recognition Based on Attention Weight Correction Using Word-Level Confidence Measure. In Proc. Interspeech 2021, pp. 1947–1951. External Links: Document Cited by: §4.2, Table 5.
  • [26] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pp. 3465–3469. External Links: Document Cited by: §1.
  • [27] Y. Xia, L. Chen, A. Rudnicky, and R. M. Stern (2021) Temporal Context in Speech Emotion Recognition. In Proc. Interspeech 2021, pp. 3370–3374. External Links: Document Cited by: §1, §1, §2.3, Table 5.
  • [28] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu (2018) Attention based fully convolutional network for speech emotion recognition. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Vol. , pp. 1771–1775. External Links: Document Cited by: §1, Table 5.