While wav2vec 2.0 has been proposed for speech recognition (ASR), it can also be used for speech emotion recognition (SER); its performance can be significantly improved using different fine-tuning strategies. Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented. We show that V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset. TAPT, an existing NLP fine-tuning strategy, further improves the performance on SER. We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations. Experiments show that P-TAPT performs better than TAPT especially under low-resource settings. Compared to prior works in this literature, our top-line system achieved a 7.4 unweighted accuracy (UA) over the state-of-the-art performance on IEMOCAP. Our code is publicly available.READ FULL TEXT VIEW PDF
Speech emotion recognition (SER) remains one of the key components in human-machine interaction and in human communication systems. With the development of deep learning, several attempts[9, 11, 28] have been made to automatically learn emotion representations from audio signal using neural nets. However, the improvement of deep learning based systems are often limited by the lack of annotated data. Commonly used SER datasets [3, 14, 17]
are relatively small in size in comparison to automatic speech recognition (ASR) datasets. Moreover, systems trained on these datasets may not generalize well to other domains such as call centers.
provide a solution by first learning from a large scale speech corpus without explicit labeling. The knowledge learned from pretraining can be transferred to downstream tasks by either using the model as a feature extractor or directly fine-tuning the whole model. While first introduced for the purpose of natural language processing (NLP), several pretrained models[26, 1, 12] have been developed for speech processing. Wav2vec  is a multi-layer convoluational neural network (CNN) trained to predict future frames conditioned on past frames by minimizing a contrastive loss. On the other hand, wav2vec 2.0  is a transformer-based model that adopts a masked learning objective to predict missing frames from the remaining context.
Despite the success of these methods in ASR, speaker verification, and mispronunciation detection [1, 8, 23], only a few attempts [2, 27, 24] have been made to apply them on SER. Boigne et al.  find that wav2vec features are superior to traditional spectral-based features on SER. Xia et al. 
compare features extracted with different time span and conclude that features with longer temporal context such as wav2vec perform better on SER. Pepino et al. show that features extracted from a linear combination of layers outperform singe layer representations in wav2vec 2.0 on SER. While these studies demonstrated the usefulness of the pretrained models as a feature extractor, little research has been conducted on fine-tuning them for SER.
One persistent issue on fine-tuning pretrained models is the mismatch between pretraining and target domain [10, 13]. Task adaptive pretraining (TAPT)  is proposed to resolve the domain shift by continuing the pretraining process on the target dataset. Hsu et al.  show that TAPT greatly improves generalization and robustness on ASR when the pretraining and fine-tuning data are dissimilar. Since the speech in the pretraining ASR corpus differs from emotive speech in multiple regards , we consider TAPT a compelling method for fine-tuning on SER.
In this paper, we explore methods for fine-tuning wav2vec 2.0 on SER. We show that by adding a simple neural network on the top of wav2vec 2.0, vanilla fine-tuning (V-FT) outperforms state-of-the-art (SOTA) methods on the IEMOCAP  dataset. In addition, with V-FT as a baseline, TAPT significantly boosts the performance of fine-tuning wav2vec 2.0 on SER. Furthermore, motivated by previous works on segment-based emotion features [9, 19, 27] and self-supervised representation learning [12, 4], we develop a novel fine-tuning procedure for SER which yields even better performance especially in low-resource conditions. Finally, we achieve a 7.4% absolute increase on unweighted accuracy (UA) over the SOTA performance on IEMOCAP.
We first review wav2vec 2.0, which serves as the backbone model for the methods we examine. We then present the two baseline methods we established. Finally, we introduce pseudo-label task adaptive pretraining (P-TAPT), a novel method we designed to fine-tune wav2vec 2.0 on SER.
Wav2vec 2.0 is a transformer-based model trained to extract contextualized representations from raw audio signal. Figure 1.b shows the wav2vec2.0 model architecture and its pretraining objective. It consists of three sub-modules, feature encoder, transformer module, and quantization module. Feature encoder is a multi-layer CNN that processes the input signal into low-level features. Based on this representation, the transformer module is further applied to produce contextualized representation. The quantization module discretizes the low-level features into a trainable codebook. To train the model, part of the low-level features are masked from the transformer module, and the objective is to identify the quantized version of the masked features based on its context.222There is an additional diversity loss in pretraining which promotes the diversity of the quantization codebook.
As there is no existing baseline system fine-tuning wav2vec 2.0 on SER, we created two baseline. One is the conventional fine-tuning method, and the other is task adaptive pretraining which is first introduced in NLP.
Vanilla fine-tuning. Wav2vec 2.0 differs from its NLP counterparts 
in that there is no utterance level pretraining task to naturally form a sentence representation. As a consequence, aggregation across time step is required to fine-tune on utterance level classification tasks. We experimented with different configurations and found that using average pooling on the final layer is simple yet effective for SER. Specifically, the final contextualized representation extracted by wav2vec 2.0 is first processed by a global average pooling across the time dimension, then followed by the ReLU activation and a single linear layer to predict the emotion categories. In addition, a modified version of SpecAugment proposed in wav2vec 2.0 is applied during training for better generalization. We will use this architecture for the fine-tuning stage of all three methods. We abbreviate the vanilla fine-tuning method as V-FT.
Task adaptive pretraining. Task adaptive pretraining (TAPT) 
is a simple but effective method to fine-tune pretrained language models on domain-specific tasks. It bridges the difference between the pretraining and target domain by continuing to pretrain on the target dataset. In this paper, we examine TAPT as one of the methods of fine-tuning wav2vec 2.0 on SER. To distinguish from the original pretraining and fine-tuning stage, we define an intermediate task adaptation stage for the continual pretraining process.
While TAPT adapts to the emotive speech by continual training with the pretraining objective, it does not make use of the emotion labels. Essentially, the contextualized representations obtained will be general features suitable for various downstream tasks. As we only focus on SER, we propose to adapt this objective to generate emotion specific features. Instead of identifying the missing low-level features, we focus on predicting the emotion state of the masked sequence. One advantage it brings is better data efficiency. Reconstruction of missing audio parts is a more complicated task, which makes the model vulnerable to over-fitting. Additionally, it simplifies the fine-tuning stage as it already filters out information unrelated to emotion recognition from the contextualized representation.
However, frame-level emotion states need to be recognized to realize our method. While only utterance-level emotion labels are given for most of the SER dataset, several studies [27, 9, 19] indicate that frame-level emotion information can still be inferred by training with a segment-based classification objective. Particularly, as shown in Figure 1
.a, we fine-tune wav2vec to extract frame-level emotion representation that is useful for predicting an utterance-level emotion label. We find that using CNN architectures such as wav2vec is important since the locality of CNN preserves sequential structure. After training, we run k-means clustering algorithm on all of the extracted representations from the target dataset. As Mathilde et al. 
has shown, the k-means cluster assignments on intermediate layers of CNN classifiers can capture information related to the target labels. Therefore, we interpret this cluster assignment as a pseudo-label that represents local emotion state.
We replace the TAPT objective with our new P-TAPT objecive. We add a position-wise linear head composed of two linear layers to predict the cluster assignments of the masked frames. In practice, we run multiple k-means clustering with different number of clusters, and our model needs to predict an ensemble of cluster assignments with multiple linear heads. This cluster ensemble technique is shown to facilitate representation learning in HuBERT , a recently developed self-supervised speech representation learning model.
IEMOCAP. Interactive Emotional Dyadic Motion Capture (IEMOCAP) is a popular dataset for evaluating SER systems. It contains five recording sessions, each with one male speaker and one female speaker. To compare with previous works, we use the default labels provided by IEMOCAP, however only four emotion categories are considered: neutral, sad, angry, and happy. In particular, the “excited” category is merged with “happy” due to its sparsity in the dataset. The total amount of speech is about 7 hours.
SAVEE. The Surrey Audio-Visual Expressed Emotion (SAVEE) dataset contains four male speakers: DC, JE, JK, and KL. Each speaker reads out the same set of 120 sentences labeled with one of the 7 emotion categories: angry, disgust, sad, fear, happy, surprise, and neutral. We use all of the emotion categories, which results in 480 utterances with a total of 30 minutes of speech.
All experiments use the same learning rate with Adam optimizer . For the wav2vec model, we use a pretrained model developed by Facebook AI333https://github.com/pytorch/fairseq/tree/master/examples/wav2vec. We build our wav2vec 2.0 implementation on the top of the huggingface implementation, and adopt an pretrained model checkpoint from Facebook AI444https://huggingface.co/facebook/wav2vec2-base. Both models are pretrained on unsupervised speech of LibriSpeech 960h  without transcriptions. We evaluate our systems using unweighted accuracy (UA) 
under speaker-independent setting; the speakers in the test set are excluded from the training data. Additional implementation details are provided in our github repository. We run each experiment 5 times for the full IEMOCAP dataset and 20 times for SAVEE and on sub-sampled versions of IEMOCAP. Additionally, we observe that wav2vec 2.0 fails to converge with some of the random seeds. Therefore we discard and rerun outlier runs where the performance is outside two standard deviations from the mean.
IEMOCAP. To have a fair comparison with the majority of previous works, we split the dataset by leaving one session out as test set, the remaining four sessions are used for training. Note that most of the papers using IEMOCAP do not explicit define their validation set 
. We therefore train with all of the four sessions for a fixed 15 epochs without validation using a batch size of 64. The number of epochs is chosen so that each of our competing methods converge in terms of training loss.
SAVEE. A similar evaluation procedure is used for SAVEE. In each fold one speaker is left out for testing and the remaining three are used for training. We increase the number of training epochs to 30 and batch size is halved to 32 for the smaller training set.
Table 1 compares performance for the fine-tuning methods on IEMOCAP. For all sessions except the first, TAPT yields a noticeable improvement over V-FT, and P-TAPT performs better than TAPT for all sessions. On the other hand, Table 2 shows that on SAVEE, both TAPT and P-TAPT outperform V-FT by a large margin. However, the performance of P-TAPT is very close to that of TAPT on SAVEE. We analyze these results by considering the characteristics of SAVEE and IEMOCAP.
Domain shift and linguistic content. We first quantify the domain shift between both datasets and the pretraining dataset. We take the wav2vec 2.0 model pretrained on LibriSpeech and calculate the pretraining loss on both datasets along with the test set of LibriSpeech.
Table 3 verifies the presence of domain shift on both datasets providing room for TAPT to improve. Smaller loss indicates that SAVEE is closer to LibriSpeech as the model can already generalize well to SAVEE. However, this improvement is larger on SAVEE than IEMOCAP despite the smaller domain shift. We observe a strong correlation between linguistic content and emotion labels in SAVEE.555Two-thirds of the sentences are specific to one emotion and shared across all speakers. We conjecture that this correlation is captured by our model and surpasses human evaluators who annotate emotion from only para-linguistic information. This also explains why P-TAPT does not further improve TAPT, as the TAPT objective is already suitable for modeling linguistic information. Nonetheless, in more naturally elicited emotional conversations (IEMOCAP), P-TAPT performs better than TAPT.
Data efficiency. We also investigated the behavior of our methods when presented with different amounts of training data. Specifically, We fix session 5 of IEMOCAP as the held out test set, and gradually halve the amount of training examples in the remaining four sessions by random selection. We compare TAPT and P-TAPT using the ratio of their corresponding improvements over V-FT. Lower ratio indicates that the improvement from P-TAPT is more significant than TAPT. As shown in Table 4, this ratio is lower under low-resource settings with one hour or less of training data. Thus P-TAPT is more data-efficient than TAPT. As mentioned in Section 2.3, we attribute this to the change of objective from reconstruction of audio frames to prediction of emotion states which is less data intensive though it requires labeled data.
Table 5 compares our performance on IEMOCAP to that of existing SOTA models. We only include methods that evaluate under speaker-independent settings. Simply fine-tuning the wav2vec 2.0 model (using V-FT) outperforms wav2vec 2.0 without fine-tuning by 3.6% absolute UA. The P-TAPT method provides 7.4% absolute improvement over SOTA models on IEMOCAP. We also show performance for methods that use both speech and text [5, 25]; our audio-only method appears comparable.
|Wav2vec w/o. FT ||Wav2vec||64.3|
|Wav2vec w. FT ||Waveform||66.9|
|Wav2vec 2.0 w/o. FT ||Wav2vec 2.0666We identify it as feature instead of model architecture since they only use wav2vec/wav2vec 2.0 as feature extractor without fine-tuning.||66.3|
|Wav2vec 2.0 w. V-FT||Waveform||69.9|
|Wav2vec 2.0 w. TAPT||Waveform||73.5|
|Wav2vec 2.0 w. P-TAPT||Waveform||
|Audio + Text ||MFCC+ALBERT777Referring to the text features extracted from ALBERT  and BERT .||72.1|
|Audio + ASR ||MFCC+BERT||75.9|
We describe different fine-tuning strategies for wav2vec 2.0 on SER. These strategies produce SOTA performance on IEMOCAP, a well-studied corpus. We verify the presence of domain shift in SER and demonstrate that addressing it improves performance. We describe an algorithm for learning contextualized emotion representation, and show its advantage on fine-tuning a wav2vec 2.0 model for SER. We believe that these techniques can be generalized to other tasks and can provide a basis for research on the utility of contextualized emotion representation. We intend to continue exploring the usefulness of approach, in a multi-modal setting.
We are grateful to PwC USA as well as to The Digital Transformation and Innovation Center at Carnegie Mellon University for supporting our research. We thank Yangyang Xia and Richard M. Stern for discussions and feedback.
Wav2vec 2.0: a framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, Vol. 33, pp. 12449–12460. External Links: Cited by: §1, §1.
Recognizing more emotions with less data using self-supervised transfer learning. ArXiv abs/2011.05585. Cited by: §1, Table 5.
Deep clustering for unsupervised learning of visual features. In
European Conference on Computer Vision, Cited by: §1, §2.3.
Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition. In Proc. Interspeech 2020, pp. 2357–2361. External Links: Cited by: §1, §2.3.