In recent years, progressively more people are interacting with virtual voice assistants, such as Siri, Alexa, Cortana and Google Assistant. Interfaces that ignore a user’s emotional state or fail to manifest the appropriate emotion can dramatically impede performance and risks being perceived as cold, socially inept, untrustworthy, and incompetent . Because of this, speech emotion recognition is becoming an increasingly relevant task.
. With the exception of MSP-Podcast, these datasets are relatively small in size, usually including only a few dozen speakers. In terms of modeling techniques, many different traditional approaches have been proposed, using hidden Markov models (HMMs)[32, 29]
, support vector machines (SVMs)[33, 26]4, 17], and, more recently, deep neural networks (DNNs) [23, 30, 28]
. Yet, while DNNs have shown large gains over traditional approaches on tasks like automatic speech recognition (ASR) and speaker identification , the gains observed on emotion recognition are limited, likely due to the small size of the datasets.
A common strategy when dealing with small training datasets is to apply transfer learning techniques. One approach for transfer learning is to use a model learned for a certain auxiliary task for which large datasets are available for training to improve robustness for the task of interest for which data is scarce. The model learned on the auxiliary task can be used as feature extractor or fine-tuned, after replacing some of its final layers, to the task of interest. Recently, transfer learning approaches have been explored in the field of speech emotion recognition. In , a deep neural network based on transformers  is pretrained on LibriSpeech  using multiple self-supervised objectives at different time scales. Then, the model is used as a feature extractor or fine-tuned for speech emotion recognition, among other downstream tasks. Similarly, in  and , a deep encoder is pretrained in a contrastive predictive coding task, and the resulting embeddings are tested in speech emotion datasets. Recently, several models for automatic speech recognition (ASR) which use self-supervised pretraining have been released, including wav2vec  and VQ-wav2vec . A few recent studies [34, 21, 3] have successfully applied representations from these models as features for emotion recognition.
In line with those works, in this paper, we explore the use of the wav2vec 2.0 model , an improved version of the original wav2vec model, as a feature extractor for speech emotion recognition. The main contributions of our paper are (1) the use of wav2vec 2.0 representations for speech emotion recognition which, to our knowledge, had never been done for this task, (2) a novel approach for the downstream model which leverages information from multiple layers of the wav2vec 2.0 model and leads to significant improvements over previous approaches, and (3) an analysis of the importance of the different layers in wav2vec 2.0 for the emotion recognition task. Our results are superior to others in the literature for models based only on acoustic information for IEMOCAP and RAVDESS. The code to replicate the results of this paper will soon be released at https://github.com/habla-liaa/ser-with-w2v2.
In our study, we extracted features from two released wav2vec 2.0 models and used them for speech emotion recognition. In this section, we describe the wav2vec 2.0 model, the datasets used for training and evaluation, and the downstream models.
2.1 Wav2vec 2.0 model architecture
Wav2vec 2.0 
is a framework for self-supervised learning of representations from raw audio. The model consists of three stages. The first stage is alocal encoder
, which contains several convolutional blocks and encodes the raw audio into a sequence of embeddings with a stride of 20 ms and a receptive field of 25 ms. Two models have been released for public use, a large one and a base one where the embeddings are 1024- and 768-dimensional, respectively. The second stage is acontextualized encoder, which takes the local encoder representations as input. Its architecture consists of several transformer encoder blocks . The base model uses 12 transformer blocks with 8 attention heads each, while the large model uses 24 transformer blocks with 16 attention heads each. Finally, a quantization module
, takes the local encoder representations as input and consists of 2 codebooks with 320 entries each. A linear map is used to turn the local encoder representations into logits. Given the logits, Gumbel-Softmax
is applied to sample from each codebook. The selected codes are concatenated and a linear transformation is applied to the resulting vector leading to a quantized representation of the local encoder output, which is used in the objective function, as explained below.
2.2 Wav2Vec 2.0 pretraining and finetuning
The wav2vec 2.0 model is pretrained in a self-supervised setting, similar to the masked language modelling used in BERT  for NLP. Contiguous time steps from the local encoder representations are randomly masked and the model is trained to reproduce the quantized local encoder representations for those masked frames at the output of the contextualized encoder.
The training objective is composed by terms of the form
where is the cosine distance between the contextualized encoder outputs and the quantized local encoder representations . is the time step, is the temperature and Q̃ is the union of a set of K distractors and . The distractors are outputs of the local encoder sampled from masked frames belonging to the same utterance as . The contrastive loss is then given by summed over all masked frames. Finally, terms to encourage diversity of the codebooks and L2 regularization are added to the contrastive loss.
The main goal of the wav2vec 2.0 paper was to use the learned representations to improve ASR performance, requiring less data for training and enabling its use for low resource languages. To this end, the model trained as described above is finetuned for ASR using a labelled speech corpus like LibriSpeech. A randomly initialized linear projection is added at the output of the contextual encoder and the connectionist temporal classification (CTC) loss  is minimized. The models finetuned in 960 hours of LibriSpeech reach state of the art results in automatic speech recognition when evaluated in different subsets of LibriSpeech. Even when finetuning using considerably less hours, wav2vec 2.0 models reach a performance comparable to the state of the art.
In this paper we compared the performance in speech emotion recognition when using both the wav2vec 2.0 base model pretrained in Librispeech without finetuning111https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt (we will call this model Wav2vec2-PT), and a model finetuned for ASR using a 960-hour subset of Librispeech222https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt (Wav2vec2-FT). In both cases, we used the base model as we did not see significant performance improvements when using the large one, and it allowed us to reduce computational requirements.
We compare results when using the output of the local and the contextualized encoders as input to the downstream models. We also propose to use a weighted average of the outputs of the 12 transformer blocks, along with those of the local encoder. The weights of this average are trained along with the rest of the downstream model, as explained in the next subsection.
As baseline features, we also calculated magnitude spectrograms with a hanning window of 25 ms and a hop size of 10 ms, and eGeMAPS  low level descriptors (LLD), which are commonly used in the emotion recognition literature. The eGeMAPS features were extracted using opensmile-python . In order to match the sequence lengths of the eGeMAPS features, which use a stride of 10 ms, with the lengths of the wav2vec 2.0 features, which have a stride of 20 ms, we downsampled the eGeMAPS LLDs by averaging every 2 consecutive frames.
All features were normalized by subtracting the mean and dividing by the standard deviation of that feature over all the data for the corresponding speaker. When disabling speaker normalization for comparison of results, we replaced it with global normalization per feature. In this case, the statistics are computed over the training data only, excluding the test data.
2.4 Downstream models
Our downstream model is inspired by  due to its simplicity, which should reduce the chance of overfitting. The model we will refer to as Dense. This is equivalent to using 1D pointwise convolutional layers (with a kernel size of 1) and 128 filters. The outputs are averaged over time resulting in a vector of size 128. Finally, another dense layer with softmax activation returns probabilities for each of the emotion classes. During the training of the downstream model, the weights of the wav2vec 2.0 model remain unaltered, so it serves as a feature extractor.
For computational reasons, we used a maximum sequence length of 400 for IEMOCAP and 250 for RAVDESS, as input to the network. This is equivalent to 8 seconds and 5 seconds, respectively. Note that the output of the contextualized encoder can contain information from the full input waveform which might be longer than the sequence seen by the network. This is because each output of the contextual encoder has a receptive field given by the whole input waveform, since this encoder is a transformer. On the other hand, the local encoder has a limited receptive field, so its outputs can only capture local information.
For the case in which we take as features the activations from both the wav2vec 2.0 local encoder and the transformer blocks, we incorporate a trainable weighted average as the first layer. This layer learns the weights for each of the wav2vec 2.0 layer activations, , where corresponds to the local encoder output, through are the outputs of the internal blocks in the transformer and is the output of the contextualized encoder. Then, the activations are combined as follows
The weights are initialized with 1.0. Note that the layer activations can be combined in this way because they are all the same size. This way of extracting features from a pretrained model is similar to the approach used in ELMO  for NLP.
In the results section, we compare performance of the Dense model with the LSTM model, in which the second dense layer is replaced by an LSTM layer. Finally, we also evaluate a third model, called Fusion model, which incorporates a branch taking as input eGeMAPS features. In this last model, the outputs of the first dense layer are concatenated before applying the second dense layer. The described downstream models can be seen in Figure 1.
The downstream models were trained using batches of 32 utterances, Adam optimizer with a learning rate of 0.001, and early stopping with a patience of 4 epochs monitoring the validation loss.
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)  is a multi-modal database of emotional speech and song. It features 24 different actors (12 males and 12 females) enacting 2 statements: “Kids are talking by the door” and “Dogs are sitting by the door.” with 8 different emotions: happy, sad, angry, fearful, surprise, disgust, calm, and neutral. These emotions are expressed in two different intensities: normal and strong, except for neutral (normal only). Each of the combinations was spoken and sung, and repeated 2 times, leading to 104 unique vocalizations per actor. Following , we merged the neutral and calm emotions, resulting in 7 emotions, and used the first 20 actors for training, actors 20-22 for validation to do early stopping, and actors 22-24 for test.
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset  has a length of approximately 12 hours and consists of scripted and improvised dialogues by 10 speakers. It is composed of 5 sessions, each including speech from an actor and an actress. Annotators were asked to label each sample choosing one or more labels from a pool of emotions. In this work, we used 4 emotional classes: anger, happiness, sadness and neutral, and following the work in , we relabeled excitement samples as happiness. Instances from other classes and with no majority label across the annotations were discarded.333Note that discarding no-agreement samples and samples from non-target emotions is not an ideal practice . Here, we decided to do this since it is standard practice in emotion recognition literature, facilitating comparisons across papers. We also trimmed the waveforms to 15 seconds to reduce the computational requirements when extracting wav2vec 2.0 features. Only 2% of the waveforms were longer than 15 seconds. To evaluate the models in IEMOCAP, we performed 5-fold cross-validation, leaving one session out for each of the folds, as it is a standard practice with this dataset. We used one of the 4 training sessions to perform early stopping for each cross-validation model.
|None||eGeMAPS||52.4 0.1||57.0 2.4|
|Spectrogram||49.8 1.0||44.5 0.8|
|Wav2vec2-PT||Local enc.||60.3 0.7||65.4 1.7|
|Cont. enc.||58.5 0.6||69.0 0.2|
|All layers||67.2 0.7||84.3 1.7|
|Wav2vec2-FT||Local enc.||57.3 1.0||58.8 2.7|
|Cont. enc.||44.6 1.0||37.5 3.0|
|All layers||63.8 0.3||68.7 0.9|
3 Results and discussion
We calculated the performance of our models by training them 5 times with different seeds. Table 1 shows the average recall over all emotion classes obtained using the different features extracted from wav2vec 2.0 models. We also show two systems based on eGeMAPS and spectrogram features, which can be considered as baselines. In all cases, features are normalized by speaker and the Dense model architecture in Figure 1 is used as downstream model. We can see that features for both wav2vec 2.0 models, the one finetuned in 960 hours of Librispeech (wav2vec2-FT) and the one that is not finetuned (wav2vec2-PT), the local encoder representations lead to better results than both of the baseline features. It is worth noting that eGeMAPS, spectrograms and the wav2vec 2.0 local encoder representations contain information restricted only to a local window around each frame, of 60 ms for eGeMAPS, and 25 ms for the others. Also, the downstream model combines information from consecutive frames using just global average, which is a very simple approach that might be suboptimal because it cannot take into account temporal patterns in the features. In spite of that, the local encoder representations, particularly the ones obtained from the PT model, reach a performance comparable to much more complex models like the one proposed in , which is a CNN with Bi-LSTM layers trained in a fully supervised setting with spectrograms as input.
We can also see that the wav2vec2-PT features perform better than wav2vec2-FT features in all cases. In particular, the model using only the contextualized encoder outputs for the wav2vec2-FT model has the worst results in the table. We hypothesize that this is because when the model is finetuned for an ASR task, information that is not relevant for that task but might be relevant for speech emotion recognition is lost from the embeddings. For example, information about the pitch might not be important for speech recognition, while it is essential for speech emotion recognition.
Further, Table 1 shows that using a weighted average of the transformer blocks outputs along with the local encoder outputs (rows labelled ”All layers”) results in better performance than using only the local or the contextual encoder outputs. Using only the local encoder representations might not give information about events occurring at time scales larger than the receptive field (25 ms). Further, the output of the contextual encoder might be similar to the quantized local encoder representations, since the contrastive loss objective is designed to achieve this goal. Hence, using both of these layers, along with all intermediate layers in the transformer provides additional information that is valuable to the model.
Figure 2 shows the weights from the weighted average layer once the downstream models are trained. We can see that the middle layers are given larger weights. This could be because these layers have already contextualized enough information from the local encoder, but they are not yet too specific to the pretraining or finetuning task as the last layers. Also, in the case of the feature extracted from the wav2vec2-PT model, the weights tend to be larger in the layers closer to the output when compared with the weights for the features from the wav2vec2-FT model. This again suggests that the layers of the wav2vec2-FT model closer to the output are less useful for emotion recognition than those of the wav2vec2-PT model. Finally, note that in both datasets, the different training seeds lead to very similar weights, as observed from the error bars in Figure 2, from which we can conclude that these weights are not too sensitive to the neural network initialization.
Finally, we experimented with several variations of the best performing model in Table 1. The first line in Table 2 shows the results for that system, corresponding to the fifth line in Table 1. The rest of the lines in Table 2 show results for that model and the others in Figure 1 applying global normalization instead of speaker-dependent normalization. This is the scenario that is most commonly used in papers as is the one with less assumptions about the available data, treating all samples from a speaker as independent from each other, not assuming that additional samples from a speaker are available at test time.
Comparing the first and the second line in Table 2, we can see that results significantly degrade when doing global normalization. This is expected since normalization by speaker helps the model focus on the emotion characteristics by eliminating part of the speaker information. The degradation is larger in RAVDESS probably because, in this dataset, audios from different emotions do not have much variation in lexical content. Hence, by reducing the effect of speaker in the features all that is left is the variation due to the emotions. On the other hand, in IEMOCAP data, the variation due to lexical content is still in the samples after speaker normalization.
The third and fourth lines in Table 2 show the results obtained using the LSTM model, and the Fusion model. The latter fuses eGeMAPS with the wav2vec 2.0 features. We can see that using an LSTM layer before the global pooling, does not seem to bring improvements over using a simple dense layer. This might be because wav2vec 2.0 features are already contextualized and have global information about the full utterance. Also, LSTMs might be more prone to overfitting or optimization problems than a simpler dense layer. Finally, the table shows some modest improvements from the addition of eGeMAPS features, suggesting that wav2vec 2.0 features may be lacking some of the information present in eGeMAPS.
|Model - Norm||IEMOCAP||RAVDESS|
|Dense - Speaker||67.2 0.7||84.3 1.7|
|Dense - Global||65.8 0.3||75.7 2.3|
|LSTM - Global||64.8 1.9||74.6 3.7|
|Fusion - Global||66.3 0.7||77.5 1.0|
|BiLSTM w. attention ||58.8||-|
|TDNN-LSTM w. attention ||60.7||-|
Table 2 also shows some of the results obtained in other works on the IEMOCAP dataset, using the same experimental setup we are using. For a fair comparison with our work, we restrict the comparison to models in the literature that do not use automatic or manual transcriptions. We can see that all of our models perform better than the state of the art.
Finally, we also compare our results with those in 
, which consists of a deep convolutional neural network using acoustic features as input. For this comparison, we used the dense downstream model with global normalization and imitated their experimental setup both for IEMOCAP and RAVDESS. For IEMOCAP, they only use the improvised sessions and full agreement utterances. For RAVDESS, they use 5 fold cross-validation, dividing the data randomly. Moreover, they do not merge calm and neutral emotions, so the total number of emotions to be predicted is 8. We outperformed the models in for both datasets obtaining an average recall of 84.1 1.2 % in RAVDESS, and 72.1 0.9% in IEMOCAP, compared to the results in the paper which are 64.3% and 71.6%, respectively.
In this work, we explored different ways of extracting and modeling features from pretrained wav2vec 2.0 models for speech emotion recognition. We proposed to combine the different layers in the wav2vec 2.0 model using trainable weights and model the resulting features with a simple DNN with a time-wise pooling layer. We evaluated our models on two standard emotion datasets, IEMOCAP and RAVDESS, and showed superior results on both cases, compared to those in recent literature. We found that the combination of information from different layers in the wav2vec 2.0 model led to improved results over using only the encoder outputs, as in previous works. Further, we found that the combination of the wav2vec 2.0 features with a set of prosodic features gave additional gains, suggesting that the wav2vec 2.0 model does not contain all the prosodic information needed for emotion recognition. Finally, we showed that a wav2vec 2.0 model finetuned for the task of ASR worked worse than the one trained only with the self-supervised task, indicating that the finetuning eliminates information from the embeddings that is useful for emotion recognition.
This material is based upon work supported by a Google Faculty Research Award, 2019, and an Amazon Research Award, 2019.
-  (2020) Vq-wav2vec: self-supervised learning of discrete speech representations. In International Conference on Learning Representations, Cited by: §1.
-  (2020) Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv:2006.11477. Cited by: §1, §2.1.
-  (2020) Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning. arXiv:2011.05585. Cited by: §1, §2.4, Table 2.
-  (2005) Emotions in speech-experiments with prosody and quality features in speech for use in categorical and dimensional emotion recognition environments. In , Cited by: §1.
-  (2009) Emotion in human–computer interaction. In Human-computer interaction fundamentals, Vol. 20094635. Cited by: §1.
-  (2005) A database of german emotional speech. In Ninth European Conference on Speech Communication and Technology, Cited by: §1.
-  (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4). Cited by: §1, §2.5.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.
-  (2015) The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE transactions on affective computing 7 (2). Cited by: §2.3.
Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, Cited by: §2.3.
-  (2017) Evaluating deep learning architectures for Speech Emotion Recognition. Neural Networks 92. Cited by: §2.5.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, Cited by: §2.2.
-  (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §1.
-  (2010) Machine Audition: Principles, Algorithms and Systems. W. Wang (Ed.), Cited by: §1.
-  (2020) Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control 59. Cited by: §3.
-  (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §2.1.
-  (2011) Emotion recognition using a hierarchical binary decision tree approach. Speech Communication 53 (9-10). Cited by: §1.
Contrastive Unsupervised Learning for Speech Emotion Recognition. arXiv:2102.06357. Cited by: §1.
-  (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE 13 (5). Cited by: §1, §2.5.
-  (2019) Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing 10 (4). Cited by: §1.
-  (2020) On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition. arXiv:2011.09212. Cited by: §1.
-  (2016) Analysis of dnn approaches to speaker identification. In ICASSP, Cited by: §1.
-  (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, Table 2.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. . Cited by: §1.
-  (2018) Deep contextualized word representations. In NAACL, Cited by: §2.4.
-  (2013) Emotion recognition from speech using global and local prosodic features. International journal of speech technology 16 (2). Cited by: §1.
-  (2019) No sample left behind: towards a comprehensive evaluation of speech emotion recognition system. In Proc. Workshop on Speech, Music and Mind 2019, Cited by: footnote 3.
-  (2018) Emotion identification from raw speech signals using dnns.. In Interspeech, Cited by: §1, Table 2.
-  (2007) Emotion recognition using mel-frequency cepstral coefficients. Information and Media Technologies 2 (3). Cited by: §1.
-  (2017) Efficient emotion recognition from speech using deep learning on spectrograms.. In Interspeech, Cited by: §1, Table 2, §3.
-  (2019) Wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv:1904.05862. Cited by: §1.
Hidden Markov Model-based Speech Emotion Recognition. Vol. 2. Cited by: §1.
-  (2011) Automatic speech emotion recognition using support vector machine. In Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology, Vol. 2. Cited by: §1.
-  (2020) Jointly Fine-Tuning ”BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition. arXiv:2008.06682. Cited by: §1.
-  (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2.1.
-  (2019) Emotion recognition from speech. arXiv preprint arXiv:1912.10458. Cited by: §2.5.
-  (2021) Transformer based unsupervised pre-training for acoustic representation learning. arXiv:2007.14602. Cited by: §1.
-  (2021) General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework. arXiv:2102.01930. Cited by: §1.