The Role of Phonetic Units in Speech Emotion Recognition

08/02/2021 ∙ by Jiahong Yuan, et al. ∙ Baidu, Inc. 0

We propose a method for emotion recognition through emotiondependent speech recognition using Wav2vec 2.0. Our method achieved a significant improvement over most previously reported results on IEMOCAP, a benchmark emotion dataset. Different types of phonetic units are employed and compared in terms of accuracy and robustness of emotion recognition within and across datasets and languages. Models of phonemes, broad phonetic classes, and syllables all significantly outperform the utterance model, demonstrating that phonetic units are helpful and should be incorporated in speech emotion recognition. The best performance is from using broad phonetic classes. Further research is needed to investigate the optimal set of broad phonetic classes for the task of emotion recognition. Finally, we found that Wav2vec 2.0 can be fine-tuned to recognize coarser-grained or larger phonetic units than phonemes, such as broad phonetic classes and syllables.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech emotion recognition is essentially a sequence-to-one classification problem whereas speech recognition is a sequence-to-sequence problem. In this paper we attempt to bridge these two problems. More specifically, we propose a method for emotion recognition through emotion-dependent speech recognition.

Speech emotion recognition has witnessed a stable advancement over the last two decades [Ververidis&Kotropoulos, Ayadietal, Koolagudi&Rao, Anagnostopoulosetal, akccay2020speech]. Much of the earlier effort was concentrated on feature engineering. The emobase feature set in the widely used OpenSMILE toolkit [Eybenetal]

, for example, consists of 998 acoustic features for emotion recognition, including prosodic, spectral, as well as voice quality features. In recent years, more effort has been devoted to improving deep learning model architectures. Many studies in the literature are based on IEMOCAP

[busso2008iemocap], a benchmark emotion dataset. At ICASSP 2020, half a dozen papers reported a cross-validation accuracy of 70% or better on this dataset [wangetal, liuetal, luetal, Pappagarietal, Yehetal], based on acoustic data only. [wangetal] achieved the highest accuracy of 73% from employing a dual-sequence LSTM model.

A confounding factor for emotion recognition is the acoustic variability of different phonetic units such as phonemes. Figure 1 compares the spectra of two vowels (from the same speaker) in the same emotion. Clearly, their spectra are different in terms of the location of spectral peaks, which determines vowel quality. There are two approaches to overcome this problem. The predominant one is to make emotion features and models independent and more robust to phonetic variability. The other approach is to take consideration of phonetic variability by developing phonetically-aware features and models. This study adopts the second approach.

A relatively small number of studies of speech emotion recognition have investigated the effect of phonetic variability. [Shahetal] incorporated articulatory information in emotion recognition. They observed that for the vowel /AE/ anger forces a larger opening of the jaw as opposed to sadness, while for the vowel /IY/ anger makes lips more protruded towards the outside. [Bitouketal] computed statistics of Mel-Frequency Cepstral Coefficients over three phonetic classes (stressed vowels, unstressed vowels, and consonants) respectively for emotion recognition. They found that spectral features computed from consonant regions of the utterance contain more information about emotion than either stressed or unstressed vowel features. [Schulleretal] investigated whether acoustic emotion recognition strongly depends on phonetic content. They demonstrated that phoneme-specific emotion models can lead to higher accuracies. [Dhamyal2020ThePB] used a self-attention based emotion classification model to understand the phonetic bases of emotions by discovering the most “attended” phonemes for each emotion. They found that the distribution of these attended-phonemes varies significantly between natural versus acted emotions.

To make use of the effect of phonetic variability for emotion recognition, we train emotion-dependent speech recognition models. For example, the vowel /AA/ in “happy” and “sad” are different phonemes and have different acoustic models. With emotion-dependent speech recognition models, emotion recognition can be done as a by-product of speech recognition. Similar efforts have been made in the literature. [Vlasenko&Wendemuth] trained two sets of HMM/GMM acoustic models of phonemes for high-arousal and low-arousal/neutral emotions respectively, and determined the emotion of each phoneme in an utterance (generated from speech recognition) by applying and comparing the two models of the phoneme. Compared to [Vlasenko&Wendemuth], our approach is simpler. We combine speech recognition and emotion classification into one step of recognizing emotion-dependent phonetic units such as phonemes.

Emotion-dependent models may require more training data than emotion-independent models because the number of units/models is increased by multiple times (n times where n is the number of emotions). However, with the recent advancement of pretrained acoustic models, the amount of data needed for training speech recognition models is greatly reduced. For example, Wav2vec 2.0 [baevski2020wav2vec] outperforms the previous state of the art on the Librispeech [panayotov2015librispeech] test set in terms of word error rate while using 100 times less labeled data. By fine-tuning Wav2vec 2.0, we can train well-performing emotion-dependent models even with a small amount of data.

We can also reduce the size of phonetic units in emotion-dependent models by using coarser-grained (e.g., broad phonetic classes) or larger (e.g., syllables) units than phonemes. Besides to increase the training data for each unit, we explore this idea in the study from three perspectives: 1. Can pretrained acoustic representations be fine-tuned to recognize coarser or larger phonetic units than phonemes? 2. What are the best phonetic units for emotion recognition? 3. Given that phonemes are language-specific while broad phonetic classes and syllables are language-general, are broad-phonetic-class or syllable models more robust than phoneme models for cross-lingual emotion recognition?

In the following sections we will first introduce the method of fine-tuning Wav2vec 2.0 for emotion recognition, followed by experiments on the English IEMOCAP dataset and three other datasets, in German, Arabic, and Mandarin Chinese respectively. Conclusions and discussion are made in the last section.

Figure 1: Spectra of happy /AE/ and /IY/.

2 Fine-tuning Wav2vec2.0 for emotion recognition

2.1 The procedure

Wav2vec 2.0 is a framework for self-supervised learning of speech representations. It consists of multiple convolution layers and self-attention layers, and pretrains on audio alone by masking the speech input in the latent space and solving a contrastive task defined over a quantization of the latent representations. Pre-trained wav2vec models can be fine-tuned for speech recognition with labeled data.

The procedure of fine-tuning Wav2vec 2.0 for emotion recognition is illustrated and described in Figure 2. The core idea is to use emotion-dependent units as labeled targets. A randomly initialized linear projection is added on top of the contextual representations of Wav2vec 2.0 to map the representations into emotion-dependent units (i.e., classes), and the entire model is optimized by minimizing the CTC loss [graves2006connectionist] through fine-tuning.

The fine-tuned model can be used to transcribe speech into emotion-dependent units. For the purpose of speech emotion recognition, the recognized emotion-dependent units of an utterance are mapped to an emotion category by majority vote. For example, if most of the recognized units are of “happy”, then the utterance will be classified as “happy”.

Figure 2: Fine-tuning wav2vec 2.0 for emotion recognition.

2.2 Phonetic units

We tried four types of phonetic units for recognition: phonemes, broad phonetic classes, syllables, and the entire utterance. They are summarized in Figure 4 with examples.

Figure 3: Phonetic units for speech emotion recognition.

3 Experiments on IEMOCAP

3.1 Data

IEMOCAP is a benchmark emotion dataset in English. It consists of 12 hours of speech from 10 professional actors. Following the literature, we extracted 5531 utterances of four emotion types from the dataset: 1708 neutral, 1636 happy (also including excited), 1103 angry, and 1084 sad.

3.2 Fine-tuning

The wav2vec 2.0 large model pre-trained on 960 hours of Librispeech audio, libri960_big.pt, was used for fine-tuning. The model was fine-tuned with 15k updates in all our experiments. For the first 10k updates only the output classifier is trained, after which the Transformer is also updated. The max_tokens was set to 1m (which is equivalent to 62.5-second audio with sampling rate of 16 kHz), the learning rate was 5e-5.

Figure 4 shows a typical training loss curve observed in fine-tuning.

Figure 4: A typical training loss curve observed in fine-tuning.

To be consistent with the literature, we conducted 10-fold cross validation on the dataset. In each fold, the utterances of one speaker were used for testing, and the other nine speakers for fine-tuning.

3.3 Results

3.3.1 Classification of emotion

The performance of emotion recognition has been evaluated in terms of both weighted and unweighted accuracy. Weighted accuracy (WA) is the overall accuracy on the entire dataset while unweighted accuracy (UA) is the average of accuracies (recalls) of the emotion categories. The accuracies of models using different phonetic units are reported in Table 1.

Phonetic units WA UA
Phonemes 75.5% 75.6%
Broad phonetic classes 76.8% 77.2%
Syllables 75.4% 75.4%
Utterance 62.6% 62.5%
Table 1: Emotion recognition accuracy from using different phonetic units.

From Table 1 we can see that all models perform significantly better than most previously reported results, except the utterance model. In the utterance model, the entire utterance is a unit therefore there are only four emotion dependent units (one for each of the four emotions) for recognition. This is essentially a classification on the utterance without considering its phonetic content. The results demonstrate that phonetic units are helpful and should be incorporated in speech emotion recognition.

The best performance is from broad phonetic classes, and phonemes and syllables perform equally well. Table 2

is the confusion matrix of the emotion recognition results from fine-tuning broad phonetic class models.

Target emotion Predicted emotion
Neutral Happy Angry Sad
Neutral 1189 298 100 121
Happy 222 1328 64 22
Angry 67 69 958 9
Sad 195 93 25 771
Table 2: Confusion matrix of emotion recognition using broad phonetic classes.

3.3.2 Recognition of phonetic units

Although Wav2vec 2.0 has achieved great success on speech recognition in terms of word error rate and phoneme error rate, it remains to be investigated whether the model can be fine-tuned to recognize other phonetic unites such as broad phonetic classes and syllables.

We evaluated the recognition results of phonemes, broad phonetic classes, and syllables using the NIST scoring toolkit SCTK [SCTK], both including and excluding emotions. The error rates are listed in Table 3.

Phonetic units with emotion without emotion
Phonemes 31.9% 14.9%
(N=194,473) (24.2%,5.0%,2.8%) (6.6%,5.2%,3.0%)
Broad phon. cl. 28.2% 11.5%
(N=194,473) (20.3%,5.2%,2.8%)% (3.0%,5.4%,3.0%)
Syllables 32.4% 16.3%
(N=78,943) (21.4%,8.8%,2.2%) (4.9%,9.0%,2.4%)
Table 3: Recognition error rates of different phonetic units counting or not counting emotion. The percentages in brackets are substitution, deletion, and insertion error rates respectively.

If we don’t count the emotion type of each recognized phonetic unit, the recognition error rates for phonemes, broad phonetic classes, and syllables are 14.9%, 11.5%, and 16.3% respectively. The results show that Wav2vec 2.0 can be fine-tuned to recognize not only phonemes, but also coarser-grained or larger phonetic units such as broad phonetic classes and syllables, without a language model.

4 Experiments on datasets in other languages

4.1 Data

We re-trained models of phonemes, broad phonetic classes, and syllables on the entire dataset of IEMOCAP, with the same hyperparameters as used in the experiments above. We then tested these models on three other emotion datasets, EmoDB

[EmoDB], KSUEmotions [KSUEmotions], and Mandarin Affective Speech [MandarinAffective].

EmoDB is a German dataset of emotional speech, containing utterances performed in seven target emotions (including neutral) by ten actors. KSUEmotions contains emotional Modern Standard Arabic (MSA) speech from 23 subjects, in six emotions. Mandarin Affective Speech contains recordings in five emotions by 68 Mandarin speakers. The datasets all contain neutral, happy, angry and sad emotions. The utterances in these emotions were extracted from these datasets for our experiments. The total number of utterances in each emotion and each dataset is listed in Table 4.

Datasets Neutral Happy Angry Sad
EmoDB (German) 79 71 127 62
KSU (Arabic) 280 280 280 280
Affective (Mandarin) 4079 4080 4080 4080
Table 4: The number of utterances in each emotion and each dataset.

4.2 Results

Table 5 lists the recognition results on the three datasets.

Datasets IEMOCAP models
Phonemes Broad Syllables
EmoDB (German) 60.4% 66.7% 68.4%
KSU (Arabic) 54.2% 60.6% 55.0%
Affective (Mandarin) 40.7% 39.9% 41.0%
Table 5: The (weighted) accuracies of emotion recognition on three other datasets using models trained on IEMOCAP (English).

The Wav2vec 2.0 pre-trained models were trained on audio with sampling rate of 16 kHz. Recordings in the IEMOCAP, EmoDB, and KSUEmotions are also sampled at 16 kHz. The Mandarin Affective Speech dataset, however, uses 8 kHz sampling rate. We upsampled the audio to 16 kHz for testing this dataset. Although the sampling rate is the same, the audio upsampled from 8 kHz to 16 kHz contains no components in frequencies between 4 and 8 kHz, which is dramatically different from the training data. This may explain why the IEMOCAP models performed poorly on the Mandarin dataset. We also directly tested on 8k Hz audio with 16 kHz models. However, the results were even worse.

We conducted a 7-fold cross validation experiment on the Mandarin Affective Speech dataset, by partitioning the dataset into training and test sets. We used initials and finals as the phonetic units in the experiment. The overall (weighted) accuracy was 75.2%. This result shows that the poor performance of IEMOCAP models on this dataset is not due to the nature of the dataset. The mismatch of sampling rate is probably the main factor responsible for the poor performance, together with other possible factors such as linguistic and cultural differences.

The models performed reasonably well on the German and Arabic datasets, suggesting that our method has the potential for cross-lingual emotion recognition. The models of broad phonetic classes and syllables seem to work better than the phoneme models. This may be because broad phonetic classes and syllables are language-general whereas phonemes are language-specific. It may also be because there are more training data for broad phonetic classes and syllables than phonemes, therefore the models of broad phonetic classes and syllables are more robust. Further studies are needed to test these hytpotheses.

5 Conclusions and discussion

We propose a method for emotion recognition through fine-tuning wav2vec 2.0 for recognition of emotion-dependent phonetic units, including phonemes, broad phonetic classes, syllables, as well as the entire utterance. The models of phonemes, broad phonetic classes, and syllables all significantly outperform the utterance model, demonstrating that phonetic units are helpful and should be incorporated in speech emotion recognition.

The best performance is from broad phonetic classes. The advantages of using broad phonetic classes for other tasks such as speaking rate estimation

[yuan&liberman] and speech enhancement [enhancement] have been reported in the literature. Using broad phonetic classes may force the model to disregard differences of phonemes within a broad phonetic class, but also encourage the model to pay attention to the phonetic variability among broad phonetic classes in emotion recognition. It will be interesting to further investigate the optimal set of broad phonetic classes for the task of emotion recognition.

Models of broad phonetic classes and syllables outperform phonemes in the setting of cross-lingual emotion recognition. This may be because broad phonetic classes and syllables are language-general whereas phonemes are language-specific, or because more training data are available for broad phonetic classes and syllables than phonemes. Further research is needed to test these hypotheses.

Our study also shows that Wav2vec 2.0 can be fine-tuned to recognize not only phonemes, but also coarser-grained or larger phonetic units such as broad phonetic classes and syllables, without a language model.

The proposed method of fine-tuning Wav2vec 2.0 for emotion recognition achieved a significant improvement over most previously reported results on IEMOCAP. In the future we will investigate how to make fine-tuned models more robust across datasets where there are mismatches in recording conditions as well as sampling rate.

References