Emotion Recognition in Low-Resource Settings: An Evaluation of Automatic Feature Selection Methods

08/28/2019 ∙ by Fasih Haider, et al. ∙ 0

Research in automatic emotion recognition has seldom addressed the issue of computational resource utilization. With the advent of ambient technology, which employs a variety of low-power, resource constrained devices, this issue is increasingly gaining interest. This is especially the case in the context of health and elderly care technologies, where interventions aim at maintaining the user's independence as unobtrusively as possible. In this context, efforts are being made to model human social signals such as emotions, which can aid health monitoring. This paper focuses on emotion recognition from speech data. In order to minimize the system's memory and computational needs, a minimum number of features should be extracted for use in machine learning models. A number of feature set reduction methods exist which seek to find minimal sets of relevant features. We evaluate three different state of the art feature selection methods: Infinite Latent Feature Selection (ILFS), ReliefF and Fisher (generalized Fisher score), and compare them to our recently proposed feature selection method named 'Active Feature Selection' (AFS). The evaluation is performed on three emotion recognition data sets (EmoDB, SAVEE and EMOVO) using two standard speech feature sets (i.e. eGeMAPs and emobase). The results show that similar or better accuracy can be achieved using subsets of features substantially smaller than entire feature set. A machine learning model trained on a smaller feature set will reduce the memory and computational resources of an emotion recognition system which can result in lowering the barriers for use of health monitoring technology.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech signals are used in a number of automatic prediction tasks, including cognitive state detection haider2015

, cognitive load estimation

bib:SchullerSteidlEtAl14in, presentation quality assessment haider2016ICASSP and emotion recognition el2011survey; bib:SchullerBatlinerEtAl11r. Emotional states can have influence on health and intervention outcomes. Positive emotions have been linked with health improvement, while negative emotions may have negative impact consedine_role_2007. For example, long term accumulations of negative emotions are predisposing factors for depression (ibid.), while positive emotions-related humour and optimism have been linked with positive health outcomes, e.g. influence on the immune system or association with cardiovascular diseases dimsdale_psychological_2008. Emotion recognition is used for many applications in the domain of health technologies, e.g. to assess mental health condition valstar_avec_2013; desmet_emotion_2013. Applications using speech usually extract emotions as an additional signal in complex systems, e.g. in smart environments mano_exploiting_2016 or in depression recognition desmet_emotion_2013. The approaches to the speech signal analysis usually employ very high-dimensional feature space consisting of large numbers of potentially relevant acoustic features. For the speech signal, these features are usually obtained by applying statistical functionals to basic, energy, spectral and voicing related acoustic descriptors eyben2009openear extracted from speech intervals lasting a few seconds bib:VerveridisKotropoulos06em

. Although there is no general consensus on what the ideal set of features should be, this “brute-force” approach of employing as many features as possible seems to outperform alternative (Markovian) approaches of modelling temporal dynamics on the classifier level

bib:WeningerEybenEtAl13acemaud

. On the other hand, the use of such high-dimensional data sets poses serious challenges for prediction, as they suffer from the so called “curse of dimensionality”, high degree of redundancy in the feature set, and a large number of features with poor descriptive value. For example, Su and Luz noticed that in the cognitive load prediction data set about 4% of the entire feature set (over 250 features) had a standard deviation of less than 0.01 and therefore contributed negligibly to the classification task

su2016predicting. Moreover, such high-dimensional approaches are not suitable for designing a system for emotion recognition with lower power, cost, memory and computational resources such as Raspberry Pi Zero 111https://www.raspberrypi.org/products/raspberry-pi-zero/ (last accessed January 2019).

The main contribution of this study is the evaluation of different state of the art feature selection methods, including our Active Feature Selection (AFS), on the emotion recognition from speech, which has, to the best of our knowledge, not yet been systematically explored. This study extends our previous work haider2018saameat, where we first introduced the novel Active Feature Selection (AFS) method and tested it on the eating conditions recognition challenge schuller_interspeech_2015

2 Background and Related Work

The automatic identification of emotions in speech is a challenging task, and identifying relevant acoustic features and systematic comparative evaluations have been difficult anagnostopoulos_features_2015. In 2016, the eGeMAPs set eyben2016geneva (see Section 4.2) was designed based on features’ potential to reflect affective processes (expensively used in the literature) and their theoretical significance. It was proposed to set a common ground of emotion-related speech features, and it has become since then a de-facto standard. The set of target emotions has mostly been fixed around the ‘Big Six’ (see Section 4.1), and similarly, evaluations are more and more frequently performed on a number of publicly available corpora (see Section 4.1). In the health domain, feature selection methods for speech processing have been applied to determine the most discriminant features in support of automation efforts, e.g. for the assessment of patients with pre-dementia and Alzheimer’s disease konig_automatic_2015-1 or for the detection of sleep apnea goldshtein_automatic_2011-1. The automatic emotion recognition problem gained a lot of attention in the past few years dhall2018emotiw; EmotiW2017 and is addressed by processing the facial, speech, body movements and biometric information of humans knyazev2017convolutional; Haider:2016:ARV:3011263.3011270; madzlan2015automatic; ICMI2017:HuEtAl; haider2015. Numerous studies Haider:2016:ARV:3011263.3011270; Vielzeuf_Pateux_Jurie_2017; ICMI2017:HuEtAl; Wang_ICMI2017; knyazev2017convolutional; ouyang2017audio extract audio features with OpenSMILE using de-facto standard presets: IS10, GeMAPS, eGeMAPS, Emobase.

The reviewed literature suggests that although the accuracy of various machine learning approaches in this area is promising, automatic dimensionality reduction have focused largely on the removal of noisy or redundant features, with less attention paid to computational resource utilisation.

There are many dimensionality reduction methods: some are feature selection methods which require labelled data such as correlation based feature selection and Fisher feature selection hall1999correlation; gu_generalized_2012

, and some are feature transformation methods which do not require labelled data such as Principle Component Analysis (PCA), independent component analysis

wang2006independent. Recently, efforts have been spent to reduce dimensionality using PCA to improve the results for emotion recognition from speech jagini2017exploring; aher2016analysis; wang2010speech in different settings such as noisy setting aher2016analysis but dimensionality reduction using feature selection methods is less explored.

3 Feature Selection Methods

In this section we will briefly describe the feature selection methods used in this study along with our AFS method. We have selected three state of the art feature selection methods and motivation behind using them is their robust performance as demonstrated by (roffo2017infinite).

3.1 Infinite Latent Feature Selection (ILFS)

The ILFS method (roffo2017infinite) performs cross-validation on unsupervised ranking of each feature. In a pre-processing stage, each feature is represented by a descriptor reflecting how discriminative it is. A probabilistic latent graph containing each feature is built. Weighted edges model pairwise relations among feature distributions, created using probabilistic latent semantic analysis. The relevance of each feature is computed by looking on its weight in arbitrary set of cues. Each path in the graph represents a selection of features. The final ranking of each feature looks at its redundancy in all the possible feature subsets, selecting the most discriminative and relevant features.

The evaluation on a range of different tasks, e.g. object recognition classification and DNA Microarray, confirms its robustness, outperforming other methods on robustness and ranking quality.

3.2 ReliefF

The ReliefF algorithm robnik-sikonja_adaptation_1997

performs a ranking and selection of top scoring features based on their processed score. The score is calculated by weighting features on a sample of randomly sampled instances. For each instance, the weight vector represents the relevance of each feature amongst the class labels: neighbours are selected from the same class (nearest hits) and from each different class (nearest misses). The weight of each feature increases when the difference with its nearest hits is low and with its nearest misses is high. Each weight vector is combined in a global relevance vector. The final subset is constituted of all the features with relevance greater than a manually set threshold. ReliefF is a common method of Feature Selection which has been continuously improved since its first publication.

3.3 Generalized Fisher score (Fisher)

The generalized Fisher score gu_generalized_2012

is a generalization of the Fisher score to take into account redundancy and combination of features. A subset of features is found to maximize the lower bound of traditional Fisher score. The combination of features is evaluated, and redundant features discarded. A quadratically constrained linear programming (QCLP) is solved with a cutting plane algorithm. Each iteration, a multiple kernel learning is solved by a multivariate ridge regression followed by a projected gradient descent to update the kernel weights. The evaluation of the method finds state of the art results, outperforming many feature selection methods while having a lower complexity.

3.4 Active feature selection method

An Active Feature Selection method has been recently introduced haider2018saameat which divides a feature set into subsets. The term ‘Active’ is used because compared to other approaches it evaluates feature sub-sets and not the each feature separately. It involves clustering the data set into N (where

) clusters using Self-Organizing Maps (SOM) with 200 iterations and batch training

kohonen1998self, and then evaluating discrimination power of features present in each cluster using Leave-One-Subject Out (LOSO) cross-validation setting, as depicted in Figure 1, and selecting the cluster with the highest validation accuracy. Here, we are not clustering the number of instances but the dimensions and not evaluating each feature separately but evaluating all the features in one cluster together. Our hypothesis is that the noisy features have different characteristics than informative features, and that clustering the features will divide the features into many subsets according to their common characteristics. An example of self-organizing clustering is depicted in Figure 4, where 88 features () are clustered into 70 clusters, and the features present in cluster number 23 (containing 2 out of 88) provide better results than features in other clusters.

Figure 1: Active feature selection method: D(m,n) represents the data where m is the total number of training instances and n is the total number of dimensions (988 emobase and 88 for eGeMAPs).

4 Experimentation

4.1 Data set

Three corpora were selected for their shared characteristics and public availability: EmoDB, SAVEE, and EMOVO. They consist of recorded acted performances, annotated using the well-known and widely used Big Six set of annotations : anger, disgust, fear, happiness, sadness, surprise + neutral, except in the older EmoDB data set where boredom was used instead of surprise. Their characteristics are summarised in Tables 1 and 2.

Berlin Database of Emotional Speech (EmoDB)
The EmoDB corpus burkhardt_database_2005 is a data set commonly used in the automatic emotion recognition litterature. It features 535 acted emotions in German, based on utterances carrying no emotional bias. The corpus was recorded in a controlled environment resulting in high quality recordings. Actors were allowed to move freely around the microphones, affecting absolute signal intensity. In addition to the emotion, each recording was labelled with phonetic transcription using the SAMPA phonetic alphabet, emotional characteristics of voice, segmentation of the syllables, and stress. The quality of the data set was evaluated by perception tests carried out by 20 human participants. In a first recognition test, subjects listened to a recording once before assigning one of the available category, achieving an average recognition rate of 86%. A second naturalness test was performed. Documents achieving a recognition rate lower than 80% or a naturalness rate lower than 60% were discarded from the main corpus, reducing the corpus to 535 recordings from the original 800.

Surrey Audio-Visual Expressed Emotion (SAVEE)
SAVEE HaqJackson_AVSP09 is an audio-visual data set that was recorded to support the development of an automatic emotion recognition system. The corpus is a set of 480 British English utterances. Each actor did 15 recordings per emotion (3 common, 2 emotion specific, and 10 generic sentences different for each emotion) and 30 neutral recordings (the 3 common and every emotion specific sentences). No limitation regarding audio features (e.g. absolute signal intensity) is explicitly stated in the description of the data set. A qualitative evaluation of the database was run as a perception tests by 10 human subjects. The mean classification accuracy for the audio modality was 66.5%, 88% for the visual modality, and 91.8% for the combined audio-visual modalities.

Italian Emotional Speech Database (EMOVO)
The EMOVO corpus costantini_emovo_2014 is a speech data set featuring recorded emotions from acted performances by 6 persons, based on both semantically neutral and nonsense sentences. Actors were allowed to move freely around the microphones and the volume was manually adjusted, affecting absolute signal intensity. A qualitative evaluation was performed using a discrimination test. Two phrases were selected and, for each, 12 subjects had to choose between two proposed emotions. The mean accuracy for the test was about 80%.

Corpus Size (utterances) Population Participants Emotion categories
EmoDB 535 10 (5 males, 5 females) German native speakers actors anger, disgust, fear, joy, sadness, boredom + neutral
SAVEE 480 4 (males) English native speakers actors anger, disgust, fear, happiness, sadness, surprise + neutral
EMOVO 588 6 (3 males, 3 females) Italian native speakers actors anger, disgust, fear, happiness, sadness, surprise + neutral
Table 1: Main characteristics of the data sets.
Corpus Neutral Anger Disgust Fear Happiness Sadness Surprise Boredom
EmoDB 79 127 46 69 71 62 - 81
SAVEE 120 60 60 60 60 60 60 -
EMOVO 84 84 84 84 84 84 84 -
Table 2: Distribution of recordings across emotion categories.

4.2 Volume normalization and feature extraction

We have normalized all the speech utterances’ volume in to the range [-1:+1] dBFS before any acoustic feature extraction. The motivation of doing this normalization is to make the model robust against different recording conditions such as distance between microphone and subject. We use the openSMILE

eyben2013recent toolkit for the extraction of two acoustic feature sets which are widely used for emotion recognition as follow:

emobase: This acoustic feature set contains the MFCC, voice quality, fundamental frequency (F0), F0 envelope, LSP and intensity features along with their first and second order derivatives. In addition, many statistical functions are applied to these features, resulting in a total of 988 features for every speech utterance.

eGeMAPs:The eGeMAPs eyben2016geneva feature set contains the F0 semitone, loudness, spectral flux, MFCC, jitter, shimmer, F1, F2, F3, alpha ratio, hammarberg index and slope V0 features including many statistical function applied on these feature which resulted in-total of 88 features for every speech utterance.

4.3 Classification Method

The classification is performed using linear Support Vector Machines (SVM) using SMO solver with

box constraint of 0.75. This classifier is employed in MATLAB222http://uk.mathworks.com/products/matlab/ (Last accessed: January 2019) using the statistics and machine learning toolbox. The feture selection methods are evaluated in Leave-One-Subject-Out cross-validation setting for SVM classifier using Unweighted Average Recall (UAR) measure.

4.4 Evaluation Criteria

All of the emotion recognition data sets are labeled for seven classes and we are evaluating the classifier using UAR which is the average accuracy of all classes. The method with the highest UAR is considered the best. The blind/majority guess for this task is the UAR of 14.29%. However we set the baseline with the UAR obtained using the entire feature set.

4.5 Results and discussion

We have evaluated the three different automatic feature selection methods named ILFS, ReliefF and Fisher along with our newly proposed AFS method using two different acoustic feature sets extracted from three different data sets. The results of three feature selection methods are shown in Figure 2. From the results, It is observed that around 30 out of 88 eGeMAPs features and around 100 out of 988 emobase features are sufficient in providing almost the same UAR as the highest achieved UAR for three data sets, as shown in Figure 2. The best results of each feature selection method are depicted in Table 3. The results confirms that a higher accuracy can be achieved using a subset of the feature set than when using the full feature set. The results for each dataset are as follows:

  1. EmoDB: ILFS method provides better UAR (69.68% for eGeMAPs and 76.86% for emobase) results than other methods and able to reduce the number of features (74 out of 88 for eGeMAPS and 685 out of 988 for emobase). For eGeMAPS, AFS method provides an UAR of 68.46% (around 1% lower than ILFS) using 81 features. For emobase, AFS method provides an UAR of 75.81% (around 1% lower than ILFS) using 696 features.

  2. EMOVO: Fisher method provides better UAR (40.99%) than other methods for eGeMAPs feature set (using only 25 out of 88 features) and ReliefF method provides a better UAR (42.38%) than other methods for emobase feature set (348 out of 988). The results of AFS are a slightly lower than the best method (for around 2 %), but the number of features are significantly lower, compared to other methods. Note that AFS selects only 2 eGeMAPs features out of 88 and 56 emobase feature out of 988 which provides the UAR of 38.95% and 36.39%, respectively.

  3. SAVEE: Fisher method provides better UAR than other methods for eGeMAPs feature set (34 out of 88 with a UAR of 42.38%) and emobase feature set (158 out of 988 with a UAR of 42.38%). For eGeMAPS, the results of AFS are slightly lower than the best method around 2 %. For emobase, AFS method provide a UAR of 37.50% (around 5% lower than Fisher) using 21 features.

Figure 2: Feature selection methods results.

Figure 3: AFS method results: The x-axis represent the number of cluster (5,10,15, … 100). The y-axis represent number of features (numFeat) and Unweighted Average Recall (UAR) in % of the best cluster.
Data Set Feature Set Method numFeat UAR
EmoDB eGeMAPs Baseline 88 0.6849
EmoDB eGeMAPs ILFS 74 0.6968
EmoDB eGeMAPs reliefF 88 0.6849
EmoDB eGeMAPs Fisher 88 0.6849
EmoDB eGeMAPs AFS 81 0.6846
EmoDB emobase Baseline 988 0.7455
EmoDB emobase ILFS 685 0.7686
EmoDB emobase reliefF 666 0.7528
EmoDB emobase Fisher 975 0.7523
EmoDB emobase AFS 696 0.7581
EMOVO eGeMAPs Baseline 88 0.3741
EMOVO eGeMAPs ILFS 28 0.3810
EMOVO eGeMAPs reliefF 20 0.3776
EMOVO eGeMAPs Fisher 25 0.4099
EMOVO eGeMAPs AFS 2 0.3895
EMOVO emobase Baseline 988 0.3435
EMOVO emobase ILFS 113 0.3469
EMOVO emobase reliefF 348 0.3707
EMOVO emobase Fisher 464 0.3622
EMOVO emobase AFS 56 0.3639
SAVEE eGeMAPs Baseline 88 0.4083
SAVEE eGeMAPs ILFS 86 0.4202
SAVEE eGeMAPs reliefF 82 0.4143
SAVEE eGeMAPs Fisher 34 0.4238
SAVEE eGeMAPs AFS 68 0.4048
SAVEE emobase Baseline 988 0.3810
SAVEE emobase ILFS 574 0.3881
SAVEE emobase relieff 72 0.3929
SAVEE emobase Fisher 158 0.4238
SAVEE emobase AFS 21 0.3750
Table 3: Best results of feature selection methods.

Figure 4:

AFS methods results: number of feature present in each cluster (i.e. Hexagon or neuron) along with the UAR(%) obtained using eGeMAPs fetures for EMOVO data set. Where 2 out of 88 features provides better results than other feature sub-sets.

Generalized Fisher score provides better results in 3 out of 6 cases, ILFS provides better results in 2 out of 6 cases and reliefF provides better results in 1 out of 6 cases. The AFS method comes second in 3 out of 6 cases as shown in Table 3. It is also observed that the AFS method provides almost the same results in terms of UAR as the other state of the art feature selection method do, with a smaller number of dimensions. We have noticed that for the SAVEE data set only 2 out of 88 eGeMAPs features (selected by AFS) provide better results than reliefF, ILFS and the baseline (i.e. entire feature set) and to get further insight of results, we demonstrate the evaluation of clusters (feature subsets) using AFS as shown in Figure 4. From the Figure 4, we observe that there are many clusters (of features) which provide better results than the blind guess (14.29%), while the feature cluster selected by AFS is containing only 2 features and leads to the (38.95%) UAR. One of the possible lines of future work is to fuse the feature from different clusters for classification task to see any possible improvements in UAR. AFS method is also evaluated with different number of clusters for SOM algorithm and the best UAR obtained from each cluster number with number of features (numFeat) in that cluster are shown in Figure 3.

In a previous study haider2018saameat, we demonstrate that the AFS method is able to select a feature subset which provides better results than the entire feature set and the PCA feature set for eating condition recognition. However the results have not been demonstrated in detail as in this study and the AFS is not been evaluated on the multiple data sets and compared against other feature selection methods which is a step towards in demonstrating the generalisability of the AFS method. The contribution of this study is not only the evaluation of AFS but also the assessment of the extent to which AFS, reliefF, Fisher and ILFS can select small enough subsets which will impose lower computational demands on low resource systems, while preserving or improving emotion recognition performance, in comparison to full feature sets.

5 Conclusion

This study evaluates three state of the feature selection methods named infinite latent feature selection, reliefF and generalized Fisher score for emotion recognition along with a recently proposed ‘Active Feature Selection’ method and utilizes three different emotion recognition data sets, namely EmoDB, EMOVO and SAVEE. The results show that a higher UAR can be achieved using a subset of the full feature set. In summary, around 30 out of 88 eGeMAPs and 100 out of 988 emobase features are sufficient in obtaining almost the same UAR as a full feature set provides. This finding is relevant to the development of machine learning models for machines with low computational resources. The AFS method provides almost the same UAR as compared to other methods. However, AFS currently uses only features present in one cluster. For future studies, we will explore methods to rank the clusters of features and do fusion of different clusters for possible accuracy improvements. Other possible avenues for future work include testing the AFS on other modalities along with speech.

6 Acknowledgement

This research has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 769661.

References