Automatic Speech Emotion Recognition (SER) is key for building intelligent human-machine interfaces that can adapt to the affective state of the user, especially in cases like call centers where no other information modality is available .
Extracting features capable of capturing the emotional state of the speaker is a challenging task for SER. Prosodic, spectral and voice quality Low Level Descriptors (LLDs), extracted from speech frames, have been extensively used for SER . Proposed SER approaches mainly differ on the aggregation and temporal modeling of the input sequence of LLDs. In utterance-based approaches, statistical functionals are applied over all LLD values of the included frames 
. These utterance-level statistical representations have been successfully used for SER using Support Vector Machines (SVMs)
, Convolutional Neural Networks (CNNs)
and Deep Belief Networks (DBNs) in a multi-task learning setup. Moreover, segment-based approaches have showcased that computation of statistical functionals over LLDs in appropriate timescales yields a significant performance improvement for SER systems , . Specifically, in  statistical representations are extracted from overlapping segments, each one corresponding to a couple of words. The resulting sequence of segments representations is fed as input to a Long Short Time Memory (LSTM) unit for SER classification.
Direct SER approaches are usually based on raw LLDs extracted from emotional utterances. CNNs  and Bidirectional-LSTMs (BLSTMs)  over spectrogram representations reported state-of-the-art performances on Interactive Emotional Dyadic Motion Capture (IEMOCAP) database . LSTMs with attention mechanisms have also been proposed in order to accommodate an active selection of the most emotionally salient frames , . To this end, Sparse Auto-Encoders (SAE) for learning salient features from spectrograms of emotional utterances have also been studied .
Despite the great progress that has been made in SER, the aforementioned LLDs are extracted under the assumption of a linear source-filter model of speech generation. However, vocal fold oscillations and vocal tract fluid dynamics often exhibit highly nonlinear dynamical properties which might not be aptly captured by conventional LLDs . Nonlinear analysis of a speech signal through the reconstruction of its corresponding Phase Space (PS) lies in embedding the signal in a higher dimensional space where its dynamics are unfolded . Recurrent patterns of these orbits are indicative attributes of system’s behavior and can be analyzed using Recurrence Plots (RPs) . Recurrence Quantification Analysis (RQA) provides complexity measures for an RP which are capable of identifying a system’s transitions between chaotic and order regimes . A variety of nonlinear features like: Teager Energy Operator , modulation features from instantaneous amplitude and phase  as well as geometrical measures from PS orbits  have been reported to yield significant improvement on SER when combined with conventional feature sets. However, RQA analysis has not yet been employed for SER. In  RQA measures have been shown to be statistically significant for the discrimination of emotions but an actual SER experimental setup is missing.
In this paper, we extract RQA measures from speech-frames and evaluate them for SER. We test the efficacy of the proposed RQA feature set under both utterance and segment-based approaches by calculating statistical functionals over the respective time lengths. SVMs and Logistic Regression (LR) classifiers are used for the utterance-based approach as well as an Attention-BLSTM (A-BLSTM) for the respective segment-based approach. The performance of the proposed RQA feature set, as well as the fusion of the RQA features with conventional feature sets, is reported on three databases and compared with state-of-the-art results for Speaker-Dependent (SD), Speaker-Independent (SI) and Leave One Session Out (LOSO) SER experiments.
2 Feature Extraction
2.1 Baseline Feature Set (IS10 Set)
2.2 Proposed Nonlinear Feature Set (RQA Set)
The RQA feature set for a given speech segment or utterance is extracted as described next. First, we break the given speech signal into frames and for each one we reconstruct its PS as shown in Section 2.2.1. For each PS orbit, its respective RP is computed as explained in Section 2.2.2. In order to quantify the complex structures of the RP, a list of RQA measures (described in Section 2.2.3) is extracted; resulting in a -dimensional representation of the input speech frame. Representations for speech-segments and utterances containing multiple frames are obtained by applying a set of statistical functionals (listed in Section 2.2.4) over -dimensional frame-attributes and their deltas. Thus, a -dimensional feature vector is obtained.
2.2.1 Phase Space Reconstruction
Given a speech frame with samples we reconstruct its corresponding PS trajectory by computing time-delayed versions of the original speech frame by multiples of time lag and creating the vectors lying in as shown next:
where is the embedding dimension of the reconstructed PS and is the time lag. If the embedding theorem holds and the aforementioned parameters are set appropriately, then the orbit defined by the points would truthfully preserve invariant quantities of the true underlying dynamics which are assumed to be unknown . In accordance with , parameters and
for each speech frame are estimated individually by using Average Mutual Information (AMI) and False Nearest Neighbors (FNN) , respectively.
2.2.2 Recurrence Plot
Given a PS trajectory we analyze the recurrence properties of these states by calculating the pairwise distances and thresholding these values in order to compute the corresponding RP . RPs are binary square matrices and are defined element-wise as shown next:
where is the Heaviside step function, is the thresholding value, is the norm used to define the distance between trajectory points (for , or we compute Manhattan, Euclidean or Supremum norm, respectively). Thus, matrix consists of ones in areas where the states of the orbit are close and zero elsewhere. The measure of proximity is defined by threshold for which multiple selection criteria have been studied . We consider three criteria depending on: 1) a fixed ad-hoc threshold value, 2) a fixed Recurrence Rate (RR) as defined in Table 1 (e.g., For we set
according to a fixed probability of the pairwise distances of PS’s points
), and 3) a fixed ratio of the standard deviationof points , e.g., . For fixed values of and we denote as the respective entry of the RP matrix for simplicity of notation.
An -length diagonal line (of ones) is defined by:
An -length vertical line is described by:
An -length white vertical line (of zeros) is defined as:
We also denote with , and the histogram distributions of lengths of diagonal, vertical and white vertical lines, respectively. Hence, the total number of these lines are correspondingly , and , where , and define the minimum lengths for each type of line .
Emerging small-scale structures based on lines of ones or zeros reflect the dynamic behavior of the system. For instance, diagonal lines indicate both similar evolution of states for different parts of PS’s orbit and deterministic chaotic dynamics of the system . This is also depicted in Figure 1.
2.2.3 Recurrence Quantification Analysis (RQA)
For each RP we extract RQA measures using the pyunicorn framework . Following the notation established in Section 2.2.2 we provide an overview of these measures in Table 1; they are comprehensively studied in , .
|Max Diagonal Length|
|Average Diagonal Length|
|Max Vertical Length|
|Max White Vertical Length|
|Average White Vertical Length|
|White Vertical Entropy|
2.2.4 Statistical Functionals
2.3 Fused Feature Set (RQA + IS10 Set)
For any emotional speech segment or utterance we extract both feature sets IS10 and RQA as described previously and concatenate them. The final feature vector has dimensions.
3 Classification Methods
We investigate both utterance-based and segment-based SER as outlined below:
Utterance-based method: For each utterance we obtain its statistical representation by extracting the corresponding feature set as described in Section 2. For emotion classification we employ an SVM with Radial Base Function (RBF) kernel and one-versus-rest LR classifier. Cost coefficient lies in the interval for both SVM and LR models which is the only hyper-parameter to be tuned. Both models are implemented using the scikit-learn framework .
Segment-based method: We break each utterance into segments of s length and
s stride in accordance with. For each speech segment we extract the feature sets described in Section 2 and as a result each utterance is now represented by a sequence of statistical vectors corresponding to different time steps. This sequence is fed as an input to a Long Short Time Memory (LSTM) unit for emotion classification. SER can be formulated as a many-to-one sequence learning where the expected output of each sequence of segment features is an emotional label derived from the activations of the last hidden layer . We employ an A-BLSTM architecture 
where the decision for the emotional label is derived from a weighted aggregation of all timesteps. We implement this architecture in pytorch. In addition, the grid space of hyper-parameters consists of: number of layers , number of hidden nodes , input noise , dropout rate and learning rate .
4 Experiments and Results
The following databases are used in our experiments:
SAVEE: Surrey Audio-Visual Expressed Emotion (SAVEE) Database  is composed of emotional speech voiced by male actors. SAVEE includes utterances ( utterances per actor) of emotions i.e., anger, disgust, fear, happiness, sadness, surprise and neutral.
Emo-DB: Berlin Database of Emotional Speech (Emo-DB)  contains emotional sentences in German, voiced by actors ( male and female). Specifically, emotions are included i.e., anger, disgust, fear, joy, sadness, boredom and neutral.
IEMOCAP: IEMOCAP database  contains hours of video data of scripted and improvised dialog recorded from actors. Utterances are organized in sessions of dyadic interactions between actors. For our experiments we consider utterances including 4 emotions (1103 angry, 1636 happy, 1708 neutral and 1084 sad), where we merge excitement and happiness class into the latter one , , , .
We evaluate our proposed feature set under three different SER tasks described next. We also compare our results with the most relevant experimental setups reported in the literature. For all tasks, we report: Weighted Accuracy (WA) which is the percentage of correct classification decisions and Unweighted Accuracy (UA) which is calculated as the average of recall percentage for each emotional class.
After an extensive study of the RQA configuration parameters described in Section 2.2.2, we conclude that best results on SER tasks are obtained using a frame duration of ms for extracting RPs. In addition, the best performing parameters for the RP configuration seem to be a Manhattan norm with a threshold setting depending on a fixed recurrence rate lying in .
4.1 Speaker Dependent (SD)
We evaluate RQA features on SAVEE and Emo-DB following the utterance-based approach described in Section 3. In this setup we apply per-speaker -normalization (PS-N) and split randomly utterances in train and test sets. Accuracies using -fold cross-validation are summarized on Table 2 for the best performing classifier hyper-parameter values.
The fused set achieves significant performance improvement over the baseline IS10 feature set for both datasets. On SAVEE, WA is improved by () and UA by (). We also achieve an improvement of () and () for WA and UA, respectively on Emo-DB. The feature set used in  is extracted over cepstral, spectral and prosodic LLDs similar to the ones used in IS10 . Noticeably, they achieve similar performance to ours when we use only IS10 but our fused set with LR outperforms on both Emo-DB ( in UA and in WA) and SAVEE ( in UA and in WA). The proposed combination of features and LR also surpasses a Convolutional SAE approach  in terms of WA by on Emo-DB and on SAVEE. Presumably, RQA measures contain information closely related to speaker-specific emotional dynamics not captured by conventional features.
| LLDs Stats||ESR||76.3||73.4||88.7||87.9|
4.2 Speaker Independent (SI)
Again, we follow the utterance-based approach described in Section 3 on both SAVEE and Emo-DB datasets but we do not make any assumptions for the identity of the user during training. We use leave-one-speaker-out cross validation, where one speaker is kept for testing and the rest for training. The mean and standard deviation are calculated only on training data and used for -normalization on all data. From now on we refer to this normalization as Per Fold-Normalization (PF-N). Table 3 presents accuracies averaged over all folds for the best performing classifier hyper-parameter values.
In comparison with the baseline IS10 feature set, the fused feature set obtains an absolute improvement of and on SAVEE as well as and on Emo-DB in terms of WA and UA, respectively. Furthermore, our fused set achieves higher performance on SAVEE ( in WA and in UA) and slightly lower in Emo-DB compared to . In 
Weighted Spectral Features based on Hu Moments (WSFHM) are fused with IS10 on utterance-level which is similar to our approach. In direct comparison using the same model (SVM) we surpass the reported performance in terms of WA byand on SAVEE and Emo-DB, respectively. In addition, both RQA and IS10 sets achieve quite low performance on SAVEE. However, their combination yields an impressive performance improvement of () in WA and () in UA over IS10 when we use LR. Our results suggest that RQA measures preserve invariant aspects of nonlinear dynamics occurring in emotional speech and are shared across different speakers.
| LLDs Stats||ESR||51.5||49.3||82.4||78.7|
4.3 Leave One Session Out (LOSO)
In this task, we assume that the test-speaker identity is unknown but we are able to train our model considering other speakers who are recorded in similar conditions. We evaluate on both utterance and segment-based methods (described in Section 3) on IEMOCAP. Given our assumption, we treat each of the sessions as a speaker group . We use LOSO in order to create train and test folds. In each fold, we use sessions for training and the remaining for testing. For the testing session we use one speaker as testing set and the other for tuning the hyper-parameters of our models. We repeat the evaluation by reversing the roles of the two speakers. In the final assessment, we report the average performance obtained in terms of WA and UA obtained from all speakers , , . In order to be easily comparable with the literature we follow three different normalization schemes. We use the aforementioned PS-N and PF-N schemes as well as Global -normalization (G-N). In G-N we calculate the global mean and standard deviation from all the available samples in the dataset and perform -normalization over them. Results on IEMOCAP for the three different normalization schemes are demonstrated on Table 4.
A consistent performance improvement is shown for all combinations of normalization techniques and employed models when the fused set is used instead of IS10. Specifically, for SVM the fused set yields a relative improvement varying from to in WA and from to in UA under all normalization strategies. The same applies for LR (in WA from to and in UA from to ) as well as for A-BLSTM (in WA from to and in UA from to ). In accordance with our intuition , a segment-based approach using A-BLSTM surpasses all utterance-based ones in WA from to and in UA from to for all normalization schemes, when the fused set is used.
a stacked autoencoder is used to extract feature representations from spectrograms of glottal flow signals and then a BLSTM is used for classification. We surpass both reported results byin UA for  and by a margin of in WA and in UA for , respectively even with simple models. Compared to a multi-task DBN trained for both discrete emotion classification and for valence-activation in , we report higher WA and higher UA. We also report higher UA and lower WA compared to CNNs over spectrograms . We assume that this inconsistency in performance metrics occurs because a slightly different experimental setup is followed where the final session is excluded from testing .
We investigated the usage of nonlinear RQA measures extracted from RPs for SER. The effectiveness of these features has been tested under both utterance-based and segment-based approaches across three emotion databases. The fusion of nonlinear and conventional feature sets yields significant performance improvement over traditional feature sets for all SER tasks; the performance improvement is especially large when speaker identity is unknown. The fused data set improves on the state-of-the-art for SER under most testing conditions, classification methods and datasets. Recurrence analysis of speech signals is a promising direction for SER research. In the future, we plan to automatically extract features from RPs using convolutional autoencoders in order to substitute RQA measures.
This work has been partially supported by the BabyRobot project supported by EU H2020 (grant #687831). Special thanks to Nikolaos Athanasiou and Nikolaos Ellinas for their contributions on the experimental environment setup.
-  B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, and S. Narayanan, “The interspeech 2010 paralinguistic challenge,” in Proceedings of INTERSPEECH, 2010, pp. 2794–2797.
-  C. M. Lee and S. S. Narayanan, “Toward detecting emotions in spoken dialogs,” IEEE Transactions on speech and audio processing, vol. 13, no. 2, pp. 293–303, 2005.
-  M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
-  E. Mower, M. J. Mataric, and S. Narayanan, “A framework for automatic human emotion classification using emotion profiles,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1057–1070, 2011.
-  Z. Aldeneh and E. M. Provost, “Using regional saliency for speech emotion recognition,” in Proceedings of ICASSP, 2017, pp. 2741–2745.
-  R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2d continuous space,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 3–14, 2017.
-  B. Schuller and G. Rigoll, “Timing levels in segment-based speech emotion recognition,” in Proceedings of INTERSPEECH, 2006.
-  E. Tzinis and A. Potamianos, “Segment-based speech emotion recognition using recurrent neural networks,” in Affective Computing and Intelligent Interaction (ACII), 2017, pp. 190–195.
H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,”Neural Networks, vol. 92, pp. 60–68, 2017.
-  S. Ghosh, E. Laksana, L.-P. Morency, and S. Scherer, “Representation learning for speech emotion recognition.” in Proceedings of INTERSPEECH, 2016, pp. 3603–3607.
-  C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
-  C.-W. Huang and S. S. Narayanan, “Attention assisted discovery of sub-utterance structure in speech emotion recognition.” in Proceedings of INTERSPEECH, 2016, pp. 1387–1391.
-  S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in Proceedings of ICASSP, 2017, pp. 2227–2231.
-  Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
-  H. Herzel, “Bifurcations and chaos in voice signals,” Applied Mechanics Reviews, vol. 46, no. 7, pp. 399–413, 1993.
-  V. Pitsikalis and P. Maragos, “Analysis and classification of speech signals by generalized fractal dimension features,” Speech Communication, vol. 51, no. 12, pp. 1206–1223, 2009.
-  J.-P. Eckmann, S. O. Kamphorst, and D. Ruelle, “Recurrence plots of dynamical systems,” EPL (Europhysics Letters), vol. 4, no. 9, p. 973, 1987.
-  N. Marwan, M. C. Romano, M. Thiel, and J. Kurths, “Recurrence plots for the analysis of complex systems,” Physics reports, vol. 438, no. 5-6, pp. 237–329, 2007.
-  R. Sun and E. Moore, “Investigating glottal parameters and teager energy operators in emotion recognition,” in Affective Computing and Intelligent Interaction, 2011, pp. 425–434.
-  T. Chaspari, D. Dimitriadis, and P. Maragos, “Emotion classification of speech using modulation features,” in Proceedings of Signal Processing Conference (EUSIPCO), 2014, pp. 1552–1556.
-  A. Shahzadi, A. Ahmadyfard, A. Harimi, and K. Yaghmaie, “Speech emotion recognition using nonlinear dynamics features,” Turkish Journal of Electrical Engineering & Computer Sciences, vol. 23, no. Sup. 1, pp. 2056–2073, 2015.
-  A. Lombardi, P. Guccione, and C. Guaragnella, “Exploring recurrence properties of vowels for analysis of emotions in speech,” Sensors & Transducers, vol. 204, no. 9, p. 45, 2016.
-  F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in opensmile, the munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 835–838.
-  T. Sauer, J. A. Yorke, and M. Casdagli, “Embedology,” Journal of statistical Physics, vol. 65, no. 3-4, pp. 579–616, 1991.
-  A. M. Fraser and H. L. Swinney, “Independent coordinates for strange attractors from mutual information,” Physical review A, vol. 33, no. 2, p. 1134, 1986.
-  M. B. Kennel, R. Brown, and H. D. I. Abarbanel, “Determining embedding dimension for phase-space reconstruction using a geometrical construction,” Phys. Rev. A, vol. 45, pp. 3403–3411, 1992.
-  S. Schinkel, O. Dimigen, and N. Marwan, “Selection of recurrence threshold for signal detection,” The european physical journal special topics, vol. 164, no. 1, pp. 45–53, 2008.
-  M. Thiel, M. C. Romano, J. Kurths, R. Meucci, E. Allaria, and F. T. Arecchi, “Influence of observational noise on the recurrence quantification analysis,” Physica D: Nonlinear Phenomena, vol. 171, no. 3, pp. 138–152, 2002.
-  N. Marwan, J. Kurths, and S. Foerster, “Analysing spatially extended high-dimensional dynamics by recurrence plots,” Physics Letters A, vol. 379, no. 10, pp. 894 – 900, 2015.
-  J. F. Donges, J. Heitzig, B. Beronov, M. Wiedermann, J. Runge, Q. Y. Feng, L. Tupikina, V. Stolbova, R. V. Donner, N. Marwan et al., “Unified functional network and nonlinear time series analysis for complex systems science: The pyunicorn package,” Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 25, no. 11, p. 113101, 2015.
-  C. L. Webber Jr and J. P. Zbilut, “Recurrence quantification analysis of nonlinear dynamical systems,” Tutorials in contemporary nonlinear methods for the behavioral sciences, pp. 26–94, 2005.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
-  S. Haq and P. Jackson, “Speaker-dependent audio-visual emotion recognition,” in Proceedings Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), Norwich, UK, Sept. 2009.
-  F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Ninth European Conference on Speech Communication and Technology, 2005.
-  Y. Sun and G. Wen, “Ensemble softmax regression model for speech emotion recognition,” Multimedia Tools and Applications, vol. 76, no. 6, pp. 8305–8328, 2017.
-  Y. Sun, G. Wen, and J. Wang, “Weighted spectral features based on local hu moments for speech emotion recognition,” Biomedical signal processing and control, vol. 18, pp. 80–90, 2015.