"I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion Recognition

05/15/2020
by   Mostafa M. Mohamed, et al.
IEEE
0

In applications that use emotion recognition via speech, frame-loss can be a severe issue given manifold applications, where the audio stream loses some data frames, for a variety of reasons like low bandwidth. In this contribution, we investigate for the first time the effects of frame-loss on the performance of emotion recognition via speech. Reproducible extensive experiments are reported on the popular RECOLA corpus using a state-of-the-art end-to-end deep neural network, which mainly consists of convolution blocks and recurrent layers. A simple environment based on a Markov Chain model is used to model the loss mechanism based on two main parameters. We explore matched, mismatched, and multi-condition training settings. As one expects, the matched setting yields the best performance, while the mismatched yields the lowest. Furthermore, frame-loss as a data augmentation technique is introduced as a general-purpose strategy to overcome the effects of frame-loss. It can be used during training, and we observed it to produce models that are more robust against frame-loss in run-time environments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

05/15/2020

ConcealNet: An End-to-end Neural Network for Packet Loss Concealment in Deep Speech Emotion Recognition

Packet loss is a common problem in data transmission, including speech d...
10/19/2020

Multi-Window Data Augmentation Approach for Speech Emotion Recognition

We present a novel, Multi-Window Data Augmentation (MWA-SER) approach fo...
01/10/2022

A study on cross-corpus speech emotion recognition and data augmentation

Models that can handle a wide range of speakers and acoustic conditions ...
01/04/2018

A pairwise discriminative task for speech emotion recognition

Speech emotion recognition is an important task in human-machine interac...
08/05/2020

Compact Graph Architecture for Speech Emotion Recognition

We propose a deep graph approach to address the task of speech emotion r...
05/18/2020

Deep Architecture Enhancing Robustness to Noise, Adversarial Attacks, and Cross-corpus Setting for Speech Emotion Recognition

Speech emotion recognition systems (SER) can achieve high accuracy when ...
09/10/2017

Robust Emotion Recognition from Low Quality and Low Bit Rate Video: A Deep Learning Approach

Emotion recognition from facial expressions is tremendously useful, espe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is a rise of affective computing applications which predict emotions through speech or other signals like images. Such applications depend heavily on the quality of the audio streams of speech to correctly predict the emotions. In streaming applications, there are a variety of factors that could result in lower quality of data received, like lower data rate and packet loss [buffering], or varying throughput in mobile communication [schmid2019deep]. In such applications, any issue like this that might happen, would cause a drop in the input streams which might lead to severe degradation in the performance of the application. Such a degradation could happen for a variety of reasons, for example, the dependency of some models on the audio context to predict the emotions of the succeeding time points, also when some models assume the continuity of the input speech. These are typical assumptions made by neural network models like [tzirakis2018, Adieu]

, because of the design of recurrent neural networks

[deeplearn].

The impact of disturbances during automatic ‘speech emotion recognition’ (SER) has been investigated for speech in the presence of noise [Schuller06-ERI, Weninger11-RON, Pohjalainen16-SAC], reverberation [Schuller11-ASS, Weninger11-RON], or in narrowband transmission [Marchi16-TEO] and coded speech [albin2015objective, Marchi16-TEO]. However, to the authors’ best knowledge, no work exists that investigates the impact of packet (or frame) loss on SER. There are only a few papers addressing SER in VoIP setting [pao2012integration], yet, not systematically investigating packet loss impact. This seems surprising, given that a main application of SER is found in call centres, and SER is currently finding its way onto mobile phones [Marchi16-RTO]

. Packet loss and its impact on speech processing has largely been studied so far in the context of automatic speech recognition

[milner2001robust] and enhanced in [Lotfidereshgi2018].

The main aim of this paper is to examine the effects of these frame-loss cases on the performance of models for emotion recognition via speech. In addition to that, an attempt to enhance such models to become more robust against frame-loss will be made.

This paper is divided as follows: Section 2 contains the details of the approach, Section 3 contains the experimental settings and the results, and Section 4 provides the conclusion of the paper.

2 Approach

The approach mainly uses an end-to-end model which predicts emotions (defined as two dimensions arousal and valence). The model is trained and tested under a variety of settings that are simulated by a mechanism modelling lossy environments.

Figure 1: End-to-end model for speech emotion recognition.

2.1 Packet loss generation model

In order to model the lossy and non-lossy packets – or, more precisely, frames in our case – in a given sequence, we adapt the Markov Chain [bishop2006pattern] as shown in Figure 2. This is a standard approach for packet loss modelling [gilbert]; note, however, that also more complex models, e. g., three states have been used [milner2004analysis], for example, to model burst behaviour. Other models are also reviewed in a recent survey [da2019mac]. Given a sequence of frames, we can use to sample a binary sequence of length . This can be achieved by starting at the state , then transitioning between the states (for no-loss) and

(for loss) based on the transition probabilities

and . This is done until states are enumerated. Then, the sampled sequence of states is directly transformed into the binary string, by replacing by and by .

The sampled binary string can be used to select elements from the given sequence, where the frames at positions with corresponding character ‘1’ are the only frames taken. For example, a binary sequence would select the frames from the sequence .

start

Figure 2: Markov Chain that samples a binary sequence, that can be used as a mask for loss or non-loss combinations.

The intuition behind this model is that it can mimic a variety of possibilities. The value of models the overall stability of the system, in particular, how unlikely it is that a frame-loss error might occur. Additionally, the value of models the intensity of frame-loss when it occurs. High values of correspond to persistent errors that stay long. Different combinations of these can correspond to different possibilities as shown in Table 1. An environment with a low bandwidth could be thought of as to have low values for both parameters, which mirrors a scenario of frequent non-persistent frame-loss issues. If both parameters have high values, this mirrors an environment with a low chance of a persistent breakdown event.

low high
high stable sudden breakdown
low low bandwidth extremely unstable
Table 1: How different values of and may model different environments with different causes for frame-loss.

Furthermore, we will need to drop frames from two sequences simultaneously, mainly when one is an input audio sequence and the other is which consists of the output labels sequence. Even though, both correspond to the same duration of time, still the sample rate of is higher than that of . For simplicity, it is assumed that the sample rate of is a multiple of the sample rate of , with a multiplicative factor . Based on that assumption, if we acquired a binary string from the model to drop the frames of the output labels , then we can construct a mask to drop the corresponding elements of . The mask is constructed by repeating each element of for times in place. For example, if and the mask ‘1011’ is used to drop frames from , then is dropped using ‘111000111111’. This mechanism ensures that the dropping of frames corresponds to the same time tags. Eventually, the given Markov Chain will sample binary strings that have an expected fraction of losses [xiao2018packet]:

(1)

2.2 Dataset

The dataset that is used in the experiment is the RECOLA dataset [RECOLA]. The training data consists of the 16 training tracks, 15 validation tracks, and 15 test tracks. Each track consists of 5 minutes of audio [RECOLA], recorded at 44.1 kHz. Each track is labelled across time and the labels were collected at a frequency of 25 Hz. Each track contains one student participant with a mean age of 22 years. The speakers spoke in a variety of languages which consisted of 33 French, 8 Italian, 4 German, and 1 Portuguese speakers. In our experiments, the audio tracks are down-sampled to 16 kHz, and the labels are down-sampled by a factor of using median pooling. Since the labels for the test portion were not freely accessible at the time of the experiments, the validation portion is used for testing.

2.3 Model

There needs to be a model that can recognise emotions via speech, where emotions are defined by two main dimensions arousal and valence. For this purpose, an end-to-end deep model is used, due to its simplicity and strong performance. There is one model architecture that is adapted in all the experiments, based on a variant of the model introduced by [tzirakis2018]

, with slightly different hyperparameters.

The model’s architecture is depicted in Figure 1. It starts with a batch normalisation layer [batchnorm], followed by three convolution blocks, then a bidirectional LSTM layer [LSTM], and finally a time-distributed fully-connected layer [FC] (using activation function) with two output features. Bidirectional LSTMs have shown to be effective in ASR [bilstm]. Each of the convolution blocks or the recurrent layers are followed by a dropout layer (dropout rate ) to reduce overfitting [dropout]. Each convolution block consists of a 1D convolution layer (with

activation function) followed by a max-pooling layer. The convolution layers have filter sizes 27, 14, and 3 respectively. The number of output channels are 64, 128, and 128 respectively. The pooling sizes are 40, 20, and 4 respectively. The bidirectional LSTM consists of 64 output units. The sizes of the pooling layers are chosen to reduce the input sample rate from 16 kHz to an output sample rate of 5 Hz. Accordingly, the kernel layers have a padding to preserve the input length. Then, their filters’ sizes are chosen to render the overlap rate

as advised in [tzirakis2018]. The overlap rate is calculated by the formula:

(2)

During training, the input and output data are segmented into frames of 20 seconds, in order to reduce the time complexity needed by the LSTM layers to operate on long sequences. The training is performed using the Adadelta optimisation algorithm [adadelta] with a learning rate of , for epochs and a mini-batch size of . Similar to [tzirakis2018]

, the loss function that is used for training is a function that would maximise the

concordance correlation coefficient (CCC) [ccc]. The function is , where is the CCC, defined by the formula:

(3)

where

are the variances of

and respectively, are the means of and respectively, and is the covariance of and . The loss function uses the CCC on the time dimension of the data, then averages the values across examples and emotions features, in order to ensure that both emotion dimensions are optimised adequately.

3 Experiments and Results

Figure 3: CCC scores for arousal and valence compared against different frame-drop rates, for the different training settings.

3.1 Experimental settings

The effects of frame-loss on emotion recognition are investigated under four different settings: matched, mismatched, multi-conditions, and augmentation. The main difference between these settings is the training environment. Table 2 shows the validation CCC scores for all the chosen settings. The testing environment is the same for all of them; it considers several combinations of the two parameters and . Depending on the chosen values for both parameters and the training setting, a corresponding model is chosen to be tested using CCC (in Equation 3). The testing is done by applying the frame-loss (in the corresponding settings only) individually on each of the five minutes tracks, then predicting the labels for the remaining frames. The comparison between labels and predictions is then done individually for each emotion dimension, by calculating CCC on the concatenation of all tracks (since they might have different lengths after applying frame-loss).

3.1.1 Mismatched training

In the mismatched setting, the training is run on the clean data without any application of frame-loss, and the same model is used for all test combinations.

3.1.2 Multi-conditions training

In the multi-conditions training settings, for each training batch, two values and are sampled uniformly from and respectively. Then, accordingly, a frame-loss mask is sampled using the Markov Chain . The sampled mask is used to drop frames for all the examples in the batch. Only one model is trained in this setting, and it is used for all test combinations. During sampling, is clipped to be at least to prevent extremely high loss of training data which degrades the training quality severely.

3.1.3 Matched training

The training environment in the matched settings relies on partial multi-conditions training, because there are many test combinations of the two parameters and , and it would be impractical to train a model for each of those combinations. Consequently, the values are clustered in three categories: low, medium, and high, with values in the ranges , and , respectively. Using these categories, there are nine combinations for models to be trained. In each combination, based on the chosen categories, values for both and are sampled uniformly for each batch (according to the corresponding categories’ ranges). Similar to the multi-conditions setting, according to sampled values of and , a mask is generated using the introduced Markov Chain to drop the frames of the whole batch. is again clipped to be at least to prevent the severe degradation of training quality. However, still some residues of the degradation is visible in the last row of Table 2. During the testing, depending on the categories in which each of the testing values of and lie in, the model with the corresponding matching category is chosen for testing.

3.1.4 Augmentation training

In this setting, one of the models from the matched training setting is used, when is low and is high. This one model is then used for all the test combinations. This setup is similar to the multi-conditions setup, with one key difference, which is the model used for testing. The main aim of this setting is to examine the effectiveness of a frame-loss as a data augmentation technique [augmentation] which can be used during training with the aim to improve the results or allow the model to be more robust in degraded run-time environments.

setting arousal valence
mis - - .789 .529
[tzirakis2018] - - .815 .502
multi .630 .366
match high mid .797 .542
match/aug high low .769 .503
match mid low .736 .501
match high high .729 .489
match mid high .702 .452
match mid mid .701 .425
match low low .662 .426
match low mid .650 .405
match low high .430 .176
Table 2: CCC scores on validation data (without any frame-loss) for the different training settings. The values mid and high correspond to the ranges and respectively, while low corresponds to the range for , and for . [tzirakis2018] is shown in the second row.

3.2 Results

Figure 4: CCC scores for valence and arousal, for the three matched, mismatched, and multi-conditions settings. is the probability to remain in a non-loss state, is the probability of remain in a loss state.

The results in Figure 3 are comparing the scores to a single dimension, which is the ratio of dropped frames after applying the frame-loss. The results of the testing are shown in details in Figure 4, where the different combinations of valence/arousal and the three training settings (matched, mismatched, and multi-conditions) are examined.

According to Figure 3, it can be seen that generally, the matched setting has the overall best performance, while the mismatched has the worst overall performance. The performance of the matched setting is expected since the model gets trained on data which is the most similar to the test data, in comparison to the other settings. In addition, for a low drop-rate , the multi-conditions setting tends to have the worst performance, while the matched and mismatched settings are more or less on par.

The previous results were the main motivation to examine the augmentation setting, which tries to combine the advantages of the mismatched settings and multi-conditions, without matching the training and testing. In that case, one model is trained with parameters that cause a low drop-rate. The aim is to achieve the high performance of the matched settings for the low drop-rate, and resembles some of the high performance of the multi-conditions setting on the high drop-rate. The results according to Figure 3 show that this is indeed the case. The augmentation setting achieves nearly similar performance like the matched setting for drop-rate , while making some improvement over the mismatched setting for higher drop-rate.

After examining the results of the different settings, a strategy to overcome the frame-loss effects is to try to match the setting of the training environment to match the deployment environment. In case this matching is hard to be performed, a data augmentation technique can be a general purpose technique to use. For particular environments with severe degradation in the audio’s quality, the training with multi-conditions setting can then be used.

4 Conclusions

In this paper, the effects of frame-loss on the performance of automatic speech emotion recognition were examined. A Markov Chain model was utilised to model environments with frame-loss, where an audio stream can lose data packets during transmission. For such an examination, an end-to-end deep model was used for the experiments. The model mainly consists of convolution blocks and recurrent layers and the dataset RECOLA was chosen for the experiments.

The experiments had mainly three settings: matched, mismatched, and multi-conditions settings. In all of the settings, the models were tested with a variety of possibilities of frame-loss, while the training was the crucial difference between the different settings. In the mismatched setting, the model was trained on clean data. In the matched setting, a variety of models were trained based on low, mid, or high values of the parameters. In the multi-conditions settings, one model was trained using a mixture of all parameters’ combinations.

The results have shown that the matched settings had the best overall performance while the mismatched setting had the worst overall performance. The multi-conditions setting was on par with the matched settings for lossy data (with frame-loss rate ). However, it was the worst on data with low frame-loss rate . On the other hand, the matched and mismatched settings had an on par performance for data with low frame-loss rates .

An additional setting was experimented to test out a general purpose solution for the frame-loss problem, namely training with frame-loss as a data augmentation mechanism, just using parameters that lead to low frame-loss rates. The augmentation has been shown as a compromise strategy to combine the advantages of the mismatched and multi-conditions settings, without matching the training to the test environments. It has shown a performance on low frame-loss rates which is on par to the matched setting, while for high frame-loss rates it has shown an improvement over the mismatched setting.

Future work should investigate the use of Packet Loss Concealment (PLC) methods [rodbro2006hidden] in the context of SER instead of classical PLC [hmminterspeech]

. This could include recent deep learning approaches including such from the image processing domain

[athar2018latent] originally tailored for occlusion restoration, as it has repeatedly been shown that audio can well be modelled as an ‘image’ using the spectogram or related representations [Cummins17-AID].

References