Packet Loss Concealment (PLC) is any technique that attempts to handle the effects of packet loss or overly delayed packets. This is a common problem in speech transmission using VoIP [voip]
. This problem can affect the performance of many speech processing systems that assume a complete speech signal is transmitted, including Speech Emotions Recognition (SER).There has been a variety of classical techniques that attempts to solve the packet loss problem, for example using Hidden Markov Models (HMM)[hmminterspeech] and Linear Predictive Coding (LPC) [schuller2013]. There are also encoding-based techniques [codec]
. However, in the era of deep learning and the rise of a variety of generative networks, like sequential generative Recurrent Neural Networks[graves] and Generative Adversarial Networks (GAN) [GANS], generating data in place of the lost packets is a promising avenue for advanced concealment techniques. There exist studies [Lotfidereshgi2018]
that attempt to solve packet loss in the context of Automatic Speech Recognition (ASR). However, to the authors’ best knowledge, there are no studies addressing PLC in the context of Speech Emotion Recognition. In an earlier work, the authors have investigated techniques how to train SER end-to-end models to be robust in the presence of frame-loss[mywork]. However, we attempt here to address PLC directly to address this issue. Furthermore, this problem can happen on a variety of devices including mobile devices [Marchi16-RTO]. Providing a neural network that can perform PLC end-to-end would be favourable, because it is easier to embed it into different applications without the need for extra processing. More importantly, there is a rise nowadays of hardware optimised for neural networks processing [smartphones], hence having an end-to-end PLC solution would be the most suitable solution for future hardware. The contributions of this paper are providing such an end-to-end PLC neural network (we call it ConcealNet) and examining the effects of using it on SER in lossy environments.
The paper is divided as follows: in Section 2, we will review some existing techniques that are also used for PLC with different models or with different settings. In Section 3, the main approach will be presented. The experiments and evaluations are discussed in Section 4. Finally, in Section 5, the summary and the conclusion of the paper are discussed.
2 Related work
A PLC approach is proposed in [PLC2016], which relies on features representing speech data. The concealment is done on feature level and then decoded, rather than executing it on the actual speech directly. The approach was realised in the context of enhancing Automatic Speech Recognition (ASR). Based on [PLC2016], [Lotfidereshgi2018] implemented a PLC algorithm that operates directly on the speech data using LSTM-based neural networks. They also applied it on ASR, while evaluating on the TIMIT dataset [TIMIT]. The main advantage of both approaches is that, they can be applied in a frame-by-frame fashion, which is suitable for real-time application on losses of small packets. Additionally, they have the potential to be extended to a neural-based end-to-end PLC. GAN-based approaches are utilised in [GAN, acousticinpainting], where the generator adapts an architecture similar to an auto-encoder. The model uses audio of long segments (like 3 200 ms) to make predictions, which is longer than a typical packet size in VoIP being around 10-20 ms [voip, PLC2016]. Such a setup is most effective for offline processing and for long losses. In [xiao2018packet]
, there is an approach facing PLC not in the context of speech, but rather the transmitted data from pose tracking sensors. They authors also chose LSTM-based RNNs, while having a two-state Markov Chain for packet loss injection.[khan2018]
are considering Cartesian Genetic Programming[Miller2011] for signal reconstruction, in some abstract setting. Another approach like [mack2019deep] attempts to reconstruct STFT signals, under a variety of deformations like destructive interference and packet-loss, however, it is not directly addressing PLC in particular.
3.1 Recurrent generative modelling
Given an input sequence , where denotes the previous state of representing the whole preceding sequence, then, a recurrent neural cell is an operation that computes an output and a next state [deeplearn]. Two effective and commonly used recurrent cells are gated recurrent cells like LSTM [LSTM] and GRU [GRU]. Additionally, here could also refer to a stack of recurrent cells, such that at each time step, each cell takes input from the output of the preceding cell at the same time step.
As shown in [graves], generative Recurrent Neural Networks (RNNs) can be used to generate data by training the cells to predict elements of sequences using the preceding elements. Inspired by this, we train a similar approach as a regression task instead of classification, to enable an RNN to generate audio segments. will be later used to conceal packet loss by generating audio segments for the lost packets.
Training. We train the generative RNN , using the input speech segments and the predicted output () which is a sequence , by concatenating the segments and comparing that to the concatenation of the sequence
as the corresponding ground truth prediction. This is optimising the loss function. The loss is , where is the concordance correlation coefficient (CCC) which measures data reproducibility [ccc], given by:
where are the means,
are the variances, andis the covariance of and .
Stressed training. The models might need to generate several consecutive frames in environments of severe packet losses. We can enhance this by adapting a stressed training scheme. We can composite three times to get as before, in addition to and . Consequently, we optimise:
This can be generalised for a deeper composition. However, it gets more expensive to train.
Data processing. The model processes speech with a sliding window of segments of duration 6.25 ms (corresponding to an array of length 100, in case of a 16 kHz sample rate). This is close to a typical packet duration of 10-20 ms [voip, PLC2016]. This length compromises between two issues. The first is the speed of inference, the second is the number of trainable parameters and generalisation ability. Smaller segment duration needs much longer time for training and inference, because it processes a very long sequence linearly without parallelisation [lstmspeed]. Bigger segments oblige the model to have more parameters, which is more difficult to train. Furthermore, during training, the tracks are segmented into segments of length 20 s to allow fast training.
The hyperparameters space is explored using BOHB[BOHB], a state-of-the-art tool for hyperparameter optimisation. The best architecture it discovered is shown in Table 1. The training is performed applying an Adam optimiser [Adam] using the stressed training loss in Equation 2. To speed up training, we first train for epochs without the stress training. Then, we continue with the stress training for epochs. For stress training, we use a learning rate , otherwise . A learning-rate decay is used, in addition to dropout layers [dropout] of a dropout rate that are entered after each LSTM layer during training, to reduce overfitting.
3.2.1 Recurrent concealment cells
Given is an input sequence with a binary mask , where only if is not lost, otherwise and . Also, given is a generative recurrent cell
with output value that estimates the next element of the input sequence, namely. We introduce a wrapper concealment recurrent cell that uses to fix x. The input of is , and its previous state . One step of the concealment cell is executed according to the equations:
The initial state is given by , where p
is a default-response vector, in case the initial frames were lost. In our implementation, we use.
The value of will be the same as if (non-lost frame). Otherwise, it will be the generated value , which is the predicted element according to the cell . After that, the predicted next element is computed using , in addition to the state that will be used in the next time step. A visual demonstration of how this cell operates is depicted in Figure 1.
This cell behaves similar to the PLC algorithm in [Lotfidereshgi2018]. However, formulating it as a neural operation allows the models to perform PLC end-to-end from corrupt raw audio to concealed raw audio. This can be embedded to make end-to-end inference, from lossy speech to emotions directly (or other tasks).
3.2.2 End-to-end PLC inference
Putting together the aforementioned components, we extract the recurrent cells and Fully Connected layers [FC] from the generative RNN trained in Section 3.1. The extracted cells are then stacked to form one cell , and then we wrap it using the concealment wrappers introduced in Subsection 3.2.1. Consequently, we construct an end-to-end PLC inference RNN (which we call ConcealNet) with the input sequence , and a corresponding loss mask to predict a fixed signal . This resulting fixed signal is after applying PLC, where lost segments are concealed and non-lost segments are copied.
Bidirectional concealment. Bidirectional RNNs have shown promising improvements in ASR [bilstm], which motivates us to introduce a bidirectional variant assuming non-causal processing is an option, e. g. , by a small buffer or in post-hoc application. If we train a backwards generative network and use the same architecture of ConcealNet on it (by reversing the input and output sequences), the results of those two networks (forward and backward) can be merged by averaging both to obtain a simple bidirectional variant of ConcealNet. This variant tends to have better performance generally. However, its main disadvantage is the inability to be used in real-time settings, because it assumes the knowledge about future context.
The dataset that is used in the experiment is the RECOLA dataset [RECOLA]. The dataset consists of 16 training tracks and 15 validation tracks. Each track consists of 5 minutes of audio [RECOLA], we downsampled them to 16 kHz. Each track is labelled with emotions across time and the labels were collected on a frequency of 25 Hz. However, we reduced this into 5 Hz using median pooling. Emotions are represented as two main features, namely arousal and valence.
3.4 Emotion model
For emotions predictions, we use a state-of-the-art end-to-end model [tzirakis2018] to predict emotions from raw audio. The model predicts two dimensions across time, namely arousal and valence
. The architecture we use consists of 3 convolution blocks, followed by 2 LSTM layers of 85 units, then a Fully Connected layer of 65 units, and a final output layer. Each convolution block consists of a convolution layer of 47 output channels followed by max pooling. The kernel sizes are, and the pooling sizes are for the 3 blocks respectively.
This emotions model is appended to the ConcealNet presented in Subsection 3.2.2 to make end-to-end predictions of emotions from speech with lossy packets.
3.5 Packet loss generation
To simulate the behaviour of lossy and non-lossy packets in a given sequence, we adapt the Markov Chain as shown in Figure 2. [gilbert] has shown it to be an effective approach for packet loss modelling; other models exist like a three-state model [milner2004analysis] to model burst behaviour. [da2019mac] reviews other models. Given a sequence of frames, we sample a binary mask by starting at the state , then transitioning between the states (for no-loss) and
(for loss) based on the transition probabilities untilstates are enumerated. The sampled sequence of states is directly transformed into the mask .
3.6.1 0-substitution concealment
This is a simple baseline, which replaces all the loss values by one constant value, which is 0 [perkins1998survey]. Even though this baseline is very simple, it serves the purpose of showing how important it is to solve the concealment problem and how far it can be improved.
3.6.2 Linear interpolation
Linear interpolation is a technique which conceals a lost segment using a linear equation joining the last point before the loss and the first point after the loss, and then predicting the lost values in between, according to the equation [interpolation].
4 Experiments and Results
In order to evaluate different methods for Packet Loss Concealment (PLC), first, we use input signals x with a corresponding loss mask (sampled by the two-state Markov Chain) and corresponding ground truth emotions labels y. Then, we use the end-to-end model to acquire the concealed signal and the corresponding emotions labels . Consequently, we compare against x to examine the quality of the concealment, and against y to examine the quality of the emotions prediction after PLC. In all scenarios, we use CCC [ccc] as the comparison metric, since it measures data reproducibility. The results of the concealment’s quality on the audio data are shown in Table 2, in addition to Table 3, which shows the results of the emotions predictions after concealment. Eventually, we show the results of the stress training scheme in Table 4.
Both versions of ConcealNet are performing much better in all scenarios of emotion prediction. Especially, where is not high (), both versions of ConcealNet have a small drop in emotions predictions, even when the overall frame drop-rate is up to 64 %. For audio concealment, the bidirectional ConcealNet has the best performance, followed by forward ConcealNet. The results are degraded, however, in one scenario when , the 0-concealment baseline achieves the best results in the loss concealment.
The scenario where is high is the scenario where ConcealNet experiences relatively long loss, and it is expected to recover consecutive segments, for each loss occurrence, which is extremely challenging. However, we observe how the stress training has managed to conquer this problem, as shown by the improvements in Table 4, where the stress trained models are generally overperforming especially in the scenarios with more losses.
In this paper, a concealment RNN (ConcealNet) was introduced. This consists of two main components: the first is a stacked generative recurrent cell which is trained to predict elements of sequences given the preceding elements, and a wrapper for such a stacked cell. The wrapped recurrent cell can be used as a recurrent layer given an input sequence x and a corresponding binary mask marking losses, to output a concealed sequence . A generative RNN consisting of two LSTM layers was trained to be used by ConcealNet, in addition to an emotions model which was connected to the ConcealNet, to conceal audio and predict emotions end-to-end. A stress training scheme was introduced to improve the performance of ConcealNet on long-term losses. Furthermore, the proposed ConcealNet was used in two variants, one processing the sequence forwards and the other processing the sequence bidirectionally by averaging forwards and backwards. The fully reproducible experiments on the popular RECOLA continuous emotion database have shown that the proposed ConcealNet is getting considerably good results in scenarios without too long losses, even when they are frequent. In environments with short packet losses, after using ConcealNet, the degradation of speech emotion prediction is minor: for arousal, CCC dropped from 76.93 % to 75.99 %, while for valence, it dropped from 43.18 % to 39.81 %. The bidirectional variant of ConcealNet is achieving even better results. The scenario when there are long packet losses has been shown to be the most challenging as one may expect. However, a stress training technique was introduced to conquer this issue and it has shown an improvement of the results.
Future work can consider the usage of attention mechanisms and the introduction of generative approaches such as variants of generative adversarial topologies or variational solutions.