INTERSPEECH’s Computational Paralinguistics Challenges (ComParE) have regularly introduced the speech community to new exciting challenges since 2009111http://www.compare.openaudio.eu/. These challenges setup as prediction tasks focus on extracting important speaker-related information from the audio signal. In more than a decade of ComParE, researchers have come up with innovative solutions to these challenges.
These efforts can broadly be divided into two categories: feature-engineering based approaches [Schmitt2016BOAW, Eyben2016GeMAPS, Freitag2017AuDeep, Amiriparian2017DeepSpect, Montacie2018, Gosztolya2018, Gosztolya2019, Wu2019, Carbonneau2020Feature]
and deep learning-based end-to-end approaches[0001SSA18, Zhao2018DeepSpect, ElsnerLRMI19, DBLP:journals/corr/abs-1802-01115, Fritsch2020RawCNN]. Feature-engineering approaches have concentrated on extracting task-specific features to be utilized by classifiers for prediction. On the other hand, end-to-end approaches have focused applying complex neural network architectures to bypass feature engineering. There might not be a clear winner between these two approaches [0001SSA18], but combining these two approaches has emerged as a trend. Prior work has used Deep Neural Networks (DNN) only to extract useful features automatically. These features are used to train a simple classifier afterwards. Two such systems, AuDeep [Freitag2017AuDeep], and DeepSpectrum [Amiriparian2017DeepSpect] are already part of this year’s baseline method.
For end-to-end (E2E) approaches, prior work has concentrated on applying single-model systems for the prediction tasks [GosztolyaBGT17, 0001SSA18, ElsnerLRMI19, DBLP:journals/corr/abs-1802-01115]
. A single neural network, which is a non-linear model, can have high-variance and, thus, produce unstable results. On the other hand, the prior work has concentrated on applying the same architecture across different tasks. Using the one-size-fits-all policy for different ComParE tasks ignores task-specific requirements that can be exploited in model design[0001SSA18]. Hence, we study the application of E2E models to obtain a more robust but task-specific solution. We build ensemble-based E2E systems to obtain robust results across different ComParE 2020 tasks. Utilizing multiple models instead of one shows better performance across all the tasks. Besides, we also study task-specific requirements and explore incorporating them into the E2E solution.
ComParE 2020 poses three new challenges [compare2020] to the community: 1) the breathing sub-challenge to predict the output signal of a respiratory belt worn by the speaker, 2) the elderly sub-challenge to classify the arousal (A) and valence (V) level of elderly speakers and 3) the mask sub-challenge to predict if the speaker wears a mask or not while they speak.
For the breathing sub-challenge, the E2E baseline system optimizes Pearson’s coefficient, which is also the task metric, to solve a regression problem. However, E2E predictions do not match the scale of the ground truth. To alleviate this scale issue, we study multi-loss strategies for our E2E model, where we optimize Pearson’s coefficient along with mean-squared-error. The elderly sub-challenge entails learning two closely related tasks (arousal and valence level prediction), so as a natural choice, we explore multi-task learning. The sub-challenge also faces the issue of imbalanced class data and we apply sampling schemes to augment the data to reduce the class imbalance. In the mask sub-challenge, our single E2E model performs better than the best baseline result. On further investigation, we analyze the trained models to understand which frequency bands hold the largest importance. Our findings lead us to create low-frequency band features. A fusion of the baseline with our E2E models, including these features, results in a substantial performance gain.
We also include our plans to explore and expand our presented experiments. We hope to include these results in an extended version of this article.
In this section, we describe the end-to-end system usage in an ensemble learning scheme. We also present task-specific modifications to capture task requirements in the end-to-end system. Keep in mind that this paper describes our solutions for a competition, so we broke with the tradition of using only a few techniques. Instead, we used several to get the best results. Still, we did our best to measure the impact of each modification on the development data and tested only the best ones.
2.1 End-to-end learning
End-to-end learning is a new emerging paradigm within deep learning. Researchers across various fields have adopted this paradigm supported by the availability of large data and powerful computational resources. Theoretically, end-to-end systems are built to replace the traditional pipeline-based solutions with a single deep neural network. The end-to-end systems allow using a single optimization step to training the complete model. They also have the promise of bypassing the laborious feature-engineering step by having a single system for solving every aspect of the prediction problem. In practice, however, these systems are built on top of existing features. The advantages of this paradigm make it an attractive choice for ComParE tasks.
In our experiments, we employ the same DNN model architecture for elderly and mask sub-challenges. For the breathing sub-challenge, we use a different DNN architecture based on the baseline system for further research. We describe the details of these end-to-end model architectures in Section 3.1. Our models process either spectral input features or raw audio signals in case of the breathing task. Then the DNNs can directly be optimized to perform the given task. Using this single model approach allowed us to quickly modify the general framework to the specialties of the sub-challenges.
2.2 Ensemble learning
DNNs are known to be sensitive to the random initialization, and our experiments also confirm this. This issue is especially severe if the amount of training data is limited, which is usually the case for paralinguistic tasks. A solution to this problem is applying ensemble learning. We train several differently initialized DNNs and then combine their predictions to get stable and even better results.
Here, we employ a specific bootstrap aggregation method, called bagging. Originally, bagging trains each model using only a random subset of the training data to produce diverse systems. As the training data is already limited, we decide to use all available data during training and rely on random initialization and data shuffling to produce a diverse set of DNNs. In the combination, we average the outputs of differently-initialized DNN together to make the final prediction.
The ensemble learning can also be performed with other approaches. For our mask sub-challenge experiments, we perform an equal-weighted soft-voting-based combination of the baseline prediction system like Support Vector Machines (SVM) with our ensemble DNNs.
2.3 Multiple objectives
Training an end-to-end system does not have to be restricted to using a single loss function. Often multiple losses are taken into consideration to focus on multiple aspects of the prediction problem. This technique also helps regularize training.
For breathing sub-challenge, the end-to-end baseline system is trained with a correlation-based loss. However, it does not help to bound the outputs to the same scale as the label. To match the output’s scale to the label, we use a combination of the correlation loss and the mean squared error (MSE), which can help regularize the end-to-end baseline system.
2.4 Multi-task learning
Multi-task learning trains a single model to perform multiple tasks simultaneously. Recent work [Latif2020Multitask] has also shown the benefits of using this scheme for paralinguistic tasks. Intuitively, multi-task learning’s unified model allows data augmentation by sharing information relevant for one task with the other. This intuition is especially relevant in the case of elderly sub-challenge. The arousal and valence levels are two related dimensions to describe the emotional experiences of the speaker. Thus, we experiment with a single end-to-end model trained to predict the arousal and valence levels in a joint framework.
2.5 Resampling strategies for multitask learning
In the elderly sub-challenge, we observe a class imbalance problem. Having over-represented classes in the data is a common problem for paralinguistic problems [GosztolyaBGT17]. To address the data imbalance, we choose two sampling techniques: upsampling and probabilistic sampling [GosztolyaBGT17]. Upsampling is a simple method that repeats the underrepresented examples until the data becomes balanced. Probabilistic sampling applies a more rigorous approach. It defines the desired class distribution and during training, it selects examples such a way that the overall distribution of the training data would fit the desired one. This new distribution is a linear combination of the original and a uniform one, and being the respective coefficients.
These resampling methods are easy to use; however, we had to adapt them to work in a multi-task setup. To upsample, we created clusters, which had the same label pair, and resampled so that each group would have the same amount of training data. Although this adaptation does not ensure the individual tasks having balanced data, in practice, it works quite well, as shown in section 3.3. A similar modification can be applied when using probabilistic sampling in a multi-task setting. First, we generate the desired distribution for each task. Then during training, we select a label pair that would fit the distribution and use a training instance that has those labels.
2.6 Low-frequency features for mask sub-challenge
For the mask sub-challenge, we hypothesize that wearing a mask changes the resonance conditions in the vocal tract, as the mask might reflect some of the frequencies to the tract [phonation_speech, Mel_phonation]. To test this hypothesis, we look at the output gradients w.r.t. the inputs and plot them per input frequency bands in Figure 1. We notice that end-to-end models have large gradients for the ten lowest frequencies. Considering this observation, we compute low-frequency information-based features. Specifically, we extract Mel-spectrogram features for 200 filter-banks and then use the ten lowest filter-banks as input features, which is referred to as lowest-10-features.
As a pre-processing step for extracting these features, we also examine enhancing the lower frequencies by manipulating the input audio. We apply low-frequency enhancing schemes like preemphasing the audio (with filter coefficients and passing through a fifth-order low-pass butterworth filter whose cutoff frequency is [oppenheimBook], denoted as preemphasis+butterworth. These schemes can allow the Mel-spectrogram to better represent the relevant information for this task, which is dependent on the low-frequency bands.
3 Experiments and results
3.1 Experimental setup
For the elderly and mask sub-challenges, we extract Mel-spectrograms from the audio files as inputs in a similar fashion to the auDeep [Freitag2017AuDeep] pipeline. Instead, for the breathing sub-challenge, the raw audio data directly input as the raw audio leads to better results than Mel-spectrograms.
In our experiments, we use two different end-to-end systems. For the elderly and mask sub-challenges, the spectral input is first processed by a 1D convolutional layer with 100 neurons and then a recurrent layer, containing 100 LSTM cells, accumulates the outputs of the filters. We pass the outputs of the recurrent layer to a feedforward layer (100 rectified-linear units) and then apply a classification layer. In the multi-task experiments, we split the structure after the LSTM layer passing the recurrent layer output to a unique set of hidden and output layers for both tasks. For the breathing sub-challenge, we opted for the same structure as the best baseline system, for details see[compare2020]
For all tasks, we use ensemble learning. For the mask sub-challenge, we obtained the best results using 50 models, while for the other tasks, ten models were enough to reach the peak performance. After training the individual models, we averaged their output to create the final predictions.
For evaluation on the test set, we train our models on the combined training and development set. We note that the ComParE challenge restricts the number of submissions per team and task to five evaluations on the test set. As the competition is ongoing, we only used a few of the available submissions to check the best systems so far. The limitation implies that we can not test all of our methods. In the result tables, we use the question mark (?) to indicate the solutions not yet evaluated on the test data.
3.2 Breathing sub-challenge
On this task, we used the end-to-end baseline system for further development as this system performs quite well on this task [compare2020]. However, it faces the issue of mismatch on the scale of the end-to-end predictions and output labels. To alleviate this mismatch, we apply a multi-loss scheme using MSE based loss to regularize the baseline correlation loss.
|Avg. per DNN||Best DNN|
In table 1, we compare between the single-loss versus multi-loss strategies. The single-loss models use either Pearson’s correlation (corr) or the MSE. The multi-loss strategy combines the two losses (corr+MSE) with a regularization weight of 0.1. Correlation-based E2E (E2E-corr) performs best on when averaging the correlation values of ten randomly-initialized DNNs. The best result corresponds to the corr+MSE based E2E model; however, the averaged results are lower and suggest that this value is unreliable. We suspect further tuning of the regularization weight is required and we hope to complete this analysis as part of our future work.
|System (loss function)||Dev||Test|
|Baseline (E2E) [compare2020]||.507||1.682||.731|
In table 2, we present the results of 10-model ensembles to compare with the baseline performance. Even though the baseline system produced high correlation values, it had the highest MSE value. Combining the predictions of 10 models (corr) reduced the MSE significantly and outperformed the baseline results. Using the MSE as loss function performed the worst but, naturally produced the lowest MSE. Lastly, we can see that using the multi-loss ensemble of E2E model (E2E-corr+MSE) drops in comparison to the ensemble of E2E-corr because of the MSE regularization. However, in terms of MSE, it is much better. On the evaluation set, E2E-corr+MSE was also slightly worse than the E2E-corr ensemble in terms of overall correlation. Nevertheless, our ensemble of E2E corr outperforms the baseline result and shows an absolute improvement of 2.6 correlation points over the baseline result.
3.3 Elderly sub-challenge
The elderly task presents a prediction problem with class imbalance. For valence prediction, 44 out of the 87 stories have a medium-valence label. Upon inspecting some of our initial models, we observe that the output prediction favours the over-represented classes. To cope with these issues, we apply the resampling methods described in section 2.5.
Table 3 presents the ensemble E2E models evaluated on the elderly sub-challenge. Applying the sampling techniques improves the performance significantly in each case. For arousal, upsampling showed more benefit than probabilistic sampling ( Table3). In contrast, probabilistic sampling with was very beneficial for the valence sub-task. The multi-task models consistently outperformed single task ones. For the two best systems, we also checked the performance of the individual DNNs and saw that ensemble learning is essential for good performance. Upsampling for a single multi-task DNNs on average yielded 38.4%/36.4% (A/V), with probabilistic sampling we got 36.6%/38.5%.
Unfortunately, the test results are below the official baseline. The considerable difference between scores on the development and test data suggest that our model overfits when training a train+development set system for evaluation.
|System||Dev (A/V)||Test (A/V)|
|Baseline (linguistic) [compare2020]||40.6/49.2||44.0/49.0|
|E2E (single task)||35.0/39.7||?|
|E2E (single task + upsampl.)||39.8/41.5||?|
|E2E (multitask+ upsampl.)||42.9/42.4||38.0/39.5|
|E2E (single task + prob. sampl.)||35.6/39.6||?|
|E2E (multitask+ prob. sampl.)||40.0/45.5||45.8/34.8|
We also suspect that there is a significant mismatch between the dev and test data in this sub-challenge. Strong evidence for this can be found in the baseline paper [compare2020]. In the baseline paper, we can see that the test performance does not correlate with the scores achieved on the development set. The official acoustic baseline model (DeepSpectrum+SVM) produces almost the worst results on the development set, and the difference between its development and test scores is large. This observation suggests that parametric tuning with the development data might not be the best model for the evaluation set.
3.4 Mask sub-challenge
Training a 50 model ensemble, we saw that their averaged prediction significantly outperformed our single E2E model and the individual baseline system (auDeep-fused). Our individual E2E models achieved 66.0%UAR on average, but their combination reached 68.0% (E2E). The best individual baseline uses auDeep-based features in an SVM system. Our E2E ensemble outperformed this model both on the development and test set, as shown in table 4. Though, our ensemble E2E model is outperformed by the fusion of the best baseline models, which is an SVM based on auDeep-fused, Bag-of-audio-word, OpenSmile and DeepSpectrum features [compare2020].
Earlier in section 2.6, we had observed that lower-frequency bands of the audio hold important information for the mask sub-challenge. Based on this observation, we applied preemphasis+butteworth to input audio and then extracted the lowest-10-features to build an E2E ensemble (E2E lowest-10-features). This ensemble model outperformed the E2E-ensemble built with only preprocessed input audio (E2E preemphasis+butterworth) but was worse than our vanilla ensemble.
Combining the regular ensemble (E2E) and E2E lowest-10-features fared better, resulting in a slight improvement over the vanilla ensemble. We combined this model with predictions from SVMs trained on bag-of-audio-word features (BoAW-fused) and DeepSpectrum-resnet50 features (E2E+lowest-10-feats+baseline) via soft voting with equal weights. The combined achieved our best result on the development set and improved the fusion-based baseline by 3.8% UAR.
4 Future work
|auDeep-fused (baseline) [compare2020]||64.4||66.6|
|Fusion of the bests (baseline) [compare2020]||–||71.8|
|E2E + E2E lowest-10-features||68.6||?|
|E2E + E2E lowest-10-features + baseline||70.2||75.6|
For the breathing sub-challenge, we observe that regularizing with MSE can help alleviate the mismatch of scales between the output and the labels. However, it still lacks in performance in comparison to the regular E2E model. To study this effect, we explore other regularization schemes to obtain a better balance between mismatch issues and performance.
On the elderly sub-challenge, our current system overfits on the training and development data and obtains a poor performance on the test set. We investigate this effect further and apply regularization schemes to reduce the overfitting. Another thing that limits our current system is that it is trained to classify short segments of the stories and then the decisions made for these fragments are merged with a soft voting method. Instead, we could concatenate the audio files of the same stories and directly classify them, as our E2E architecture allows us to use arbitrary long inputs.
For the mask sub-challenge, we currently combine predictions from separate models for the vanilla and lowest-10-features E2E scenarios. In contrast with this late fusion, we explore early and intermediate fusion of features to better exploit the information present in these spectrograms. Our lowest-10-features naïvely extracts the ten lowest frequency bands to use as features for the E2E model. Instead, we also develop specialized low-frequency features to aid better learning by the E2E model. In the mask sub-challenge, we observe that combining our E2E ensembles with baseline results achieves the best result. We plan to explore similar combinations for both breathing and elderly sub-challenges.
We presented Aalto’s E2E ensemble solution for the three different INTERSPEECH 2020 ComParE tasks. In our study, the ensemble E2E models achieved better performance than individuals E2E models on average. On the ComParE 2020 tasks, we also proposed task-specific modifications for the underlying E2E models. We studied modifications based on multi-task learning, re-sampling training data for multi-task scenarios, and feature engineering based on the initial E2E ensemble models. Our best models showed absolute improvements upon the competitive baselines for the breathing and mask sub-challenges by 2.8% and 3.8%, respectively. Overall, our paper showcased the benefits of using an ensemble of E2E models and task-specific modifications for computational paralinguistic tasks.
We thank Antonia Hamilton and Alexis Macintyre for granting us access to a subset of the UCL Speech BreathMonitoring (UCL-SBM) database used in the Breathing Sub-Challenge. This work was supported by the Academy of Finland (grants 312490, 329267) and the Kone Foundation. Aalto ScienceIT provided the computational resources.