The expression of emotions in speech sounds and the corresponding ability to perceive such emotions are both fundamental aspects of human communication (mauss2009measures; latif2021survey). Understanding the vocal bursts and non-lingistic vocalizations, which are central to the expression of emotion (scherer2017computational), has been overlooked in the field of computer audition. In this aspect, the EXVO (BairdExVo2022) presents an excellent opportunity to take advantage of recent advances in affective computing to study the automatic recognition of emotions through vocalizations.
In supervised learning, data augmentation(Yang20ARC; Triantafyllopoulos21DSC) is typically applied to enlarge the data set in order to prevent overfitting and improve generalization of the models. Self-supervised Learning (SSL) adopts a different means for the same goal by attempting to learn generic representations of massive data without human annotations (liu2022audio). Since there is no label in the training upstream, recent advances in SSL suggest that it is possible to obtain competitive results compared to supervised learning by creating pseudo-labels for training data through data augmentation (zbontar2021barlow), or by defining ‘positive’ and ‘negative’ pairings (chen2020simclr; baevski2020wav2vec). Redundancy-reduction is a principle first proposed in neuroscience by H. B. Barlow (barlow1961possible) in 1961. The principle hypothesizes that the goal of sensory processing is to recode highly redundant sensory inputs into a factorial code with statistically independent components and successfully explain the organization of the visual system (barlow2001redundancy)
. It has led to several different works on supervised and unsupervised learning(balle2016end; deco1997non). Following that principle, this work is mainly inspired by (zbontar2021barlow; barlow1961possible; goyal2019scaling), where the original data and its augmented version are fed into the same network and their outputs should have large similarity with slight difference.
Several works have shown that the combination of different loss functions is beneficial when training deep neural networks (bentaieb2017uncertainty; babaeizadeh2018adjustable), as this allows for incorporating heterogeneous information. For a multi-losses function training, it is a common approach to balance the different properties by minimizing a loss function that is the weighted sum of the different losses (dosovitskiy2019you; liebel2018auxiliary).
The key contributions of our work are two-fold:
We present a simple, but effective redundancy reduction method that generates two distorted perspectives for every sample in the batch and then reduces redundancy in the feature map by operating on the cross-correlation matrix of the resulting embeddings.
We apply a novel weighted loss strategy through weight updates and restrain the sum of losses to find the best weight for each loss.
The remainder of our paper is organized as follows. We present our Redundancy Reduction Twins network architecture and Restrained Uncertainty Weight Loss strategy in Section 2. In Section 3 and Section 4, we describe our experimental setting and results respectively. Finally, we conclude with a brief summary and outline the future directions.
In this section, we propose the Redundancy Reduction Twins Network (RRTN), a redundancy reduction training framework which applies a pair of identical networks to measure the similarity of distorted versions of a sample and makes the cross-correlation matrix between them as close to the identity matrix as possible. In addition, we also apply a loss strategy named Restrained Uncertainty Weight Loss (RUWL), which dynamically adjusts the weights of different losses and keeps the sum of the weights close to a constant.
2.1 Redundancy Reduction Twins Network
Figure 1 shows the architecture of our proposed Redundancy Reduction Twins Network (RRTN) training framework. We employ the original input sample and its augmented form as the network’s input data. Given Figure 1, we refer to the output of its encoder as its ‘representations’ and the output of the projector as its ‘embeddings’. The representations are used for the regression task and the embeddings are fed to the Barlow Twins loss function.
For our best approach, which we will discuss in the section 4, the encoder consists of a CNN14 network(Kong19-PANNS) (without the classification layer, and 2048 output units). The following projector network is different from the original Barlow Twins network(zbontar2021barlow), which contains 3 linear layers, each with 8192 units. Our projector includes only 1 linear layer which has 2048 units, as our method does not benefit from very high embedding. From the Equation 2, the Barlow Twins loss is the sum of elements in a matrix, and when the dimension of embedding rises, it can cause a loss value that is thousands of times larger than the regression loss, which makes it hard to jointly optimize.
2.2 Loss functions
Concordance Correlation Coefficient Loss (CCC Loss): is the concordance between prediction () and the ground truth (). This statistic quantifies the agreement between these two measures of the same variable. The CCC loss is defined as
where , and . In the considered emotion recognition task, ten emotion labels are provided, and we use the CCC defined and provided by the ExVo challenge (BairdExVo2022), which is the mean for each emotion dimension: .
Barlow Twins Loss (BT Loss):
is designed to maximize similarity between the embedding vectors while reducing the redundancy between their components. The BT loss is defined as
in which is a postive constant that weighs the importance of the first and second term in the loss function. In this work, we set . is the cross-correlation matrix, which is
where is the batch smaples index and is the vector dimension of the network output. Both terms in the definition serve a distinct purpose. The first term in Equation 3 tries to equate the diagonal elements of the cross-correlation matrix to 1, which makes the embedding invariant to the distortions applied, while the second term, which is called the redundancy reduction term, decorrelates the different vector components of the embedding by trying to equate the off-diagonal elements of the cross-correlation matrix to 0.
2.3 Restrained Uncertainty Weight Loss (RUWL)
From the Equation 1, the upper limit of the CCC loss is 1. Meanwhile, as Equation 2 shows, the BT loss is a sum of matrices. For a high-dimension embedding in our work, which is 2048, the value of the BT loss could be much higher than the CCC loss.
As shown in Figure 1, in order to optimise the network with three losses, we add them by different weights, which is
where is the weight of different losses and is the loss of the distorted input sample.
Since each of the contributing single loss functions may behave differently, appropriately weighting each loss is essential. We apply the weighted loss function described in (liebel2018auxiliary) to automatically tune . Furthermore, to prevent the network from constantly shrinking the weights to reduce the loss without learning anything about the task, we propose a Restrained Uncertainty Weight Loss for the to update, which is
where The final loss for our RRTN framework is defined as
where a is the parameter that dynamically updates the weights as the network is updated, and is a constant for each loss. Based on our prior knowledge, in our work, we set , while initialing the parameters .
|Encoder||# Param(M)||Relative Local Gain (%)||Relative Global Gain (%)|
|ResNet 18 (baseline)||11.18||0.613||-||-|
|ResNet 18 RUWL||12.23||0.628||2.4||2.4|
|ResNet 18 RUWL||12.23||0.634||3.6||3.6|
|CNN10 RRTN RUWL||6.00||0.667||1.8||8.8|
|CNN10 RRTN RUWL||6.00||0.668||2.0||9.0|
|CNN14 RRTN RUWL||83.95||0.674||4.2||10.0|
|CNN14 RRTN RUWL||83.95||0.678||4.8||10.6|
The Hume Vocal Bursts dataset (HUME-VB)(Cowen2022HumeVB)111We use the original release of processed .wav files, not the follow-up release of unprocessed ones. is a large-scale self-recording emotional non-linguistic vocalizations (vocal bursts) dataset, which contains totally 2,270 minutes of audio data from 1,702 speakers, aged from 20 to 39 years old.
In the database, each vocal burst has been labeled in terms of the intensity of ten different expressed emotions. Each emotion’s intensity ratings were normalized to a range of [0:1]. The audio files were normalized to -3 decibels, and converted to 16 kHz, 16 bit, mono. Finally, the official data have been split into three equal size sets: training, validation, and test sets. There are ten emotion labels in the dataset: Amusement, Awe, Awkwardness, Distress, Excitement, Fear, Horror, Sadness, Surprise, and Triumph.
3.2 Experiment settings
As input features, we use 64-dimensional log Mel-Spectrograms with a 32 ms window length and a 10 ms frame shift. To unify the shape of the input data consistently, we crop/pad the time dimension to 250. Finally, the size of the features for the RRTN is. Each input sample is augmented by SpecAugment(Park_spec_2019) to produce the distorted views shown in Figure 1. We set SpecAugment with 64 times drop width, 2 time stripes, 8 frequency drop width, and 2 frequency stripes.
4 Results and Discussion
Our experimental results obtained in the multi-output regression of the vocal emotion task are presented in Table 2, and we follow the original network implementation described in (he2016deep; Kong19-PANNS) to build ResNet 18 and CNN 10/CNN 14.
As shown in Table 2, in our experiments, we set the ResNet 18 result as a global baseline. We fit each network into the RRTN framework; for the global gain, our method on CNN 14 obtains the highest improvement with of 0.678, which is a 10.6 % increase. Further experiments on the RUWL reveal that it is a simple and effective practice to apply weighted loss to improve the prediction ability of the network.
By applying the RRTN training framework, there is always a gain in the when compared to the outcomes of the original network, which we dub local gain. However, we find that the improvement in the results obtained by using RRTN with various encoders is quite variable. CNN 14 improves by up to 4.8 %, while CNN 10 only improves by up to 2 %. CNN 10 and CNN 14 share the same network structure, consisting of 4 and 6 convolutional blocks, respectively. Each convolutional block consists of 2 convolutional layers with a kernel size of . The goal of the RRTN is to improve the generalizability of the encoder’s features by maximizing the similarity and eliminating the redundant information between pairs of samples. The deeper and wider network is more conducive to the joint optimization strategy. In our work, RRTN only introduces a linear layer as the projector, which does not significantly increase the parameters of the network, but may lack in representational power. Further experiments should be conducted to find the optimal number of linear layers in the projector.
Overall, it seems that the most gain to the performance is from the introduction of the RRTN framework, while RUWL adds a marginal gain over that. This means minimizing redundant information in the features is the most important contribution of our proposal, while the automatic weighting of the different losses is not as critical.
In this work, we proposed the Redundancy Reduction Twins Network (RRTN), which is a light and effective training framework. The cross-correlation matrix was used to quantify the similarity between the outputs of the same network fed with distorted versions of a sample, and it is made as close to the identity matrix as feasible to maximize similarity. Since the final loss is the sum of multiple weighted losses, we applied the restrained uncertainty weight loss for joint training to find the best weight for each loss. Our experimental results demonstrate the validity of our proposed methodologies.
Moreover, we believe that further refinements of the proposed loss function and framework could lead to more efficient solutions and better performance.
This work was funded by the China Scholarship Council (CSC), Grant # 202006290013.