Redundancy Reduction Twins Network: A Training framework for Multi-output Emotion Regression

06/18/2022
by   Xin Jing, et al.
0

In this paper, we propose the Redundancy Reduction Twins Network (RRTN), a redundancy reduction training framework that minimizes redundancy by measuring the cross-correlation matrix between the outputs of the same network fed with distorted versions of a sample and bringing it as close to the identity matrix as possible. RRTN also applies a new loss function, the Barlow Twins loss function, to help maximize the similarity of representations obtained from different distorted versions of a sample. However, as the distribution of losses can cause performance fluctuations in the network, we also propose the use of a Restrained Uncertainty Weight Loss (RUWL) or joint training to identify the best weights for the loss function. Our best approach on CNN14 with the proposed methodology obtains a CCC over emotion regression of 0.678 on the ExVo Multi-task dev set, a 4.8 0.647, which achieves a significant difference at the 95 (2-tailed).

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/21/2018

Wrapped Loss Function for Regularizing Nonconforming Residual Distributions

Multi-output is essential in machine learning that it might suffer from ...
09/04/2018

End-to-end Multimodal Emotion and Gender Recognition with Dynamic Weights of Joint Loss

Multi-task learning (MTL) is one of the method for improving generalizab...
09/04/2018

End-to-end Multimodal Emotion and Gender Recognition with Dynamic Joint Loss Weights

Multi-task learning is a method for improving the generalizability of mu...
11/02/2020

Multimodal Continuous Emotion Recognition using Deep Multi-Task Learning with Correlation Loss

In this study, we focus on continuous emotion recognition using body mot...
03/30/2022

Region of Interest focused MRI to Synthetic CT Translation using Regression and Classification Multi-task Network

In this work, we present a method for synthetic CT (sCT) generation from...
05/23/2019

Disentangling Redundancy for Multi-Task Pruning

Can prior network pruning strategies eliminate redundancy in multiple co...
01/29/2019

Hyperspherical Prototype Networks

This paper introduces hyperspherical prototype networks, which unify reg...

1 Introduction

The expression of emotions in speech sounds and the corresponding ability to perceive such emotions are both fundamental aspects of human communication (mauss2009measures; latif2021survey). Understanding the vocal bursts and non-lingistic vocalizations, which are central to the expression of emotion (scherer2017computational), has been overlooked in the field of computer audition. In this aspect, the EXVO (BairdExVo2022) presents an excellent opportunity to take advantage of recent advances in affective computing to study the automatic recognition of emotions through vocalizations.

In supervised learning, data augmentation

(Yang20ARC; Triantafyllopoulos21DSC) is typically applied to enlarge the data set in order to prevent overfitting and improve generalization of the models. Self-supervised Learning (SSL) adopts a different means for the same goal by attempting to learn generic representations of massive data without human annotations (liu2022audio). Since there is no label in the training upstream, recent advances in SSL suggest that it is possible to obtain competitive results compared to supervised learning by creating pseudo-labels for training data through data augmentation (zbontar2021barlow), or by defining ‘positive’ and ‘negative’ pairings (chen2020simclr; baevski2020wav2vec). Redundancy-reduction is a principle first proposed in neuroscience by H. B. Barlow (barlow1961possible) in 1961. The principle hypothesizes that the goal of sensory processing is to recode highly redundant sensory inputs into a factorial code with statistically independent components and successfully explain the organization of the visual system (barlow2001redundancy)

. It has led to several different works on supervised and unsupervised learning

(balle2016end; deco1997non). Following that principle, this work is mainly inspired by (zbontar2021barlow; barlow1961possible; goyal2019scaling), where the original data and its augmented version are fed into the same network and their outputs should have large similarity with slight difference.

Several works have shown that the combination of different loss functions is beneficial when training deep neural networks (bentaieb2017uncertainty; babaeizadeh2018adjustable), as this allows for incorporating heterogeneous information. For a multi-losses function training, it is a common approach to balance the different properties by minimizing a loss function that is the weighted sum of the different losses (dosovitskiy2019you; liebel2018auxiliary).

The key contributions of our work are two-fold:

  1. We present a simple, but effective redundancy reduction method that generates two distorted perspectives for every sample in the batch and then reduces redundancy in the feature map by operating on the cross-correlation matrix of the resulting embeddings.

  2. We apply a novel weighted loss strategy through weight updates and restrain the sum of losses to find the best weight for each loss.

The remainder of our paper is organized as follows. We present our Redundancy Reduction Twins network architecture and Restrained Uncertainty Weight Loss strategy in Section 2. In Section 3 and Section 4, we describe our experimental setting and results respectively. Finally, we conclude with a brief summary and outline the future directions.

2 Methodology

In this section, we propose the Redundancy Reduction Twins Network (RRTN), a redundancy reduction training framework which applies a pair of identical networks to measure the similarity of distorted versions of a sample and makes the cross-correlation matrix between them as close to the identity matrix as possible. In addition, we also apply a loss strategy named Restrained Uncertainty Weight Loss (RUWL), which dynamically adjusts the weights of different losses and keeps the sum of the weights close to a constant.

Figure 1: The proposed Redundancy Reduction Twins Network requires distorted versions of a batch of samples and tries to measure the similarity of two embeddings using the correlation matrix. It is designed to mitigate redundancy by trying to equate the diagonal elements of the cross-correlation matrix to 1. The introduced Restrained Uncertainty Weight Loss will dynamically adjust the loss weight while keeping the sum of the weights approaching the upper limit.
Train Validation
Minutes 739 726 1465
No. 19990 19396 59201
Speakers 571 568 1139
USA 206 206 -
China 79 79 -
South Africa 244 244 -
Venezuela 44 44 -
Table 1: An overview of the Hume-VB dataset. The information of train and validation sets is publicly accessible while the test set is blinded.

2.1 Redundancy Reduction Twins Network

Figure 1 shows the architecture of our proposed Redundancy Reduction Twins Network (RRTN) training framework. We employ the original input sample and its augmented form as the network’s input data. Given Figure 1, we refer to the output of its encoder as its ‘representations’ and the output of the projector as its ‘embeddings’. The representations are used for the regression task and the embeddings are fed to the Barlow Twins loss function.

For our best approach, which we will discuss in the section 4, the encoder consists of a CNN14 network(Kong19-PANNS) (without the classification layer, and 2048 output units). The following projector network is different from the original Barlow Twins network(zbontar2021barlow), which contains 3 linear layers, each with 8192 units. Our projector includes only 1 linear layer which has 2048 units, as our method does not benefit from very high embedding. From the Equation 2, the Barlow Twins loss is the sum of elements in a matrix, and when the dimension of embedding rises, it can cause a loss value that is thousands of times larger than the regression loss, which makes it hard to jointly optimize.

2.2 Loss functions

Concordance Correlation Coefficient Loss (CCC Loss): is the concordance between prediction () and the ground truth (). This statistic quantifies the agreement between these two measures of the same variable. The CCC loss is defined as

(1)

where , and . In the considered emotion recognition task, ten emotion labels are provided, and we use the CCC defined and provided by the ExVo challenge (BairdExVo2022), which is the mean for each emotion dimension: .

Barlow Twins Loss (BT Loss):

is designed to maximize similarity between the embedding vectors while reducing the redundancy between their components. The BT loss is defined as

(2)

in which is a postive constant that weighs the importance of the first and second term in the loss function. In this work, we set . is the cross-correlation matrix, which is

(3)

where is the batch smaples index and is the vector dimension of the network output. Both terms in the definition serve a distinct purpose. The first term in Equation 3 tries to equate the diagonal elements of the cross-correlation matrix to 1, which makes the embedding invariant to the distortions applied, while the second term, which is called the redundancy reduction term, decorrelates the different vector components of the embedding by trying to equate the off-diagonal elements of the cross-correlation matrix to 0.

2.3 Restrained Uncertainty Weight Loss (RUWL)

From the Equation 1, the upper limit of the CCC loss is 1. Meanwhile, as Equation 2 shows, the BT loss is a sum of matrices. For a high-dimension embedding in our work, which is 2048, the value of the BT loss could be much higher than the CCC loss.

As shown in Figure 1, in order to optimise the network with three losses, we add them by different weights, which is

(4)

where is the weight of different losses and is the loss of the distorted input sample.

Since each of the contributing single loss functions may behave differently, appropriately weighting each loss is essential. We apply the weighted loss function described in (liebel2018auxiliary) to automatically tune . Furthermore, to prevent the network from constantly shrinking the weights to reduce the loss without learning anything about the task, we propose a Restrained Uncertainty Weight Loss for the to update, which is

(5)

where The final loss for our RRTN framework is defined as

(6)

where a is the parameter that dynamically updates the weights as the network is updated, and is a constant for each loss. Based on our prior knowledge, in our work, we set , while initialing the parameters .

Encoder # Param(M) Relative Local Gain (%) Relative Global Gain (%)
ResNet 18 (baseline) 11.18 0.613 - -
ResNet 18 RUWL 12.23 0.628 2.4 2.4
ResNet 18 RUWL 12.23 0.634 3.6 3.6
CNN10 4.95 0.655 - 6.9
CNN10 RRTN RUWL 6.00 0.667 1.8 8.8
CNN10 RRTN RUWL 6.00 0.668 2.0 9.0
CNN14 79.69 0.647 - 5.5
CNN14 RRTN RUWL 83.95 0.674 4.2 10.0
CNN14 RRTN RUWL 83.95 0.678 4.8 10.6
Table 2: Overall results for our proposed RRTN framework by Concordance Correlation Coefficient (CCC, w. r. t. ). The parameters of the networks are also presented. Relative Local Gain indicates the improvement on each network itself. The Relative Global Gain, on the other hand, refers to the improvement over the ResNet 18 baseline. The best result of our experiments is marked in bold, and the second best one is underlined.

3 Experiments

3.1 Dataset

The Hume Vocal Bursts dataset (HUME-VB)(Cowen2022HumeVB)111We use the original release of processed .wav files, not the follow-up release of unprocessed ones. is a large-scale self-recording emotional non-linguistic vocalizations (vocal bursts)  dataset, which contains totally 2,270 minutes of audio data from 1,702 speakers, aged from 20 to 39 years old.

In the database, each vocal burst has been labeled in terms of the intensity of ten different expressed emotions. Each emotion’s intensity ratings were normalized to a range of [0:1]. The audio files were normalized to -3 decibels, and converted to 16 kHz, 16 bit, mono. Finally, the official data have been split into three equal size sets: training, validation, and test sets. There are ten emotion labels in the dataset: Amusement, Awe, Awkwardness, Distress, Excitement, Fear, Horror, Sadness, Surprise, and Triumph.

3.2 Experiment settings

As input features, we use 64-dimensional log Mel-Spectrograms with a 32 ms window length and a 10 ms frame shift. To unify the shape of the input data consistently, we crop/pad the time dimension to 250. Finally, the size of the features for the RRTN is

. Each input sample is augmented by SpecAugment(Park_spec_2019) to produce the distorted views shown in Figure 1. We set SpecAugment with 64 times drop width, 2 time stripes, 8 frequency drop width, and 2 frequency stripes.

Throughout our experiments, we use the AdamW optimiser with an epsilon value of and weight decay of

. We train for 60 epochs with a mini-batch size of

and a learning rate of

. All models were developed on Pytorch 1.8.1 and trained on a single Nvidia RTX 3090 GPU.

4 Results and Discussion

Our experimental results obtained in the multi-output regression of the vocal emotion task are presented in Table 2, and we follow the original network implementation described in (he2016deep; Kong19-PANNS) to build ResNet 18 and CNN 10/CNN 14.

As shown in Table 2, in our experiments, we set the ResNet 18 result as a global baseline. We fit each network into the RRTN framework; for the global gain, our method on CNN 14 obtains the highest improvement with of 0.678, which is a 10.6 % increase. Further experiments on the RUWL reveal that it is a simple and effective practice to apply weighted loss to improve the prediction ability of the network.

By applying the RRTN training framework, there is always a gain in the when compared to the outcomes of the original network, which we dub local gain. However, we find that the improvement in the results obtained by using RRTN with various encoders is quite variable. CNN 14 improves by up to 4.8 %, while CNN 10 only improves by up to 2 %. CNN 10 and CNN 14 share the same network structure, consisting of 4 and 6 convolutional blocks, respectively. Each convolutional block consists of 2 convolutional layers with a kernel size of . The goal of the RRTN is to improve the generalizability of the encoder’s features by maximizing the similarity and eliminating the redundant information between pairs of samples. The deeper and wider network is more conducive to the joint optimization strategy. In our work, RRTN only introduces a linear layer as the projector, which does not significantly increase the parameters of the network, but may lack in representational power. Further experiments should be conducted to find the optimal number of linear layers in the projector.

Overall, it seems that the most gain to the performance is from the introduction of the RRTN framework, while RUWL adds a marginal gain over that. This means minimizing redundant information in the features is the most important contribution of our proposal, while the automatic weighting of the different losses is not as critical.

5 Conclusion

In this work, we proposed the Redundancy Reduction Twins Network (RRTN), which is a light and effective training framework. The cross-correlation matrix was used to quantify the similarity between the outputs of the same network fed with distorted versions of a sample, and it is made as close to the identity matrix as feasible to maximize similarity. Since the final loss is the sum of multiple weighted losses, we applied the restrained uncertainty weight loss for joint training to find the best weight for each loss. Our experimental results demonstrate the validity of our proposed methodologies.

Moreover, we believe that further refinements of the proposed loss function and framework could lead to more efficient solutions and better performance.

6 Acknowledgements

This work was funded by the China Scholarship Council (CSC), Grant # 202006290013.

References