In many practical applications for speech emotion recognition systems, the testing data (target domain) is different from the labeled data used to train the models (source domain). The mismatch in data distribution leads to a performance degradation of the trained models [1, 2, 3, 4, 5]. Therefore, it is vital to develop more robust systems that are more resilient to changes in train and test conditions [6, 7, 8, 9]. One approach to ensure that models perform well on the target domain is to use training data drawn from the same distribution. However, this approach can be expensive, since it requires enough data with emotional labels to build models specific to a new target domain. A more practical approach is to use available labeled data from similar domains along with unlabeled data from the target domain, creating models that generalize well to the new testing conditions without the need to annotate extra data with emotional labels. This study proposes an elegant solution for the problem of mismatch in train-test data distributions based on domain adversary training.
We formulate the machine-learning problem as follows. We have asource domain(s) with annotated emotional data, which is used to train the models, and a large target domain with unlabeled data (see Fig. 1). The testing data come from the target domain. Due to the prohibitive cost of annotating new data every time we change the target domain and the abundance of unlabeled data, we aim to use the unlabeled target data to extract useful information, reducing the differences between source and target domains. The envisioned system generalizes better and is more robust by maximizing the performance using a shared data representation for source and target domains. The key principle in our approach is to find a consistent feature representation for the source and target domains. Common approaches to address this problem include feature transformation methods, where the representation of the source data is mapped into a representation that resembles the features in the target domain , and finding a common representation between the domains, such that the features are invariant across domains [11, 12, 13]. The common domain-invariant features do not necessarily contain useful information about the main task. Therefore, it is vital to constrain the learned common feature representation to ensure that it is discriminative with respect to the main task.
This paper explores the idea of finding a common representation between the source and target domains, such that data from the domains become indistinguishable while maintaining the critical information used in emotion recognition. This work is inspired by the work of Ganin et al. , which proposed training an adversarial multitask network. The approach searches the feature space for a representation that accurately describes the data from either domain, containing the relevant information to accurately classify the main task, in our study, the prediction of emotional attributes (arousal, valence, and dominance). The discriminative and domain-invariant features are learned by aligning the data distributions from the domains through back-propagation. This approach allows our framework to use unlabeled target data to learn a flexible representation.
This paper shows that adversarial training using unlabeled training data benefits emotion recognition. By using the abundant unlabeled target data, we gain on average 27.3% relative improvement in concordance correlation coefficient (CCC) compared to just training with the source data. We evaluate the effect of adversarial training by visualizing the similarity of the data representation learned by the network from both domains. The visualization shows that adversarial training aligns the data distributions as expected, reducing the gap between source and target domains. The study also shows the effect of the number of shared layers between the domain classifier and the emotion predictor on the performance of the system. The size of the source domain is an important factor that determines the optimal number of shared layers in the network. This novel framework for emotion recognition provides an appealing approach to effectively leverage unlabeled data from a new domain, generalizing the models and improving the classification performance.
This paper is organized as follows. Section II discusses previous work on speech emotion recognition, with emphasis on frameworks that aim to reduce the mismatch between train and test datasets. Section III presents our proposed model, providing the motivation and the details of the proposed framework. Section IV presents the experiment evaluation including the databases, network structure and acoustic features. Section V presents the experimental results and the analysis of the main findings. Section VI finalizes the study with conclusions and future research directions.
Ii Related Work
The key challenge in speech emotion recognition is to build classifiers that perform well under various conditions. The cross-corpora evaluation in Shami and Verhels  demonstrated the drop in classification performance observed when training on one emotional corpus and testing on another. Other studies have shown similar results [16, 17, 18, 5]. Several approaches have been proposed to solve this problem. Shami and Verhels  proposed to include more variability in the training data by merging emotional databases. They demonstrated that it is possible to achieve classification performance comparable to within-corpus results. More recently, Chang and Scherer  showed that data from other domains can improve the within-corpus performance of a neural network. They employed deep convolutional generative adversarial network (DCGAN) to extract and learn useful feature representation from unlabeled data from a different domain. This led to better generalization compared to a model that did not make use of unlabeled data.
The main approach to attenuate the mismatch between train and test conditions is to minimize the differences in the feature space between both domains. Zhang et al.  showed that by separately normalizing the features of each corpus, it is possible to minimize cross-corpus variability. Hassan et al.  increased the weight of the train data that matches the test data distribution by using kernel mean matching (KMM), Kullback-Leibler importance estimation procedure (KLEIP), and unconstrained least-squares importance fitting (uLSIF). Zong et al.  used least square regression
(LSR) to mitigate the projected mean and covariance differences between source data and unlabeled target samples while learning the regression coefficient matrix. Because the learned coefficient matrix depends on the samples selected from the target domain, multiple matrices were estimated and used to test new samples. Each matrix was used to predict an emotional label for a test sample, combining the results with the majority vote rule.
Studies have also explored mapping both train and test domains to a common space, where the feature representation is more robust to the variations between the domains. Deng et al.  used auto-encoders to find a common feature representation across the domains. They trained an auto-encoder such that it minimized the reconstruction error on both domains. Building upon this work, Mao et al.  proposed to learn a shared feature representation across domains by constraining their model to share the class priors across domains. Sagha et al. , also motivated by the work of Deng et al. , used principal component analysis (PCA) along with kernel canonical correlation analysis (KCCA) to find views with the highest correlation between the source and target corpora. First, they used PCA to represent the feature space of the source and target data. Then, the features for the source and target domains were projected using the PCA in both domains. Finally, they used KCCA to select the top dimensions that maximized the correlation between the views. Inspired by universum learning where unlabeled data is used to regularize the training process for support vector machine (SVM), Deng et al. 
proposed adding an universum loss to the reconstruction loss of an auto-encoder. The added loss function was defined as the addition of the-margin loss and the -insensitive loss, making use of both labeled and unlabeled data. This approach aimed to learn auto-encoder classifier has low reconstruction and classification errors on both domains.
Song et al.  proposed a couple of methods based on non-negative matrix factorization (NMF) that utilized data from both train and test domains. The proposed methods aimed to represent a matrix formed by data from both domains as two non-negative matrices whose product is an approximation of the original matrix. The factorization was regularized by maximum mean discrepancy (MMD) to ensure that the differences in the feature distributions of the two corpora were minimized. The proposed methods aim to learn a robust low dimensional feature representation using either unlabeled data or labels as hard constraints on the problem. Abdelwahab and Busso 
proposed creating an ensemble of classifiers, where each classifier focuses on a different feature space (each classifier maximized the performance for a given emotion category). The features were selected over the labeled data from the target domain obtained with active learning. This semisupervised approach learned discriminant features for the target domain, increasing the robustness against shift in the data distributions between domains.
Instead of finding a common representation between domains, Deng et al.  trained a sparse auto-encoder on the target data and used it to reconstruct the source data. This approach used feature transformation in a way that exploits the underlying structure in emotional speech learned from the target data. Deng et al.  used two denoising auto-encoders. The first auto-encoder was trained on the target data and the second auto-encoder was trained on the source data, but it was constrained to be close to the first auto-encoder. The second auto-encoder was then used to reconstruct the data from both source and target domains.
Our proposed approach takes an innovative formulation with respect to previous work relying on domain adversarial neural network (DANN) . While Shinohara  showed that the use of DANN can increase the robustness of a automatic speech recognition (ASR) system against certain types of noise, this framework has not been used for speech emotion recognition, which is an important contribution as this framework can reduce the mismatch between train and test sets in a principled way. DANN relies on adversarial training for domain adaptation to learn a flexible representation during the training of the emotion classifier. As the training data changes, both the emotion and domain classifiers readjust their weights to find the new representation that satisfies all conditions. The domain classifier can be considered as a regularizer that prevents the main classifier from over-fitting to the source domain. The final learned representation performs well on the target domain without sacrificing the performance on the source domain.
Iii Proposed Approach
We present an unsupervised approach to reduce the mismatch between source and target domains by creating a discriminative feature representation that leverages unlabeled data from the target domain.
We aim to learn a common representation between the source and target domains, where samples from both domains are indistinguishable to a domain classifier. This approach is useful because all the knowledge learned while training the classifier on the source domain is directly applicable to the target domain data. We learn the representation by using a gradient reversal layer where the gradient produced by the domain classifier is multiplied by a negative value when it is propagated back to the shared layers. Changing the sign of the gradient causes the feature representation of the samples from both domains to converge, reducing the gap between domains. Ideally, the performance of the domain classifiers should be at random level where both domains “look” the same. When such a representation is learned, the data distributions of both domains are aligned. This approach leads to a large performance improvement in the target domain, as demonstrated by our experimental evaluation (see Section V). A key feature of this framework is that it is unsupervised, since it does not require labeled data from the target domain. However, this framework would continue to be useful when labeled data from the target domain is available, working as a semi-supervised approach.
Iii-B Domain Adversarial Neural Network for Emotion Recognition
Ganin et al. , inspired by the recent work on generative adversarial networks (GAN) , proposed the domain adversarial neural network (DANN). The network is trained using labeled data from the source domain and unlabeled data from the target domain. The network learns two classifiers: the main classification task, and the domain classifier, which determines whether the input sample is from the source or target domains. Both classifiers share the first few layers that determine the representation of the data used for classification. The approach introduced a gradient reversal layer between the domain classifier and the feature representation layers. The layer passes the data during forward propagation and inverts the sign of the gradient during backward propagation. The network attempts to minimize the task classification error and find a representation that maximizes the error of the domain classifier. By considering these two goals, the model ensures a discriminative representation for the main recognition task that makes the samples from either domain indistinguishable.
Figure 2 shows an example structure of the DANN network. The network is fed labeled source data and unlabeled target data in equal proportions. In our formulation, we propose to predict emotional attribute descriptors as the primary goal. We train the primary recognition task with the source data, for which we have emotional labels. For the domain classifier, we train the classifier with data from the source and target domains. Notice that the domain classifier does not require emotional label, so we can rely on unlabeled data from the target domain. The classifiers are trained in parallel. The network’s objective is defined as:
where represents the parameters of the shared layers providing the regularized feature representation, represents the parameters of the layers associated with the main prediction task, and represents the parameters of the layers of the domain classifier ( is the number of labeled training samples, is the number of unlabeled training samples). The optimization process consists of two loss functions. is the prediction loss for the main task, and is the domain classification loss. The prediction loss and the domain classification loss compete against each other in an adversarial manner. The parameter is a regularization multiplier that controls the tradeoff between the losses. This is a minimax problem. It attempts to find a saddle point parametrized by ,
At the saddle point, the classification loss on the source domain is minimized and the domain classification loss is maximized. The maximization is achieved by introducing a gradient reversal layer that changes the sign of the gradient going from the domain classification layers to the feature representation layers (see white layer in Fig. 2
). The updates taken on the feature representation parameters are in opposite direction to the gradient. With this approach, stochastic gradient descent tries to make the features similar across domains, so what is learned from the source domain remains effective for the target domain without loss in performance.
The simple concept of choosing a representation that confuses a competent domain classifier leads to models that perform better in the target domain without impacting the performance in the source domain. This approach is particularly useful for emotion recognition, as most annotated corpora come from studio settings that greatly differs from real world testing conditions. This unsupervised approach uses available unlabeled data to align the distributions of both domains. The aligned distributions lead to a common representation, causing the domain classifier’s performance to drop to random chance levels. The common indistinguishable representation retains discriminative information learned during the training of the models with data from the source domain. We improve the classifier’s performance on the target domain without having to collect new annotated data. This is an important contribution in this field, taking us one step closer toward robust speech emotion classifiers that generalize well in most testing conditions.
Iv Experimental Evaluation
We define the main task as a regression problem to estimate the emotional content conveyed in speech described by the emotional attributes arousal (calm versus activated), valence (negative versus positive), and dominance (weak versus strong). This section describes the databases (Section IV-A), the acoustic features (Section IV-B) and the specific network structures (Section IV-C) used in the experimental evaluation.
Iv-a Emotional Databases
The experimental evaluation consider a multi-corpus setting with three databases. The source domain (test set) corresponds to two databases: the USC-IEMOCAP  and MSP-IMPROV  corpora. The target domain corresponds to the MSP-Podcast  database.
Iv-A1 The USC-IEMOCAP Corpus
The USC-IEMOCAP database is an audiovisual corpus recorded from ten actors during dyadic interactions . It has approximately 12 hours of recordings with detailed motion capture information carefully synchronized with audio (this study only uses the audio). The goal of the data collection was to elicit natural emotions within a controlled setting. This goal was achieved with two elicitation frameworks: emotional scripts, and improvisation of hypothetical scenarios. These approaches allowed the actors to express spontaneous emotional behaviors driven by the context, as opposed to read speech displaying prototypical emotions . Several dyadic interactions were recorded and manually segmented into turns. Each turn was annotated with emotional labels by at least two evaluators across emotional attributes (valence, arousal, dominance). Dimensional attributes take integer values that range from one to five. The dimensional attribute of an utterance is the average of the values given by the annotators. We linearly map the scores between and 3.
Iv-A2 The MSP-IMPROV Corpus
The MSP-IMPROV database is a multimodal emotional database recorded from actors interacting in dyadic sessions . The recordings were carefully designed to promote natural emotional behaviors, while maintaining control over lexical and emotional contents. The corpus relied on a novel elicitation scheme, where two actors improvised scenarios that lead one of them to utter target sentences. For each of these target sentences, four emotional scenarios were created to contextualize the sentence to elicit happy, angry, sad and neutral reactions, respectively. The approach allows the actor to express emotions as dictated by the scenarios, avoiding prototypical reactions that are characteristic of other acted emotional corpus. Busso et al.  showed that the target sentences occurring within these improvised dyadic interactions were perceived more natural than read renditions of the same sentences. The MSP-IMPROV corpus includes not only the target sentences, but also other sentences during the improvisations that led one of the actors to utter the target sentence. It also includes the natural interactions between the actors during the breaks.
The corpus consists of 8,438 turns of emotional sentences recorded from 12 actors (over 9 hours). The sessions were manually segmented into speaking turns, which were annotated with emotional labels using perceptual evaluations conducted with crowdsourcing . Each turn was annotated by at least five evaluators, who annotated the emotional content in terms of the dimensional attributes arousal, valence, and dominance. Dimensional attributes take integer values that range from one to five. The consensus label assigned to each speech turn is the average value of the scores provided by the evaluators, which we linearly map between and 3.
Iv-A3 The MSP-Podcast Corpus
The MSP-Podcast corpus is an extensive collection of natural speech from multiple speakers appearing in Creative Commons licensed recordings downloaded from audio-sharing websites . Some of the key aspects of the corpus are the different conditions in which the recordings are collected, large number of speakers, and a large variety of natural content from spontaneous conversations conveying emotional behaviors. The audio was preprocessed removing portions that contain noise, music or overlapped speech. The recordings were then segmented into speaking turns creating a big audio repository with sentences that are between 2.75 seconds and 11 seconds. Emotional models trained with existing databases are then used to retrieve speech turns with target emotional content. The candidate segments were annotated with emotional labels using an improved version of the crowdsourcing framework proposed by Burmania et al. .
Each speech segment was annotated by at least five raters, who provided scores for the emotional dimensions arousal, valence and dominance using seven-likert scales. The consensus scores are the average scores assigned by the evaluators, which are shifted such that they are in the range between and 3. The collection of this corpus is an ongoing effort. This study uses 14,227 labeled sentences. From this set, we use 4,283 labeled sentences coming from 50 speakers as our test set, which is consistently used across conditions. For the within corpus evaluation (i.e., training and testing in the same domain), we define a development set with 1,860 sentences from 10 speakers, and a train set with the remainder of the corpus (8,084 sentences). This study also uses 73,209 unlabeled sentences from the audio repository of segmented speech turns. The unlabeled segments are used to train the domain classifier in the DANN approach.
Iv-B Acoustic Features
The acoustic features correspond to the set proposed for the INTERSPEECH 2013 Computational Paralinguistics Challenge (ComParE) 
. This feature set includes 6,373 acoustic features extracted at the sentence level (one feature vector per sentence). First, it extracts 65 frame-by-framelow-level descriptors (LLDs) which includes various acoustic characteristics such as Mel-frequency cepstral coefficients (MFCCs), fundamental frequency, and energy. The externalization of emotion is conveyed through different aspects of speech production so including these LLDs is important to capture emotional cues. After estimating LLDs, a list of functions are estimated for each LLD, which are referred to as high-level descriptors
(HLD) features. These HLDs include standard deviation, minimum, maximum, and ranges. The acoustic features are extracted using OpenSMILE.
We separately normalize the features from each domain (i.e., corpus) to have zero mean and a unit standard deviation. The mean and the variance of the data is calculated considering only the values of the features within the 5% and 95% quantiles to avoid outliers skewing the values. After normalization, we ignore any value greater than 10 times its standard deviation by setting their values to zero.
Iv-C Network Structure
As discussed in Section III-B, the DANN approach has three main components: the domain classifier layers, task classifier layers, and feature representation layers. The domain classifier layers are implemented with two layers across all the experimental evaluation. The task classifier layers are also implemented with two layers, except for the shallow network described below. The number of layers in the feature representation layers is a parameter of the network that is set on the development set. We consider different number of shared layers, evaluating the performance of the system with one, two, three or four layers.
We also study whether a simple shallow network can achieve similar performance compared to the deep network. In the shallow network, the task classifier layer and the feature representation layer are implemented with one layer each. We implement the domain classifier layer with two layers.
Iv-D Baseline Systems
We establish two baselines. The first baseline is a network trained and validated only on the source data. This condition creates a mismatch between the train and test conditions. The second baseline corresponds to within corpus evaluations, where the models are trained and tested with data from the target domain. This model assumes that training data from the target domain is available, so it corresponds to the ideal condition. The parameters of the networks are optimized using the development set. The baselines are implemented with similar architectures, serving as a fair comparison with the proposed method (e.g., number of layers, number of nodes). The key difference with the DANN models is the lack of the domain classification layers, where the feature representation layers only consider the primary classification task.
We train the networks using Keras
with Tensorflow as back-end
. We use batch normalization and dropout. The dropout rates are=0.2 for the input layer and =0.5 for the rest of the layers. We further regularize the models using max-norm on the weights of value four and a clip norm on the gradient of value ten. The loss function used for the main regression task is the mean square error (MSE). The loss function for the domain classification task is the categorical cross-entropy. We use Adam as an optimizer with a learning rate of 5 
. We train the models for 100 epochs with a batch size of 256 sentences. A parameter of the DANN model isin Equation 1, which controls the tradeoff between the task and domain classification losses. We follow a similar approach to the one proposed by Ganin et al. , where is initialized equal to zero for the first ten training epochs. Then, we slowly increase its value until reaching by the end of the training. We train each model twenty times to reduce the effect of initialization on the performance of the classifiers. We report the average performance across the trails.
In adversarial training, we need unlabeled data to train the domain classifier. We randomly select samples from the unlabeled portion of the target domain to be fed to the domain classifier. The number of selected samples from the audio repository of unlabeled speech turns is equal to the number of samples in the source domain, keeping the balance in the training set of the domain classification task.
V Experimental Results
The performance results for the baseline models and the DANN models are reported in terms of the root mean square error (RMSE), Pearson’s correlation coefficient (PR), and concordance correlation coefficient (CCC) between the ground-truth labels and the estimated values. While we presented PR and RMSE, the analysis focus on CCC as the performance metric, which combines mean square error (MSE) and PR. CCC is defined as:
where is the Pearson’s correlation coefficient, and , and and are the standard deviations and means of the predicted score and the ground truth label , respectively.
V-a Number of Layers For the Shared Feature Representation
We first study the effect of the number of shared layers between the domain classifier and the primary regression task on the DANN model’s performance (e.g., feature representation layers in Fig. 2). The domain and task classifier layers are implemented with two layers each. We vary the number of shared layers between the classifiers and observe how the changes in feature representation affect the regression performance. This evaluation is conducted exclusively on the validation set of the target domain (e.g., MSP-Podcast corpus).
shows the average RMSE, PR, and CCC for the models trained with one, two, three and four shared layers. For arousal, we consistently observe better performance (lower RMSE and higher PR and CCC) with one shared layer between the domain and task classifiers. For the CCC values, the differences are statistically significant for all the cases, except when the source domain is the MSP-IMPROV corpus and the DANN model is implemented with two layers (one-tailed t-test over the average across the twenty trails, asserting significance if-value ). The performance degrades as more shared layers are added. For valence and dominance, the number of shared layers that provides the best performance varies from one corpus to another. In most cases, two or three shared layers provide the best performance. Based on these results on the validation set, we set the number of shared layers for the feature representation networks to one for arousal, two for valence and three for dominance.
V-B Regression Performance of the DANN Model
This section presents the regression results achieved by the DANN model, and the baseline models. Table II lists the performance for the within-corpus evaluation where the models are trained and tested with data from the MSP-Podcast corpus (referred to as target on the table), and the cross-corpus evaluations, where the models are trained with other corpora (referred to as src on the table). These baseline results are compared with our proposed DANN method (referred to as dann on the table). The results are tabulated in terms of emotional dimensions and networks structures. These values are the average of twenty trails to account for different initializations and set of unlabeled data used to train the DANN models. For the rows denoted “All Databases”, we combine all the source domain together (IEMOCAP and MSP-IMPROV corpora), treating them as a single source domain. To determine statistical differences between the src and dann conditions, we use the one-tailed t-test over the twenty trails, asserting significance if -value . We highlight in bold when the difference between these conditions is statistically significant.
To visualize the results in Table II, we create figures showing the average performance under different conditions (Figs. 3-6). We use statistical significance tests to compare different conditions (values obtained from Table II). We test the hypothesis that population means for matched conditions are different using one-tailed z-test. We assert significance if -value . We use an asterisk in the figures to indicate if there is a statistically significant difference relative to the baseline model trained with the source domain.
Figure 3 shows the average concordance correlation coefficient across emotional dimensions, training sources, trails, and networks structures (three emotional dimensions twenty trails three sources five structures = 900 matched conditions). The average performance for the within-corpus evaluation (target) is close to double the performance for the cross-corpus evaluations (source). This result demonstrates the importance of minimizing the mismatch between train and test conditions in emotion recognition. The figure shows that the proposed DANN approach greatly improves the performance, achieving 6.6% gain compared to the systems trained with the source domains. As highlighted by the asterisk, the improvement is statistically significant. The proposed approach reduces the gap between within-corpus and cross-corpus emotion recognition results, effectively using unlabeled data from the target domain.
Figure 4 shows the average concordance correlation coefficient for each emotional dimension (twenty trails three sources five structures = 300 matched conditions). On average, the figure shows that models trained using DANN consistently outperform models trained with the source data. The asterisk denotes that the difference is statistically significant across all emotional dimensions. The relative improvements over the source models are 22.8% for arousal, 33.4% for valence and 15.5% for dominance. In general, the values for CCC are lower for valence, validating findings from previous studies, which indicated that acoustic features are less discriminative for this emotional attribute .
Figure 5 shows the average concordance correlation coefficient per source domain (three emotional dimensions twenty trails five structures = 300 matched conditions). The figure shows the within-corpus performance (target) with a solid horizontal line. The results consistently show significant improvements when using DANN. The relative improvements in performance over training with the source domain are 7.7% for the USC-IEMOCAP corpus, 36.4% for the MSP-IMPROV corpus, and 25% when we combine all the corpora. Figure 5 also shows that, on average, combining all the sources into one domain improves the performance of the systems in recognizing emotions. Adding variability during the training of the models is important, as also demonstrated by Shami and Verhels . DANN models also benefit from adding variability. By leveraging the added data representations, DANN models are able to find a common representation between the domains without sacrificing the critical features relevant for learning the primary task.
Figure 6 compares the performance for deep and shallow networks (see Section IV-C). The figure summarizes the results for the within-corpus evaluations (target), cross-corpus evaluations (source) and with our proposed DANN model (three emotional dimensions twenty trails three sources = 180 matched conditions). For the target models, we observe significant improvements when using deep structures over shallow structures. However, the differences are not statistically significant for the source and DANN models.
V-C Data Representation
The results in Tables I and II demonstrate the benefits of using the proposed DANN framework. This section aims to understand the key aspect of the DANN approach by visualizing the feature representation created during the training process.
We use the t-SNE toolkit  to create 2D projections of the feature distributions at different layers of the networks. Figure 7 shows the distributions of the data from the source domain (blue circles) and the target domain (red crosses) after projecting them in the feature representation created by two models. This example corresponds to the models trained for arousal using the USC-IEMOCAP corpus as the source domain (as explained in Section V-A, the DANN model for arousal has only one shared layer as a feature representation). Figure 6(a) shows the data representation at the shared layer of the DANN model. Figure 6(b) shows the data representation at the equivalent layer of the DNN model trained with the source domain (i.e., the USC-IEMOCAP database). By using adversarial domain training, the feature distribution for samples from both domains are almost indistinguishable, demonstrating that the proposed approach is able to find a common representation. Without adversarial domain training, in contrast, there are large regions in the feature space where it is easy to separate samples from the source and target domains. Figure 6(b) suggests the presence of a source-target mismatch which affects the performance of the emotion classifier.
We also explore the feature representation when the DANN model is trained with four shared layers using the t-SNE toolkit. The objective of this evaluation is to visualize the distribution of the data in each of the shared layers. This evaluation is implemented using the models for dominance, using the MSP-IMPROV corpus as the source domain. Figures 7(a)-7(d) show the changes in the data representation for each of the four shared layers in the DANN model. At the first layer (Fig. 7(a)), the feature representation for the source data (blue circles) and the target data (red crosses) are dissimilar enough for the domain classifier to be able to distinguish them. While the difference between domains in the feature representation has decreased at the second layer (Fig. 7(b)), there are still some regions dominated by samples from one of the two domains. At the third layer (Fig. 7(c)), the feature representation of the domains are similar enough to confuse the domain classifier. The common representation is maintained in the fourth layer (Fig. 7(d)), where the data from the target domain is indistinguishable from the data from the source domain. This final representation is used by the emotion regression to predict the emotional state of the target speech. For comparison, we also trained a baseline model with six layers, matching the combined number of shared and task classifier layers in the DANN model. Figures 7(e)-7(h) show the feature representation of the corresponding first four layers of this model. In the DANN model, the data representation of the samples from the source and target domains become more similar in deeper layers. This trend is not observed in the model trained with only the source data. The DANN model effectively reduces the mismatch in the feature representation across domains, which leads to significant improvements in the regression models.
This study proposed an appealing framework for emotion recognition that exploits available unlabeled data from the target domain. The proposed approach relies on domain adversarial neural network (DANN), which creates a flexible and discriminant feature representation that reduces the gap in the feature space between the source and target domains. By using adversarial training, we learned domain invariant representations that can effectively discriminate the primary regression task. The model aims to find a balanced representation that aligns the domain distributions, while retaining crucial information for the primary regression task. The proposed adversarial training leads to significant improvements in the emotion recognition classifier’s performance over models exclusively trained with data from the source domain, which was demonstrated by the experimental evaluation. We visualized the data representation of both domains by projecting the features into the shared layers of the proposed DANN model. The results showed that the model converged to a common representation, where the source and target domains became indistinguishable. The experimental evaluation also showed that the amount of labeled data from the source domain plays a small role in determining how many shared layers are needed between the domain and regression tasks. Since the number of shared layers has a strong impact on the system’s performance, it is vital to identify the optimal number of shared layers, given a specific source domain.
One challenging aspect in using the proposed approach is the difficulty of training adversarial networks. For example, Shinohara  noted that in ASR problems, the improvements of DANN were large for some types of noises, but less effective for others. They suggested that tuning the parameters could lead to better results. We also observed that the framework failed to converge for certain parameters, which is common in minimax problems. When properly trained, however, this powerful framework can elegantly solve one of the most important problems in speech emotion recognition: reducing the mismatch between train and test domains.
In the case of multiple sources, our approach seems to work well when multiple sources are combined, treating them as one. This approach forces the network to learn a representation that is common across all the source domains. We hypothesize that a better approach is to use asymmetric transformations, where the model learns multiple possible representations for the test data, creating one representation for each source. During testing, the network chooses the most useful representation for each data point. Another alternative approach is to transform the available sources to match the target domains. Finally, this unsupervised approach can be easily extended to the cases where limited labeled data from the target domain is available (semi-supervised approach), creating a flexible framework to create emotion recognition systems that can generalize across domain.
This study was funded by the National Science Foundation (NSF) CAREER grant IIS-1453781.
-  C. Busso, M. Bulut, and S.S. Narayanan, “Toward effective automatic recognition systems of emotion in speech,” in Social emotions in nature and artifact: emotions in human and human-computer interaction, J. Gratch and S. Marsella, Eds., pp. 110–127. Oxford University Press, New York, NY, USA, November 2013.
-  Y. Kim and E. Mower Provost, “Say cheese vs. smile: Reducing speech-related variability for facial emotion recognition,” in ACM International Conference on Multimedia (MM 2014), Orlando, FL, USA, November 2014, pp. 27–36.
-  D. Ververidis and C. Kotropoulos, “Automatic speech classification to five emotional states based on gender information,” in European Signal Processing Conference (EUSIPCO 2004), Vienna, Austria, September 2004, pp. 341–34.
-  T. Vogt and E. André, “Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition,” in IEEE International Conference on Multimedia and Expo (ICME 2005), Amsterdam, The Netherlands, July 2005, pp. 474–477.
-  S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” in Interspeech 2017, Stockholm, Sweden, August 2017, pp. 1103–1107.
M. Abdelwahab and C. Busso,
“Ensemble feature selection for domain adaptation in speech emotion recognition,”in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, March 2017, pp. 5000–5004.
-  M. Abdelwahab and C. Busso, “Incremental adaptation using active learning for acoustic emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, March 2017, pp. 5160–5164.
-  Y. Zong, W. Zheng, T. Zhang, and X. Huang, “Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression,” IEEE Signal Processing Letters, vol. 23, no. 5, pp. 585–589, May 2016.
-  M. Abdelwahab and C. Busso, “Supervised domain adaptation for emotion recognition from speech,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2015), Brisbane, Australia, April 2015, pp. 5058–5062.
-  T. Rahman and C. Busso, “A personalized emotion recognition system using an unsupervised feature adaptation scheme,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2012), Kyoto, Japan, March 2012, pp. 5117–5120.
-  J. Deng, R Xia, Z. Zhang, Y. Liu, and B. Schuller, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014), Florence, Italy, May 2014, pp. 4818–4822.
-  Y. Zhang, Y. Liu, F. Weninger, and B. Schuller, “Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, March 2017, pp. 4490–4494.
-  X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in International conference on machine learning (ICML 2011), Bellevue, WA, USA, June-July 2011, pp. 513–520.
-  Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, April 2016.
-  M. Shami and W. Verhelst, “Automatic classification of expressiveness in speech: A multi-corpus study,” in Speaker Classification II, C. Müller, Ed., vol. 4441 of Lecture Notes in Computer Science, pp. 43–56. Springer-Verlag Berlin Heidelberg, Berlin, Germany, August 2007.
-  A. Austermann, N. Esau, L. Kleinjohann, and B. Kleinjohann, “Fuzzy emotion recognition in natural speech dialogue,” in IEEE International Workshop on Robot and Human Interactive Communication, Nashville, TN, USA, August 2005, pp. 317–322.
-  B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 119–131, July-Dec 2010.
-  L. Vidrascu and L. Devillers, “Anger detection performances based on prosodic and acoustic cues in several corpora,” in Second International Workshop on Emotion: Corpora for Research on Emotion and Affect, International conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 2008, pp. 13–16.
-  J. Chang and S Scherer, “Learning representations of emotional speech with deep convolutional generative adversarial networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, March 2017, pp. 2746–2750.
Z. Zhang, F. Weninger, M. Wollmer, and B. Schuller,
“Unsupervised learning in cross-corpus acoustic emotion recognition,”in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2011), Waikoloa, HI, USA, December 2011, pp. 523–528.
-  A. Hassan, R. Damper, and M. Niranjan, “On acoustic emotion recognition: compensating for covariate shift,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1458–1468, July 2013.
-  Q. Mao, W. Xue, Q. Rao, F. Zhang, and Y. Zhan, “Domain adaptation for speech emotion recognition by sharing priors between related source and target classes,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, March 2016, pp. 2608–2612.
-  H. Sagha, J. Deng, M. Gavryukova, J. Han, and B. Schuller, “Cross lingual speech emotion recognition using canonical correlation analysis on principal component subspace,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, March 2016, pp. 5800–5804.
-  J. Deng, X. Xu, Z. Zhang, S. Frühholz, and B. Schuller, “Universum autoencoder-based domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 24, no. 4, pp. 500–504, April 2017.
-  P. Song, W. Zheng, S. Ou, X. Zhang, Y. Jin, J. Liu, and Y. Yu, “Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization,” Speech Communication, vol. 83, pp. 34–41, October 2016.
-  J. Deng, Z. Zhang, E. Marchi, and B. Schuller, “Sparse autoencoder-based feature transfer learning for speech emotion recognition,” in Affective Computing and Intelligent Interaction (ACII 2013), Geneva, Switzerland, September 2013, pp. 511–516.
-  J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoder-based unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1068–1072, September 2014.
-  Y. Shinohara, “Adversarial multi-task learning of deep neural networks for robust speech recognition,” in Interspeech 2016, San Francisco, CA, USA, September 2016, pp. 2369–2372.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems (NIPS 2014), Montreal, Canada, December 2014, vol. 27, pp. 2672–2680.
-  C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, and S.S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, December 2008.
-  C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. Mower Provost, “MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, January-March 2017.
-  R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. To appear, 2018.
-  C. Busso and S.S. Narayanan, “Recording audio-visual emotional databases from actors: a closer look,” in Second International Workshop on Emotion: Corpora for Research on Emotion and Affect, International conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 2008, pp. 17–22.
-  A. Burmania, S. Parthasarathy, and C. Busso, “Increasing the reliability of crowdsourcing evaluations using online quality assessment,” IEEE Transactions on Affective Computing, vol. 7, no. 4, pp. 374–388, October-December 2016.
-  B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, “The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” in Interspeech 2013, Lyon, France, August 2013, pp. 148–152.
-  F. Eyben, M. Wöllmer, and B. Schuller, “OpenSMILE: the Munich versatile and fast open-source audio feature extractor,” in ACM International conference on Multimedia (MM 2010), Florence, Italy, October 2010, pp. 1459–1462.
“Keras: Deep learning library for Theano and TensorFlow,”https://keras.io/, April 2017.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D.G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A system for large-scale machine learning,” in Symposium on Operating Systems Design and Implementation (OSDI 2016), Savannah, GA, USA, November 2016, pp. 265–283.
“Incorporating Nesterov momentum into Adam,”in Workshop track at International Conference on Learning Representations (ICLR 2015), San Juan, Puerto Rico, May 2015, pp. 1–4.
-  C. Busso and T. Rahman, “Unveiling the acoustic properties that describe the valence dimension,” in Interspeech 2012, Portland, OR, USA, September 2012, pp. 1179–1182.
-  L. Van Der Maaten, “Accelerating t-SNE using tree-based algorithms,” Journal of machine learning research, vol. 15, no. 1, pp. 3221–3245, October 2014.