It is increasingly popular for people to share experiences, express opinions, and record activities by posting images on social networks like Instagram and Twitter. It offers a great opportunity for visual sentiment analysis that infers their emotional behaviors and provide personalized services[zhao2018predicting], such as blog recommendation [borth2013large] and tourism [alaei2019sentiment].
Recent advances in deep learning have significantly improved the state-of-the-art performance in visual sentiment classification[yang2018weakly, yang2018visual, katsurai2016image] or image emotion distribution learning [yang2017joint]. However, considering the difficulties in the acquisition of sentiment labels due to the high subjectivity in the human perception process, training a model on a labeled source domain that can well generalize to another new domain is necessary. Because of the presence of domain shift or dataset bias [torralba2011unbiased], even a slight departure from a network’s training domain can lead to incorrect predictions and significantly reduce its performance.
Therefore, establishing knowledge transfer from a labeled source domain to an unlabeled target domain for visual sentiment analysis has attracted significant attention [zhao2018emotiongan, zhao2019cycleemotiongan]. Recently, most existing deep unsupervised domain adaptation (UDA) methods assume that there is only one single source domain and the labeled source data are implicitly sampled from the same underlying distribution. In practice, there are obvious biases in different domains, even for the images of the same sentiment. For example, as shown in Figure 1, the styles of artistic photos, abstract paintings, and natural images are quite different. Simply combining different sources into one source and directly employing a state-of-the-art single-source UDA model [zhu2017unpaired] lead to a drop in classification accuracy from 61.59% (if trained on the single-best source) to 59.84% (if trained on simply combined source). This is because images from different source domains may interfere with each other during the learning process [riemer2018learning]. Comparing to source-only method [peng2015mixed]
, UDA model can alleviate domain shift between single source domain and the target domain to some extent. However, it still exists between multiple source domains and target domain, leading to multi-source domain adaptation (MDA) for visual sentiment classification. Despite of the rapid progress in UDA, no study has been investigated on MDA for visual sentiment classification. This task is much more challenging due to the following reasons. First, the multiple sources are different not only from the target but also from each other. Existing MDA methods only align each source and target pair. Although different sources are matched towards the target, there may exist significant mis-alignment across different sources. Second, these MDA methods focus on matching the visual features but ignore the semantic labels, which hardly characterize the consistency of image sentiment. Third, current methods typically require multiple classifiers or transfer structures for different source domains. This leads to a high model complexity and low learning stability when learning from small-scale sentiment data.
In this paper, we study the multi-source unsupervised domain adaptation problem for visual sentiment classification. Specifically, we propose a novel adversarial framework, termed Multi-source Sentiment Generative Adversarial Network (MSGAN), which is composed of three pipelines, i.e. image reconstruction, image translation, and cycle-reconstruction. The image reconstruction and cycle-reconstruction pipelines learn a unified sentiment space where data from the source and target domains share similar distributions. Subsequently, the image translation pipeline restricted by emotional semantic consistency learns to adapt the source domain images to appear as if they were drawn from the target domain, while preserving the annotation information. Notably, thanks to the unified sentiment latent space, MSGAN requires a single classification network to handle data from different source domains.
In summary, the contributions of this paper are threefold:
1. We propose to adapt visual sentiment from multiple source domains to a target domain in an end-to-end manner. To the best of our knowledge, this is the first multi-source domain adaptation work on visual sentiment classification.
2. We develop a novel adversarial framework, termed MSGAN, for visual sentiment classification. By the joint learning of image reconstruction, image translation, and cycle-reconstruction pipelines, images from multiple sources and one target can be mapped to a unified sentiment latent space. Meanwhile, the semantic consistency loss in image translation pipeline preserves the semantic information of images.
3. We conduct extensive experiments on the ArtPhoto [machajdik2010affective], FI [you2016building], Twitter I [you2015robust], and Twitter II [you2016building] datasets, and the results demonstrate the superiority of the proposed MSGAN model compared with the state-of-the-art MDA approaches.
Visual Sentiment Classification:
Recently, with the great success of convolutional neural network (CNN) on many computer vision tasks, CNN has also been employed in sentiment classification. you2015robust you2015robust proposed a progressive CNN architecture to make use of noisily labeled data for binary sentiment classification. yang2018retrieving yang2018retrieving employed deep metric learning to optimize both retrieval and classification tasks by jointly optimizing cross-entropy loss and a novel sentiment constraint. Different from improving global image representations, several methods[you2017visual, yang2018weakly] consider the local information for sentiment classification. All the above methods employ a supervised manner to learn the mapping between image content and sentiments. In this paper, we study how to adapt the models from multiple labeled source domain to an unlabeled target domain for visual sentiment classification.
Single-source Domain Adaptation: Since data from the source and target domains have intrinsically different distributions, the key problem in single-source UDA is how to reduce the domain shift. Discrepancy-based methods explicitly measure the discrepancy between the source and target domains on corresponding activation layers of the two network streams [sun2017correlation, zhuo2017deep]; Adversarial generative models combine the domain discriminative model with a generative component generally based on GANs [goodfellow2014generative]. The Coupled Generative Adversarial Networks (CoGAN) [liu2016coupled]
can learn a joint distribution of multi-domain images with a tuple of GANs; Reconstruction based methods incorporate a reconstruction loss to minimize the difference between the input and the reconstructed input[ghifary2015domain, ghifary2016deep]. zhao2019cycleemotiongan zhao2019cycleemotiongan proposed CycleEmotionGAN for image emotion classification by adapting source domain images to have similar distributions to the target ones by enforcing emotional semantic consistency. However, none of them can handle data from multiple source domains, which is the target of this paper.
Multi-source Domain Adaptation: Compared with single source UDA, multi-source domain adaptation (MDA) assumes that training data from multiple sources are available [zhao2019multi]. Early efforts on this task used shallow models [sun2013bayesian]. MDA also develops with theoretical supports. blitzer2008learning blitzer2008learning provided the first learning bound for MDA. mansour2009domain mansour2009domain claimed that an ideal target hypothesis can be represented by a distribution of a weighted combination of source hypotheses. In the more applied works, Deep Cocktail Network (DCTN) [xu2018deep]
proposed a k-way domain discriminator and category classifier for digit classification and real-world object recognition. zhao2018adversarial zhao2018adversarial proposed new generalization bounds and algorithms under both classification and regression settings for MDA. peng2018moment peng2018moment directly matched all the distributions based on moments and provided a concrete proof of why matching the moments of multiple distributions works for MDA. Different from these methods, we learn a unified sentiment latent space which jointly aligns data from all source and target domains.
Suppose we have source domains , , , and one target domain . In the unsupervised multi-source domain adaptation (MDA) scenario, , , , are labeled and is fully unlabeled. For the th source domain , the observed images and corresponding sentiment labels drawn from the source distribution are and ,where is the number of images in . The target images drawn from the target distribution are without label observation, where is the number of target images.
The main idea of MSGAN is to learn a mapping that can align the images from both the multiple source and target domains to have similar distributions in a unified sentiment space. As shown in Figure 2, images from both the source and target domains are mapped to have similar distributions, and their information is preserved by the reconstruction loss.
To achieve this, we introduce the encoders , , , , and generators , . To obtain the sentiment latent space, we enforce a weight-sharing constraint on the encoders and generators. Specifically, we share the weights of the last block layers of , , , and to extract the high-level representations of the input images from the multiple sources and target. Similarly, we share the weights of the first block layers of and to decode their high-level representations for reconstructing the input images. The unified sentiment space is learned from the cycle-consistency constrain [liu2017unsupervised]: and , where , . With the latent space established, we do not need multiple transfer structures.
Multi-source Sentiment Generative Adversarial Network
Our framework, as illustrated in Figure 3
, is based on variational autoencoders (VAEs) and generative adversarial networks (GANs). It consists of three pipelines: image reconstruction, image translation, and cycle-reconstruction. Image reconstruction pipeline includes multiple domain image encoders, , , encoding images to a unified sentiment space and two image generators , reconstructing input images. Image translation pipeline includes two generative adversarial networks: and learning the mapping between multiple source domains and the target domain. The cycle reconstruction pipeline is used to learn a unified sentiment space and ensure that features of images from different domains preserved the information of their original images.
Image Reconstruction Pipeline
The image reconstruction pipeline is achieved by multiple encodergenerator pairs , each of which maps an input image to a code in a latent space via and then decodes a random-perturbed version of the code to reconstruct the input image via . We assume the components in the latent space. The reconstructed image is , where is a noise has the same distribution of . Similarly, constitutes a VAE for the target domain, where the reconstructed image is .
MVAE training aims to minimize a variational upper bound as follows:
where the hyper-parameters and control the weights of the objective terms and the divergence terms penalize the deviation of the distribution of the latent code from the prior distribution. and are modeled using Laplacian distributions, respectively.
Image Translation Pipeline
As aforementioned, the unified sentiment space allows to use only one generator to adapt multi-source images indistinguishable from the target domain. can generate two types of images: (1) images from the reconstruction pipeline and (2) images from the translation pipeline . A similar processing is applied to . Meanwhile, two discriminators and are used to distinguish between and , and and , respectively.
The GAN objective functions are defined by:
To preserve the semantics of the adapted images, generated by , from source to the target domain, an emotional semantic consistency loss is used, defined by:
where and is the KL divergence between two distributions.
Cycle Reconstruction Pipeline
The cycle reconstruction pipeline is used to learn a unified sentiment latent space and ensure that features of images from different domains preserve the information of their original images. According to [liu2017unsupervised], a VAE-like objective function is used to model the cycle-consistency constraint, defined by:
The hyper-parameters and control the weights of the two different objective terms.
Therefore, the augmented MVAE-GAN loss is:
Sentiment Classification with Adapted Images
After the joint learning image reconstruction, image translation, and cycle-reconstruction pipelines, both the source images and adapted images of different domains can be mapped to a same latent representation in a unified sentiment latent space . Meanwhile, the semantic consistency loss in the image translation pipeline ensures the semantic information, i.e. the corresponding sentiment labels, is preserved before and after image translation.
Generally, multiple classification models are needed to correspond to different domains for the final classification task. To the contrary, thanks to the unified latent space, the proposed MSGAN is augmented with a single classifier optimized by minimizing the following cross-entropy loss:
where is the SoftMax function, is an indicator function and .The hyper-parameters and control the weights of the two different objective terms.
We jointly solve the learning problems of the , , , , for the image reconstruction, the image translation, and the cycle-reconstruction pipelines and classification task.
Inheriting from GAN, training the proposed MSGAN framework results in solving a mini-max problem where the optimization aims to find a saddle point. It can be seen as a two player zero-sum game. The first player is a team consisting of the encoders and generators. The second player is a team consisting of the adversarial discriminators. In addition to defeating the second player, the first player has to minimize the MVAE losses and the cycle-consistency losses. We apply an alternating gradient update scheme similar to the one described in [goodfellow2014generative].
The procedure is summarized in Algorithm 1, where , , , ,, , , , and are the parameters of , , , , , , , and .
|Standards||Method||FI||Artphoto||Twitter I||Twitter II|
|Single-best DA||CycleGAN [zhu2017unpaired]||63.87||61.11||61.59||70.24|
|Source-combined DA||CycleGAN [zhu2017unpaired]||66.05||60.49||59.84||71.07|
|Multi-source DA||DCTN [xu2018deep]||65.31||62.34||62.59||66.94|
|Oracle (Train on sufficiently-labeled target data)||75.24||64.81||68.5||72.72|
This section presents the experimental analysis of MSGAN. First, the detailed experimental setups are introduced, including the datasets, baselines, and evaluation metrics. Second, the performance of MSGAN and the state-of-the-art algorithms in MDA is reported. Finally, an in-depth analysis on MSGAN, including a parameter sensitivity analysis and an ablation study, is presented.
We evaluate our framework on four public datasets including the Flickr and Instagram (FI) [you2016building], Artistic (ArtPhoto) dataset [machajdik2010affective], Twitter I [you2015robust] and Twitter II [you2016building] datasets. FI is collected by querying with eight sentiment categories as keywords from social websites. 22,700 images are included in the FI dataset. ArtPhoto dataset consists of 806 artistic photographs from a photo sharing site searched by emotion categories. We combine excitement, amusement, awe, contentment as positive images and disgust, anger, fear, sadness as negative images [mikels2005emotional]. The Twitter I and Twitter II datasets are collected from the social websites and labeled with sentiment polarity (i.e. positive, negative) labels, which consist of 1,269 and 603 images, respectively.
To the best of our knowledge, MSGAN is the first work on multi-source domain adaptation for classifying visual sentiment. We compare MSGAN with three types of baseline algorithms, termed Source-only, Single-source DA, and Multi-source DA. The Source-only methods are trained on source images and directly test their classification performance on the target images. The Single-source DA methods include CycleGAN [zhu2017unpaired] and CycleEmotionGAN [zhao2019cycleemotiongan]. For CycleGAN, we extend the original transfer network, i.e. first adapt the source images to the adapted ones cycle-consistently, and then train the classifier on the adapted source images with the emotion labels from corresponding source images. Since those methods perform in single-source setting, we employ two MDA standards: (1) single-best, i.e. performing adaptation on each single source, and we choose single best performance from three source results; (2) source-combined, i.e. all source domains are combined into a traditional single source. For resnet-simple-extend, our encoder and classifier can be seen as an simply extension of Resnet18 [he2016deep]. We train a classifier with the same network on the source combined. Additionally, we introduce two methods, DCTN [xu2018deep], MDAN [zhao2018adversarial] as the Multi-source DA baselines. For comparison, we also report the results of an oracle setting, where the classifier is both trained and tested both on the target domain.
Comparison with State-of-the-art
The performance comparisons between the proposed MSGAN model and the state-of-the-art approaches measured by classification accuracy are shown in Table 1. From the results, we have the following observations:
(1) The source-only methods can’t handle the domain shift or dataset bias
, where a rough combination is used to transfer all source data. Different domains represent the diverse joint probability distributions of observed images and emotion labels,so a simple combination of all source data for training is harmful to the model performance.
(2) Both adaptation methods, CycleGAN [zhu2017unpaired] and CycleEmotionGAN [zhao2019cycleemotiongan], are superior to the source-only methods, while CycleEmotionGAN performs better. This result demonstrates the effectiveness of CycleEmotionGAN for unsupervised domain adaptation in classifying image emotions. Obviously, in source-combined settings, adaption methods transfer the negative samples across the multiple source domains, which indicates multiple sources domain adapter should not be modeled in the same way with single source domain adapter.
(3) The proposed MSGAN achieves superior performance over the state-of-the-art approaches. The improvements benefit from four aspects: image reconstruction, image translation, cycle-reconstruction pipelines and the unified latent space. Firstly, compared with source-combined DA methods, our MSGAN further improves the classification performance, which demonstrates the proposed sentiment latent space can bridge the gap of multiple sources more effectively. Especially, the images in Artphoto dataset, such as abstract oil painting, are far different from other source datasets. However, MSGAN can better distinguish them. Thus, MSGAN further improves the classification performance than other adaption methods. Secondly, compared to single-source DA, MSGAN utilizes more useful information from multiple sources. Thirdly, while other multi-source DA methods only consider the alignment between each source domain and target domain, MSGAN attempts to align all source and target domains jointly. In addition, existing DA methods, such as CycleGAN [zhu2017unpaired] and MDAN [zhao2018adversarial], focus on matching the visual features but ignore the semantic labels. Therefore, they may not well characterize the consistency of visual sentiment. As a result, these methods may not preserve the mappings between visual content and the corresponding sentiment. However, all the DA methods with sentiment consistency loss, such as single-source method [zhao2019cycleemotiongan] and our multi-source method, significantly outperform the source-only approach, which demonstrates the effectiveness of preserving the sentiments of the adapted images for visual sentiment classification.
(4) The oracle method, i.e. testing on the target domain using the model trained on the same domain, achieves the best performance. However, this model is trained using the ground truth sentiment labels from the target domain, which are actually unavailable in unsupervised domain adaptation.
Visualization We visualize the results of image-space adaptation from Artphoto, FI, Twitter II to Twitter I in Figure 4. We can see that with our final proposed MSGAN method (d), the styles of the images are similar to FI while the emotion semantic information is well preserved.
We incrementally investigate the effectiveness of different components in MSGAN. The results are shown in Table 2. We can observe that: (1) MVAE+GAN can obtain better performance by making different adapted domains more closely aggregated; (2) adding the cycle-consistency loss could further improve the accuracy, again demonstrating the effectiveness of the unified sentiment latent space; (3) ESC loss also contributes to the visual sentiment adaptation task; (4) the modules are orthogonal to each other, since adding each one of them does not introduce performance degradation.
Parameter Sensitivity. we analyze the impact of the hyper-parameter values in Eq. (1)(2)(6)(7) to the sentiment classification accuracy. For different weight values on the negative log likelihood terms , we computed the achieved classification accuracy over different weight values on the KL terms for both Artphoto, Twitter I, Twitter II FI and FI, Twitter I, Twitter II Artphoto. The results are reported in Figure 5. We can observe that, in general, a larger weight value on the negative log likelihood terms yields a better result. We also find that setting the weights of the KL terms to 0.1 can result in consistently good performance. We hence set , .
In this paper, we tackle the problem of multi-source domain adaptation (MDA) in visual sentiment classification. A novel framework, termed Multi-source Sentiment Generative Adversarial Network (MSGAN), is proposed to learn a unified sentiment latent space such that data from both the source and target domains share a similar distribution. Such mappings are learned via three pipelines, including image reconstruction, image translation, and cycle reconstruction. Extensive experiments conducted on four benchmark datasets demonstrate that MSGAN significantly outperforms the state-of-the-art MDA approaches for visual sentiment classification. For further studies, we plan to extend the MSGAN model to other image emotion recognition tasks, such as emotion distribution learning [zhao2017continuous]. We will also investigate methods that can improve the intrinsic problem of GAN variants in training stability.
NExT research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its IRC@SG Funding Initiative.