Weighed Domain-Invariant Representation Learning for Cross-domain Sentiment Analysis

09/18/2019 ∙ by Minlong Peng, et al. ∙ FUDAN University 0

Cross-domain sentiment analysis is currently a hot topic in the research and engineering areas. One of the most popular frameworks in this field is the domain-invariant representation learning (DIRL) paradigm, which aims to learn a distribution-invariant feature representation across domains. However, in this work, we find out that applying DIRL may harm domain adaptation when the label distribution P(Y) changes across domains. To address this problem, we propose a modification to DIRL, obtaining a novel weighted domain-invariant representation learning (WDIRL) framework. We show that it is easy to transfer existing SOTA DIRL models to WDIRL. Empirical studies on extensive cross-domain sentiment analysis tasks verified our statements and showed the effectiveness of our proposed solution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentiment analysis aims to predict sentiment polarity of user-generated data with emotional orientation like movie reviews. The exponentially increase of online reviews makes it an interesting topic in research and industrial areas. However, reviews can span so many different domains and the collection and preprocessing of large amounts of data for new domains is often time-consuming and expensive. Therefore, cross-domain sentiment analysis is currently a hot topic, which aims to transfer knowledge from a label-rich source domain (S) to the label-few target domain (T).

In recent years, one of the most popular frameworks for cross-domain sentiment analysis is the domain invariant representation learning (DIRL) framework Glorot et al. (2011); Fernando et al. (2013); Ganin et al. (2016); Zellinger et al. (2017); Li et al. (2017)

. Methods of this framework follow the idea of extracting a domain-invariant feature representation, in which the data distributions of the source and target domains are similar. Based on the resultant representations, they learn the supervised classifier using source rich labeled data. The main difference among these methods is the applied technique to force the feature representations to be domain-invariant.

However, in this work, we discover that applying DIRL may harm domain adaptation in the situation that the label distribution shifts across domains. Specifically, let and

denote the input and label random variable, respectively, and

denote the feature representation of . We found out that when changes across domains while stays the same, forcing to be domain-invariant will make uninformative to . This will, in turn, harm the generation of the supervised classifier to the target domain. In addition, for the more general condition that both and shift across domains, we deduced a conflict between the object of making the classification error small and that of making domain-invariant.

We argue that the problem is worthy of studying since the shift of exists in many real-world cross-domain sentiment analysis tasks Glorot et al. (2011). For example, the marginal distribution of the sentiment of a product can be affected by the overall social environment and change in different time periods; and for different products, their marginal distributions of the sentiment are naturally considered different. Moreover, there are many factors, such as the original data distribution, data collection time, and data clearing method, that can affect of the collected target domain unlabeled dataset. Note that in the real-world cross-domain tasks, we do not know the labels of the collected target domain data. Thus, we cannot previously align its label distribution with that of source domain labeled data , as done in many previous works Glorot et al. (2011); Ganin et al. (2016); Tzeng et al. (2017); Li et al. (2017); He et al. (2018); Peng et al. (2018).

To address the problem of DIRL resulted from the shift of , we propose a modification to DIRL, obtaining a weighted domain-invariant representation learning (WDIRL) framework. This framework additionally introduces a class weight to weigh source domain examples by class, hoping to make of the weighted source domain close to that of the target domain. Based on , it resolves domain shift in two steps. In the first step, it forces the marginal distribution to be domain-invariant between the target domain and the weighted source domain instead of the original source, obtaining a supervised classifier and a class weight . In the second step, it resolves the shift of by adjusting using for label prediction in the target domain. We detail these two steps in §4. Moreover, we will illustrate how to transfer existing DIRL models to their WDIRL counterparts, taking the representative metric-based CMD model Zellinger et al. (2017) and the adversarial-learning-based DANN model Ganin et al. (2016) as an example, respectively.

In summary, the contributions of this paper include: () We theoretically and empirically analyse the problem of DIRL for domain adaptation when the marginal distribution shifts across domains. () We proposed a novel method to address the problem and show how to incorporate it with existent DIRL models. () Experimental studies on extensive cross-domain sentiment analysis tasks show that models of our WDIRL framework can greatly outperform their DIRL counterparts.

2 Preliminary and Related Work

2.1 Domain Adaptation

For expression consistency, in this work, we consider domain adaptation in the unsupervised setting (however, we argue that our analysis and solution also applies to the supervised and semi-supervised domain adaptation settings). In the unsupervised domain adaptation setting, there are two different distributions over : the source domain and the target domain . And there is a labeled data set drawn from and an unlabeled data set drawn from the marginal distribution :

The goal of domain adaptation is to build a classier that has good performance in the target domain using and .

For this purpose, many approaches have been proposed from different views, such as instance reweighting Mansour et al. (2009), pivot-based information passing Blitzer et al. (2007), spectral feature alignment Pan et al. (2010) subsampling Chen et al. (2011), and of course the domain-invariant representation learning Pan et al. (2011); Gopalan et al. (2011); Long et al. (2013); Muandet et al. (2013); Yosinski et al. (2014); Long et al. (2015); Aljundi et al. (2015); Wei et al. (2016); Bousmalis et al. (2016); Pinheiro and Element (2018); Zhao et al. (2018).

2.2 Domain Invariant Representation Learning

Domain invariant representation learning (DIRL) is a very popular framework for performing domain adaptation in the cross-domain sentiment analysis field Ghifary et al. (2014); Li et al. (2017); Chen et al. (2018); Peng et al. (2018). It is heavily motivated by the following theorem Ben-David et al. (2007).

Theorem 1.

For a hypothesis ,

(1)

Here, denotes the expected loss with hypothesis in the source domain, denotes the counterpart in the target domain, is a measure of divergence between two distributions.

Based on Theorem 1 and assuming that performing feature transform on will not increase the values of the first and third terms of the right side of Ineq. (1), methods of the DIRL framework apply a feature map onto , hoping to obtain a feature representation that has a lower value of . To this end, different methods have been proposed. These methods can be roughly divided into two directions. The first direction is to design a differentiable metric to explicitly evaluate the discrepancy between two distributions. We call methods of this direction as the metric-based DIRL methods. A representative work of this direction is the center-momentum-based model proposed by Zellinger et al. (2017)

. In that work, they proposed a central moment discrepancy metric (CMD) to evaluate the discrepancy between two distributions. Specifically, let denote

and an

dimensional random vector on the compact interval

over distribution and , respectively. The CMD loss between and is defined by:

(2)

Here, denotes the expectation of over distribution , and

is the -th momentum, where denotes the dimensional variable of .

The second direction is to perform adversarial training between the feature generator and a domain discriminator . We call methods of this direction as the adversarial-learning-based methods. As a representative, Ganin et al. (2016) trained to distinguish the domain of a given example based on its representation . At the same time, they encouraged to deceive , i.e., to make unable to distinguish the domain of . More specifically, was trained to minimize the loss:

(3)

over its trainable parameters, while in contrast was trained to maximize . According to the work of Goodfellow et al. (2014), this is equivalent to minimize the Jensen-shannon divergence Amari et al. (1987); Lin (1991) between and over . Here, for a concise expression, we write as the shorthand for .

The task loss is the combination of the supervised learning loss

and the domain-invariant learning loss , which are defined on only and on the combination of and , respectively:

(4)

Here, is a hyper-parameter for loss balance, and the aforementioned domain adversarial loss and are two concrete forms of .

3 Problem of Domain-Invariant Representation Learning

In this work, we found out that applying DIRL may harm domain adaptation in the situation that shifts across domains. Specifically, when differs from , forcing the feature representations to be domain-invariant may increase the value of in Ineq. (1) and consequently increase the value of , which means the decrease of target domain performance. In the following, we start our analysis under the condition that . Then, we consider the more general condition that also differs from .

When , we have the following theorem.

Theorem 2.

Given , if and a feature map makes , then .

Proof.

Proofs appear in Appendix A. ∎

Remark.

According to Theorem 2, we know that when and , forcing to be domain-invariant inclines to make data of class mix with data of other classes in the space of . This will make it difficult for the supervised classifier to distinguish inputs of class from inputs of the other classes. Think about such an extreme case that every instance is mapped to a consistent point in . In this case, . Therefore, is domain-invariant. As a result, the supervised classifier will assign the label to all input examples. This is definitely unacceptable. To give a more intuitive illustration of the above analysis, we offer several empirical studies on Theorem 2 in Appendix B.

When and , we did not obtain such a strong conclusion as Theorem 2. Instead, we deduced a conflict between the object of achieving superior classification performance and that of making features domain-invariant.

Suppose that and instances of class are completely distinguishable from instances of the rest classes in , i.e.,:

In DIRL, we hope that:

Consider the region , where . According to the above assumption, we know that . Therefore, applying DIRL will force

in region . Taking the integral of over for both sides of the equation, we have . This deduction contradicts with the setting that . Therefore, is impossible fully class-separable when it is domain-invariant. Note that the object of the supervised learning is exactly to make class-separable. Thus, this actually indicates a conflict between the supervised learning and the domain-invariant representation learning.

Based on the above analysis, we can conclude that it is impossible to obtain a feature representation that is class-separable and at the same time, domain-invariant using the DIRL framework, when shifts across domains. However, the shift of can exist in many cross-domain sentiment analysis tasks. Therefore, it is worthy of studying in order to deal with the problem of DIRL.

4 Weighted Domain Invariant Representation Learning

According to the above analysis, we proposed a weighted version of DIRL to address the problem caused by the shift of to DIRL. The key idea of this framework is to first align across domains before performing domain-invariant learning, and then take account the shift of in the label prediction procedure. Specifically, it introduces a class weight to weigh source domain examples by class. Based on the weighted source domain, the domain shift problem is resolved in two steps. In the first step, it applies DIRL on the target domain and the weighted source domain, aiming to alleviate the influence of the shift of during the alignment of . In the second step, it uses to reweigh the supervised classifier obtained in the first step for target domain label prediction. We detail these two steps in §4.1 and §4.2, respectively.

4.1 Align with Class Weight

The motivation behind this practice is to adjust data distribution of the source domain or the target domain to alleviate the shift of across domains before applying DIRL. Consider that we only have labels of source domain data, we choose to adjust data distribution of the source domain. To achieve this purpose, we introduce a trainable class weight to reweigh source domain examples by class when performing DIRL, with . Specifically, we hope that:

and we denote the value of that makes this equation hold. We shall see that when , DIRL is to align with without the shift of . According to our analysis, we know that due to the shift of , there is a conflict between the training objects of the supervised learning and the domain-invariant learning . And the conflict degree will decrease as getting close to . Therefore, during model training, is expected to be optimized toward since it will make of the weighted source domain close to , so as to solve the conflict.

We now show how to transfer existing DIRL models to their WDIRL counterparts with the above idea. Let denote a statistic function defined over a distribution . For example, the expectation function in is a concrete instaintiation of . In general, to transfer models from DIRL to WDIRL, we should replace defined in with

Take the CMD metric as an example. In WDIRL, the revised form of is defined by:

(5)

Here, denotes the expectation of over distribution . Note that both and

can be estimated using source labeled data, and

can be estimated using target unlabeled data.

As for those adversarial-learning-based DIRL methods, e.g., DANN Ganin et al. (2016), the revised domain-invariant loss can be precisely defined by:

(6)

During model training, is optimized in the direction to minimize , while and are optimized to maximize . In the following, we denote the equivalent loss defined over for the revised version of domain adversarial learning.

The general task loss in WDIRL is defined by:

(7)

where is a unified representation of the domain-invariant loss in WDIRL, such as and .

4.2 Align with Class Weight

In the above step, we align across domains by performing domain-invariant learning on the class-weighted source domain and the original target domain. In this step, we deal with the shift of . Suppose that we have successfully resolved the shift of with , i.e., . Then, according to the work of Chan and Ng (2005), we have:

(8)

where . Of course, in most of the real-world tasks, we do not know the value of . However, note that is exactly the expected class weight . Therefore, a natural practice of this step is to estimate with the obtained in the first step and estimate with:

(9)

In summary, to transfer methods of the DIRL paradigm to WDIRL, we should: first revise the definition of , obtaining its corresponding WDIRL form ; then perform supervised learning and domain-invariant representation learning on and according to Eq. (7), obtaining a supervised classifier and a class weight vector ; and finally, adjust using according to Eq. (9) and obtain the target domain classifier .

5 Experiment

5.1 Experiment Design

Through the experiments, we empirically studied our analysis on DIRL and the effectiveness of our proposed solution in dealing with the problem it suffered from. In addition, we studied the impact of each step described in §4.1 and §4.2 to our proposed solution, respectively. To performe the study, we carried out performance comparison between the following models:

  • SO: the source-only model trained using source domain labeled data without any domain adaptation.

  • CMD: the centre-momentum-based domain adaptation model Zellinger et al. (2017) of the original DIRL framework that implements with .

  • DANN: the adversarial-learning-based domain adaptation model Ganin et al. (2016) of the original DIRL framework that implements with .

  • : the weighted version of the CMD model that only applies the first step (described in §4.1) of our proposed method.

  • : the weighted version of the DANN model that only applies the first step of our proposed method.

  • : the weighted version of the CMD model that applies both the first and second (described in §4.2) steps of our proposed method.

  • : the weighted version of the DANN model that applies both the first and second steps of our proposed method.

  • : a variant of that assigns (estimate from target labeled data) to and fixes this value during model training.

  • : a variant of that assigns to and fixes this value during model training.

Intrinsically, SO can provide an empirical lowerbound for those domain adaptation methods. and can provide the empirical upbound of and , respectively. In addition, by comparing performance of and with that of SO, we can know the effectiveness of the DIRL framework when dose not shift across domains. By comparing with CMD, or comparing with DANN, we can know the effectiveness of the first step of our proposed method. By comparing with , or comparing with , we can know the impact of the second step of our proposed method. And finally, by comparing with CMD, or comparing with DANN, we can know the general effectiveness of our proposed solution.

ST SO CMD DANN
BD 83.52 0.20 79.18 0.28 82.01 0.54 83.89 0.65 84.83 0.05 80.47 0.52 84.53 0.52 84.60 0.18 84.33 0.15
BE 81.83 0.06 78.11 0.19 84.02 0.37 84.01 0.45 84.26 0.09 76.26 1.16 84.75 0.44 83.91 0.58 83.71 0.60
BK 82.72 0.02 80.19 0.12 83.91 0.24 85.49 0.05 85.49 0.06 79.66 0.49 82.64 0.59 83.32 0.27 84.87 0.41
DB 82.97 0.06 81.47 0.38 83.20 0.10 83.10 0.12 83.11 0.03 82.08 0.97 83.10 0.38 82.65 0.08 82.05 0.22
DE 81.97 0.07 80.35 0.03 82.48 0.29 83.47 0.12 83.57 0.03 78.75 0.54 83.01 0.44 83.29 0.51 83.09 0.48
DK 83.51 0.10 82.99 0.22 86.94 0.18 86.40 0.23 86.34 0.15 81.54 0.70 85.05 0.51 85.84 0.71 86.06 0.61
EB 80.65 0.11 78.09 0.34 79.65 0.40 81.35 0.31 81.82 0.07 78.94 0.73 80.70 0.94 81.63 0.74 81.53 0.33
ED 80.25 0.25 77.16 1.99 80.07 0.49 82.20 0.17 81.85 0.08 76.87 0.50 79.73 0.77 81.24 0.47 82.04 0.15
EK 87.43 0.06 83.76 0.15 86.87 0.28 88.68 0.13 89.00 0.02 84.37 0.89 87.89 0.28 88.31 0.36 88.38 0.31
KB 80.05 0.26 75.44 0.37 81.00 0.25 82.35 0.16 82.34 0.13 75.81 0.21 80.97 0.72 81.83 0.32 81.13 0.52
KD 79.88 0.13 73.52 0.27 79.85 0.15 83.58 0.05 83.64 0.06 74.27 0.82 80.49 0.07 83.11 0.76 83.53 0.10
KE 87.30 0.02 81.73 0.46 87.80 0.13 87.87 0.04 88.04 0.01 82.19 0.00 87.52 0.26 87.55 0.18 87.80 0.18
Ave 82.67 0.11 79.33 0.40 83.15 0.37 84.36 0.21 84.52 0.07 79.42 0.63 83.28 0.49 83.32 0.43 84.04 0.34
Table 1: Mean accuracy standard deviation over five runs on the 12 binary-class cross-domain tasks.

5.2 Dataset and Task Design

We conducted experiments on the Amazon reviews dataset Blitzer et al. (2007), which is a benchmark dataset in the cross-domain sentiment analysis field. This dataset contains Amazon product reviews of four different product domains: Books (B), DVD (D), Electronics (E), and Kitchen (K) appliances. Each review is originally associated with a rating of 1-5 stars and is encoded in 5,000 dimensional feature vectors of bag-of-words unigrams and bigrams.

Binary-Class.

From this dataset, we constructed 12 binary-class cross-domain sentiment analysis tasks: BD, BE, BK, DB, DE, DK, EB, ED, EK, KB, KD, KE. Following the setting of previous works, we treated a reviews as class ‘1’ if it was ranked up to 3 stars, and as class ‘2’ if it was ranked 4 or 5 stars. For each task, consisted of 1,000 examples of each class, and consists of 1500 examples of class ‘1’ and 500 examples of class ‘2’. In addition, since it is reasonable to assume that can reveal the distribution of target domain data, we controlled the target domain testing dataset to have the same class ratio as . Using the same label assigning mechanism, we also studied model performance over different degrees of shift, which was evaluated by the max value of . Please refer to Appendix C for more detail about the task design for this study.

Multi-Class.

We additionally constructed 12 multi-class cross-domain sentiment classification tasks. Tasks were designed to distinguish reviews of 1 or 2 stars (class 1) from those of 4 stars (class 2) and those of 5 stars (class 3). For each task, contained 1000 examples of each class, and consisted of 500 examples of class 1, 1500 examples of class 2, and 1000 examples of class 3. Similarly, we also controlled the target domain testing dataset to have the same class ratio as .

5.3 Implementation Detail

For all studied models, we implemented and using the same architectures as those in Zellinger et al. (2017). For those DANN-based methods (i.e., DANN, , , and ), we implemented the discriminator

using a 50 dimensional hidden layer with relu activation functions and a linear classification layer. Hyper-parameter

of and was set to 5 as suggested by Zellinger et al. (2017)

. Model optimization was performed using RmsProp

Tieleman and Hinton (2012). Initial learning rate of was set to 0.01, while that of other parameters was set to 0.005 for all tasks.

Hyper-parameter was set to 1 for all of the tested models. We searched for this value in range on task B K. Within the search, label distribution was set to be uniform, i.e., , for both domain B and K. We chose the value that maximize the performance of CMD on testing data of domain K. You may notice that this practice conflicts with the setting of unsupervised domain adaptation that we do not have labeled data of the target domain for training or developing. However, we argue that this practice would not make it unfair for model comparison since all of the tested models shared the same value of and was not directly fine-tuned on any tested task. With the same consideration, for every tested model, we reported its best performance achieved on testing data of the target domain during its training111Please refer to the attached source code in the appendix for more implementation detail of this work..

Figure 1: Mean accuracy of WCMD over different initialization of . The empirical optimum value of makes . The dot line in the same color denotes performance of the CMD model and ‘’ annotates performance of WCMD when initializing with .

To initialize , we used label prediction of the source-only model. Specifically, let denote the trained source-only model. We initialized by:

Here, denotes the indication function. To offer an intuitive understanding to this strategy, we report performance of WCMD over different initializations of on 2 within-group (BD, EK) and 2 cross-group (BK, DE) binary-class domain adaptation tasks in Figure 1. Here, we say that domain B and D are of a group, and domain E and K are of another group since B and D are similar, as are E and K, but the two groups are different from one another Blitzer et al. (2007). Note that is a constant, which is estimated using source labeled data. From the figure, we can obtain three main observations. First, WCMD generally outperformed its CMD counterparts with different initialization of . Second, it was better to initialize with a relatively balanced value, i.e., (in this experiment, ). Finally, was often a good initialization of , indicating the effectiveness of the above strategy.

Model BD BK DE EK
SO 59.10 0.83 60.77 1.47 57.50 0.67 66.13 4.09
CMD 59.11 0.70 60.35 1.32 56.59 1.00 62.78 3.16
59.16 1.00 61.32 1.67 58.32 1.89 64.94 3.91
60.69 0.82 61.18 1.84 60.12 0.89 66.65 3.77
60.26 0.76 61.77 1.43 59.84 0.84 66.42 3.70
DANN 59.16 0.60 61.85 0.64 57.80 0.32 65.50 0.53
60.07 0.39 62.71 0.34 59.97 0.49 66.86 3.23
59.32 0.52 63.07 0.51 58.95 0.32 66.54 3.24
60.49 0.17 62.90 0.39 58.89 0.37 66.45 3.23
Table 2: Mean accuracy standard deviation over five runs on the 2 within-group and 2 cross-group multi-class domain-adaptation tasks.
Figure 2: Relative improvement over the SO baseline under different degrees of shift on the BD and B K binary-class domain adaptation tasks.

5.4 Main Result

Table 1 shows model performance on the 12 binary-class cross-domain tasks. From this table, we can obtain the following observations. First, CMD and DANN underperform the source-only model (SO) on all of the 12 tested tasks, indicating that DIRL in the studied situation will degrade the domain adaptation performance rather than improve it. This observation confirms our analysis. Second, consistently outperformed CMD and SO. This observation shows the effectiveness of our proposed method for addressing the problem of the DIRL framework in the studied situation. Similar conclusion can also be obtained by comparing performance of with that of DANN and SO. Third, and consistently outperformed CMD and DANN, respectively, which shows the effectiveness of the first step of our proposed method. Finally, on most of the tested tasks, and outperforms and , respectively.

Figure 2 depicts the relative improvement, e.g., , of the domain adaptation methods over the SO baseline under different degrees of shift, on two binary-class domain adaptation tasks (You can refer to Appendix C for results of the other models on other tasks). From the figure, we can see that the performance of CMD generally got worse as the increase of shift. In contrast, our proposed model performed robustly to the varying of shift degree. Moreover, it can achieve the near upbound performance characterized by . This again verified the effectiveness of our solution.

Table 2 reports model performance on the 2 within-group (BD, EK) and the 2 cross-group (BK, DE) multi-class domain adaptation tasks (You can refer to Appendix D for results on the other tasks). From this table, we observe that on some tested tasks, and did not greatly outperform or even slightly underperformed and , respectively. A possible explanation of this phenomenon is that the distribution of also differs from that of the target domain testing dataset. Therefore, the estimated or learned value of using is not fully suitable for application to the testing dataset. This explanation is verified by the observation that and also slightly outperforms and on these tasks, respectively.

6 Conclusion

In this paper, we studied the problem of the popular domain-invariant representation learning (DIRL) framework for domain adaptation, when changes across domains. To address the problem, we proposed a weighted version of DIRL (WDIRL). We showed that existing methods of the DIRL framework can be easily transferred to our WDIRL framework. Extensive experimental studies on benchmark cross-domain sentiment analysis datasets verified our analysis and showed the effectiveness of our proposed solution.

References

  • Aljundi et al. (2015) Rahaf Aljundi, Rémi Emonet, Damien Muselet, and Marc Sebban. 2015. Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 56–63.
  • Amari et al. (1987) Shunʼichi Amari, Ole E Barndorff-Nielsen, Robert E Kass, Steffen L Lauritzen, and CR Rao. 1987. Differential geometry in statistical inference. IMS.
  • Ben-David et al. (2007) Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. 2007. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144.
  • Blitzer et al. (2007) John Blitzer, Mark Dredze, Fernando Pereira, et al. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, volume 7, pages 440–447.
  • Bousmalis et al. (2016) Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351.
  • Chan and Ng (2005) Yee Seng Chan and Hwee Tou Ng. 2005. Word sense disambiguation with distribution estimation. In IJCAI, volume 5, pages 1010–5.
  • Chen et al. (2011) Minmin Chen, Yixin Chen, and Kilian Q Weinberger. 2011. Automatic feature decomposition for single view co-training. In

    Proceedings of the 28th International Conference on Machine Learning (ICML-11)

    , pages 953–960.
  • Chen et al. (2018) Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. 2018. Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics, 6:557–570.
  • Fernando et al. (2013) Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. 2013. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE International Conference on Computer Vision, pages 2960–2967.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016.

    Domain-adversarial training of neural networks.

    Journal of Machine Learning Research, 17(59):1–35.
  • Ghifary et al. (2014) Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang. 2014. Domain adaptive neural networks for object recognition. In

    Pacific Rim International Conference on Artificial Intelligence

    , pages 898–904. Springer.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011.

    Domain adaptation for large-scale sentiment classification: A deep learning approach.

    In Proceedings of the 28th international conference on machine learning (ICML-11), pages 513–520.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
  • Gopalan et al. (2011) Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. 2011. Domain adaptation for object recognition: An unsupervised approach. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 999–1006. IEEE.
  • He et al. (2018) Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel Dahlmeier. 2018.

    Adaptive semi-supervised learning for cross-domain sentiment classification.

    In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 3467–3476.
  • Li et al. (2017) Zheng Li, Yun Zhang, Ying Wei, Yuxiang Wu, and Qiang Yang. 2017. End-to-end adversarial memory network for cross-domain sentiment classification. In IJCAI, pages 2237–2243.
  • Lin (1991) Jianhua Lin. 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151.
  • Long et al. (2013) Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and S Yu Philip. 2013.

    Transfer feature learning with joint distribution adaptation.

    In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2200–2207. IEEE.
  • Long et al. (2015) Mingsheng Long, Jianmin Wang, Jiaguang Sun, and S Yu Philip. 2015. Domain invariant transfer kernel learning. IEEE Transactions on Knowledge and Data Engineering, 27(6):1519–1532.
  • Mansour et al. (2009) Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. 2009. Domain adaptation with multiple sources. In Advances in neural information processing systems, pages 1041–1048.
  • Muandet et al. (2013) Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. 2013. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10–18.
  • Pan et al. (2010) Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. 2010. Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th international conference on World wide web, pages 751–760. ACM.
  • Pan et al. (2011) Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. 2011. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210.
  • Peng et al. (2018) Minlong Peng, Qi Zhang, Yu-gang Jiang, and Xuanjing Huang. 2018. Cross-domain sentiment classification with target domain specific information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2505–2513.
  • Pinheiro and Element (2018) Pedro O Pinheiro and AI Element. 2018. Unsupervised domain adaptation with similarity learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8004–8013.
  • Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.
  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4.
  • Wei et al. (2016) Pengfei Wei, Yiping Ke, and Chi Keong Goh. 2016. Deep nonlinear feature coding for unsupervised domain adaptation. In IJCAI, pages 2189–2195.
  • Yosinski et al. (2014) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328.
  • Zellinger et al. (2017) Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811.
  • Zhao et al. (2018) Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. 2018. Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pages 8568–8579.