Domain adaptation is important in sentiment analysis as sentiment-indicating words vary between domains. Recently, multi-domain adaptation has become more pervasive, but existing approaches train on all available source domains including dissimilar ones. However, the selection of appropriate training data is as important as the choice of algorithm. We undertake -- to our knowledge for the first time -- an extensive study of domain similarity metrics in the context of sentiment analysis and propose novel representations, metrics, and a new scope for data selection. We evaluate the proposed methods on two large-scale multi-domain adaptation settings on tweets and reviews and demonstrate that they consistently outperform strong random and balanced baselines, while our proposed selection strategy outperforms instance-level selection and yields the best score on a large reviews corpus.READ FULL TEXT VIEW PDF
Domain adaptation for sentiment analysis is challenging due to the fact ...
Learning representations which remain invariant to a nuisance factor has...
Novel neural models have been proposed in recent years for learning unde...
Recently natural language processing (NLP) tools have been developed to
Domain similarity measures can be used to gauge adaptability and select
Cross-domain sentiment analysis has received significant attention in re...
Domain Adaptation explores the idea of how to maximize performance on a
Domain adaptation is important for sentiment analysis, as sentiment-bearing words vary between domains. If two domains are similar, such as electronics and kitchen appliance reviews, adaptation is successful; in contrast, transfer is less productive for dissimilar domains, e.g. books and electronics reviews Blitzer et al. (2007).
Consequently, recent research has looked to the more realistic setting of multi-domain adaptation, where multiple source domains are provided and the objective is to maximize performance on the target domain. However, such approaches still train on samples from dissimilar domains that are not helpful for prediction in the actual target domain. To mitigate this, existing approaches Zhou et al. (2016) use a domain similarity measure to weight the predictions of separate source domain models. However, Ruder et al. Ruder et al. (2017) show that training one model on all source data is generally more effective.
Even within one domain, such as a Twitter dataset, adaptation performance varies significantly depending on the choice of training samples Hovy et al. (2014). In practice, data selection and domain adaptation approaches complement each other: data selection can be seen as weighting relevant instances more highly Jiang and Zhai (2007), while data selection approaches in some cases have been shown to outperform adaptation methods Remus (2012).
When performing sentiment analysis in the real world, the domains are often unknown or not clearly separable. In this scenario, adaptation and data selection strategies are needed that do not presuppose a distinction of domains Plank (2016).
Data selection is also important because annotation is expensive. A large amount of unlabeled data is generally available for training, but annotation can typically only be afforded for a fraction of it. Data selection is able to guide us where to concentrate our annotation efforts.
In the following, we will review different strategies to select training data for multi-domain adaptation for sentiment analysis. For data selection, three factors are of importance: the representation, the similarity metric, and the level of the selection. With regard to the representation, we consider term distributions, word embeddings, and autoencoder representations . We consider three domain similarity metrics: Jensen-Shanon divergence, cosine similarity, and proxy
In the following, we will review different strategies to select training data for multi-domain adaptation for sentiment analysis. For data selection, three factors are of importance: the representation, the similarity metric, and the level of the selection. With regard to the representation, we consider term distributions, word embeddings, and autoencoder representations222Note that the latter two have not been used for data selection as far as we are aware.
. We consider three domain similarity metrics: Jensen-Shanon divergence, cosine similarity, and proxydistance. We finally employ three different data selection levels: domain level, training instance level, and instance subset level.
Our contributions are the following:
We present -- to our knowledge -- the first extensive study of data selection strategies for the task of sentiment analysis.
We consider novel representations and metrics for data selection and present strategies that consistently outperform random and balanced baselines on two large-scale multi-domain sentiment analysis datasets.
We propose a data selection method that is well-suited to the realistic setting of unknown or ill-defined domains and consistently outperforms instance-level selection.
We present guidelines for data selection for sentiment analysis in the wild.
Domain adaptation. Domain adaptation has a long history of research: Blitzer et al. Blitzer et al. (2006) proposed a structural correspondence learning algorithm. Daumé III Daumé III (2007) introduced a kernel function that maps source and target domain data to a space that encourages in-domain similarity, while Pan et al. Pan et al. (2010) proposed a spectral feature alignment algorithm to align domain-specific words into meaningful clusters. Glorot et al. Glorot et al. (2011) employed stacked Denoising Autoencoders to extract meaningful representations, which proposed to use deep auto-encoders for transfer learning, while transferred the source examples to the target domain and vice versa using Bi-Transferring Deep Neural Networks.
employed stacked Denoising Autoencoders to extract meaningful representations, whichChen et al. Chen et al. (2012) extended to a marginalized version to address their high computational cost. Zhuang et al. Zhuang et al. (2015)
proposed to use deep auto-encoders for transfer learning, whileZhou et al. Zhou et al. (2016)
transferred the source examples to the target domain and vice versa using Bi-Transferring Deep Neural Networks.
Multi-domain adaptation. For domain adaptation from multiple sources, Mansour Mansour (2009) proposed a distribution weighted hypothesis with theoretical guarantees. Duan et al. Duan et al. (2009) proposed a method to learn a least-squares SVM classifer by leveraging source classifiers, while
proposed a method to learn a least-squares SVM classifer by leveraging source classifiers, whileChattopadhyay et al. (2012) Chattopadhyay et al. assign pseudo-labels to the target data. Yang and Eisenstein Yang and Eisenstein (2015) embed meta-data attributes of domains, which, however, are not available when domains are unknown. Wu and Huang Wu and Huang (2016) exploit general sentiment knowledge and word-level sentiment polarity relations for multi-source domain adaptation, while Ruder et al. Ruder et al. (2017) use a distillation approach to adapt knowledge from source domain teachers to a target domain student.
Domain similarity metrics. Blitzer et al. Blitzer et al. (2007) show that proxy distance can be used to measure the adaptability between two domains in order to determine, which examples should be annotated. Van Asch and Daelemans Van Asch and Daelemans (2010) find that Rényi divergence outperforms other metrics in predicting POS tagging accuracy on a target domain. Plank and van Noord Plank and van Noord (2011), however, find that instance-level based selection on Jensen-Shannon divergence and topic models performs best for parsing data selection.
In contrast, we consider different levels of data selection and show that both selecting the most similar domain and the most similar subsets outperform instance-level selection for sentiment analysis. Remus Remus (2012) also use Jensen-Shannon divergence to select training examples for sentiment analysis. Finally, Wu and Huang Wu and Huang (2016) propose a domain similarity metric based on a sentiment graph. Their measure, however, is only applicable if domains are clearly defined or known in advance, which is often not the case in the real world.
Term distributions. The relative frequency distributions of terms in the vocabulary have been successfully used to gauge similarity with respect to a target domain Plank and van Noord (2011); Wu and Huang (2016). The underlying assumption is that similar domains have more terms in common than dissimilar domains. The term distribution of a domain is a vector is the probability of the
is a vectorwhere
is the probability of the-th word in the vocabulary appearing in . Term distributions, however, only capture superficial occurrence statistics, which likely cannot express a more nuanced spectrum of domain similarity.
Word embeddings. Word embeddings have been used to capture more fine-grained notions of similarity both of words Mikolov et al. (2013) and of sentences Wieting et al. (2016), but have not been considered for modeling domain similarity. In line with previous work, we use a weighted sum of pre-trained word embeddings to represent each document. In particular, we discount frequent words by weighting the word embedding of each word occurring in the document with the word’s smoothed inverse probability where is the probability of appearing in domain and is a smoothing factor, which we set to Mikolov et al. (2013). The representation of a domain is then simply the mean of its document representations.
Autoencoder representations. Denoising autoencoders have been successfully used in recent work on domain adaptation Glorot et al. (2011); Zhuang et al. (2015). Their representations are typically created to be domain-invariant, but might still capture information that is beneficial for modeling domain similarity. We train a denoising autoencoder on the data of all domains and extract the representation for each document. To obtain the representation of a domain, we take the mean of its document representations.
Other representations. In the past, topic distributions have also been used and been found to be successful for part-of-speech tagging Plank and van Noord (2011). While topic distributions have proven to be convincing features for sentiment analysis Lu et al. (2011), in our experiments, we have found them not to be effective for selecting suitable training examples. We attribute this to the fact that topical similarity only inadequately captures the nuances that make up the notion of similarity for sentiment analysis.
Jensen-Shannon divergence. Jensen-Shannon divergence is one of the most frequently used measures of domain similarity Remus (2012) and has been shown to outperform other similarity metrics Plank and van Noord (2011) . Jensen-Shannon divergence is a smoothed, symmetric variant of KL divergence. The Jensen-Shannon divergence between two different probability distributions
. Jensen-Shannon divergence is a smoothed, symmetric variant of KL divergence. The Jensen-Shannon divergence between two different probability distributionsand can be written as where , i.e. the average distribution of and , and is the KL divergence:
Cosine similarity. Cosine similarity is traditionally used to measure the similarity between vectors, in particular word embeddings Mikolov et al. (2013). The cosine similarity between two vectors and is:
Proxy distance. The distance Ben-David et al. (2007) aims to identify the subset in a family of subsets on which the source domain distribution and the target domain distribution differ the most and is defined as follows:
In practice, the Huber loss is used as a proxy for the distance Blitzer et al. (2007) but is generally only used on a domain level to gain insights with regard to the adaptability of representations. We propose to use the proxy distance as a data selection metric in the multi-domain setting: We first sample as many examples from the source domains as we have target domain examples. We then label all source domain examples with and all target domain examples with and train a logistic regression classifier on the balanced binary dataset. We then use the probability of belonging to the target domain as the similarity score for each source domain example.
and train a logistic regression classifier on the balanced binary dataset. We then use the probability of belonging to the target domain as the similarity score for each source domain example.
Other similarity metrics. Another similarity metric that has been successfully used in Machine Translation is a sentence’s perplexity as determined by a language model trained on the target domain Duh et al. (2013). However, in our experiments, perplexity proved unsuitable for data selection for sentiment analysis.
Finally, domain similarity can be measured on different levels as can be seen in Figure 1: on the level of the entire domain (0(a)); on the level of a single training example (0(b)); or on a level that mediates between the two extremes and introduces diversity in the data selection (0(c)).
Domain level. Often, such as in the case of product reviews, domains are clearly delimited and documents in different domains are clearly distinct from one another. We can thus adopt these human-assigned labels and compute the similarity of each source domain with regard to the target domain. We then sample training examples only from the most similar source domain.
Training instance level. In other scenarios, such as on the web, there is no clear distinction between different domains Ruder et al. (2016). In this case, the similarity with regard to the target domain can be computed for each training instance Plank and van Noord (2011). Instances are then sorted by similarity and the training examples with the highest similarity score are chosen for training.
Instance subset level. While representations such as word embeddings have been shown to be effective at representing words or individual documents, term distributions are more apt to capture the statistics of a collection of examples, as the sparse term distribution of a single sentence or document might not contain sufficient evidence to compute an accurate notion of similarity on an instance level. To counter-act this, instead of calculating the similarity for each training instance, we propose to compute the similarities for random subsets of instances. In addition, considering a subset of examples in conjunction has the advantage of diversifying the training data and thus making the trained model more robust.
At each iteration, we sample subsets of size from all source domains . We retain the subset with the highest similarity with regard to the target domain at each iteration and repeat the process until we have gathered training examples. The complete procedure is shown in Algorithm 1.
As challenges for data selection methods vary depending on the nature of the domains to which they are applied, we evaluate our approaches on two large multi-domain sentiment analysis datasets, which seek to replicate these challenges.
Tweets+Reviews. When data selection is applied to documents on the web, methods need to be able to handle domains that are not clearly defined and inconsistent. In the extreme, domains may be unknown or non-existing. In order to emulate this diversity, we choose the data of different editions of the SemEval Twitter Sentiment Analysis task as tweets within one domain are often heterogeneous Hovy et al. (2014). Specifically, we select the training data of SemEval-2013 Task 9333http://alt.qcri.org/semeval2014/task9/data/uploads/semeval2013_task2_train.zip (Twitter2013-Train) and the training (Twitter2016-Train) and test data of SemEval-2016 Task 4 Subtask A444http://alt.qcri.org/semeval2016/task4/index.php?id=data-and-tools. We split the latter into its sub-domains, i.e. LiveJournal2014, SMS2013, Twitter2013, Twitter2014, Twitter2014Sarcasm, Twitter2015, Twitter2016. Furthermore, in order to gauge the methods’ aptitude to handle diverse datasets, we include the laptop and restaurant reviews of SemEval-2016 Task 5555http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools.666We omit sentences with conflicting sentiments. Statistics of the domains can be found in Table 2.
|Dataset||# of||# of|
|camera & photo||1000||999||5409|
|cell phones & service||639||384||0|
|health & personal care||1000||1000||5225|
|jewelry & watches||1000||292||689|
|kitchen & housewares||1000||1000||17856|
|sports & outdoors||1000||1000||3728|
|tools & hardware||98||14||0|
|toys & games||1000||1000||11147|
Reviews. We furthermore consider the diametrally opposite setting of clearly distinct domains. For this scenario, we use the large version of the reviews dataset of Blitzer et al. Blitzer et al. (2006). The dataset consists of 25 different review domains with varying data sizes. To enforce realistic conditions where annotation is expensive we employ only the provided datasets with small amounts of labeled data. We show statistics in Table 2.
Tasks. We evaluate our methods on the ternary sentence-level and binary review-level sentiment analysis task on the Tweets+Reviews and Reviews dataset respectively.
We pre-process tweets by replacing urls, user names, and hashtags. We remove stopwords and use a vocabulary of the 10,000 most frequent words across all domains. In line with previous work, we use the raw bag-of-words unigram/bigram features pre-processed with tf-idf as input to a linear SVM classifier Blitzer et al. (2006). We use GloVe vectors Pennington et al. (2014) pre-trained on 42B tokens of the Common Crawl corpus777http://nlp.stanford.edu/projects/glove/ for our word embeddings. For the auto-encoder representations, we use a denoising auto-encoder with one hidden layer of dimensions and a masking noise with a masking probability , which we train for epochs with the Adam optimizer Kingma and Ba (2015). For subset-level selection, we set the subset size and the number of subsets as determined via a grid search over a range of values on Tweets+Reviews validation data.
We limit the number of training examples to and for Tweets+Reviews and Reviews respectively, the latter in accordance with the conventions of Bollegala et al. Bollegala et al. (2011). In all cases, we evaluate on all data in the target domain.
Baselines. We compare against two baselines: a) We randomly sample training examples from all source domains (rand); b) we randomly sample the same number of stratified examples from all source domains (all). Since the Reviews corpus contains distinct product review domains, we also include a human-labeled baseline (H), where the model is trained on data randomly sampled from the domain that was determined most similar by 5 human annotators. We provided them with the names of the review categories and review samples and tasked them with ranking the three most similar domains for each domain, with the consensus being selected as the most similar domain.
Our methods. For each representation, we use -- due to space considerations -- the similarity metric that is most commonly used in conjunction with it, i.e. term distribution and Jensen-Shannon divergence, word embeddings and cosine similarity, and autoencoder representations and cosine similarity. For each combination, we select examples based on three levels: a) We randomly sample from the most similar domain (); b) we choose the most similar individual examples (ex); and c) we choose the most similar subsets of examples (subset). Due to space considerations, we apply proxy distance for each representation only on the instance level, although it can be naturally combined with our subset-level selection strategy.
|Representation||Term distribution||Word embeddings||AE representations|
|Metric||Jensen-Shannon||Cosine similarity||Cosine similarity|
|Representation||Term distribution||Word embeddings||AE representations|
|Metric||Jensen-Shannon||Cosine similarity||Cosine similarity|
|camera & photo||81.6||81.0||83.0||82.8||79.2||81.7||78.3||83.3||82.1||81.0||81.3||83.3||80.9||81.2||80.2|
|cell phones & service||81.6||81.9||81.9||81.4||76.3||81.1||78.4||72.8||81.5||81.5||80.3||81.6||79.9||83.3||81.3|
|computer & video games||80.1||80.6||72.7||75.2||79.4||82.0||80.1||72.6||79.9||81.3||78.9||79.3||79.8||82.1||80.4|
|health & personal care||80.2||78.9||81.3||80.6||76.7||79.5||76.7||81.2||80.6||79.3||77.9||81.9||80.7||79.8||76.7|
|jewelry & watches||84.7||86.2||82.9||82.9||80.7||86.2||87.5||83.7||83.5||86.9||86.3||84.4||86.1||87.5||88.6|
|kitchen & housewares||81.4||79.8||58.6||82.6||79.8||81.4||72.6||74.3||82.0||79.8||80.1||83.2||81.6||82.4||78.9|
|sports & outdoors||80.9||80.6||73.7||82.1||79.6||82.3||71.0||82.6||81.5||81.4||80.4||83.8||80.7||82.2||79.2|
|tools & hardware||80.5||85.2||82.3||77.5||80.4||85.0||85.7||91.1||78.6||83.0||85.7||76.8||85.4||85.7||89.3|
|toys & games||79.9||78.9||69.6||82.0||78.3||79.8||79.0||82.8||78.9||79.1||80.9||82.8||82.1||79.8||82.0|
Tweets+Reviews. Most methods outperform the random and balanced baselines on average. Term-distribution based domain-level selection achieves the best performance. Even though Twitter datasets are highly heterogeneous, training and test sets of the same edition of the competition have been collected in the same time frame and thus share similar topics and characteristics that are helpful for identifying the sentiment.
In contrast, the scores of the other selection strategies are similar across representations. Subset-level selection generally outperforms its instance-level counterpart. Using proxy distance as a similarity metric generally improves upon using the default similarity metric for the given representation. Particularly for term distributions, it helps to mitigate the sparsity of the representation.
Even though domain-level selection using word embeddings outperforms the baselines on the majority of domains, they perform -- on average -- more poorly. This shows the fallacy of choosing the most similar domain for training: The domain might be similar with regard to the representations of the words that are used, but the way the words are used are different, as can be seen with the sarcasm, laptops, and restaurant domains, where semantically similar domains are chosen by word embeddings that are, however, less helpful for predicting sentiment. This also reveals that the mean of document representations might not be the best way to capture all aspects of domain similarity.
Autoencoder representations perform comparable to term distributions on the Tweets+Reviews dataset, while their domain-level selection method performs worse. We generally expected them to perform better than term distributions, but reason that this is mainly due to the scarcity of data in this multi-domain dataset.
As we have limited the number of training examples to , we also investigate the behaviour of the different data selection strategies when the number of training examples increases. We display this exemplarily for the LiveJournal2014 domain in Figure 2. We observe that -- while all selection strategies increase their performance -- subset-level selection improves slightly more considerably, but note that this trend might vary depending on the domain.
Reviews. For the clearly distinct review domains, both the random and balanced baselines obtain comparatively stronger results. We attribute this to the fact that it is very difficult for a model to learn from a sentence that is not relevant to the target domain as in the Tweets+Reviews dataset, while an unhelpful review might still shed light on general sentiment words; in addition, the binary setting is easier than the ternary setting.
Domain-level selection based on term distributions still performs strongly, but is outperformed by subset-level selection on the same representations, which significantly improves upon instance-level selection. In comparison to the previous dataset, word embedding-based representation outperform term distributions, which we attribute to the fact that reviews are larger and contain more topical words with meaningful word representations in comparison to the noisy social media messages.
Even though product review categories have natural-language names, identifying the domain that is most similar and thus most likely to help the prediction at test time is no trivial task as can be seen with the abysmal performance of our human-labeled baseline (H) and underlines the need for automatic data selection for domain adaptation. This is most evident with regard to the music domain, which is naturally conceptually similar to the musical instruments domain, but employs an entirely different repertoire of sentiment words.
Another behavior that we observe is that while domain-level selection yields more often the best score for a domain compared to other selection strategies, its failure mode is equally polarized and often yields one of the worst scores for a target domain, if a source domain is matched whose examples are only peripherally relevant for the prediction.
Proxy distance again consistently improves upon the default domain similarity metric demonstrating that domain similarity as judged by the confidence value of a liner classifier is a suitable metric for data selection. Finally, autoencoder representations outperform all other representations, while subset-level selection based on autoencoder representations performs best. The multiplicity of domains compensates for the lack of data, which renders autoencoder representations more meaningful and gives them an edge over term representations.
Finally, we propose guidelines on using data selection for sentiment analysis in the wild that might be helpful for NLP practitioners:
DO use term distribution-based domain-level selection with Jensen-Shannon divergence as a simple baseline.
DO use subset-level selection instead of instance-level selection, particularly with term distributions.
DO use proxy distance instead of the default similarity metric.
DO NOT use pre-trained word embeddings for data selection on noisy data.
DO NOT use autoencoder representations for data selection if your dataset is small.
DO use autoencoder representations on large datasets.
In this paper, we have extensively studied -- for the first time as far as we are aware -- domain similarity metrics in the context of sentiment analysis. We have proposed several representations and metrics that have previously not been employed for data selection. We have introduced a novel data selection strategy that leverages subsets of examples and consistently outperforms instance-level selection. Finally, we have evaluated the proposed methods on two large-scale multi-domain adaptation settings on tweets and reviews and demonstrated that our proposed metrics and representations outperform strong random and balanced baselines, while our new selection strategy based on autoencoder representations yields the best score on the review corpus.
EMNLP ’06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing(July):120--128. https://doi.org/10.3115/1610075.1610094.
Proceedings of the 29th International Conference on Machine Learning (ICML-12)pages 767----774. https://doi.org/10.1007/s11222-007-9033-z.
IJCAI International Joint Conference on Artificial Intelligencepages 4119--4125.