Data Selection Strategies for Multi-Domain Sentiment Analysis

02/08/2017 ∙ by Sebastian Ruder, et al. ∙ 0

Domain adaptation is important in sentiment analysis as sentiment-indicating words vary between domains. Recently, multi-domain adaptation has become more pervasive, but existing approaches train on all available source domains including dissimilar ones. However, the selection of appropriate training data is as important as the choice of algorithm. We undertake -- to our knowledge for the first time -- an extensive study of domain similarity metrics in the context of sentiment analysis and propose novel representations, metrics, and a new scope for data selection. We evaluate the proposed methods on two large-scale multi-domain adaptation settings on tweets and reviews and demonstrate that they consistently outperform strong random and balanced baselines, while our proposed selection strategy outperforms instance-level selection and yields the best score on a large reviews corpus.



There are no comments yet.


page 4

page 7

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Domain adaptation is important for sentiment analysis, as sentiment-bearing words vary between domains. If two domains are similar, such as electronics and kitchen appliance reviews, adaptation is successful; in contrast, transfer is less productive for dissimilar domains, e.g. books and electronics reviews Blitzer et al. (2007).

Consequently, recent research has looked to the more realistic setting of multi-domain adaptation, where multiple source domains are provided and the objective is to maximize performance on the target domain. However, such approaches still train on samples from dissimilar domains that are not helpful for prediction in the actual target domain. To mitigate this, existing approaches Zhou et al. (2016) use a domain similarity measure to weight the predictions of separate source domain models. However, Ruder et al. Ruder et al. (2017) show that training one model on all source data is generally more effective.

Even within one domain, such as a Twitter dataset, adaptation performance varies significantly depending on the choice of training samples Hovy et al. (2014). In practice, data selection and domain adaptation approaches complement each other: data selection can be seen as weighting relevant instances more highly Jiang and Zhai (2007), while data selection approaches in some cases have been shown to outperform adaptation methods Remus (2012).

When performing sentiment analysis in the real world, the domains are often unknown or not clearly separable. In this scenario, adaptation and data selection strategies are needed that do not presuppose a distinction of domains Plank (2016).

Data selection is also important because annotation is expensive. A large amount of unlabeled data is generally available for training, but annotation can typically only be afforded for a fraction of it. Data selection is able to guide us where to concentrate our annotation efforts.

In the following, we will review different strategies to select training data for multi-domain adaptation for sentiment analysis. For data selection, three factors are of importance: the representation, the similarity metric, and the level of the selection. With regard to the representation, we consider term distributions, word embeddings, and autoencoder representations

222Note that the latter two have not been used for data selection as far as we are aware.

. We consider three domain similarity metrics: Jensen-Shanon divergence, cosine similarity, and proxy

distance. We finally employ three different data selection levels: domain level, training instance level, and instance subset level.

Our contributions are the following:

  1. We present -- to our knowledge -- the first extensive study of data selection strategies for the task of sentiment analysis.

  2. We consider novel representations and metrics for data selection and present strategies that consistently outperform random and balanced baselines on two large-scale multi-domain sentiment analysis datasets.

  3. We propose a data selection method that is well-suited to the realistic setting of unknown or ill-defined domains and consistently outperforms instance-level selection.

  4. We present guidelines for data selection for sentiment analysis in the wild.

2 Related work

Domain adaptation. Domain adaptation has a long history of research: Blitzer et al. Blitzer et al. (2006) proposed a structural correspondence learning algorithm. Daumé III Daumé III (2007) introduced a kernel function that maps source and target domain data to a space that encourages in-domain similarity, while Pan et al. Pan et al. (2010) proposed a spectral feature alignment algorithm to align domain-specific words into meaningful clusters. Glorot et al. Glorot et al. (2011)

employed stacked Denoising Autoencoders to extract meaningful representations, which

Chen et al. Chen et al. (2012) extended to a marginalized version to address their high computational cost. Zhuang et al. Zhuang et al. (2015)

proposed to use deep auto-encoders for transfer learning, while

Zhou et al. Zhou et al. (2016)

transferred the source examples to the target domain and vice versa using Bi-Transferring Deep Neural Networks.

Multi-domain adaptation. For domain adaptation from multiple sources, Mansour Mansour (2009) proposed a distribution weighted hypothesis with theoretical guarantees. Duan et al. Duan et al. (2009)

proposed a method to learn a least-squares SVM classifer by leveraging source classifiers, while

Chattopadhyay et al. (2012) Chattopadhyay et al. assign pseudo-labels to the target data. Yang and Eisenstein Yang and Eisenstein (2015) embed meta-data attributes of domains, which, however, are not available when domains are unknown. Wu and Huang Wu and Huang (2016) exploit general sentiment knowledge and word-level sentiment polarity relations for multi-source domain adaptation, while Ruder et al. Ruder et al. (2017) use a distillation approach to adapt knowledge from source domain teachers to a target domain student.

Domain similarity metrics. Blitzer et al. Blitzer et al. (2007) show that proxy distance can be used to measure the adaptability between two domains in order to determine, which examples should be annotated. Van Asch and Daelemans Van Asch and Daelemans (2010) find that Rényi divergence outperforms other metrics in predicting POS tagging accuracy on a target domain. Plank and van Noord Plank and van Noord (2011), however, find that instance-level based selection on Jensen-Shannon divergence and topic models performs best for parsing data selection.

In contrast, we consider different levels of data selection and show that both selecting the most similar domain and the most similar subsets outperform instance-level selection for sentiment analysis. Remus Remus (2012) also use Jensen-Shannon divergence to select training examples for sentiment analysis. Finally, Wu and Huang Wu and Huang (2016) propose a domain similarity metric based on a sentiment graph. Their measure, however, is only applicable if domains are clearly defined or known in advance, which is often not the case in the real world.

3 Data selection strategies

3.1 Representations

Term distributions. The relative frequency distributions of terms in the vocabulary have been successfully used to gauge similarity with respect to a target domain Plank and van Noord (2011); Wu and Huang (2016). The underlying assumption is that similar domains have more terms in common than dissimilar domains. The term distribution of a domain

is a vector


is the probability of the

-th word in the vocabulary appearing in . Term distributions, however, only capture superficial occurrence statistics, which likely cannot express a more nuanced spectrum of domain similarity.

Word embeddings. Word embeddings have been used to capture more fine-grained notions of similarity both of words Mikolov et al. (2013) and of sentences Wieting et al. (2016), but have not been considered for modeling domain similarity. In line with previous work, we use a weighted sum of pre-trained word embeddings to represent each document. In particular, we discount frequent words by weighting the word embedding of each word occurring in the document with the word’s smoothed inverse probability where is the probability of appearing in domain and is a smoothing factor, which we set to Mikolov et al. (2013). The representation of a domain is then simply the mean of its document representations.

Autoencoder representations. Denoising autoencoders have been successfully used in recent work on domain adaptation Glorot et al. (2011); Zhuang et al. (2015). Their representations are typically created to be domain-invariant, but might still capture information that is beneficial for modeling domain similarity. We train a denoising autoencoder on the data of all domains and extract the representation for each document. To obtain the representation of a domain, we take the mean of its document representations.

Other representations. In the past, topic distributions have also been used and been found to be successful for part-of-speech tagging Plank and van Noord (2011). While topic distributions have proven to be convincing features for sentiment analysis Lu et al. (2011), in our experiments, we have found them not to be effective for selecting suitable training examples. We attribute this to the fact that topical similarity only inadequately captures the nuances that make up the notion of similarity for sentiment analysis.

3.2 Domain similarity metrics

(a) Domain-level selection
(b) Instance-level selection
(c) Subset-level selection
Figure 1: Different data selection levels. Target domain examples are indicated by green squares and domain representations (cluster centroids) are indicated by yellow stars. The examples selected by each strategy are marked as solid. Subset-level selection leads to a more diverse training set.

Jensen-Shannon divergence. Jensen-Shannon divergence is one of the most frequently used measures of domain similarity Remus (2012) and has been shown to outperform other similarity metrics Plank and van Noord (2011)

. Jensen-Shannon divergence is a smoothed, symmetric variant of KL divergence. The Jensen-Shannon divergence between two different probability distributions

and can be written as where , i.e. the average distribution of and , and is the KL divergence:


Cosine similarity. Cosine similarity is traditionally used to measure the similarity between vectors, in particular word embeddings Mikolov et al. (2013). The cosine similarity between two vectors and is:


Proxy distance. The distance Ben-David et al. (2007) aims to identify the subset in a family of subsets on which the source domain distribution and the target domain distribution differ the most and is defined as follows:


In practice, the Huber loss is used as a proxy for the distance Blitzer et al. (2007) but is generally only used on a domain level to gain insights with regard to the adaptability of representations. We propose to use the proxy distance as a data selection metric in the multi-domain setting: We first sample as many examples from the source domains as we have target domain examples. We then label all source domain examples with and all target domain examples with

and train a logistic regression classifier on the balanced binary dataset. We then use the probability of belonging to the target domain as the similarity score for each source domain example.

Other similarity metrics. Another similarity metric that has been successfully used in Machine Translation is a sentence’s perplexity as determined by a language model trained on the target domain Duh et al. (2013). However, in our experiments, perplexity proved unsuitable for data selection for sentiment analysis.

3.3 Data selection level

Finally, domain similarity can be measured on different levels as can be seen in Figure 1: on the level of the entire domain (0(a)); on the level of a single training example (0(b)); or on a level that mediates between the two extremes and introduces diversity in the data selection (0(c)).

Domain level. Often, such as in the case of product reviews, domains are clearly delimited and documents in different domains are clearly distinct from one another. We can thus adopt these human-assigned labels and compute the similarity of each source domain with regard to the target domain. We then sample training examples only from the most similar source domain.

Training instance level. In other scenarios, such as on the web, there is no clear distinction between different domains Ruder et al. (2016). In this case, the similarity with regard to the target domain can be computed for each training instance Plank and van Noord (2011). Instances are then sorted by similarity and the training examples with the highest similarity score are chosen for training.

Instance subset level. While representations such as word embeddings have been shown to be effective at representing words or individual documents, term distributions are more apt to capture the statistics of a collection of examples, as the sparse term distribution of a single sentence or document might not contain sufficient evidence to compute an accurate notion of similarity on an instance level. To counter-act this, instead of calculating the similarity for each training instance, we propose to compute the similarities for random subsets of instances. In addition, considering a subset of examples in conjunction has the advantage of diversifying the training data and thus making the trained model more robust.

At each iteration, we sample subsets of size from all source domains . We retain the subset with the highest similarity with regard to the target domain at each iteration and repeat the process until we have gathered training examples. The complete procedure is shown in Algorithm 1.

1:procedure SubsetSelect(, , , )
4:     for i in range(numiter) do
5:          sample m subsets of size from
6:          compute similarity score for
7:          sort
8:          get subset with highest similarity score
9:         .append
10:         .remove      
11:     return
Algorithm 1 Instance subset select

4 Experiments

4.1 Datasets and tasks

As challenges for data selection methods vary depending on the nature of the domains to which they are applied, we evaluate our approaches on two large multi-domain sentiment analysis datasets, which seek to replicate these challenges.

Tweets+Reviews. When data selection is applied to documents on the web, methods need to be able to handle domains that are not clearly defined and inconsistent. In the extreme, domains may be unknown or non-existing. In order to emulate this diversity, we choose the data of different editions of the SemEval Twitter Sentiment Analysis task as tweets within one domain are often heterogeneous Hovy et al. (2014). Specifically, we select the training data of SemEval-2013 Task 9333 (Twitter2013-Train) and the training (Twitter2016-Train) and test data of SemEval-2016 Task 4 Subtask A444 We split the latter into its sub-domains, i.e. LiveJournal2014, SMS2013, Twitter2013, Twitter2014, Twitter2014Sarcasm, Twitter2015, Twitter2016. Furthermore, in order to gauge the methods’ aptitude to handle diverse datasets, we include the laptop and restaurant reviews of SemEval-2016 Task 5555 omit sentences with conflicting sentiments. Statistics of the domains can be found in Table 2.

Dataset # of # of
sentences words
LiveJournal2014 1142 15216
SMS2013 2093 31774
Twitter2013 3813 74649
Twitter2014 1853 36305
Twitter2014Sarcasm 86 1219
Twitter2015 2390 46270
Twitter2016 20632 404827
Twitter2016-Train 5443 105984
Twitter2013-Train 7947 156507
Laptop-reviews 2426 31626
Restaurant-reviews 2110 25419
Table 1: Number of sentences and number of words for different Twitter and review domains.
Dataset + - ?
apparel 1000 1000 7252
automotive 584 152 0
baby 1000 900 2356
beauty 1000 493 1391
books 1000 1000 973194
camera & photo 1000 999 5409
cell phones & service 639 384 0
computer &
video games
1000 458 1313
dvd 1000 1000 122438
electronics 1000 1000 21009
gourmet food 1000 208 367
grocery 1000 352 1280
health & personal care 1000 1000 5225
jewelry & watches 1000 292 689
kitchen & housewares 1000 1000 17856
magazines 1000 970 2221
music 1000 1000 172180
musical instruments 284 48 0
office products 367 64 0
outdoor living 1000 327 272
software 1000 915 475
sports & outdoors 1000 1000 3728
tools & hardware 98 14 0
toys & games 1000 1000 11147
video 1000 1000 34180
Table 2: Number of positive (+), negative (-), and unlabeled documents (?) for the different review domains of the large Amazon reviews dataset.

Reviews. We furthermore consider the diametrally opposite setting of clearly distinct domains. For this scenario, we use the large version of the reviews dataset of Blitzer et al. Blitzer et al. (2006). The dataset consists of 25 different review domains with varying data sizes. To enforce realistic conditions where annotation is expensive we employ only the provided datasets with small amounts of labeled data. We show statistics in Table 2.

Tasks. We evaluate our methods on the ternary sentence-level and binary review-level sentiment analysis task on the Tweets+Reviews and Reviews dataset respectively.

4.2 Training details

We pre-process tweets by replacing urls, user names, and hashtags. We remove stopwords and use a vocabulary of the 10,000 most frequent words across all domains. In line with previous work, we use the raw bag-of-words unigram/bigram features pre-processed with tf-idf as input to a linear SVM classifier Blitzer et al. (2006). We use GloVe vectors Pennington et al. (2014) pre-trained on 42B tokens of the Common Crawl corpus777 for our word embeddings. For the auto-encoder representations, we use a denoising auto-encoder with one hidden layer of dimensions and a masking noise with a masking probability , which we train for epochs with the Adam optimizer Kingma and Ba (2015). For subset-level selection, we set the subset size and the number of subsets as determined via a grid search over a range of values on Tweets+Reviews validation data.

We limit the number of training examples to and for Tweets+Reviews and Reviews respectively, the latter in accordance with the conventions of Bollegala et al. Bollegala et al. (2011). In all cases, we evaluate on all data in the target domain.

4.3 Comparison methods

Baselines. We compare against two baselines: a) We randomly sample training examples from all source domains (rand); b) we randomly sample the same number of stratified examples from all source domains (all). Since the Reviews corpus contains distinct product review domains, we also include a human-labeled baseline (H), where the model is trained on data randomly sampled from the domain that was determined most similar by 5 human annotators. We provided them with the names of the review categories and review samples and tasked them with ranking the three most similar domains for each domain, with the consensus being selected as the most similar domain.

Our methods. For each representation, we use -- due to space considerations -- the similarity metric that is most commonly used in conjunction with it, i.e. term distribution and Jensen-Shannon divergence, word embeddings and cosine similarity, and autoencoder representations and cosine similarity. For each combination, we select examples based on three levels: a) We randomly sample from the most similar domain (); b) we choose the most similar individual examples (ex); and c) we choose the most similar subsets of examples (subset). Due to space considerations, we apply proxy distance for each representation only on the instance level, although it can be naturally combined with our subset-level selection strategy.

4.4 Results

For all datasets, we report accuracy scores and the average of 10 runs. We measure statistical significance using Student’s T test. We provide results in Tables

4 and 4.

Representation Term distribution Word embeddings AE representations
Metric Jensen-Shannon Cosine similarity Cosine similarity
Target domain rand all ex subset ex ex subset ex ex subset ex
LiveJournal2014 50.9 50.6 55.3 51.4 52.4 50.3 32.0 52.1 50.6 47.8 42.6 51.9 51.1 50.3
SMS2013 53.6 47.7 57.0 53.9 54.9 53.0 60.8 56.0 51.2 53.4 60.6 54.7 51.7 52.7
Twitter2013 57.2 57.1 60.3 57.0 57.3 55.8 60.4 56.6 57.3 57.7 61.0 56.6 56.8 58.3
Twitter2014 61.2 61.6 63.2 60.8 61.8 61.5 62.1 62.1 61.8 62.5 62.6 59.6 61.8 62.3
Twitter2014Sarcasm 44.8 46.6 46.6 47.7 47.1 51.5 36.5 46.5 45.6 48.3 48.4 45.3 41.9 53.5
Twitter2015 56.7 56.2 58.3 55.6 57.0 53.5 58.0 55.7 57.5 56.3 58.0 54.4 56.2 57.0
Twitter2016 53.8 52.6 56.3 51.3 53.7 53.4 54.7 47.1 54.1 55.4 54.9 52.2 54.2 55.2
Twitter2016-Train 50.6 50.0 49.5 51.7 50.6 45.4 51.6 52.0 51.1 47.9 50.0 46.9 48.1 51.1
Twitter2013-Train 57.0 55.8 56.6 55.6 56.8 56.4 58.4 54.8 56.1 56.5 56.5 55.2 56.5 57.5
Laptop-reviews 48.1 54.1 69.5 55.0 53.6 65.2 48.8 50.4 52.7 61.1 48.9 63.6 53.2 63.5
Restaurant-reviews 56.3 60.1 71.8 61.4 60.9 69.6 52.5 63.4 61.8 67.1 71.9 70.1 63.9 67.7
Average 53.6 53.8 58.6 54.7 55.1 56.0 52.3 54.3 54.5 55.9 55.9 55.5 54.1 57.2
Table 3: Comparison of different representations and metrics for multi-domain adaptation of tweets and review datasets for ternary sentence-level sentiment analysis. For each target domain, all other domains are available as source domains. Training examples are limited to . Baselines are random selection (rand) and stratified selection balanced across all domains (all). Training examples are a) samples from the most similar domain according to the chosen similarity metric (), b) the most similar individual examples (ex), or c) the most similar subsets of examples (subset). is the proxy distance. and indicates significantly better () than rand and all baseline respectively.
Representation Term distribution Word embeddings AE representations
Metric Jensen-Shannon Cosine similarity Cosine similarity
Target domain rand all H ex subset ex ex subset ex ex subset ex
apparel 81.4 79.6 78.4 83.8 78.3 81.5 78.4 74.6 78.9 79.3 81.3 84.2 79.3 81.5 80.8
automotive 80.1 83.2 79.4 79.0 75.7 81.2 83.0 83.8 82.1 83.0 84.2 78.3 81.0 81.3 81.9
baby 80.0 78.7 81.2 80.9 78.6 81.0 80.2 82.2 80.9 79.5 81.1 82.2 81.3 81.5 81.5
beauty 81.3 80.9 84.1 83.8 81.3 82.6 80.7 84.7 79.8 82.2 83.1 84.7 82.6 83.1 83.3
books 73.2. 71.4 72.0 79.6 73.8 75.1 75.3 72.4 73.9 73.6 75.2 77.2 76.7 75.2 74.9
camera & photo 81.6 81.0 83.0 82.8 79.2 81.7 78.3 83.3 82.1 81.0 81.3 83.3 80.9 81.2 80.2
cell phones & service 81.6 81.9 81.9 81.4 76.3 81.1 78.4 72.8 81.5 81.5 80.3 81.6 79.9 83.3 81.3
computer & video games 80.1 80.6 72.7 75.2 79.4 82.0 80.1 72.6 79.9 81.3 78.9 79.3 79.8 82.1 80.4
dvd 76.7 75.6 80.4 80.3 77.2 78.7 79.7 81.1 76.8 78.3 79.5 81.1 80.6 79.2 79.0
electronics 79.4 78.2 68.3 82.0 75.5 80.5 79.8 82.4 79.6 79.8 81.5 79.1 80.9 80.3 81.3
gourmet food 82.3 84.4 86.8 86.8 76.3 83.4 86.8 86.8 83.2 84.3 85.7 86.8 87.3 85.7 86.3
grocery 83.5 84.2 79.9 79.9 78.5 84.5 80.8 79.9 83.1 84.4 86.2 79.9 85.7 83.8 85.1
health & personal care 80.2 78.9 81.3 80.6 76.7 79.5 76.7 81.2 80.6 79.3 77.9 81.9 80.7 79.8 76.7
jewelry & watches 84.7 86.2 82.9 82.9 80.7 86.2 87.5 83.7 83.5 86.9 86.3 84.4 86.1 87.5 88.6
kitchen & housewares 81.4 79.8 58.6 82.6 79.8 81.4 72.6 74.3 82.0 79.8 80.1 83.2 81.6 82.4 78.9
magazines 75.9 75.1 75.8 74.8 76.5 78.9 73.6 75.0 75.7 74.3 75.8 75.8 77.6 76.4 76.4
music 73.4 71.0 50.1 78.1 73.1 74.3 74.8 78.9 72.0 73.7 76.3 78.9 71.7 74.5 74.3
musical instruments 85.3 86.7 70.0 84.0 75.3 85.8 87.7 84.3 85.8 86.2 89.2 82.2 84.3 88.1 86.7
office products 82.0 83.4 85.2 82.7 72.4 81.9 85.2 85.4 85.8 84.2 85.8 77.5 79.8 83.1 83.8
outdoor living 81.9 84.0 82.0 80.2 74.4 81.8 82.7 81.6 84.3 84.1 82.7 81.6 82.8 83.7 82.7
software 80.9 79.8 71.4 81.9 78.4 81.4 82.0 83.2 81.2 80.9 82.2 83.2 79.3 82.4 82.6
sports & outdoors 80.9 80.6 73.7 82.1 79.6 82.3 71.0 82.6 81.5 81.4 80.4 83.8 80.7 82.2 79.2
tools & hardware 80.5 85.2 82.3 77.5 80.4 85.0 85.7 91.1 78.6 83.0 85.7 76.8 85.4 85.7 89.3
toys & games 79.9 78.9 69.6 82.0 78.3 79.8 79.0 82.8 78.9 79.1 80.9 82.8 82.1 79.8 82.0
video 75.2 74.2 80.8 80.8 76.3 76.1 80.5 81.3 71.4 77.4 79.1 81.3 79.8 78.3 79.8
Average 80.2 80.1 76.1 81.0 77.3 81.1 80.2 80.9 80.1 80.7 81.6 81.2 81.1 81.7 81.5
Table 4: Comparison of different representations and metrics on multi-domain adaptation for document-level binary sentiment analysis. For each target domain, all other domains are available as source domains. Training examples are limited to . H is data selection based on the most similar domain assigned by human annotators. For the rest of the legend, see Table 4.

4.5 Discussion

Tweets+Reviews. Most methods outperform the random and balanced baselines on average. Term-distribution based domain-level selection achieves the best performance. Even though Twitter datasets are highly heterogeneous, training and test sets of the same edition of the competition have been collected in the same time frame and thus share similar topics and characteristics that are helpful for identifying the sentiment.

In contrast, the scores of the other selection strategies are similar across representations. Subset-level selection generally outperforms its instance-level counterpart. Using proxy distance as a similarity metric generally improves upon using the default similarity metric for the given representation. Particularly for term distributions, it helps to mitigate the sparsity of the representation.

Even though domain-level selection using word embeddings outperforms the baselines on the majority of domains, they perform -- on average -- more poorly. This shows the fallacy of choosing the most similar domain for training: The domain might be similar with regard to the representations of the words that are used, but the way the words are used are different, as can be seen with the sarcasm, laptops, and restaurant domains, where semantically similar domains are chosen by word embeddings that are, however, less helpful for predicting sentiment. This also reveals that the mean of document representations might not be the best way to capture all aspects of domain similarity.

Autoencoder representations perform comparable to term distributions on the Tweets+Reviews dataset, while their domain-level selection method performs worse. We generally expected them to perform better than term distributions, but reason that this is mainly due to the scarcity of data in this multi-domain dataset.

As we have limited the number of training examples to , we also investigate the behaviour of the different data selection strategies when the number of training examples increases. We display this exemplarily for the LiveJournal2014 domain in Figure 2. We observe that -- while all selection strategies increase their performance -- subset-level selection improves slightly more considerably, but note that this trend might vary depending on the domain.

Figure 2:

Comparison of average accuracy scores of different data selection methods on the LiveJournal2014 domain on the Tweets+Reviews with term distribution representation as representation and an increasing number of training examples.

Reviews. For the clearly distinct review domains, both the random and balanced baselines obtain comparatively stronger results. We attribute this to the fact that it is very difficult for a model to learn from a sentence that is not relevant to the target domain as in the Tweets+Reviews dataset, while an unhelpful review might still shed light on general sentiment words; in addition, the binary setting is easier than the ternary setting.

Domain-level selection based on term distributions still performs strongly, but is outperformed by subset-level selection on the same representations, which significantly improves upon instance-level selection. In comparison to the previous dataset, word embedding-based representation outperform term distributions, which we attribute to the fact that reviews are larger and contain more topical words with meaningful word representations in comparison to the noisy social media messages.

Even though product review categories have natural-language names, identifying the domain that is most similar and thus most likely to help the prediction at test time is no trivial task as can be seen with the abysmal performance of our human-labeled baseline (H) and underlines the need for automatic data selection for domain adaptation. This is most evident with regard to the music domain, which is naturally conceptually similar to the musical instruments domain, but employs an entirely different repertoire of sentiment words.

Another behavior that we observe is that while domain-level selection yields more often the best score for a domain compared to other selection strategies, its failure mode is equally polarized and often yields one of the worst scores for a target domain, if a source domain is matched whose examples are only peripherally relevant for the prediction.

Proxy distance again consistently improves upon the default domain similarity metric demonstrating that domain similarity as judged by the confidence value of a liner classifier is a suitable metric for data selection. Finally, autoencoder representations outperform all other representations, while subset-level selection based on autoencoder representations performs best. The multiplicity of domains compensates for the lack of data, which renders autoencoder representations more meaningful and gives them an edge over term representations.

4.6 Guidelines

Finally, we propose guidelines on using data selection for sentiment analysis in the wild that might be helpful for NLP practitioners:

  • DO use term distribution-based domain-level selection with Jensen-Shannon divergence as a simple baseline.

  • DO use subset-level selection instead of instance-level selection, particularly with term distributions.

  • DO use proxy distance instead of the default similarity metric.

  • DO NOT use pre-trained word embeddings for data selection on noisy data.

  • DO NOT use autoencoder representations for data selection if your dataset is small.

  • DO use autoencoder representations on large datasets.

5 Conclusion

In this paper, we have extensively studied -- for the first time as far as we are aware -- domain similarity metrics in the context of sentiment analysis. We have proposed several representations and metrics that have previously not been employed for data selection. We have introduced a novel data selection strategy that leverages subsets of examples and consistently outperforms instance-level selection. Finally, we have evaluated the proposed methods on two large-scale multi-domain adaptation settings on tweets and reviews and demonstrated that our proposed metrics and representations outperform strong random and balanced baselines, while our new selection strategy based on autoencoder representations yields the best score on the review corpus.