Disentangled Representations for Manipulation of Sentiment in Text

12/22/2017 ∙ by Maria Larsson, et al. ∙ Zenuity Sigma Chalmers University of Technology 0

The ability to change arbitrary aspects of a text while leaving the core message intact could have a strong impact in fields like marketing and politics by enabling e.g. automatic optimization of message impact and personalized language adapted to the receiver's profile. In this paper we take a first step towards such a system by presenting an algorithm that can manipulate the sentiment of a text while preserving its semantics using disentangled representations. Validation is performed by examining trajectories in embedding space and analyzing transformed sentences for semantic preservation while expression of desired sentiment shift.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As we live in an increasingly digitized society, algorithms for text analysis and generation can be used for a variety of purposes and may greatly relieve manual work. A system for robust manipulation of global text properties, e.g. sentiment, is one such algorithm that could potentially change how we work with text and open up new possibilities. Though the main purpose of a text might be to communicate a concrete message there are an infinite number of ways the message can be phrased, each with an individual set of global properties connected to it. In this paper we focus on the sentiment aspect and note that robust control over the sentiment would open up a range of new possibilities, like AB testing of different instantiations of a message with respect to some desired measure, and personalized communication automatically adapted to the receiver’s profile. Further, the ability of generating new sentences with transformed sentiment could also be useful in data augmentation when the available data is scarce.

Recent work in text generation

(Hu et al., 2017; Radford et al., 2017) has shown that it is possible to generate random sentences where the sentiment can be chosen as an input parameter. This line of research has some similarities to the problem we are addressing in this paper but with the key difference that while they generate new random sentences we aim to transform existing sentences. This makes the problem more difficult but also more applicable to real world applications as shown by the recent work of Mueller et al. (2017).

In the visual domain there has been a range of work lately that aims to transform the input image to fit different aspects, e.g. to look like a painting (Gatys et al., 2015). The method presented by Gardner et al. (2015)

transforms an image to a deep feature space using a convolutional neural network (CNN). This space is then traversed towards the target features. A new image is subsequently reconstructed from the deep feature representation but where some aspect has been changed from the original image. In their experiments they show that this can be used to transform a smiling portrait into an angry one and make one individual look more like someone else without changing clothing or background. The method we present in this paper is loosely based on their model, however, with significant changes due to the discrete nature of language.

The main contributions of this work include: (1) an algorithm that can automatically transform the sentiment of a text while leaving the semantic content largely intact, and (2) preliminary qualitative analysis of the performance with regard to (a) resulting sentiment, (b) semantic stability and (c) acceptability of the transformed text.

2 Maximum mean discrepancy

The maximum mean discrepancy (MMD) (Gretton et al., 2012)

is a test statistic used to determine whether two distributions are the same. Given two distributions,

and , the objective of the MMD is to find a smooth function which is large for samples from and small for samples from

. Given such a function the MMD is the difference between the mean function values for the two sets of samples, which can be empirically estimated as


where are samples drawn from the source distribution and are samples drawn from the target distribution . The function belongs to a class, , of smooth functions and should be chosen as to maximize the difference between the mean values of applied to and . In both (Gretton et al., 2012) and (Gardner et al., 2015),

is a reproducing kernel Hilbert space allowing comparison of multi-dimensional feature vectors. The function

attaining the supremum in equation (1) can be empirically estimated as


where is a kernel function. The method presented by Gardner et al. (2015) uses a Gaussian kernel function with being the kernel bandwidth.

3 Model

The problem we are addressing can be split into three different subtasks. The first task is representing sentences in a continuous space. The second task is exploiting the sentence representation and traversing the manifold in such a way that the sentiment changes. The third task is generating a sentence from the representation space. Our model uses a CNN for sentence encoding. The encoded vectors are subsequently traversed using the MMD statistic and finally decoded using a recurrent neural network (RNN).

3.1 Encoding sentences

A sentence is represented as a matrix where the rows correspond to the, 300-dimensional, word2vec (Mikolov et al., 2013)

word embeddings for each word in the sentence. This matrix is given as input to a CNN, trained for binary sentiment classification. The network consists of one convolutional layer, one max-pooling layer and finally one fully connected feed forward layer. The filter heights for the convolutional layer are

and , and the filter width is 300. 75 filters per size results in a total of 300 filters. The pooling layer therefore outputs a 300-dimensional feature vector, denoted . This feature vector is extracted from the CNN, along with the predicted label, and used as the encoding of the input sentence.

In addition to classifying sentiment, the CNN needs to encode information about the topic and semantics of the sentence. Therefore, it is trained together with the RNN. Initially, the sentiment classification task is disregarded and the joint networks are trained for encoding and decoding unlabeled sentences. The loss for this task is measured by calculating the cross-entropy error between the predicted word,

, at position , in the generated sentence and the actual word, , at the same position from the original sentence. After this initial training phase, the CNN is trained on binary sentiment classification. The classification loss is calculated as the cross-entropy error between the predicted label and the true label for each sentence. This loss is added to the text generation loss, producing a total loss which is used to update the weights in both networks. A schematic of the training procedure is illustrated in figure 2.

Figure 1: During training, the CNN and RNN are updated using the unweighted sum of the loss for sentiment classification and for text generation.
Figure 2: Different icons distinguish feature vectors by sentiment and topic. Bold faced points are examples of original and traversed vectors.

3.2 Traversal of the representation space

Since the CNN is trained on binary sentiment classification, two separable distributions of feature vectors are generated. The MMD statistic can be used to traverse a vector originating from one of these distributions to the other. The result of the traversal is a vector that resembles the encoding of a sentence with the opposite sentiment.

When moving the feature vector by minimizing equation (2), the semantics of the original sentence may be lost if is moved too far along the manifold. To control how far is moved from its original position a budget of change (Gardner et al., 2015), , is used. A source and a target set of sentence representations are created. The source set, , contains feature vectors for sentences with the same sentiment as and the target set, , contains feature vectors for sentences with the opposite sentiment. From these sets and the original vector, a matrix is formed. The traversed feature vector can then be expressed as , where is a coefficient vector. Equation (2) can now be written as where , . The minimization over uses the BFGS algorithm (Battiti, 1990) and is constrained by the budget of change, enforced in the last term.

3.3 Decoding sentences

The traversed feature vector is given as input to an RNN trained for generating text. In addition to

, the RNN receives a start-of-sentence token as input in the first time step. For each time step, the RNN outputs the most probable word and gives this word as input to the next time step. When the most probable word is an end-of-sentence token, the generation of words is terminated. The RNN consists of a single layer GRU cell, with a state size of 300. The weight matrix for the input,

, consists of the 300-dimensional word2vec word embeddings for the words in the vocabulary.

4 Experiments and results

The initial encoding and decoding training uses the large movie review dataset v1.0 (Maas et al., 2011) disregarding the label. The networks are then trained on three sentiment labelled data sets. The first set is the movie review sentence polarity data set v1.0111https://www.cs.cornell.edu/people/pabo/movie-review-data/ (Pang and Lee, 2005) which consists of 10 662 labelled movie-review sentences from www.rottentomatoes.com. The second set contains 500 reviews for cell phones and accessories from Amazon, 500 reviews for restaurants from Yelp and 500 movie reviews from IMDB222https://archive.ics.uci.edu/ml/machine-learning-databases/00331/ (Kotzias et al., 2015). These two sets have equal amounts of positive and negative sentences. The third set is a subset of 923 positive and 1320 negative sentences from a data set333https://github.com/oscartackstrom/sentence-sentiment-data containing product reviews from various online sources (Täckström and McDonald, 2011). The three data sets are randomly divided 90%-10% into a training and a test set. The training set is used for updating the weights of the networks during training and is divided into batches of 64 sentences. The test set is used for evaluating the accuracy of the networks periodically during training.

4.1 Preserving semantics

In order to evaluate whether the encodings from the CNN contain information about sentiment and semantics, feature vectors for the sentences with different sentiments and topics are visualized. These visualizations also serve as an aid for assessing whether the content in a sentence is preserved in the traversal. The feature vectors are reduced from 300 to 2 dimensions using principal component analysis (PCA) and the visualizations are made using the first two principal components.

The choice of topics was sentences containing either the word phone or movie, because such sentences would likely have little correlation in contrast to, for example, sentences containing either comedy or drama. Negative sentences containing the word movie and positive sentences containing the word phone were traversed. The optimization of the MMD was set up with 90 positive examples and 90 negative examples for the source and target sets, and e. The examples consisted of an equal amount of sentences containing the word movie and sentences containing the word phone. The topics of the sentences were not used for the traversal but needed when visualizing the results.

The results are shown in figure 2. It is seen that a vector representing a positive sentence containing movie is moved so that the resulting vector lies within the cluster of negative sentences containing movie. In the same way, a vector representing a negative sentence containing phone is moved so that the resulting vector lies within the cluster of positive sentences containing phone. This behaviour suggests that the context and semantics may be preserved during the traversal.

Since the manifold traversal is made using two sets of examples, source and target feature vectors, the traversed feature vector will more resemble the sentences in the target set. This means that if we traverse the manifold for a sentence with a different topic than the sentences in the source and target sets, the traversed vector might not preserve the topic of the original sentence.

Original: unfortunately , this is a bad movie that is just plain bad
From : unfortunately , this is a bad movie that is just plain bad
From : overall , this is a good movie that is just good
Original: one of the oddest and most inexplicable sequels in movie history
From : most of the oddest and most strange movie in history history
From : most interesting and most wonderful movie in one of the oddest ways
Original: still , i do like this movie for it’s empowerment of women there ’s not enough movies out there like this one
From : still , i do like this movie for one of adults ’s not like enough like ages out there ’s no women
From : still , i do not like this movie ’s not one of adults for no people who do not like this
Original: i highly recommend this movie for anyone interested in art , poetry , theater , politics , or japanese history
From : i highly recommend this movie , interested for poetry , poetry , poetry , interested in history , or interested history
From : i highly recommend this movie , except for anything , in any movie , not n’t interested in any crappy movie
Table 1: Regenerated (), and traversed and generated() sentences compared to the original.

4.2 Analysis of transformed sentences

There exists no single correct output for the manifold traversal, e.g given the negative sentence “The food did not taste well”, both sentences “The food was amazing” and “I liked the food” are valid outputs that reverse the sentiment. Therefore, scores and measures used for other NLP tasks, like BLEU (Papineni et al., 2002) for machine translation, are difficult to apply to the manifold traversal. Instead we focus on qualitative evaluation. The encoding-decoding, and the model as a whole, is evaluated by generating sentences from the feature vectors (representing the original sentence) and (the traversed vector) respectively. The generated sentences are manually compared to the original. Ideally, the sentence generated from should closely resemble the original sentence while the sentence generated from should have the same context, but opposite sentiment, as the original sentence. In table 1 some of the better examples of sentences generated by the trained RNN are shown. The overall impression is that, while having poor grammar, the model works well in terms of changing sentiment. We see that the generated sentences have the same topic as the original and that they are composed mostly by the same words. It is also found that shorter sentences are more easily encoded and decoded.

5 Conclusion

An algorithm for sentiment manipulation was presented and evaluated. Visualizations of the embedding space indicate that sentence representations can be moved such that the sentiment changes while the semantics is preserved. Further, examination of generated sentences from manipulated embeddings confirmed that the sentiment had changed while the semantics and acceptability had stayed largely constant.


The authors would like to acknowledge the project Towards a knowledge-based culturomics supported by a framework grant from the Swedish Research Council (2012–2016; dnr 2012-5738).