Context-aware Natural Language Generation with Recurrent Neural Networks

11/29/2016 ∙ by Jian Tang, et al. ∙ University of Michigan Peking University 0

This paper studied generating natural languages at particular contexts or situations. We proposed two novel approaches which encode the contexts into a continuous semantic representation and then decode the semantic representation into text sequences with recurrent neural networks. During decoding, the context information are attended through a gating mechanism, addressing the problem of long-range dependency caused by lengthy sequences. We evaluate the effectiveness of the proposed approaches on user review data, in which rich contexts are available and two informative contexts, sentiments and products, are selected for evaluation. Experiments show that the fake reviews generated by our approaches are very natural. Results of fake review detection with human judges show that more than 50% of the fake reviews are misclassified as the real reviews, and more than 90% are misclassified by existing state-of-the-art fake review detection algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Figure 1: Examples of reviews generated by our approach. Only the sentiment rating and product id are fed to the algorithm.

Natural language generation is potentially useful in a variety of applications such as natural language understanding [Graves2013], response generation in dialogue systems [Wen et al.2015a, Wen et al.2015b, Sordoni et al.2015]

, text summarization 

[Rush, Chopra, and Weston2015], machine translation [Bahdanau, Cho, and Bengio2014] and image caption [Xu et al.2015]. Traditional approaches usually generate languages according to some rules or templates designed by humans [Cheyer and Guzzoni2014], which are specific for some tasks and domains and difficult to generalize to other tasks and domains. Besides, the languages generated according to these approaches are very similar, lacking the large variations of human languages. Therefore, it is a long shot of the community to develop automatic approaches that learn from data and generate languages as diverse as human languages.

Recently, recurrent neural networks (RNNs) have been proved to very effective in natural language generation [Graves2013, Sutskever, Martens, and Hinton2011, Bowman et al.2015]

. Comparing to the traditional approaches, RNNs directly model the generation process of text sequences, and the generating function can be automatically learned from a large amount of text data, providing an end-to-end solution. Though traditional RNNs suffer from the problem of gradient vanishing or exploding, the long-short term memory (LSTM) 

[Hochreiter and Schmidhuber1997] unit effectively addresses this problem and is able to capture the long-range dependency in natural languages. RNNs with LSTM have shown very promising results on various data sets with different structures including Wikipedia articles [Graves2013], linux source codes [Karpathy, Johnson, and Li2015], scientific papers [Karpathy, Johnson, and Li2015], NSF abstracts [Karpathy, Johnson, and Li2015].

Most of existing work focus on natural language generation only with their textual content while ignoring their contextual information. However, in reality natural languages are usually generated at/with particular contexts, e.g., time, locations, emotions or sentiments. Therefore, in this paper we study context-aware natural language generation

. Our goal is to generate not only semantically and syntactically coherent sentences, but also sentences that are reasonable at particular contexts. Indeed, contexts have been proved to be very useful for various natural language processing tasks such as topic extraction 

[Mei et al.2006], text classification [Cao et al.2009] and language modelingmikolov2012context.

We proposed two novel approaches for context-aware natural language generation, which map a set of contexts to text sequences. Our first model C2S encodes a set of contexts into a continuous representation and then decode the representation into a text sequence through a recurrent neural network. The C2S model is able to generate semantically and syntactically very coherent sentences. However, one limitation is that when the sequences become very long, the information from the contexts may not be able to propagate to the words in distant positions. An intuitive approach to address this is to build the direct dependency between the contexts and the words in the sequences, allowing the information jump from the contexts to the words. However, not all the words may depend on the contexts, some of which may only depend on their preceding words. To resolve this, a gating mechanism is introduced to control when the information from the contexts are accessed. This is our second model: gated contexts to sequences (gC2S).

We evaluate our approaches on the user reviews from Amazon and TripAdvisor, where rich contexts are available. Two informative contexts are selected: sentiment rating and product id. Fig. 1 presents two examples of reviews generated by gC2S, which are very difficult to tell from reviews written by real users. We choose the task of fake review detection to evaluate the effectiveness of our approach. Experimental results show that more than 50% of the fake reviews generated by our approach are misclassified as real reviews with human judges and more than 90% are misclassified by the existing state-of-the-art fake review detection algorithm.

Related Work

The approaches of natural language generation can be roughly classified into two categories: the classical rule-based or template-based approaches, and recent approaches with recurrent neural networks, which automatically learn the natural language generator from the data. Classical approaches usually define some rules or templates 

[Cheyer and Guzzoni2014] by humans, which are very brittle and hard to generalize to different tasks and domains. Though there are some recent approaches aiming to learn the template structures from large amounts of corpus [Oh and Rudnicky2000], the training data is very expensive to obtain and the final generation process still requires additional human handcrafted features.

Comparing to the traditional rule-based approaches, the recurrent neural networks based approaches does not rely on any human handcrafted features and provide an end-to-end solution. Our approach is also built on recurrent neural networks. [Graves2013] studied sequence generation, including text, using recurrent neural networks (RNN) with long-short term memory unit. [Sutskever, Martens, and Hinton2011] proposed a multiplicative RNN (MRNN) for text prediction and generation, in which different transformation functions between the hidden states are used for different input characters. [Bowman et al.2015] investigated generating sentences from continuous semantic spaces with a variational auto-encoder, in which RNN is used for both the encoder and the encoder. These work have shown that RNNs are very effective for text generation on various data sets of different structures. However, all these work study natural language generation without contexts.

There are some recent work that investigate language modeling with context information. [Mikolov and Zweig2012] studied language modeling by adding the topical features of preceding words as contexts. [Wang and Cho2015] exploited the preceding words in larger windows for language modeling. These work focus on the task of language modeling and the preceding words are used as the contexts information while our task focuses on natural language generation and external contexts are studied. There are also some related work of response generation in conversation systems [Sordoni et al.2015, Wen et al.2015b], in which the conversation history are treated as contexts. Comparing to their work, our solutions are more general, application for a variety of context while their solution are specifically designed for contexts with specific structures.

Problem Definition

Natural languages are usually associated with rich context information, e.g., time, location, which provide clues on how the natural languages are generated. In this paper, we study context-aware natural language generation. Given the context clues, we want to generate the corresponding natural languages. We first formally define the contexts as follows:

Definition 1

(Contexts.) The contexts

of natural languages refer to the situations they are generated. Each context is defined as a high-dimensional vector.

The contexts of natural languages can be either discrete or continuous features. For example, the context can be a specific user or location; it can also be a continuous feature vectors generated from other sources. For discrete features, the context is usually represented with one-hot representations. Formally, we formulate our problem as follows:

Definition 2

(Context-aware natural language generation) Given a set of contexts , in which is the total number of context types, our goal is to generate a sequence of words that are appropriate at the given contexts

In this paper, we take the user reviews as an example, where there exist abundant context information, e.g., user, time, sentiment, product. However, our proposed approaches are also general for other contextual data. Next, we introduce our approach for context-aware natural language generation.


In this section, we introduce our proposed approaches for generating natural language at particular contexts. We first introduce the recurrent neural networks (RNN), which are very effective models for text generation, and then introduce our proposed approaches, which map a set of contexts to a text sequence.

(a) RNN for sequence modeling
(b) C2S: Contexts to Sequences
(c) gC2S: Gated Contexts to Sequences
Figure 2: (a): classical RNN model for text modeling; (b): our first approach in which multi-layer neural networks are encoders and RNN are decoders; (c): our second approach which adds skip-connections from contexts to words, controlled by a gating function.

Recurrent Neural Network

Recurrent neural network (RNN) models the generative process of sequence data, which summarizes the information into a hidden state ( a continuous representation) and then generate a new sample according to a probability distribution specified by the hidden state. Specifically, the hidden state

from the sequence is recursively updated as:


where is usually a nonlinear transformation, e.g., ( are transformation matrices). The hidden state summarizes the information of the entire sequences , and the probability of generating next words is defined as


where is the low-dimensional continuous representation of word .

The overall probability of a sequence is calculated as follows:


Training RNN can be done through maximizing the joint probability defined by Eqn. (3) and optimized through Back-propagation. However, training RNN with traditional state transition unit

suffers from the problem of gradient vanishing or exploding, which is caused by the product of multiple non-linear transforming functions.

[Hochreiter and Schmidhuber1997] effectively addresses this problem through the long-short term memory (LSTM) unit. The core idea of LSTM is introducing the memory state and multiple gating functions to control the information written to the memory sate, reading from the memory state, and removed (or forgotten) from the memory state. Specifically, the detailed updating equations are listed below:


where is the memory state, is the module that transform information from input space to the memory space, is the information read from the memory state, are the input, forget, and output gates respectively. controls the information from input to the memory state, controls the information in the memory state to be forgotten, controls the information read from the memory state. The memory state is updated through a linear combination of the input filtered by the input gate and the previous memory state filtered by the forget gate.

C2S: Contexts to Sequences

RNN effectively the joint probability of natural languages . As mentioned previously, natural languages are usually generated at particular contexts.Therefore, instead of modeling the probability of observing a text sequence , we are more interested in the probability of observing under some contexts , i.e., the conditional probability . In this part, we introduce two generative models for modeling the conditional probability based on recurrent neural networks.

Encoder. Our framework is built on the encoder-decoder framework [Cho et al.2014]. The essential idea is to encode the contexts information into a continuous semantic representation, and then decode the semantic representation into a text sequence. We first introduce how to encode the contexts of natural languages into a semantic representation. We represent the contexts as a set , where is a type of context, is the number of context types. Take the review as an example, each is a sentiment rating score (ranging from 1 to 5), a product id or a user id. For discrete contexts, each is a one hot-vector . The embedding of each context can be obtained through:


where , is the number of different context values of type , and is the dimension of context embedding. Once the embeddings of different contexts are obtained, they are concatenated into a long vector and followed by a non-linear transformation, formulated as follows:


where , is the size of hidden state of recurrent neural networks in the decoder. By Eqn. (5) and Eqn. 6, we are able to encoder the contexts into a semantic representation. Next we introduce how to decode it a text sequence.

Decoder. We introduce two types of decoders. The first one is the vanilla recurrent neural networks with LSTM unit, and the initiate state of the RNN is set as the context embedding . We call this approach as C2S, and the whole encoder-decoder framework is presented in 2(b). The C2S have shown very promising results in the experiments. However, one limitation of the approach is that when the sequences become very long, the information from the contexts may not be able to propagate to the distant words. To resolve this, a natural solution would be to directly build the dependency between the contexts and each word, i.e., add the skip-connections between the contexts and the words. By doing this, when predicting the next word , it not only depends on the current hidden state , but also depends on the context representation . To combine the two sources of information, a simple way would be to take their summation or concatenate them. Here we use the way of taking their summation. However, simply summing the two representations which treats the two sources of information equally may be problematic as some words may depend on the context or others may not. Therefore, it would be a desirable to figure out an approach when the context information are required.

We achieve this through the gating mechanism. We introduce a gating function which depends on the current hidden state :


where ,

is the sigmoid function. The probability of next word

will be calculated as follows:


where is the elementwise product. We call this model gC2S, and the whole framework is presented in 2(c).

Generation. Once the models are trained, give a set of contexts, we can generate natural languages based on them. There are usually two types of approaches for natural language generation: beam search [Bahdanau, Cho, and Bengio2014]

, which is widely used in neural machine translation, and random sample 

[Graves2013]. In our experiments, we tried both approaches. We find that the samples generated by the beam search are usually very trivial without much variation. Therefore, we adopt the approach of random sampling. During the sampling process, instead of using the standard softmax function, we also tried different values of temperatures. High temperatures will generate more diverse samples but making more mistakes while small temperatures tend to generate more conservative and confident samples. In the experiments, we empirically set the temperatures as 0.7.


In this section, we evaluate the effectiveness of the approach C2S and gC2S with the user review data. Different tasks are evaluated including language modeling, fake review detection with human judges or existing state-of-the-art fake review detection algorithm, sentiment classification on both the real and fake reviews. We first introduce the data sets to be used.

Name train valid test Median.Len Max Len Vocab.
Book 5,800,000 293,050 293,055 22 100 20,000
Electronic 3,200,000 180,708 180,709 35 100 20,000
Movie 1,500,000 85,195 85,200 31 100 20,000
Hotel 230,000 12,912 12,914 87 300 20,000
Table 1: Statistics of the data sets
Domain RNN C2S(P) C2S(S) C2S(P+S) gC2S(P) gC2S(S) gC2S(P+S)
Book 27.5 27.1 27.2 26.6 25.2 25.8 24.9
Electronic 27.4 26.2 27.3 25.8 24.4 25.6 24.1
Movie 28.8 27.2 28.2 26.9 25.3 27.1 24.8
Hotel 23.6 23.2 23.4 23.1 21.3 22.4 21.2
Table 2: Results of language modeling measured by perplexity (P: product, S: sentiment).
Figure 3: Words with the largest gating values in the gC2S model on test data. Most of the words are product or sentiment related.

Data Sets

We choose the user review data as it contains rich context information, e.g., users, time, sentiments, products. We select the sentiment ratings (ranging from 1 to 5) and product ids as the context information, which we believe are the most important factors that affect the review content. We use data from two websites: Amazon111Available at and TripAdvisor222Available at are used. For the Amazon data, we select three most popular domains including book, electronic and movie; for the TripAdvisor data, it is about the hotel domain. We select the most popular words as the vocabulary, and reviews containing unknown words, with length more than 100 words in the Amazon data and more than 300 words in the TripAdvisor data are all removed. The whole data are split into train, validation, test data according to the ratio 18:1:1. The statistics of the final data sets are summarized into Table 1.

Training Details.

All the models on trained on a single GPU. The batch size is set as 128. The weights are randomly initialized with the uniform distribution

, and the biases are initialized with 0. The initial learning rate is set as 1, and the learning rate is halved if the perplexity of the current epoch on the validation data is not less than the last epoch. The gradient is clipped if the norm is larger than 5. Dropout is used from the input to hidden layer and from the hidden layer to output layer in the recurrent neural networks. Different values of hidden size is tried, and the results show that the larger, the better. Due to the limitation of GPU memory, we use 512 by default. For the number of layers of RNN, one layer is used by default as increasing the number of layers does not yield significantly better samples.

Figure 4: Results of language modeling on different lengths of reviews. The superiority of gC2S over C2S increases as the lengths of reviews increase.

Language Modeling

We start with the task of language modeling. Table 2 compares the performance of language modeling with the approach RNN, C2S and gC2S. First, both the C2S and gC2S with either the sentiment context, product context or their combination outperform the vanilla RNN without context information. This shows that contextual information are indeed helpful for natural language understanding and prediction. Second, the product context seems to be more informative than the sentiment context for language modeling, no matter with the C2S or gC2S model. This may be that there are many words in the reviews that are relevant to the product information. Comparing the C2S and gC2S model, the gC2S model consistently outperforms the C2S model no matter which contexts are used. As explained previously, this is because in the C2S model, the context information may not be able to affect the words that are far away from the beginning of the sequences while the gC2S effectively addresses this through adding the direct dependency between the contexts and the words in the sequences.

To further verify this, we compare the results of C2S and gC2S on different lengths of reviews. Fig. 4 presents the results. We can see that as the lengths of the reviews increase, the gC2S model outperforms the C2S model more significantly, showing its effectiveness for modeling lengthy reviews.

Fig. 3 presents several examples on the test data showing that which words are affected by the contexts. We mark the words with the largest gating values (the average of the gating vector is compared here). We can see that most of the words strongly affected by the contexts are words related to the products or sentiments.

Overall, we can see that the gC2S model is indeed more effective than C2S. Therefore, in all the following experiments, we only use gC2S.

Fake Review Detection

To further evaluate the effectiveness of the gC2S model for natural language generation, we choose the task of fake review detection, which aims to classify whether the reviews are written by real users or generated by the gC2S model. Real reviews are treated as positive, and fake reviews are treated as negative. For the evaluation data, we randomly select some products which have at least two real reviews for each rating score in the Amazon data and one review for each rating score in the TripAdvisor data. For each real review, a fake review is generated with gC2S according to its contexts. Table 3 summarizes the number of reviews in each domain.

Domain #products #reviews/rating total
Book 74 2 740
Electronic 100 2 1,000
Movie 100 2 1,000
Hotel 55 1 275
Table 3: Summary of the evaluation data for fake review detection.
Domain TP FN TN FP
Book 73.9 26.1 47.1 52.9
Electronic 77.2 22.8 43.5 56.5
Movie 79.7 20.3 46.1 53.9
Hotel 85.2 14.8 48.5 51.5
Overall 77.9 22.1 47.5 52.5

TP: true positive, FP: false positive, TN: true negative, FN: false negative

Table 4: Results of fake review classification with human judgments (Positive: real reviews, Negative: fake reviews). In all the domains, more than 50% of the fake reviews are misclassified.

Human Evaluation. We use the Amazon Mechanical Turk to evaluate whether the reviews are fake not. We divide all the data into different batches. Each batch contains twenty reviews about the same product. We show the urls of the products, the sentiment rating and the review content to users to ask the turkers to judge whether the reviews are written by real users or not. To control the quality of the results, some “gotcha” questions are inserted in the middle of the list of reviews. Only the results judged by users who answer the “gotcha” questions correctly are kept. The kappa score is …. We summarize the final results into Table 4.

We can see that in all the domains, more than 50% of the fake reviews generated by the gC2S model are misclassified by the Turkers, and around 80% of the real reviews are correctly classified. This shows that the reviews generated by the gC2S model are indeed very natural.

Figure 5: Results of human evaluation w.r.t. different lengths of reviews. Lengthy fake reviews are more likely to be detected by real users.

As more samples are generated by the RNN, more mistakes are likely to make by the model. Therefore, we want to see how the results of human evaluate change as the lengths of the reviews increase. Fig. 5 presents the human evaluation results w.r.t different lengths of reviews. We can see that for both the lengthy fake and real reviews, fewer percentages are misclassified by human judges. However, we can see that even for the fake reviews with more than 150 words, around 40% of them are still misclassified by the human judges.

Feature TP FN TN FP
Unigram+LR 88.8 11.2 8.3 91.7
Bigram+LR 87.3 12.7 9.1 90.9

LR: logistic regression classifier

Table 5: Results of fake review detection in TripAdvisor with the approach in  [Ott et al.2011]. More than 90% fake reviews are misclassified.

Automatic Classification. Another way to evaluate the effectiveness of our approach is to see how well existing state-of-the-art fake review detection algorithm performs on our generated fake reviews. We adopt the approach in [Ott et al.2011], which trains a classifier with 800 real reviews from TripAdvisor and 800 fake reviews written by the Amazon Mechanical Turkers333The training data is available at Here we use the unigram and bigram features for classification, with which the results are very close to the best results according to [Ott et al.2011]. Table [tab::fake-classifier] summarizes the results. We can see that more than 90% of the fake reviews generated by the gC2S model are misclassified as the real reviews by the classifier.

Domain Data Precision Recall F1
Book True 0.533 0.452 0.480
Fake 0.574 0.501 0.529
Electronic True 0.450 0.451 0.410
Fake 0.450 0.461 0.419
Movie True 0.536 0.443 0.474
Fake 0.593 0.472 0.507
Hotel True 0.562 0.504 0.527
Fake 0.575 0.361 0.396
Table 6: Fine granularity sentiment classification. Results on the true and fake reviews are very close.
Domain Data Precision Recall F1
Book True 0.963 0.982 0.972
Fake 0.971 0.994 0.982
Electronic True 0.781 0.953 0.858
Fake 0.761 0.993 0.861
Movie True 0.973 0.983 0.978
Fake 0.973 0.996 0.984
Hotel True 0.985 0.988 0.987
Fake 0.947 1.000 0.973
Table 7: Binary sentiment classification. Results on the true and fake reviews are very close.

Sentiment Classification

The above results show that given particular contexts, the gC2S model is able to generate very natural reviews that are indistinguishable from real reviews. But how well the generated reviews reflect the context information, e.g., how well the generated reviews express the sentiment polarity? Therefore, in this part we compare the results of sentiment classification on both the fake and real reviews. Two types of sentiment classification are conducted: finer granularity, i.e., sentiments with five different rating scores, and binary classification, in which reviews with 4 and 5 ratings are treated as positive and reviews with 1 and 2 ratings are treated as negative.

To conduct the classification, we randomly sample one 100,000 real reviews from the training data for training the sentiment classifier. As for the evaluation data, a fake review is generated for each test data according to its contexts. Table 

6 and 7 summarize the results of fine granularity and binary sentiment classification respectively. We can see that the results on the real and fake reviews are very close to each other. On some domains (e.g., Book and Movie), the results on the fake reviews are even better than the real reviews, showing that the reviews generated by the gC2S model accurately reflect the sentiment polarity.

Finally, we present some examples of fake reviews generated by the gC2S model (Table 8). We can see that the generated reviews are very natural, which are grammatically correct, accurately reflect the sentiment polarity and product information.

Domain Product Rating Review
Movie “Frozen” 1 i love disney movies but this one was not at all what i expected . the story line was so predictable , it was dumb .
3 i liked the movie but it didn’t hold my attention as much as i expected . they just don’t make movies like this anymore .
5 my son loves this movie and it is a good family movie . i would recomend it for anyone who likes to watch movies with kids .
Electronic “Leather case cover for ipad mini” 1 i bought this case for my ipad 3 . it was not as pictured and it was too small and the ipad mini was not secured inside it , so i returned it .
3 the case is good for the price , but seems to be of very thin plastic and not well made . i use the stylus for reading . i would recommend it if you have a mini ipad .
5 the cover is very good and it fits the ipad mini perfectly and the color is exactly what i was looking for .
Table 8: Examples of fake reviews generated by the gC2S model.


This paper studied context-aware natural language generation. We proposed two approaches, C2S and gC2S, which encode the contexts into semantic representations and then decode the representations into text sequences. The gC2S model significantly outperforms the C2S model as it adds skip-connections between the context representations and the words in the sequences, allowing the information from the contexts to be able to directly affect the generation of words. We evaluated our approaches on the user reviews data. Experimental results show that more than 50% of the fake reviews generated by our approach are misclassified by human judges, and more than 90% of the reviews are misclassified by existing fake review detection algorithm.

In the future, we plan to integrate more context information, e.g., the user, the detailed descriptions of the products, the product prices, into our approaches and also evaluate our approaches in other scenarios, e.g., generating the titles of scientific papers based on the author, venue, and time information. It may be also beneficial to improve our model through the attention mechanical [Bahdanau, Cho, and Bengio2014], i.e., attending to different types of contexts when generating words in different positions.