Towards Controlled and Diverse Generation of Article Comments

07/25/2021 ∙ by Linhao Zhang, et al. ∙ Peking University 0

Much research in recent years has focused on automatic article commenting. However, few of previous studies focus on the controllable generation of comments. Besides, they tend to generate dull and commonplace comments, which further limits their practical application. In this paper, we make the first step towards controllable generation of comments, by building a system that can explicitly control the emotion of the generated comments. To achieve this, we associate each kind of emotion category with an embedding and adopt a dynamic fusion mechanism to fuse this embedding into the decoder. A sentence-level emotion classifier is further employed to better guide the model to generate comments expressing the desired emotion. To increase the diversity of the generated comments, we propose a hierarchical copy mechanism that allows our model to directly copy words from the input articles. We also propose a restricted beam search (RBS) algorithm to increase intra-sentence diversity. Experimental results show that our model can generate informative and diverse comments that express the desired emotions with high accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic article commenting is a valuable yet challenging task. It requires the machine first to understand the articles and then generate coherent comments. The task was formally proposed by Qin et al. (2018), along with a large-scale dataset. Much research has since focused on this task Lin et al. (2018b); Ma et al. (2018a); Li et al. (2019).

The ability to generate comments is especially useful for online news platforms, for the comments can encourage user engagement and interactions Qin et al. (2018). Besides, an automatic commenting system also enables us to build a comment-assistant which can generate some candidate comments for users, who may later select one and refine it Zheng et al. (2018).

In addition to the practical importance of this task, it also has significant research value. It can be seen as a natural language generation (NLG) task, yet unlike machine translation or text summarization, the comments can be rather diverse. That is, for the same article, there can be numerous appropriate comments that are from different angles. In this sense, this task is similar to dialog system, yet because the input article is much longer and more complex than dialog text, it is hence more challenging.

Figure 1: Comparison of output comments of the Seq2Seq baseline and our model. We can see that the baseline cannot control the expressed emotion, and the generated comment is dull and irrelevant to the article (red colored). By contrast, Our model is emotion-controllable and can generate more diverse and relevant comments, thanks to the hierarchical copy mechanism (blue colored).

Despite the importance of this task, it is still relatively new and not well-studied. One of the major limitations of current article commenting systems is that the generation process is not controllable, meaning that the comments are conditioned entirely on the articles, and users can not further control the features of comments. In this work, we make the first step towards controllable article commenting and propose a model to control the emotion of generated comments, which has wide practical application. Take comment-assistant Zheng et al. (2018) as an example, it is far more desirable to have candidate comments each expresses a different emotion, and users can hence choose one that matches their own emotion towards the article.

Another problem of current commenting systems arises from the limitation of the Seq2Seq framework Sutskever et al. (2014), which has been known to suffer from generating dull and responses that are irrelevant to the input articles Li et al. (2015); Wei et al. (2019); Shao et al. (2017). As shown in Figure 1, the Seq2Seq baseline generates I love this movie for the input article, despite the fact that Ode of joy is not a movie, but a TV series.

In this work, we propose a controllable article commenting system that can generate diverse and relevant comments. We first create two emotional datasets based on Qin et al. (2018). The first one is a find-grained dataset that contains {Anger, Disgust, Like, Happiness and Sadness } and the second one is a coarse-grained one that contains only {Positive and Negative}. To incorporate the emotion information into the decoding process, we first associate each kind of emotion with an embedding and then feed the emotion embedding into the decoder at each time step. A dynamic fusion mechanism is employed to utilize the emotion information selectively at each decoding step. Besides, the decoding process is further guided by a sequence-level emotion loss term to increase the intensity of emotional expression.

To generate diverse comments, we propose a hierarchical copy mechanism to copy words from the input articles. This is based on the observation that we can discourage the generation of dull comments like I don’t know by enhancing the relevance between comments and articles. In this way, the inter-sentence diversity gets increased. Moreover, we observe that the repetition problem of Seq2Seq framework See et al. (2017); Lin et al. (2018a) can be seen as a lack of intra-sentence diversity, and we further adopt a restricted beam search (RBS) algorithm to tackle this problem.

Figure 2: The architecture of CCS. On the left is the encoder, which encodes articles at the word- and sentence- level. On the right is the decoder, with emotion information dynamically fused into the decoding process. We add a emotion loss item to further bias the decoding process. Besides, a hierarchical copy mechanism is proposed to improve the diversity of the generated comments.

To sum up, our contributions are two-folds:

1) We make the first step to build a controllable article commenting system by injecting emotion into the decoding process. Features other than emotion can be controlled in a similar way.

2) We propose the hierarchical copy mechanism and RBS algorithm to increase the inter- and intra-sentence diversity, respectively.

2 Method

The overall architecture of our proposed model, Controllable Commenting System (CCS), is shown in Figure 2. We describe the details in the following subsections.

2.1 Task Defination

The task can be formulated as follows: Given an article and an emotion category , the goal is to generate an appropriate comment that expresses the emotion .

Specifically, consists of sentences. Each sentence consists of words, where can vary among different sentences. For each word, we use to denote its word embedding.

2.2 Basic Structure: Hierarchical Seq2Seq

We encode an article by first building representations of sentences and then aggregating them into an article representation.

Word-Level Encoder - We first encode the article on the word level. For each sentence in D, the word-level LSTM encoder reads the words sequentially and updates its hidden state . Formally,


The last hidden state, , stores information of the whole sentence and thus can be used to represent that sentence:


Sentence-Level Encoder - Given sentence embeddings , we then encode the article at the sentence-level:


where the is the sentence-level LSTM encoder and is its hidden state. The last hidden state, , aggregates information of all the sentences and is later used by the decoder to initialize its hidden state.

Decoder Similarly, the decoder LSTM updates its hidden state at time step t:


where is the embedding of the previous word. While training, this is the previous word of the reference comment; at test time it is the previous word emitted by the decoder. The attention mechanism Luong et al. (2015)

is also applied. Formally, At each decoding step, we compute the attention weights between the current decoder hidden vector and all the outputs of the sentence-level encoder. We then compute the context vector

using the weighted sum of the sentence-level encoder outputs and obtain our attention vector for the final prediction. Formally,


where ,, and are model parameters; is vector concatenation.

2.3 Dynamic Fusion Mechanism

To control the emotions expressed by the comments, we first associate each emotion with an embedding . For clarification, we note that is the user-specified emotion category and is obtained by a trainable embedding layer. We then add the emotion vector to the decoder input at each decoding step, changing Equation 4 into:


In this way, emotion information is injected into the decoding process at each decoding step. However, this method is overly simplistic. It uses emotion information indiscriminately at each decoding step, yet not all steps require the same amount of emotion information. Simply using the same emotion embedding during the whole generation process may sacrifice grammaticality Ghosh et al. (2017).

To solve this issue, we adopt a dynamic fusion mechanism that can update and selectively utilize the emotion embedding at each decoding step. Besides, given that the same word can convey different emotions, we also alter the word embedding according to emotions to be expressed. Specifically, we first employ a emotion gate and a word gate :



is the sigmoid function;

, , , are model parameters. The resulting vectors and are of the same dimension as and . They are used to control the amount of emotion information used in current decoding step, changing Equation 6 into:


where is element-wise multiplication.

Experiments show that the dynamic fusion mechanism leads to better generation quality and higher emotion accuracy.

2.4 Emotion Loss

To further guide the model to generate comments expressing the desired emotion, we use a sequence-level emotion classifier Song et al. (2019) to guide the generation process.

We first approximate the step embedding using the weighted sum of embeddings of words with the TopKprobabilities, where

is a hyperparameter.


Then, we calculate the emotion distribution using the average step embeddings of all time steps, based on which we calculate the final emotion classification loss:


where is model parameters and P(E) is the gold emotion distribution.

Intuitively, the emotion loss forces a strong dependency relationship between the input emotion category and generated comment, further biasing the comments towards our desired emotion.

2.5 Hierarchical Copy Mechanism

A key limitation of Seq2Seq models is that they tend to generate dull and commonplace text such as I don’t know and I like it. We observe that we can encourage the generation of diverse comments by enhancing the relevance between comments and source articles.

To achieve this, we adopt copy mechanism See et al. (2017), which allows for copying words from source text. We adapt the original copy mechanism to consider the hierarchical structure of articles. Specifically, we first compute attention over sentences.


where is model parameter, is decoder’s hidden state and is sentence representation.

We then compute the attentions over individual words and normalize them using the sentence-level attention .


where is model parameters, is the words of the sentence. In this way, words from more informative sentences are rewarded, and words from less informative ones are discouraged.

Finally, we get the context vector using the weighted sum of all the words and obtain our attention vector at for the final prediction. Formally,


where are model parameters.

is a probability distribution over all words in the vocabulary. Then, the attention distribution

and the vocabulary distribution are then weighted and summed to obtain the final word distribution:


Intuitively, can be seen as a soft switch to choose between generating a word from the vocabulary, and copying a word from the source article. Experiment results show that our hierarchical copy mechanism can effectively improve comments diversity.

2.6 Restricted Beam Search

The above method can improve the inter-sentence diversity of comments. However, the Seq2Seq framework has also long been known of generating repeating and redundant words and phrases, as the example shown in Lin et al. (2018a):

Fatah officially officially elects Abbas as candidate for candidate.

We regard this problem as a lack of intra-sentence diversity. To mitigate this issue, we propose the Restricted Beam Search (RBS) algorithm. Based on the observation that ground truth comments seldom contain the same n-grams multiple times, we explicitly lower the probability of those words that would create repetitive n-grams at each decoding step. Specifically, we will lower the probability of generating

by if outputting creates an n-gram that already exists in the previously decoded sequence of the current beam. Formally, the probability of will be modified as the following.


where is a hyperparameter and is the times that -created n-gram has occurred in the previously decoded sequence of the current beam.

Despite its brevity, we found that this procedure can substantially improve intra-sentence diversity. Note that the RBS algorithm is notably different from Paulus et al. (2017). They set during beam search, when outputting would create a n-gram that already exists. We, on the other hand, lower the probability of by a certain amount. We believe that forcing the decoder to never output the same n-gram more than once is too strict, for repetitive trigrams do exist in natural language. In this sense, our RBS algorithm can be seen as a soft version of Paulus et al. (2017) that enables more flexibility.

3 Experiments

3.1 Dataset

Currently there is no off-the-shelf article commenting dataset annotated with emotions, so we created our own datasets based on the Tencent News dataset released by Qin et al. (2018). For fine-grained setting, which involves {Anger, Disgust, Like, Happiness and Sadness}, we trained a Bi-LSTM emotion tagger using NLPCC111 dataset. For coarse-grained setting, which involves {Positive and Negative}, we used a well-trained emotion tagger provided by Baidu222, which provides the function to classify a text into positive or negative with high accuracy. We then annotated the Tencent News dataset with the two taggers, creating two emotional commenting datasets, called Tencent-coarse and Tencent-fine respectively333We will release the two datasets to promote future research.

For the original Tencent News dataset, there are 169,060 articles in the training set, 4,455 in the development set and 1,397 in the test set. There are several comments for the same article, and many of these comments are short and meaningless. To promote the quality of the generated comments, we trained our model with only one comment that has the most upvotes for each article. We found this cleaning procedure led to better generation quality and expedited training substantially.

Model Generation Quality Diversity Emotion Acc.
Seq2Seq Sutskever et al. (2014) 5.15 9.43 22.67 4.51 16.48 23.56 - -
Transformer Vaswani et al. (2017) 4.23 9.89 24.09 3.98 15.23 21.96 - -
Flat-Emo Huang et al. (2018) 4.13 7.75 23.40 3.38 10.85 15.99 76.21 43.92
Hier-Emo Huang et al. (2018) 5.02 8.97 24.14 3.54 13.01 20.25 78.35 44.26
CCS 7.54 10.50 25.29 5.36 19.74 29.36 83.10 48.34

Table 1: Results of the generation quality, diversity and emotion accuracy(%).

3.2 Training Procedure

We trained our models on a single NVIDIA TITAN RTX GPU. The LSTMs Hochreiter and Schmidhuber (1997) are with 512-dimensional hidden states. The dimensions of both word embeddings and sentiment embeddings are 512. Dropout Hinton et al. (2012) is used with dropout rate set to 0.3. The number of layers of LSTM encoder/decoder is set to 2. The batch size is set to 64.

We use Adam Kingma and Ba (2014) with learning rate = , and . We obtain the pretrained word embedding by training an unsupervised word2vec Mikolov et al. (2013) model on the training set.

We compare the char-based model with the word-based model. For the latter we use jieba444 for Chinese word segmentation (CWS) to preprocess the text. We find that char-based model given better performance, so we adopt this setting. This results is in line with Meng et al. (2019).

We shared the vocabulary between articles and comments, and the vocabulary size was restricted to 5,000. During training, we truncated the article to 30 sentences and limited the length of each sentence to 80 tokens. At test time our comments were produced using restricted beam search with beam size = 5. We found that setting n to 1 and to 0.5 is enough for effective repetition reduction. , , N, K are set to 0.01, 1, 20 and 50 respectively.

3.3 Systems for Comparison

As we have mentioned, this is the first work to consider the emotion factor for article commenting. We did not find closely-related baselines in the literature. Nevertheless, we first chose two baselines that are widely used in NLG tasks:

  • Seq2Seq - In this paper we use Seq2Seq Sutskever et al. (2014) model enhanced with attention mechanism Luong et al. (2015).

  • Transformer - The Transformer Vaswani et al. (2017) also follows the encoder-decoder paradigm but relies on self-attention instead of RNN.

However, these two baselines cannot control the expressed emotion. Huang et al. (2018) proposed to generate dialogue with expressed emotion, we adapted their approach and created two new models that are emotion-controllable:

  • Flat-Emo - The article is represented as a flat structure, and the emotion embedding serves as an input to every decoding step to a Seq2Seq network, as in Huang et al. (2018).

  • Hier-Emo - Similar to Flat-Emo, only that the article is encoded in a hierarchical manner.

4 Results

4.1 Generation Quality

Automatic metrics such as BLEU, METEOR, and ROUGE are widely used for NLG tasks Lin et al. (2018a); Rose (2016); Song et al. (2019). We adopt these metrics to evaluate generation quality, that is, whether the comments are relevant and grammatical.

The results can be seen in Table 1. The first observation is that the first three baselines give similar and unsatisfactory results. This may due to the general limitations of non-hierarchical structure. Hier-Emo, on the other hand, makes use of this structure information and hence gives better performance. However, it still suffers from the general problems of Seq2Seq framework. Besides, the emotion information is indiscriminately fused into the decoding process, which also hurts generation quality Ghosh et al. (2017).

Compared to baselines, our model gives substantially better performance in terms of all metrics. One reason is that we adopt the hierarchical copy mechanism, which improves the coherence between comments and articles. Besides, by adopting the dynamic fusion mechanism, our model can use the emotion information selectively, which is also beneficial to generation quality.

For clarification, we note that at test time the input emotion tags are not used as external knowledge. That is, the specified emotion categories are manually designed rather than reflecting the emotions of the gold comments, so the comparison with models that cannot use emotion information (e.g., Seq2Seq and Transformer) would be fair. The reported results are averaged over all emotion categories. Although this setting makes the task harder, we believe it is much closer to practical scenarios.

Model Coarse-grained Fine-grained
Pos. Neg. Ave. Disgust Anger Sadness Like Happiness Ave.
CCS-Emo 79.89 77.09 78.04 59.24 31.38 41.02 63.54 39.58 46.95
CCS 81.66 84.54 83.10 61.39 33.52 43.19 69.72 33.88 48.34

Table 2: Effectiveness of our emotion-control approach (%).
Model D1 D2 D3
CCS w/o RBS 3.28 14.25 22.32
CCS w/o HC 4.05 17.80 29.17
CCS 5.36 19.74 29.36
Table 3: Effectiveness of RBS and hierarchical copy (HC) mechanism on diversity of comments (%).

4.2 Emotion Accuracy

To measure whether the model can generate comments expressing the desired emotion, we adopt emotion accuracy as the agreement between the desired emotion and the predicted emotion by our emotion tagger. Acc-C and Acc-F report the coarse-grained and fine-grained results, respectively. Of the four baselines, only Flat-Emo and Hier-Emo are emotion-controllable. The results are shown in Table 1.

Our first observation is that these two baselines give similar results on this metric, with Hier-Emo perform slightly better. Compared to them, our model performs significantly better under both the coarse-grained setting and the fine-grained setting, with over 3% absolute improvement.

More detailed analysis validates the effectiveness of the dynamic fusion mechanism and the emotion loss term. As shown in Table 2, using a simple fusing method (CCS-Emo) results in drastic drop in emotion accuracy, almost to the same level of our baselines.

4.3 Diversity

To measure whether the hierarchical copy mechanism can promote the diversity of the generated comments, we report the proportion of novel n-grams in all the generated texts, represented as Dn. This metric has been widely used to evaluate the diversity of generated text in a variety of NLG tasks Lin et al. (2018a); Chen and Bansal (2018); Li et al. (2015).

As shown in Table 1, all three metrics get improved drastically. Especially, CCS beats Transformer on D3 by over 7% (absolute), showing that CCS can effectively generate diverse comments compared to strong baselines.

From Table 3, we can see that after removing the hierarchical copy mechanism, the diversity drops significantly (over 2% absolute for D2). Besides, the RBS algorithm also contributes to comments diversity. After removing RBS, the comments diversity drops significantly. To examine the effectiveness of RBS more closely, we report the percentage of repetitive n-grams within comments. From Figure 3, we can observe that the repetition problem (lack of intra-sentence diversity) is rather severe when trained with the normal beam search algorithm (CCS -RBS), with repetitive 3-grams and 4-grams almost ten times more than reference. Despite the brevity of our RBS algorithm, the repetition problem is almost completely eliminated, with repetitive 3-grams and 4-grams only slightly higher than reference comments.

Figure 3: Restricted beam search (RBS) helps to mitigate repetition problem. Comments from normal beam search contain many duplicated n-grams, which our RBS algorithm produces a similar number as the reference comments.
Figure 4: Case study.

4.4 Case Study

To gain an insight into how well the emotion is expressed in the generated comments, we provide an example in Figure 4. We can see that the Seq2Seq baseline cannot control the expressed emotion, and the comment is not much relevant to the article. By contrast, our model can generate far more relevant comments. The main character’s name Bozhi Zhang and Nicholas Lse are directly copied into the comments, thanks to the hierarchical copy mechanism.

As for the emotion expressed, we can see that many emotional words appear in the generated comments, like pathetic and brute. Besides, there are also other comments that do not contain any explicit emotional words, yet also express the specified emotions.

However, we do observe some limitations of our model. The emotion expressed is not always precise. We believe a major reason is that the dataset we use to train our emotion tagger is noisy. Besides, some comments are not very informative. We believe a possible solution might be introducing external knowledge. We will leave these to future work.

5 Related Work

Automatic Article Commenting - Qin et al. (2018) formally proposed the automatic article commenting task, along with a large-scale annotated article commenting dataset, making data-driven seq2seq approaches to solve this problem viable. Much research has since focused on this area Lin et al. (2018b); Ma et al. (2018b).

However, none of these models are controllable, which limits their practical application. They also suffer from the general limitations of Seq2Seq framework, like repetition and lack of diversity.

Controllable Generation - Hu et al. (2017) is one of the earliest work that consider the problem of controllable generation. However, they aimed to generate text conditioned only on the representation vectors, which is significantly different from the task of automatic article commenting. Our work is close to Huang et al. (2018), which first introduced the emotion factor into dialog generation process. However, as we have discussed in the Introduction, the problem of article commenting is inherently more challenging. Besides, they simply concatenated the emotion embedding into the decoder at every time step, which can be seen as a variant of our baseline Flat-Emo. As we have shown in this paper, this simple approach gives unsatisfactory results.

Copy Mechanism - See et al. (2017) proposed the Pointer-Generator, which copies words from the source text for summarization. However, they regard the input article as a flat structure, ignoring the hierarchical structure of document altogether. Based on this observation, we propose the hierarchical copy mechanism to better suit this task.

Restricted Beam Search - Our RBS algorithm can be seen as a soft version of Paulus et al. (2017). They set during beam search, when outputting would create a trigram that already exists. We, on the other hand, lower the probability of by a certain amount rather than set it to 0. We believe that forcing the decoder to never output the same trigram more than once is overly simplistic, for repetitive trigrams do exist in natural language.

6 Conclusion

In this paper, we make the first step towards controlled and diverse article commenting. We build two emotional datasets to validate our approach, and propose a dynamic fusion mechanism to effectively control the expressed emotions of comments. Besides, our model can also generate more diverse comments thanks to the hierarchical copy mechanism and RBS. Experimental results show that our model beats strong baselines.