Log In Sign Up

Improving Online Forums Summarization via Unifying Hierarchical Attention Networks with Convolutional Neural Networks

Online discussion forums are prevalent and easily accessible, thus allowing people to share ideas and opinions by posting messages in the discussion threads. Forum threads that significantly grow in length can become difficult for participants, both newcomers and existing, to grasp main ideas. This study aims to create an automatic text summarizer for online forums to mitigate this problem. We present a framework based on hierarchical attention networks, unifying Bidirectional Long Short-Term Memory (Bi-LSTM) and Convolutional Neural Network (CNN) to build sentence and thread representations for the forum summarization. In this scheme, Bi-LSTM derives a representation that comprises information of the whole sentence and whole thread; whereas, CNN recognizes high-level patterns of dominant units with respect to the sentence and thread context. The attention mechanism is applied on top of CNN to further highlight the high-level representations that capture any important units contributing to a desirable summary. Extensive performance evaluation based on three datasets, two of which are real-life online forums and one is news dataset, reveals that the proposed model outperforms several competitive baselines.


Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction

We propose in this paper a combined model of Long Short Term Memory and ...

Hierarchical RNN with Static Sentence-Level Attention for Text-Based Speaker Change Detection

Traditional speaker change detection in dialogues is typically based on ...

Toward Extractive Summarization of Online Forum Discussions via Hierarchical Attention Networks

Forum threads are lengthy and rich in content. Concise thread summaries ...

Read, Highlight and Summarize: A Hierarchical Neural Semantic Encoder-based Approach

Traditional sequence-to-sequence (seq2seq) models and other variations o...

Multichannel CNN with Attention for Text Classification

Recent years, the approaches based on neural networks have shown remarka...

A System for Interleaving Discussion and Summarization in Online Collaboration

In many instances of online collaboration, ideation and deliberation abo...

1. Introduction

Online discussion forums embody a plethora of information exchanged among people with a common interest. Typically, a discussion thread is initiated by a user posting a message (e.g. question, suggestion, narrative, etc), then other users who are interested in the topic will join the discussion, also by posting their own messages (e.g. answer, relevant experience, new question, etc). The thread that gains popularity can span hundreds of messages, putting burden on both newcomers and current participants as they have to spend extra time to understand or simply to catch up with the discussion so far. An automatic forum summarization method that generates a concise summary is therefore highly desirable.

One simple way to produce a summary is to identify salient sentences and aggregate them. This method naturally aligns with the concept of extractive summarization which likewise involves selecting representative units and concatenating them according to their chronological order. In order to determine saliency of each unit, the context must be taken into account. This factor is critical to any summarization process whether it be an automatic system or a human tasked with selecting sentences from a document to form a summary. As an illustration, if a human is given a thread to extract key sentences from, he/she would first read the thread to grasp contextual information, then select sentences based on that context to compose a summary. On the other hand, if an arbitrary sentence is shown to a human without supplying context of the thread from which that sentence belongs to, there would be no clear way of deciding if the sentence should belong in the summary. Previous works (Cheng and Lapata, 2016; Yang et al., 2016; Zhou et al., 2018) have shown that the performance of a summarizer can be improved with the context information from the document structure. The model can utilize such knowledge to generate more effective representations. Similar to documents, forum threads also possess a hierarchical structure; in which, words constitute a sentence, sentences constitute a post, and posts constitute a thread.

In this work, we propose a data-driven approach based on hierarchical attention networks to summarize online forums. In order to utilize knowledge of the forum structure, the method hierarchically encodes sentences and threads to obtain sentence and thread representations. Meanwhile, an attention mechanism is applied to further place emphasis on salient units. Drawing our inspiration from how humans read, comprehend, and then summarize a document, it led us to a network design that unifies Bidirectional Long Short-Term Memory (Bi-LSTM) and Convolutional Neural Network (CNN). In this scheme, Bi-LSTM derives a representation that comprises information of the whole sentence and whole thread (long-term dependencies); whereas, CNN recognizes high-level patterns of dominant units (words and sentences) with respect to the context from sentence and thread. All in all, both networks are combined with an aim to leverage their individual strength to achieve effective representations, compared to when either one is used. Our extensive experimental results verifies this effectiveness.

The contributions of this study are as follows:

  • We propose a hierarchical attention networks which unifies Bi-LSTM and CNN to obtain representations for the extractive summarization of online forums. The attention mechanism is employed to put weight on important units. Different from previous studies that apply attention directly to individual words and sentences (Yang et al., 2016; Tarnpradab et al., 2017)

    , our findings suggest that applying attention to the high-level features extracted and compressed by CNN contributes to improvements in the performance.

  • To demonstrate the advantage of the proposed hybrid model, we perform comprehensive empirical study. The result shows that the proposed approach significantly outperforms a range of competitive baselines as well as the initial study (Tarnpradab et al., 2017), with respect to both sentence-level scores and ROUGE evaluation. This encourages further investigation into the use of the proposed hybrid network for text summarization.

  • We conduct an extensive experiment using different pretrained embeddings (static and contextual) to investigate their effectiveness towards improving the summarization performance. Moreover, since the proposed approach can be framed as multi-document summarization, we evaluate the performance on three datasets – two of which from online forums domain, whereas the other from news domain.

The remainder of this paper is organized as follows. We review the literature related to automatic summarization in Section 2. The proposed framework is introduced in Section 3. In Section 4, we provide details on the dataset and the experimental configurations for the performance studies, and explain the baselines used in the comparative study to assess the effectiveness of our proposed model. The performance results are analyzed in Section 5. Finally, we draw our conclusions in Section 6.

2. Related Work

In this study, we address the problem of online forums summarization. Therefore, described herein this section are three major strands of research related to this study, including extractive summarization, neural network-based text summarization, and representation learning.

2.1. Extractive Summarization

There are mainly two kinds of methods used in text summarization, namely extractive summarization and abstractive summarization (Hahn and Mani, 2000)

. Owing to its effectiveness and simplicity, the extractive summarization approach has been used extensively. The technique involves segmenting text into units (e.g. sentences, phrases, paragraphs, etc), then concatenating a key subset of these units to derive a final summary. In contrast, the abstractive approach functions similarly to paraphrasing, by which the original units are hardly preserved in the output summary. In this study, we consider the extractive summarization approach and design a deep classifier to recognize key sentences for the summary.

The extractive approach has been applied to data from various domains such as forum threads (Bhatia et al., 2014; Tarnpradab et al., 2017), online reviews (Hu et al., 2017; Liu and Wan, 2019; Nguyen et al., 2019), emails (Carenini et al., 2008), group chats (Zhang and Cranshaw, 2018; Tepper et al., 2018), meetings (Nihei et al., 2018, 2016), microblogs (Rudra et al., 2018, 2018; Sharma et al., 2019), and news (Liu et al., 2020; Thomas et al., 2016; Duan and Jatowt, 2019) just to name a few. In the news domain, articles typically follow a clear pattern where the most important point is at the top of the article, followed by the secondary point, and so forth. We generally do not observe a clear pattern in other domains such as the aforementioned examples. In particular, forum content is created by multiple users; thus, the gist may be contained across different posts – not necessarily at the first sentence or paragraph. Furthermore, these user-generated content (UGC) generally contains noise, misspellings, and informal abbreviations which make choosing sentences for summarization much more challenging. In our work, we focus on summarizing content in the forum thread. Given the nature of forum data, it can be framed as a multi-document summarization where these documents are created and posted by different authors.

2.2. Neural Network-based Text Summarization

A large body of research applies neural networks involving RNN (Nallapati et al., 2017), CNN (Cao et al., 2017), along with a combination of both (Singh et al., 2017; Narayan et al., 2018) to develop and improve text summarization. For example, Nallapati et al. (2017)

have proposed an RNN-based sequence model entitled SummaRuNNer to produce extractive summaries. A two-layer bidirectional Gated Recurrent Unit (GRU) is applied to derive document representations. The first layer runs at the word level to derive hidden representation of each word in both forward and backward directions. The second layer runs at the sentence level to encode the representations of sentences in the document.

Cao et al. (2017)

have proposed a CNN-based summarization system entitled TCSum to perform multi-document summarization. Adopting transfer learning concept, TCSum demonstrated that the distributed representation projected from text classification model can be shared with the summarization model. The model can achieve state-of-the-art performance without handcrafted features needed. A unified architecture that combines RNN and CNN for summarization task has shown success in several works. For instance,

Singh et al. (2017) have proposed Hybrid MemNet, a data-driven end-to-end network for a single-document summarization, where CNN is applied to capture latent semantic features, and LSTM is applied thereafter to capture an overall representation of the document. The final document representation is generated by concatenating two document embeddings, one from CNN-LSTM and the other from the memory network. Narayan et al. (2018)

also proposed a unified architecture which frames an extractive summarization problem with a reinforcement learning objective. The architecture involves LSTM and CNN to encode sentences and documents successively. The model learns to rank sentences by training the network in a reinforcement learning framework while optimizing ROUGE evaluation metric.

Several lines of research have taken into account the hierarchical structure of the document (Zhao et al., 2019; Jiang et al., 2019; Cruz et al., 2020; Cheng and Lapata, 2016; Zhou et al., 2018). Cheng and Lapata (2016) have developed a framework containing a hierarchical document encoder and an attention-based extractor for single-document summarization. The hierarchical information has shown to help derive a meaningful representation of a document. Zhou et al. (2018) have proposed an end-to-end neural network framework to generate extractive document summaries. Essentially, the authors have developed a hierarchical encoder via bidirectional Gated Recurrent Unit (BiGRU) which integrates sentence selection strategy into the scoring model, so that the model can jointly learn to score and select sentences.

The usage of an attention mechanism has also proven successful in many applications (Rush et al., 2015; Lee et al., 2020; Nema et al., 2017; Wang and Ling, 2016; Cao et al., 2016; Narayan et al., 2018; Feng et al., 2018). For example, Wang and Ling (2016) have introduced an attention-based encoder-decoder concept to summarize opinions. The authors have applied LSTM network to generate abstracts, given as input a latent representation computed from the attention-based encoder. Cao et al. (2016) have applied the attention concept to simulate human attentive reading behavior for extractive query-focused summarization. The system called AttSum is proposed and demonstrated to be capable of jointly handling the tasks of query relevance ranking and sentence saliency ranking. More recent works have applied the attention mechanism to facilitate the sentence extraction process to form a summary. Narayan et al. (2018) have proposed a hierarchical document encoder with attention-based extractor to generate extractive summaries. Their results have shown that, with attention, the model can successfully guide the representation learning of the document. Feng et al. (2018) have presented a model entitled AES: Attention Encoder-based Summarization to summarize articles. This architecture comprises an attention-based document encoder and an attention-based sentence extractor. The authors consider both unidirectional and bidirectional RNN in the experiment. The results have shown that better performance can be obtained via bidirectional RNN since the network reads a sequence in both original and reverse orders, which helps to derive better document representation.

2.3. Representation Learning

Representation learning which aims to acquire representations automatically from the data plays a crucial role in many Natural Language Understanding (NLU) and Natural Language Processing (NLP) models. Particularly, pre-trained word representations are the building blocks of any NLP and NLU models that have shown to improve downstream tasks in many domains such as text classification, machine translation, machine comprehension, among others (Camacho-Collados and Pilehvar, 2018). Learning high-quality word representations is challenging, and many approaches have been developed to produce pre-trained word embeddings which differ on how they model the semantics and context of the words. word2vec (Mikolov et al., 2013), a window-based model, and GloVe (Global Vectors for Word Representation) (Pennington et al., 2014), a count-based model, rely on distributional language hypothesis in order to capture the semantics. FastText (Bojanowski et al., 2017)

is a character-based word representation in which a word is represented as a bag of character n-grams and the final word vector is the sum of these representations. One of the advantages of FastText is the capability of handling out-of-vocabulary words (OOV) – unlike word2vec and GloVe.

Although the classical word embeddings can capture semantic and syntactic characteristics of words to some extent, they fail to capture polysemy and disregard the context in which the word appears. To address the polysemous and context-dependent nature of words, the contextualized word embeddings are proposed. ELMo (Embeddings from Language Models) proposes a deep contextualized word representation in which each representation is a function of the input sentence where the objective function is a bidirectional Language Model (biLM) (Peters et al., 2018). The representations are a linear combination of all of the internal layers of the biLM where the weights are learnable for a specific task. BERT (Bidirectional Encoder Representations from Transformers) is another contextualized word representation which is trained on bidirectional transformers by jointly conditioning on both left and right context in all layers (Devlin et al., 2019). The objective function in BERT is a masked language model where some of the words in the input sentence are randomly masked. FLAIR is contextualized character-level word embedding which models words and context as sequences of characters and is trained on a character-level language model objective (Akbik et al., 2018).

In summary, there are several approaches adopted to learn the word representation in literature which differ in the ways they model meaning and context. The choice of word embeddings for particular NLP tasks is still a matter of experimentation and evaluation. In this study, we experimented with word2vec, FastText, ELMo, and BERT embeddings, by integrating them in an embedding layer of the model. These embeddings initialize vectors of words/sentences present in the forum data as a substitute to the random initialization.

3. Summarization Model

Our system is tasked with extracting representative sentences from a thread to form a summary, which is naturally well-suited to be formulated as a supervised-learning task. We consider a sentence as an extraction unit due to its succinctness. Let

= be the sentences in a thread and = be the corresponding labels, where ” = 1” indicates that the sentence

is part of the summary, and ”0” otherwise. Our goal is to find the most probable tag sequence given the thread sentences:


where is the set of all possible tag sequences, and where the tag of each sentence is determined independently.

In this section, we elaborate our hierarchical-based framework for multi-document summarization. The proposed model is based on Hierarchical Attention Networks (HAN) to construct sentence and thread representations (Yang et al., 2016; Tarnpradab et al., 2017)

. Two types of neural networks, namely bi-directional recurrent neural network and convolutional neural network, are combined into a unified framework to maximize the capability of the summarizer. In a nutshell, the model is comprised of hierarchical encoders, a neural attention component, and a sentence extractor. The encoders generate the representations based on words and sentences in the forum. The neural attention mechanism pinpoints any meaningful units in the process. Finally, the sentence extractor selects and puts together all the key sentences to produce a summary. In the following, the boldface letters represent vectors and matrices. Words and sentences are denoted by their indices.

3.1. Sentence Encoder

The sentence encoder reads an input sentence as a sequence of word embeddings, then returns a sentence vector as an output. Adopting the pipeline architecture to process data in a streaming manner, a bi-directional recurrent neural network is followed by a convolutional neural network to constitute the sentence encoder. Furthermore, the attention mechanism is employed while generating the sentence vector to give more emphasis on units that contribute more to the meaning of the sentence. This strategy to sentence encoding is illustrated in Figure 1. We elaborate the different network components in the following subsections.

Figure 1. Illustration of the sentence encoder.

3.1.1. Input Layer

Given that, each thread is a sequence of sentences and each sentence is a sequence of words, let = denote the -th sentence and the words are indexed by where denotes the number of words in the sentence. Each word is converted to its corresponding pretrained embedding (Figure 1

), and subsequently fed into the bidirectional recurrent neural network.

3.1.2. Bidirectional Recurrent Neural Network layer

We opt for Bidirectional Long Short-Term Memory (Bi-LSTM) due to its effectiveness as evidenced in previous studies (Hochreiter and Schmidhuber, 1997). LSTM contains an input gate (), a forget gate (), and an output gate () to control the amount of information coming from the previous time-step as well as flowing out in the next time-step. This gating mechanism accommodates long-term dependencies by allowing the information flow to sustain for a long period of time. Our Bi-LSTM model contains forward pass and backward pass (Eq. 2-3). The forward hidden representation comprises semantic information from the beginning of the sentence to the current time-step; on the contrary, comprises semantic information from the current time-step to the end of the sentence. Both vectors and are of dimension , where is the dimensionality of the hidden state in the word-level Bi-LSTM. Finally, concatenating the two vectors, in particular =, produces a word representation that carries contextual information of the whole sentence the word being a part of.


3.1.3. Convolutional Layer

The convolutional layer is primarily used to extract high-level features from the hidden representation obtained from the preceding layer. of every word in the sentence are compiled to form a matrix , which is used as an input to the CNN. Concretely, = , where . The convolutional layer is composed of a set of filters where each filter is applied to a window of words, and denotes index of each filter. Each filter slides across the input to form a feature map . Each feature map is obtained as:


where denotes a submatrix of comprised of row to row ; is an additive bias. A Rectified Linear Unit

(ReLU) is applied element-wise as a nonlinear activation function in this study.

One-dimensional max-pooling operation is then performed to obtain a fixed-length vector. A total of

max-pooled vectors are generated, one for each . Given that each feature is of length , through a 1D max-pooling window of size , is transformed into a vector of half the length. In other words, only meaningful features per bigram are extracted. Thus, each is transformed into a vector which is constituted of the max-pooled values concatenated together (Eq. 5). All resultant feature maps are combined into a final representation (Eq. 6).


3.1.4. Attention Layer

In this section, we describe the attention mechanism employed to attend to important units in the sentence. We note that, the units here refer to latent semantic features of bigrams as they are a unit of compression (max-pooling) in the prior layer. We introduce a trainable vector for all the bigrams to capture global bigram saliency. Each vector of , denoted as , is selected through a multiplication operation where is a standard basis vector containing all zeros except for a one in the -th position. Every vector is projected to a transformed space to generate (Eq. 7). The inner product signals the importance of the -th bigram. We convert it to a normalized weight using a softmax function (Eq. 8). Finally, a weighted sum of bigram representation is computed to obtain a sentence vector , where is a scalar value indicating the bigram importance (Eq. 9).

Figure 2. Complete framework of the proposed summarization model.

3.2. Thread Encoder

The thread encoder takes as input a sequence of sentence vectors = previously encoded through the sentence encoder, as illustrated in Figure 2. We choose to index sentences by . The thread encoder has a similar network architecture as the sentence encoder, summarized by Eq. 10 - 19. Note that vectors and are of dimension where is the dimensionality of the hidden state in the sentence-level Bi-LSTM (Eq. 10, 11). of every sentence in the thread is compiled to form a matrix (Eq. 12, 13). Each feature map is represented by , where is an index of each CNN filter; is total number of sentences in the thread; and is the filter height (Eq. 14). is constituted of the max-pooled values of concatenated together into a vector (Eq. 15). The max-pooling window size is 2 representing a pair of consecutive sentences. All resultant max-pooled vectors are combined into a final representation (Eq. 16). Each vector of is denoted by (Eq. 17). The sentence-level attention mechanism introduces a trainable vector that encodes salient sentence-level content. The thread vector is a weighted sum of sentence pairs, where is a normalized scalar value indicating important sentence pairs of the thread (Eq. 18, 19).


3.3. Output Layer

The vector representation of each sentence is concatenated with its corresponding thread representation to construct the final sentence representation. With this, both sentence-level and thread-level context are taken into account when classifying whether or not the sentence is part of the final summary. The learned vector representations are fed into a dense layer of which sigmoid is used as an activation function. Cross-entropy is used to measure the network loss.

3.4. Sentence Extraction

We impose a limit to the number of words in the final summary – at most 20% of total words in the original thread are allowed. In order to extract salient sentences, all sentences are sorted based on saliency scores outputted from the dense layer. Sorted sentences are then iteratively added to the final summary until the compression limit is reached. At last, all the sentences in the final summary are chronologically ordered according to their appearance in the original thread. Since sentences selected by supervised summarization models tend to be redundant as in (Cao et al., 2017), we apply an additional constraint to include a sentence in the summary only if it contains at least 50% new bigrams in comparison to all existing bigrams in the final summary. Henceforth, we refer to our approach as Hierarchical Hybrid Deep Neural Network or the hybrid network for short. A complete framework of the hybrid network is illustrated in Figure 2.

4. Experiment

In this section, we first give a description of the datasets used for experiments, followed by details of how the training set is created. Then, we present experiment configurations along with a list of hyperparameters explored to achieve the best performing model. Next, we provide a brief description of baselines used in our performance study, and subsequently we give an introduction to the metrics for evaluating performance of the summarization system.

4.1. Dataset

Since the proposed approach is applicable for multi-document summarization, besides using only an online forums dataset, we also perform experiments on news data. Three datasets, namely Trip advisor, Reddit, and Newsroom, are used in our study – the former two were crawled from online forums while the other is news articles from major publications. Statistics of all datasets is provided in Table 1 and a brief description of each is as follows.

Trip advisor. The Trip advisor222 forum data were collected Bhatia et al. (2014). In our study, there are a total of 700 TripAdvisor threads, 100 of which were originally annotated with human summaries by (Bhatia et al., 2014), and the additional 600 threads were annotated later by Tarnpradab et al. (Tarnpradab et al., 2017). We held out 100 threads as a development set and reported the performance results on the remaining threads. The development set is mainly used for hyperparameter-tuning purposes as described in Section 4.3. The reference summaries were prepared by having two human annotators generate a summary for each thread. Both annotators were instructed to read a thread, then write a corresponding summary with the length limited within 10% to 25% of the original thread length. The annotators were also encouraged to pick sentences directly from the data.

Reddit. Reddit forum data333 were prepared by Wubben et al. (2015). It contains 242,666 threads in 12,980 subreddits. The size of threads ranges from 5 sentences with a few words per line to over 43,000 sentences. In our study, we utilize threads with a length of at least 10 sentences, since any threads of size smaller than that are not necessary to be summarized. The training and test sets contain 66,589 and 17,869 threads respectively, while the development set contains 20,891 threads for hyperparameter-tuning. The reference summaries were prepared by using the number of votes as a factor to select sentences. That is, all sentences are first ranked based on their final votes (upvotes-downvotes), then the ranked sentences are iteratively added into the output list until total words reach the compression ratio (25% of original total words), finally the selected sentences are ordered according to their chronological order.

Newsroom. Newsroom555 summarization dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications (Grusky et al., 2018). It is used for training and evaluating summarization systems. The dataset provides training, development, and test sets. Each set comprises summary objects, where each individual one includes information of article text, its corresponding summary, date, density bin, just to name a few. Density bin denotes summarization strategies of the reference summary, which involves extractive, abstractive, and a mix of both. In our study, we use only articles of which the reference summary was generated via extractive approach. Similar to Reddit dataset, some news articles are too short, and thus a summary is not necessary. We filter out any articles with number of sentences lower than 10. As a result, training and test sets contain a total of 288,671 and 31,426 articles respectively, while a development set contains 31,813 articles for exploring the best set of hyperparameters.

Trip advisor Reddit Newsroom
Vocabulary 26,422 910,941 894,212
#Threads 700 105,349 351,910
Avg #sentences
59.41 64.99 35.57
Max #sentences
144 43,486 12,016
Avg #words
833.65 783.85 679.24
Max #words
1,559 532,400 178,463
Avg #words per sentence
14.03 11.84 16.23
Table 1. Data statistics

4.2. Training Set Creation

In this study, every sentence requires a label to train the deep neural network; therefore, we create a training set where each sentence will be marked as True to indicate a part-of-summary unit, or False to indicate otherwise. First of all, an empty set S = { } is initialized per thread (or per news article). For each sentence that is not a member of the set, add the sentence to the set then measure ROUGE-1 score (Lin, 2004) between the set and the gold summaries; thereafter, the sentence is removed from the set. Once all sentences had their ROUGE-1 score measured, the candidate sentences that increased the score the most are permanently added to the set. This process is repeated until one of the following conditions is achieved: 1) the total number of words in the selected sentences has hit the desired compression ratio of 20%, or 2) the ROUGE score of summary cannot be improved any further. Finally, those sentences that are a member of the set are labeled True, while others are labeled False. We utilized the ROUGE 2.0 Java package666 to evaluate the ROUGE scores that presents the unigram overlap between the selected sentences and the gold summaries (Ganesan, 2015).

4.3. Model Configuration

The optimum parameters for the hybrid model were explored through experimentation. Six-fold cross-validation was used for both tuning and training process. We performed a random search by sampling without replacement over 80% of all possible configurations since the whole configuration space is too large. All hyperparameters are listed in Table 2, some of which are based on the recommendation of (Zhang and Wallace, 2017)

. We found the best configuration for the number of Bi-LSTM neurons at sentence encoder to be 200, and at thread encoder to be 100, respectively. For CNN hyperparameters, the best explored number of convolutional layers at sentence and thread levels is 2; and the best number of filters at both levels is 100, where each filter has the size as well as a stride length of 2. The best explored dropout rate is 0.3, with learning rate of 0.001 and a batch size of 16. Lastly, RMSprop optimizer has shown to best optimize binary cross-entropy loss function in our model.

The training/validation/test split was set to 0.8/0.1/0.1 of all threads. We kept this split ratio fixed in all the experiments and for all datasets. To prevent the model from overfitting, we applied early stopping during the training process. This was done by computing an error value of the model on a validation dataset for every epoch, and terminating the training if the error value monotonically increased. After obtaining the best configuration, we retrained the model on the union of training and development sets and evaluated it on the test set. All the training forum threads or news articles are iterated at each epoch. The training process continues until loss value converges or the maximum epoch number of 500 is met.

Regarding the pretrained vectors, we apply word2vec777, FastText (Mikolov et al., 2018)888, and ELMo999 as word-level embedding vectors. For BERT101010, however, we apply it as sentence-level embedding vectors as sentence vectors trained by BERT have shown to give better performances.

Hyperparameter Range
Number of Bi-LSTM hidden layer neurons {25, 50, 100, 200}
Number of convolutional layers {1, 2, 3, 4, 5}
Number of CNN filters {100, 200, 400, 600}
CNN receptive field size {1, 2, 3, 4, 5, [2,3], [2,3,4], [2,3,4,5]}
Dropout rate {0.0, 0.1, 0.3, 0.5}
Learning rate {0.001, 0.01, 0.1}
Batch size {16, 32, 64, 128}
{SGD(Ruder, 2016), RMSprop(Hinton et al., ), Adadelta(Zeiler, 2012),
Adagrad(Duchi et al., 2011), Adam(Kingma and Ba, 2014), Adamax(Kingma and Ba, 2014)}
Table 2. Hyperparameter values evaluated in the proposed model.

4.4. Baselines

We compare the proposed model against both unsupervised and supervised methods. The more detailed descriptions of the baselines are as follows.

4.4.1. Unsupervised-learning Baselines

The unsupervised-learning baselines below are used for our comparative study:

  • ILP  (Berg-Kirkpatrick et al., 2011)

    , a baseline Integer Linear Programming framework implemented by 

    (Boudin et al., 2015).

  • SumBasic  (Vanderwende et al., 2007), an approach that assumes words occurring frequently in a document cluster have a higher chance of being included in the summary.

  • KL-SUM  (Haghighi and Vanderwende, 2009), a method that adds sentences to the summary so long as it decreases the KL Divergence.

  • LSA  (Steinberger and others, ), the latent semantic analysis technique to identify semantically important sentences.

  • LEXRANK  (Erkan and Radev, 2004)

    , a graph-based summarization approach based on eigenvector centrality.

  • MEAD  (Radev et al., 2004), a centroid-based summarization system that scores sentences based on sentence length, centroid, and position.

  • Opinosis  (Ganesan et al., 2010), a graph based algorithm for generating abstractive summaries from large amounts of highly redundant text.

  • TextRank  (Barrios et al., 2016), a graph-based extractive summarization algorithm which computes similarity among sentences.

4.4.2. Supervised-learning Baselines

We include also traditional supervised-learning methods namely Support Vector Machine (SVM) and LIBLINEAR

(Fan et al., 2008)111111Set option ‘-c 10 -w1 5’ for SVM and ‘-c 0.1 -w1 5’ for LogReg

in our study. Both of which employ the following features: 1) cosine similarity of current sentence to thread centroid, 2) relative sentence position within the thread, 3) the number of words in the sentence excluding stopwords, and 4) max/avg/total TF-IDF scores of the consisting words. The features were designed such that they carry similar information as our proposed model.

4.4.3. Deep Learning Baseline

Neural network methods, including LSTM and CNN, have been used as a deep learning baseline in our study. For LSTM, we implemented a neural network containing a single layer of LSTM to classify sentences in each input thread/news article. For CNN, the CNN model for sentence classification proposed by

Kim (2014) is applied. The input layer was initialized with pre-trained static word embeddings. The network uses features extracted from the convolutional layer to perform classification.

In addition, we also implemented a variant of HAN, namely hierarchical convolutional neural network which simply replaces LSTM with CNN. This allows us to examine the effectiveness of each individual network versus the unified network.

4.5. Evaluation Methods

We report ROUGE-1, ROUGE-2, and ROUGE-L scores along with sentence-level scores for the evaluation. In particular, the quantitative values for each method are computed as precision, recall, and F1 measure. Note that, we will also refer to ROUGE metrics as R-1, R-2, and R-L for short.

ROUGE-1 and ROUGE-2 are metrics commonly used in the DUC and TAC competitions (Dang and Owczarzak, 2008). R-1 and R-2 precision scores are computed as the number of n-grams the system summary has in common

with its corresponding human reference summaries divided by total n-grams in the system summary where R-1 and R-2 set n=1 and n=2, respectively. R-1 and R-2 recall scores are calculated the same way except that the number of overlapping n-grams are divided by the total n-grams in the human reference summary. Finally, the F1 score for R-1 and R-2 is the harmonic mean of precision and recall. We use R-1 and R-2 as a means to assess informativeness. ROUGE-L measures the longest common subsequence of words between the sentences in the system summary and the reference summary. The higher the R-L, the more likely that the output summary has n-grams in the same order as the reference summary. This would better preserve the semantic meaning of the reference summary.

The sentence-level score is based on ”labels” which means that each sentence can have a true or false value indicating if the sentence is to be part of the summary. When the summarizer labels sentences as true in both the reference set (actual class) and the system set (predicted class), those sentences are regarded as true positives. Sentences labelled true in the reference set yet false in the system set are false negatives. Sentences labelled true in the system set yet false in the reference set are false positives. Finally, sentences labelled false in both the system set and the reference set are true negatives. Table 3

presents the confusion matrix.

Predicted Class
True False
True Positives
False Negatives
False Positives
True Negatives
Table 3. Confusion Matrix.

Sentence-level precision is the number of true positives divided by the sum of true positives and false positives. Sentence-level recall is the number of true positives divided by the sum of true positives and false negatives. Lastly, sentence-level F1 is the harmonic mean of recall and precision. Sentence-level scores basically report the classification performance of the model.

5. Performance Evaluation Results and Discussions

Our hybrid network is compared against a set of unsupervised and supervised approaches, along with variants of hierarchical methods. In this section, we first discuss the performance of different methods which involve traditional machine learning baselines, non-hierarchical and hierarchical deep learning methods. Then, we explain comparisons, observations, and provide our detailed analysis. After that, extensive ablation studies are presented.

5.1. Comparison with Traditional Machine Learning Baselines

Among the unsupervised-learning baselines in Table 4, the sentence classification results from MEAD demonstrate good performance. MEAD has also been shown to perform well in previous studies such as (Luo et al., 2016). In this study, MEAD and LexRank are centroid-based, meaning that sentences that contain more words from the cluster centroid are considered to be holding key information, thereby increasing the likelihood of being included in the final summary. A similar pattern in results appears in KL-Sum and LSA. Nonetheless, in terms of ROUGE evaluation as shown in Table 5, they were all outperformed by hierarchical-based approaches. Opinosis has poor performance since it relies heavily on the redundancy of the data to generate a graph with meaningful paths. To this end, the hierarchical approaches appear to achieve better performance without the need for sophisticated constraint optimization such as in ILP.

Regarding the supervised-learning baselines, according to Table 5, a pattern of high precision and low recall can generally be observed for both SVM and LogReg. The R-1 results reflect that among the sentences classified as True, there are several unigrams overlapping with the reference summaries. However, when evaluating with higher n-grams, the results show that only a few matches exist between the system and the references. Considering the sentence-level scores of the Trip advisor dataset as an example, it can be seen that LogReg has failed to extract representative sentences as evidenced from the 14.50% precision and 12.10% recall, which are the lowest.

Comparing the traditional models against hierarchical-based models has shown that the hierarchical models have shown better potential in classifying and selecting salient sentences to form a summary. Furthermore, both traditional baselines possess one disadvantage, which is their reliance on a set of features from the feature engineering process. These handcrafted features, usually obtained from studying signals observable from the data, might not be able to capture all the traits necessary for the models to learn and differentiate between classes.

5.2. Comparison to Non-hierarchical Deep Learning Methods

In general, LSTM outperforms CNN in terms of sentence classification as well as ROUGE evaluation. Particularly for the sentence classification task, LSTM has shown to achieve high precision scores across all datasets. This indicates the importance of the learning of sequential information towards obtaining an effective representation. CNN, although proven to be efficient in previous studies, as shown in Table 4, the results have evidenced that omitting sequential information essentially results in an inferior performance.

In terms of ROUGE evaluation, according to Table 5, R-1 and R-2 of both LSTM and CNN baselines are quite competitive compared to the hierarchical-based methods. However, with respect to R-L scores, hierarchical-based models generally have better performance by a significant margin. We observe that, hierarchical models have an advantage over the non-hierarchical deep learning methods in a sense that they also explore hierarchical structure on top of sequential information learned via LSTM and feature extraction via CNN.

5.3. Comparison with Another Hierarchical Attention-based Deep Network

Of all the hierarchical-based models, we compare the proposed model against the state-of-the-art HAN model to examine whether the hybrid architecture contributes to performance gain/loss. We hypothesize that unifying LSTM and CNN encourages the leverage of both short-term and long-term dependencies, which are keys to learning and generating effective representation for the summarizer. We also make a comparison with the hierarchical convolutional neural network (HCNN) to observe the effect of excluding long-term dependencies captured by LSTM.

According to Table 4, the sentence-level score shows that on average the performance of hybrid network is comparative to other hierarchical methods regardless of the choice of embedding. HCNN is generally the most inferior among the three hierarchical models. This demonstrates that LSTM layers play a key role in capturing sequential information which is essential for the system to understand input documents. Without LSTM layers, the system only obtains high-level representation through CNN which captures only important n-grams/sentences, independent of their position in the sentence/thread. This has been shown to be insufficient to generate an effective representation. Using both LSTM and CNN has shown a promising avenue for improving summarization. It is important to note that, when the contextual representation is employed, especially for Reddit and Newsroom datasets, their results have shown high precision yet low recall. This indicates that few sentences are predicted as a part-of-summary sentence; however, most of its predicted labels are correct. Figure 3 illustrates a comparison among hierarchical methods with respect to sentence-level scores across all datasets.

With respect to ROUGE evaluation, Table 5 shows that ROUGE scores for the hierarchical model are promising. Among the hierarchical models, the hybrid methods outperform others in all datasets, as displayed in Figure 4 (a) - (i). We present example summaries generated by the hierarchical models in Figure 5. The results indicate that for the hybrid model, among all its true-labeled sentences, 46.67% were labeled correctly which is higher than the rest of the hierarchical models.

We also observed the behavior of each hierarchical model in terms of loss that is minimized. Figure 6 (a) - (f) illustrate the training loss of each hierarchical model per fold. We note that, for every fold of every model, the objective loss continuously decreases and begins to converge very early on. The average losses across all epochs of HAN, HCNN, and Hybrid model are approximately 0.2555, 0.2564, and 0.2466, respectively. More fluctuations also appear in the HCNN curve. The hybrid model converges faster due to the larger model complexity.

(a) Trip advisor
(b) Reddit
(c) Newsroom
Figure 3. Comparison of F1-scores among Hierarchical Methods based on sentence-level scores. x-axis denotes types of embeddings; 1=w2v, 2=FastText, 3=ELMo, 4=ELMo+w2v, 5=ELMo+FastText, 6=BERT, 7=BERT+w2v, 8=BERT+FastText. y-axis denotes F1 scores normalized between [0,1]. The bar color blue presents HAN, orange presents HCNN, and gray presents Hybrid model.

width= Trip advisor Reddit Newsroom Embedding Method P R F P R F P R F Baselines ILP 22.60 13.60 15.60±0.40 22.86 23.40 23.12±0.10 16.05 20.15 17.87±0.10 Sum-Basic 22.90 14.70 16.70±0.50 22.31 17.11 19.37±0.20 16.57 18.52 17.49±0.10 KL-Sum 21.10 15.20 16.30±0.50 23.58 17.91 20.36±0.10 17.25 20.74 18.83±0.20 LSA 21.05 15.02 17.53±0.50 23.59 17.91 20.37±0.10 27.64 32.34 29.81±0.20 LexRank 21.50 14.30 16.00±0.50 24.69 18.17 20.94±0.10 29.22 32.98 29.65±0.20 MEAD 29.20 27.80 26.80±0.50 26.83 28.26 27.52±0.10 25.40 41.03 31.38±0.10 SVM 34.30 32.70 31.40±0.40 17.09 4.32 6.90±0.10 27.19 14.09 18.56±0.30 LogReg 14.50 12.10 12.50±0.50 5.10 0.67 1.18±0.30 18.43 6.22 9.30±0.40 LSTM 43.12 38.09 40.44±0.03 35.02 30.45 32.27±0.02 35.31 26.30 30.17±0.01 CNN 35.17 23.35 28.03±0.03 27.91 22.61 24.98±0.02 27.63 26.88 26.23±0.01 Hierarchical + Static Embedding w2v HAN 39.65 33.41 36.26±0.05 27.01 29.74 28.31±0.02 25.80 27.33 26.55±0.01 HCNN 36.37 26.78 30.84±0.03 26.64 23.23 24.82±0.04 23.20 31.43 26.69±0.03 Hybrid 40.65 32.49 36.12±0.02 27.63 30.78 29.12±0.04 25.40 26.65 26.01±0.03 FastText HAN 35.56 25.43 29.66±0.05 27.11 28.00 27.55±0.03 25.90 28.52 27.15±0.01 HCNN 34.97 25.56 29.53±0.03 26.60 23.22 24.80±0.03 22.64 28.54 25.25±0.02 Hybrid 39.97 32.25 35.70±0.02 29.48 20.89 24.45±0.03 25.32 26.81 26.04±0.02 Hierarchical + ELMo HAN 39.74 33.48 36.31±0.01 35.60 1.98 3.74±0.01 31.61 25.84 28.44±0.02 HCNN 38.02 30.44 33.81±0.01 36.07 5.03 8.82±0.01 31.22 19.22 23.79±0.01 Hybrid 40.74 36.92 38.69±0.02 35.49 3.75 6.79±0.01 30.87 26.52 28.53±0.02 +w2v HAN 38.05 30.06 33.56±0.01 30.78 15.69 20.79±0.01 26.31 27.17 26.73±0.02 HCNN 38.08 31.08 34.18±0.01 33.62 2.29 4.28±0.01 26.57 26.07 26.32±0.01 Hybrid 38.30 32.50 35.15±0.01 29.33 14.60 19.50±0.02 26.44 25.54 25.98±0.01 +FastText HAN 37.67 29.12 32.83±0.02 30.35 15.39 20.42±0.01 25.89 24.16 24.99±0.01 HCNN 38.20 30.30 33.78±0.01 34.45 3.56 6.45±0.01 25.79 24.22 24.98±0.02 Hybrid 38.16 31.88 34.72±0.01 29.61 11.63 16.70±0.02 26.80 30.66 28.60±0.02 Hierarchical + BERT HAN 40.94 44.00 42.41±0.01 33.96 1.14 2.19±0.01 19.71 4.40 7.20±0.01 HCNN 43.63 31.60 36.65±0.01 33.76 1.13 2.18±0.01 20.09 4.39 7.20±0.01 Hybrid 41.53 46.45 43.85±0.01 33.60 1.12 2.16±0.01 19.78 4.34 7.12±0.01 +w2v HAN 39.69 41.53 40.59±0.01 28.15 3.32 5.93±0.02 28.15 3.32 5.93±0.01 HCNN 40.43 42.81 41.59±0.01 34.02 2.59 4.81±0.01 34.02 2.59 4.81±0.01 Hybrid 40.20 42.47 41.30±0.01 31.86 4.02 7.13±0.03 31.86 4.02 7.13±0.02 +FastText HAN 39.73 41.81 40.75±0.01 32.03 2.92 5.34±0.02 20.01 4.38 7.19±0.01 HCNN 40.24 42.10 41.15±0.02 33.17 2.74 5.05±0.02 19.92 4.37 7.16±0.01 Hybrid 40.37 42.32 41.32±0.01 36.33 2.47 4.61±0.01 19.70 4.34 7.11±0.01

Table 4.

Sentence-level classification results from all models. Precision (P), Recall (R), and F1 scores (F) are reported in percentage. Variance (’±’) of F1 scores across all threads/news articles are also presented.

width= Trip advisor Reddit Newsroom Embedding Method R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L Baselines ILP 29.30 9.90 12.80 40.85 38.96 37.84 17.45 16.89 16.05 Sum-Basic 33.10 10.40 13.70 36.66 34.72 36.63 16.28 15.63 16.31 KL-Sum 35.50 12.30 13.40 46.87 45.23 46.78 21.10 20.53 20.96 LSA 34.20 14.50 13.60 46.85 45.24 46.88 25.29 24.80 24.41 LexRank 38.70 14.20 13.20 44.93 43.28 44.85 21.08 20.51 20.99 MEAD 38.50 15.40 22.00 44.70 46.94 47.57 20.97 22.24 22.26 Opinosis 0.62 0.10 0.99 1.33 0.24 1.67 2.22 0.95 2.24 TextRank - - - - - - 24.45 10.12 20.13 SVM 24.70 10.00 25.80 6.02 2.57 7.46 17.46 17.76 25.04 LogReg 29.40 7.80 10.30 0.73 0.34 0.92 7.21 7.35 11.23 LSTM 33.02 11.92 20.07 48.00 34.40 38.81 24.47 15.06 19.87 CNN 33.37 12.22 20.07 40.00 24.41 29.23 20.46 9.19 14.82 Hierarchical + Static Embedding w2v HAN 36.19 13.15 21.84 46.95 32.09 44.52 24.26 12.30 18.39 HCNN 36.34 12.74 21.60 46.71 31.63 44.61 23.68 11.35 17.25 Hybrid 38.13 15.51 32.01 54.67 42.84 53.34 25.56 12.41 24.11 FastText HAN 37.23 13.29 22.05 46.62 31.73 44.17 24.38 12.54 18.46 HCNN 36.55 13.20 22.29 43.82 27.31 41.07 23.66 11.23 17.29 Hybrid 37.10 14.02 30.31 53.86 41.83 52.55 25.33 12.03 23.81 Hierarchical + ELMo HAN 35.86 13.94 30.00 41.62 26.00 30.57 25.01 12.88 18.88 HCNN 36.29 14.34 30.75 41.88 26.40 30.73 25.14 12.95 18.53 Hybrid 35.67 13.36 30.21 44.88 28.96 42.16 26.45 13.59 24.99 +w2v HAN 35.80 13.83 28.62 40.60 24.82 30.40 24.32 12.06 17.69 HCNN 35.90 13.78 29.23 40.85 25.72 30.95 23.77 11.60 17.60 Hybrid 35.92 14.07 28.85 43.48 27.28 41.20 25.87 12.77 24.43 +FastText HAN 36.26 14.43 30.05 41.68 25.52 30.46 23.43 11.38 17.35 HCNN 35.81 13.88 30.41 41.63 26.27 30.85 23.67 11.64 17.21 Hybrid 37.17 15.19 30.64 44.17 27.90 41.48 25.69 12.54 24.27 Hierarchical + BERT HAN 35.02 10.84 19.50 40.24 23.64 28.78 25.88 14.23 19.69 HCNN 34.78 10.84 19.71 40.08 23.42 28.51 26.06 14.37 19.83 Hybrid 36.13 11.96 29.28 42.08 25.23 39.63 27.51 15.33 26.58 +w2v HAN 30.58 9.24 19.15 40.65 24.50 29.20 25.93 14.29 19.51 HCNN 30.72 9.25 18.75 40.52 24.30 28.89 25.56 13.92 19.58 Hybrid 33.21 11.24 29.19 42.51 26.15 40.21 26.16 14.44 20.08 +FastText HAN 33.32 10.24 19.29 45.47 31.13 35.55 25.64 14.34 19.76 HCNN 34.04 10.65 19.44 45.17 30.55 35.18 25.34 13.91 19.31 Hybrid 35.14 12.03 29.23 47.66 32.88 45.48 27.43 15.25 26.45

Table 5. Summarization results from all models. F1 scores are reported in percentage for Rouge-1, Rouge-2, and Rouge-L respectively.
(a) R-1, Trip advisor
(b) R-1, Reddit
(c) R-1, Newsroom
(d) R-2, Trip advisor
(e) R-2, Reddit
(f) R-2, Newsroom
(g) R-L, Trip advisor
(h) R-L, Reddit
(i) R-L, Newsroom
Figure 4. Comparison of F1-scores among Hierarchical Methods based on ROUGE scores. x-axis denotes types of embeddings; 1=w2v, 2=FastText, 3=ELMo, 4=ELMo+w2v, 5=ELMo+FastText, 6=BERT, 7=BERT+w2v, 8=BERT+FastText. y-axis denotes F1 scores normalized between [0,1]. The bar color blue presents HAN, orange presents HCNN, and gray presents Hybrid model. Fig. (4)(a)-(c) are R-1 scores, Fig. (4)(d)-(f) are R-2 scores, and Fig. (4)(g)-(i) are R-L scores.
Figure 5. Example of output summaries generated by each hierarchical model. Presented in Bold are correctly labelled sentences. Accuracy value (%) is computed as a ratio of number of correctly labelled sentences out of total sentences selected by the model.
(a) Fold 1
(b) Fold 2
(c) Fold 3
(d) Fold 4
(e) Fold 5
(f) Fold 6
Figure 6. Plots showing the convergence of training loss per fold. The results from first 20 epochs are displayed since all models have shown to plateau from this point forward.

5.4. Ablation Study

In this subsection, we discuss our observations from extensive ablation experiments conducted to better understand our model from various aspects.

5.4.1. Model Component Analysis

Comprehensive component analysis is performed by adding different components on top of baseline methods as presented in Table 6. The results reveal that using a model equipped with either CNN or LSTM alone performs poorly across all datasets. This indicates that leveraging the hierarchical structure of an input document to generate a document representation has helped boost the performance. Specifically, the hierarchical structure captures information at both word and sentence levels – word-level representation is learned and subsequently aggregated to form a sentence; likewise, sentence-level representation is learned and subsequently aggregated to form document representation. Among all hierarchical-based models, the ROUGE-L scores of HAN and HCNN are comparative, whereas the union of both has illustrated an evident performance gain. The shift in improvement is also noticeable (¿1%) for Reddit and Newsroom datasets, both of which are larger in size than Trip advisor.

According to Table 6, when comparing results from the proposed hybrid model against those from baseline-CNN and baseline-LSTM, the overall improvement is 11.94% for Trip advisor dataset, 14.53% and 24.11% for Reddit dataset, and 4.24% and 9.29% for Newsroom dataset, respectively.

Model Component Data
Trip advisor Reddit Newsroom
Baseline LSTM
20.07 38.81 19.87
Baseline CNN
20.07 29.23 14.82
HAN 21.84 44.52 18.39
HCNN 21.60 44.61 17.25
Hybrid 32.01 53.34 24.11
Overall Improvement +11.94 +14.53 +4.24
+11.94 +24.11 +9.29
Table 6. Ablation study to investigate the effect of each component in the hierarchical-based models. F1 scores of ROUGE-L are compared (unit in percentage). indicates the component available in the model. The overall improvement in red and blue are the Hybrid model performance gain compared to Baseline LSTM and Baseline CNN, respectivey.

5.4.2. Effect of CNN configurations

Table 7 and 8 show the model performance on different receptive field sizes and different number of convolutional layers, while other parameters remain fixed. It is important to note that when multiple receptive field sizes are used, such as [2,3], an output obtained from each local feature needs to be concatenated first to yield a representation that will be used by a layer following CNN. We observe that the receptive field size of 2 outperforms others across all datasets in both sentence classification and ROUGE evaluation. For Reddit dataset, in particular, the decrease in performance is notable. Regarding the number of convolutional layers, we observe that increasing convolutional layers leads to a drop in overall performance. With the number of layers of 6 (highest), the classification performance, as well as the output summary quality, are the lowest. The CNN part of our Hybrid model, therefore, applies a receptive field size of 2 and a single convolutional layer.

Trip advisor Reddit Newsroom
Size SL R-1 R-2 R-L SL R-1 R-2 R-L SL R-1 R-2 R-L
2 36.12 38.13 15.51 32.01 29.12 54.67 42.84 53.34 26.01 25.56 12.41 24.11
[2,3] -2.41 -2.44 -1.45 -1.65 -8.33 -10.51 -14.72 -11.68 -2.00 -2.90 -1.69 -2.32
[2,3,4] -2.64 -1.97 -0.97 -1.23 -12.33 -11.03 -15.06 -11.86 -1.15 -2.76 -1.56 -2.24
[2,3,4,5] -2.91 -2.37 -1.44 -1.60 -7.84 -10.56 -14.71 -11.55 -0.66 -3.17 -1.88 -2.48
Table 7. Ablation study to investigate the effect of receptive field size towards the overall performance improvement. F1 scores are reported for both sentence-level classification (SL) and ROUGE evaluation (R-1, R-2, R-L). Shaded in gray are best values (in %). Non-shaded values presents loss compared to the best values, also in %.
Trip advisor Reddit Newsroom
Depth SL R-1 R-2 R-L SL R-1 R-2 R-L SL R-1 R-2 R-L
1 36.12 38.13 15.51 32.01 29.12 54.67 42.84 53.34 26.01 25.56 12.41 24.11
2 -2.02 -2.11 -1.11 -1.17 -2.01 -11.07 -15.59 -12.17 -0.97 -2.98 -1.72 -2.49
3 -1.28 -2.36 -1.39 -1.57 -10.33 -10.68 -14.69 -11.68 -2.30 -3.15 -2.13 -2.70
4 -2.03 -2.30 -1.10 -1.42 -12.04 -10.86 -14.87 -11.72 -1.64 -3.04 -1.87 -2.39
5 -1.80 -2.48 -1.51 -1.75 -9.97 -10.82 -15.08 -11.80 -0.94 -2.84 -1.66 -2.31
6 -2.34 -2.14 -1.10 -1.15 -10.55 -11.27 -15.39 -12.23 -3.07 -3.37 -2.29 -2.80
Table 8. Ablation study to investigate the effect of number of convolutional layer(s) towards the overall performance improvement. F1 scores are reported for both sentence-level classification (SL) and ROUGE evaluation (R-1, R-2, R-L). Shaded in gray are best values (in %). Non-shaded values presents loss compared to the best values, also in %.

5.4.3. Representation Learning

In this section, we discuss the impact of different embeddings on the hierarchical models. We report the effect of using static word embeddings versus contextual embeddings. We also concatenate static and contextual embeddings to examine their joint effect on the performance.

We investigate the outputs from the model initialized with only static word embeddings. The results for all datasets in Table 4 show that using word2vec mostly is superior to FastText in classifying sentence labels as well as in terms of ROUGE evaluation. With respect to contextual representations, for Trip advisor dataset, the results show that BERT embeddings yield better classification performance in terms of sentence-level scores than ELMo. For the remaining datasets, however, the sentence-level results mostly reveal a significant drop in the recall, and thus a drop in F1-scores as a consequence. As aforementioned in Section 5.3, this indicates that the system determines few sentences as summary-worthy; yet, most of these sentences are correctly labelled. From Table 5, the ROUGE evaluation shows that, in general using static word embeddings achieves better performance.

In addition, inspired by the study of (Peters et al., 2018), we concatenated both static and contextual representations at the sentence-level encoders. In terms of sentence-level scores, in general, the results show that the concatenation does not significantly affect the performance, except for ELMo+ w2v and ELMo+FastText in Reddit dataset, of which a significant improvement can be noticed in HAN and Hybrid models. With respect to ROUGE evaluation, the results obtained from concatenated representation also do not reflect a significant improvement.

5.4.4. Effect of Attention Mechanism on Selecting Salient Units

We investigate the attention layer to validate whether the attention mechanism aids in selecting representative units. Table 9 shows that, with respect to ROUGE evaluation, when the attention mechanism is incorporated in the model, the performance is improved across all datasets. In particular, for the larger dataset (Reddit and Newsroom), the difference of results between with and without attention is nontrivial. In terms of sentence-level classification, incorporating the attention mechanism does not significantly affect the performance.

It is important to note that the proposed hybrid model attends to important bigrams at the word level and contiguous sentence pairs at the sentence level. At the word level, the attention value of each bigram influences the sentence vector to which the bigram belongs. The attention weight is computed according to the relevance of each bigram, given the sentence context. If a sentence contains many bigrams with high attention values, its corresponding sentence vector will potentially contain information about these prominent bigrams. In the sentence level, likewise, the attention values of the sentence pairs influence the resulting thread vector. A high attention value of a sentence pair indicates its importance and relevance towards the thread key concept. This attention-weighted sentence pair goes through softmax normalization, from which the output indicates how likely a sentence pair is a key unit for the summary.

Figure 7 illustrates a visualization of words in the example summary. The bigram with high attention weight will be highlighted with a darker shade compared to other bigrams with lower attention. The sentence “I am glad you are so mellow and think that it might be difficult filling up the morning before you get married at noon!!” contains two bigrams, namely “glad you” and “are so” which have attention weights of 0.782 and 0.555, respectively. The sentence encoder outputs a weighted sum of the bigrams using normalized attention as weight, and the two aforementioned bigrams are represented the most in the encoded sentence. Later in the thread encoder, this sentence has also shown to be in one of the highest sentence pairs ranked by attention weights. Finally, in the final summary, it can be noticed that the majority of sentences (italicized) are those belonging to the example sentence pairs with high ranked weights. Nevertheless, we emphasize that it is not necessarily the case that if a sentence is in a sentence pair with high attention, it will be selected into the final summary. High attention weights only indicate the significance of the constituent unit. In other words, whether or not a sentence is chosen into the summary is determined by the output layer which considers both sentence and thread representations concatenated together. However, it is observed that when sentences belong to sentence pairs with high attention weights, they have a higher chance of being selected into the final thread summary.

Sentence-level R-1 R-2 R-L
Trip advisor 36.12 36.13 38.13 35.74 15.51 14.4 32.01 30.75
Reddit 29.12 30.33 54.67 43.55 42.84 27.30 53.34 41.00
Newsroom 26.01 24.14 25.56 19.73 12.41 9.20 24.11 14.65
Table 9. Ablation study to investigate the effect of attention mechanism. The results are obtained from Hybrid model. means the attention mechanism is applied in the model, whereas is the opposite case.
Figure 7. Visualization of the generated summary for a forum thread. Top: Relative attention weights for each bigram by the hybrid model, over an entire thread. Bigrams with a darker highlight present higher importance. The attention values of all bigrams were obtained from word-level attention layer. Bottom: The first row presents a list of top 25 bigrams that are ranked according to their attention values, formatted as a (bigram, attention weights) tuple. The second row presents a list of sentence pairs ranked by attention weights, where the highest weight is 0.527. The bigrams in bold and underlined are those with highest attention weights. The third row presents the final summary which lists all chronologically-ordered extracted sentences. The sentences in italic are those in the top 6 sentence pairs with highest attention weights.

6. Conclusions

In this study, we present a framework based on hierarchical attention networks to extractively summarize online forum threads. Our proposed networks unify two deep neural networks, namely Bi-LSTM and CNN, to obtain representations that are used to classify whether or not the sentence is summary-worthy. Since the proposed approach can be framed as multi-document summarization, we also evaluate the proposed approach on news domain dataset, in addition to online forums. The experimental results on three real-life datasets have demonstrated that the proposed model outperforms the majority of baseline methods.

Our findings confirm the initial hypothesis that the capability of encoders can be enhanced through the unified architecture. In essence, Bi-LSTM serves a role to capture contextual information, whereas CNN helps to signify prominent units that are keys pertaining to a summary. Together, the strength of both deep neural networks have been leveraged to achieve effective representations.

Finally, we have conducted extensive experiments to investigate the effect of attention mechanism and pretrained embeddings. The results show that applying attention to the high-level features extracted and compressed by CNN, together with the contextual embeddings, provide a promising avenue towards improving an extractive summarization performance.

7. Acknowledgements

This work was supported by Crystal Photonics, Inc. (Grant 1063271).


  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Link Cited by: §2.3.
  • F. Barrios, F. López, L. Argerich, and R. Wachenchauzer (2016) Variations of the similarity function of textrank for automated summarization. arXiv preprint arXiv:1602.03606. Cited by: 8th item.
  • T. Berg-Kirkpatrick, D. Gillick, and D. Klein (2011) Jointly learning to extract and compress. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 481–490. External Links: Link Cited by: 1st item.
  • S. Bhatia, P. Biyani, and P. Mitra (2014) Summarizing online forum discussions – can dialog acts of individual messages help?. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 2127–2131. External Links: Link, Document Cited by: §2.1, §4.1.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §2.3.
  • F. Boudin, H. Mougard, and B. Favre (2015) Concept-based summarization using integer linear programming: from concept pruning to multiple optimal solutions. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1914–1918. External Links: Link, Document Cited by: 1st item.
  • J. Camacho-Collados and M. T. Pilehvar (2018) From word to sense embeddings: a survey on vector representations of meaning.

    Journal of Artificial Intelligence Research

    63, pp. 743–788.
    Cited by: §2.3.
  • Z. Cao, W. Li, S. Li, F. Wei, and Y. Li (2016) AttSum: joint learning of focusing and summarization with neural attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 547–556. External Links: Link Cited by: §2.2.
  • Z. Cao, W. Li, S. Li, and F. Wei (2017) Improving multi-document summarization via text classification. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, pp. 3053–3059. Cited by: §2.2, §3.4.
  • G. Carenini, R. T. Ng, and X. Zhou (2008) Summarizing emails with conversational cohesion and subjectivity. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 353–361. External Links: Link Cited by: §2.1.
  • J. Cheng and M. Lapata (2016) Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 484–494. External Links: Link, Document Cited by: §1, §2.2.
  • A. F. Cruz, G. Rocha, and H. L. Cardoso (2020) On document representations for detection of biased news articles. In Proceedings of the 35th Annual ACM Symposium on Applied Computing, SAC ’20, New York, NY, USA, pp. 892–899. External Links: ISBN 9781450368667, Link, Document Cited by: §2.2.
  • H. T. Dang and K. Owczarzak (2008) Overview of the TAC 2008 update summarization task. In Proceedings of Text Analysis Conference (TAC), Cited by: §4.5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.3.
  • Y. Duan and A. Jatowt (2019) Across-time comparative summarization of news articles. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, New York, NY, USA, pp. 735–743. External Links: ISBN 9781450359405, Link, Document Cited by: §2.1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12 (Jul), pp. 2121–2159. Cited by: Table 2.
  • G. Erkan and D. R. Radev (2004) LexRank: graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res. 22 (1), pp. 457–479. External Links: ISSN 1076-9757 Cited by: 5th item.
  • R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, pp. 1871–1874. Cited by: §4.4.2.
  • C. Feng, F. Cai, H. Chen, and M. de Rijke (2018) Attentive encoder-based extractive text summarization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, New York, NY, USA, pp. 1499–1502. External Links: ISBN 9781450360142, Link, Document Cited by: §2.2.
  • K. Ganesan, C. Zhai, and J. Han (2010) Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In Proceedings of the 23rd International Conference on Computational Linguistics, pp. 340–348. Cited by: 7th item.
  • K. Ganesan (2015) ROUGE 2.0: updated and improved measures for evaluation of summarization tasks. Cited by: §4.2.
  • M. Grusky, M. Naaman, and Y. Artzi (2018) Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283. Cited by: §4.1.
  • A. Haghighi and L. Vanderwende (2009) Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado, pp. 362–370. External Links: Link Cited by: 3rd item.
  • U. Hahn and I. Mani (2000) The challenges of automatic summarization. Computer 33 (11), pp. 29–36. Cited by: §2.1.
  • [25] G. Hinton, N. Srivastava, and K. Swersky Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited by: Table 2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.2.
  • Y. Hu, Y. Chen, and H. Chou (2017) Opinion mining from online hotel reviews–a text summarization approach. Information Processing & Management 53 (2), pp. 436–449. Cited by: §2.1.
  • J. Jiang, M. Zhang, C. Li, M. Bendersky, N. Golbandi, and M. Najork (2019) Semantic text matching for long-form documents. In The World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 795–806. External Links: ISBN 9781450366748, Link, Document Cited by: §2.2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751. External Links: Link, Document Cited by: §4.4.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Table 2.
  • H. Lee, Y. Choi, and J. Lee (2020) Attention history-based attention for abstractive text summarization. In Proceedings of the 35th Annual ACM Symposium on Applied Computing, SAC ’20, New York, NY, USA, pp. 1075–1081. External Links: ISBN 9781450368667, Link, Document Cited by: §2.2.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.2.
  • H. Liu and X. Wan (2019) Neural review summarization leveraging user and product information. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, New York, NY, USA, pp. 2389–2392. External Links: ISBN 9781450369763, Link, Document Cited by: §2.1.
  • S. Liu, K. Chen, and B. Chen (2020) Enhanced language modeling with proximity and sentence relatedness information for extractive broadcast news summarization. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19 (3). External Links: ISSN 2375-4699, Link, Document Cited by: §2.1.
  • W. Luo, F. Liu, Z. Liu, and D. Litman (2016) Automatic summarization of student course feedback. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 80–85. External Links: Link, Document Cited by: §5.1.
  • T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin (2018) Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §4.3.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Red Hook, NY, USA, pp. 3111–3119. Cited by: §2.3.
  • R. Nallapati, F. Zhai, and B. Zhou (2017) SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, pp. 3075–3081. Cited by: §2.2.
  • S. Narayan, R. Cardenas, N. Papasarantopoulos, S. B. Cohen, M. Lapata, J. Yu, and Y. Chang (2018) Document modeling with external attention for sentence extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2020–2030. External Links: Link, Document Cited by: §2.2.
  • S. Narayan, S. B. Cohen, and M. Lapata (2018) Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1747–1759. External Links: Link, Document Cited by: §2.2.
  • P. Nema, M. Khapra, A. Laha, and B. Ravindran (2017)

    Diversity driven attention model for query-based abstractive summarization

    arXiv preprint arXiv:1704.08300. Cited by: §2.2.
  • M. Nguyen, T. V. Cuong, and N. X. Hoai (2019) Exploiting user comments for document summarization with matrix factorization. In Proceedings of the Tenth International Symposium on Information and Communication Technology, SoICT 2019, New York, NY, USA, pp. 118–124. External Links: ISBN 9781450372459, Link, Document Cited by: §2.1.
  • F. Nihei, Y. I. Nakano, and Y. Takase (2016) Meeting extracts for discussion summarization based on multimodal nonverbal information. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, ICMI ’16, New York, NY, USA, pp. 185–192. External Links: ISBN 9781450345569, Link, Document Cited by: §2.1.
  • F. Nihei, Y. I. Nakano, and Y. Takase (2018) Fusing verbal and nonverbal information for extractive meeting summarization. In Proceedings of the Group Interaction Frontiers in Technology, GIFT’18, New York, NY, USA. External Links: ISBN 9781450360777, Link, Document Cited by: §2.1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §2.3.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §2.3, §5.4.3.
  • D. R. Radev, H. Jing, M. Styś, and D. Tam (2004) Centroid-based summarization of multiple documents. Information Processing and Management 40 (6), pp. 919–938. Cited by: 6th item.
  • S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: Table 2.
  • K. Rudra, N. Ganguly, P. Goyal, and S. Ghosh (2018) Extracting and summarizing situational information from the twitter social media during disasters. ACM Trans. Web 12 (3). External Links: ISSN 1559-1131, Link, Document Cited by: §2.1.
  • K. Rudra, P. Goyal, N. Ganguly, P. Mitra, and M. Imran (2018) Identifying sub-events and summarizing disaster-related information from microblogs. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 265–274. Cited by: §2.1.
  • A. M. Rush, S. Chopra, and J. Weston (2015) A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389. External Links: Link, Document Cited by: §2.2.
  • A. Sharma, K. Rudra, and N. Ganguly (2019) Going beyond content richness: verified information aware summarization of crisis-related microblogs. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, New York, NY, USA, pp. 921–930. External Links: ISBN 9781450369763, Link, Document Cited by: §2.1.
  • A. K. Singh, M. Gupta, and V. Varma (2017) Hybrid memnet for extractive summarization. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, New York, NY, USA, pp. 2303–2306. External Links: ISBN 9781450349185, Link, Document Cited by: §2.2.
  • [54] J. Steinberger et al. Using latent semantic analysis in text summarization and summary evaluation. Cited by: 4th item.
  • S. Tarnpradab, F. Liu, and K. A. Hua (2017) Toward extractive summarization of online forum discussions via hierarchical attention networks. In The Thirtieth International Flairs Conference, Cited by: 1st item, 2nd item, §2.1, §3, §4.1.
  • N. Tepper, A. Hashavit, M. Barnea, I. Ronen, and L. Leiba (2018) Collabot: personalized group chat summarization. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, New York, NY, USA, pp. 771–774. External Links: ISBN 9781450355810, Link, Document Cited by: §2.1.
  • J. R. Thomas, S. K. Bharti, and K. S. Babu (2016) Automatic keyword extraction for text summarization in e-newspapers. In Proceedings of the International Conference on Informatics and Analytics, ICIA-16, New York, NY, USA. External Links: ISBN 9781450347563, Link, Document Cited by: §2.1.
  • L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova (2007) Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion. Information Processing and Management 43 (6), pp. 1606–1618. Cited by: 2nd item.
  • L. Wang and W. Ling (2016) Neural network-based abstract generation for opinions and arguments. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 47–57. External Links: Link, Document Cited by: §2.2.
  • S. Wubben, S. Verberne, E. Krahmer, and A. van den Bosch (2015) Facilitating online discussions by automatic summarization. Cited by: §4.1.
  • Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 1480–1489. External Links: Link, Document Cited by: 1st item, §1, §3.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. External Links: Link, 1212.5701 Cited by: Table 2.
  • A. X. Zhang and J. Cranshaw (2018) Making sense of group chat through collaborative tagging and summarization. Proc. ACM Hum.-Comput. Interact. 2 (CSCW). External Links: Link, Document Cited by: §2.1.
  • Y. Zhang and B. Wallace (2017) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 253–263. External Links: Link Cited by: §4.3.
  • Z. Zhao, H. Pan, C. Fan, Y. Liu, L. Li, M. Yang, and D. Cai (2019) Abstractive meeting summarization via hierarchical adaptive segmental network learning. In The World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 3455–3461. External Links: ISBN 9781450366748, Link, Document Cited by: §2.2.
  • Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao (2018) Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 654–663. External Links: Link, Document Cited by: §1, §2.2.