Efficient and Effective Single-Document Summarizations and A Word-Embedding Measurement of Quality

10/01/2017 ∙ by Liqun Shao, et al. ∙ UMass Lowell 0

Our task is to generate an effective summary for a given document with specific realtime requirements. We use the softplus function to enhance keyword rankings to favor important sentences, based on which we present a number of summarization algorithms using various keyword extraction and topic clustering methods. We show that our algorithms meet the realtime requirements and yield the best ROUGE recall scores on DUC-02 over all previously-known algorithms. We show that our algorithms meet the realtime requirements and yield the best ROUGE recall scores on DUC-02 over all previously-known algorithms. To evaluate the quality of summaries without human-generated benchmarks, we define a measure called WESM based on word-embedding using Word Mover's Distance. We show that the orderings of the ROUGE and WESM scores of our algorithms are highly comparable, suggesting that WESM may serve as a viable alternative for measuring the quality of a summary.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text summarization algorithms have been studied intensively and extensively. An effective summary must be human-readable and convey the central meanings of the original document within a given length boundary. The common approach of unsupervised summarization algorithms extracts sentences based on importance rankings (e.g., see [DUC, 2002, Mihalcea and Tarau, 2004, Rose et al., 2010, Lin and Bilmes, 2011, Parveen et al., 2015]), where a keyword may also be a phrase. A sentence with a larger number of keywords of higher ranking scores is considered more important for extraction. Supervised algorithms include CNN and RNN models for generating extractive and abstractive summaries (e.g., see [Rush et al., 2015, Nallapati et al., 2016, Cheng and Lapata, 2016]).

We were asked to construct a general-purpose text-automation tool to produce, among other things, an effective summary for a given document with the following realtime requirements: Generate a summary instantly for a document of up to 2,000 words, under 1 second for a document of slightly over 5,000 words, and under 3 seconds for a very long document of around 10,000 words. Moreover, we need to deal with documents of arbitrary topics without knowing what the topics are in advance. After investigating all existing summarization algorithms, we conclude that unsupervised single-document summarization algorithms would be the best approach to meeting our requirements.

We use topic clusterings to obtain a good topic coverage in the summary when extracting key sentences. In particular, we first determine which topic a sentence belongs to, and then extract key sentences to cover as many topics as possible within the given length boundary.

Human judgement is the best evaluation of the quality of a summarization algorithm. It is a standard practice to run an algorithm over DUC data and compute the ROUGE recall scores with a set of DUC benchmarks, which are human-generated summaries for articles of a moderate size. DUC-02 [DUC, 2002], in particular, is a small set of benchmarks for single-document summarizations. When dealing with a large number of documents of unknown topics and various sizes, human judgement may be impractical, and so we would like to have an alternative mechanism of measurement without human involvement. Ideally, this mechanism should preserve the same ordering as ROUGE over DUC data; namely, if and are two summaries of the same DUC document produced by two algorithms, and the ROUGE score of is higher than that of , then it should also be the case under the new measure.

Louis and Nenkova [Louis and Nenkova, 2009]

devised an unsupervised method to evaluate summarization without human models using common similarity measures: Kullback-Leibler divergence, Jensen-Shannon divergence, and cosine similarity. These measures, as well as the information-theoretic similarity measure

[Aslam and Frost, 2003], are meant to measure lexical similarities, which are unsuitable for measuring semantic similarities.

Word embeddings such as Word2Vec can be used to fill this void and we devise WESM (Word-Embedding Similarity Measure) based on Word Mover’s Distance (WMD) [Kusner et al., 2015] to measure word-embedding similarity of the summary and the original document. WESM is meant to evaluate summaries for new datasets when no human-generated benchmarks are available. WESM has an advantage that it can measure the semantic similarity of documents. We show that WESM correlates well with ROUGE on DUC-02. Thus, WESM may be used as an alternative summarization evaluation method when benchmarks are unavailable.

The major contributions of this paper are summarized below:

  1. We present a number of summarization algorithms using topic clustering methods and enhanced keyword rankings by the softplus function, and show that they meet the realtime requirements and outperform all the previously-known summarization algorithms under the ROUGE measures over DUC-02.

  2. We propose a new mechanism WESM as an alternative measurement of summary quality when human-generated benchmarks are unavailable.

The rest of the paper is organized as follows: We survey in Section 2 unsupervised single-document summarization algorithms. We present in Section 3 the details of our summarization algorithms and describe WESM in Section 4. We report the results of extensive experiments in Section 5 and conclude the paper in Section 6.

2 Early Work

Early work on single-topic summarizations can be described in the following three categories: keyword extractions, coverage and diversity optimizations, and topic clusterings.

2.1 Keyword extractions

To identify keywords in a document over a corpus of documents, the measure of term-frequency-inverse-document-frequency (TF-IDF) [Salton and Buckley, 1987] is often used. When document corpora are unavailable, the measure of word co-occurrences (WCO) can produce a comparable performance to TF-IDF over a large corpus of documents [Matsuo and Ishizuka, 2003]. The methods of TextRank [Mihalcea and Tarau, 2004] and RAKE (Rapid Automatic Keyword Extraction) [Rose et al., 2010] further refine the WCO method from different perspectives, which are also sufficiently fast to become candidates for meeting the realtime requirements.

TextRank computes the rank of a word in an undirected, weighted word-graph using a slightly modified PageRank algorithm [Brin and Page, 1998]. To construct a word-graph for a given document, first remove stop words and represent each remaining word as a node, then link two words if they both appear in a sliding window of a small size. Finally, assign the number of co-occurrences of the endpoints of an edge as a weight to the edge.

RAKE first removes stop words using a stoplist, and then generates words (including phrases) using a set of word delimiters and a set of phrase delimiters. For each remaining word , the degree of is the frequency of plus the number of co-occurrences of consecutive word pairs and in the document, where and are remaining words. The score of is the degree of divided by the frequency of . We note that the quality of RAKE also depends on a properly-chosen stoplist, which is language dependent.

2.2 Coverage and diversity optimization

The general framework of selecting sentences gives rise to optimization problems with objective functions being monotone submodular [Lin and Bilmes, 2011] to promote coverage and diversity. Among them is an objective function in the form of with a summary and a coefficient , where measures the coverage of the summary and rewards diversity. We use SubmodularF to denote the algorithm computing this objective function. SubmodularF uses TF-IDF values of words in sentences to compute the cosine similarity of two sentences. While it is NP-hard to maximize a submodular objective function subject to a summary length constraint, the submodularity allows a greedy approximation with a proven approximation ratio of .

SubmodularF needs labeled data to train the parameters in the objective function to achieve a better summary and it is intended to work on multiple-document summarizations. While it is possible to work on a single document without a corpus, we note that the greedy algorithm has at least a quadratic-time complexity and it produces a summary with low ROUGE scores over DUC-02 (see Section 2.4), and so it would not be a good candidate to meet our needs. This also applies to a generalized objective function consisting of a submodular component and a non-submodular component [Dasgupta et al., 2013].

2.3 Topic clusterings

Two unsupervised approaches to topic clusterings for a given document have been investigated. One is TextTiling [Hearst, 1997] and the other is LDA (Latent Dirichlet Allocation) [Blei et al., 2003]. TextTiling represents a topic as a set of consecutive paragraphs in the document. It merges adjacent paragraphs that belong to the same topic. TextTiling identifies major topic-shifts based on patterns of lexical co-occurrences and distributions. LDA computes for each word a distribution under a pre-determined number of topics. LDA is a computation-heavy algorithm that incurs a runtime too high to meet our realtime requirements. TextTiling has a time complexity of almost linear, which meets the requirements of efficiency.

2.4 Other algorithms

Following the general framework of selecting sentences to meet the requirements of topic coverage and diversity, a number of unsupervised single-document summarization algorithms have been devised. The most notable is [Parveen et al., 2016], which produces the best ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-SU4 (R-SU4) scores on DUC-02 among all early algorithms, including Lead [Parveen et al., 2015], DUC-02 Best, TextRank, LREG [Cheng and Lapata, 2016], Mead [Radev et al., 2004], [Woodsend and Lapata, 2010], URANK [Wan, 2010], UniformLink [Wan and Xiao, 2010], Egraph + Coherence [Parveen and Strube, 2015], Tgraph + Coherence (Topical Coherence for Graph-based Extractive Summarization) [Parveen et al., 2015], NN-SE [Cheng and Lapata, 2016], and SubmodularF.

maximizes importance, non-redundancy, and pattern-based coherence of sentences to generate a coherent summary using ILP. It computes the ranks of selected sentences for the summary by the Hubs and Authorities algorithm (HITS) [Kleinberg, 1999], and ensures that each selected sentence has unique information. It then uses mined patterns to extract sentences if the connectivity among nodes in the projection graph matches the connectivity among nodes in a coherence pattern. Because of space limitation, we omit the descriptions of the other algorithms.

Table 1 shows the comparison results, where the results for SubmodularF is obtained using the best parameters trained on DUC-03 [Lin and Bilmes, 2011]. Thus, to demonstrate the effectiveness of our algorithms, we will compare our algorithms with only over DUC-02.

Methods R-1 R-2 R-SU4
Lead 45.9 18.0 20.1
DUC 2002 Best 48.0 22.8
TextRank 47.0 19.5 21.7
LREG 43.8 20.7
Mead 44.5 20.0 21.0
45.4 21.3
URANK 48.5 21.5
UniformLink 47.1 20.1
Egraph + Coh. 48.5 23.0 25.3
Tgraph + Coh. 48.1 24.3 24.2
NN-SE 47.4 23.0
SubmodularF 39.6 16.9 17.8
49.0 24.7 25.8
Table 1: ROUGE scores (%) on DUC-02 data.

Solving ILP, however, is time consuming even on documents of a moderate size, for ILP is NP-hard. Thus, does not meet the requirements of time efficiency. We will need to investigate new methods.

3 Our Methods

We use TextRank and RAKE to obtain initial ranking scores of keywords, and use the softplus function [Glorot et al., 2011]

(1)

to enhance keyword rankings to favor sentences that are more important.

3.1 Softplus ranking

Assume that after filtering, a sentence consists of keywords , and has a ranking score produced by TextRank or RAKE. Following Shao and Wang  [Shao and Wang, 2016], we use their central sentence extraction algorithm for ranking sentences by importance as . We can rank using one of the following two methods:

(2)
(3)

Let DTRank (Direct TextRank) and ETRank (Enhanced TextRank) denote the methods of ranking sentences using, respectively, and over TextRank keyword rankings, and DRAKE (Direct RAKE) and ERAKE (Enhanced RAKE) to denote the methods of ranking sentences using, respectively, and over RAKE keyword rankings.

The softplus function is helpful because when is a small positive number, increases the value of significantly (see Figure 1) and when is large, .

Figure 1: Softplus function .

In particular, given two sentences and , suppose that has a few keywords with high rankings and the rest of the keywords with low rankings, while has medium rankings for almost all the keywords. In this case, we would consider more important than . However, we may end up with . To illustrate this using a numerical example, assume that and each consists of 5 keywords, with original scores (sc) and softplus scores (sp) given in the following table 2.

Rank
sc 2.6 2.2 2.1 0.3 0.2 7.4
sp 2.67 2.31 2.22 0.85 0.80 8.84
sc 1.6 1.5 1.5 1.5 1.4 7.5
sp 1.78 1.70 1.70 1.70 1.62 8.51
Table 2: Numerical examples with given sc and sp scores.

Sentence is more important than because it contains three keywords of much higher ranking scores than those of . However, will be selected without using softplus. After using softplus, is selected as it should be.

For a real-life example, consider the following two sentences from an article in DUC-02:

  • Hurricane Gilbert swept toward Jamaica yesterday with 100-mile-an-hour winds, and officials issued warnings to residents on the southern coasts of the Dominican Republic, Haiti and Cuba.

  • Forecasters said the hurricane was gaining strength as it passed over the ocean and would dump heavy rain on the Dominican Republic and Haiti as it moved south of Hispaniola, the Caribbean island they share, and headed west.

We consider more important as it specifies the name, strength, and direction of the hurricane, the places affected, and the official warnings. Using TextRank to compute keyword scores, we have , which returns a less important sentence . After computing softplus, we have ; the more important sentence is selected.

Note that not any exponential function would do the trick. What we want is a function to return roughly the same value as the input when the input is large, and a significantly larger value than the input when the input is much less than 1. The softplus function meets this requirement.

3.2 Topic clustering schemes

We consider four topic clustering schemes: TCS, TCP, TCTT, and TCLDA.

  1. TCS selects sentences without checking topics.

  2. TCP treats each paragraph as a separate topic.

  3. TCTT partitions a document into a set of multi-paragraph segments using TextTiling.

  4. TCLDA computes a topic distribution for each word using LDA. We set the number of topics from 5 to 8 depending on the length of the document. Assume that a document contains topics () and the topic consists of words , where and

    has a probability

    . For a document with sentences , we use the following maximization to determine which topic the sentence belongs to ():

    (4)

3.3 Summarization algorithms

The length of a summary may be specified by users, either as a number of words or as a percentage of the number of characters of the original document. By a “30% summary” we mean that the number of characters of the summary does not exceed 30% of that of the original document.

Let be the summary length (the total number of characters) specified by the user and a summary. If consist of sentences , and the number of characters of is , then the following inequality must hold:

Depending on which sentence-ranking algorithm and which topic-clustering scheme to use, we have eight combinations using ETRank and ERAKE, and eight combinations using DTRank and DRAKE, shown in Table 3. For example, ET3Rank (Enhanced TextTiling TRank) means to use to rank sentences and TextTiling to compute topic clusterings, and T2RAKE (TextTiling RAKE) means to use rank sentences over RAKE keywords and TextTiling to compute topic clusterings.

Methods S-R T-C
ESTRank ETRank TCS
EPTRank ETRank TCP
ET3Rank ETRank TCTT
ELDATRank ETRank TCLDA
ESRAKE ERAKE TCS
EPRAKE ERAKE TCP
ET2RAKE ERAKE TCTT
ELDARAKE ERAKE TCLDA
STRank DTRank TCS
PTRank DTRank TCP
T3Rank DTRank TCTT
LDATRank DTRank TCLDA
SRAKE DRAKE TCS
PRAKE DRAKE TCP
T2RAKE DRAKE TCTT
LDARAKE DRAKE TCLDA
Table 3: Description of all the Algorithms with different sentence-ranking (S-R) and topic-clustering (T-C) schemes.

All algorithms follow the following procedure for selecting sentences:

  1. Preprocessing phase

    1. Identify keywords and compute the ranking of each keyword.

    2. Compute the ranking of each sentence.

  2. Sentence selection phase

    1. Sort the sentences in descending order of their ranking scores.

    2. Select sentences one at a time with a higher score to a lower score. Check if the selected sentence belongs to the known-topic set (KTS) according to the underlying topic clustering scheme, where KTS is a set of topics from sentences placed in the summary so far. If is in KTS, then discard it; otherwise, place into the summary and its topic into KTS.

    3. Continue this procedure until the summary reaches its length constraint.

    4. If the number of topics contained in the KTS is equal to the number of topics in the document, empty KTS and repeat the procedure from Step 1.

Figure 2 shows an example of 30% summary generated by ET3Rank on an article in NewsIR-16.

Figure 2: An example of 30% summary of an article in NewsIR-16 by ET3Rank, where the original document is on the left and the summary is on the right.

4 A Word-Embedding Measurement of Quality

Word2vec [Mikolov et al., 2013a, Mikolov et al., 2013b]

is an NN model that learns a vector representation for each word contained in a corpus of documents. The model consists of an input layer, a projection layer, and an output layer to predict nearby words in the context. In particular, a sequence of

words are used to train a Word2Vec model for maximizing the probability of neighboring words:

(5)

where is the set of center word ’s neighboring words, is the size of the training context, and is defined by the softmax function. Word2Vec can learn complex word relationships if it trains on a very large data set.

4.1 Word Mover’s Distance

Word Mover’s Distance (WMD) [Kusner et al., 2015] uses Word2Vec as a word embedding representation method. It measures the dissimilarity between two documents and calculates the minimum cumulative distance to “travel” from the embedded words of one document to the other. Although two documents may not share any words in common, WMD can still measure the semantical similarity by considering their word embeddings, while other bag-of-words or TF-IDF methods only measure the similarity by the appearance of words. A smaller value of WMD indicates that the two sentences are more similar.

4.2 A word-embedding similarity measure

Based on WMD’s ability of measuring the semantic similarity of documents, we propose a summarization evaluation measure WESM (Word-Embedding Similarity Measure). Given two documents and , let denote the distance of and . Given a document , assume that it consists of paragraphs . Let be a summary of . We compare the word-embedding similarity of a summary with using WESM as follows:

(6)

The value of WESM is between 0 and 1. Under this measure, the higher the WESM value, the more similar is to .

5 Numerical Analysis

We evaluate the qualities of summarizations using the DUC-02 dataset [DUC, 2002] and the NewsIR-16 dataset [Corney et al., 2016]. DUC-02 consists of 60 reference sets, each of which consists of a number of documents, single-document summary benchmarks, and multi-document abstracts/extracts. The common ROUGE recall measures of ROUGE-1, ROUGE-2, and ROUGE-SU4 are used to compare the quality of summarization algorithms over DUC data. NewsIR-16 consists of 1 million articles from English news media sites and blogs.

We use various software packages to implement TextRank (with window size = 2) [MIT, 2014], RAKE [MIT, 2015], TexTiling [Boutsioukis, 2016], LDA and Word2Vec [Rehurek, 2017].

We use the existing Word2Vec model trained on English Wikipedia [Foundation, 2017], which consists of 3.75 million articles formatted in XML. The reason to choose this dataset is for its large size and the diverse topics it covers.

5.1 ROUGE evaluations over DUC-02

As mentioned before, we use to cover all previously known algorithms for the purpose of comparing qualities of summaries, as produces the best results among them.

Among all the algorithms we devise, we only present those with at least one ROUGE recall score better than or equal to the corresponding score of , identified in bold (see Table 4). Also shown in the table is the average of the three ROUGE scores (R-AVG). We can see that ET3Rank is the winner, followed by T2RAKE; both are superior to . Moreover, ET2RAKE offers the highest ROUGE-1 score of 49.3.

Methods R-1 R-2 R-SU4 R-AVG
49.0 24.7 25.8 33.17
ET3Rank 49.2 25.6 27.5 34.10
ESRAKE 49.0 23.6 26.1 32.90
ET2RAKE 49.3 21.4 24.5 31.73
PRAKE 49.0 24.5 25.3 32.93
T2RAKE 49.1 25.4 25.8 33.43
Table 4: ROUGE scores (%) on DUC-02 data.

5.2 WESM evaluations over DUC-02 and NewsIR-16

Table 5 shows the evaluation results on DUC-02 and NewsIR-16 using WESM based on the Word2Vec model trained on English Wikipedia. The first number in the third row is the average score on all benchmark summaries in DUC-02. For the rest of the rows, each number is the average score of summaries produced by the corresponding algorithm for all documents in DUC-02 and NewsIR-16. The size constraint of a summary on DUC-02 for each document is the same as that of the corresponding DUC-02 summary benchmark.

For NewsIR-16, we select at random 1,000 documents from NewsIR-16 and remove the title, references, and other unrelated content from each article. Based on an observation that a 30% summary allows for a good summary, we compute 30% summaries of these articles using each algorithm.

Datasets DUC-02 NewsIR-16
Benchmarks 3.021









ET3Rank
3.382 2.002
ESRAKE 3.175 1.956
ET2RAKE 3.148 1.923
PRAKE 3.150 1.970
T2RAKE 3.247 1.990
Table 5: Scores (%) over DUC-02 and NewsIR-16 under WESM trained on English-Wikipedia.

It is expected that scores of our algorithms are better than the score for benchmarks under each measure, for the benchmarks often use different words not in the original documents, and hence would have smaller similarities.

5.3 Normalized -norm

We would like to determine if WESM is a viable measure. From our experiments, we know that the all-around best algorithm ET3Rank, the second best algorithm T2RAKE, and ET2RAKE remain the same positions under R-AVG over DUC-02 and under WESM over both DUC-02 and NewsIR-16 (see Table 6), ESRAKE and PRAKE remain the same positions under R-AVG over DUC-02 and under WESM over NewsIR-16, while ESRAKE and PRAKE only differ by one place under R-AVG and WESM over DUC-02.

Methods R-AVG WESM
DUC-02 DUC-02 NewsIR-16
ET3Rank 1 1 1
ESRAKE 4 3 4
ET2RAKE 5 5 5
PRAKE 3 4 3
T2RAKE 2 2 2
Table 6: Orderings of R-AVG scores over DUC-02 and WESM scores over DUC-02 and NewsIR-16.

Next, we compare the ordering of the R-AVG scores and the WESM scores over DUC-02. For this purpose, we use the normalized -norm to compare the distance of two orderings. Let be a sequence of objects, where each has two values and such that and are, respectively, permutations of . Let

which is the maximum distance two permutations can possibly have. Then the normalized -norm of and is defined by

Table 6 shows the orderings of the R-AVG scores over DUC-02 and WESM scores over DUC-02 and NewsIR-16 (from Tables 4 and 5).

It is straightforward to see that , and . This indicates that WESM and ROUGE are highly comparable over DUC-02 and NewsIR-16, and the orderings of WESM on different datasets, while with larger spread, are still similar.

5.4 Runtime analysis

We carried out runtime analysis through experiments on a computer with a 3.5 GHz Intel Xeon CPU E5-1620 v3. We used a Python implementation of our summarization algorithms. Since DUC-02 are short, all but LDA-based algorithms run in about the same time. To obtain a finer distinction, we ran our experiments on NewsIR-16. Since the average size of NewsIR-16 articles is 405 words, we selected at random a number of articles from NewsIR-16 and merged them to generate a new article. For each size from around 500 to around 10,000 words, with increments of 500 words, we selected at random 100 articles and computed the average runtime of different algorithms to produce 30% summary (see Figure 3). We note that the time complexity of each of our algorithms incurs mainly in the preprocessing phase; the size of summaries in the sentence selection phase only incur minor fluctuations of computation time, and so it suffice to compare the runtime for producing 30% summaries.

Figure 3: Runtime analysis, where the unit on the x-axis is 100 words and the unit of the y-axis is seconds.

We can see from Figure 3 that ESRAKE and PRAKE incur about the same linear time and they are extremely fast. Also, ET3RANK, ET2RAKE, and T2RAKE incur about the same time. While the time is higher because of the use of TextTiling and is closed to being linear, it meets the realtime requirements. For example, for a document of up to 3,000 words, over 3,000 but less than 5,500 words, and 10,000 words, respectively, the runtime of ET3Rank is under 0.5, 1, and 2.75 seconds.

The runtime of SubmodularF is acceptable for documents of moderate sizes (not shown in the paper); but for a document of about 10,000 words, the runtime is close to 4 seconds. LDA-based algorithms is much higher. For example, LDARAKE incurs about 16 seconds for a document of about 2,000 words, about 41 seconds for a document of about 5,000 words, and about 79 seconds for a document of about 10,000 words.

6 Conclusions

We presented a number of unsupervised single-document summarization algorithms for generating effective summaries in realtime and a new measure based on word-embedding similarities to evaluate the quality of a summary. We showed that ET3Rank is the best all-around algorithm. A web-based summarization tool using ET3Rank and T2RAKE will be made available to the public.

To further obtain better topic clusterings efficiently, we plan to extend TextTiling over non-consecutive paragraphs. To obtain a better understanding of word-embedding similarity measures, we plan to compare WESM with human evaluation and other unsupervised methods including those devised by Louis and Nenkova [Louis and Nenkova, 2009]. We also plan to explore new ways to measure summary qualities without human-generated benchmarks.

Acknowledgements

We thank the members of the Text Automation Lab at UMass Lowell for their support and fruitful discussions.

References

  • [Aslam and Frost, 2003] Aslam, J. A. and Frost, M. (2003). An information-theoretic measure for document similarity. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 449–450, New York, NY, USA. ACM.
  • [Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation.

    Journal of machine Learning research

    , 3(Jan):993–1022.
  • [Boutsioukis, 2016] Boutsioukis, G. (2016). Natural language toolkit: Texttiling.
  • [Brin and Page, 1998] Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30:107–117.
  • [Cheng and Lapata, 2016] Cheng, J. and Lapata, M. (2016). Neural summarization by extracting sentences and words. CoRR, abs/1603.07252.
  • [Corney et al., 2016] Corney, D., Albakour, D., Martinez, M., and Moussa, S. (2016). What do a million news articles look like? In Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, March 20, 2016., pages 42–47.
  • [Dasgupta et al., 2013] Dasgupta, A., Kumar, R., and Ravi, S. (2013). Summarization through submodularity and dispersion. In ACL (1), pages 1014–1022. The Association for Computer Linguistics.
  • [DUC, 2002] DUC (2002). Document understanding conference 2002.
  • [Foundation, 2017] Foundation, W. (2017). Wikimedia downloads.
  • [Glorot et al., 2011] Glorot, X., Bordes, A., and Bengio, Y. (2011).

    Deep sparse rectifier neural networks.

    In Aistats, volume 15, page 275.
  • [Hearst, 1997] Hearst, M. A. (1997). Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 23(1):33–64.
  • [Kleinberg, 1999] Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632.
  • [Kusner et al., 2015] Kusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger, K. Q. (2015). From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 957–966.
  • [Lin and Bilmes, 2011] Lin, H. and Bilmes, J. A. (2011). A class of submodular functions for document summarization. In Lin, D., Matsumoto, Y., and Mihalcea, R., editors, ACL, pages 510–520. The Association for Computer Linguistics.
  • [Louis and Nenkova, 2009] Louis, A. and Nenkova, A. (2009). Automatically evaluating content selection in summarization without human models. In

    Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1

    , EMNLP ’09, pages 306–314, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Matsuo and Ishizuka, 2003] Matsuo, Y. and Ishizuka, M. (2003). Keyword extraction from a single document using word co-occurrence statistical information. In

    Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference

    , pages 392–396. AAAI Press.
  • [Mihalcea and Tarau, 2004] Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order into texts. In Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing.
  • [Mikolov et al., 2013a] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • [Mikolov et al., 2013b] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • [MIT, 2014] MIT (2014). Textrank implementation in python.
  • [MIT, 2015] MIT (2015). A python implementation of the rapid automatic keyword extraction.
  • [Nallapati et al., 2016] Nallapati, R., Zhou, B., dos Santos, C. N., Gülçehre, ., and Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL, pages 280–290. ACL.
  • [Parveen et al., 2016] Parveen, D., Mesgar, M., and Strube, M. (2016). Generating coherent summaries of scientific articles using coherence patterns. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 772–783.
  • [Parveen et al., 2015] Parveen, D., Ramsl, H.-M., and Strube, M. (2015). Topical coherence for graph-based extractive summarization. In Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., and Marton, Y., editors, EMNLP, pages 1949–1954. The Association for Computational Linguistics.
  • [Parveen and Strube, 2015] Parveen, D. and Strube, M. (2015). Integrating importance, non-redundancy and coherence in graph-based extractive summarization. In Yang, Q. and Wooldridge, M., editors, IJCAI, pages 1298–1304. AAAI Press.
  • [Radev et al., 2004] Radev, D. R., Allison, T., Blair-Goldensohn, S., Blitzer, J., Celebi, A., Dimitrov, S., Drabek, E., Hakim, A., Lam, W., Liu, D., et al. (2004). Mead-a platform for multidocument multilingual text summarization. In LREC.
  • [Rehurek, 2017] Rehurek, R. (2017). gensim 2.0.0.
  • [Rose et al., 2010] Rose, S., Engel, D., Cramer, N., and Cowley, W. (2010). Automatic keyword extraction from individual documents. In Berry, M. W. and Kogan, J., editors, Text Mining. Applications and Theory, pages 1–20. John Wiley and Sons, Ltd.
  • [Rush et al., 2015] Rush, A. M., Chopra, S., and Weston, J. (2015). A neural attention model for abstractive sentence summarization. CoRR, abs/1509.00685.
  • [Salton and Buckley, 1987] Salton, G. and Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Technical report, Cornell University, Ithaca, NY, USA.
  • [Shao and Wang, 2016] Shao, L. and Wang, J. (2016). Dtatg: An automatic title generator based on dependency trees. In Fred, A. L. N., Dietz, J. L. G., Aveiro, D., Liu, K., Bernardino, J., and Filipe, J., editors, KDIR, pages 166–173. SciTePress.
  • [Wan, 2010] Wan, X. (2010). Towards a unified approach to simultaneous single-document and multi-document summarizations. In Proceedings of the 23rd international conference on computational linguistics, pages 1137–1145. Association for Computational Linguistics.
  • [Wan and Xiao, 2010] Wan, X. and Xiao, J. (2010). Exploiting neighborhood knowledge for single document summarization and keyphrase extraction. ACM Trans. Inf. Syst., 28(2).
  • [Woodsend and Lapata, 2010] Woodsend, K. and Lapata, M. (2010). Automatic generation of story highlights. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 565–574. Association for Computational Linguistics.