Log In Sign Up

Specificity-Based Sentence Ordering for Multi-Document Extractive Risk Summarization

by   Berk Ekmekci, et al.

Risk mining technologies seek to find relevant textual extractions that capture entity-risk relationships. However, when high volume data sets are processed, a multitude of relevant extractions can be returned, shifting the focus to how best to present the results. We provide the details of a risk mining multi-document extractive summarization system that produces high quality output by modeling shifts in specificity that are characteristic of well-formed discourses. In particular, we propose a novel selection algorithm that alternates between extracts based on human curated or expanded autoencoded key terms, which exhibit greater specificity or generality as it relates to an entity-risk relationship. Through this extract ordering, and without the need for more complex discourse-aware NLP, we induce felicitous shifts in specificity in the alternating summaries that outperform non-alternating summaries on automatic ROUGE and BLEU scores, and manual understandability and preferences evaluations - achieving no statistically significant difference when compared to human authored summaries.


page 1

page 2

page 3

page 4


Multi-Document Summarization via Discriminative Summary Reranking

Existing multi-document summarization systems usually rely on a specific...

Entity-based SpanCopy for Abstractive Summarization to Improve the Factual Consistency

Despite the success of recent abstractive summarizers on automatic evalu...

Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities

With the advent of large language models, methods for abstractive summar...

A Consolidated System for Robust Multi-Document Entity Risk Extraction and Taxonomy Augmentation

We introduce a hybrid human-automated system that provides scalable enti...

Beyond ROUGE Scores in Algorithmic Summarization: Creating Fairness-Preserving Textual Summaries

As the amount of textual information grows rapidly, text summarization a...

A Parameterized Approach to Personalized Variable Length Summarization of Soccer Matches

We present a parameterized approach to produce personalized variable len...

1 Introduction

Risk mining seeks to identify the expression of entity-risk relationships in textual data Leidner and Schilder (2010). For example, (1a-b) describe a CNN-Terrorism relationship that is indicated by the reference to CNN in (1a) and Terrorism risk category keywords - pipe bomb (1a) and bomb threat (1b).

(1a) Later Wednesday, CNN received a pipe bomb at its Time Warner Center headquarters in Manhattan sent to ex-CIA director John Brennan,prompting CNN to evacuate its offices.

(1b) It was the second time in two days that the building was evacuated in a bomb threat.

The goal of risk mining systems is to identify the highest value and most relevant text extractions that embody an entity-risk relationship, indexed by an entity and a keyword/phrase - obviating the need for a manual review of numerous sources. However, as systems expand both in volume of data analyzed and discovery of new entity-risk relation expressions, the number of relevant extracts increases and the challenge to review the information returns. We rely on extractive summarization (see generally, Nenkova and McKeown (2011)) to address this problem with particular emphasis on creating high quality output that appropriately orders extracted clauses by information specificity.

To illustrate, (1a) provides specific referents about time (Later Wednesday), events (receiv[ing] a pipe bomb), location (Time Warner Center headquarters in Manhattan), people (ex-CIA director John Brennan), and the resulting event of evacuat[ing] its [CNN’s] offices. (1b), through the use of less specific references (It, the building) and its sequencing relative to (1a), generalizes that this was the second such event in two days. If we reorder (1a-b) in (2a-b), which can happen in extractive systems, the flow of information is less felicitous and reads less easily.

(2a) It was the second time in two days that the building was evacuated in a bomb threat.

(2b) Later Wednesday, CNN received a pipe bomb at its Time Warner Center headquarters in Manhattan sent to ex-CIA director John Brennan, prompting CNN to evacuate its offices.

One way to improve output in these circumstances is to control sentence ordering. This is simpler in single documents as preserving the order of the extract in the documents works to encourage a coherent summary (e.g., McKeown et al. (1999)). However, for multi-document summaries, this is not as simple and approaches to sentence ordering become much more complex.

We propose a novel approach to ordering extractive summaries by focusing on the specific or general nature of the extracts when building the summary. In particular, we identity two groups of extracts from a keyword-based risk mining system: one characterized as more specific (from a manually curated set of keywords) and one characterized as more general (from a semantically encoded set of keywords). Alternating the extract selection between these two groups, which are ranked by bidirectional token distances between the entity and the risk keyword, creates extractive summaries that outperform non-alternating systems - so much so that our top performing system fails to be significantly different from the comparative human authored summaries.

In this paper, we review risk mining, extractive summarization, and discuss information specificity in discourse (Section 2). Section 3 presents our risk mining system, with emphasis on: entity-keyword extraction, the expansion of the human curated taxonomy, and the nature of these extraction sets relative to specificity in discourse. Section 4 introduces and presents the results of several experiments evaluated with automatic ROUGE, BLEU, and manual preference and readability judgments. We discuss the results and related work in extractive summarization (Section 5), and discuss future work in Section 6.

2 Background

This section provides a high-level overview of risk mining (Section 2.1), automatic summarization (Section 2.2.), and the relationship between specificity and discourse (Section 2.3) to contextualize the presentation of our system in Section 3. Citations are not meant to be exhaustive. Additional treatment of automatic summarization comparable to ours is discussed in Section 5.

2.1 Risk Mining

Risk mining systems typically start with a keyword list that captures, from a subject matter expert’s perspective, a risk category of interest and entities that are subject to that risk (e.g., media outlets subject to terrorism, persons subject to fraud). Systems also expand this “seed” keyword list and fine tune output through some combination of machine learning and human-in-the-loop review until a desired level of performance is achieved

Leidner and Schilder (2010); Nugent and Leidner (2016). Domains where risk mining has been applied include financial risks based on filings and stock prices Kogan et al. (2009); Dasgupta et al. (2016); general risks in news Lu et al. (2009b); Nugent et al. (2017), and supply chain risks Carstens et al. (2017). Further, methods of keyword list expansion include ontology merging Subramaniam et al. (2010), crowdsourcing Meng et al. (2015) and paraphrase detection Plachouras et al. (2018). The goal of the expansion is to minimize human involvement while still preserving expert judgment, maintaining and improving performance through the return of highly relevant extracts.

2.2 Automatic Summarization

Approaches to automatic text summarization fall into either the abstractive or extractive categories. Abstractive approaches seek to identify relevant phrases and sentences. The summary is a rewriting of those extracts; with recent approaches making use of graphs Tan et al. (2017); Dohare et al. (2018)

or neural networks

Chopra et al. (2016); Paulus et al. (2018). Extractive approaches attempt to: identify relevant text extractions in single and multi-document source material; rank the extracts to find the most informative; and combine the selected extracts into a summarized discourse.

Finding and ranking relevant extracts is based on queries Rahman and Borah (2015), document word frequencies Conroy et al. (2006); Gupta et al. (2007)

, probabilities

Vanderwende et al. (2007), tf-idf weighting Erkan and Radev (2004b); Fung and Ngai (2006), topic modeling Lin and Hovy (2000), sentence clustering McKeown et al. (1999); Siddharthan et al. (2004), graph-based methods Erkan and Radev (2004a); Erkan and Radev (2004b); Mihalcea and Tarau (2004), and neural networks Filippova et al. (2015); Nallapati et al. (2017). Our extraction method (Section 3.2) is based on entity-keyword matching in multiple documents with subsequent ranking of token distances between entities and risk keywords.

Once extracts are selected for inclusion, techniques are applied to improve the overall quality of the summary. Improvements on the sentence level include sentence compression Turner and Charniak (2005); Galley and McKeown (2007) and fusion Jing and McKeown (2000); Barzilay and McKeown (2005). Improvements on the semantic and pragmatic level include use of lexical chains Barzilay and Elhadad (1997); Galley and McKeown (2003), WordNet Fellbaum (1998) -based concepts Schiffman et al. (2002), Latent Semantic Analysis Gong and Liu (2001); Hachey et al. (2006), and discourse relation and graph representations Ono et al. (1994); Marcu (1998, 1997, 2000); Wang et al. (2015). As discussed in Section 5, our system most closely aligns with sentence ordering methods of improvement in multi-document extractive summarization research.

2.3 Specificity and Discourse Structure

At the word level, specificity can be defined in terms of generics and habituals in (3a-d):

(3a) Generic:

Dogs love to go for walks.
(3b) Non-Generic:

The dog is in the backyard.
(3c) Habitual:

She had trouble walking after she slipped and fell.
(3d) Non-Habitual:

She slipped and fell in January of 2019.

Generics describe either a class of entities - dogs in (3a), or a member of a class of entities - the dog in (3b). Habituals describe either specific or regular events - trouble walking (3c) - slipped and fell (3d). The ability to detect generics and habituals computationally relies on word-level features such as plurals, quantifiers, verb tenses, categories of noun phrases, and lexical resources such as WordNet (see generally, Mathew and Katz (2009); Friedrich and Pinkal (2015)).

Beyond the sentence, Li (2017) links occurrences to information specificity to rhetorical relations. For example, the background relation provides general backdrop information for subsequent clauses; elaboration provides more specific unfolding of events; and specification provides more specific detail of the previous information (see e.g., Rhetorical Structure Theory Mann and Thompson (1987) and the Penn Discourse TreeBank Prasad et al. (2008)). Mulkar-Mehta et al. (2011) weave generics and habituals into a “granularity” framework of part-of and causality shifts across clauses in discourse. Howald and Abramson (2012) and Howald and Katz (2011) demonstrate that annotated granularities improved machine learning prediction of Segmented Discourse Representation Theory Asher and Lascarides (2003) rhetorical relations.

Appropriately ordered shifts in specificity are generally associated with texts of higher quality Louis and Nenkova (2011a), which can be interpreted as increased readability Dixon (1982, 1987), higher coherence Hobbs (1985); Kehler (2002) and accommodation of the intended audience Beaver and Clarck (2008); Djalali et al. (2011). Louis and Nenkova (2011b) further observe that automatic summaries tend to be much more specific than their human authored counterparts and are judged to be incoherent and of lower comparative quality. As discussed in Section 4, rather than explicitly identifying and exploiting habituals, generics or rhetorical relations, we model shifts in specificity by alternating selection from sets of extracts that are characterized as more or less specific relative to an entity-risk relation as a byproduct of risk term expansion.

3 System

Our initial extraction system is a custom NLP processing pipeline capable of ingesting and analyzing hundreds of thousands of text documents relative to a manually-curated seed taxonomy. The system consists of five components:

  1. Document Ingest and Processing: Raw text documents are read from disk and tokenization, lemmatization, and sentencization are performed.

  2. Keyword/Entity Detection: Instances of both keywords and entities are identified in the processed text, and each risk keyword occurrence is matched to the nearest entity token.

  3. Match Filtering and Sentence Retrieval: Matches within the documents are filtered, categorized by pair distance, and corresponding spans retrieved for context. For comparison to methods relying on sentence co-occurrence, the sentences are retrieved for context.

  4. Semantic Encoding and Taxonomy Expansion:

    A semantic vectorization algorithm is trained on domain-specific texts and used to perform automated expansion of the keyword taxonomy.

  5. Extractive Summarization Construction: From the total collection of extracts, summaries are formed based on different combinations distances, keyword frequencies, and taxonomy.

We leverage spaCy (Version 2.0.16, Honnibal and Montani (2017) as the document ingest and low-level NLP platform for this system. This choice was influenced by spaCy’s high speed parsing Choi et al. (2015), out-of-the-box parallel processing, and Python compatibility. In particular, spaCy’s pipe() function allows for a text generator object to be provided and takes advantage of multi-core processing to parallelize batching. In this implementation, each processed document piped in by spaCy is converted to its lemmatized form with sentence breaks noted so that sentence and multi-sentence identification of keyword/entity distances can be captured.

3.1 Keyword/Entity Detection

The shorter the token distance between entity and keyword, the stronger the entity-risk relationship is as a function of semantic and pragmatic coherence. (4) describes the entity Verizon and its litigation risk associated with lawsuit settlement (indicated by settle and lawsuit keywords).

(4) In 2011, Verizon agreed to pay $20 million to settle a class-action lawsuit by the federal Equal Employment Opportunity Commission alleging that the company violated the Americans with Disabilities Act by denying reasonable accommodations for hundreds of employees with disabilities.

We return the entire sentence to provide additional context - a class-action lawsuit and the allegation that Verizon denied reasonable accommodations for hundreds of employees with disabilities. Extracts can further improve when the distances are considered bidirectionally. For example, (5) extends (4) to the prior contiguous sentence which contains settlement. This extension provides greater context for Verizon’s lawsuit. (5) contains a background relation and provides the larger context that Verizon is in violation of settlement terms from a previous lawsuit.

(5) McDonald says this treatment violated the terms of a settlement the company reached a few years earlier regarding its treatment of employees with disabilities. In 2011, Verizon agreed to pay $20 million to settle a class-action lawsuit by the federal Equal Employment Opportunity Commission ….

The system detection process begins by testing for matches of each keyword with each entity, for every possible keyword-entity pairing in the document. For every instance of every keyword, the nearest instance of every available entity is paired regardless of whether it comes before or after the keyword (Algorithm 1). An entity may be found to have multiple risk terms associated with it, but each instance of a risk term will only apply itself to the closest entity - helping to minimize overreaching conclusions of risk while maintaining system flexibility.

0:  taxonomy and entities lists
  for keyword in taxonomy do
      for entity in entities do
          keywordLocs = findLocs(keyword)
          entityLocs = findLocs(entity)
          for kLoc in keywordLocs do
              bestHit = findClosestPair(kLoc, entityLocs)
              results.append((keyword, entity, bestHit))
          end for
      end for
  end for
  return  findClosestPair is two token indicies
Algorithm 1 Entity-Keyword Pairing

The system’s token distance approach promotes efficiency compared to more complex NLP. However, the computational cost of this is: a total of comparisons must be made for each document, where i is the number of keyword terms across all taxonomic categories, a the average number of instances of each keyword per document, j the number of entities provided, and b the average number of entity instances per document. Changing any single one of these variables will result in computational load changing with complexity, but their cumulative effects can quickly add up.111For parallelization purposes, each keyword and entity is independent of each other keyword and entity. This means that in an infinitely parallel (theoretical) computational scheme, the system runs on , which will vary as a function of the risk and text domains.

Category Seed Expanded
Cybersecurity n=20 n=32 (additional)
antivirus, cybersecurity, data breach, denial of service, hacker, malware, network intrusion, phishing, ransomware, spyware, virus, … 4frontsecurity, attack, beware, cyberattack, cyberstalking,detection, identity, opsware, phish, ransom, security, socialware
Terrorism n=23 n=47 (additional)
bioterrorism, car bomb, counterterrorism, extremist, hijack, jihad, lone wolf, mass shooting, separatism, suicide bomber, terrorist, … bombmaker, consequence, criticism, fascist, hate, hezbollah, hijacker, jihadi, massive, military, suspicious, …
Legal n=26 n=54 (additional)
allegation, bankruptcy, indictment, infringement, lawsuit, litigation, misappropriation, negligence, plaintiff, regulatory violation, statutory, … action, carelessness, extortion, foreclosure, infringe, pre-litigation, reckless, relitigate, require, suit, tort, …
Table 1: Sample risk terms: qualitatively bolded terms are more specific and smallcaps terms are more general relative to the risk category.

3.2 Encoding

Our system automates term expansion by using similarity calculations of semantic vectors. These vectors were generated by training a fastText ( skipgram model Bojanowski et al. (2017), which relies on words and subwords from the same data sources identified in the initial run of the system using the seed taxonomy. This ensures that domain usage of language is well-represented, and any rich domain-specific text may be used to train semantic vectors (see generally, Mikolov et al. (2013)).

For each risk term, the system searches the model vocabulary for the minimized normalized dot product (a basic similarity score found in the fastText codebase), and returns the top-scoring vocabulary terms. Upon qualitative review, the expansion finds new keywords that are specific to the entity-risk relationship, but a higher proportition of new keywords that are more general (Table 1).222The method also produces tokenized variants and misspellings (neg igence, thisagreement), items clearly out of semantic bounds (gorilla, papilloma, titration), and substring drifting (fines,vines,wines). These are low frequency and typically down selected by the system rather than removed).

Category Seed Expanded %
Cybersecurity 1.40 2.41 72.14
Terrorism 2.13 2.46 15.49
Legal 1.73 3.60 108.09
Table 2: WordNet Polysemy Seed and Increased Expanded Averages.

To illustrate more quantitatively, the possibility of the content of the extracts having a general or specific character is indicated in Table 2. We calculated a polysemy average: for every word in the keyword sets, we averaged the number of definitions per word from WordNet. The higher the number (the more definitions) the more general the keyword can become relative to the context. Greater increases are seen for Cybersecurity and Legal (more general) compared to Terrorism where the expansion appears to have maintained a similar mix of specific and general. While the filtering of documents by entities may somewhat control the contexts, there is, of course, no guarantee of this. However, we suggest that our method benefits from operating within a specified entity-risk relationship (controlling the extraction, expansion and source material).

3.3 Selection

After processing, the resulting extracts are deduped, preserving the lowest distance version. Remaning extracts are ranked by highest frequency keyword and then by shortest distance within the keyword. Summary extract selection proceeds as follows (Algorithm 2):

0:  ranked list by distance and keyword frequency
  while summary is less than n number of words do
      if keyword not in selectedWords then
          summary+=top extract
          remove extract
          rerank remaining results
      end if
  end while
  return  summary
Algorithm 2 Extract Selection

For experimentation (Section 4), we first selected the top Fortune 100 companies from 2017 ( as input (entities) into a proprietary news retrieval system for the most recent 1,000 articles mentioning each company (sources). Ignoring low coverage and bodiless articles, 99,424 individual documents were returned. Second, each article was fed into the system and risk detections for three risk relationships (Cybersecurity, Terrorism, and Legal

) were found with a distance cutoff of 100 (word) tokens. Lastly, a baseline extract was selected at random for each identified risk from the corresponding document for pairwise comparison. The probability of a multi-sentence extract occurring in the output is high - approx. 70% with an average token distance of 30 for multi- or single sentence extraction (standard deviation is as high as 25 tokens).

System Costco-Legal
Human A lawsuit was brought against Costco for negligence, carlessness, and having defective conditions. Costco is also being investigated for potential corporate misconduct concerning sales of products that are alleged to be counterfeit and/or to Infringe patent/trademark rights. The Acushnet Company who is the holder of certain Titleist golf ball patnets is also in litigation with Costco alleging patent infringement and false advertising.
Alternate Thirds The suit claims Costco should be held liable for the injuries due to its ”negligence and carelessness,” and for having ”dangerous or defective conditions.” In addition to the litigation with Tiffany & Co., the Company has also recently been in litigation with Acushnet Company, represented to be the holder of certain Titleist golf ball patents, concerning allegations that Costco has committed patent infringement and false advertising in connection with the sale of certain golf balls in its membership warehouses. The plaintiffs did not accept Costcos proposals for settlement and Costco ultimately prevailed on a motion for summary judgment.
Mixed Thirds The suit claims Costco should be held liable for the injuries due to its ”negligence and carelessness,” and for having ”dangerous or defective conditions.” In her motion, Pronzini challenges Costcos allegation that it is headquartered in Washington. The lawsuit claims Costco should have known about any ”unsafe, dangerous or defective conditions” in the store.
Expansion Costco’s settlement of this matter does not constitute an admission of staff’s charges as set forth in paragraphs 4 through 12 above. In addition to the litigation with Tiffany & Co., the Company has also recently been in litigation with Acushnet Company, represented to be the holder of certain Titleist golf ball patents, concerning allegations that Costco has committed patent infringement and false advertising in connection with the sale of certain golf balls in its membership warehouses.
Table 3: Sample Expanded and Human Summaries for Costo-Legal Entity Risk Relationship.

4 Evaluation and Results

We evaluate four systems that produce different combinations of general and specific extracts:

  • Seed - Seed extracts only.

  • Expanded - Expanded extracts only.

  • MixedThirds - The first selection is from the expanded set, all remaining selections are from the seed set.

  • AlternateThirds - Selection proceeds from expanded to seed to expanded.

Depending on the specificity or generality of a given clause, a pair of extracts may flow from general to specific or vice versa. We choose a canonical narrative flow for the overall text - i.e., general to specific (and back to general) (see e.g. Labov (1972)) - which is tested in the MixedThirds and AlternateThirds systems. Table 3 provides example output for the Costco-Legal entity-risk relation, thresholded to 100 words.

We further test a random baseline system as well as two existing extractive summarization systems TextRank Mihalcea and Tarau (2004) and LexRank Radev (2004).333TextRank implemented with Summa NLP’s Textrank - and LexRank implemented with Crabcamp’s Lexrank -

  • Baseline - For a given entity risk relationship, extracts are randomly selected until the 100 word limit is reached.

  • TextRank - Each extract is a node in a graph with weighted edges by normalized word overlap between sentences.

  • LexRank

    - Each extract is a node in a graph with weighted edges based on cosine similarity of the extract set’s TF-IDF vectors.

We asked six analysts (subject matter experts in risk analysis) to write human summaries for each entity-risk relationship relying on reference extracts filtered by lowest distance and keyword. These human summaries, also thresholded at 100 words, were used in ‘intrinsic’ comparison evaluations (how informative the summaries are) with ROUGE - Recall-Oriented Understudy for Gisting Evaluation Lin (2004) and BLEU (Bilingual Evaluation Understudy) Papineni et al. (2002) Loper and Bird (2002).444ROUGE implemented with Kavita Ganesan’s JAVA Rouge 2.0 - and BLEU implemented with the Natural Language Tool Kit (NLTK). ROUGE and BLEU alone can be limited without additional ‘extrinsic’ evaluations (how well the summaries are formed) to support and appropriately characterize results. Consequently, we conducted two additional manual evaluations: an A/B Preference Judgment task, pitting all systems against human summaries, and a Readability Judgment task using a 3-Point scale: Fluent (5) = no grammatical or informative barriers, Understandable (3) = some grammatical or informative barriers, Disfluent (1) = significant grammatical or informative barriers.

Seed 9.18 2.78 8.04 3.45 29.48
Expanded 20.45 10.81 18.35 11.55 30.22
MixedThirds 12.29 4.11 10.43 4.93 31.79
AlternateThirds 18.12 8.51 15.66 9.37 32.05
Baseline 9.74 3.35 9.33 4.03 30.61
TextRank 8.05 2.80 9.01 3.24 28.62
LexRank 9.48 2.83 8.74 3.53 29.96
Table 4: ROUGE-1,-2,-L,-SU Average and BLEU-4 Results (top three scores bolded).

4.1 Results

We focus results on the system level for simplification as performance was similar across all risk categories and evaluations. In Table 4, we report average for unigram (ROUGE-1), bigram (ROUGE-2), longest common subsequence (ROUGE-L), and skip-4-gram using unigram co-occurrence statistics (ROUGE-SU) and the BLEU (4-gram) score. Each system summary was compared against two human summaries from the same entity-risk relationship. System summaries that pulled from the expanded (more general) set of extractions performed best across all versions of ROUGE and BLEU-4, with MixedThirds and AlternativeThirds outperforming all other systems.

System (p [d.f.=1])
Seed 17.64 (p0.001)
Expanded 12.82 (p0.001)
MixedThirds 11.68 (p0.001)
AlternateThirds 3.68 (p0.05)
Baseline 23.12 (p0.001)
TextRank 49.08 (p0.001)
LexRank 13.62 (p0.001)
Table 5: Pearson’s for Preference Judgments. No statistically significant difference when AlternateThirds is compared to human summaries (p0.05).

For A/B Preference Judgments, 2,000 annotations (1,000 double annotated instances) were collected for human summaries versus all systems. There is a trend of greater preference for the expanded over non-expanded systems (Figure 1). This is supported with Pearson’s (Table 5) where there is no statistically significant difference between AlternateThirds and human summaries. Statistically significant differences exist with and all other system comparisons, though a narrowing percentage preference gap for the expanded systems. Average Cohen (1960) for the Preference Judgment was quite low at .069, indicating not only the difficulty of the task and a significant source of disagreement among the risk analysts, but also increased randomization based on the lack of a third ’no difference’ option.

System Readability
Human 3.75
Baseline 2.54
AlternateThirds 2.50
Expanded 2.37
Seed 2.31
MixedThirds 2.20
LexRank 2.14
TextRank 1.92
Table 6: Average Readability (1-3-5 Scale). AlternateThirds and Baseline (Discussed in Section 5) have the highest non-human readability across all systems.

For Readability Judgments, 1,600 annotations were collected (800 doubly annotated instances) for all systems and human summaries. The human summaries garnered the highest scores with a 3.75 average (Table 6) with the Expanded and AlternateThirds (and Baseline) achieving scores between 2.37 and 2.54. Alternate Thirds and Expanded also had the highest proportion of “5” ratings (20%) compared to 50% for the human summaries and 15% or lower for the other systems. Average improved to .163, but still low.

Figure 1: Expert preference ratings.

5 Discussion and Related Work

Overall, Alternate and MixedThirds systems have the highest content overlap and are packaged in a way that yield high readability and preference ratings when compared to human summaries. When variation was observed in the results (low scores for these systems, or high scores for non-alternating systems) it often had to do with the experimental design rather than specificity ordering. For example, Baseline extractions received “5” ratings (c.f. Tables 4 and 6 for good Baseline performance) when they were short coherent discourses (6):

(6)Well before a deranged anti-Semite opened fire in the Tree of Life Synagogue, instances of anti-Semitism and hate crimes were on the rise. White nationalists have felt emboldened to march in American cities. And days before the shooting, a gunman tried to shoot up a predominantly black church. When he failed, he went to a nearby Kroger outside Louisville, Kentucky, and killed two people there instead.

Further, the performance of TextRank and Lexrank was likely inhibited by being run on the extracts rather than the documents themselves; though LexRank did outperform the Seed system on the A/B Preference evaluation.

Thresholding at 100 words created lower scored AlternativeThirds summaries if only two extracts could not be selected because the word limit would be exceeded (i.e., no final expanded extract). Also, while the top distance-ranked extracts were the substrate for the human summaries, the systems could use a broader range of extracts and create interesting (though less on point) highly rated summaries - e.g. the Seed system in (7):

(7)If there is such a thing as a hate crime, we saw it at Kroger and we saw it in the synagogue again in Pittsburgh,” McConnell said. The Kroger Co. announced today a $1 million donation to the USO as part of its annual Honoring Our Heroes campaign. Kroger’s Honoring Our Heroes campaign has supported veterans, active duty military and their families since 2010, raising more than $21 million through both corporate funds and customer donations.

While a variety of discourse level extractive summarization approaches attempt to create well-formed discourses, of which specificity and a host of other pragmatic phenomena would follow suit and contribute to higher quality, sentence ordering approaches are most similar to what is proposed here. For single documents, maintaining the order of extracts in the source material, has provided positive improvements in quality Lin and Hovy (2002); McKeown et al. (1999); Ji and Nie (2008). Sentence ordering for multi-document summarization is harder as there is no a priori structural discourse relationship between documents. Nonetheless, chronology can be inferred and ordered across clusters of documents for improved output Barzilay et al. (2002); Ji and Nie (2008); Bollegala et al. (2005, 2010).

Discourse awareness in our system comes from semantic coherence associated with token distances, and pragmatic (rhetorical) coherence associated with the multi-sentence extractions and the nature of specificity in the extraction sets. Our system is lower complexity compared to other systems, but there is less control over the specific and general nature of the extracts and their ordering. Observed benefits from the extractions within a tightly constrained domain cannot be disregarded. While current research and detection of text specificity (e.g. Li (2017)) shows promise of more control, it remains a very difficult problem.

6 Conclusion and Future Work

For short extractive multi-document summaries in the context of our risk mining system, focusing on ordering of information specificity as a means of structuring discourse has provided tangible improvements in output quality. Future experimentation will extend to contexts beyond risk mining to test the generalizability of our method in less controlled environments. Further, as summary thresholds increase, our method may require additional constraints to ensure, for example, that global discourse patterns are adhered to - especially as other non-narrative structures are considered (see e.g. Smith (2003)).

As noted, observed improvements do not require intricate control of the extracted information. While greater control to improve output more consistently would certainly be welcome, care must be taken not to overburden the system where it is not clear, based on current research into specificity and discourse, that improvement will be found. Nonetheless, specificity-leaning features improve output in extractive summary discourses in the absence of more in-depth NLP - an encouraging step toward focusing on less well-studied discourse phenomena as a means of progress.


  • Asher and Lascarides (2003) Nicholas Asher and Alex Lascarides. 2003. Logics of Conversation. Cambridge University Press, Cambridge, UK.
  • Barzilay and Elhadad (1997) Regina Barzilay and Michael Elhadad. 1997. Using lexical chains for text summarization. In In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization, pages 10–17.
  • Barzilay et al. (2002) Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2002. Inferring strategies for sentence ordering in multidocument news summarization. J. Artif. Int. Res., 17(1):35–55.
  • Barzilay and McKeown (2005) Regina Barzilay and Kathleen R. McKeown. 2005. Sentence fusion for multidocument news summarization. Computational Linguistics, 31(3):297–328.
  • Beaver and Clarck (2008) David I. Beaver and Bradzy Z. Clarck. 2008. Sense and Sensitivity: How Focus Determines Meaning. Wiley-Blackwell.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  • Bollegala et al. (2005) Danushka Bollegala, Naoaki Okazaki, and Mitsuru Ishizuka. 2005. A machine learning approach to sentence ordering for multidocument summarization and its evaluation. In

    Second International Joint Conference on Natural Language Processing: Full Papers

  • Bollegala et al. (2010) Danushka Bollegala, Naoaki Okazaki, and Mitsuru Ishizuka. 2010. A bottom-up approach to sentence ordering for multi-document summarization. Inf. Process. Manage., 46(1):89–109.
  • Carstens et al. (2017) Lucas Carstens, Jochen L. Leidner, Krzysztof Szymanski, and Blake Howald. 2017. Modeling company risk and importance in supply graphs. In The Semantic Web - 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28 - June 1, 2017, Proceedings, Part II, pages 18–32.
  • Choi et al. (2015) Jinho D. Choi, Joel Tetreault, and Amanda Stent. 2015. It depends: Dependency parser comparison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 387–396.
  • Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98, San Diego, California. Association for Computational Linguistics.
  • Cohen (1960) Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
  • Conroy et al. (2006) John M. Conroy, Judith D. Schlesinger, and Dianne P. O’Leary. 2006. Topic-focused multi-document summarization using an approximate oracle score. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 152–159, Sydney, Australia. Association for Computational Linguistics.
  • Dasgupta et al. (2016) Tirthankar Dasgupta, Lipika Dey, Prasenjit Dey, and Rupsa Saha. 2016. A framework for mining enterprise risk and risk factors from text. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pages 180–184.
  • Dixon (1982) Peter Dixon. 1982. Plans and written directions for complex tasks. Journal of Verbal Learning and Verbal Behavior, 21:70–84.
  • Dixon (1987) Peter Dixon. 1987. The processing of organizational and component step information in written directions*1. Journal of Memory and Language - J MEM LANG, 26:24–35.
  • Djalali et al. (2011) Alex Djalali, David Clausen, Sven Lauer, Karl Schultz, and Christopher Potts. 2011. Modeling expert effects and common ground using questions under discussion. In In AAAI Fall Symposium: Building Representations of Common Ground with Intelligent Agents.
  • Dohare et al. (2018) Shibhansh Dohare, Vivek Gupta, and Harish Karnick. 2018. Unsupervised semantic abstractive summarization. In Proceedings of ACL 2018, Student Research Workshop, pages 74–83, Melbourne, Australia. Association for Computational Linguistics.
  • Erkan and Radev (2004a) Günes Erkan and Dragomir R. Radev. 2004a. The university of michigan at duc 2004. In In Proceedings of the Document Understanding Conferences, pages 120–127.
  • Erkan and Radev (2004b) Günes Erkan and Dragomir R. Radev. 2004b. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res., 22(1):457–479.
  • Fellbaum (1998) Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Bradford Books.
  • Filippova et al. (2015) Katja Filippova, Enrique Alfonseca, Carlos A. Colmenares, Lukasz Kaiser, and Oriol Vinyals. 2015. Sentence compression by deletion with LSTMs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 360–368, Lisbon, Portugal. Association for Computational Linguistics.
  • Friedrich and Pinkal (2015) Annemarie Friedrich and Manfred Pinkal. 2015. Discourse-sensitive automatic identification of generic expressions. In ACL 2015 - Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.
  • Fung and Ngai (2006) Pascale Fung and Grace Ngai. 2006. One story, one flow: Hidden markov story models for multilingual multidocument summarization. ACM Trans. Speech Lang. Process., 3(2):1–16.
  • Galley and McKeown (2003) Michel Galley and Kathleen McKeown. 2003. Improving word sense disambiguation in lexical chaining. In

    Proceedings of the 18th International Joint Conference on Artificial Intelligence

    , IJCAI’03, pages 1486–1488, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Galley and McKeown (2007) Michel Galley and Kathleen McKeown. 2007. Lexicalized Markov grammars for sentence compression. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 180–187, Rochester, New York. Association for Computational Linguistics.
  • Gong and Liu (2001) Yihong Gong and Xin Liu. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, pages 19–25, New York, NY, USA. ACM.
  • Gupta et al. (2007) Surabhi Gupta, Ani Nenkova, and Dan Jurafsky. 2007. Measuring importance and query relevance in topic-focused multi-document summarization. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 193–196, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Hachey et al. (2006) Ben Hachey, Gabriel Murray, and David Reitter. 2006. Dimensionality reduction aids term co-occurrence based multi-document summarization. In Proceedings of the Workshop on Task-Focused Summarization and Question Answering, SumQA ’06, pages 1–7, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Hobbs (1985) Jerry Hobbs. 1985. On the coherence and structure of discourse. CSLI Technical Report, 85-37.
  • Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017.

    spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.

    To appear.
  • Howald and Abramson (2012) Blake Howald and Martha Abramson. 2012. The use of granularity in rhetorical relation prediction. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, *SEM 2012, June 7-8, 2012, Montréal, Canada., pages 44–48.
  • Howald and Katz (2011) Blake Howald and E. Graham Katz. 2011. On the explicit and implicit spatiotemporal architecture of narratives of personal experience. In Spatial Information Theory - 10th International Conference, COSIT 2011, Belfast, ME, USA, September 12-16, 2011. Proceedings, pages 434–454.
  • Ji and Nie (2008) Donghong Ji and Yu Nie. 2008. Sentence ordering based on cluster adjacency in multi-document summarization. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.
  • Jing and McKeown (2000) Hongyan Jing and Kathleen R. McKeown. 2000. Cut and paste based text summarization. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics.
  • Kehler (2002) Andrew Kehler. 2002. Coherence, Reference, and the Theory of Grammar. CSLI Publications, Stanford, CA, USA.
  • Kogan et al. (2009) Shimon Kogan, Bryan R. Routledge Dimitry Levin, Jacob S. Sagl, and Noah A. Smith. 2009. Predicting risk from financial reports with regression. In Proceedings of the 2009 Annual Conference of the North American Chapter of the ACL (NAACL-HLT), pages 272–280.
  • Labov (1972) William Labov. 1972. The transformation of experience in narrative syntax. In Language in the Inner City. Studies in the Black English Vernacular, pages 354–396.
  • Leidner and Schilder (2010) Jochen L. Leidner and Frank Schilder. 2010. Hunting for the black swan: Risk mining from text. In Proceedings of the ACL 2010 System Demonstrations, pages 54–59.
  • Li (2017) Junyi Li. 2017. From discourse structure to text specificity: Studies of coherence preferences. University of Pennsylvania, Ph.D. Thesis.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Lin and Hovy (2000) Chin-Yew Lin and Eduard Hovy. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1, COLING ’00, pages 495–501, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Lin and Hovy (2002) Chin-Yew Lin and Eduard Hovy. 2002. From single to multi-document summarization: A prototype system and its evaluation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 457–464, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Loper and Bird (2002) Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, ETMTNLP ’02, pages 63–70, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Louis and Nenkova (2011a) Annie Louis and Ani Nenkova. 2011a. Automatic identification of general and specific sentences by leveraging discourse annotations. In Fifth International Joint Conference on Natural Language Processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011, pages 605–613.
  • Louis and Nenkova (2011b) Annie Louis and Ani Nenkova. 2011b. Text specificity and impact on quality of news summaries. In

    Proceedings of the Workshop on Monolingual Text-To-Text Generation@ACL, Portland, Oregon, USA, June 24, 2011

    , pages 34–42.
  • Lu et al. (2009b) Hsin-Min Lu, Nina WanHsin Huang, Zhu Zhang, and Tsai-Jyh Chen. 2009b. Identifying firm-specific risk statements in news articles. In Intelligence and Security Informatics, pages 42–53. Springer.
  • Mann and Thompson (1987) William C. Mann and Sandra A. Thompson. 1987. Rhetorical Structure Theory: A Theory of Text Organization. Information Sciences Institute, Marina del Rey, CA.
  • Marcu (1997) Daniel Marcu. 1997. From discourse structures to text summaries. In Intelligent Scalable Text Summarization.
  • Marcu (1998) Daniel Marcu. 1998. To build text summaries of high quality, nuclearity is not sufficient. In Working Notes of the AAAI-98 Spring Symposium on Intelligent Text Summarization.
  • Marcu (2000) Daniel Marcu. 2000. The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge, MA, USA.
  • Mathew and Katz (2009) Thomas Mathew and E. Graham Katz. 2009. Supervised categorization for habitual versus episodic sentences. In Sixth Midwest Computational Linguistics Colloquium.
  • McKeown et al. (1999) Kathleen R. McKeown, Judith L. Klavans, Vasileios Hatzivassiloglou, Regina Barzilay, and Eleazar Eskin. 1999. Towards multidocument summarization by reformulation: Progress and prospects. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference Innovative Applications of Artificial Intelligence, AAAI ’99/IAAI ’99, pages 453–460, Menlo Park, CA, USA. American Association for Artificial Intelligence.
  • Meng et al. (2015) Rui Meng, Yongxin Tong, Lei Chen, and Caleb Chen Cao. 2015. Crowdtc: Crowdsourced taxonomy construction. In 2015 IEEE International Conference on Data Mining, pages 913–918.
  • Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of EMNLP 2004, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Mulkar-Mehta et al. (2011) Rutu Mulkar-Mehta, Jerry Hobbs, and Eduard Hovy. 2011. Granularity in natural language discourse. In Proceedings of the Ninth International Conference on Computational Semantics, IWCS ’11, pages 360–364, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Nallapati et al. (2017) Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.

    Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.

    In AAAI, pages 3075–3081. AAAI Press.
  • Nenkova and McKeown (2011) Ani Nenkova and Kathleen McKeown. 2011. Automatic Summarization, volume 5.
  • Nugent and Leidner (2016) Timothy Nugent and Jochen L. Leidner. 2016. Risk mining: Company-risk identification from unstructured sources. In IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, December 12-15, 2016, Barcelona, Spain., pages 1308–1311.
  • Nugent et al. (2017) Timothy Nugent, Fabio Petroni, Natraj Raman, Lucas Carstens, and Jochen L. Leidner. 2017. A comparison of classification models for natural disaster and critical event detection from news. In 2017 IEEE International Conference on Big Data (Big Data), pages 3750–3759.
  • Ono et al. (1994) Kenji Ono, Kazuo Sumita, and Seiji Miike. 1994. Abstract generation based on rhetorical structure extraction. In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Paulus et al. (2018) Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. CoRR, abs/1705.04304.
  • Plachouras et al. (2018) Vassilis Plachouras, Fabio Petroni, Timothy Nugent, and Jochen L. Leidner. 2018. A comparison of two paraphrase models for taxonomy augmentation. In Proceedings of the 2018 Annual Conference of the North American Chapter of the ACL (NAACL-HLT), pages 315–320.
  • Prasad et al. (2008) Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The penn discourse treebank 2.0. In Proceedings of the International Conference on Language Resources and Evaluation (LREC-08).
  • Radev (2004) Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22:2004.
  • Rahman and Borah (2015) N. Rahman and B. Borah. 2015. A survey on existing extractive techniques for query-based text summarization. In 2015 International Symposium on Advanced Computing and Communication (ISACC), pages 98–102.
  • Schiffman et al. (2002) Barry Schiffman, Ani Nenkova, and Kathleen McKeown. 2002. Experiments in multidocument summarization. In Proceedings of the Second International Conference on Human Language Technology Research, HLT ’02, pages 52–58, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Siddharthan et al. (2004) Advaith Siddharthan, Ani Nenkova, and Kathleen McKeown. 2004. Syntactic simplification for improving content selection in multi-document summarization. In Proceedings of Coling 2004, pages 896–902, Geneva, Switzerland. COLING.
  • Smith (2003) Carlota S. Smith. 2003. Modes of Discourse: The Local Structure of Texts (Cambridge Studies in Linguistics). Cambridge University Press.
  • Subramaniam et al. (2010) L. Venkata Subramaniam, Amit Anil Nanavati, and Sougata Mukherjea. 2010. Enriching one taxonomy using another. In 2010 IEEE Transactions on Knowledge and Data Engineering, pages 913–918.
  • Tan et al. (2017) Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. Abstractive document summarization with a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1171–1181, Vancouver, Canada. Association for Computational Linguistics.
  • Turner and Charniak (2005) Jenine Turner and Eugene Charniak. 2005. Supervised and unsupervised learning for sentence compression. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 290–297, Ann Arbor, Michigan. Association for Computational Linguistics.
  • Vanderwende et al. (2007) Lucy Vanderwende, Hisami Suzuki, Chris Brockett, and Ani Nenkova. 2007. Beyond sumbasic: Task-focused summarization with sentence simplification and lexical expansion. Inf. Process. Manage., 43(6):1606–1618.
  • Wang et al. (2015) Xun Wang, Yasuhisa Yoshida, Tsutomu Hirao, Katsuhito Sudoh, and Masaaki Nagata. 2015. Summarization based on task-oriented discourse parsing. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 23(8):1358–1367.