A Consolidated System for Robust Multi-Document Entity Risk Extraction and Taxonomy Augmentation

09/23/2019 ∙ by Berk Ekmekci, et al. ∙ 0

We introduce a hybrid human-automated system that provides scalable entity-risk relation extractions across large data sets. Given an expert-defined keyword taxonomy, entities, and data sources, the system returns text extractions based on bidirectional token distances between entities and keywords and expands taxonomy coverage with word vector encodings. Our system represents a more simplified architecture compared to alerting focused systems - motivated by high coverage use cases in the risk mining space such as due diligence activities and intelligence gathering. We provide an overview of the system and expert evaluations for a range of token distances. We demonstrate that single and multi-sentence distance groups significantly outperform baseline extractions with shorter, single sentences being preferred by analysts. As the taxonomy expands, the amount of relevant information increases and multi-sentence extractions become more preferred, but this is tempered against entity-risk relations become more indirect. We discuss the implications of these observations on users, management of ambiguity and taxonomy expansion, and future system modifications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Identifying or predicting entity-risk relationships in textual data is a common natural language processing (“NLP”) task known as

risk mining Leidner and Schilder (2010). For example, (1) relates the entity CNN to the risk pipe bomb, which is part of a broader terrorism risk category.

(1)Later Wednesday, CNN received a pipe bomb at its Time Warner Center headquarters in Manhattan.

Monitoring systems seek to classify such text extracts as indicating a risk or not for a range of entities and risk categories. Any collection of additional information (e.g., time, location) is often secondary to risk classification. Thus for use cases seeking more coverage - e.g, intelligence gathering, legal and financial due diligence, and operational risk management - our experience is that traditional risk mining architectures are not entirely fit for purpose. Consequently, we present a hybrid-automated system that leverages predefined entities, data sources and risk taxonomies to return extractions based on bidirectional entity-risk keyword surface distances.

(2) On Monday, a pipe bomb was found in a mailbox of billionaire business magnet and political activist George Soros. Later Wednesday, CNN received a pipe bomb at its Time Warner Center headquarters in Manhattan sent to ex-CIA director John Brennan and a suspicious package sent to Rep. Maxine Waters ….

By starting with expert defined information, our system can return both (1) and (2) by varying the distance threshold without the need for a risk classification engine, morphosyntactic parsing or named entity recognition. Further, taxonomy coverage is expanded with word vector encodings trained on the same sources of data, e.g.,

suspicious package

in (2). We believe this approach best combines the efficiencies that tuned Machine Learning (“ML”) systems can offer with the rich depth of experience and insight that analysts bring to the process.

In this paper, we review risk mining research (Section 2) to contextualize the presentation of our system (Section 3) and expert evaluations testing a range of different distance thresholds with seed and expanded taxonomies across cybersecurity, terrorism and legal/noncompliance risk categories (Section 4). We demonstrate that extractions from single vs. multi-sentence distance groups outperform the baseline by statistically significant margins. We observe that as coverage of the taxonomy grows, the preference for multi-sentence extractions increases, but entity-risk relationship becomes more indirect. We discuss these results relative to the user, information recall minimization and taxonomy management (Section 5). We conclude with further considerations and extensions of the system (Section 6).

2 Related Work

A central focus in risk mining is on the ML classification of risk from non-risk and contributing features - e.g. textual association with stock price movement and financial documents (Kogan et al. (2009), Lu et al. (2009a), Groth and Muntermann (2011), Tsai and Wang (2012), and Dasgupta et al. (2016)); sentiment of banking risks Nopp and Hanbury (2015); and risk in news (Lu et al. (2009b) and Nugent et al. (2017)

). Heuristic approaches also exist - e.g., risk taxonomy built with seed patterns used in earnings report altering

Leidner and Schilder (2010) and company risk rankings based on supply chain analysis Carstens et al. (2017). Additional research is focused on curation and management of risk taxonomies (e.g. ontology merging Subramaniam et al. (2010), crowdsourcing Meng et al. (2015) and paraphrase detection Plachouras et al. (2018)).

Our system is closest to the system proposed in Nugent and Leidner (2016), which starts with an risk taxonomy (following Leidner and Schilder (2010)) to be searched in news data. All possible extracts are paired with all possible companies and, based on an SVM model built from annotated data, tuples are classified and stored for risk analyst review. Our system similarly focuses on entity-risk relations for analyst review, but deviates in two key ways: (a) because of the specificity of the initial taxonomy, extracts are assumed to express some degree of entity-risk relationship, obviating the need for a risk classifier; and (b) extracts are based on “shallow” surface parsing rather than deeper morpho-syntactic parsing. These deviations will be revisited in Section 5.

3 System

Figure 1: System Diagram. Unshaded elements represent processes, while shaded nodes represent data input or products. Solid-bordered nodes are the system’s automatic processors or output, and dashed-bordered elements are data sources provided to the system or expert knowledge. The architecture is designed such that each processor element can operate independently (without shared memory) and is NLP platform-independent.

Our system (Figure 1) is a custom NLP processing pipeline capable of the ingesting and analyzing hundreds of thousands of text documents. The system consists of four components:

  1. Document Ingest and Processing: Raw text documents are read from disk then tokenized, lemmatized, and sentencized.

  2. Keyword/Entity Detection: Instances of both keywords and entities are identified in the processed text, and each risk keyword occurrence is matched to the nearest entity token.

  3. Match Filtering and Sentence Retrieval: Matches within the documents are filtered and categorized by pair distance and/or sentence co-occurrence, and the filtered sentences are retrieved for context.

  4. Semantic Encoding and Taxonomy Expansion: A semantic vectorization algorithm is trained on domain-specific text and used to perform automated expansion of the keyword taxonomy.

This design architecture allows for significant customization, high throughput, and modularity for uses in experimental evaluation and deployment in production use-cases. We built the system to support decentralized or streaming architectures, with each document being processed independently and learning systems (specifically at the semantic encoding/expansion steps) configured for continuous learning or batch model training.

3.1 Document Ingest and Processing

We leverage spaCy (Version 2.0.16, https://spacy.io) Honnibal and Johnson (2015) as the document ingest and low-level NLP platform for this system. This choice was influenced by spaCy’s high speed parsing Choi et al. (2015), out-of-the-box parallel processing, and Python compatibility, spaCy’s default NLP pipeline runs tokenizer part-of-speech tagger dependency parser named entity recognizer.

We used the sentence breaks found by the dependency parser to annotate each of the found keyword-entity pairs as being either in the same or different sentences. A dependency-based sentencizer is preferred to a simpler stop-character based approach due to the unpredictable formatting of certain domains of text - e.g. web-mined news and regulatory filings.

spaCy’s pipe() function, allows for a text generator object to be provided, and takes advantage of multi-core processing to parallelize batching. In this implementation, each processed document piped in by spaCy is converted to its lemmatized form with sentence breaks noted so that sentence and multi-sentence identification of keyword/entity distances can be captured.

3.2 Keyword/Entity Detection

In the absence of intervening information or a more sophisticated approach to parsing, the mention of an entity and risk keyword within a phrase or sentence is the most morpho-syntactically, semantically and and pragmatically coherent of the relationship. For example, (3) describes the entity Verizon and its litigation risk associated with lawsuit settlement (keywords being settle and lawsuit).

(3) In 2011, Verizon agreed to pay $20 million to settle a class-action lawsuit by the federal Equal Employment Opportunity Commission alleging that the company violated the Americans with Disabilities Act by denying reasonable accommodations for hundreds of employees with disabilities.

Returning the entire sentence yields additional information - the lawsuit is class-action and the complaint allegation is that Verizondenied reasonable accommodations for hundreds of employees with disabilities.” The system detection process begins by testing for matches of each keyword with each entity, for every possible keyword-entity pairing in the document. Algorithm 1 provides the simplified pseudocode for this process.

0:  taxonomy and entities lists
  for keyword in taxonomy do
     for entity in entities do
         keywordLocs = findLocs(keyword)
         entityLocs = findLocs(entity)
         for kLoc in keywordLocs do
            bestHit = findClosestPair(kLoc, entityLocs)
            results.append((keyword, entity, bestHit))
         end for
     end for
  end for
  return  findClosestPair (two token indicies)
Algorithm 1 Entity-Keyword Pairing

For every instance of every keyword, the nearest instance of every available entity is paired - regardless of whether it precedes or proceeds the keyword. Furthermore, an entity may be found to have multiple risk terms associated with it, but each instance of a risk term will only apply itself to the closest entity - helping to prevent overreaching conclusions of risk while maintaining system flexibility.

(4) McDonald says this treatment violated the terms of a settlement the company reached a few years earlier regarding its treatment of employees with disabilities. In 2011, Verizon agreed to pay $20 million to settle a class-action lawsuit by the federal Equal Employment Opportunity Commission ….

For example, (4) extends the extract of (3) to the prior contiguous sentence which contains settlement. This extension provides greater context for Verizon’s lawsuit. (4) is actually background for a larger proposition being made in the document that Verizon is in violation of settlement terms from a previous lawsuit.

The system’s token distance approach promotes efficiency and is preferable to more complex NLP - e.g., chunking or co-reference resolution. Nonetheless, this flexibility comes at a computational cost: a total of comparisons must be made for each document, where m is the number of keyword terms across all taxonomic categories, a the average number of instances of each keyword per document, n the number of entities provided, and b the average number of entity instances per document. Changing any single one of these variables will result in computational load changing with complexity, but their cumulative effects can quickly add up. For parallelization purposes, each keyword is independent of each other keyword and each entity is independent of each other entity. This means that in an infinitely parallel (theoretical) computational scheme, the system runs on , which will vary as a function of the risk and text domains.

Risk Category Keyword Seed Taxonomy Enriched Keyword Taxonomy
Cybersecurity n=26 n=123
cybercrime, hack, DDOS, antivirus, 4front security, cyber deterence,
data breach, ransomware, penetration, … cloakware, unauthenticated, …
Terrorism n=37 n=147
terrorism, bio-terrorism, extremist, anti-terrorism, bomb maker, explosives,
car bombing, hijack, guerrilla, … hezbollah, jihadi, nationalist, …
Legal n=38 n=162
litigation, indictment, allegation, appropriation, concealment, counter suit,
failure to comply, sanctions violations, … debtor, expropriation, issuer, …
Table 1: Sample risk terms from the seed and expanded sets.

3.3 Match Filtering and Sentence Retrieval

The system has by this point completed the heart of document processing and risk identification. The next component seeks to (a) filter away results unlikely to hold analytic value; and (b) identify the hits as being either single sentence or multi-sentence using the sentence information generated by the spaCy dependency parse. The first of these goals is achieved with a simple hit distance cutoff wherein any keyword-entity pair with more than a certain count of intervening tokens is discarded. The setting of a hard cutoff improves keyword-entity spans by not including cross-document matches for large documents. Once filtering is complete, the system uses document sentence breaks identified by spaCy to the sentence membership of each keyword and entity for each pairing and, ultimately, whether they belong to the same sentence.

3.4 Encodings

Our system automates term expansion by using similarity calculations of semantic vectors. We generated these vectors by training a fastText (https://fasttext.cc/) skipgram model, which relies on words and subwords from the same data source as the initial extractions Bojanowski et al. (2017). This ensures that domain usage of language is well-represented, and any rich domain-specific text may be used to train semantic vectors (see generally, Mikolov et al. (2013)). We chose fastText was chosen for this project because of its light weight, open source availability, efficient single-machine performance, and output of vector models for use in Python.

For each taxonomic risk term encountered, fastText searches the model vocabulary for the minimized normalized dot product (a basic similarity score found in the fastText codebase), and returns the top-scoring vocabulary terms as candidates for taxonomic expansion.

4 Experiment

To test the performance of the system, we designed an experiment comparing the performances of systems using single-sentence risk detection and systems only using multi-sentence risks. This measures the performance of purely multi-sentence hits against purely single-sentence hits. In addition, a baseline system was tested that detected only risk terms without searching for corresponding entities. Taken together, the three hypotheses tested are as follows:




and test whether each method of detecting risk co-occurrence with an entity performs better than random chance of selecting a risk term in the document and its association with the entity. tests whether the distance-based measure in the system outperforms a sentential approach.

4.1 Data Processing

A virtualized Ubuntu 14.04 machine with 8 vCPUs - running on 2.30GHz Intel Xeon E5-2670 processors and 64GB RAM - was chosen to support the first experiment.

The names of the top Fortune 100 companies from 2017 (http://fortune.com/fortune500/2017/) were fed as input into a proprietary news retrieval system for the most recent 1000 articles mentioning each company. Ignoring low coverage and bodiless news articles, 99,424 individual documents were returned. Each article was then fed into the system and risk detections found with a distance cutoff of 100 tokens. For each identified risk, whether single or multi-sentence, the system also selected a baseline sentence at random from the corresponding document for pairwise comparison.

The spaCy dependency parse was the largest bottleneck - expected total runtime for the near 100,000 documents at approximately 7 calendar-days of computation. In the interest of runtime, only the first 21,000 documents read in order of machine-generated news article ID were analyzed. Once all selected documents were processed, we paired single and multi-sentence spans relating to the same risk category, but potentially different entities and documents, for pairwise evaluations.

4.2 Term Expansion

As summarized in Table 1, starting with manually-created seed terms in each category of risk, encodings were learned using fastText from a concatenation of the news article text using the methodology discussed in Section 3.4. Selecting the top ten most similar terms for each in-vocabulary seed term resulted in an expanded taxonomy, with a 326.31% increase on average across the three categories. This term expansion not only introduced new vocabulary to the taxonomy but also variants and common misspellings of keywords, which are important in catching risk terms“in the wild”. Some cleanup of the term expansion was required to filter out punctuation and tokenization variants.

4.3 Evaluation

Analysts were asked to give their preference for “System A” or “System B” or “Neither” when presented with randomized pairs of output. Percentage preferences for the overarching system and each of six pairings was tested for significance with Pearson’s using raw counts.

Figure 2: Expert preference ratings. Left and columns are single or multi-sentence as defined in Table 2.
Overall 8.530
single v. multi (seed v. expand) p=0.003
SYSTEMsbs 28.088
single v. baseline (seed) p=1.159e-07
SYSTEMmbs 25.358
multi v. baseline (seed) p=4.762e-07
SYSTEMsms 37.763
single v. multi (seed) p=7.99e-10
SYSTEMsbe 6.858
single v. baseline (expand) p=0.008
SYSTEMmbe 25.705
multi v. baseline (expand) p=3.978e-07
SYSTEMsme 6.316
single v. multi (expand) p=0.011
Table 2: Pearson’s and p values (d.f.=1)

5 Results and Discussion

We collected 4,514 judgments from eight subject matter experts to compute system preferences associated with single, multi- and baseline sentence extractions. Roughly  28% of all evaluated extractions (1,266/4,514) received a preference judgment (32% from the seed set (698/2,198) and 24% from the expansion set (568/2,316)) where  72% received a “Neither” rating.

As summarized in Table 2 and Figure 2, all single and multi-sentence extractions across the seed and expansion sets outperform the baseline by statistically significant margins. For the seed set, the single sentence extractions outperform the multi-sentence extractions by a statistically significant margin as well (p.01). However, for the expansion set, the multi-sentence extractions gain significant ground (26% to 38% increase in preference).

1,283 (28%) of evaluations were doubly annotated for calculation of Cohen’s Kappa Cohen (1960). Average for the seed set was 0.284 and 0.144 for the expansion set (which suffers from low sample size). This is uniformly low across all categories, but not an unusual result given the task and the range of analyst expertise.

5.1 Discussion

Our system’s distance metric is clearly providing benefit well above the baseline, so we can accept and , but given exactness of the seed taxonomy, the high proportion of “Neither” judgments is surprising. While there are polysemous examples (non-risk category uses of risk keywords), which are minimized in more complex systems, there are observable variations in the directness or indirectness of the entity-risk relationships. For example, The companies could face a number of lawsuits from Walmart, Target and Kroger, describes an entity-risk relationship where the entity is suing rather than being sued. The degree of the risk varied by analyst and will have to be controlled for in subsequent evaluations and possibly components of the system.

We cannot accept which tests the assumption that multi-sentence returns will have greater analytical utility. However, we observed that as the taxonomy expands, the preference for the multi-sentences increases by 46% over the single sentence seed set extractions (SYSTEMsms).111Also note that despite an overall 326% increase in coverage, the proportion of “Neither” judgments in a randomly selected set of 100 sentences increases from 52% to 75%. A potential reason for this is that the keywords in the expanded taxonomy exhibit a greater range of specificity as compared to the seed terms. For example, the expanded taxonomy included require and suit which are more general and given to ambiguity in the legal/noncompliance category (the expanded taxonomy also included highly specific keywords - e.g. foreclosure and harassment).

To the extent that more general keywords proliferate the taxonomy, multi-sentence information may be more preferred for understanding the entity-risk relationship. This additional information may be preferred in some use cases, but the potential increase via a concomitant encouragement of non-risk relationships would certainly be unwelcome. Accommodating shifts in semantic granularity as the taxonomy expands will likely have to be considered in addition to distance thresholding to tune results.

6 Conclusion and Future Work

We have described a configurable, scalable system for finding entity-event relations across large data sets. Addressing observed drawbacks should be done so relative to maintaining flexibility for analyst users - possibilities include summarization of results for better presentation, alternative source data at the direction of the analyst for given risk categories, token distance thresholding and considerations of semantic granularity.

While designed for high-coverage use cases in risk mining, our system could be used for any data, entities and taxonomies to support generalized entity-x relationship reporting. Further, while the current system seeks to be low complexity (shallow parsing and no risk classifier), the system’s modular design can facilitate any number of additions for customized use cases and integrations, and performance comparisons to similar systems.


Thank you to anonymous reviewers from NAACL Industry Track as well as Matt Machado, Nathan Maynes, and Scott McFadden for support and feedback. Thank you also to extraction output evaluators Alli Zube, Jason Sciarotta, Jeff Earnest, Jonathan Feng, Larry Fowlkes, Regina Reese, Romel Lira and Sean Ahearn.