Identifying or predicting entity-risk relationships in textual data is a common natural language processing (“NLP”) task known asrisk mining Leidner and Schilder (2010). For example, (1) relates the entity CNN to the risk pipe bomb, which is part of a broader terrorism risk category.
(1)Later Wednesday, CNN received a pipe bomb at its Time Warner Center headquarters in Manhattan.
Monitoring systems seek to classify such text extracts as indicating a risk or not for a range of entities and risk categories. Any collection of additional information (e.g., time, location) is often secondary to risk classification. Thus for use cases seeking more coverage - e.g, intelligence gathering, legal and financial due diligence, and operational risk management - our experience is that traditional risk mining architectures are not entirely fit for purpose. Consequently, we present a hybrid-automated system that leverages predefined entities, data sources and risk taxonomies to return extractions based on bidirectional entity-risk keyword surface distances.
(2) On Monday, a pipe bomb was found in a mailbox of billionaire business magnet and political activist George Soros. Later Wednesday, CNN received a pipe bomb at its Time Warner Center headquarters in Manhattan sent to ex-CIA director John Brennan and a suspicious package sent to Rep. Maxine Waters ….
By starting with expert defined information, our system can return both (1) and (2) by varying the distance threshold without the need for a risk classification engine, morphosyntactic parsing or named entity recognition. Further, taxonomy coverage is expanded with word vector encodings trained on the same sources of data, e.g.,suspicious package
in (2). We believe this approach best combines the efficiencies that tuned Machine Learning (“ML”) systems can offer with the rich depth of experience and insight that analysts bring to the process.
In this paper, we review risk mining research (Section 2) to contextualize the presentation of our system (Section 3) and expert evaluations testing a range of different distance thresholds with seed and expanded taxonomies across cybersecurity, terrorism and legal/noncompliance risk categories (Section 4). We demonstrate that extractions from single vs. multi-sentence distance groups outperform the baseline by statistically significant margins. We observe that as coverage of the taxonomy grows, the preference for multi-sentence extractions increases, but entity-risk relationship becomes more indirect. We discuss these results relative to the user, information recall minimization and taxonomy management (Section 5). We conclude with further considerations and extensions of the system (Section 6).
2 Related Work
A central focus in risk mining is on the ML classification of risk from non-risk and contributing features - e.g. textual association with stock price movement and financial documents (Kogan et al. (2009), Lu et al. (2009a), Groth and Muntermann (2011), Tsai and Wang (2012), and Dasgupta et al. (2016)); sentiment of banking risks Nopp and Hanbury (2015); and risk in news (Lu et al. (2009b) and Nugent et al. (2017)
). Heuristic approaches also exist - e.g., risk taxonomy built with seed patterns used in earnings report alteringLeidner and Schilder (2010) and company risk rankings based on supply chain analysis Carstens et al. (2017). Additional research is focused on curation and management of risk taxonomies (e.g. ontology merging Subramaniam et al. (2010), crowdsourcing Meng et al. (2015) and paraphrase detection Plachouras et al. (2018)).
Our system is closest to the system proposed in Nugent and Leidner (2016), which starts with an risk taxonomy (following Leidner and Schilder (2010)) to be searched in news data. All possible extracts are paired with all possible companies and, based on an SVM model built from annotated data, tuples are classified and stored for risk analyst review. Our system similarly focuses on entity-risk relations for analyst review, but deviates in two key ways: (a) because of the specificity of the initial taxonomy, extracts are assumed to express some degree of entity-risk relationship, obviating the need for a risk classifier; and (b) extracts are based on “shallow” surface parsing rather than deeper morpho-syntactic parsing. These deviations will be revisited in Section 5.
Our system (Figure 1) is a custom NLP processing pipeline capable of the ingesting and analyzing hundreds of thousands of text documents. The system consists of four components:
Document Ingest and Processing: Raw text documents are read from disk then tokenized, lemmatized, and sentencized.
Keyword/Entity Detection: Instances of both keywords and entities are identified in the processed text, and each risk keyword occurrence is matched to the nearest entity token.
Match Filtering and Sentence Retrieval: Matches within the documents are filtered and categorized by pair distance and/or sentence co-occurrence, and the filtered sentences are retrieved for context.
Semantic Encoding and Taxonomy Expansion: A semantic vectorization algorithm is trained on domain-specific text and used to perform automated expansion of the keyword taxonomy.
This design architecture allows for significant customization, high throughput, and modularity for uses in experimental evaluation and deployment in production use-cases. We built the system to support decentralized or streaming architectures, with each document being processed independently and learning systems (specifically at the semantic encoding/expansion steps) configured for continuous learning or batch model training.
3.1 Document Ingest and Processing
We leverage spaCy (Version 2.0.16, https://spacy.io) Honnibal and Johnson (2015) as the document ingest and low-level NLP platform for this system. This choice was influenced by spaCy’s high speed parsing Choi et al. (2015), out-of-the-box parallel processing, and Python compatibility, spaCy’s default NLP pipeline runs tokenizer part-of-speech tagger dependency parser named entity recognizer.
We used the sentence breaks found by the dependency parser to annotate each of the found keyword-entity pairs as being either in the same or different sentences. A dependency-based sentencizer is preferred to a simpler stop-character based approach due to the unpredictable formatting of certain domains of text - e.g. web-mined news and regulatory filings.
spaCy’s pipe() function, allows for a text generator object to be provided, and takes advantage of multi-core processing to parallelize batching. In this implementation, each processed document piped in by spaCy is converted to its lemmatized form with sentence breaks noted so that sentence and multi-sentence identification of keyword/entity distances can be captured.
3.2 Keyword/Entity Detection
In the absence of intervening information or a more sophisticated approach to parsing, the mention of an entity and risk keyword within a phrase or sentence is the most morpho-syntactically, semantically and and pragmatically coherent of the relationship. For example, (3) describes the entity Verizon and its litigation risk associated with lawsuit settlement (keywords being settle and lawsuit).
(3) In 2011, Verizon agreed to pay $20 million to settle a class-action lawsuit by the federal Equal Employment Opportunity Commission alleging that the company violated the Americans with Disabilities Act by denying reasonable accommodations for hundreds of employees with disabilities.
Returning the entire sentence yields additional information - the lawsuit is class-action and the complaint allegation is that Verizon “denied reasonable accommodations for hundreds of employees with disabilities.” The system detection process begins by testing for matches of each keyword with each entity, for every possible keyword-entity pairing in the document. Algorithm 1 provides the simplified pseudocode for this process.
For every instance of every keyword, the nearest instance of every available entity is paired - regardless of whether it precedes or proceeds the keyword. Furthermore, an entity may be found to have multiple risk terms associated with it, but each instance of a risk term will only apply itself to the closest entity - helping to prevent overreaching conclusions of risk while maintaining system flexibility.
(4) McDonald says this treatment violated the terms of a settlement the company reached a few years earlier regarding its treatment of employees with disabilities. In 2011, Verizon agreed to pay $20 million to settle a class-action lawsuit by the federal Equal Employment Opportunity Commission ….
For example, (4) extends the extract of (3) to the prior contiguous sentence which contains settlement. This extension provides greater context for Verizon’s lawsuit. (4) is actually background for a larger proposition being made in the document that Verizon is in violation of settlement terms from a previous lawsuit.
The system’s token distance approach promotes efficiency and is preferable to more complex NLP - e.g., chunking or co-reference resolution. Nonetheless, this flexibility comes at a computational cost: a total of comparisons must be made for each document, where m is the number of keyword terms across all taxonomic categories, a the average number of instances of each keyword per document, n the number of entities provided, and b the average number of entity instances per document. Changing any single one of these variables will result in computational load changing with complexity, but their cumulative effects can quickly add up. For parallelization purposes, each keyword is independent of each other keyword and each entity is independent of each other entity. This means that in an infinitely parallel (theoretical) computational scheme, the system runs on , which will vary as a function of the risk and text domains.
|Risk Category||Keyword Seed Taxonomy||Enriched Keyword Taxonomy|
|cybercrime, hack, DDOS, antivirus,||4front security, cyber deterence,|
|data breach, ransomware, penetration, …||cloakware, unauthenticated, …|
|terrorism, bio-terrorism, extremist,||anti-terrorism, bomb maker, explosives,|
|car bombing, hijack, guerrilla, …||hezbollah, jihadi, nationalist, …|
|litigation, indictment, allegation,||appropriation, concealment, counter suit,|
|failure to comply, sanctions violations, …||debtor, expropriation, issuer, …|
3.3 Match Filtering and Sentence Retrieval
The system has by this point completed the heart of document processing and risk identification. The next component seeks to (a) filter away results unlikely to hold analytic value; and (b) identify the hits as being either single sentence or multi-sentence using the sentence information generated by the spaCy dependency parse. The first of these goals is achieved with a simple hit distance cutoff wherein any keyword-entity pair with more than a certain count of intervening tokens is discarded. The setting of a hard cutoff improves keyword-entity spans by not including cross-document matches for large documents. Once filtering is complete, the system uses document sentence breaks identified by spaCy to the sentence membership of each keyword and entity for each pairing and, ultimately, whether they belong to the same sentence.
Our system automates term expansion by using similarity calculations of semantic vectors. We generated these vectors by training a fastText (https://fasttext.cc/) skipgram model, which relies on words and subwords from the same data source as the initial extractions Bojanowski et al. (2017). This ensures that domain usage of language is well-represented, and any rich domain-specific text may be used to train semantic vectors (see generally, Mikolov et al. (2013)). We chose fastText was chosen for this project because of its light weight, open source availability, efficient single-machine performance, and output of vector models for use in Python.
For each taxonomic risk term encountered, fastText searches the model vocabulary for the minimized normalized dot product (a basic similarity score found in the fastText codebase), and returns the top-scoring vocabulary terms as candidates for taxonomic expansion.
To test the performance of the system, we designed an experiment comparing the performances of systems using single-sentence risk detection and systems only using multi-sentence risks. This measures the performance of purely multi-sentence hits against purely single-sentence hits. In addition, a baseline system was tested that detected only risk terms without searching for corresponding entities. Taken together, the three hypotheses tested are as follows:
and test whether each method of detecting risk co-occurrence with an entity performs better than random chance of selecting a risk term in the document and its association with the entity. tests whether the distance-based measure in the system outperforms a sentential approach.
4.1 Data Processing
A virtualized Ubuntu 14.04 machine with 8 vCPUs - running on 2.30GHz Intel Xeon E5-2670 processors and 64GB RAM - was chosen to support the first experiment.
The names of the top Fortune 100 companies from 2017 (http://fortune.com/fortune500/2017/) were fed as input into a proprietary news retrieval system for the most recent 1000 articles mentioning each company. Ignoring low coverage and bodiless news articles, 99,424 individual documents were returned. Each article was then fed into the system and risk detections found with a distance cutoff of 100 tokens. For each identified risk, whether single or multi-sentence, the system also selected a baseline sentence at random from the corresponding document for pairwise comparison.
The spaCy dependency parse was the largest bottleneck - expected total runtime for the near 100,000 documents at approximately 7 calendar-days of computation. In the interest of runtime, only the first 21,000 documents read in order of machine-generated news article ID were analyzed. Once all selected documents were processed, we paired single and multi-sentence spans relating to the same risk category, but potentially different entities and documents, for pairwise evaluations.
4.2 Term Expansion
As summarized in Table 1, starting with manually-created seed terms in each category of risk, encodings were learned using fastText from a concatenation of the news article text using the methodology discussed in Section 3.4. Selecting the top ten most similar terms for each in-vocabulary seed term resulted in an expanded taxonomy, with a 326.31% increase on average across the three categories. This term expansion not only introduced new vocabulary to the taxonomy but also variants and common misspellings of keywords, which are important in catching risk terms“in the wild”. Some cleanup of the term expansion was required to filter out punctuation and tokenization variants.
Analysts were asked to give their preference for “System A” or “System B” or “Neither” when presented with randomized pairs of output. Percentage preferences for the overarching system and each of six pairings was tested for significance with Pearson’s using raw counts.
|single v. multi (seed v. expand)||p=0.003|
|single v. baseline (seed)||p=1.159e-07|
|multi v. baseline (seed)||p=4.762e-07|
|single v. multi (seed)||p=7.99e-10|
|single v. baseline (expand)||p=0.008|
|multi v. baseline (expand)||p=3.978e-07|
|single v. multi (expand)||p=0.011|
5 Results and Discussion
We collected 4,514 judgments from eight subject matter experts to compute system preferences associated with single, multi- and baseline sentence extractions. Roughly 28% of all evaluated extractions (1,266/4,514) received a preference judgment (32% from the seed set (698/2,198) and 24% from the expansion set (568/2,316)) where 72% received a “Neither” rating.
As summarized in Table 2 and Figure 2, all single and multi-sentence extractions across the seed and expansion sets outperform the baseline by statistically significant margins. For the seed set, the single sentence extractions outperform the multi-sentence extractions by a statistically significant margin as well (p.01). However, for the expansion set, the multi-sentence extractions gain significant ground (26% to 38% increase in preference).
1,283 (28%) of evaluations were doubly annotated for calculation of Cohen’s Kappa Cohen (1960). Average for the seed set was 0.284 and 0.144 for the expansion set (which suffers from low sample size). This is uniformly low across all categories, but not an unusual result given the task and the range of analyst expertise.
Our system’s distance metric is clearly providing benefit well above the baseline, so we can accept and , but given exactness of the seed taxonomy, the high proportion of “Neither” judgments is surprising. While there are polysemous examples (non-risk category uses of risk keywords), which are minimized in more complex systems, there are observable variations in the directness or indirectness of the entity-risk relationships. For example, The companies could face a number of lawsuits from Walmart, Target and Kroger, describes an entity-risk relationship where the entity is suing rather than being sued. The degree of the risk varied by analyst and will have to be controlled for in subsequent evaluations and possibly components of the system.
We cannot accept which tests the assumption that multi-sentence returns will have greater analytical utility. However, we observed that as the taxonomy expands, the preference for the multi-sentences increases by 46% over the single sentence seed set extractions (SYSTEMsms).111Also note that despite an overall 326% increase in coverage, the proportion of “Neither” judgments in a randomly selected set of 100 sentences increases from 52% to 75%. A potential reason for this is that the keywords in the expanded taxonomy exhibit a greater range of specificity as compared to the seed terms. For example, the expanded taxonomy included require and suit which are more general and given to ambiguity in the legal/noncompliance category (the expanded taxonomy also included highly specific keywords - e.g. foreclosure and harassment).
To the extent that more general keywords proliferate the taxonomy, multi-sentence information may be more preferred for understanding the entity-risk relationship. This additional information may be preferred in some use cases, but the potential increase via a concomitant encouragement of non-risk relationships would certainly be unwelcome. Accommodating shifts in semantic granularity as the taxonomy expands will likely have to be considered in addition to distance thresholding to tune results.
6 Conclusion and Future Work
We have described a configurable, scalable system for finding entity-event relations across large data sets. Addressing observed drawbacks should be done so relative to maintaining flexibility for analyst users - possibilities include summarization of results for better presentation, alternative source data at the direction of the analyst for given risk categories, token distance thresholding and considerations of semantic granularity.
While designed for high-coverage use cases in risk mining, our system could be used for any data, entities and taxonomies to support generalized entity-x relationship reporting. Further, while the current system seeks to be low complexity (shallow parsing and no risk classifier), the system’s modular design can facilitate any number of additions for customized use cases and integrations, and performance comparisons to similar systems.
Thank you to anonymous reviewers from NAACL Industry Track as well as Matt Machado, Nathan Maynes, and Scott McFadden for support and feedback. Thank you also to extraction output evaluators Alli Zube, Jason Sciarotta, Jeff Earnest, Jonathan Feng, Larry Fowlkes, Regina Reese, Romel Lira and Sean Ahearn.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Carstens et al. (2017) Lucas Carstens, Jochen L. Leidner, Krzysztof Szymanski, and Blake Howald. 2017. Modeling company risk and importance in supply graphs. In The Semantic Web - 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28 - June 1, 2017, Proceedings, Part II, pages 18–32.
- Choi et al. (2015) Jinho D. Choi, Joel Tetreault, and Amanda Stent. 2015. It depends: Dependency parser comparison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 387–396.
- Cohen (1960) Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
- Dasgupta et al. (2016) Tirthankar Dasgupta, Lipika Dey, Prasenjit Dey, and Rupsa Saha. 2016. A framework for mining enterprise risk and risk factors from text. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pages 180–184.
- Groth and Muntermann (2011) Sven S. Groth and Jan Muntermann. 2011. An intraday market risk management approach based on textual analysis. Decision Support Systems, 50(4):680–691.
- Honnibal and Johnson (2015) Matthew Honnibal and Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378, Lisbon, Portugal. Association for Computational Linguistics.
- Kogan et al. (2009) Shimon Kogan, Bryan R. Routledge Dimitry Levin, Jacob S. Sagl, and Noah A. Smith. 2009. Predicting risk from financial reports with regression. In Proceedings of the 2009 Annual Conference of the North American Chapter of the ACL (NAACL-HLT), pages 272–280.
- Leidner and Schilder (2010) Jochen L. Leidner and Frank Schilder. 2010. Hunting for the black swan: Risk mining from text. In Proceedings of the ACL 2010 System Demonstrations, pages 54–59.
- Lu et al. (2009a) Hsin-Min Lu, Nina Wan-Hsin Huang, Shu-Hsing Li, and Tsai-Jyh Chen. 2009a. Risk statement recognition in news articles. In Proceedings of the 2009 Annual Conference of the International Conference on Information Systems, pages 54–59.
- Lu et al. (2009b) Hsin-Min Lu, Nina WanHsin Huang, Zhu Zhang, and Tsai-Jyh Chen. 2009b. Identifying firm-specific risk statements in news articles. In Intelligence and Security Informatics, pages 42–53. Springer.
- Meng et al. (2015) Rui Meng, Yongxin Tong, Lei Chen, and Caleb Chen Cao. 2015. Crowdtc: Crowdsourced taxonomy construction. In 2015 IEEE International Conference on Data Mining, pages 913–918.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Nopp and Hanbury (2015) Clemens Nopp and Allan Hanbury. 2015. Detecting risks in the banking system by sentiment analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Langauge Processing, pages 591–600.
- Nugent and Leidner (2016) Timothy Nugent and Jochen L. Leidner. 2016. Risk mining: Company-risk identification from unstructured sources. In IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, December 12-15, 2016, Barcelona, Spain., pages 1308–1311.
- Nugent et al. (2017) Timothy Nugent, Fabio Petroni, Natraj Raman, Lucas Carstens, and Jochen L. Leidner. 2017. A comparison of classification models for natural disaster and critical event detection from news. In 2017 IEEE International Conference on Big Data (Big Data), pages 3750–3759.
- Plachouras et al. (2018) Vassilis Plachouras, Fabio Petroni, Timothy Nugent, and Jochen L. Leidner. 2018. A comparison of two paraphrase models for taxonomy augmentation. In Proceedings of the 2018 Annual Conference of the North American Chapter of the ACL (NAACL-HLT), pages 315–320.
- Subramaniam et al. (2010) L. Venkata Subramaniam, Amit Anil Nanavati, and Sougata Mukherjea. 2010. Enriching one taxonomy using another. In 2010 IEEE Transactions on Knowledge and Data Engineering, pages 913–918.
- Tsai and Wang (2012) Ming-Feng Tsai and Chuan-Ju Wang. 2012. Visualization on financial terms via risk ranking from financial reports. In Proceedings of COLING 2012, pages 447–452.