A Multilingual Information Extraction Pipeline for Investigative Journalism

09/01/2018
by   Gregor Wiedemann, et al.
0

We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks. Our software prepares a visually-aided exploration of the collection to quickly learn about potential stories contained in the data. It is based on the automatic extraction of entities and their co-occurrence in documents. In contrast to comparable projects, we focus on the following three major requirements particularly serving the use case of investigative journalism in cross-border collaborations: 1) composition of multiple state-of-the-art NLP tools for entity extraction, 2) support of multi-lingual document sets up to 40 languages, 3) fast and easy-to-use extraction of full-text, metadata and entities from various file formats.

READ FULL TEXT
research
07/13/2018

New/s/leak 2.0 - Multilingual Information Extraction and Visualization for Investigative Journalism

Investigative journalism in recent years is confronted with two major ch...
research
01/14/2022

The Lokahi Prototype: Toward the automatic Extraction of Entity Relationship Models from Text

Entity relationship extraction envisions the automatic generation of sem...
research
02/02/2017

Multilingual and Cross-lingual Timeline Extraction

In this paper we present an approach to extract ordered timelines of eve...
research
02/10/2021

ELSKE: Efficient Large-Scale Keyphrase Extraction

Keyphrase extraction methods can provide insights into large collections...
research
11/25/2019

My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

Comparative text mining extends from genre analysis and political bias d...
research
01/29/2022

Information Extraction through AI techniques: The KIDs use case at CONSOB

In this paper we report on the initial activities carried out within a c...
research
06/20/2022

Open Set Classification of Untranscribed Handwritten Documents

Huge amounts of digital page images of important manuscripts are preserv...

Please sign up or login with your details

Forgot password? Click here to reset