Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping

04/07/2018
by   Dario Pavllo, et al.
0

We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90 precision at 40 recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/10/2020

Linking Social Media Posts to News with Siamese Transformers

Many computational social science projects examine online discourse surr...
research
05/13/2021

SaRoCo: Detecting Satire in a Novel Romanian Corpus of News Articles

In this work, we introduce a corpus for satire detection in Romanian new...
research
03/07/2021

RevDet: Robust and Memory Efficient Event Detection and Tracking in Large News Feeds

With the ever-growing volume of online news feeds, event-based organizat...
research
12/31/2015

Event Specific Multimodal Pattern Mining with Image-Caption Pairs

In this paper we describe a novel framework and algorithms for discoveri...
research
09/25/2020

PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

Keyphrases provide an extremely dense summary of a text. Such informatio...
research
07/07/2022

Quote Erat Demonstrandum: A Web Interface for Exploring the Quotebank Corpus

The use of attributed quotes is the most direct and least filtered pathw...
research
12/01/2021

STEM: Unsupervised STructural EMbedding for Stance Detection

Stance detection is an important task, supporting many downstream tasks ...

Please sign up or login with your details

Forgot password? Click here to reset