Semi-Supervised Cleansing of Web Argument Corpora

11/03/2020
by   Jonas Dorsch, et al.
0

Debate portals and similar web platforms constitute one of the main text sources in computational argumentation research and its applications. While the corpora built upon these sources are rich of argumentatively relevant content and structure, they also include text that is irrelevant, or even detrimental, to their purpose. In this paper, we present a precision-oriented approach to detecting such irrelevant text in a semi-supervised way. Given a few seed examples, the approach automatically learns basic lexical patterns of relevance and irrelevance and then incrementally bootstraps new patterns from sentences matching the patterns. In the existing args.me corpus with 400k argumentative texts, our approach detects almost 87k irrelevant sentences, at a precision of 0.97 according to manual evaluation. With low effort, the approach can be adapted to other web argument corpora, providing a generic way to improve corpus quality.

READ FULL TEXT
research
06/04/2018

Neural Adversarial Training for Semi-supervised Japanese Predicate-argument Structure Analysis

Japanese predicate-argument structure (PAS) analysis involves zero anaph...
research
10/27/2020

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natu...
research
06/18/2022

Automatic Summarization of Russian Texts: Comparison of Extractive and Abstractive Methods

The development of large and super-large language models, such as GPT-3,...
research
06/18/2022

Argumentative Text Generation in Economic Domain

The development of large and super-large language models, such as GPT-3,...
research
11/30/2019

Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

This paper presents SwissCrawl, the largest Swiss German text corpus to ...
research
05/22/2020

The Discussion Tracker Corpus of Collaborative Argumentation

Although Natural Language Processing (NLP) research on argument mining h...
research
09/04/2020

ViS-Á-ViS : Detecting Similar Patterns in Annotated Literary Text

We present a web-based system called ViS-Á-ViS aiming to assist literary...

Please sign up or login with your details

Forgot password? Click here to reset