Know thy corpus! Robust methods for digital curation of Web corpora

03/13/2020
by   Serge Sharoff, et al.
0

This paper proposes a novel framework for digital curation of Web corpora in order to provide robust estimation of their parameters, such as their composition and the lexicon. In recent years language models pre-trained on large corpora emerged as clear winners in numerous NLP tasks, but no proper analysis of the corpora which led to their success has been conducted. The paper presents a procedure for robust frequency estimation, which helps in establishing the core lexicon for a given corpus, as well as a procedure for estimating the corpus composition via unsupervised topic models and via supervised genre classification of Web pages. The results of the digital curation study applied to several Web-derived corpora demonstrate their considerable differences. First, this concerns different frequency bursts which impact the core lexicon obtained from each corpus. Second, this concerns the kinds of texts they contain. For example, OpenWebText contains considerably more topical news and political argumentation in comparison to ukWac or Wikipedia. The tools and the results of analysis have been released.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2022

Corpus Similarity Measures Remain Robust Across Diverse Languages

This paper experiments with frequency-based corpus similarity measures a...
research
11/06/2020

Corpora Compared: The Case of the Swedish Gigaword Wikipedia Corpora

In this work, we show that the difference in performance of embeddings f...
research
06/30/2022

esCorpius: A Massive Spanish Crawling Corpus

In the recent years, transformer-based models have lead to significant a...
research
09/11/2023

D2WFP: A Novel Protocol for Forensically Identifying, Extracting, and Analysing Deep and Dark Web Browsing Activities

The use of the un-indexed web, commonly known as the deep web and dark w...
research
04/03/2021

Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

This paper measures similarity both within and between 84 language varie...
research
10/28/2020

Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript

This paper outlines the creation of three corpora for multilingual compa...
research
11/09/2020

An Analysis of Dataset Overlap on Winograd-Style Tasks

The Winograd Schema Challenge (WSC) and variants inspired by it have bec...

Please sign up or login with your details

Forgot password? Click here to reset