What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus

05/06/2021
by   Alexandra Sasha Luccioni, et al.
0

Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures. We conclude with a discussion of the potential impacts of this content on language models and call for more mindful approach to corpus collection and analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2022

Leashing the Inner Demons: Self-Detoxification for Language Models

Language models (LMs) can reproduce (or amplify) toxic language seen dur...
research
03/02/2019

Reliable Access to Massive Restricted Texts: Experience-based Evaluation

Libraries are seeing growing numbers of digitized textual corpora that f...
research
12/20/2022

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

As demand for large corpora increases with the size of current state-of-...
research
01/25/2022

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Language models increasingly rely on massive web dumps for diverse text ...
research
03/04/2023

Could a Large Language Model be Conscious?

There has recently been widespread discussion of whether large language ...
research
08/03/2022

A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

In human speech, the attitude of a speaker cannot be fully expressed onl...
research
03/05/2023

FQP 2.0: Industry Trend Analysis via Hierarchical Financial Data

Analyzing trends across industries is critical to maintaining a healthy ...

Please sign up or login with your details

Forgot password? Click here to reset