Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

10/27/2020
by   Isaac Caswell, et al.
0

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90 models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5 suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5 data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2021

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

With the success of large-scale pre-training and multilingual modeling i...
research
04/18/2021

Documenting the English Colossal Clean Crawled Corpus

As language models are trained on ever more text, researchers are turnin...
research
11/11/2021

SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets

NLP researchers need more, higher-quality text datasets. Human-labeled d...
research
11/03/2020

Semi-Supervised Cleansing of Web Argument Corpora

Debate portals and similar web platforms constitute one of the main text...
research
01/17/2022

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

The need for raw large raw corpora has dramatically increased in recent ...
research
06/30/2022

esCorpius: A Massive Spanish Crawling Corpus

In the recent years, transformer-based models have lead to significant a...
research
08/05/2016

Boundary-based MWE segmentation with text partitioning

This work presents a fine-grained, text-chunking algorithm designed for ...

Please sign up or login with your details

Forgot password? Click here to reset