Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

11/30/2019
by   Lucy Linder, et al.
0

This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling. To capture new content, our approach will run continuously to keep increasing the corpus over time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2020

GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

The lack of publicly available text corpora is a major obstacle for prog...
research
07/15/2020

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

This paper presents two colloquial Sinhala language corpora from the lan...
research
10/27/2022

Creating a morphological and syntactic tagged corpus for the Uzbek language

Nowadays, creation of the tagged corpora is becoming one of the most imp...
research
05/22/2023

The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian Language

This paper provides an overview of a text mining tool the StyloMetrix de...
research
04/03/2023

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

One of the major challenges that under-represented and endangered langua...
research
11/03/2020

Semi-Supervised Cleansing of Web Argument Corpora

Debate portals and similar web platforms constitute one of the main text...

Please sign up or login with your details

Forgot password? Click here to reset