The WiLI benchmark dataset for written language identification

01/23/2018
by   Martin Thoma, et al.
0

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.

READ FULL TEXT

page 3

page 4

page 12

research
08/06/2019

Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations

In this paper we introduce a new natural language processing dataset and...
research
01/29/2017

The HASYv2 dataset

This paper describes the HASYv2 dataset. HASY is a publicly available, f...
research
05/21/2019

MultiWiki: Interlingual Text Passage Alignment in Wikipedia

In this article we address the problem of text passage alignment across ...
research
03/14/2019

Complexity-entropy analysis at different levels of organization in written language

Written language is complex. A written text can be considered an attempt...
research
09/22/2020

Investigating Machine Learning Methods for Language and Dialect Identification of Cuneiform Texts

Identification of the languages written using cuneiform symbols is a dif...
research
06/06/2018

NumtaDB - Assembled Bengali Handwritten Digits

To benchmark Bengali digit recognition algorithms, a large publicly avai...
research
02/04/2023

A Benchmark and Scoring Algorithm for Enriching Arabic Synonyms

This paper addresses the task of extending a given synset with additiona...

Please sign up or login with your details

Forgot password? Click here to reset