DeepAI AI Chat
Log In Sign Up

The WiLI benchmark dataset for written language identification

01/23/2018
by   Martin Thoma, et al.
0

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.

READ FULL TEXT

page 3

page 4

page 12

08/06/2019

Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations

In this paper we introduce a new natural language processing dataset and...
07/02/2020

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

This paper describes the Dakshina dataset, a new resource consisting of ...
01/29/2017

The HASYv2 dataset

This paper describes the HASYv2 dataset. HASY is a publicly available, f...
05/21/2019

MultiWiki: Interlingual Text Passage Alignment in Wikipedia

In this article we address the problem of text passage alignment across ...
03/14/2019

Complexity-entropy analysis at different levels of organization in written language

Written language is complex. A written text can be considered an attempt...
09/22/2020

Investigating Machine Learning Methods for Language and Dialect Identification of Cuneiform Texts

Identification of the languages written using cuneiform symbols is a dif...
06/06/2018

NumtaDB - Assembled Bengali Handwritten Digits

To benchmark Bengali digit recognition algorithms, a large publicly avai...