StoryDB: Broad Multi-language Narrative Dataset

09/29/2021
by   Alexey Tikhonov, et al.
0

This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2017

InScript: Narrative texts annotated with script information

This paper presents the InScript corpus (Narrative Texts Instantiating S...
research
03/13/2020

LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

Language recognition has been significantly advanced in recent years by ...
research
05/23/2023

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Knowing the language of an input text/audio is a necessary first step fo...
research
11/04/2022

CLSE: Corpus of Linguistically Significant Entities

One of the biggest challenges of natural language generation (NLG) is th...
research
10/06/2015

Language Segmentation

Language segmentation consists in finding the boundaries where one langu...
research
03/28/2023

Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information

This paper presents the first publicly available version of the Carolina...
research
03/09/2022

Automatic Language Identification for Celtic Texts

Language identification is an important Natural Language Processing task...

Please sign up or login with your details

Forgot password? Click here to reset