LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

03/13/2020
by   Hadi Abdi Khojasteh, et al.
0

Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low-resourced ones such as Persian. In order to target this gap for low resource languages, we propose a "Large Scale Colloquial Persian Dataset" (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2021

StoryDB: Broad Multi-language Narrative Dataset

This paper presents StoryDB - a broad multi-language dataset of narrativ...
research
04/25/2019

Holistic Large Scale Video Understanding

Action recognition has been advanced in recent years by benchmarks with ...
research
06/26/2022

Meta Auxiliary Learning for Low-resource Spoken Language Understanding

Spoken language understanding (SLU) treats automatic speech recognition ...
research
04/14/2020

Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus

Many efforts of research are devoted to semantic role labeling (SRL) whi...
research
05/30/2021

How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages

Despite the recent advancements of attention-based deep learning archite...
research
06/26/2023

Uncovering Political Hate Speech During Indian Election Campaign: A New Low-Resource Dataset and Baselines

The detection of hate speech in political discourse is a critical issue,...
research
05/05/2023

Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

In many humanitarian scenarios, translation into severely low resource l...

Please sign up or login with your details

Forgot password? Click here to reset