Informal Persian Universal Dependency Treebank

01/10/2022
by   Roya Kabiri, et al.
0

This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian, showing that these two variants have fundamental differences that cannot be attributed solely to pronunciation discrepancies. Given that informal Persian exhibits particular characteristics, any computational model trained on formal Persian is unlikely to transfer well to informal Persian, necessitating the creation of dedicated treebanks for this variety. We thus detail the development of the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependencies scheme. We then investigate the parsing of informal Persian by training two dependency parsers on existing formal treebanks and evaluating them on out-of-domain data, i.e. the development set of our informal treebank. Our results show that parsers experience a substantial performance drop when we move across the two domains, as they face more unknown tokens and structures and fail to generalize well. Furthermore, the dependency relations whose performance deteriorates the most represent the unique properties of the informal variant. The ultimate goal of this study that demonstrates a broader impact is to provide a stepping-stone to reveal the significance of informal variants of languages, which have been widely overlooked in natural language processing tools across languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2022

Developing Universal Dependency Treebanks for Magahi and Braj

In this paper, we discuss the development of treebanks for two low-resou...
research
04/17/2021

Monotonicity Marking from Universal Dependency Trees

Dependency parsing is a tool widely used in the field of Natural languag...
research
01/28/2021

Syntactic Nuclei in Dependency Parsing – A Multilingual Exploration

Standard models for syntactic dependency parsing take words to be the el...
research
03/16/2020

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

We introduce Stanza, an open-source Python natural language processing t...
research
01/25/2023

Weakly Supervised Headline Dependency Parsing

English news headlines form a register with unique syntactic properties ...
research
10/10/2017

The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages

We release Galactic Dependencies 1.0---a large set of synthetic language...

Please sign up or login with your details

Forgot password? Click here to reset