Log In Sign Up

Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages

by   Rajaswa Patil, et al.

While there has been significant progress towards developing NLU datasets and benchmarks for Indic languages, syntactic evaluation has been relatively less explored. Unlike English, Indic languages have rich morphosyntax, grammatical genders, free linear word-order, and highly inflectional morphology. In this paper, we introduce Vyākarana: a benchmark of gender-balanced Colorless Green sentences in Indic languages for syntactic evaluation of multilingual language models. The benchmark comprises four syntax-related tasks: PoS Tagging, Syntax Tree-depth Prediction, Grammatical Case Marking, and Subject-Verb Agreement. We use the datasets from the evaluation tasks to probe five multilingual language models of varying architectures for syntax in Indic languages. Our results show that the token-level and sentence-level representations from the Indic language models (IndicBERT and MuRIL) do not capture the syntax in Indic languages as efficiently as the other highly multilingual language models. Further, our layer-wise probing experiments reveal that while mBERT, DistilmBERT, and XLM-R localize the syntax in middle layers, the Indic language models do not show such syntactic localization.


page 1

page 2

page 3

page 4


Cross-Linguistic Syntactic Evaluation of Word Prediction Models

A range of studies have concluded that neural word prediction models can...

Syntax Representation in Word Embeddings and Neural Networks – A Survey

Neural networks trained on natural language processing tasks capture syn...

Multilingual Syntax-aware Language Modeling through Dependency Tree Conversion

Incorporating stronger syntactic biases into neural language models (LMs...

Towards Syntactic Iberian Polarity Classification

Lexicon-based methods using syntactic rules for polarity classification ...

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish

The availability of compute and data to train larger and larger language...

Syntax-aware Multilingual Semantic Role Labeling

Recently, semantic role labeling (SRL) has earned a series of success wi...