Automatic Readability Assessment for Closely Related Languages

05/22/2023
by   Joseph Marvin Imperial, et al.
3

In recent years, the main focus of research on automatic readability assessment (ARA) has shifted towards using expensive deep learning-based methods with the primary goal of increasing models' accuracy. This, however, is rarely applicable for low-resource languages where traditional handcrafted features are still widely used due to the lack of existing NLP tools to extract deeper linguistic representations. In this work, we take a step back from the technical component and focus on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting. We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models and explore the interaction of data and features in various cross-lingual setups. Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models compared to the use of off-the-shelf large multilingual language models alone. Consequently, when both linguistic representations are combined, we achieve state-of-the-art results for Tagalog and Cebuano, and baseline scores for ARA in Bikol.

READ FULL TEXT
research
01/29/2022

Does Transliteration Help Multilingual Language Modeling?

As there is a scarcity of large representative corpora for most language...
research
06/15/2021

Knowledge-Rich BERT Embeddings for Readability Assessment

Automatic readability assessment (ARA) is the task of evaluating the lev...
research
04/01/2021

Low-Resource Language Modelling of South African Languages

Language models are the foundation of current neural network-based model...
research
04/18/2023

Romanization-based Large-scale Adaptation of Multilingual Language Models

Large multilingual pretrained language models (mPLMs) have become the de...
research
12/01/2020

Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios

Existing multilingual speech NLP works focus on a relatively small subse...
research
01/28/2021

Does Typological Blinding Impede Cross-Lingual Sharing?

Bridging the performance gap between high- and low-resource languages ha...
research
09/25/2021

Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features

We report two essential improvements in readability assessment: 1. three...

Please sign up or login with your details

Forgot password? Click here to reset