Simple or Complex? Learning to Predict Readability of Bengali Texts

12/09/2020
by   Mir Tafseer Nayeem, et al.
0

Determining the readability of a text is the first step to its simplification. In this paper, we present a readability analysis tool capable of analyzing text written in the Bengali language to provide in-depth information on its readability and complexity. Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing. Readability related research of the Bengali language so far can be considered to be narrow and sometimes faulty due to the lack of resources. Therefore, we correctly adopt document-level readability formulas traditionally used for U.S. based education system to the Bengali language with a proper age-to-age comparison. Due to the unavailability of large-scale human-annotated corpora, we further divide the document-level task into sentence-level and experiment with neural architectures, which will serve as a baseline for the future works of Bengali readability prediction. During the process, we present several human-annotated corpora and dictionaries such as a document-level dataset comprising 618 documents with 12 different grade levels, a large-scale sentence-level dataset comprising more than 96K sentences with simple and complex labels, a consonant conjunct count algorithm and a corpus of 341 words to validate the effectiveness of the algorithm, a list of 3,396 easy words, and an updated pronunciation dictionary with more than 67K words. These resources can be useful for several other tasks of this low-resource language. We make our Code Dataset publicly available at https://github.com/tafseer-nayeem/BengaliReadability for reproduciblity.

READ FULL TEXT
research
04/19/2023

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

Despite much progress in recent years, the vast majority of work in natu...
research
04/11/2023

A Survey of Resources and Methods for Natural Language Processing of Serbian Language

The Serbian language is a Slavic language spoken by over 12 million spea...
research
05/01/2020

SciREX: A Challenge Dataset for Document-Level Information Extraction

Extracting information from full documents is an important problem in ma...
research
03/14/2019

Complexity-entropy analysis at different levels of organization in written language

Written language is complex. A written text can be considered an attempt...
research
06/29/2021

SDL: New data generation tools for full-level annotated document layout

We present a novel data generation tool for document processing. The too...
research
03/25/2022

Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach

Plagiarism means taking another person's work and not giving any credit ...
research
05/03/2018

Scalable Semantic Querying of Text

We present the KOKO system that takes declarative information extraction...

Please sign up or login with your details

Forgot password? Click here to reset