Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach

03/25/2022
by   Satyajit Ghosh, et al.
0

Plagiarism means taking another person's work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus. Our experimental results find out average accuracy between 72.10 OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts. In future, we aim to construct a corpus with more books for more accurate detection.

READ FULL TEXT

page 1

page 2

page 3

page 5

page 7

page 9

page 10

page 11

research
10/07/2018

A Machine Learning Approach to Persian Text Readability Assessment Using a Crowdsourced Dataset

An automated approach to text readability assessment is essential to a l...
research
03/11/2017

A German Corpus for Text Similarity Detection Tasks

Text similarity detection aims at measuring the degree of similarity bet...
research
04/17/2023

Political corpus creation through automatic speech recognition on EU debates

In this paper, we present a transcribed corpus of the LIBE committee of ...
research
12/09/2020

Simple or Complex? Learning to Predict Readability of Bengali Texts

Determining the readability of a text is the first step to its simplific...
research
07/16/2018

The EcoLexicon English Corpus as an open corpus in Sketch Engine

The EcoLexicon English Corpus (EEC) is a 23.1-million-word corpus of con...
research
09/24/2022

Understanding the Use of Quantifiers in Mandarin

We introduce a corpus of short texts in Mandarin, in which quantified ex...
research
12/17/2021

NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese

This paper presents and makes publicly available the NILC-Metrix, a comp...

Please sign up or login with your details

Forgot password? Click here to reset