Benchmark Performance of Machine And Deep Learning Based Methodologies for Urdu Text Document Classification

03/03/2020
by   Muhammad Nabeel Asim, et al.
0

In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold. First, it pro-vides a publicly available benchmark dataset manually tagged against 6 classes. Second, it investigates the performance impact of traditional machine learning based Urdu text document classification methodologies by embedding 10 filter-based feature selection algorithms which have been widely used for other languages. Third, for the very first time, it as-sesses the performance of various deep learning based methodologies for Urdu text document classification. In this regard, for experimentation, we adapt 10 deep learning classification methodologies which have pro-duced best performance figures for English text classification. Fourth, it also investigates the performance impact of transfer learning by utiliz-ing Bidirectional Encoder Representations from Transformers approach for Urdu language. Fifth, it evaluates the integrity of a hybrid approach which combines traditional machine learning based feature engineering and deep learning based automated feature engineering. Experimental results show that feature selection approach named as Normalised Dif-ference Measure along with Support Vector Machine outshines state-of-the-art performance on two closed source benchmark datasets CLE Urdu Digest 1000k, and CLE Urdu Digest 1Million with a significant margin of 32 respectively. Across all three datasets, Normalised Differ-ence Measure outperforms other filter based feature selection algorithms as it significantly uplifts the performance of all adopted machine learning, deep learning, and hybrid approaches. The source code and presented dataset are available at Github repository.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/12/2019

A Robust Hybrid Approach for Textual Document Classification

Text document classification is an important task for diverse natural la...
research
03/11/2020

A Precisely Xtreme-Multi Channel Hybrid Approach For Roman Urdu Sentiment Analysis

In order to accelerate the performance of various Natural Language Proce...
research
02/04/2018

Heuristic Feature Selection for Clickbait Detection

We study feature selection as a means to optimize the baseline clickbait...
research
03/02/2023

Document Provenance and Authentication through Authorship Classification

Style analysis, which is relatively a less explored topic, enables sever...
research
05/18/2023

MiraBest: A Dataset of Morphologically Classified Radio Galaxies for Machine Learning

The volume of data from current and future observatories has motivated t...
research
01/20/2021

The Challenges of Persian User-generated Textual Content: A Machine Learning-Based Approach

Over recent years a lot of research papers and studies have been publish...
research
12/23/2022

RMove: Recommending Move Method Refactoring Opportunities using Structural and Semantic Representations of Code

Incorrect placement of methods within classes is a typical code smell ca...

Please sign up or login with your details

Forgot password? Click here to reset