Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language

10/19/2022
by   Istiak Ahmad, et al.
0

Research in Natural Language Processing (NLP) has increasingly become important due to applications such as text classification, text mining, sentiment analysis, POS tagging, named entity recognition, textual entailment, and many others. This paper introduces several machine and deep learning methods with manual and automatic labelling for news classification in the Bangla language. We implemented several machine (ML) and deep learning (DL) algorithms. The ML algorithms are Logistic Regression (LR), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbour (KNN), used with Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Doc2Vec embedding models. The DL algorithms are Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Convolutional Neural Network (CNN), used with Word2vec, Glove, and FastText word embedding models. We develop automatic labelling methods using Latent Dirichlet Allocation (LDA) and investigate the performance of single-label and multi-label article classification methods. To investigate performance, we developed from scratch Potrika, the largest and the most extensive dataset for news classification in the Bangla language, comprising 185.51 million words and 12.57 million sentences contained in 664,880 news articles in eight distinct categories, curated from six popular online news portals in Bangladesh for the period 2014-2020. GRU and Fasttext with 91.83 achieve the highest accuracy for manually-labelled data. For the automatic labelling case, KNN and Doc2Vec at 57.72 for single-label and multi-label data, respectively. The methods developed in this paper are expected to advance research in Bangla and other languages.

READ FULL TEXT

page 6

page 16

page 17

page 19

page 21

page 22

page 23

page 24

research
12/13/2021

Khmer Text Classification Using Word Embedding and Neural Networks

Text classification is one of the fundamental tasks in natural language ...
research
10/17/2022

Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

Knowledge is central to human and scientific developments. Natural Langu...
research
04/10/2018

Deep Learning for Digital Text Analytics: Sentiment Analysis

In today's scenario, imagining a world without negativity is something v...
research
04/12/2022

Accurate Discharge Coefficient Prediction of Streamlined Weirs by Coupling Linear Regression and Deep Convolutional Gated Recurrent Unit

Streamlined weirs which are a nature-inspired type of weir have gained t...
research
08/05/2023

Textual Data Mining for Financial Fraud Detection: A Deep Learning Approach

In this report, I present a deep learning approach to conduct a natural ...
research
09/08/2018

Multi-label Classification of User Reactions in Online News

The increase in the number of Internet users and the strong interaction ...
research
07/03/2022

Job Offers Classifier using Neural Networks and Oversampling Methods

Both policy and research benefit from a better understanding of individu...

Please sign up or login with your details

Forgot password? Click here to reset