Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

06/12/2023
by   Andani Madodonga, et al.
0

Local/Native South African languages are classified as low-resource languages. As such, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these native South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Bag-Of-Words, TFIDF and Word2vec. The results of this study showed that XGBoost, Logistic Regression and LSTM, trained from Word2vec performed better than the other combinations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2020

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

The recent advances in Natural Language Processing have been a boon for ...
research
04/06/2019

Team QCRI-MIT at SemEval-2019 Task 4: Propaganda Analysis Meets Hyperpartisan News Detection

In this paper, we describe our submission to SemEval-2019 Task 4 on Hype...
research
10/23/2020

KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi

Recent progress in text classification has been focused on high-resource...
research
08/30/2023

Benchmarking Multilabel Topic Classification in the Kyrgyz Language

Kyrgyz is a very underrepresented language in terms of modern natural la...
research
03/26/2022

Joint Transformer/RNN Architecture for Gesture Typing in Indic Languages

Gesture typing is a method of typing words on a touch-based keyboard by ...
research
11/03/2017

One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis

When learning a new skill, you take advantage of your preexisting skills...
research
10/29/2019

Detect Toxic Content to Improve Online Conversations

Social media is filled with toxic content. The aim of this paper is to b...

Please sign up or login with your details

Forgot password? Click here to reset