Evaluating Various Tokenizers for Arabic Text Classification

06/14/2021
by   Zaid Alyafeai, et al.
0

The first step in any NLP pipeline is learning word vector representations. However, given a large text corpus, representing all the words is not efficient. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords which in turn limits the vocabulary size in any text corpus. However such algorithms are mostly language-agnostic and lack a proper way of capturing meaningful tokens. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations. In addition to that, we compare all the six algorithms by evaluating them on three tasks which are sentiment analysis, news classification and poetry classification. Our experiments show that the performance of such tokenization algorithms depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2018

Improving Sentiment Analysis in Arabic Using Word Representation

The complexities of Arabic language in morphology, orthography and diale...
research
05/22/2018

Sentiment Analysis of Arabic Tweets: Feature Engineering and A Hybrid Approach

Sentiment Analysis in Arabic is a challenging task due to the rich morph...
research
02/26/2015

Rational Kernels for Arabic Stemming and Text Classification

In this paper, we address the problems of Arabic Text Classification and...
research
04/08/2021

Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification

Text classification is a significant branch of natural language processi...
research
09/19/2019

Deep Contextualized Pairwise Semantic Similarity for Arabic Language Questions

Question semantic similarity is a challenging and active research proble...
research
12/28/2022

Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification

Learning models are highly dependent on data to work effectively, and th...
research
10/21/2020

Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set

The Gutenberg Literary English Corpus (GLEC) provides a rich source of t...

Please sign up or login with your details

Forgot password? Click here to reset