Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

08/03/2023
by   Shafie Abdi Mohamed, et al.
0

Lemmatization is a Natural Language Processing (NLP) technique used to normalize text by changing morphological derivations of words to their root forms. It is used as a core pre-processing step in many NLP tasks including text indexing, information retrieval, and machine learning for NLP, among others. This paper pioneers the development of text lemmatization for the Somali language, a low-resource language with very limited or no prior effective adoption of NLP methods and datasets. We especially develop a lexicon and rule-based lemmatizer for Somali text, which is a starting point for a full-fledged Somali lemmatization system for various NLP tasks. With consideration of the language morphological rules, we have developed an initial lexicon of 1247 root words and 7173 derivationally related terms enriched with rules for lemmatizing words not present in the lexicon. We have tested the algorithm on 120 documents of various lengths including news articles, social media posts, and text messages. Our initial results demonstrate that the algorithm achieves an accuracy of 57% for relatively long documents (e.g. full news articles), 60.57% for news article extracts, and high accuracy of 95.87% for short texts such as social media messages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/23/2020

A Nepali Rule Based Stemmer and its performance on different NLP applications

Stemming is an integral part of Natural Language Processing (NLP). It's ...
research
07/13/2017

Lithium NLP: A System for Rich Information Extraction from Noisy User Generated Text on Social Media

In this paper, we describe the Lithium Natural Language Processing (NLP)...
research
02/13/2020

Comparison of Turkish Word Representations Trained on Different Morphological Forms

Increased popularity of different text representations has also brought ...
research
05/21/2020

Towards Finite-State Morphology of Kurdish

Morphological analysis is the study of the formation and structure of wo...
research
10/02/2019

Neural Word Decomposition Models for Abusive Language Detection

User generated text on social media often suffers from a lot of undesire...
research
07/04/2022

Location reference recognition from texts: A survey and comparison

A vast amount of location information exists in unstructured texts, such...
research
03/31/2020

Automatic Extraction of Bengali Root Verbs using Paninian Grammar

In this research work, we have proposed an algorithm based on supervised...

Please sign up or login with your details

Forgot password? Click here to reset