A Nepali Rule Based Stemmer and its performance on different NLP applications

02/23/2020
by   Pravesh Koirala, et al.
0

Stemming is an integral part of Natural Language Processing (NLP). It's a preprocessing step in almost every NLP application. Arguably, the most important usage of stemming is in Information Retrieval (IR). While there are lots of work done on stemming in languages like English, Nepali stemming has only a few works. This study focuses on creating a Rule Based stemmer for Nepali text. Specifically, it is an affix stripping system that identifies two different class of suffixes in Nepali grammar and strips them separately. Only a single negativity prefix (Na) is identified and stripped. This study focuses on a number of techniques like exception word identification, morphological normalization and word transformation to increase stemming performance. The stemmer is tested intrinsically using Paice's method and extrinsically on a basic tf-idf based IR system and an elementary news topic classifier using Multinomial Naive Bayes Classifier. The difference in performance of these systems with and without using the stemmer is analysed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/03/2023

Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

Lemmatization is a Natural Language Processing (NLP) technique used to n...
research
03/31/2023

Dataset and Baseline System for Multi-lingual Extraction and Normalization of Temporal and Numerical Expressions

Temporal and numerical expression understanding is of great importance i...
research
07/17/2018

Developing a Portable Natural Language Processing Based Phenotyping System

This paper presents a portable phenotyping system that is capable of int...
research
01/11/2018

Applying Vector Space Model (VSM) Techniques in Information Retrieval for Arabic Language

Information Retrieval (IR) is a part of Neutral Language Processing (NLP...
research
05/21/2020

Towards Finite-State Morphology of Kurdish

Morphological analysis is the study of the formation and structure of wo...
research
01/22/2019

CREATE: Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records using OMOP Common Data Model

Background: Widespread adoption of electronic health records (EHRs) has ...
research
08/23/2016

Using Semantic Similarity for Input Topic Identification in Crawling-based Web Application Testing

To automatically test web applications, crawling-based techniques are us...

Please sign up or login with your details

Forgot password? Click here to reset