SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

05/25/2016
by   Javid Dadashkarimi, et al.
0

There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages. In this paper we propose a method for finding affixes in different positions of a word. Common statistical techniques heavily rely on string similarity in terms of prefix and suffix matching. Since infixes are common in irregular/informal inflections in morphologically complex texts, it is required to find infixes for stemming. In this paper we propose a method whose aim is to find statistical inflectional rules based on minimum edit distance table of word pairs and the likelihoods of the rules in a language. These rules are used to statistically stem words and can be used in different text mining tasks. Experimental results on CLEF 2008 and CLEF 2009 English-Persian CLIR tasks indicate that the proposed method significantly outperforms all the baselines in terms of MAP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2019

Universal and non-universal text statistics: Clustering coefficient for language identification

In this work we analyze statistical properties of 91 relatively small te...
research
02/03/2018

Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval

Cross-modal information retrieval aims to find heterogeneous data of var...
research
08/04/2016

UsingWord Embeddings for Query Translation for Hindi to English Cross Language Information Retrieval

Cross-Language Information Retrieval (CLIR) has become an important prob...
research
09/15/2022

Accuracy of the Uzbek stop words detection: a case study on "School corpus"

Stop words are very important for information retrieval and text analysi...
research
08/23/2016

Tracking Amendments to Legislation and Other Political Texts with a Novel Minimum-Edit-Distance Algorithm: DocuToads

Political scientists often find themselves tracking amendments to politi...
research
08/16/2020

Discovering Lexical Similarity Through Articulatory Feature-based Phonetic Edit Distance

Lexical Similarity (LS) between two languages uncovers many interesting ...
research
03/26/2019

Document Similarity for Texts of Varying Lengths via Hidden Topics

Measuring similarity between texts is an important task for several appl...

Please sign up or login with your details

Forgot password? Click here to reset