Content-based Text Categorization using Wikitology

08/17/2012
by   Muhammad Rafi, et al.
0

A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assign a real number between 0 and 1 to a pair of documents, depending upon the degree of similarity between them. A value of zero means that the documents are completely dissimilar whereas a value of one indicates that the documents are practically identical. Traditionally, vector-based models have been used for computing the document similarity. The vector-based models represent several features present in documents. These approaches to similarity measures, in general, cannot account for the semantics of the document. Documents written in human languages contain contexts and the words used to describe these contexts are generally semantically related. Motivated by this fact, many researchers have proposed semantic-based similarity measures by utilizing text annotation through external thesauruses like WordNet (a lexical database). In this paper, we define a semantic similarity measure based on documents represented in topic maps. Topic maps are rapidly becoming an industrial standard for knowledge representation with a focus for later search and extraction. The documents are transformed into a topic map based coded knowledge and the similarity between a pair of documents is represented as a correlation between the common patterns. The experimental studies on the text mining datasets reveal that this new similarity measure is more effective as compared to commonly used similarity measures in text clustering.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/29/2011

Document Clustering based on Topic Maps

Importance of document clustering is now widely acknowledged by research...
research
11/15/2016

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Document similarity is the problem of estimating the degree to which a g...
research
08/06/2018

An Efficient Approach to Learning Chinese Judgment Document Similarity Based on Knowledge Summarization

A previous similar case in common law systems can be used as a reference...
research
02/21/2016

Interactive Storytelling over Document Collections

Storytelling algorithms aim to 'connect the dots' between disparate docu...
research
01/10/2012

Sentence based semantic similarity measure for blog-posts

Blogs-Online digital diary like application on web 2.0 has opened new an...
research
01/16/2014

Which Clustering Do You Want? Inducing Your Ideal Clustering with Minimal Feedback

While traditional research on text clustering has largely focused on gro...
research
02/16/2017

Clustering articles based on semantic similarity

Document clustering is generally the first step for topic identification...

Please sign up or login with your details

Forgot password? Click here to reset