Document Clustering based on Topic Maps

12/29/2011
by   Muhammad Rafi, et al.
0

Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.

READ FULL TEXT
research
12/29/2011

A comparison of two suffix tree-based document clustering algorithms

Document clustering as an unsupervised approach extensively used to navi...
research
08/17/2012

Content-based Text Categorization using Wikitology

A major computational burden, while performing document clustering, is t...
research
11/06/2018

Semantic Term "Blurring" and Stochastic "Barcoding" for Improved Unsupervised Text Classification

The abundance of text data being produced in the modern age makes it inc...
research
11/15/2016

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Document similarity is the problem of estimating the degree to which a g...
research
08/06/2018

An Efficient Approach to Learning Chinese Judgment Document Similarity Based on Knowledge Summarization

A previous similar case in common law systems can be used as a reference...
research
11/26/2015

OntoSeg: a Novel Approach to Text Segmentation using Ontological Similarity

Text segmentation (TS) aims at dividing long text into coherent segments...
research
01/06/2010

Random Indexing K-tree

Random Indexing (RI) K-tree is the combination of two algorithms for clu...

Please sign up or login with your details

Forgot password? Click here to reset