A Comparative Study on Data Representation to Categorize Text Documents

03/06/2022
by   Dulani Meedeniya, et al.
0

In the modern world text documents play an important role in most of the organizations. Their constant growth widens the scope of document storage. As a result, there is a potential need for effective text retrieval and search capabilities. This paper suggests two document preprocessing methods. The objective of this study is to find an appropriate data representation for text categorization by comparing two data representation approaches. The first approach groups the documents based on their title and the second approach considers the document body to group documents. Both methods apply the same clustering and classification techniques on the test data sets. It applies clustering to divide the documents into categories and uses classification techniques to validate the clustering results. This study shows that the text documents grouping based on document titles has high performances than the other approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2022

Evaluation of Partition-Based Text Clustering Techniques to Categorize Indic Language Documents

Wide availability of electronic data has led to the vast interest in tex...
research
01/06/2015

Arabic Text Categorization Algorithm using Vector Evaluation Method

Text categorization is the process of grouping documents into categories...
research
12/06/2018

Comparative Document Summarisation via Classification

This paper considers extractive summarisation in a comparative setting: ...
research
03/06/2022

An Adaptive Technique to Categorize Indic Language Documents

The significant growth of the electronic media to store and exchange tex...
research
05/30/2022

Contextualization for the Organization of Text Documents Streams

There has been a significant effort by the research community to address...
research
12/01/2021

Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

Document clustering is a traditional, efficient and yet quite effective,...
research
09/21/2016

Document Image Coding and Clustering for Script Discrimination

The paper introduces a new method for discrimination of documents given ...

Please sign up or login with your details

Forgot password? Click here to reset