Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

12/01/2021
by   Sergios Gerakidis, et al.
0

Document clustering is a traditional, efficient and yet quite effective, text mining technique when we need to get a better insight of the documents of a collection that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used clustering algorithms; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in text clustering over large-scale collections can lead to unacceptable time costs. In this paper we first address some of the most valuable approaches for document clustering over such 'big data' (large-scale) collections. We then present two very promising alternatives: (a) a variation of an existing K-Means-based fast clustering technique (known as BigKClustering - BKC) so that it can be applied in document clustering, and (b) a hybrid clustering approach based on a customized version of the Buckshot algorithm, which first applies a hierarchical clustering procedure on a sample of the input dataset and then it uses the results as the initial centers for a K-Means based assignment of the rest of the documents, with very few iterations. We also give highly efficient adaptations of the proposed techniques in the MapReduce model which are then experimentally tested using Apache Hadoop and Spark over a real cluster environment. As it comes out of the experiments, they both lead to acceptable clustering quality as well as to significant time improvements (compared to K-Means - especially the Buckshot-based algorithm), thus constituting very promising alternatives for big document collections.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/06/2010

K-tree: Large Scale Document Clustering

We introduce K-tree in an information retrieval context. It is an effici...
research
12/29/2011

A comparison of two suffix tree-based document clustering algorithms

Document clustering as an unsupervised approach extensively used to navi...
research
03/06/2022

A Comparative Study on Data Representation to Categorize Text Documents

In the modern world text documents play an important role in most of the...
research
07/02/2017

Classification non supervisée des données hétérogènes à large échelle

When it comes to cluster massive data, response time, disk access and qu...
research
07/30/2021

Efficient Sparse Spherical k-Means for Document Clustering

Spherical k-Means is frequently used to cluster document collections bec...
research
10/07/2018

A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

One of the important factors that make a search engine fast and accurate...
research
01/06/2010

Document Clustering with K-tree

This paper describes the approach taken to the XML Mining track at INEX ...

Please sign up or login with your details

Forgot password? Click here to reset