Influence of various text embeddings on clustering performance in NLP

05/04/2023
by   Rohan Saha, et al.
4

With the advent of e-commerce platforms, reviews are crucial for customers to assess the credibility of a product. The star ratings do not always match the review text written by the customer. For example, a three star rating (out of five) may be incongruous with the review text, which may be more suitable for a five star review. A clustering approach can be used to relabel the correct star ratings by grouping the text reviews into individual groups. In this work, we explore the task of choosing different text embeddings to represent these reviews and also explore the impact the embedding choice has on the performance of various classes of clustering algorithms. We use contextual (BERT) and non-contextual (Word2Vec) text embeddings to represent the text and measure their impact of three classes on clustering algorithms - partitioning based (KMeans), single linkage agglomerative hierarchical, and density based (DBSCAN and HDBSCAN), each with various experimental settings. We use the silhouette score, adjusted rand index score, and cluster purity score metrics to evaluate the performance of the algorithms and discuss the impact of different embeddings on the clustering performance. Our results indicate that the type of embedding chosen drastically affects the performance of the algorithm, the performance varies greatly across different types of clustering algorithms, no embedding type is better than the other, and DBSCAN outperforms KMeans and single linkage agglomerative clustering but also labels more data points as outliers. We provide a thorough comparison of the performances of different algorithms and provide numerous ideas to foster further research in the domain of text clustering.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/27/2018

Review Helpfulness Assessment based on Convolutional Neural Network

In this paper we describe the implementation of a convolutional neural n...
research
08/16/2017

Fault in your stars: An Analysis of Android App Reviews

Mobile app distribution platforms such as Google Play Store allow users ...
research
08/01/2019

Evaluating Ordering Strategies of Star Glyph Axes

Star glyphs are a well-researched visualization technique to represent m...
research
11/03/2020

"You eat with your eyes first": Optimizing Yelp Image Advertising

A business's online, photographic representation can play a crucial role...
research
12/31/2021

Clustering Vietnamese Conversations From Facebook Page To Build Training Dataset For Chatbot

The biggest challenge of building chatbots is training data. The require...
research
10/06/2020

The Multilingual Amazon Reviews Corpus

We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale ...
research
01/10/2022

Using Online Customer Reviews to Classify, Predict, and Learn about Domestic Robot Failures

There is a knowledge gap regarding which types of failures robots underg...

Please sign up or login with your details

Forgot password? Click here to reset