Clustering Vietnamese Conversations From Facebook Page To Build Training Dataset For Chatbot

12/31/2021
by   Trieu Hai Nguyen, et al.
0

The biggest challenge of building chatbots is training data. The required data must be realistic and large enough to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining of BERT for Vietnamese (PhoBERT) to extract features of our text data. K-Means and DBSCAN clustering algorithms are used for clustering tasks based on output embeddings from PhoBERT_base. We apply V-measure score and Silhouette score to evaluate the performance of clustering algorithms. We also demonstrate the efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A GridSearch algorithm that combines both clustering evaluations is also proposed to find optimal parameters. Thanks to clustering such a number of conversations, we save a lot of time and effort to build data and storylines for training chatbot.

READ FULL TEXT
research
07/09/2021

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

Hateful memes pose a unique challenge for current machine learning syste...
research
10/31/2012

Understanding the Interaction between Interests, Conversations and Friendships in Facebook

In this paper, we explore salient questions about user interests, conver...
research
05/04/2023

Influence of various text embeddings on clustering performance in NLP

With the advent of e-commerce platforms, reviews are crucial for custome...
research
08/26/2021

SOMTimeS: Self Organizing Maps for Time Series Clustering and its Application to Serious Illness Conversations

There is an increasing demand for scalable algorithms capable of cluster...
research
07/27/2017

A Family of Metrics for Clustering Algorithms

We give the motivation for scoring clustering algorithms and a metric M ...
research
06/01/2020

BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text

There is a growing interest in understanding how humans initiate and hol...
research
02/03/2022

Cross-Platform Difference in Facebook and Text Messages Language Use: Illustrated by Depression Diagnosis

How does language differ across one's Facebook status updates vs. one's ...

Please sign up or login with your details

Forgot password? Click here to reset