A Comparative Study on Transfer Learning and Distance Metrics in Semantic Clustering over the COVID-19 Tweets

11/16/2021
by   Elnaz Zafarani-Moattar, et al.
0

This paper is a comparison study in the context of Topic Detection on COVID-19 data. There are various approaches for Topic Detection, among which the Clustering approach is selected in this paper. Clustering requires distance and calculating distance needs embedding. The aim of this research is to simultaneously study the three factors of embedding methods, distance metrics and clustering methods and their interaction. A dataset including one-month tweets collected with COVID-19-related hashtags is used for this study. Five methods, from earlier to new methods, are selected among the embedding methods: Word2Vec, fastText, GloVe, BERT and T5. Five clustering methods are investigated in this paper that are: k-means, DBSCAN, OPTICS, spectral and Jarvis-Patrick. Euclidian distance and Cosine distance as the most important distance metrics in this field are also examined. First, more than 7,500 tests are performed to tune the parameters. Then, all the different combinations of embedding methods with distance metrics and clustering methods are investigated by silhouette metric. The number of these combinations is 50 cases. First, the results of these 50 tests are examined. Then, the rank of each method is taken into account in all the tests of that method. Finally, the major variables of the research (embedding methods, distance metrics and clustering methods) are studied separately. Averaging is performed over the control variables to neutralize their effect. The experimental results show that T5 strongly outperforms other embedding methods in terms of silhouette metric. In terms of distance metrics, cosine distance is weakly better. DBSCAN is also superior to other methods in terms of clustering methods.

READ FULL TEXT

page 11

page 14

research
03/30/2022

Benchmarking distance-based partitioning methods for mixed-type data

Clustering mixed-type data, that is, observation by variable data that c...
research
02/13/2023

Transferable Deep Metric Learning for Clustering

Clustering in high dimension spaces is a difficult task; the usual dista...
research
07/01/2019

Learning to Link

Clustering is an important part of many modern data analysis pipelines, ...
research
04/18/2022

Time Series Clustering for Grouping Products Based on Price and Sales Patterns

Developing technology and changing lifestyles have made online grocery d...
research
08/15/2019

Pearson Distance is not a Distance

The Pearson distance between a pair of random variables X,Y with correla...
research
10/25/2022

Clustering of Threat Information to Mitigate Information Overload for Computer Emergency Response Teams

The constantly increasing number of threats and the existing diversity o...
research
10/21/2019

Improving Vehicle Re-Identification using CNN Latent Spaces: Metrics Comparison and Track-to-track Extension

This paper addresses the problem of vehicle re-identification using dist...

Please sign up or login with your details

Forgot password? Click here to reset