Are Classes Clusters?

by   Kees Varekamp, et al.

Sentence embedding models aim to provide general purpose embeddings for sentences. Most of the models studied in this paper claim to perform well on STS tasks - but they do not report on their suitability for clustering. This paper looks at four recent sentence embedding models (Universal Sentence Encoder (Cer et al., 2018), Sentence-BERT (Reimers and Gurevych, 2019), LASER (Artetxe and Schwenk, 2019), and DeCLUTR (Giorgi et al., 2020)). It gives a brief overview of the ideas behind their implementations. It then investigates how well topic classes in two text classification datasets (Amazon Reviews (Ni et al., 2019) and News Category Dataset (Misra, 2018)) map to clusters in their corresponding sentence embedding space. While the performance of the resulting classification model is far from perfect, it is better than random. This is interesting because the classification model has been constructed in an unsupervised way. The topic classes in these real life topic classification datasets can be partly reconstructed by clustering the corresponding sentence embeddings.


page 1

page 2

page 3

page 4


COSTRA 1.0: A Dataset of Complex Sentence Transformations

We present COSTRA 1.0, a dataset of complex sentence transformations. Th...

Idea density for predicting Alzheimer's disease from transcribed speech

Idea Density (ID) measures the rate at which ideas or elementary predica...

Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

While paragraph embedding models are remarkably effective for downstream...

Putting a Compass on the Map of Elections

Recently, Szufa et al. [AAMAS 2020] presented a "map of elections" that ...

A novel sentence embedding based topic detection method for micro-blog

Topic detection is a challenging task, especially without knowing the ex...

Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

We extend the work of Wieting et al. (2017), back-translating a large pa...

NLPR@SRPOL at SemEval-2019 Task 6 and Task 5: Linguistically enhanced deep learning offensive sentence classifier

The paper presents a system developed for the SemEval-2019 competition T...