Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF

Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the word embeddings are fitted by using Shannon's word entropies provided by the Term Frequency--Inverse Document Frequency (TF--IDF) transform. The hyperparameters of the model can be selected according to the properties of data (e.g. sentence length and textual gender). Hyperparameter selection involves word embedding methods and dimensionalities, as well as weighting schemata. Our method offers advantages over existing methods: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and language resources. Results showed that our model outperformed the state of the art in well-known Semantic Textual Similarity (STS) benchmarks. Moreover, our model reached state-of-the-art performance when compared to supervised and knowledge-based STS systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/17/2018

Correcting the Common Discourse Bias in Linear Representation of Sentences using Conceptors

Distributed representations of words, better known as word embeddings, h...
research
05/12/2021

Playing Codenames with Language Graphs and Word Embeddings

Although board games and video games have been studied for decades in ar...
research
10/25/2019

Textual Data for Time Series Forecasting

While ubiquitous, textual sources of information such as company reports...
research
09/30/2019

A Critique of the Smooth Inverse Frequency Sentence Embeddings

We critically review the smooth inverse frequency sentence embedding met...
research
08/09/2022

Using Sentence Embeddings and Semantic Similarity for Seeking Consensus when Assessing Trustworthy AI

Assessing the trustworthiness of artificial intelligence systems require...
research
03/07/2017

Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features

The recent tremendous success of unsupervised word embeddings in a multi...
research
07/02/2016

Representation learning for very short texts using weighted word embedding aggregation

Short text messages such as tweets are very noisy and sparse in their us...

Please sign up or login with your details

Forgot password? Click here to reset