Learning Semantic Similarity for Very Short Texts

12/02/2015
by   Cedric De Boom, et al.
0

Levering data on social media, such as Twitter and Facebook, requires information retrieval algorithms to become able to relate very short text fragments to each other. Traditional text similarity methods such as tf-idf cosine-similarity, based on word overlap, mostly fail to produce good results in this case, since word overlap is little or non-existent. Recently, distributed word representations, or word embeddings, have been shown to successfully allow words to match on the semantic level. In order to pair short text fragments - as a concatenation of separate words - an adequate distributed sentence representation is needed, in existing literature often obtained by naively combining the individual word representations. We therefore investigated several text representations as a combination of word embeddings in the context of semantic pair matching. This paper investigates the effectiveness of several such naive techniques, as well as traditional tf-idf similarity, for fragments of different lengths. Our main contribution is a first step towards a hybrid method that combines the strength of dense distributed representations - as opposed to sparse term matching - with the strength of tf-idf based methods to automatically reduce the impact of less informative terms. Our new approach outperforms the existing techniques in a toy experimental set-up, leading to the conclusion that the combination of word embeddings and tf-idf information might lead to a better model for semantic content within very short text fragments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/02/2016

Representation learning for very short texts using weighted word embedding aggregation

Short text messages such as tweets are very noisy and sparse in their us...
research
02/17/2016

A Comprehensive Comparative Study of Word and Sentence Similarity Measures

Sentence similarity is considered the basis of many natural language tas...
research
10/07/2020

MuSeM: Detecting Incongruent News Headlines using Mutual Attentive Semantic Matching

Measuring the congruence between two texts has several useful applicatio...
research
02/25/2020

Declarative Memory-based Structure for the Representation of Text Data

In the era of intelligent computing, computational progress in text proc...
research
05/26/2021

A data-driven strategy to combine word embeddings in information retrieval

Word embeddings are vital descriptors of words in unigram representation...
research
01/21/2021

Multi-sense embeddings through a word sense disambiguation process

Natural Language Understanding has seen an increasing number of publicat...
research
02/10/2019

Word embeddings for idiolect identification

The term idiolect refers to the unique and distinctive use of language o...

Please sign up or login with your details

Forgot password? Click here to reset