Improving a tf-idf weighted document vector embedding

02/26/2019
by   Craig W. Schmidt, et al.
0

We examine a number of methods to compute a dense vector embedding for a document in a corpus, given a set of word vectors such as those from word2vec or GloVe. We describe two methods that can improve upon a simple weighted sum, that are optimal in the sense that they maximizes a particular weighted cosine similarity measure. We consider several weighting functions, including inverse document frequency (idf), smooth inverse frequency (SIF), and the sub-sampling function used in word2vec. We find that idf works best for our applications. We also use common component removal proposed by Arora et al. as a post-process and find it is helpful in most cases. We compare these embeddings variations to the doc2vec embedding on a new evaluation task using TripAdvisor reviews, and also on the CQADupStack benchmark from the literature.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2019

A Critique of the Smooth Inverse Frequency Sentence Embeddings

We critically review the smooth inverse frequency sentence embedding met...
research
08/02/2015

Class Vectors: Embedding representation of Document Classes

Distributed representations of words and paragraphs as semantic embeddin...
research
02/09/2019

A new simple and effective measure for bag-of-word inter-document similarity measurement

To measure the similarity of two documents in the bag-of-words (BoW) vec...
research
03/22/2018

Context is Everything: Finding Meaning Statistically in Semantic Spaces

This paper introduces a simple and explicit measure of word importance i...
research
07/18/2019

Evaluating the Utility of Document Embedding Vector Difference for Relation Learning

Recent work has demonstrated that vector offsets obtained by subtracting...
research
04/22/2020

Preserving the Hypernym Tree of WordNet in Dense Embeddings

In this paper, we provide a novel way to generate low-dimension (dense) ...
research
03/28/2022

Specialized Document Embeddings for Aspect-based Similarity of Research Papers

Document embeddings and similarity measures underpin content-based recom...

Please sign up or login with your details

Forgot password? Click here to reset