WordRank: Learning Word Embeddings via Robust Ranking

06/09/2015
by   Shihao Ji, et al.
0

Embedding words in a vector space has gained a lot of attention in recent years. While state-of-the-art methods provide efficient computation of word similarities via a low-dimensional matrix embedding, their motivation is often left unclear. In this paper, we argue that word embedding can be naturally viewed as a ranking problem due to the ranking nature of the evaluation metrics. Then, based on this insight, we propose a novel framework WordRank that efficiently estimates word representations via robust ranking, in which the attention mechanism and robustness to noise are readily achieved via the DCG-like ranking losses. The performance of WordRank is measured in word similarity and word analogy benchmarks, and the results are compared to the state-of-the-art word embedding techniques. Our algorithm is very competitive to the state-of-the- arts on large corpora, while outperforms them by a significant margin when the training set is limited (i.e., sparse and noisy). With 17 million tokens, WordRank performs almost as well as existing methods using 7.2 billion tokens on a popular word similarity benchmark. Our multi-node distributed implementation of WordRank is publicly available for general usage.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2020

Attention Word Embedding

Word embedding models learn semantically rich vector representations of ...
research
12/28/2019

Learning Numeral Embeddings

Word embedding is an essential building block for deep learning methods ...
research
03/07/2020

Discovering linguistic (ir)regularities in word embeddings through max-margin separating hyperplanes

We experiment with new methods for learning how related words are positi...
research
11/30/2016

Low-dimensional Data Embedding via Robust Ranking

We describe a new method called t-ETE for finding a low-dimensional embe...
research
08/16/2021

IsoScore: Measuring the Uniformity of Vector Space Utilization

The recent success of distributed word representations has led to an inc...
research
12/20/2014

Word Representations via Gaussian Embedding

Current work in lexical distributed representations maps each word to a ...
research
10/21/2022

TransLIST: A Transformer-Based Linguistically Informed Sanskrit Tokenizer

Sanskrit Word Segmentation (SWS) is essential in making digitized texts ...

Please sign up or login with your details

Forgot password? Click here to reset