Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models

08/28/2018
by   Mohammad Taher Pilehvar, et al.
0

Rare word representation has recently enjoyed a surge of interest, owing to the crucial role that effective handling of infrequent words can play in accurate semantic understanding. However, there is a paucity of reliable benchmarks for evaluation and comparison of these techniques. We show in this paper that the only existing benchmark (the Stanford Rare Word dataset) suffers from low-confidence annotations and limited vocabulary; hence, it does not constitute a solid comparison framework. In order to fill this evaluation gap, we propose CAmbridge Rare word Dataset (Card-660), an expert-annotated word similarity dataset which provides a highly reliable, yet challenging, benchmark for rare word representation techniques. Through a set of experiments we show that even the best mainstream word embeddings, with millions of words in their vocabularies, are unable to achieve performances higher than 0.43 (Pearson correlation) on the dataset, compared to a human-level upperbound of 0.90. We release the dataset and the annotation materials at https://pilehvar.github.io/card-660/.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/28/2018

WiC: 10,000 Example Pairs for Evaluating Context-Sensitive Representations

By design, word embeddings are unable to model the dynamic nature of wor...
research
09/07/2021

Rare Words Degenerate All Words

Despite advances in neural network language model, the representation de...
research
09/05/2023

Haystack: A Panoptic Scene Graph Dataset to Evaluate Rare Predicate Classes

Current scene graph datasets suffer from strong long-tail distributions ...
research
05/05/2022

One Size Does Not Fit All: The Case for Personalised Word Complexity Models

Complex Word Identification (CWI) aims to detect words within a text tha...
research
07/24/2017

Learning Rare Word Representations using Semantic Bridging

We propose a methodology that adapts graph embedding techniques (DeepWal...
research
04/02/2019

Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts

Learning high-quality embeddings for rare words is a hard problem becaus...
research
08/30/2021

RetroGAN: A Cyclic Post-Specialization System for Improving Out-of-Knowledge and Rare Word Representations

Retrofitting is a technique used to move word vectors closer together or...

Please sign up or login with your details

Forgot password? Click here to reset