Large Scale Substitution-based Word Sense Induction

10/14/2021
by   Matan Eyal, et al.
0

We present a word-sense induction method based on pre-trained masked language models (MLMs), which can cheaply scale to large vocabularies and large corpora. The result is a corpus which is sense-tagged according to a corpus-derived sense inventory and where each sense is associated with indicative words. Evaluation on English Wikipedia that was sense-tagged using our method shows that both the induced senses, and the per-instance sense assignment, are of high quality even compared to WSD methods, such as Babelfy. Furthermore, by training a static word embeddings algorithm on the sense-tagged corpus, we obtain high-quality static senseful embeddings. These outperform existing senseful embeddings techniques on the WiC dataset and on a new outlier detection dataset we developed. The data driven nature of the algorithm allows to induce corpora-specific senses, which may not appear in standard sense inventories, as we demonstrate using a case study on the scientific domain.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/05/2021

Learning Sense-Specific Static Embeddings using Contextualised Word Embeddings as a Proxy

Contextualised word embeddings generated from Neural Language Models (NL...
research
10/24/2016

Geometry of Polysemy

Vector representations of words have heralded a transformational approac...
research
09/28/2022

RuDSI: graph-based word sense induction dataset for Russian

We present RuDSI, a new benchmark for word sense induction (WSI) in Russ...
research
06/05/2019

Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings

Given a small corpus D_T pertaining to a limited set of focused topics,...
research
03/22/2018

Word sense induction using word embeddings and community detection in complex networks

Word Sense Induction (WSI) is the ability to automatically induce word s...
research
02/01/2018

Adapting predominant and novel sense discovery algorithms for identifying corpus-specific sense differences

Word senses are not static and may have temporal, spatial or corpus-spec...
research
04/07/2022

Automatic WordNet Construction using Word Sense Induction through Sentence Embeddings

Language resources such as wordnets remain indispensable tools for diffe...

Please sign up or login with your details

Forgot password? Click here to reset