Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings

06/05/2019
by   Vihari Piratla, et al.
0

Given a small corpus D_T pertaining to a limited set of focused topics, our goal is to train embeddings that accurately capture the sense of words in the topic in spite of the limited size of D_T. These embeddings may be used in various tasks involving D_T. A popular strategy in limited data settings is to adapt pre-trained embeddings E trained on a large corpus. To correct for sense drift, fine-tuning, regularization, projection, and pivoting have been proposed recently. Among these, regularization informed by a word's corpus frequency performed well, but we improve upon it using a new regularizer based on the stability of its cooccurrence with other words. However, a thorough comparison across ten topics, spanning three tasks, with standardized settings of hyper-parameters, reveals that even the best embedding adaptation strategies provide small gains beyond well-tuned baselines, which many earlier comparisons ignored. In a bold departure from adapting pretrained embeddings, we propose using D_T to probe, attend to, and borrow fragments from any large, topic-rich source corpus (such as Wikipedia), which need not be the corpus used to pretrain embeddings. This step is made scalable and practical by suitable indexing. We reach the surprising conclusion that even limited corpus augmentation is more useful than adapting embeddings, which suggests that non-dominant sense information may be irrevocably obliterated from pretrained embeddings and cannot be salvaged by adaptation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2021

Large Scale Substitution-based Word Sense Induction

We present a word-sense induction method based on pre-trained masked lan...
research
09/14/2019

Multi-view and Multi-source Transfers in Neural Topic Modeling with Pretrained Topic and Word Embeddings

Though word embeddings and topics are complementary representations, sev...
research
04/30/2020

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!

Topic models are a useful analysis tool to uncover the underlying themes...
research
04/17/2021

Multi-source Neural Topic Modeling in Multi-view Embedding Spaces

Though word embeddings and topics are complementary representations, sev...
research
03/30/2022

Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling

This work presents a new resource for borrowing identification and analy...
research
02/01/2018

Adapting predominant and novel sense discovery algorithms for identifying corpus-specific sense differences

Word senses are not static and may have temporal, spatial or corpus-spec...
research
09/22/2000

A Comparison between Supervised Learning Algorithms for Word Sense Disambiguation

This paper describes a set of comparative experiments, including cross-c...

Please sign up or login with your details

Forgot password? Click here to reset