Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are Complementary to Neural

01/05/2018
by   Neil R. Smalheiser, et al.
0

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title+abstracts are all publicly available from http://arrowsmith.psych.uic.edu for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2015

"The Sum of Its Parts": Joint Learning of Word and Phrase Representations with Autoencoders

Recently, there has been a lot of effort to represent words in continuou...
research
03/13/2018

Enhanced Word Representations for Bridging Anaphora Resolution

Most current models of word representations(e.g.,GloVe) have successfull...
research
12/16/2014

Rehabilitation of Count-based Models for Word Vector Representations

Recent works on word representations mostly rely on predictive models. D...
research
05/18/2021

WOVe: Incorporating Word Order in GloVe Word Embeddings

Word vector representations open up new opportunities to extract useful ...
research
04/17/2021

Characterizing Idioms: Conventionality and Contingency

Idioms are unlike other phrases in two important ways. First, the words ...
research
11/06/2019

Gextext: Disease Network Extraction from Biomedical Literature

PURPOSE: We propose a fully unsupervised method to learn latent disease ...
research
09/07/2021

Learning grounded word meaning representations on similarity graphs

This paper introduces a novel approach to learn visually grounded meanin...

Please sign up or login with your details

Forgot password? Click here to reset