On the Impact of Knowledge-based Linguistic Annotations in the Quality of Scientific Embeddings

by   Andres Garcia-Silva, et al.

In essence, embedding algorithms work by optimizing the distance between a word and its usual context in order to generate an embedding space that encodes the distributional representation of words. In addition to single words or word pieces, other features which result from the linguistic analysis of text, including lexical, grammatical and semantic information, can be used to improve the quality of embedding spaces. However, until now we did not have a precise understanding of the impact that such individual annotations and their possible combinations may have in the quality of the embeddings. In this paper, we conduct a comprehensive study on the use of explicit linguistic annotations to generate embeddings from a scientific corpus and quantify their impact in the resulting representations. Our results show how the effect of such annotations in the embeddings varies depending on the evaluation task. In general, we observe that learning embeddings using linguistic annotations contributes to achieve better evaluation results.



page 1

page 2

page 3

page 4


Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation

Following the recent success of word embeddings, it has been argued that...

DirectProbe: Studying Representations without Classifiers

Understanding how linguistic structures are encoded in contextualized em...

On the Effects of Knowledge-Augmented Data in Word Embeddings

This paper investigates techniques for knowledge injection into word emb...

An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings

Many mispronunciation detection and diagnosis (MD D) research approach...

Towards Better Understanding with Uniformity and Explicit Regularization of Embeddings in Embedding-based Neural Topic Models

Embedding-based neural topic models could explicitly represent words and...

Picking BERT's Brain: Probing for Linguistic Dependencies in Contextualized Embeddings Using Representational Similarity Analysis

As the name implies, contextualized representations of language are typi...

Concept2vec: Metrics for Evaluating Quality of Embeddings for Ontological Concepts

Although there is an emerging trend towards generating embeddings for pr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.