Domain-Specific Word Embeddings with Structure Prediction

10/06/2022
by   Stephanie Brandl, et al.
0

Complementary to finding good general word embeddings, an important question for representation learning is to find dynamic word embeddings, e.g., across time or domain. Current methods do not offer a way to use or predict information on structure between sub-corpora, time or domain and dynamic embeddings can only be compared after post-alignment. We propose novel word embedding methods that provide general word representations for the whole corpus, domain-specific representations for each sub-corpus, sub-corpus structure, and embedding alignment simultaneously. We present an empirical evaluation on New York Times articles and two English Wikipedia datasets with articles on science and philosophy. Our method, called Word2Vec with Structure Prediction (W2VPred), provides better performance than baselines in terms of the general analogy tests, domain-specific analogy tests, and multiple specific word embedding evaluations as well as structure prediction performance when no structure is given a priori. As a use case in the field of Digital Humanities we demonstrate how to raise novel research questions for high literature from the German Text Archive.

READ FULL TEXT

page 7

page 9

research
03/26/2019

Deep Learning and Word Embeddings for Tweet Classification for Crisis Response

Tradition tweet classification models for crisis response focus on convo...
research
07/11/2021

Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF

Over the last few years, neural network derived word embeddings became p...
research
02/13/2023

Evaluation of Word Embeddings for the Social Sciences

Word embeddings are an essential instrument in many NLP tasks. Most avai...
research
06/07/2019

Learning Word Embeddings with Domain Awareness

Word embeddings are traditionally trained on a large corpus in an unsupe...
research
10/01/2019

Essentia: Mining Domain-specific Paraphrases with Word-Alignment Graphs

Paraphrases are important linguistic resources for a wide variety of NLP...
research
10/08/2019

When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish

The recognition of pharmacological substances, compounds and proteins is...
research
12/12/2018

The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation

Language is dynamic, constantly evolving and adapting with respect to ti...

Please sign up or login with your details

Forgot password? Click here to reset