Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

03/07/2019
by   Gerhard Wohlgenannt, et al.
0

Word embeddings are already well studied in the general domain, usually trained on large text corpora, and have been evaluated for example on word similarity and analogy tasks, but also as an input to downstream NLP processes. In contrast, in this work we explore the suitability of word embedding technologies in the specialized digital humanities domain. After training embedding models of various types on two popular fantasy novel book series, we evaluate their performance on two task types: term analogies, and word intrusion. To this end, we manually construct test datasets with domain experts. Among the contributions are the evaluation of various word embedding techniques on the different task types, with the findings that even embeddings trained on small corpora perform well for example on the word intrusion task. Furthermore, we provide extensive and high-quality datasets in digital humanities for further investigation, as well as the implementation to easily reproduce or extend the experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/04/2019

Russian Language Datasets in the Digitial Humanities Domain and Their Evaluation with Word Embeddings

In this paper, we present Russian language datasets in the digital human...
research
03/04/2019

Relation Extraction Datasets in the Digital Humanities Domain and their Evaluation with Word Embeddings

In this research, we manually create high-quality datasets in the digita...
research
09/15/2021

SWEAT: Scoring Polarization of Topics across Different Corpora

Understanding differences of viewpoints across corpora is a fundamental ...
research
09/21/2017

Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

Word embedding is a Natural Language Processing (NLP) technique that aut...
research
01/10/2020

MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding

Word embeddings have undoubtedly revolutionized NLP. However, pre-traine...
research
09/06/2019

To lemmatize or not to lemmatize: how word normalisation affects ELMo performance in word sense disambiguation

We critically evaluate the widespread assumption that deep learning NLP ...
research
06/13/2023

CipherSniffer: Classifying Cipher Types

Ciphers are a powerful tool for encrypting communication. There are many...

Please sign up or login with your details

Forgot password? Click here to reset