Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler

05/31/2016
by   Gregory Grefenstette, et al.
0

Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical methods for detecting this vocabulary involve gathering a domain corpus, calculating statistics on the terms found there, and then comparing these statistics to a background or general language corpus. Terms which are found significantly more often in the specialized corpus than in the background corpus are candidates for the characteristic vocabulary of the domain. Here we present two tools, a directed crawler, and a distributional semantics package, that can be used together, circumventing the need of a background corpus. Both tools are available on the web.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2020

Accent Estimation of Japanese Words from Their Surfaces and Romanizations for Building Large Vocabulary Accent Dictionaries

In Japanese text-to-speech (TTS), it is necessary to add accent informat...
research
09/15/2020

Using Known Words to Learn More Words: A Distributional Analysis of Child Vocabulary Development

Why do children learn some words before others? Understanding individual...
research
08/04/2022

Vocabulary Transfer for Medical Texts

Vocabulary transfer is a transfer learning subtask in which language mod...
research
04/19/2019

Recognizing the vocabulary of Brazilian popular newspapers with a free-access computational dictionary

We report an experiment to check the identification of a set of words in...
research
06/15/2018

Stylized innovation: interrogating incrementally available randomised dictionaries

Inspired by recent work of Fink, Reeves, Palma and Farr (2017) on innova...
research
05/18/2021

An Automated Method to Enrich Consumer Health Vocabularies Using GloVe Word Embeddings and An Auxiliary Lexical Resource

Background: Clear language makes communication easier between any two pa...

Please sign up or login with your details

Forgot password? Click here to reset