Deep learning models for representing out-of-vocabulary words

07/14/2020
by   Johannes V. Lochter, et al.
0

Communication has become increasingly dynamic with the popularization of social networks and applications that allow people to express themselves and communicate instantly. In this scenario, distributed representation models have their quality impacted by new words that appear frequently or that are derived from spelling errors. These words that are unknown by the models, known as out-of-vocabulary (OOV) words, need to be properly handled to not degrade the quality of the natural language processing (NLP) applications, which depend on the appropriate vector representation of the texts. To better understand this problem and finding the best techniques to handle OOV words, in this study, we present a comprehensive performance evaluation of deep learning models for representing OOV words. We performed an intrinsic evaluation using a benchmark dataset and an extrinsic evaluation using different NLP tasks: text categorization, named entity recognition, and part-of-speech tagging. Although the results indicated that the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/25/2019

hauWE: Hausa Words Embedding for Natural Language Processing

Words embedding (distributed word vector representations) have become an...
research
08/05/2017

A Syllable-based Technique for Word Embeddings of Korean Words

Word embedding has become a fundamental component to many NLP tasks such...
research
06/27/2016

Network-Efficient Distributed Word2vec Training System for Large Vocabularies

Word2vec is a popular family of algorithms for unsupervised training of ...
research
01/08/2022

A comprehensive review and evaluation on text predictive and entertainment systems

One of the most important ways to experience communication and interact ...
research
01/06/2015

Unknown Words Analysis in POS tagging of Sinhala Language

Part of Speech (POS) is a very vital topic in Natural Language Processin...
research
04/15/2016

Parallelizing Word2Vec in Shared and Distributed Memory

Word2Vec is a widely used algorithm for extracting low-dimensional vecto...
research
06/12/2018

Learning to Automatically Generate Fill-In-The-Blank Quizzes

In this paper we formalize the problem automatic fill-in-the-blank quest...

Please sign up or login with your details

Forgot password? Click here to reset