Multi hash embeddings in spaCy

12/19/2022
by   Lester James Miranda, et al.
0

The distributed representation of symbols is one of the key technologies in machine learning systems today, playing a pivotal role in modern natural language processing. Traditional word embeddings associate a separate vector with each word. While this approach is simple and leads to good performance, it requires a lot of memory for representing a large vocabulary. To reduce the memory footprint, the default embedding layer in spaCy is a hash embeddings layer. It is a stochastic approximation of traditional embeddings that provides unique vectors for a large number of words without explicitly storing a separate vector for each of them. To be able to compute meaningful representations for both known and unknown words, hash embeddings represent each word as a summary of the normalized word form, subword information and word shape. Together, these features produce a multi-embedding of a word. In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail. Second, we critically evaluate the hash embedding architecture with multi-embeddings on Named Entity Recognition datasets from a variety of domains and languages. The experiments validate most key design choices behind spaCy's embedders, but we also uncover a few surprising results.

READ FULL TEXT
research
06/08/2017

Context encoders as a simple but powerful extension of word2vec

With a simple architecture and the ability to learn meaningful word embe...
research
09/12/2017

Hash Embeddings for Efficient Word Representations

We present hash embeddings, an efficient method for representing words i...
research
08/05/2017

A Syllable-based Technique for Word Embeddings of Korean Words

Word embedding has become a fundamental component to many NLP tasks such...
research
04/25/2020

All Word Embeddings from One Embedding

In neural network-based models for natural language processing (NLP), th...
research
10/25/2022

MemoNet:Memorizing Representations of All Cross Features Efficiently via Multi-Hash Codebook Network for CTR Prediction

New findings in natural language processing(NLP) demonstrate that the st...
research
07/14/2020

Using Holographically Compressed Embeddings in Question Answering

Word vector representations are central to deep learning natural languag...
research
05/30/2018

What the Vec? Towards Probabilistically Grounded Embeddings

Vector representation, or embedding, of words is commonly achieved with ...

Please sign up or login with your details

Forgot password? Click here to reset