kōan: A Corrected CBOW Implementation

12/30/2020
by   Ozan Irsoy, et al.
0

It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train. We release our implementation, kōan, at https://github.com/bloomberg/koan.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2017

Analysis of Italian Word Embeddings

In this work we analyze the performances of two of the most used word em...
research
01/05/2016

The Role of Context Types and Dimensionality in Learning Word Embeddings

We provide the first extensive evaluation of how using different types o...
research
05/23/2019

Misspelling Oblivious Word Embeddings

In this paper we present a method to learn word embeddings that are resi...
research
05/12/2023

ActUp: Analyzing and Consolidating tSNE and UMAP

tSNE and UMAP are popular dimensionality reduction algorithms due to the...
research
07/20/2020

Morphological Skip-Gram: Using morphological knowledge to improve word representation

Natural language processing models have attracted much interest in the d...
research
09/26/2021

An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces

Much recent work in bilingual lexicon induction (BLI) views word embeddi...
research
11/17/2015

Learning the Dimensionality of Word Embeddings

We describe a method for learning word embeddings with data-dependent di...

Please sign up or login with your details

Forgot password? Click here to reset