Asynchronous Training of Word Embeddings for Large Text Corpora

12/07/2018
by   Avishek Anand, et al.
0

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45 by our distributed procedure which requires 1/10 of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2021

Spanish Biomedical and Clinical Language Embeddings

We computed both Word and Sub-word Embeddings using FastText. For Sub-wo...
research
12/25/2017

Generative Adversarial Nets for Multiple Text Corpora

Generative adversarial nets (GANs) have been successfully applied to the...
research
06/05/2019

Training Temporal Word Embeddings with a Compass

Temporal word embeddings have been proposed to support the analysis of w...
research
08/22/2019

ViCo: Word Embeddings from Visual Co-occurrences

We propose to learn word embeddings from visual co-occurrences. Two word...
research
05/26/2021

A data-driven strategy to combine word embeddings in information retrieval

Word embeddings are vital descriptors of words in unigram representation...
research
04/13/2020

Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Word2vec is one of the most used algorithms to generate word embeddings ...
research
09/08/2019

Distributed Word2Vec using Graph Analytics Frameworks

Word embeddings capture semantic and syntactic similarities of words, re...

Please sign up or login with your details

Forgot password? Click here to reset