Parallelizing Word2Vec in Shared and Distributed Memory

04/15/2016
by   Shihao Ji, et al.
0

Word2Vec is a widely used algorithm for extracting low-dimensional vector representations of words. It generated considerable excitement in the machine learning and natural language processing (NLP) communities recently due to its exceptional performance in many NLP applications such as named entity recognition, sentiment analysis, machine translation and question answering. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures but are based on vector-vector operations that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we improve reuse of various data structures in the algorithm through the use of minibatching, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. In combination, these techniques allow us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2016

Parallelizing Word2Vec in Multi-Core and Many-Core Architectures

Word2vec is a widely used algorithm for extracting low-dimensional vecto...
research
11/04/2021

Reducing the impact of out of vocabulary words in the translation of natural language questions into SPARQL queries

Accessing the large volumes of information available in public knowledge...
research
09/17/2023

OWL: A Large Language Model for IT Operations

With the rapid development of IT operations, it has become increasingly ...
research
08/16/2018

Computing Word Classes Using Spectral Clustering

Clustering a lexicon of words is a well-studied problem in natural langu...
research
07/14/2020

Deep learning models for representing out-of-vocabulary words

Communication has become increasingly dynamic with the popularization of...
research
07/15/2018

Concept-Based Embeddings for Natural Language Processing

In this work, we focus on effectively leveraging and integrating informa...
research
08/30/2017

TANKER: Distributed Architecture for Named Entity Recognition and Disambiguation

Named Entity Recognition and Disambiguation (NERD) systems have recently...

Please sign up or login with your details

Forgot password? Click here to reset