Text comparison using word vector representations and dimensionality reduction

07/02/2016
by   Hendrik Heuer, et al.
0

This paper describes a technique to compare large text sources using word vector representations (word2vec) and dimensionality reduction (t-SNE) and how it can be implemented using Python. The technique provides a bird's-eye view of text sources, e.g. text summaries and their source material, and enables users to explore text sources like a geographical map. Word vector representations capture many linguistic properties such as gender, tense, plurality and even semantic concepts like "capital city of". Using dimensionality reduction, a 2D map can be computed where semantically similar words are close to each other. The technique uses the word2vec model from the gensim Python library and t-SNE from scikit-learn.

READ FULL TEXT

page 3

page 4

research
06/11/2015

Isometric sketching of any set via the Restricted Isometry Property

In this paper we show that for the purposes of dimensionality reduction ...
research
09/21/2016

Gov2Vec: Learning Distributed Representations of Institutions and Their Legal Text

We compare policy differences across institutions by embedding represent...
research
12/19/2016

High Performance Software in Multidimensional Reduction Methods for Image Processing with Application to Ancient Manuscripts

Multispectral imaging is an important technique for improving the readab...
research
02/21/2022

Non-Volatile Memory Accelerated Geometric Multi-Scale Resolution Analysis

Dimensionality reduction algorithms are standard tools in a researcher's...
research
11/09/2022

Minimalist Data Wrangling with Python

Minimalist Data Wrangling with Python is envisaged as a student's first ...
research
11/18/2020

Accelerating Text Mining Using Domain-Specific Stop Word Lists

Text preprocessing is an essential step in text mining. Removing words t...

Please sign up or login with your details

Forgot password? Click here to reset