Semantic Word Clouds with Background Corpus Normalization and t-distributed Stochastic Neighbor Embedding

08/11/2017
by   Erich Schubert, et al.
0

Many word clouds provide no semantics to the word placement, but use a random layout optimized solely for aesthetic purposes. We propose a novel approach to model word significance and word affinity within a document, and in comparison to a large background corpus. We demonstrate its usefulness for generating more meaningful word clouds as a visual summary of a given document. We then select keywords based on their significance and construct the word cloud based on the derived affinity. Based on a modified t-distributed stochastic neighbor embedding (t-SNE), we generate a semantic word placement. For words that cooccur significantly, we include edges, and cluster the words according to their cooccurrence. For this we designed a scalable and memory-efficient sketch-based approach usable on commodity hardware to aggregate the required corpus statistics needed for normalization, and for identifying keywords as well as significant cooccurences. We empirically validate our approch using a large Wikipedia corpus.

READ FULL TEXT
research
06/22/2023

MySemCloud: Semantic-aware Word Cloud Editing

Word clouds are a popular text visualization technique that summarize an...
research
10/14/2022

Word Clouds in the Wild

Word clouds are frequently used to analyze and communicate text data in ...
research
01/14/2020

Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning

Word embeddings, i.e., low-dimensional vector representations such as Gl...
research
08/10/2015

Measuring Word Significance using Distributed Representations of Words

Distributed representations of words as real-valued vectors in a relativ...
research
12/16/2021

Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search

Search is one of the key functionalities in digital platforms and applic...
research
02/07/2022

Moving Other Way: Exploring Word Mover Distance Extensions

The word mover's distance (WMD) is a popular semantic similarity metric ...
research
12/29/2021

Using word clouds for fast identification of papers' subject domain and reviewers' competences

Generating word (tag) clouds is a powerful data visualization technique ...

Please sign up or login with your details

Forgot password? Click here to reset