Exploring text datasets by visualizing relevant words

07/17/2017
by   Franziska Horn, et al.
0

When working with a new dataset, it is important to first explore and familiarize oneself with it, before applying any advanced machine learning algorithms. However, to the best of our knowledge, no tools exist that quickly and reliably give insight into the contents of a selection of documents with respect to what distinguishes them from other documents belonging to different categories. In this paper we propose to extract `relevant words' from a collection of texts, which summarize the contents of documents belonging to a certain class (or discovered cluster in the case of unlabeled datasets), and visualize them in word clouds to allow for a survey of salient features at a glance. We compare three methods for extracting relevant words and demonstrate the usefulness of the resulting word clouds by providing an overview of the classes contained in a dataset of scientific publications as well as by discovering trending topics from recent New York Times article snippets.

READ FULL TEXT
research
07/18/2017

Discovering topics in text datasets by visualizing relevant words

When dealing with large collections of documents, it is imperative to qu...
research
09/30/2021

Tag Clouds for Software Documents Visualization

Legacy software documents are hard to understand and visualize. The tag ...
research
04/10/2017

SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications

We describe the SemEval task of extracting keyphrases and relations betw...
research
10/25/2016

Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Topic modeling, a method for extracting the underlying themes from a col...
research
09/19/2023

Interactive Distillation of Large Single-Topic Corpora of Scientific Papers

Highly specific datasets of scientific literature are important for both...
research
06/04/2021

MexPub: Deep Transfer Learning for Metadata Extraction from German Publications

Extracting metadata from scientific papers can be considered a solved pr...
research
09/10/2008

Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

Most search engines index the textual content of documents in digital li...

Please sign up or login with your details

Forgot password? Click here to reset