DeepAI AI Chat
Log In Sign Up

Exploring text datasets by visualizing relevant words

by   Franziska Horn, et al.
Berlin Institute of Technology (Technische Universität Berlin)

When working with a new dataset, it is important to first explore and familiarize oneself with it, before applying any advanced machine learning algorithms. However, to the best of our knowledge, no tools exist that quickly and reliably give insight into the contents of a selection of documents with respect to what distinguishes them from other documents belonging to different categories. In this paper we propose to extract `relevant words' from a collection of texts, which summarize the contents of documents belonging to a certain class (or discovered cluster in the case of unlabeled datasets), and visualize them in word clouds to allow for a survey of salient features at a glance. We compare three methods for extracting relevant words and demonstrate the usefulness of the resulting word clouds by providing an overview of the classes contained in a dataset of scientific publications as well as by discovering trending topics from recent New York Times article snippets.


Discovering topics in text datasets by visualizing relevant words

When dealing with large collections of documents, it is imperative to qu...

Tag Clouds for Software Documents Visualization

Legacy software documents are hard to understand and visualize. The tag ...

SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications

We describe the SemEval task of extracting keyphrases and relations betw...

Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Topic modeling, a method for extracting the underlying themes from a col...

Automatic Machine Learning Derived from Scholarly Big Data

One of the challenging aspects of applying machine learning is the need ...

MexPub: Deep Transfer Learning for Metadata Extraction from German Publications

Extracting metadata from scientific papers can be considered a solved pr...

Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

Most search engines index the textual content of documents in digital li...