DeepAI AI Chat
Log In Sign Up

Exploring text datasets by visualizing relevant words

07/17/2017
by   Franziska Horn, et al.
Berlin Institute of Technology (Technische Universität Berlin)
0

When working with a new dataset, it is important to first explore and familiarize oneself with it, before applying any advanced machine learning algorithms. However, to the best of our knowledge, no tools exist that quickly and reliably give insight into the contents of a selection of documents with respect to what distinguishes them from other documents belonging to different categories. In this paper we propose to extract `relevant words' from a collection of texts, which summarize the contents of documents belonging to a certain class (or discovered cluster in the case of unlabeled datasets), and visualize them in word clouds to allow for a survey of salient features at a glance. We compare three methods for extracting relevant words and demonstrate the usefulness of the resulting word clouds by providing an overview of the classes contained in a dataset of scientific publications as well as by discovering trending topics from recent New York Times article snippets.

READ FULL TEXT
07/18/2017

Discovering topics in text datasets by visualizing relevant words

When dealing with large collections of documents, it is imperative to qu...
09/30/2021

Tag Clouds for Software Documents Visualization

Legacy software documents are hard to understand and visualize. The tag ...
04/10/2017

SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications

We describe the SemEval task of extracting keyphrases and relations betw...
10/25/2016

Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Topic modeling, a method for extracting the underlying themes from a col...
03/06/2020

Automatic Machine Learning Derived from Scholarly Big Data

One of the challenging aspects of applying machine learning is the need ...
06/04/2021

MexPub: Deep Transfer Learning for Metadata Extraction from German Publications

Extracting metadata from scientific papers can be considered a solved pr...
09/10/2008

Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

Most search engines index the textual content of documents in digital li...