Topological Data Analysis in Text Classification: Extracting Features with Additive Information

03/29/2020
by   Shafie Gholizadeh, et al.
0

While the strength of Topological Data Analysis has been explored in many studies on high dimensional numeric data, it is still a challenging task to apply it to text. As the primary goal in topological data analysis is to define and quantify the shapes in numeric data, defining shapes in the text is much more challenging, even though the geometries of vector spaces and conceptual spaces are clearly relevant for information retrieval and semantics. In this paper, we examine two different methods of extraction of topological features from text, using as the underlying representations of words the two most popular methods, namely word embeddings and TF-IDF vectors. To extract topological features from the word embedding space, we interpret the embedding of a text document as high dimensional time series, and we analyze the topology of the underlying graph where the vertices correspond to different embedding dimensions. For topological data analysis with the TF-IDF representations, we analyze the topology of the graph whose vertices come from the TF-IDF vectors of different blocks in the textual document. In both cases, we apply homological persistence to reveal the geometric structures under different distance resolutions. Our results show that these topological features carry some exclusive information that is not captured by conventional text mining methods. In our experiments we observe adding topological features to the conventional features in ensemble models improves the classification results (up to 5%). On the other hand, as expected, topological features by themselves may be not sufficient for effective classification. It is an open problem to see whether TDA features from word embeddings might be sufficient, as they seem to perform within a range of few points from top results obtained with a linear support vector classifier.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2020

A Novel Method of Extracting Topological Features from Word Embeddings

In recent years, topological data analysis has been utilized for a wide ...
research
03/25/2021

Persistence Homology of TEDtalk: Do Sentence Embeddings Have a Topological Shape?

Topological data analysis (TDA) has recently emerged as a new technique ...
research
11/17/2020

Argumentative Topology: Finding Loop(holes) in Logic

Advances in natural language processing have resulted in increased capab...
research
02/07/2021

A Note on Argumentative Topology: Circularity and Syllogisms as Unsolved Problems

In the last couple of years there were a few attempts to apply topologic...
research
08/22/2022

Dialogue Term Extraction using Transfer Learning and Topological Data Analysis

Goal oriented dialogue systems were originally designed as a natural lan...
research
12/08/2020

A Topological Method for Comparing Document Semantics

Comparing document semantics is one of the toughest tasks in both Natura...
research
05/31/2017

Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations

We investigate the pertinence of methods from algebraic topology for tex...

Please sign up or login with your details

Forgot password? Click here to reset