DeepAI
Log In Sign Up

Topological Data Analysis in Text Classification: Extracting Features with Additive Information

While the strength of Topological Data Analysis has been explored in many studies on high dimensional numeric data, it is still a challenging task to apply it to text. As the primary goal in topological data analysis is to define and quantify the shapes in numeric data, defining shapes in the text is much more challenging, even though the geometries of vector spaces and conceptual spaces are clearly relevant for information retrieval and semantics. In this paper, we examine two different methods of extraction of topological features from text, using as the underlying representations of words the two most popular methods, namely word embeddings and TF-IDF vectors. To extract topological features from the word embedding space, we interpret the embedding of a text document as high dimensional time series, and we analyze the topology of the underlying graph where the vertices correspond to different embedding dimensions. For topological data analysis with the TF-IDF representations, we analyze the topology of the graph whose vertices come from the TF-IDF vectors of different blocks in the textual document. In both cases, we apply homological persistence to reveal the geometric structures under different distance resolutions. Our results show that these topological features carry some exclusive information that is not captured by conventional text mining methods. In our experiments we observe adding topological features to the conventional features in ensemble models improves the classification results (up to 5%). On the other hand, as expected, topological features by themselves may be not sufficient for effective classification. It is an open problem to see whether TDA features from word embeddings might be sufficient, as they seem to perform within a range of few points from top results obtained with a linear support vector classifier.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/29/2020

A Novel Method of Extracting Topological Features from Word Embeddings

In recent years, topological data analysis has been utilized for a wide ...
03/25/2021

Persistence Homology of TEDtalk: Do Sentence Embeddings Have a Topological Shape?

Topological data analysis (TDA) has recently emerged as a new technique ...
11/17/2020

Argumentative Topology: Finding Loop(holes) in Logic

Advances in natural language processing have resulted in increased capab...
02/07/2021

A Note on Argumentative Topology: Circularity and Syllogisms as Unsolved Problems

In the last couple of years there were a few attempts to apply topologic...
08/22/2022

Dialogue Term Extraction using Transfer Learning and Topological Data Analysis

Goal oriented dialogue systems were originally designed as a natural lan...
12/08/2020

A Topological Method for Comparing Document Semantics

Comparing document semantics is one of the toughest tasks in both Natura...

1 Introduction

The primary goal in Topological Data Analysis (TDA) is to analyze the shapes in data. While TDA has got significant attention in data mining for numeric data [de2006coordinate, de2007coverage, khasawneh2016chatter, khasawneh2018topological, pereira2015persistent, maletic2016persistent], its application in natural language processing still appears to be challenging. Defining shapes in the text seems much more challenging even though vector spaces are used as standard tools to define geometries in text mining and information retrieval [manning2008introduction], and conceptual spaces [gardenfors2014geometry] are relevant for cognitive modeling and semantics of natural language.

In this study, the focus is on introducing and examining two methods to apply TDA to text classification. Term frequency (or TF-IDF) and word embeddings are the most frequently used methods to translate the text into numerical data. Therefore they deserve to be examined, as a priority, for potential to reveal their hidden dimensions by applying topological methods.

First, we introduce a novel method of using word embeddings where we view text documents as time series. We believe this method shows great promise, since it can be applied to documents irrespective of their length (with some likely limitations explained in Section 6), and it encodes the temporal succession words in a latent semantic space. Our algorithm analyzes the topology of the embedding space to discover relations among different embedding dimensions of the analyzed text. The precise nature of this space is not clear to us at this point. However, we know it is there, because our experiments show its influence on the accuracy of classification.

In the second experiment, working with TF-IDF representations of textual documents, we use a method that divides the text to a fixed number of blocks, analyzes the topological structure of the relations among different blocks and summarizes the results. As with the first method, this topological summary consists of numerical features derived from the persistence diagram of each document. And as in the first case, it improves the accuracy of classification, proving the existence of the latent topological dimension (speaking metaphorically).

The intuitive idea behind both experiments has to do with the central premise of topological data analysis, namely that when examining a cloud of data points at different resolutions, the emerging diagrams encode global geometric properties of the point as shown in Fugure 1 and later in Fugure 2. There we observe, with the change of the threshold, i.e. the distance at which we add connections to the points, new elements are added to the persistence diagram, culminating in a clear circle, or torus-like signature in in Fugure 1 (shown as the long line in the right panel), and a more complex representation of the geometry in Fugure 2

. In our case we measure, and use as features for machine learning the birth and death diameters in dimension 0 and 1, as well as their derivatives that is the number of holes, the average divided by the standard deviation of death diameter, and the same ratio for the duration (death - birth).

This paper is structured as follows. In Section 2, we review the basic methods in topological data analysis and review a few studies that utilized TDA in the wider area of natural language processing. Section 3 describes our methods of extracting topological features out of textual documents. Then our experiments are explained in Section 4 followed by results in Section 5. We discuss the contribution and its limitations and open problems in Section 6.

2 Background

2.1 A Brief Sketch of Topological Data Analysis (TDA)

In TDA, we often use simplicial complexes to study the shapes. A simplex can be a single point (0-simplex), two connected data points (1-simplex), tree fully-connected points (2-simplex), or generally fully-connected data points (-simplex). For consistency of the definition, we use -simplex to describe an empty set [zomorodian2010computational]. Then we can define a simplicial complex as the union of simplices satisfying one condition: If a simplex is in the simplicial complex, then any of its subsets should also be in the complex.

The topological characteristics of a simplicial complex can be summarized in Betti numbers [edelsbrunner2000topological]. The -th Betti number is defined as the number of -dimensional holes in a simplicial complex. More specifically, is the number of the connected components, is the number of - holes and is the number of - voids, etc. Note that For a -dimensional shape, for any , n-th Betti number is zero. Betti numbers for some topological shapes are shown in Table 1.

Order Type A Point Circle Sphere Torus
components 1 1 1 1
loops 0 1 0 2
voids 0 0 1 1
3D holes 0 0 0 0
Table 1: Betti numbers for some topological shapes.
Figure 1: A data cloud (left), persistence diagram (middle), and its equivalent barcodes (right). On a persistence diagram, death radii are plotted vs. birth radii. Barcodes plot the same information with one-dimensional bars from birth radii to death radii.

Persistent homology is a technique in TDA to find topological patterns of the data [edelsbrunner2000topological, zomorodian2005computing, munch2017user]. Dealing with a set of discrete data points, we can define a radius around each data point and connect the points within that radius of each other. Then we compute the number of holes or loops in the resulting simplicial complex. However, some data points might produce a fully-connected partition where there is no hole, i.e., a -simplex. If we increase the defined radius gradually, or – equivalently – decrease the resolution, the resulting simplicial complex will change. Subsequently, the holes and their numbers (Betti numbers) in the shape will change. So, increasing the radius, many holes (e.g., loops in dimension ) will come to the picture and then will disappear. We may illustrate the birth and the death radii of the holes for each dimension in persistence diagram [edelsbrunner2000topological]. Equivalently, the birth and the death radii of holes might be shown with barcodes where the lifetime of every hole is represented by a one-dimensional bar from the birth radius to the death radius [collins2004barcode, ghrist2008barcodes, carlsson2014topological]. An example of these barcodes and the equivalent persistence diagram are shown in Fugure 1.

In persistent homology, we can study the distances among data points in different ways. In the procedure we described above, the information structure based on thresholding distances is called Vietoris-Rips Filtration [ghrist2008barcodes]. In a Vietoris-Rips complex, any -simplex consists of nodes whose pairwise distance is less than or equal to the threshold. We refer the interested reader to [zomorodian2005computing, munch2017user] for more details on persistent homology.

2.2 TDA in Text Processing

Study Input Task/Application
Wagner et al. [wagner2012computational] TF-IDF Measuring heterogeneity of documents in corpus
Zhu [zhu2013persistent] TF-IDF Finding repetitions in text
Torres-Tramón et al. [torres2015topic] TF Topic detection in Twitter data
Almgren et al. [almgren2017mining] Word2Vec Image popularity prediction in social media
Doshi and Zadrozny [doshi2018movie] TF-IDF Classification
Gholizadeh et al. [gholizadeh2018topological] NER Tags Authorship profiling
Savle and Zardozny [savle2019topological] TF-IDF Text entailment prediction
Table 2: Studies covering TDA in text processing.

There are only a few studies in the literature, utilizing topological data analysis for text processing. In most cases persistent homology is applied to term frequency vectors representing the documents. This method is used for classification [doshi2018movie], measuring heterogeneity of documents [wagner2012computational], finding repetitions in text [zhu2013persistent], topic detection [torres2015topic], and text entailment prediction [savle2019topological]. In other cases, topological data analysis is applied on word embedding representations [almgren2017mining] or to tagged text [gholizadeh2018topological]

(after performing named entity recognition). We organize these contribution in Table

2.

In addition to the contributions mentioned in Table 2, there are several studies utilizing persistent homology in time series and system analysis [pereira2015persistent, khasawneh2014stability, perea2015sliding, maletic2016persistent, stolz2017persistent, garland2016exploring, aurenhammer1991voronoi, gholizadeh2018short]. These approaches are relevant for us, since texts represented by word embeddings can be viewed as time series, as we show in Section 3.

3 Methodology

As mentioned before, we may refine topological features out of TF-IDF space or word embedding representation of text. Here we use both approaches.

3.1 Topological features from word embeddings

Our method of extracting topological features from embeddings is described in Algorithm 1. Assume that a document with tokens is represented in -dimensional word embedding by . We will treat this matrix as a -dimensional time series. Of course, the length of this time series is equal to . Here, we intend to investigate the topological characteristics of this time-series representing the text. First, we smooth each time-series dimension (a column of ). Smoothing is a standard technique in time-series [gardner2006exponential] analysis which usually reduces the noise and improves the accuracy of prediction.

To smooth each column of the embedding representation we can use Eq. 1.

(1)

Then, on the smoothed matrix , we calculate the distance between different embedding dimensions.

(2)

Note that the measure of distance as defined in Eq. 2 encodes the word order of the text. The cosine function in the equation compares the elements on and with the same indices, e.g., the first element on is compared with the first element on , etc. But recall that in the smoothing step as in Eq. 1, each index on is compounded with a two lags and two lead indices. Therefore, comparing the same index of the smoothed columns and is in fact comparing each index of -th original (non-smoothed) column with a few indices of -th original column. That is how the cosine function encoded the word order in the algorithm. The distance as defined in Eq. 2 also takes into account the magnitude of the columns (dimension) that are being compared.

The pair-wise distance matrix can be interpreted as an adjacency matrix of a graph. Thus we can easily apply persistent homology on it and get persistence diagrams at dimension (components) and dimension (loops). Then for each embedding dimension, we exclude the corresponding vertex of the graph and measure the change in persistence diagrams. These measures represent the sensitivity of the graph to each embedding dimension. We know that word embeddings are representing the tokens of the text. But our main goal is to provide a new representation for the whole document.

To this end, the document is translated into the graph with adjacency matrix in Steps 1-6 of Algorithm 1, and then into a persistence diagram. We assume that the sensitivity of the diagram with respect to each embedding dimension represents the significance of that embedding dimension in the diagram, and therefore in the original document.

This means that, effectively, we will be classifying documents based on the significance of each embedding dimension. Since the embeddings used to represent the words are derived from a large corpus of text, and they encode similarities and differences between contexts of words in the corpus, we are employing this latent knowledge in a way similar to the standard use of TF-IDF weighting in information retrieval. In other words, and represent the importance of particular dimensions in a document, similarly to the TF-IDF values representing the importance of words.

We use Wasserstein distance [edelsbrunner2010computational, berwald2018computing, cohen2010lipschitz] in Algorithm 1. It measures the minimum cost to map a distribution to another one. It is also a common metric to quantify the difference among persistence diagrams [marchese2017k]. Remember that the persistence diagrams are in fact a few dots on the 2D space. To compare two persistence diagrams, Wasserstein distance measures the minimum cost of moving the dots in the first diagram to convert it to the second diagram.

Finally, as shown in Algorithm 1, we get features for topological dimension (components) and another features for topological dimension (loops). We use the resulted topological features to represent the text.

0:  word embedding representation of text:A matrix where is the number of tokens in the text and is the dimentionality of word embedding.
0:  embedding-based topological features of text:A vector of size .
1:  for  to  do
2:     Smooth d-th column of : Update the smoothed matrix smoothing column of .
3:  end for
4:  for  to  do
5:     Calculate distance between columns and .
6:  end for
7:  Apply persistent homology on .Get persistence diagrams and for components and loops respectively.
8:  for  to  do
9:     Make the persistence diagrams excluding -th column and row from , i.e., and .
10:     Calculate Wasserstein distance of persistence diagram including and excluding dimension .
11:  end for
12:  return   and as
Algorithm 1 Topological Features from Word Embedding

3.2 Topological features from term frequency space

To apply persistent homology on TF-IDF space, we follow the approach in [zhu2013persistent], i.e., dividing the textual document to a fixed number of blocks and then searching for repetitive patterns in the text. Our method is described in Algorithm 2.

Figure 2: Working on the graph of 10 vertices, persistent homology thresholds the distance (e.g., cosine distance) among different nodes using all possible thresholds. The resulted edges for a few choices of threshold are shown here. Topological characteristics are summarized in the persistence diagram (bottom right).

We divide each document into consecutive blocks of equal size, we calculate TF-IDF vector for each block. We chose 10, but one may try different number of blocks for each document. However, we note that using a large number of blocks could make the TF-IDF vectors too sparse, so that comparing them would not be useful. For instance, if an average number of tokens in a document is only 200 tokens and we divide each of the documents into 100 blocks, there would be two tokens in each block, and most of the blocks would have zero similarity.

In our experiments, we work on graphs of vertices, where each vertex is represented by its TF-IDF vector. An example of such graphs is illustrated in Fugure 2. The figure shows that when persistent homology is applied, the number of edges connecting the ten vertices will increase with the size of the radius (as we described in Section 2.2

). The distance between two vertices is given by the cosine similarity of the vectors associated with each vertex. With

vertices, in topological dimension (components) we get exactly diameters of birth and diameters of death. Since for topological dimension all of the birth diameters are always equal to zero, we only retrieve death diameters. For topological dimension (loops), we may get different number of loops for different documents. Thus, if we retrieve all of birth and death diameters, we will get different numbers of features for different textual documents. Therefore, we summarize the information from topological dimension (loops) in five statistically inferred features: number of loops, the average diameter of birth, the average diameter of duration, the standard deviation of birth diameters, and the standard deviation of duration diameters. This is similar to what Mittal and Gupta suggested in [mittal2017topological] to summarize persistence diagram— that is using six features from the persistence diagram including the number of holes, the average lifetime of holes, the maximum diameter of holes and the maximum distance between holes in each dimension. Here we utilize some similar features. The resulting features ( from dimension zero plus from dimension one) represent patterns in the text. (As noted by [zhu2013persistent] such representation may capture e.g. repetitive patterns of the text).

0:  text:A array R of size T where T is number of tokens in text.
0:  TF-IDF based topological features of text:A vector of size .
1:  Divide R to 10 equal-size arrays of size T/10.
2:  for  to  do
3:     Calculate TF-IDF vector of .
4:  end for
5:  Apply persistent homology on ; .
6:  Set where ’s are the diameters of deaths for components (dimension ). We get exactly death diameters.
7:  For each loop (dimension ), we have the diameters of birth and death.Calculate
  • number of loops

  • average diameter of birth

  • average diameter of duration (death minus birth diameter)

  • standard deviation of of birth diameters

  • standard deviation of duration diameters

8:  return   and
Algorithm 2 Topological Features from TF-IDF

4 Description of the Experiments

We run both algorithms on Wikipedia Movie Plots from Kaggle111https://www.kaggle.com/aminejallouli/genre-classification-based-on-wiki-movies-plots/data. We selected the movie plots annotated by four major genres of Drama, Comedy, Action, and Romance. Keeping only the plots containing at least 200 words, we tried to predict the genres solely based on the plot texts.

The data set contains 11,500 total records. Each record may have been annotated by more than one label. More specifications per class are shown in Table 3. We used 2/3 of the records for training and 1/3 for testing.

To represent the data in word embedding space, we used fastText [bojanowski2016enriching, joulin2016bag] pre-trained on Wikipedia 2017 with the vocabulary size of M and d vectors222https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec. We chose fastText since in our initial experiment it showed slightly better performance compared to Google word2vec [mikolov2013efficient, mikolov2013distributed, mikolov2013linguistic], GloVe [pennington2014glove], and Conceptnet numberbatch [speer2017conceptnet] pre-trained vectors. To apply persistent homology and extract topological features we utilized Ripser [bauer2019ripser] package. The TF-IDF vectors for Algorithm 2 were extracted with text2vec package [selivanov2016text2vec].

Specification Drama Comedy Action Romance
Overlap with drama - 524 223 379
Overlap with comedy 524 - 207 544
Overlap with action 223 207 - 117
Overlap with romance 379 544 117 -
Exclusive Records 4592 3302 1181 672
Total Records 5615 4477 1658 1614
Table 3: Number of records per class and overlaps among different classes.

5 Results and Discussion

For each record in the data set, we computed two sets of topological features based on word embeddings as in Algorithm 1 and TF-IDF space as in Algorithm 2. We will call these two sets of features TP1 and TP2, respectively.

First, we fed

to the XGBoost

[chen2016xgboost] classifier with , , and iterations. Then we tried adding features to the same classifier to boost the results. We also tried a Bidirectional LSTM to classify the records without using our topological features. Our bidirectional LSTM model containing -dimensional main layer output was trained with a batch size of

in five epochs with adam optimizer

[kingma2014adam].

While bidirectional LSTM showed stronger performance than the XGBoost model feeding our topological features, we assumed that there might be some exclusive information carried by our topological features that are not captured by the LSTM. Thus we tried combining the LSTM results with the XGBoost models. As one of the easiest ways to combine the results, we fed the probabilities (not the rounded predictions) returned by the two models (LSTM and XGBoost using

and

) to a logistic regression model.

As shown in Table 4, our best ensemble model outperforms the LSTM accuracy and F1-score by 1.6% and 5.1%, respectively. The previous results333https://www.kaggle.com/aminejallouli/genre-classification-based-on-wiki-movies-plots/notebook using linear Support Vector Classifier (SVC) and multinomial Naïve Bayes are also provided in the table. The detailed results per class are also provided in Table 5.

Note that the topological features that we extracted from the word embedding space (i.e.,

) can classify the records alone with an accuracy comparable but not equal to the LSTM. On the other hand, the topological features extracted from TF-IDF space are primarily used to reflect some repetitive patterns in the text, as Zhu

[zhu2013persistent] suggested in a similar study. However, as shown in Table 4 and Table 5, using the topological feature sets can boost the accuracy of classification in the ensemble model.

Classifier Pre. Rec. F1 Acc.
1 BiLSTM 68.0 59.7 0.608 76.2
2 XGBoost on TP1 59.6 53.2 0.560 71.1
3 XGBoost on TP1 & TP2 59.9 53.7 0.564 71.4
4 BiLSTM + XGBoost on TP1 67.8 64.8 0.656 77.3
5 BiLSTM + XGBoost on TP1 & TP2 68.5 64.6 0.659 77.8
Previous Results (Linear SVC) 73.5
Previous Results (Naïve Bayes) 73.3
Table 4: Macro-average results by different methods. The ensemble model using both embeddings and topological features improves the F1 measure by about 5%.
Class BiLSTM XGB XGB2 TL4 TL5 prev. SVC prev. NB
action 87.7 86.7 86.9 89.3 88.9 81.5 82.7
comedy 75.6 69.0 69.1 76.9 77.7 74.6 73.3
drama 69.9 63.9 64.3 71.0 71.6 66.1 67.4
romance 87.6 86.0 85.9 87.8 87.8 88.3 84.3
macro-avg 76.2 71.1 71.4 77.3 77.8 73.5 73.3
Table 5: Accuracy per class using different methods. Here BiLSTM, XGB, XGB2, TL4, and TL5 are the same as models 1 to 5 in Table 4. prev. SVC and prev. NB refer to the previous results using linear SVC and multinomial Naïve Bayes, respectively. We see across the board superior performance of the ensemble models with topological features.

6 Conclusions

We first summarize our contributions and argue for the potential of topological methods to contribute to text analysis, and then discuss some limitations and open problems.

6.1 Summary of contributions

In this paper, we used two different methods to extract topological features from text and applied them to the task of document classification. The first method converts text, represented as a sequence of word embeddings into a high dimensional time series, which at the end is analyzed using the machinery of topological data analysis, namely homological persistence. The second method augments the classical TF-IDF representation of the text with topological features.

Specifically, we have leveraged existing word embeddings along with topology of text to show that such structure can carry some useful information for machine learning classifiers to learn from. To extract topological features from the word embedding space, using the high dimensional time series derived from embeddings, we measured and analyzed the topology of the graph whose vertices are different embedding dimensions.

For topological data analysis of TF-IDF space, we analyzed the topology of the graph whose vertices are the TF-IDF vectors of different blocks of a textual document. As we have shown in the results, while a classifier utilizing only topological features may fail to outperform more conventional models like bidirectional LSTMs, these topological features are capable of carrying some exclusive information that is not captured by conventional text mining methods. Therefore, adding these features to more conventional features models can boost the results. In our experiment, adding using topological features in the ensemble model resulted in 4.9% increase in recall, 0.5% increase in precision, and 5.1% increase in F1 score

Briefly, our contributions are as follows:

  • We introduced a new algorithm of extracting topological features from text, namely by converting a sequence of word embeddings into a time series, and analyzing the dimensions of the resulting series for topological persistence.

  • This algorithm works with documents of any length and, importantly, preserves the word order in its representation.

  • We have shown that this new method produces features of value for the task of document classification.

  • We showed that even if the representation of documents is derived from the standard TF/IDF matrix, similarly produced topological features improve the accuracy of classification.

Based on the above, we suggest that topological methods deserve deeper examination as a tool for text analysis. We believe that as with the geometries of vector spaces and conceptual spaces mentioned earlier in Section 1, the topological features, which capture certain geometric invariants are relevant for text analytics and semantics of natural language.

6.2 Discussion, Limitations and Open Problems

We end with a discussion, including some of the limitations, and open problems.

The strength of our algorithm for analyzing documents as a time series of embeddings is in its universal applicability, irrespective of the length of the document. The second important property is using the word order. Finally, the algorithm produces the representation in one pass.

However, one of the limitations of our methodology is the size of block of text. Regarding the embedding based topological features, the topological structure of a short text would not be stable. Also due to lack of context, the embedding may not be able to provide enough information for classification tasks.

Similarly, using its TF-IDF vectors on short documents, can result in poor simplicial shapes, when we divide our text in blocks of 10, as in Section 3. That is, a set of separate dots in the space most of which are not connected at all. In such a case, it is challenging to find informative topological structure in text.

Proving the value of the methods used in this article for other natural language processing tasks, such as summarization, entity extraction or question answering, is both a limitation of this work, and an open problem.

We see two other important open issues, one very technical, and one more programmatic. The latter one has to do with connecting our work on topology of text with the work on the understanding of topological properties of deep neural networks, exemplified e.g. by

[kim2020efficient] and [guss2018characterizing]. An urgent technical open problem is to find the actual text behind the topological structures. This a challenge in our ongoing work.

References