1 Introduction
The primary goal in Topological Data Analysis (TDA) is to analyze the shapes in data. While TDA has got significant attention in data mining for numeric data [de2006coordinate, de2007coverage, khasawneh2016chatter, khasawneh2018topological, pereira2015persistent, maletic2016persistent], its application in natural language processing still appears to be challenging. Defining shapes in the text seems much more challenging even though vector spaces are used as standard tools to define geometries in text mining and information retrieval [manning2008introduction], and conceptual spaces [gardenfors2014geometry] are relevant for cognitive modeling and semantics of natural language.
In this study, the focus is on introducing and examining two methods to apply TDA to text classification. Term frequency (or TFIDF) and word embeddings are the most frequently used methods to translate the text into numerical data. Therefore they deserve to be examined, as a priority, for potential to reveal their hidden dimensions by applying topological methods.
First, we introduce a novel method of using word embeddings where we view text documents as time series. We believe this method shows great promise, since it can be applied to documents irrespective of their length (with some likely limitations explained in Section 6), and it encodes the temporal succession words in a latent semantic space. Our algorithm analyzes the topology of the embedding space to discover relations among different embedding dimensions of the analyzed text. The precise nature of this space is not clear to us at this point. However, we know it is there, because our experiments show its influence on the accuracy of classification.
In the second experiment, working with TFIDF representations of textual documents, we use a method that divides the text to a fixed number of blocks, analyzes the topological structure of the relations among different blocks and summarizes the results. As with the first method, this topological summary consists of numerical features derived from the persistence diagram of each document. And as in the first case, it improves the accuracy of classification, proving the existence of the latent topological dimension (speaking metaphorically).
The intuitive idea behind both experiments has to do with the central premise of topological data analysis, namely that when examining a cloud of data points at different resolutions, the emerging diagrams encode global geometric properties of the point as shown in Fugure 1 and later in Fugure 2. There we observe, with the change of the threshold, i.e. the distance at which we add connections to the points, new elements are added to the persistence diagram, culminating in a clear circle, or toruslike signature in in Fugure 1 (shown as the long line in the right panel), and a more complex representation of the geometry in Fugure 2
. In our case we measure, and use as features for machine learning the birth and death diameters in dimension 0 and 1, as well as their derivatives that is the number of holes, the average divided by the standard deviation of death diameter, and the same ratio for the duration (death  birth).
This paper is structured as follows. In Section 2, we review the basic methods in topological data analysis and review a few studies that utilized TDA in the wider area of natural language processing. Section 3 describes our methods of extracting topological features out of textual documents. Then our experiments are explained in Section 4 followed by results in Section 5. We discuss the contribution and its limitations and open problems in Section 6.
2 Background
2.1 A Brief Sketch of Topological Data Analysis (TDA)
In TDA, we often use simplicial complexes to study the shapes. A simplex can be a single point (0simplex), two connected data points (1simplex), tree fullyconnected points (2simplex), or generally fullyconnected data points (simplex). For consistency of the definition, we use simplex to describe an empty set [zomorodian2010computational]. Then we can define a simplicial complex as the union of simplices satisfying one condition: If a simplex is in the simplicial complex, then any of its subsets should also be in the complex.
The topological characteristics of a simplicial complex can be summarized in Betti numbers [edelsbrunner2000topological]. The th Betti number is defined as the number of dimensional holes in a simplicial complex. More specifically, is the number of the connected components, is the number of  holes and is the number of  voids, etc. Note that For a dimensional shape, for any , nth Betti number is zero. Betti numbers for some topological shapes are shown in Table 1.
Order  Type  A Point  Circle  Sphere  Torus 

components  1  1  1  1  
loops  0  1  0  2  
voids  0  0  1  1  
3D holes  0  0  0  0 
Persistent homology is a technique in TDA to find topological patterns of the data [edelsbrunner2000topological, zomorodian2005computing, munch2017user]. Dealing with a set of discrete data points, we can define a radius around each data point and connect the points within that radius of each other. Then we compute the number of holes or loops in the resulting simplicial complex. However, some data points might produce a fullyconnected partition where there is no hole, i.e., a simplex. If we increase the defined radius gradually, or – equivalently – decrease the resolution, the resulting simplicial complex will change. Subsequently, the holes and their numbers (Betti numbers) in the shape will change. So, increasing the radius, many holes (e.g., loops in dimension ) will come to the picture and then will disappear. We may illustrate the birth and the death radii of the holes for each dimension in persistence diagram [edelsbrunner2000topological]. Equivalently, the birth and the death radii of holes might be shown with barcodes where the lifetime of every hole is represented by a onedimensional bar from the birth radius to the death radius [collins2004barcode, ghrist2008barcodes, carlsson2014topological]. An example of these barcodes and the equivalent persistence diagram are shown in Fugure 1.
In persistent homology, we can study the distances among data points in different ways. In the procedure we described above, the information structure based on thresholding distances is called VietorisRips Filtration [ghrist2008barcodes]. In a VietorisRips complex, any simplex consists of nodes whose pairwise distance is less than or equal to the threshold. We refer the interested reader to [zomorodian2005computing, munch2017user] for more details on persistent homology.
2.2 TDA in Text Processing
Study  Input  Task/Application 

Wagner et al. [wagner2012computational]  TFIDF  Measuring heterogeneity of documents in corpus 
Zhu [zhu2013persistent]  TFIDF  Finding repetitions in text 
TorresTramón et al. [torres2015topic]  TF  Topic detection in Twitter data 
Almgren et al. [almgren2017mining]  Word2Vec  Image popularity prediction in social media 
Doshi and Zadrozny [doshi2018movie]  TFIDF  Classification 
Gholizadeh et al. [gholizadeh2018topological]  NER Tags  Authorship profiling 
Savle and Zardozny [savle2019topological]  TFIDF  Text entailment prediction 
There are only a few studies in the literature, utilizing topological data analysis for text processing. In most cases persistent homology is applied to term frequency vectors representing the documents. This method is used for classification [doshi2018movie], measuring heterogeneity of documents [wagner2012computational], finding repetitions in text [zhu2013persistent], topic detection [torres2015topic], and text entailment prediction [savle2019topological]. In other cases, topological data analysis is applied on word embedding representations [almgren2017mining] or to tagged text [gholizadeh2018topological]
(after performing named entity recognition). We organize these contribution in Table
2.In addition to the contributions mentioned in Table 2, there are several studies utilizing persistent homology in time series and system analysis [pereira2015persistent, khasawneh2014stability, perea2015sliding, maletic2016persistent, stolz2017persistent, garland2016exploring, aurenhammer1991voronoi, gholizadeh2018short]. These approaches are relevant for us, since texts represented by word embeddings can be viewed as time series, as we show in Section 3.
3 Methodology
As mentioned before, we may refine topological features out of TFIDF space or word embedding representation of text. Here we use both approaches.
3.1 Topological features from word embeddings
Our method of extracting topological features from embeddings is described in Algorithm 1. Assume that a document with tokens is represented in dimensional word embedding by . We will treat this matrix as a dimensional time series. Of course, the length of this time series is equal to . Here, we intend to investigate the topological characteristics of this timeseries representing the text. First, we smooth each timeseries dimension (a column of ). Smoothing is a standard technique in timeseries [gardner2006exponential] analysis which usually reduces the noise and improves the accuracy of prediction.
To smooth each column of the embedding representation we can use Eq. 1.
(1) 
Then, on the smoothed matrix , we calculate the distance between different embedding dimensions.
(2) 
Note that the measure of distance as defined in Eq. 2 encodes the word order of the text. The cosine function in the equation compares the elements on and with the same indices, e.g., the first element on is compared with the first element on , etc. But recall that in the smoothing step as in Eq. 1, each index on is compounded with a two lags and two lead indices. Therefore, comparing the same index of the smoothed columns and is in fact comparing each index of th original (nonsmoothed) column with a few indices of th original column. That is how the cosine function encoded the word order in the algorithm. The distance as defined in Eq. 2 also takes into account the magnitude of the columns (dimension) that are being compared.
The pairwise distance matrix can be interpreted as an adjacency matrix of a graph. Thus we can easily apply persistent homology on it and get persistence diagrams at dimension (components) and dimension (loops). Then for each embedding dimension, we exclude the corresponding vertex of the graph and measure the change in persistence diagrams. These measures represent the sensitivity of the graph to each embedding dimension. We know that word embeddings are representing the tokens of the text. But our main goal is to provide a new representation for the whole document.
To this end, the document is translated into the graph with adjacency matrix in Steps 16 of Algorithm 1, and then into a persistence diagram. We assume that the sensitivity of the diagram with respect to each embedding dimension represents the significance of that embedding dimension in the diagram, and therefore in the original document.
This means that, effectively, we will be classifying documents based on the significance of each embedding dimension. Since the embeddings used to represent the words are derived from a large corpus of text, and they encode similarities and differences between contexts of words in the corpus, we are employing this latent knowledge in a way similar to the standard use of TFIDF weighting in information retrieval. In other words, and represent the importance of particular dimensions in a document, similarly to the TFIDF values representing the importance of words.
We use Wasserstein distance [edelsbrunner2010computational, berwald2018computing, cohen2010lipschitz] in Algorithm 1. It measures the minimum cost to map a distribution to another one. It is also a common metric to quantify the difference among persistence diagrams [marchese2017k]. Remember that the persistence diagrams are in fact a few dots on the 2D space. To compare two persistence diagrams, Wasserstein distance measures the minimum cost of moving the dots in the first diagram to convert it to the second diagram.
Finally, as shown in Algorithm 1, we get features for topological dimension (components) and another features for topological dimension (loops). We use the resulted topological features to represent the text.
3.2 Topological features from term frequency space
To apply persistent homology on TFIDF space, we follow the approach in [zhu2013persistent], i.e., dividing the textual document to a fixed number of blocks and then searching for repetitive patterns in the text. Our method is described in Algorithm 2.
We divide each document into consecutive blocks of equal size, we calculate TFIDF vector for each block. We chose 10, but one may try different number of blocks for each document. However, we note that using a large number of blocks could make the TFIDF vectors too sparse, so that comparing them would not be useful. For instance, if an average number of tokens in a document is only 200 tokens and we divide each of the documents into 100 blocks, there would be two tokens in each block, and most of the blocks would have zero similarity.
In our experiments, we work on graphs of vertices, where each vertex is represented by its TFIDF vector. An example of such graphs is illustrated in Fugure 2. The figure shows that when persistent homology is applied, the number of edges connecting the ten vertices will increase with the size of the radius (as we described in Section 2.2
). The distance between two vertices is given by the cosine similarity of the vectors associated with each vertex. With
vertices, in topological dimension (components) we get exactly diameters of birth and diameters of death. Since for topological dimension all of the birth diameters are always equal to zero, we only retrieve death diameters. For topological dimension (loops), we may get different number of loops for different documents. Thus, if we retrieve all of birth and death diameters, we will get different numbers of features for different textual documents. Therefore, we summarize the information from topological dimension (loops) in five statistically inferred features: number of loops, the average diameter of birth, the average diameter of duration, the standard deviation of birth diameters, and the standard deviation of duration diameters. This is similar to what Mittal and Gupta suggested in [mittal2017topological] to summarize persistence diagram— that is using six features from the persistence diagram including the number of holes, the average lifetime of holes, the maximum diameter of holes and the maximum distance between holes in each dimension. Here we utilize some similar features. The resulting features ( from dimension zero plus from dimension one) represent patterns in the text. (As noted by [zhu2013persistent] such representation may capture e.g. repetitive patterns of the text).4 Description of the Experiments
We run both algorithms on Wikipedia Movie Plots from Kaggle^{1}^{1}1https://www.kaggle.com/aminejallouli/genreclassificationbasedonwikimoviesplots/data. We selected the movie plots annotated by four major genres of Drama, Comedy, Action, and Romance. Keeping only the plots containing at least 200 words, we tried to predict the genres solely based on the plot texts.
The data set contains 11,500 total records. Each record may have been annotated by more than one label. More specifications per class are shown in Table 3. We used 2/3 of the records for training and 1/3 for testing.
To represent the data in word embedding space, we used fastText [bojanowski2016enriching, joulin2016bag] pretrained on Wikipedia 2017 with the vocabulary size of M and d vectors^{2}^{2}2https://dl.fbaipublicfiles.com/fasttext/vectorswiki/wiki.en.vec. We chose fastText since in our initial experiment it showed slightly better performance compared to Google word2vec [mikolov2013efficient, mikolov2013distributed, mikolov2013linguistic], GloVe [pennington2014glove], and Conceptnet numberbatch [speer2017conceptnet] pretrained vectors. To apply persistent homology and extract topological features we utilized Ripser [bauer2019ripser] package. The TFIDF vectors for Algorithm 2 were extracted with text2vec package [selivanov2016text2vec].
Specification  Drama  Comedy  Action  Romance 

Overlap with drama    524  223  379 
Overlap with comedy  524    207  544 
Overlap with action  223  207    117 
Overlap with romance  379  544  117   
Exclusive Records  4592  3302  1181  672 
Total Records  5615  4477  1658  1614 
5 Results and Discussion
For each record in the data set, we computed two sets of topological features based on word embeddings as in Algorithm 1 and TFIDF space as in Algorithm 2. We will call these two sets of features TP1 and TP2, respectively.
First, we fed
to the XGBoost
[chen2016xgboost] classifier with , , and iterations. Then we tried adding features to the same classifier to boost the results. We also tried a Bidirectional LSTM to classify the records without using our topological features. Our bidirectional LSTM model containing dimensional main layer output was trained with a batch size ofin five epochs with adam optimizer
[kingma2014adam].While bidirectional LSTM showed stronger performance than the XGBoost model feeding our topological features, we assumed that there might be some exclusive information carried by our topological features that are not captured by the LSTM. Thus we tried combining the LSTM results with the XGBoost models. As one of the easiest ways to combine the results, we fed the probabilities (not the rounded predictions) returned by the two models (LSTM and XGBoost using
and) to a logistic regression model.
As shown in Table 4, our best ensemble model outperforms the LSTM accuracy and F1score by 1.6% and 5.1%, respectively. The previous results^{3}^{3}3https://www.kaggle.com/aminejallouli/genreclassificationbasedonwikimoviesplots/notebook using linear Support Vector Classifier (SVC) and multinomial Naïve Bayes are also provided in the table. The detailed results per class are also provided in Table 5.
Note that the topological features that we extracted from the word embedding space (i.e.,
) can classify the records alone with an accuracy comparable but not equal to the LSTM. On the other hand, the topological features extracted from TFIDF space are primarily used to reflect some repetitive patterns in the text, as Zhu
[zhu2013persistent] suggested in a similar study. However, as shown in Table 4 and Table 5, using the topological feature sets can boost the accuracy of classification in the ensemble model.Classifier  Pre.  Rec.  F1  Acc.  

1  BiLSTM  68.0  59.7  0.608  76.2 
2  XGBoost on TP1  59.6  53.2  0.560  71.1 
3  XGBoost on TP1 & TP2  59.9  53.7  0.564  71.4 
4  BiLSTM + XGBoost on TP1  67.8  64.8  0.656  77.3 
5  BiLSTM + XGBoost on TP1 & TP2  68.5  64.6  0.659  77.8 
Previous Results (Linear SVC)  73.5  
Previous Results (Naïve Bayes)  73.3 
Class  BiLSTM  XGB  XGB2  TL4  TL5  prev. SVC  prev. NB 

action  87.7  86.7  86.9  89.3  88.9  81.5  82.7 
comedy  75.6  69.0  69.1  76.9  77.7  74.6  73.3 
drama  69.9  63.9  64.3  71.0  71.6  66.1  67.4 
romance  87.6  86.0  85.9  87.8  87.8  88.3  84.3 
macroavg  76.2  71.1  71.4  77.3  77.8  73.5  73.3 
6 Conclusions
We first summarize our contributions and argue for the potential of topological methods to contribute to text analysis, and then discuss some limitations and open problems.
6.1 Summary of contributions
In this paper, we used two different methods to extract topological features from text and applied them to the task of document classification. The first method converts text, represented as a sequence of word embeddings into a high dimensional time series, which at the end is analyzed using the machinery of topological data analysis, namely homological persistence. The second method augments the classical TFIDF representation of the text with topological features.
Specifically, we have leveraged existing word embeddings along with topology of text to show that such structure can carry some useful information for machine learning classifiers to learn from. To extract topological features from the word embedding space, using the high dimensional time series derived from embeddings, we measured and analyzed the topology of the graph whose vertices are different embedding dimensions.
For topological data analysis of TFIDF space, we analyzed the topology of the graph whose vertices are the TFIDF vectors of different blocks of a textual document. As we have shown in the results, while a classifier utilizing only topological features may fail to outperform more conventional models like bidirectional LSTMs, these topological features are capable of carrying some exclusive information that is not captured by conventional text mining methods. Therefore, adding these features to more conventional features models can boost the results. In our experiment, adding using topological features in the ensemble model resulted in 4.9% increase in recall, 0.5% increase in precision, and 5.1% increase in F1 score
Briefly, our contributions are as follows:

We introduced a new algorithm of extracting topological features from text, namely by converting a sequence of word embeddings into a time series, and analyzing the dimensions of the resulting series for topological persistence.

This algorithm works with documents of any length and, importantly, preserves the word order in its representation.

We have shown that this new method produces features of value for the task of document classification.

We showed that even if the representation of documents is derived from the standard TF/IDF matrix, similarly produced topological features improve the accuracy of classification.
Based on the above, we suggest that topological methods deserve deeper examination as a tool for text analysis. We believe that as with the geometries of vector spaces and conceptual spaces mentioned earlier in Section 1, the topological features, which capture certain geometric invariants are relevant for text analytics and semantics of natural language.
6.2 Discussion, Limitations and Open Problems
We end with a discussion, including some of the limitations, and open problems.
The strength of our algorithm for analyzing documents as a time series of embeddings is in its universal applicability, irrespective of the length of the document. The second important property is using the word order. Finally, the algorithm produces the representation in one pass.
However, one of the limitations of our methodology is the size of block of text. Regarding the embedding based topological features, the topological structure of a short text would not be stable. Also due to lack of context, the embedding may not be able to provide enough information for classification tasks.
Similarly, using its TFIDF vectors on short documents, can result in poor simplicial shapes, when we divide our text in blocks of 10, as in Section 3. That is, a set of separate dots in the space most of which are not connected at all. In such a case, it is challenging to find informative topological structure in text.
Proving the value of the methods used in this article for other natural language processing tasks, such as summarization, entity extraction or question answering, is both a limitation of this work, and an open problem.
We see two other important open issues, one very technical, and one more programmatic. The latter one has to do with connecting our work on topology of text with the work on the understanding of topological properties of deep neural networks, exemplified e.g. by
[kim2020efficient] and [guss2018characterizing]. An urgent technical open problem is to find the actual text behind the topological structures. This a challenge in our ongoing work.