1 Background
Persistent homology [edelsbrunner2000topological, carlsson2009topology, edelsbrunner2008persistent, chen2015mathematical] is a technique in TDA for multiscale analysis of data clouds. In a data cloud, there are no predefined links between the points. Therefore, there are no simplices except simplices each of which trivially referring to a single data point. Betti numbers are all zero, except which is equal to the number of data points. The main idea of persistent homology is to define an distance around each data points and then connect those points with overlapping distances. We may change in the range of . Setting , there will be no link between the points. However, increasing gradually, some points will get connected. When is large enough, (), all data points are connected to each other— i.e. data points will consist an simplex. In the way to increase from to , number of connected components will change. Also, many loops in the data may appear and disappear—i.e., Betti numbers are changing. On a data cloud, changes in Betti numbers are captured by persistent homology. More precisely, Persistence Diagram [edelsbrunner2000topological] captures the birth and death ’s of each component, loop, void, etc. An example for persistent homology on a data cloud along with the resulted persistence diagram are illustrated in Fig. 3. Alternatively, we may show the birth and the death of loops with barcodes where each hole is represented by a bar from its birth to its death [collins2004barcode, carlsson2005persistence, ghrist2008barcodes]. For a more general review of persistent homology and its applications we refer the reader to [edelsbrunner2008persistent, zomorodian2005computing, munch2017user, gholizadeh2018short].
While TDA and specially persistent homology are applied on a wide range of problems (e.g., system analysis [maletic2016persistent, garland2016exploring, pereira2015persistent, khasawneh2014stability, perea2015sliding, stolz2017persistent], network coverage [de2006coordinate, de2007coverage], etc), there are only a few studies using them for natural language processing. Zhu in [zhu2013persistent]
used persistent homology to find repetitive patterns in the text, comparing vector space representations of different blocks in the documents. Doshi and Zadrozny in
[doshi2018movie] utilized Zhu’s algorithm [zhu2013persistent] for movie genre detection on the IMDB data set of movie plot summaries. The authors showed how persistent homology can improve the classification. In [wagner2012computational], Wagner et al. utilized TDA to measure the discrepancy among documents represented by their TFIDF matrices. Guan et al. in [guan2016topological] utilized topological collapsing algorithm [wilkerson2013simplifying] to develop an unsupervised method of keyphrase extraction. Almgren et al. in [almgren2017mining] and [almgren2017extracting] examined the feasibility of persistent homology for social network analysis. The authors utilized Mapper algorithm [singh2007topological] to predict image popularity based on word embedding representations of images’ captions. In [torres2015topic], TorresTramón et al. introduced a topic detection algorithm for Twitter data analysis utilizing Mapper algorithm to map the term frequency matrix to the topological representations. Savle and Zardozny in [savle2019topological] used TDA to study discourse structures and showed that topological features derived from the relation of sentences can be informative for prediction of text entailment. Considering the word embedding representation of the text as highdimensional timeseries, some ideas from the recent applications of persistent homology in time series analysis [pereira2015persistent, khasawneh2014stability, perea2015sliding, maletic2016persistent, stolz2017persistent] are also considerable for text processing.In our prior work [gholizadeh2018topological]
, we applied persistent homology on a set of novels and tried to predict the author without using any conventional text mining features. For each novel, we built an adjacency matrix whose elements are measuring the coappearance of the main characters (persons) in the novel. Utilizing persistent homology, we analyzed on the graph of the main characters in the novels and fed the resulted topological features (conveying topological styles of novelists) to a classifier. Despite the novelty of the algorithm (using persistent homology instead of utilizing conventional text mining features), it is not easily applicable to the general text classification problem. Here, we introduce a different approach using word embeddings and persistent homology which can be applied to the general text classification tasks.
2 Methodology
In our algorithm, Topological Inference of the Embedding Space (TIES) we utilize word embedding and persistent homology. The input is the textual document and the output is a topological representation of the same document. Later we may use these representations for text classification, clustering, etc. Stepbystep specifications of TIES are explained is this section.
2.1 Preprocessing
Like any other text mining method, standard preprocessing possibly including lemmatization, removing stop words and if necessary lowercasing will be applied to the text. Also, there might be some specific preprocessing tasks that are inspired by TDA algorithms.
2.2 Word Embedding Representation
In a document of size , replacing each token with its embedding vector of size will result in a matrix of size . This matrix can naturally be viewed as a dimensional time series that represents the document. More precisely, each column () will represent the document in dimension of the embedding.
2.3 Aggregation on Sliding Window
One of the easiest and potentially most efficient ways of such smoothing is to replace each element () in with a local average in its neighborhood. Equivalently, we may describe it by taking the summation in the sliding window of size , where so the smoothed vector is
and is the smoothed dimensional time series that represents the document. For long documents this is almost the same as using a smoother matrix , i.e.,
where is a tridiagonal (for ), pentadiagonal (for ) or heptadiagonal (for ) binary matrix. The only difference is that in the latter definition, no value of time series will be dropped, so the result is only slightly different in size, assuming . Note that we can also use exponential weights in summation of elements in the sliding window instead of simply adding them up. In one of our experiments, we tried the exponential form of:
2.4 Computing Distances
Assume that in a documents, some of the embedding dimensions— and the relation among different dimensions are carrying some information regarding the document. Such information could be revealed in a coordinate system, where each embedding dimension is represented by a data point, and the distance between two data points (embedding dimensions) represents their relation. A possible choice of distance is formulated in Equation 1.
(1) 
The intuition behind the way of defining distance in Equation 1 is to (1) consider the relation between and
via Cosine similarity, (2) distinguish significant embedding dimensions (the term
will do this), and (3) make the distance almost nonsensitive to the size of document via term . Note that Equation 1 can be replaced by any other definition satisfying these three conditions. Aggregation on sliding window along with a distance formula like Equation 1 guarantee that the order of the tokens in documents is considered in the final distance matrix , defined byFor simplicity let’s fix , so remembering that each column of as a simple time series is a function of time (the index of word/token in the document),
and assuming that the length of document is large enough () we have
as , so
since shifting the time index will only exclude elements from the beginning or the end of time series and its effect is negligible when . It is easy to show that
and similarly, in a general form for any window size , Equation 2 holds.
(2) 
Such coefficients will guarantee that the order is considered in the final distance matrix . It means that each embedding dimension for each token in the text is being compared with all the other embedding dimensions in the same token, a few tokens before that, and a few tokens after that, though these comparisons will have different weights. Note that similar equations can be easily derived for correlationbased and covariancebased distances. In any case, the distance matrix is sensitive to the window size , or more generally to the smoothing algorithm. For instance, using exponential smoothing will result in geometric sequence of coefficients instead of arithmetic sequence of coefficients in Equation 2 (i.e., , , ,). Regarding using the sliding window, the choice of is a tradeoff between increasing the captured information on one side and decreasing the noise on the other side.
2.5 Applying Persistent Homology
Having the distance matrix for each document, a persistence diagram can be constructed for topological dimension^{1}^{1}1These dimensions should not be mistaken with embedding dimensions. (number of clusters) and dimension (number of loops) denoted by and respectively. However, this persistence diagram alone is not very useful as the representation of the document.
In our prior work [gholizadeh2018topological] the resulted graph of the main characters (persons) in each novel, and therefore the distance matrix was not annotated nor was needed to be annotated, since we had designed the algorithm to deal with the main characters whatever their names are. In other word, to capture the topological signature of a novelist, it did not matter whether the names of the main characters are shifted. But here, dealing with timeseries in different dimensions, the order of embedding dimensions is meaningful, since different embedding dimensions have different roles in representing document. Therefore, feeding the time series to the persistent homology algorithm is meaningless, unless we somehow manage to distinguish different dimensions. One intuitive way is comparing the persistence diagram with and without each embedding dimension. In other word we can measure the change in persistence diagram when we exclude one embedding dimension. We measure the sensitivity of the persistence diagram generated by Ripser [bauer2019ripser, bauer2017ripser] to each embedding dimension to use it later as a measure of the sensitivity of the document itself to each embedding dimension. This way the document will be represented in an array of size array of based on and another array of size based on , as formulated in Equation 3, where is any measure of distance between two persistence diagram, e.g, Wasserstein distance.
(3) 
A block diagram of TIES is shown in Fig. 4.
3 Data Specification
To examine our topological algorithm (TIES), we use the following data sets and predict the labels in multiclass multi labeling classification.

arXiv Papers: We downloaded all of arXiv papers in quantitative finance^{2}^{2}2https://arXiv.org/archive/qfin published between 2011 and 2018. Then we selected five major categories (subject tags): “qfin.GN” (General Finance), “qfin.ST” (Statistical Finance), “qfin.MF” (Mathematical Finance), “qfin.PR” (Pricing of Securities), and “qfin.RM” (Risk Management). For preprocessing we removed the titles, author names and affiliations, abstracts, keywords and references. Then we tried to predict the subjects solely based on the paper body.

IMDB Movie Review [maas2011learning]: Using IMDB reviews annotated by positive/negative label, we examined TIES for binary sentiment classification task.
Table 1 contains the specifications of both data sets. Note that each records in the arXiv data set may have more that one label. The histogram of number of labels for each record is shown in Fig. 5. As shown in the histogram, the majority of records in arXiv data set are tagged with only a single label.
Specification  arXiv Quant. Fin. Papers  IMDB Movie Reviews 

Labels  5 (Multilabel)  2 
Clean Records  4601  6000 
Length of Records  
Frequency of Labels 
In practice, for many of the classification and clustering tasks in text processing, the data covers only very short documents (e.g., a limited data set of social media posts). Therefore a big challenge is training word embedding models on short documents. Such a challenge is beyond the scope of this study and we will use pretrained versions of word embeddings that are previously trained on large corpora. Specifically, we use the following pretrained models.

GloVe [pennington2014glove] pretrained on Wikipedia 2014 and Gigaword 5 with vocabulary size of K and d vectors^{3}^{3}3http://nlp.stanford.edu/data/glove.6B.zip.

fastText [bojanowski2016enriching, joulin2016bag] pretrained on Wikipedia 2017 with the vocabulary size of M and d vectors^{4}^{4}4https://dl.fbaipublicfiles.com/fasttext/vectorswiki/wiki.en.vec.

ConceptNet Numberbatch [speer2017conceptnet] with the vocabulary size of K and d vectors^{5}^{5}5https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatchen17.06.txt.gz.
Model  Embedding  Window  Prec.  Rec.  F1  Acc. 

TIES + XGBoost 
fastText  3  61.9  55.4  0.575  80.1 
TIES + XGBoost  GloVe  3  63.1  56.7  0.597  80.7 
TIES + XGBoost  Numberbatch  3  68.7  60.5  0.643  82.6 
TIES + XGBoost  fastText  5  60.8  54.7  0.576  79.8 
TIES + XGBoost  GloVe  5  61.8  56.1  0.588  80.3 
TIES + XGBoost  Numberbatch  5  65.5  58.4  0.617  81.6 
TIES + XGBoost  fastText  7  58.9  54.4  0.566  79.5 
TIES + XGBoost  GloVe  7  62.8  56.4  0.594  80.6 
TIES + XGBoost  Numberbatch  7  65.7  57.7  0.614  81.3 
TIES + XGBoost  fastText  7 expon.  60.3  54.6  0.573  79.7 
TIES + XGBoost  GloVe  7 expon.  61.2  55.9  0.584  80.2 
TIES + XGBoost  Numberbatch  7 expon.  66.4  59.6  0.628  82.2 
CNN  fastText    57.1  64.3  60.5  80.0 
CNN  GloVe    57.6  64.2  60.7  80.6 
CNN  Numberbatch    55.0  67.6  60.7  79.8 
Model  Embedding  Window  Prec.  Rec.  F1  Acc. 

TIES + XGBoost  fastText  3  84.8  85.8  0.853  85.4 
TIES + XGBoost  GloVe  3  86.9  88.0  0.874  87.5 
TIES + XGBoost  Numberbatch  3  87.9  89.0  0.884  88.5 
TIES + XGBoost  fastText  5  84.2  85.2  0.847  84.8 
TIES + XGBoost  GloVe  5  85.6  86.6  0.861  86.2 
TIES + XGBoost  Numberbatch  5  86.5  87.6  0.870  87.1 
TIES + XGBoost  fastText  7  82.8  83.8  0.833  83.4 
TIES + XGBoost  GloVe  7  83.8  84.8  0.843  84.4 
TIES + XGBoost  Numberbatch  7  85.3  86.3  0.858  85.9 
TIES + XGBoost  fastText  7 expon.  84.3  85.3  0.848  84.9 
TIES + XGBoost  GloVe  7 expon.  86.5  87.6  0.870  87.1 
TIES + XGBoost  Numberbatch  7 expon.  87.0  88.1  0.875  87.6 
Shauket et al. (2020) [shaukat2020sentiment]  Lexicon based    86.7  
Giatsoglou et al. (2017) [giatsoglou2017sentiment]  Hybrid    0.880  87.8 
Subject  Test Records  Precision  Recall  F1  Accuracy 

qfin.GN  410  73.2  68.5  0.708  83.8 
qfin.ST  396  70.2  67.5  0.688  83.6 
qfin.MF  306  66.0  45.6  0.539  77.5 
qfin.PR  305  69.5  55.2  0.615  82.7 
qfin.RM  307  62.5  61.0  0.617  84.5 
4 Results and Discussion
We run our binary classification and multilabel multiclass classification on both data set using XGBoost [chen2015xgboost, chen2016xgboost] with the parameters , , , and . In each data set 2/3 of the records were randomly selected for training and 1/3 used for testing. Table 2 and Table 3 show the results on arXiv paper data set and IMBD movie review data set, respectively. On each data set, we run the classifier using different pretrained embedding models and different sliding window sizes to smooth the embedding signals. For both arXiv papers set and IMDB Movie Review data set, the best result is achieved using ConceptNet Numberbatch as pretrained embedding and window size of . Detailed results for arXiv papers set are shown in Table 4. As shown in Fig. 6, the performance of the classifier does not depend on the length of the records.
To evaluate out results, for arXiv data set we run a convolutional neural network using the same pretrained word embeddings. As shown in Table
2, our best configuration using TIES outperforms the base CNN model in terms of accuracy and F1 score. For IMDB reviews data set, we compare our results to the previous results of Shauket et al. (2020) [shaukat2020sentiment] lexicon based approach and Giatsoglou et al. (2017) [giatsoglou2017sentiment] hybrid approach. The comparison reveals that TIES outperforms the previous models.5 Conclusion
In this paper, we introduced a novel method to define and extract topological features from word embedding representations of corpus and used them for text classification. We utilized persistent homology, the most commonly tool from topological data analysis to interpret the embedding space of each textual documents. In our experiments, we showed that working on textual documents, our defined topological features can outperform conventional text mining features. Specially when the textual documents are long, using these topological features can improve the results. However, in TIES, we are analyzing different embedding dimensions as time series. Thus, to achieve reasonable results, a large number of tokens in each textual document is required. We acknowledge this issue as the main limitation of our algorithm. Also, it is not easy to measure and/or interpret the impact of different parts of the text input on the output in TIES. This is one of the possible future directions for this study.