A Novel Method of Extracting Topological Features from Word Embeddings

03/29/2020
by   Shafie Gholizadeh, et al.
UNC Charlotte
0

In recent years, topological data analysis has been utilized for a wide range of problems to deal with high dimensional noisy data. While text representations are often high dimensional and noisy, there are only a few work on the application of topological data analysis in natural language processing. In this paper, we introduce a novel algorithm to extract topological features from word embedding representation of text that can be used for text classification. Working on word embeddings, topological data analysis can interpret the embedding high-dimensional space and discover the relations among different embedding dimensions. We will use persistent homology, the most commonly tool from topological data analysis, for our experiment. Examining our topological algorithm on long textual documents, we will show our defined topological features may outperform conventional text mining features.

READ FULL TEXT VIEW PDF

page 4

page 7

03/29/2020

Topological Data Analysis in Text Classification: Extracting Features with Additive Information

While the strength of Topological Data Analysis has been explored in man...
06/03/2019

An Introduction to a New Text Classification and Visualization for Natural Language Processing Using Topological Data Analysis

Topological Data Analysis (TDA) is a novel new and fast growing field of...
03/01/2022

Topological Data Analysis for Word Sense Disambiguation

We develop and test a novel unsupervised algorithm for word sense induct...
08/22/2022

Dialogue Term Extraction using Transfer Learning and Topological Data Analysis

Goal oriented dialogue systems were originally designed as a natural lan...
09/02/2021

Knot invariants and their relations: a topological perspective

This work brings methods from topological data analysis to knot theory a...
11/17/2020

Argumentative Topology: Finding Loop(holes) in Logic

Advances in natural language processing have resulted in increased capab...
02/07/2021

A Note on Argumentative Topology: Circularity and Syllogisms as Unsolved Problems

In the last couple of years there were a few attempts to apply topologic...

1 Background

Persistent homology [edelsbrunner2000topological, carlsson2009topology, edelsbrunner2008persistent, chen2015mathematical] is a technique in TDA for multi-scale analysis of data clouds. In a data cloud, there are no pre-defined links between the points. Therefore, there are no -simplices except -simplices each of which trivially referring to a single data point. Betti numbers are all zero, except which is equal to the number of data points. The main idea of persistent homology is to define an -distance around each data points and then connect those points with overlapping -distances. We may change in the range of . Setting , there will be no link between the points. However, increasing gradually, some points will get connected. When is large enough, (), all data points are connected to each other— i.e. data points will consist an -simplex. In the way to increase from to , number of connected components will change. Also, many loops in the data may appear and disappear—i.e., Betti numbers are changing. On a data cloud, changes in Betti numbers are captured by persistent homology. More precisely, Persistence Diagram [edelsbrunner2000topological] captures the birth and death ’s of each component, loop, void, etc. An example for persistent homology on a data cloud along with the resulted persistence diagram are illustrated in Fig. 3. Alternatively, we may show the birth and the death of loops with barcodes where each hole is represented by a bar from its birth to its death [collins2004barcode, carlsson2005persistence, ghrist2008barcodes]. For a more general review of persistent homology and its applications we refer the reader to [edelsbrunner2008persistent, zomorodian2005computing, munch2017user, gholizadeh2018short].

Figure 1: -simplex, -simplex , -simplex and -simplex (top). An example of simplicial complex (bottom).

While TDA and specially persistent homology are applied on a wide range of problems (e.g., system analysis [maletic2016persistent, garland2016exploring, pereira2015persistent, khasawneh2014stability, perea2015sliding, stolz2017persistent], network coverage [de2006coordinate, de2007coverage], etc), there are only a few studies using them for natural language processing. Zhu in [zhu2013persistent]

used persistent homology to find repetitive patterns in the text, comparing vector space representations of different blocks in the documents. Doshi and Zadrozny in

[doshi2018movie] utilized Zhu’s algorithm [zhu2013persistent] for movie genre detection on the IMDB data set of movie plot summaries. The authors showed how persistent homology can improve the classification. In [wagner2012computational], Wagner et al. utilized TDA to measure the discrepancy among documents represented by their TF-IDF matrices. Guan et al. in [guan2016topological] utilized topological collapsing algorithm [wilkerson2013simplifying] to develop an unsupervised method of key-phrase extraction. Almgren et al. in [almgren2017mining] and [almgren2017extracting] examined the feasibility of persistent homology for social network analysis. The authors utilized Mapper algorithm [singh2007topological] to predict image popularity based on word embedding representations of images’ captions. In [torres2015topic], Torres-Tramón et al. introduced a topic detection algorithm for Twitter data analysis utilizing Mapper algorithm to map the term frequency matrix to the topological representations. Savle and Zardozny in [savle2019topological] used TDA to study discourse structures and showed that topological features derived from the relation of sentences can be informative for prediction of text entailment. Considering the word embedding representation of the text as high-dimensional time-series, some ideas from the recent applications of persistent homology in time series analysis [pereira2015persistent, khasawneh2014stability, perea2015sliding, maletic2016persistent, stolz2017persistent] are also considerable for text processing.

Figure 2: Betti numbers for some simple shapes.
Figure 3: Persistent Homology

In our prior work [gholizadeh2018topological]

, we applied persistent homology on a set of novels and tried to predict the author without using any conventional text mining features. For each novel, we built an adjacency matrix whose elements are measuring the co-appearance of the main characters (persons) in the novel. Utilizing persistent homology, we analyzed on the graph of the main characters in the novels and fed the resulted topological features (conveying topological styles of novelists) to a classifier. Despite the novelty of the algorithm (using persistent homology instead of utilizing conventional text mining features), it is not easily applicable to the general text classification problem. Here, we introduce a different approach using word embeddings and persistent homology which can be applied to the general text classification tasks.

2 Methodology

In our algorithm, Topological Inference of the Embedding Space (TIES) we utilize word embedding and persistent homology. The input is the textual document and the output is a topological representation of the same document. Later we may use these representations for text classification, clustering, etc. Step-by-step specifications of TIES are explained is this section.

2.1 Pre-processing

Like any other text mining method, standard pre-processing possibly including lemmatization, removing stop words and if necessary lowercasing will be applied to the text. Also, there might be some specific pre-processing tasks that are inspired by TDA algorithms.

2.2 Word Embedding Representation

In a document of size , replacing each token with its embedding vector of size will result in a matrix of size . This matrix can naturally be viewed as a -dimensional time series that represents the document. More precisely, each column () will represent the document in dimension of the embedding.

2.3 Aggregation on Sliding Window

One of the easiest and potentially most efficient ways of such smoothing is to replace each element () in with a local average in its neighborhood. Equivalently, we may describe it by taking the summation in the sliding window of size , where so the smoothed vector is

and is the smoothed -dimensional time series that represents the document. For long documents this is almost the same as using a smoother matrix , i.e.,

where is a tridiagonal (for ), pentadiagonal (for ) or heptadiagonal (for ) binary matrix. The only difference is that in the latter definition, no value of time series will be dropped, so the result is only slightly different in size, assuming . Note that we can also use exponential weights in summation of elements in the sliding window instead of simply adding them up. In one of our experiments, we tried the exponential form of:

2.4 Computing Distances

Assume that in a documents, some of the embedding dimensions— and the relation among different dimensions are carrying some information regarding the document. Such information could be revealed in a coordinate system, where each embedding dimension is represented by a data point, and the distance between two data points (embedding dimensions) represents their relation. A possible choice of distance is formulated in Equation 1.

(1)

The intuition behind the way of defining distance in Equation 1 is to (1) consider the relation between and

via Cosine similarity, (2) distinguish significant embedding dimensions (the term

will do this), and (3) make the distance almost non-sensitive to the size of document via term . Note that Equation 1 can be replaced by any other definition satisfying these three conditions. Aggregation on sliding window along with a distance formula like Equation 1 guarantee that the order of the tokens in documents is considered in the final distance matrix , defined by

For simplicity let’s fix , so remembering that each column of as a simple time series is a function of time (the index of word/token in the document),

and assuming that the length of document is large enough () we have

as , so

since shifting the time index will only exclude elements from the beginning or the end of time series and its effect is negligible when . It is easy to show that

and similarly, in a general form for any window size , Equation 2 holds.

(2)

Such coefficients will guarantee that the order is considered in the final distance matrix . It means that each embedding dimension for each token in the text is being compared with all the other embedding dimensions in the same token, a few tokens before that, and a few tokens after that, though these comparisons will have different weights. Note that similar equations can be easily derived for correlation-based and covariance-based distances. In any case, the distance matrix is sensitive to the window size , or more generally to the smoothing algorithm. For instance, using exponential smoothing will result in geometric sequence of coefficients instead of arithmetic sequence of coefficients in Equation 2 (i.e., , , ,). Regarding using the sliding window, the choice of is a trade-off between increasing the captured information on one side and decreasing the noise on the other side.

2.5 Applying Persistent Homology

Having the distance matrix for each document, a persistence diagram can be constructed for topological dimension111These dimensions should not be mistaken with embedding dimensions. (number of clusters) and dimension (number of loops) denoted by and respectively. However, this persistence diagram alone is not very useful as the representation of the document.

In our prior work [gholizadeh2018topological] the resulted graph of the main characters (persons) in each novel, and therefore the distance matrix was not annotated nor was needed to be annotated, since we had designed the algorithm to deal with the main characters whatever their names are. In other word, to capture the topological signature of a novelist, it did not matter whether the names of the main characters are shifted. But here, dealing with time-series in different dimensions, the order of embedding dimensions is meaningful, since different embedding dimensions have different roles in representing document. Therefore, feeding the time series to the persistent homology algorithm is meaningless, unless we somehow manage to distinguish different dimensions. One intuitive way is comparing the persistence diagram with and without each embedding dimension. In other word we can measure the change in persistence diagram when we exclude one embedding dimension. We measure the sensitivity of the persistence diagram generated by Ripser [bauer2019ripser, bauer2017ripser] to each embedding dimension to use it later as a measure of the sensitivity of the document itself to each embedding dimension. This way the document will be represented in an array of size array of based on and another array of size based on , as formulated in Equation 3, where is any measure of distance between two persistence diagram, e.g, Wasserstein distance.

(3)

A block diagram of TIES is shown in Fig. 4.

Figure 4: A block diagram of TIES. Word Embedding representations are aggregated over sliding windows. Then defining the distance between pairs of embedding dimensions, persistence diagram can represent the text. Finally, the topological features are the sensitivity of the persistence diagram with respect to different dimensions.

3 Data Specification

To examine our topological algorithm (TIES), we use the following data sets and predict the labels in multi-class multi labeling classification.

  • arXiv Papers: We downloaded all of arXiv papers in quantitative finance222https://arXiv.org/archive/q-fin published between 2011 and 2018. Then we selected five major categories (subject tags): “q-fin.GN” (General Finance), “q-fin.ST” (Statistical Finance), “q-fin.MF” (Mathematical Finance), “q-fin.PR” (Pricing of Securities), and “q-fin.RM” (Risk Management). For pre-processing we removed the titles, author names and affiliations, abstracts, keywords and references. Then we tried to predict the subjects solely based on the paper body.

  • IMDB Movie Review [maas2011learning]: Using IMDB reviews annotated by positive/negative label, we examined TIES for binary sentiment classification task.

Table 1 contains the specifications of both data sets. Note that each records in the arXiv data set may have more that one label. The histogram of number of labels for each record is shown in Fig. 5. As shown in the histogram, the majority of records in arXiv data set are tagged with only a single label.

Specification arXiv Quant. Fin. Papers IMDB Movie Reviews
Labels 5 (Multi-label) 2
Clean Records 4601 6000
Length of Records
Frequency of Labels
Table 1: Data Specification for arXiv papers data sets.
Figure 5: Histograms of number of labels per document in arXiv data set of papers.

In practice, for many of the classification and clustering tasks in text processing, the data covers only very short documents (e.g., a limited data set of social media posts). Therefore a big challenge is training word embedding models on short documents. Such a challenge is beyond the scope of this study and we will use pre-trained versions of word embeddings that are previously trained on large corpora. Specifically, we use the following pre-trained models.

Model Embedding Window Prec. Rec. F1 Acc.

TIES + XGBoost

fastText 3 61.9 55.4 0.575 80.1
TIES + XGBoost GloVe 3 63.1 56.7 0.597 80.7
TIES + XGBoost Numberbatch 3 68.7 60.5 0.643 82.6
TIES + XGBoost fastText 5 60.8 54.7 0.576 79.8
TIES + XGBoost GloVe 5 61.8 56.1 0.588 80.3
TIES + XGBoost Numberbatch 5 65.5 58.4 0.617 81.6
TIES + XGBoost fastText 7 58.9 54.4 0.566 79.5
TIES + XGBoost GloVe 7 62.8 56.4 0.594 80.6
TIES + XGBoost Numberbatch 7 65.7 57.7 0.614 81.3
TIES + XGBoost fastText 7 expon. 60.3 54.6 0.573 79.7
TIES + XGBoost GloVe 7 expon. 61.2 55.9 0.584 80.2
TIES + XGBoost Numberbatch 7 expon. 66.4 59.6 0.628 82.2
CNN fastText - 57.1 64.3 60.5 80.0
CNN GloVe - 57.6 64.2 60.7 80.6
CNN Numberbatch - 55.0 67.6 60.7 79.8
Table 2: Results on arXiv papers dataset. The best result is achieved using ConceptNet Numberbatch as pre-trained embedding and window size of 3.
Model Embedding Window Prec. Rec. F1 Acc.
TIES + XGBoost fastText 3 84.8 85.8 0.853 85.4
TIES + XGBoost GloVe 3 86.9 88.0 0.874 87.5
TIES + XGBoost Numberbatch 3 87.9 89.0 0.884 88.5
TIES + XGBoost fastText 5 84.2 85.2 0.847 84.8
TIES + XGBoost GloVe 5 85.6 86.6 0.861 86.2
TIES + XGBoost Numberbatch 5 86.5 87.6 0.870 87.1
TIES + XGBoost fastText 7 82.8 83.8 0.833 83.4
TIES + XGBoost GloVe 7 83.8 84.8 0.843 84.4
TIES + XGBoost Numberbatch 7 85.3 86.3 0.858 85.9
TIES + XGBoost fastText 7 expon. 84.3 85.3 0.848 84.9
TIES + XGBoost GloVe 7 expon. 86.5 87.6 0.870 87.1
TIES + XGBoost Numberbatch 7 expon. 87.0 88.1 0.875 87.6
Shauket et al. (2020) [shaukat2020sentiment] Lexicon based - 86.7
Giatsoglou et al. (2017) [giatsoglou2017sentiment] Hybrid - 0.880 87.8
Table 3: Results on IMDB Movie Review dataset. The best result is achieved using ConceptNet Numberbatch as pre-trained embedding and window size of 3.
Subject Test Records Precision Recall F1 Accuracy
q-fin.GN 410 73.2 68.5 0.708 83.8
q-fin.ST 396 70.2 67.5 0.688 83.6
q-fin.MF 306 66.0 45.6 0.539 77.5
q-fin.PR 305 69.5 55.2 0.615 82.7
q-fin.RM 307 62.5 61.0 0.617 84.5
Table 4: Results per class on arXiv papers dataset using ConceptNet Numberbatch as pre-trained embedding and window size of 3.

4 Results and Discussion

We run our binary classification and multi-label multi-class classification on both data set using XGBoost [chen2015xgboost, chen2016xgboost] with the parameters , , , and . In each data set 2/3 of the records were randomly selected for training and 1/3 used for testing. Table 2 and Table 3 show the results on arXiv paper data set and IMBD movie review data set, respectively. On each data set, we run the classifier using different pre-trained embedding models and different sliding window sizes to smooth the embedding signals. For both arXiv papers set and IMDB Movie Review data set, the best result is achieved using ConceptNet Numberbatch as pre-trained embedding and window size of . Detailed results for arXiv papers set are shown in Table 4. As shown in Fig. 6, the performance of the classifier does not depend on the length of the records.

To evaluate out results, for arXiv data set we run a convolutional neural network using the same pre-trained word embeddings. As shown in Table

2, our best configuration using TIES outperforms the base CNN model in terms of accuracy and F1 score. For IMDB reviews data set, we compare our results to the previous results of Shauket et al. (2020) [shaukat2020sentiment] lexicon based approach and Giatsoglou et al. (2017) [giatsoglou2017sentiment] hybrid approach. The comparison reveals that TIES outperforms the previous models.

Figure 6: Accuracy of classification vs length of records in arXiv data set of papers.

5 Conclusion

In this paper, we introduced a novel method to define and extract topological features from word embedding representations of corpus and used them for text classification. We utilized persistent homology, the most commonly tool from topological data analysis to interpret the embedding space of each textual documents. In our experiments, we showed that working on textual documents, our defined topological features can outperform conventional text mining features. Specially when the textual documents are long, using these topological features can improve the results. However, in TIES, we are analyzing different embedding dimensions as time series. Thus, to achieve reasonable results, a large number of tokens in each textual document is required. We acknowledge this issue as the main limitation of our algorithm. Also, it is not easy to measure and/or interpret the impact of different parts of the text input on the output in TIES. This is one of the possible future directions for this study.

References