Introduction
In 2003, Yoshua Bengio and his team proposed the first neural network for natural language processing (NLP) (
[first_neural_net_nlp]). Since then, there has been a festival of new model architectures surpassing current records on many textual tasks. Recurrent neural networks (RNNs) (
[first_recurrent_neural_net]) were replaced by long shortterm memory (LSTM) networks (
[first_long_short_term_memory_net]) before attentionbased models were proposed ([transformer]).One of the first text encoder models using only the attention mechanism without any recurrent parts is BERT, and it got stateoftheart performance on many NLP benchmark tasks. But its success came with the desire to understand what type of language knowledge these models acquire, such as grammar rules, semantics, syntactic relations between words, or even the world knowledge it can infer through language. Numerous studies about these different subjects are very well summarized under the term Bertology ([bertology]).
The geometric and topological information contained in BERT has recently caught the interest of the topological data analysis (TDA) community ([introduction_persistent_homology_comp_science, acceptability_judgement_topo_att_maps, artificial_text_topo_att_maps]). The attention head activation can be transformed into an attention graph. One can then filter this graph and apply persistent homology to study the evolution of connected components and higherorder structures, compute a collection of Betti numbers, or store the average barcode length of a given homology dimension. This approach provides a topological representation of textual input and can be used to train a new type of textual classifier or find new interpretability methods for NLP.
The contributions of this paper are summarized as follows:

We reproduce the results of [betti_bert] but in a broader setting with different choices of attention graph filtrations, types of persistence homology (ordinary and directed), and symmetry functions. In addition, we also use different topological features (they considered Betti numbers, and we consider persistence images).

We study UMAP projections of persistent diagrams obtained via the attention graphs and compare their distribution with the classification of attention maps proposed by [dark_secret_of_bert]. We could not observe any correlation, but we find stability across input sentences of the cluster composition, indicating a specialization of the attention heads.

We propose a new method to rate attention heads inspired by GradCam ([gradcam]) on the topological inputs that works remarkably well to prune the number of heads down from 144 to ten with no reduction either in classification performance or robustness against adversarial attacks. Moreover, the selected heads largely display attention patterns focusing on the [SEP] token, leading to a new perspective on the noop hypothesis proposed in [what_does_bert_look_at].

Our research is the first study that contemplates the topic of adversarial attacks in relation to topologicalbased models for NLP tasks.
1 Related Work
Topological Data Analysis
Combining algebraic topology and machine learning has become a vast field of investigation in the past decade. To the best of our knowledge,
[introduction_persistent_homology_comp_science] was the first to use persistent homology in the context of NLP. The author differentiated child writing from teenager writing using persistence tools. This is followed by an increase in interest from the scientific community in TDA methods in NLP, including a successful attempt to predict the genre of a movie from its plot ([tda_movie_genre]), an application of persistent homology to depict textual entailment in legal processes ([tda_legal_entailment]), an unsuccessful TEDtalk rating ([tda_tedtalk]), and detection of artificially generated text (
[artificial_text_topo_att_maps]).To demonstrate this increase in interest, [acceptability_judgement_topo_att_maps] examined, independently and in parallel to our work, the linguistic acceptability of text using the topology of the attention graphs. They were able to enhance the performance of existing models for two standard practices in linguistics: binary judgments and linguistic minimal pairs. They also proposed a method for analyzing the linguistic functions of attention heads and interpreting the correspondence between the graph features and grammatical phenomena.
The challenge of combining a neural network with a topological layer is the differentiability of the overall objective function. The PersLay model ([perslay]) obtained excellent classification performance of reallife graph datasets such as social graphs or data from medical or biological domains. This is similar to Persformer ([persformer]), which can process persistent diagrams without using handcrafted features but using the selfattention mechanism and achieving stateoftheart results on the standard ORBIT5K dataset.
BERT Model Multiple studies have shown that BERT is overparametrized. [dark_secret_of_bert] obtained an increase in performance of the model when disabling the attention in certain attention heads. [are_sixteen_heads_better_than_one] proposed a pruning method that disabled 40% of the attention heads while keeping the accuracy high and the authors showed that some layers could be reduced to one attention head for better performance. Even the original transformer architecture has been pruned from 48 heads down to 10 heads while maintaining the accuracy level ([analysing_multi_head_attention]). Interestingly. these remaining heads displayed specific interpretable linguistic functions. These specialized heads have also been found in BERT by [what_does_bert_look_at], who present attention heads that attend to the direct objects of verbs, to the determiner of nouns, or to coreferent mentions ([what_does_bert_look_at]).
[bert_plays_lottery] showed that BERT contained many subnetworks achieving performance comparable to the full model, and furthermore, that the worst performing subnetwork remains highly trainable.
Another approach to illustrate the information contained in a pretrained model is the study of the transferability of the contextual representations stored inside the model. [linguistic_knowledge_and_transferability_of:contextual_representations] found that linear models trained on top of frozen contextual representations, such as the pretrained BERT model, are competitive with stateoftheart taskspecific models. This is related to our work, as we also extract information from a frozen BERT model and train a topological classifier on it.
2 Background
2.1 BERT Model
The multiheaded attention model BERT (
[bert]) was one of the first models to use only the encoder part of the original transformer architecture ([transformer]) while obtaining stateoftheart results on many NLP tasks. It is a model composed of encoder layers connected in series each containing attention heads applied in parallel.We focus on the attention heads of () as they are the part of the model containing the topological information we want to analyse. The input of an attention head is a matrix of size (where
) whose rows are vector representations of the
tokens of the input sentence, and the output is a new representation of size with () following the formula:(1) 
where are trainable matrices that project the token embeddings into a lowerdimensional vector space, and the softmax is token over the second dimension. The matrix is a dimensional matrix called the attention matrix.
Its entries can be interpreted as the attention given to the token at position when computing the new representation of the token at position . They are nonnegative and the attention scores of a token sum up to one, i.e., for all .
2.2 Attention Maps and Attention Graphs
Attention maps are the representations of the attention matrices into pixel images. The higher is, the darker the pixel color is. Those maps where intensively studied in [dark_secret_of_bert] and divided into four classes: diagonal patterns, vertical patterns, diagonal and vertical patterns, and heterogeneous patterns (see Figure 1).
Another representation of an attention matrix is through an attention graph. Given a head, we construct a weighted directed graph taking as vertices the tokens of the input sentence and connecting two words and with an edge from to with weight and an opposite direction edge with weight . No further modification is needed to apply directed persistent homology and in the nondirected versions, this directed graph is transformed into a complete nondirected weighted graph on the set of tokens via a symmetric function . The edge connecting tokens and is assigned to the weight . The larger the weight of the edge connecting the two vertices, the lower the transformed attention score between the two tokens. The transformation from attention map to attention graph is illustrated in Figure 3.
2.3 Persistent Homology
The attention graphs constructed from the attention heads contain the topological structure we are interested in. Topologically, a weighted graph and the corresponding unweighted graph are identical. To encode the topological information provided by the graph weights, we use filtrations of the graph. A filtration is a sequence of nested topological subspaces indexed either on a discrete set like or on a continuous realvalued parameter. Starting with an attention graph, we consider three types of filtrations: Ordinary, MultiDim, and Directed.

[wide,itemindent=]
 Ordinary

We initiate the filtration with only the vertices of the graph. Then we add edges one by one, depending on their weights, until we obtain the complete graph (see Figure 3). The order of how the edges are added is as follows: Given two edges and , if the weight of is smaller than the weight of , then will be added before . This filtration is based on a realvalued parameter taking value in and the filtration at stage is formed by all the edges with weights smaller or equal to .
 MultiDim

The Ordinary filtration can be seen as starting with 0simplices and then adding 1simplices to construct the graph. The 0simplices can be thought of as points, the 1simplices as edges, the 2simplices as triangles, these 3simplices as tetrahedrons, and so on. The edges for the MultiDim filtration have the same filtration values as for the Ordinary filtration, but we add a 2simplex everytime three edges form a triangle.
 Directed

We consider the directed version of the attention graph and again start the filtration with only the vertices. The idea is similar to the MultiDim filtration: we add edges one by one, depending on their weight, and we add a 2simplex if its boundary 1simplices are present and do not form a directed cycle.
Filtrations are the topological interpretation of edge weights and their directions, for if the weights on the edges or the direction of the edges were different, the filtration would change accordingly.
Given a filtered simplicial complex, we can analyze it through persistent homology. The idea is to keep track of the appearance and disappearance of topological features through the filtration by computing the homology of each topological space encountered during the filtration and keeping track of the maps induced by the inclusions. An introduction to the mathematical background of TDA can be found in [introduction_persistent_homology_comp_science]. One can think of 0dimensional persistent features as connected components, 1dimensional features as holes and 2dimensional features as cavities (2dimensional holes). The birth time of a persistent feature is the filtration value at which the feature appears. For examples, the birth time of all the 0dimensional features of our graph filtration is 0 and the birth time of a 1dimensional features is the filtration value of the edge completing a graph cycle. The death time is the filtration value at which the feature disappears. For a 0dimensional feature it will correspond to the weight of the edge connecting the corresponding vertex to the main connected component. If there are no 2simplices in the filtration, a 1dimensional feature will never vanish, and its death time is said to be equal to . But for computational purpose, it is set to the maximal filtration value . The birth and death times of each feature are stored in a persistence diagram. A persistent feature is seen as a point in with its birth time as coordinate and death time as coordinate (see Figure 3). Hence a persistence diagram is a multiset of elements in – multi because two or more persistent features may have the same birth and death time.
2.4 Persistence Images
Incorporating persistence diagrams into a machine learning pipeline faces two main challenges. Firstly, one can add distances on the space of persistence diagrams, such as the bottleneck distance or a Wassersteintype distance (see Definition 5.1). But the underlying data structure remains a set, not a vector. Secondly, the space of persistence diagrams cannot be isometrically embedded into a Hilbert space, and this remains true for any type of distance considered ([persistence_diagrams_not_hilbert_space, Theorem 4.3]).
To overcome this issue, the TDA community has developed many methods to convert a collection of persistence diagrams into a collection of vectors contained in a Hilbert space. The method we consider uses persistence images ([persistence_images]
) because they are simple to compute and interpret. Furthermore, they are compatible with convolutional neural networks.
A persistence diagram can be seen as a noncontinuous map that counts the number of points in the diagram at the input location . This function can be decomposed into a finite sum of indicator functions that return 1 if the input is
and 0 otherwise. These indicator functions can be seen as probability density functions. To work with continuous functions,
will be approximated by the 2dimensional Gaussian distribution
of of mean, a hyperparameter that has to be chosen. We sum up all those continuous functions to obtain
, a continuous approximation of the function .For more flexibility and stability, a weight function is incorporated inside the sum to emphasize certain regions of the persistence diagram. The obtained function is called the persistence surface of the persistence diagram .
The last step is to integrate the persistence surface over cells of a grid of dimension with given horizontal and vertical boundaries. This grid defines the frame and resolution of the persistence image and has to be chosen (see Figure 2). The value of a pixel of the persistence image of the persistence diagram is given by
The value of the integration over each pixel is stored in a image, called the persistence image of the persistence diagram.
For our application, we consider various image frames, image resolutions (see Table 12), and weight functions (see Table 13). An illustration for the Ordinary filtration can be found in Figure 3. The tables are in the appendix with examples for all types of persistence images (see Table 14).





3 Methodology
We compare BERT against the topological model on numerous classification tasks. BERT is finetuned for a variable amount of epochs. To apply the topological model, we first transform each sentence into a stack of persistence images. To do so, we feed a finetuned BERT model with the sentence and extract the attention graphs for each head. We then transform the attention graphs into persistence images. One attention graph generates a number of persistence images equals to the number of persistence features inside the filtration (2 for Ordinary, 3 for the two others). For the Ordinary filtration, a sentence is transformed into 288 images and for the two others into 432 images. The topological classifier receives as input a 4dimensional tensor where the dimensions are: batchsize, the number of persistence images per sentence, the width and height of the image.
We use the Gudhi library ([gudhi]) to convert persistence diagrams into persistence images for the Ordinary and MultiDim filtrations, and the Giottotda library ([giottotda]) to manage the directed filtration.
3.1 Data
We load the datasets from Hugging Face and we follow the data processing proposed by [betti_bert]. All dataset statistics are presented in Table 1.

[wide,itemindent=]
 CoLA

The Corpus of Linguistic Acceptability ([CoLA]) is part of the GLUE benchmarks ([GLUE]). The task is to detect if a sentence is grammatically correct (class 1) or not (class 0). We consider the public part of the dataset that contains labels, and disregard the hidden part. We use the original validation set for prediction, and we split the train set into train and validation subsets with proportions .
 IMDB

Large Movie Review Dataset v1.0 ([IMDB]) consists of movie reviews labeled by two sentiments “positive” (1) or “negative” (0). It contains 50,000 reviews. We first divide the data set into two equal sized subsets, one for training and the other for testing and validation. To obtain attention graphs of manageable sizes, we consider only sentences of length smaller or equal to 128 after tokenization with the standard BERT uncased tokenizer. We then divide the second subset into validation and prediction datasets following the proportion .
 SPAM

The SMS Spam Collection v.1 ([SPAM]) contains text SMS and the task is to determined if they are spam (1) or ham (0). It contains 5,574 messages that we divide into train, validation and prediction subsets ().
 SST2

The binary version of Stanford Sentiment Treebank ([SST2]) is also part of the GLUE benchmarks and consists of parts of movie reviews labeled by two sentiment “positive” (1) and “negative” (0). As for the CoLA dataset, we split the original train set into two subsets with proportions , and use the original validation set for the prediction.
CoLA  IMDB  SPAM  SST2  
# Train sent.  7695  2675  4459  60614 
# Validation sent.  856  1414  557  6735 
# Prediction sent.  1043  1414  558  872 
%  70  55  14  51 
Mean sent. size  11  88  25  13 
Max sent. size  47  128  238  64 
3.2 Model
We run the experiments on the “bertbaseuncased” model from the HuggingFace library ([huggingface_transformers]) for the BERT baselines. For explanatory purposes, we consider numerous variations:

[wide,itemindent=]
 Finetuning

We finetuned BERT for 4, 10, and 20 epochs ^{1}^{1}1We follow the procedure proposed in https://github.com/MohamedAteya/BERTFineTuningSentenceClassificationforCoLA to study the importance of finetuning with respect to the performance of the topological classifiers. This gives use three BERT models per task, and each is used to transform sentences into persistence images.
 Symmetry function

The choice of the symmetry function is a crucial step to get from the attention matrices to the nondirected attention graphs. To explore its relevance with respect to the performance of the topological classifier, we use four different symmetry functions: maximum, minimum, multiplication and mean.
 Filtration

We also consider the three types of filtrations describe in Section 2.3. Unfortunately, we are limited in terms of computation time and power as it can reach excessively high values depending on the length of the sentence to transform (see Figure 4). Hence for the MultiDim filtration, we consider only the maximum as a symmetry function, and the Directed filtration is only applied to the CoLA dataset.
In total, one finetuned BERT model generates 5 or 6 persistence image datasets (PIdatasets). We refer to a PIdataset by the finetuned BERT model producing it (name of the dataset and number of finetuning epochs), the type of filtration, and the symmetry function, e.g., “IMDB, 10 epochs, MultiDim, max”.
Our topological classifiers are convolutional neural networks taking as input the stack of persistence images from one sentence. The architecture identified by running a hyperparameter search using the Optuna library ([optuna]) for 500 trials tuning the number of convolution and fully connected layers, the learning rate, the optimizer, and the dropout rate. The hyperparameter search is done on one dataset (CoLA, IMDB, SPAM, or SST2), and we say that the model designed from the hyperparameter search result is specific to this dataset. We run hyperparameter searches for each classification task and type of filtration. The architectures of all the topological classifiers are presented in Table 15.
We then investigate the performance of a topological classifier specific to one dataset when evaluated on the other datasets. To do so, for each PIdataset, we evaluate the performance of two topological models: one whose hyperparameters are optimized on the current dataset, the other one specific to the CoLA dataset. We refer to the first model as the specific model and to the second as the general model.
The inference time of our combination of the BERT model and the topological classifier is greater than the inference time of the BERT model itself. For example, when using the Ordinary filtration, the time needed to transform a tokenized sentence into a persistence image and predict its class is two times greater than the BERT prediction time. It goes up to 20 times for the MultiDim filtration and 70 times for the Directed filtration. The bottleneck is the computation of the persistence images from the persistence diagrams, which is not implemented on GPU but only on CPU.
3.3 Hardware
All the computations are done using a virtual machine from the Google Cloud Platform (https://cloud.google.com/). The machine we work on has a NVIDIA Tesla T4 GPU with 16GB of VRAM, 8 CPUs, named “n1standard8”, with 30GB of RAM.
4 Results of the topological classifiers
The performance results obtained by the model whose hyperparameters are optimized on the CoLA dataset and the specific models are outlined in Table 2 and Table 3. As performance measures, we choose the accuracy on the prediction set and the Matthew Correlation coefficient which is a good metric for unbalanced datasets.
The general topological classifier outperforms BERT for the CoLA dataset by and obtains similar performances for the other datasets. It suggests that the persistence images contain as much syntactic information as the encoding provided by BERT. The biggest increase in performance is generally obtained by the MultiDim filtration. It seems that the more topological information is provided to the model, the better it performs. However, the enhancement is not comparable with the computation cost needed to compute the persistent homology of the MultiDim filtration. The model based on the Directed filtration performs as well as the Ordinary. One explanation could be that the persistence images produced by the Giotto TDA library have different ranges making it difficult for the CNN to compare them.
There is no symmetry function that works best in all cases. The mean is the best choice when considering the CoLA dataset. The multiplication performs less well for IMDB, but better for the three others. In general, all symmetry functions perform similarly on a given task. Hence this choice is not crucial for the overall performance of the topological classifier. Interestingly, there is also no significant difference between the results obtained by the specific models versus the general model. In some cases, when applied on another dataset, the general model outperforms the specific model. This observation would suggest that we could use one topological classifier for multitask learning with a performance similar to BERT.
Lastly, there is an overall tendency for performanceboosting when we increase the amount of BERT finetuning epochs. This is the case on each task with respect to any filtration and any symmetry function, even when the BERT model overfits by training for more epochs (SST2 and IMDB).
5 UMAP Description
It is challenging to understand how BERT learns to solve a task, and many attempts have been published ([attention_is_not_explanation, what_does_bert_look_at, bertology]). Persistence diagrams of the attention graphs offer a new perspective to look at attention heads.
To do so, we use the UMAP (Uniform Manifold Approximations & Projection) library ([umap]). It allows to project a high dimensional point cloud onto the two dimensional plane. The data we use consists of the persistence diagrams of the 144 attention heads corresponding to one sentence. To give a graph structure to the set of data points, we compute the Wasserstein distance between each pair of diagrams.
Definition 5.1 ([computational_topology_for_data_analysis], Definition 3.9 and 3.10).
Let be fixed and let and be two finite persistence diagrams of the same homology dimension. Let and be the diagrams obtained from and by adding all points on the diagonal with infinite multiplicity. We define the Wasserstein distance between and by
where is the set of all bijections between and .
From the pairwise distances of the diagrams, UMAP construct a graph as follows: It connects each point to a fixed number of closest neighbors. Then it projects this graph onto the twodimensional plane. The number of neighbors to connect to has to be chosen manually. The higher it is, the more global information will be retrieved. The lower it is, the more local information will be displayed in the final projection. Figure 5 shows examples of UMAP projections.
UMAP provides us with information about similitude between persistence diagrams. The main observation is the formation of clusters, indicating that there are groups of diagrams that look similar and that are different from the diagrams not in the cluster.
The primary motivation to use UMAP projections of persistence diagrams is to give another perspective on the selfattention patterns depicted in the paper ([dark_secret_of_bert]). We want to see if the persistence diagrams are also clustered in a similar way. We consider sentences of the CoLA dataset, and we plot the UMAP for any category of diagrams; one per type of filtration (Ordinary, MultiDim, or Directed) and per homology dimension (0 and 1 for Ordinary, 0,1 and 2 for MultiDim and Directed). We manually identified heads with the pattern type of their selfattention maps for each sentence. Then we display the UMAP projections, coloring the dots with respect to the pattern: yellow for diagonal, red for vertical, blue for diagonal and vertical and green for heterogeneous.
In Figures 15, LABEL:, 14, LABEL:, 13, LABEL:, 12, LABEL:, 6, LABEL: and 5, the distribution of the persistence diagrams are not correlated with the distribution of the attention map classes. The colors are spread evenly across all the points, with no clear monochromatic area. In some projection maps, a color gradient between green and red can be observed, but with no clear distinction between the two groups. Nevertheless, there are some cluster formations shared across sentences. In the next subsections we explore the cluster compositions and compare them across sentences. We observe similar pattern for a large range of sentences, but only present the plots of three sentences for illustratory purposes. The three sentences are:

“Our friends won’t buy this analysis, let alone the next one we propose.” (1)

“I know a boy mad at John.” (1)

“Mary has more friends that two.” (0)
We only look at the Ordinary filtration here. The UMAP projection for the MultiDim and the Directed filtrations can be found in Appendix A and Appendix B.
Figure 5 shows the UMAPs of the 0dimension diagrams for the three considered sentences.
A small cluster is present for the two first sentences. The overall shape and the color distribution are similar across sentences and we observed that pattern for many more sentences. Looking at the persistent images of each head, no clear pattern is depicted between diagrams in or outside the cluster. But surprisingly, the cluster is formed on average by the same heads. For example, the cluster for the second sentence contains 20 of the 23 heads of the cluster from sentence 1. The circled area for sentence 3 contains all the elements of this cluster of size 23.
We observe no common cluster for the first homological dimension diagrams, but their UMAP projections tend to divide green and red points.
5.1 Discussion
The classes proposed by [dark_secret_of_bert] are not observed through the topological lens in these three sentences. However, the UMAP projections display clusters that appear for each sentence and whose composition is shared in each example. For persistence diagrams of homology dimension zero, there is a small dense cluster shared across sentences, but we could not find a pattern shared between the diagrams inside. For homology dimension one, there is a cluster whose diagrams contain mostly short lifespan features, i.e., persistence features with deathtime equal or very close to the birthtime. And for homology dimension two, there is a cluster whose diagrams present almost only features with a high birthtime. These specific clusters vary in size, but their composition is shared across sentences. In average, more than 90% of the smallest specific cluster is shared with the clusters in the other sentences. We made the same observation for numerous sentences of the CoLA dataset, leading to the claim that the attention heads of the finetuned BERT model specialize in searching for specific information. This claim was acknowledged in ([bertology]), and we provide a new approach to back it up.
6 Pruning the heads
Instead of exploring the structure of the persistence diagrams, we investigate what part of the input is most relevant for the model to predict the class of the sentence. To do so, we develop a method inspired by GradCam ([gradcam]).
GradCam is used to help understand the decisions made by deep learning models. Given an image, GradCam produces a heatmap that shows how the model makes its decision for that image. In our case, as the input is composed of either 288 (twice the number of heads) or 432 images (three times the number of heads), we can not directly apply the method proposed by
[gradcam]. Instead, we compute the gradient of the output logit with respect to the input image. This yields a tensor of the same shape as the input (for example for the ordinary case, the shape is [288, 50, 5]). Then we average the absolute value of the gradient over each channel to obtain a number that represents the influence of each individual image on the model output. Finally, we take the mean of these values corresponding to images coming from the same attention head (two images in the case of the ordinary filtration, three for the others). We end up with a score for the 144 heads of BERT for one input sentence. We perform this procedure for a large number of sentences and average all the obtained scores.
Our hypothesis is that the higher the score of a head, the more relevant it is to the topological model, and the more information about the sentence structure with respect to the current task it contains. Figure 7 displays the 30 best attention heads of the topological model trained on the PIdataset “CoLA, 4 Epochs, Ordinary”, but with various numbers of sentences considered and for different symmetry functions.
The heads with the highest scores are independent of the number of sentences used, as there is almost no difference in Figures 6(c), 6(b) and 6(a), Figures 6(f), 6(e) and 6(d), and Figures 6(i), 6(h) and 6(g). We also observed that the best performing heads are almost independent of the symmetry function considered. In general, the best heads are located in the deep layers of BERT. Therefore, the heads of BERT located in the later layers are the ones that change the most when finetuning ([dark_secret_of_bert]), and they are also the most relevant for the topological classifier.
6.1 Experiments
We design the following experiment to determine if these highscoring heads contain most of the necessary information for our topological classifier to perform well. First we determine the head with the highest scores. We train a model on a selected PIdataset and we apply our rating procedure on it. We look at the heads with the highest score (for = 70, 50, 30, 10, 5, 3, 2, and finally the best head). Then, we train another model on a PIdataset but with persistence images only related to the highest scoring heads. When considering the 70 best heads, we do not consider the persistence images from the other 74 for heads. In the case of the Ordinary filtration the input of shape [288, 50, 5] is pruned to the shape [140, 50, 5], as each head produces two persistence images (one for the 0dimensional features, one for the 1dimensional features). Table 4 presents the performance obtained from such a pruning. The base PIdataset considered is “CoLA, 4 Epochs, Ordinary, max”. The other columns are variations of the base PIdataset by changing either the considered symmetry function or the number of finetuning epochs. In Table 4 the PIdataset is the base PIdataset and the PIdataset is the PIdataset identified by the column. In Table 5, and are both the PIdataset identified by the column.
CoLA 4 Epochs Ordinary Max  CoLA 4 Epochs Ordinary Min  CoLA 4 Epochs Ordinary Mean  CoLA 10 Epochs Ordinary Max  CoLA 20 Epochs Ordinary Max  

BERT  
144 Heads  
70 Heads  
50 Heads  
30 Heads  
10 Heads  
5 Heads  
3 Heads  
2 Heads  
1 Head 
CoLA 4 Epochs Ordinary Max  CoLA 4 Epochs Ordinary Min  CoLA 4 Epochs Ordinary Mean  CoLA 10 Epochs Ordinary Max  CoLA 20 Epochs Ordinary Max  

BERT  
144 Heads  
70 Heads  
50 Heads  
30 Heads  
10 Heads  
5 Heads  
3 Heads  
2 Heads  
1 Head 
The performance of the topological classifier trained on the base PIdataset does not decrease with decreasing number of input images. It is even increasing and outperforms the 144 heads model at most by 2% in accuracy on the prediction set. Astonishingly, with 2 heads, our topological classifier outperforms BERT by 1.5% in accuracy. The model receives only 4 images of the initial 288, and its accuracy is still very large. It diminishes when considering only one head, but it still has a high accuracy of almost 80%.
The same trend can be observed for all the other PIdatasets: an increasing or constant accuracy when we decrease the number of considered heads from 70 to 10. However, we get a lower accuracy when we only consider less than ten heads. In all the models, decreasing the number of input images increases the performance of our topological classifier, up to a certain minimal number of heads considered. Choosing a modelspecific rating of heads does not change the behavior in the results, neither in the trend nor in values. Hence the heads containing the most relevant information with respect to our topological classifier are consistent across different symmetry functions and number of finetuning epochs.
We do not observe the same phenomenon when we look at different datasets: there are not high scoring heads shared across tasks (see Tables 9, LABEL: and 8 in Appendix). For the pruning to be efficient, the head scores has to be determined specifically for each dataset.
We also consider the effect of image pruning for other filtrations (see Table 10 in Appendix) and observe that they also benefit a gain in performance from it. Interestingly, to increase the performance of the topological classifier, one should prefer to remove some wellchosen input images, rather than considering more complex and computationdemanding filtrations.
The highestrated heads may not be the only ones from which the model retrieves valuable information. To explore this, we trained the model while keeping the images coming from all the attention heads except the heads with the highest rating (see Table 11 in Appendix). We conclude that the high scoring heads are not necessary. Even without them, the topological classifier obtains a similar performance as in the nonpruning case.
These results verify the statements proposed by [bert_plays_lottery] that one can find a good submodel inside BERT even when it is highly pruned. In [are_sixteen_heads_better_than_one], some layers where pruned to one head with no effect on performance. With our procedure, we could reduce the number of head to five without a loss of performance.
We further investigate how the our pruning procedure behaves across different finetuned BERT models in Appendix C.
6.2 Discussion
From all these experiments, a clear observation arises: specific heads are highly relevant for our model to perform comparably to BERT or even outperform it. To investigate what these heads look like, we plot the attention maps ([dark_secret_of_bert]) for the three best heads of our base PIdataset (“CoLA, 4 Epochs, Ordinary, max”) for ten sentences in class 1 (grammatically correct) and 10 sentences in class 0 (grammatically incorrect).
Almost all the feature maps have high values on the column corresponding to the [SEP] token. This means that to encode the sentence, the head will mainly consider the current vectorization of [SEP] to compute the new representation of each token. This suggests that this pattern contains sufficient information to get a good performance from the topological classifier. In [what_does_bert_look_at], this peculiar pattern on [SEP] is interpreted as a noop function; the default mode a head enters if it cannot apply its specific function. For example in ([what_does_bert_look_at]) the authors found a head that is specialized in verbsubject recognition. This head puts all the attention on [SEP] if the input word is not a verb. Our observation suggests that giving attention to [SEP] contains useful information, and is not only a way for the head to do no operation.
To deduce how full attention on [SEP] can be used to classify sentences, the persistence images are of great help. Figure 9 plots the persistence images corresponding to the above attention maps.
These images correspond to attention almost exclusively on the token [SEP]. Each bar in the persistence image represents the filtration value where a token is connected to the [SEP] vertex. Sparse diagrams represent a filtration where the vertices are connected to the [SEP] vertex at different filtration values. Pack diagrams indicate a filtration where the vertices are connected to [SEP] in a narrow range of filtration values. The valuable information of the fullattentionto[SEP] pattern might be in the connection to the [SEP] vertex in the attention graph, which is easily obtained by the topological classifiers.
But where does the model look when it processes persistence images? The regions that are the most relevant to the model appear when the gradient is visualized, as in Figure 10. The darker the red, the more the area influences the model towards class 0; the darker the blue, the more the influence is towards class 1. The white areas are not considered by the model as being relevant to the output classification..
The model has different behavior for each sentence label. If the sentence is in class 0, any positive pixel value will decrease the model output, thus increasing the probability of class 0. This is independent of the death time of the feature, and hence only the existence is relevant. If the sentence belongs to class 1, then again, any point in the image will influence the output probability towards class 1, except for some particular regions generally situated at filtration values between
and , where the influence is inverted. Persistence features of dimension 0 that die at this filtration value influence the model output toward class 0. A 0dimensional feature dies when it gets connected to the main connected component; hence if a vertex has edges with the lowest value in this specific filtration range, it will influence the model toward class 0.From the CoLA dataset perspective, a sentence containing such a token has a higher chance of being predicted as grammatically incorrect.
7 Adversarial Attacks
The transformation of attention head activations into persistence images, despite being computationally demanding could increases the robustness of our model. On the other hand, pruning the number of heads considered for the input of the model might diminish the stability of the classifier. To explore both concerns we face our topological model with adversarial attacks. We used TextAttack ([textattack]) to generate hundred attacks for the SST2 dataset. After removing the skipped and failed attempts, 89 successful attacks on BERT remain. We then apply the topological classifier “SST2, 4 Epochs, ordinary, max” on each sentence before and after the changes made by the attacks, and with various numbers of heads considered, determined by the pruning method presented in Section 6.
We consider SST2 and not the CoLA dataset, because the attacks generated by TextAttack were mostly transforming a grammatically correct sentence into a grammatically incorrect one, and the attack is considered a success even if the model detected the grammatical mistake. For that we looked at the 89 sentences in SST2 that where initially correctly classified by the BERT model.
Avoided Attacks  Avoided Common Attacks  #  
144 heads  46  (52%)  40  (45%)  83 
70 heads  45  (50%)  40  (48%)  84 
50 heads  42  (47%)  36  (43%)  83 
30 heads  39  (44%)  35  (42%)  84 
10 heads  39  (44%)  33  (40%)  83 
5 heads  40  (45%)  36  (42%)  85 
3 heads  52  (58%)  45  (56%)  81 
2 heads  45  (51%)  36  (46%)  79 
1 heads  45  (51%)  36  (45%)  80 
Table 6 shows that the topological model is much more stable than the BERT model. Only half of the attacks that succeed on BERT also succeed in fooling the topological classifier, which is surprising since the attention graphs are coming from the fooled BERT model. Even more, the stability of our model does not decrease with the number of considered heads. Even with persistence images coming from less than 5 heads, the adversarial attack efficiency exceeds slightly 50%. This suggest that the robustness of the classifier based on persistence homology is not due to the large amount of input images.
Furthermore, the robustness is not due to the stability against adversarial attacks of the persistence images. Figure 11 displays the perturbation between the persistence images before and after the 10 first attacks. The squares represent the attention heads sorted by layers, with one pixel per head. The darker the pixel, the larger the Euclidean distance between the images generated from the head. We consider both the persistence images of dimension 0 and 1 by summing up their differences.
Looking at the perturbation value of one head across different attacks, we notice that the perturbation value highly depends on the attack. For example, the perturbation value of the first head in the first layer (upper left corner) varies up to a factor of 100 between different attacks. In general, the images undergo modification before and after the attack. But even if the images change, the model’s output remains constant.
There are no canonical ways to analyze the perturbations in the attention maps. The sentence length can vary before and after the attack; therefore, the dimension of the attention maps can also vary. Hence, to measure the Euclidean distance, for example, one must first resize one of the attention maps to match the dimension of the other. Shrinking the largest is unsuitable as it removes the [SEP] column and will drastically change the attention graph’s structure. The same goes for the solution of padding the smaller attention map. This illustrates one advantage that persistence images adduce to interpretability methods: they allow to compare the model’s behavior when it faces two sentences of different lengths.
It is worth noticing that the attacks do not produce correct or understandable movie reviews for SST2 (see Table 7) in general. The topological classifier seems to worry less about the overall meaning of the sentence. It might represent the input more abstractly and hence procure a stability towards meaningswitching words that is more appreciated for classification.
Before Attack  After Attack 
It’s a charming and often affecting journey.  It ’s a cutie and often afflicts journey. 
Unflinchingly bleak and desperate  Unflinchingly eerie and desperate 
Allows us to hope that Nolan is poised to embark a major career as a commercial yet inventive filmmaker.  Allows ourselves to hope that Nolan is poised to embarked a severe career as a commercial yet novelty superintendent. 
The acting, costumes, music, cinematography and sound are all astounding given the production’s austere locales.  The acting, costumes, music, cinematography and sound are all breathless given the production’s austere locales. 
It’s slow – very, very slow.  It’s slow – pretty, perfectly lent. 
The topological model presents greater stability than BERT. Furthermore, the stability is maintained regardless of the number of heads considered for the images and the impact of the attacks on the images. The stability must therefore be intrinsic to the classification itself.
This robustness and high classification performance make our topological model more suitable than BERT when consistency and stability are needed.
8 Conclusion
In this work, we proposed numerous experiments on persistent homology applied for text classification. The model we present outperforms the baselines from BERT by and has higher robustness against adversarial attacks. We presented a new perspective on the specialization of BERT’s attention heads using persistence diagrams, and also developed a new BERT attention head scoring technique.
Our most surprising finding is the efficiency of our proposed ratings, allowing us to consider only ten attention heads out of 144 with no reduction in accuracy on the test dataset or stability. Although the attention to the [SEP] token was assumed to have a noop behavior ([what_does_bert_look_at]), a majority of the best scoring heads showcase this pattern, suggesting that through the lens of TDA, the attention to [SEP] displays valuable information for the classification task.
One possible direction for future research is to extend the tools from TDA to other types of NLP tasks. We recommend using ordinary persistence homology up to the first dimension to avoid computational complexity and to use more powerful vector representations than persistence images like the ones computed by the Persformer ([persformer]). We also propose applying our rating approach to identify the most relevant heads and prune the others, which could increase performance. Lastly, we suggest training a specific classifier to detect adversarial textual attacks from the topology of the attention graphs.
9 Acknowledgements
This work was supported by the Swiss Innovation Agency (Innosuisse project 41665.1 IPICT).
We thank Matthias Kemper for helpful discussions and constructive comments on the paper.
Appendix
A MultiDim UMAPs
For this filtration, there is a clear cluster in the UMAPs of the 1dim diagrams observed across the three sentences.
This time, they represent diagrams that contain almost only points on the diagonal . Those points represent 1persistence features that vanish at the time they are born. In other terms, these points represent a “triangular” cycle, a cycle formed by three edges only. When the third edge is added, a 2cell is also added and fills the inside of the triangle, making the hole of the cycle disappear. As previously observed, the clusters across the sentences share globally the same heads.
For the second homological dimension diagrams, a similar pattern is observed.
The difference between the diagrams that are inside the cluster from the ones that are outside is that the former have only 2holes with a high birthtime and the latter have 2holes with varying birthtimes. When transformed into persistence images, the diagrams outside the cluster will display a richer variety of patterns, compared to the diagrams inside the cluster whose images are similar: high value pixels on the top right corner, and small values everywhere else. Again, the clusters across sentences share a similar composition.
B Directed UMAPs
We observe similar clusters as in the MultiDim filtration case.
Their meaning is identical: diagrams with almost only diagonal points for the firstdimensional features and top right points for the seconddimensional features. Again the composition of the clusters is similar across sentences.
C Pruning heads across models and datasets
We further investigate how the our pruning procedure behaves across different dataset and across different finetuned BERT models.
CoLA 4 Epochs Ordinary Max  IMDB 4 Epochs Ordinary Max  SPAM 4 Epochs Ordinary Max  SST2 4 Epochs Ordinary Max  

BERT  
144 Heads  
70 Heads  
50 Heads  
30 Heads  
10 Heads  
5 Heads  
3 Heads  
2 Heads  
1 Head 
CoLA 4 Epochs Ordinary Max  IMDB 4 Epochs Ordinary Max  SPAM 4 Epochs Ordinary Max  SST2 4 Epochs Ordinary Max  

BERT  
144 Heads  
70 Heads  
50 Heads  
30 Heads  
10 Heads  
5 Heads  
3 Heads  
2 Heads  
1 Head 
For the head scores from CoLA, the performance of the model on the other tasks decreases with decreasing number of considered images. But when the head scores are determined for each PIdataset, the performance remains constant when at least 10 heads are considered, and decreases slowly when less than 10 heads are considered. For the SPAM dataset, we even observe a perfect score of 100% accuracy obtained while considering 30 heads. The boosting effect of pruning images are the most significant for the CoLA dataset.
CoLA 4 Epochs Ordinary Max  CoLA 4 Epochs MultiDim Max  CoLA 4 Epochs Directed Max  

BERT  
144 Heads  
70 Heads  
50 Heads  
30 Heads  
10 Heads  
5 Heads  
3 Heads  
2 Heads  
1 Head 
Pruning the images is beneficial for performance, with a more significant effect for the Ordinary and Directed filtrations. Without pruning, the MultiDim filtration outperforms the others, but the ordinary persistence combined with pruning reaches the same peak performance of 82% in accuracy. Interestingly, to increase the performance of the topological classifier, one should prefer to remove some wellchosen input images, rather than considering more complex and computationdemanding filtrations.
Table 11 shows the result for different symmetry functions and a different number of finetuning epochs. Here, the line 10 heads corresponds to the performance obtained by the model while trained on images from 134 attention heads (we removed the images from the 10 best heads).
CoLA 4 Epochs Ordinary Max  CoLA 4 Epochs Ordinary Min  CoLA 4 Epochs Ordinary Mean  CoLA 10 Epochs Ordinary Max  CoLA 20 Epochs Ordinary Max  

BERT  
70 Heads  
50 Heads  
30 Heads  
10 Heads  
5 Heads  
3 Heads  
2 Heads  
1 Head 
The general tendency is a constant performance while the number of considered images increases. The exceptions are for the mean symmetry function where a neat increase in accuracy occurs, and the low accuracy when only half of the worst performing heads are considered.
D Data and model specifications
Type of persistence homology  Examples 

Ordinary  0  
Ordinary  1  
MultiDim  0  
MultiDim  1  
MultiDim  2  
Directed  0  
Directed  1  
Directed  2 