In 2003, Yoshua Bengio and his team proposed the first neural network for natural language processing (NLP) ([first_neural_net_nlp]
). Since then, there has been a festival of new model architectures surpassing current records on many textual tasks. Recurrent neural networks (RNNs) ([first_recurrent_neural_net]
) were replaced by long short-term memory (LSTM) networks ([first_long_short_term_memory_net]) before attention-based models were proposed ([transformer]).
One of the first text encoder models using only the attention mechanism without any recurrent parts is BERT, and it got state-of-the-art performance on many NLP benchmark tasks. But its success came with the desire to understand what type of language knowledge these models acquire, such as grammar rules, semantics, syntactic relations between words, or even the world knowledge it can infer through language. Numerous studies about these different subjects are very well summarized under the term Bertology ([bertology]).
The geometric and topological information contained in BERT has recently caught the interest of the topological data analysis (TDA) community ([introduction_persistent_homology_comp_science, acceptability_judgement_topo_att_maps, artificial_text_topo_att_maps]). The attention head activation can be transformed into an attention graph. One can then filter this graph and apply persistent homology to study the evolution of connected components and higher-order structures, compute a collection of Betti numbers, or store the average barcode length of a given homology dimension. This approach provides a topological representation of textual input and can be used to train a new type of textual classifier or find new interpretability methods for NLP.
The contributions of this paper are summarized as follows:
We reproduce the results of [betti_bert] but in a broader setting with different choices of attention graph filtrations, types of persistence homology (ordinary and directed), and symmetry functions. In addition, we also use different topological features (they considered Betti numbers, and we consider persistence images).
We study UMAP projections of persistent diagrams obtained via the attention graphs and compare their distribution with the classification of attention maps proposed by [dark_secret_of_bert]. We could not observe any correlation, but we find stability across input sentences of the cluster composition, indicating a specialization of the attention heads.
We propose a new method to rate attention heads inspired by GradCam ([gradcam]) on the topological inputs that works remarkably well to prune the number of heads down from 144 to ten with no reduction either in classification performance or robustness against adversarial attacks. Moreover, the selected heads largely display attention patterns focusing on the [SEP] token, leading to a new perspective on the no-op hypothesis proposed in [what_does_bert_look_at].
Our research is the first study that contemplates the topic of adversarial attacks in relation to topological-based models for NLP tasks.
1 Related Work
Topological Data Analysis
Combining algebraic topology and machine learning has become a vast field of investigation in the past decade. To the best of our knowledge,[introduction_persistent_homology_comp_science] was the first to use persistent homology in the context of NLP. The author differentiated child writing from teenager writing using persistence tools. This is followed by an increase in interest from the scientific community in TDA methods in NLP, including a successful attempt to predict the genre of a movie from its plot ([tda_movie_genre]), an application of persistent homology to depict textual entailment in legal processes ([tda_legal_entailment]), an unsuccessful TED-talk rating ([tda_tedtalk]
), and detection of artificially generated text ([artificial_text_topo_att_maps]).
To demonstrate this increase in interest, [acceptability_judgement_topo_att_maps] examined, independently and in parallel to our work, the linguistic acceptability of text using the topology of the attention graphs. They were able to enhance the performance of existing models for two standard practices in linguistics: binary judgments and linguistic minimal pairs. They also proposed a method for analyzing the linguistic functions of attention heads and interpreting the correspondence between the graph features and grammatical phenomena.
The challenge of combining a neural network with a topological layer is the differentiability of the overall objective function. The PersLay model ([perslay]) obtained excellent classification performance of real-life graph datasets such as social graphs or data from medical or biological domains. This is similar to Persformer ([persformer]), which can process persistent diagrams without using handcrafted features but using the self-attention mechanism and achieving state-of-the-art results on the standard ORBIT5K dataset.
BERT Model Multiple studies have shown that BERT is overparametrized. [dark_secret_of_bert] obtained an increase in performance of the model when disabling the attention in certain attention heads. [are_sixteen_heads_better_than_one] proposed a pruning method that disabled 40% of the attention heads while keeping the accuracy high and the authors showed that some layers could be reduced to one attention head for better performance. Even the original transformer architecture has been pruned from 48 heads down to 10 heads while maintaining the accuracy level ([analysing_multi_head_attention]). Interestingly. these remaining heads displayed specific interpretable linguistic functions. These specialized heads have also been found in BERT by [what_does_bert_look_at], who present attention heads that attend to the direct objects of verbs, to the determiner of nouns, or to coreferent mentions ([what_does_bert_look_at]).
[bert_plays_lottery] showed that BERT contained many subnetworks achieving performance comparable to the full model, and furthermore, that the worst performing subnetwork remains highly trainable.
Another approach to illustrate the information contained in a pretrained model is the study of the transferability of the contextual representations stored inside the model. [linguistic_knowledge_and_transferability_of:contextual_representations] found that linear models trained on top of frozen contextual representations, such as the pretrained BERT model, are competitive with state-of-the-art task-specific models. This is related to our work, as we also extract information from a frozen BERT model and train a topological classifier on it.
2.1 BERT Model
The multi-headed attention model BERT ([bert]) was one of the first models to use only the encoder part of the original transformer architecture ([transformer]) while obtaining state-of-the-art results on many NLP tasks. It is a model composed of encoder layers connected in series each containing attention heads applied in parallel.
We focus on the attention heads of () as they are the part of the model containing the topological information we want to analyse. The input of an attention head is a matrix of size (where
) whose rows are vector representations of thetokens of the input sentence, and the output is a new representation of size with () following the formula:
where are trainable matrices that project the token embeddings into a lower-dimensional vector space, and the softmax is token over the second dimension. The matrix is a -dimensional matrix called the attention matrix.
Its entries can be interpreted as the attention given to the token at position when computing the new representation of the token at position . They are non-negative and the attention scores of a token sum up to one, i.e., for all .
2.2 Attention Maps and Attention Graphs
Attention maps are the representations of the attention matrices into pixel images. The higher is, the darker the pixel color is. Those maps where intensively studied in [dark_secret_of_bert] and divided into four classes: diagonal patterns, vertical patterns, diagonal and vertical patterns, and heterogeneous patterns (see Figure 1).
Another representation of an attention matrix is through an attention graph. Given a head, we construct a weighted directed graph taking as vertices the tokens of the input sentence and connecting two words and with an edge from to with weight and an opposite direction edge with weight . No further modification is needed to apply directed persistent homology and in the non-directed versions, this directed graph is transformed into a complete non-directed weighted graph on the set of tokens via a symmetric function . The edge connecting tokens and is assigned to the weight . The larger the weight of the edge connecting the two vertices, the lower the transformed attention score between the two tokens. The transformation from attention map to attention graph is illustrated in Figure 3.
2.3 Persistent Homology
The attention graphs constructed from the attention heads contain the topological structure we are interested in. Topologically, a weighted graph and the corresponding unweighted graph are identical. To encode the topological information provided by the graph weights, we use filtrations of the graph. A filtration is a sequence of nested topological subspaces indexed either on a discrete set like or on a continuous real-valued parameter. Starting with an attention graph, we consider three types of filtrations: Ordinary, MultiDim, and Directed.
We initiate the filtration with only the vertices of the graph. Then we add edges one by one, depending on their weights, until we obtain the complete graph (see Figure 3). The order of how the edges are added is as follows: Given two edges and , if the weight of is smaller than the weight of , then will be added before . This filtration is based on a real-valued parameter taking value in and the filtration at stage is formed by all the edges with weights smaller or equal to .
The Ordinary filtration can be seen as starting with 0-simplices and then adding 1-simplices to construct the graph. The 0-simplices can be thought of as points, the 1-simplices as edges, the 2-simplices as triangles, these 3-simplices as tetrahedrons, and so on. The edges for the MultiDim filtration have the same filtration values as for the Ordinary filtration, but we add a 2-simplex everytime three edges form a triangle.
We consider the directed version of the attention graph and again start the filtration with only the vertices. The idea is similar to the MultiDim filtration: we add edges one by one, depending on their weight, and we add a 2-simplex if its boundary 1-simplices are present and do not form a directed cycle.
Filtrations are the topological interpretation of edge weights and their directions, for if the weights on the edges or the direction of the edges were different, the filtration would change accordingly.
Given a filtered simplicial complex, we can analyze it through persistent homology. The idea is to keep track of the appearance and disappearance of topological features through the filtration by computing the homology of each topological space encountered during the filtration and keeping track of the maps induced by the inclusions. An introduction to the mathematical background of TDA can be found in [introduction_persistent_homology_comp_science]. One can think of 0-dimensional persistent features as connected components, 1-dimensional features as holes and 2-dimensional features as cavities (2-dimensional holes). The birth time of a persistent feature is the filtration value at which the feature appears. For examples, the birth time of all the 0-dimensional features of our graph filtration is 0 and the birth time of a 1-dimensional features is the filtration value of the edge completing a graph cycle. The death time is the filtration value at which the feature disappears. For a 0-dimensional feature it will correspond to the weight of the edge connecting the corresponding vertex to the main connected component. If there are no 2-simplices in the filtration, a 1-dimensional feature will never vanish, and its death time is said to be equal to . But for computational purpose, it is set to the maximal filtration value . The birth and death times of each feature are stored in a persistence diagram. A persistent feature is seen as a point in with its birth time as -coordinate and death time as -coordinate (see Figure 3). Hence a persistence diagram is a multi-set of elements in – multi because two or more persistent features may have the same birth and death time.
2.4 Persistence Images
Incorporating persistence diagrams into a machine learning pipeline faces two main challenges. Firstly, one can add distances on the space of persistence diagrams, such as the bottleneck distance or a Wasserstein-type distance (see Definition 5.1). But the underlying data structure remains a set, not a vector. Secondly, the space of persistence diagrams cannot be isometrically embedded into a Hilbert space, and this remains true for any type of distance considered ([persistence_diagrams_not_hilbert_space, Theorem 4.3]).
To overcome this issue, the TDA community has developed many methods to convert a collection of persistence diagrams into a collection of vectors contained in a Hilbert space. The method we consider uses persistence images ([persistence_images]
) because they are simple to compute and interpret. Furthermore, they are compatible with convolutional neural networks.
A persistence diagram can be seen as a non-continuous map that counts the number of points in the diagram at the input location . This function can be decomposed into a finite sum of indicator functions that return 1 if the input is
and 0 otherwise. These indicator functions can be seen as probability density functions. To work with continuous functions,
will be approximated by the 2-dimensional Gaussian distributionof of mean
, a hyperparameter that has to be chosen. We sum up all those continuous functions to obtain, a continuous approximation of the function .
For more flexibility and stability, a weight function is incorporated inside the sum to emphasize certain regions of the persistence diagram. The obtained function is called the persistence surface of the persistence diagram .
The last step is to integrate the persistence surface over cells of a grid of dimension with given horizontal and vertical boundaries. This grid defines the frame and resolution of the persistence image and has to be chosen (see Figure 2). The value of a pixel of the persistence image of the persistence diagram is given by
The value of the integration over each pixel is stored in a image, called the persistence image of the persistence diagram.
For our application, we consider various image frames, image resolutions (see Table 12), and weight functions (see Table 13). An illustration for the Ordinary filtration can be found in Figure 3. The tables are in the appendix with examples for all types of persistence images (see Table 14).
We compare BERT against the topological model on numerous classification tasks. BERT is fine-tuned for a variable amount of epochs. To apply the topological model, we first transform each sentence into a stack of persistence images. To do so, we feed a fine-tuned BERT model with the sentence and extract the attention graphs for each head. We then transform the attention graphs into persistence images. One attention graph generates a number of persistence images equals to the number of persistence features inside the filtration (2 for Ordinary, 3 for the two others). For the Ordinary filtration, a sentence is transformed into 288 images and for the two others into 432 images. The topological classifier receives as input a 4-dimensional tensor where the dimensions are: batch-size, the number of persistence images per sentence, the width and height of the image.
We use the Gudhi library ([gudhi]) to convert persistence diagrams into persistence images for the Ordinary and MultiDim filtrations, and the Giotto-tda library ([giotto-tda]) to manage the directed filtration.
We load the datasets from Hugging Face and we follow the data processing proposed by [betti_bert]. All dataset statistics are presented in Table 1.
The Corpus of Linguistic Acceptability ([CoLA]) is part of the GLUE benchmarks ([GLUE]). The task is to detect if a sentence is grammatically correct (class 1) or not (class 0). We consider the public part of the dataset that contains labels, and disregard the hidden part. We use the original validation set for prediction, and we split the train set into train and validation subsets with proportions .
Large Movie Review Dataset v1.0 ([IMDB]) consists of movie reviews labeled by two sentiments “positive” (1) or “negative” (0). It contains 50,000 reviews. We first divide the data set into two equal sized subsets, one for training and the other for testing and validation. To obtain attention graphs of manageable sizes, we consider only sentences of length smaller or equal to 128 after tokenization with the standard BERT uncased tokenizer. We then divide the second subset into validation and prediction datasets following the proportion .
The SMS Spam Collection v.1 ([SPAM]) contains text SMS and the task is to determined if they are spam (1) or ham (0). It contains 5,574 messages that we divide into train, validation and prediction subsets ().
The binary version of Stanford Sentiment Treebank ([SST2]) is also part of the GLUE benchmarks and consists of parts of movie reviews labeled by two sentiment “positive” (1) and “negative” (0). As for the CoLA dataset, we split the original train set into two subsets with proportions , and use the original validation set for the prediction.
|# Train sent.||7695||2675||4459||60614|
|# Validation sent.||856||1414||557||6735|
|# Prediction sent.||1043||1414||558||872|
|Mean sent. size||11||88||25||13|
|Max sent. size||47||128||238||64|
We run the experiments on the “bert-base-uncased” model from the HuggingFace library ([huggingface_transformers]) for the BERT baselines. For explanatory purposes, we consider numerous variations:
We fine-tuned BERT for 4, 10, and 20 epochs 111We follow the procedure proposed in https://github.com/MohamedAteya/BERT-Fine-Tuning-Sentence-Classification-for-CoLA to study the importance of fine-tuning with respect to the performance of the topological classifiers. This gives use three BERT models per task, and each is used to transform sentences into persistence images.
- Symmetry function
The choice of the symmetry function is a crucial step to get from the attention matrices to the non-directed attention graphs. To explore its relevance with respect to the performance of the topological classifier, we use four different symmetry functions: maximum, minimum, multiplication and mean.
We also consider the three types of filtrations describe in Section 2.3. Unfortunately, we are limited in terms of computation time and power as it can reach excessively high values depending on the length of the sentence to transform (see Figure 4). Hence for the MultiDim filtration, we consider only the maximum as a symmetry function, and the Directed filtration is only applied to the CoLA dataset.
In total, one fine-tuned BERT model generates 5 or 6 persistence image datasets (PI-datasets). We refer to a PI-dataset by the fine-tuned BERT model producing it (name of the dataset and number of fine-tuning epochs), the type of filtration, and the symmetry function, e.g., “IMDB, 10 epochs, MultiDim, max”.
Our topological classifiers are convolutional neural networks taking as input the stack of persistence images from one sentence. The architecture identified by running a hyperparameter search using the Optuna library ([optuna]) for 500 trials tuning the number of convolution and fully connected layers, the learning rate, the optimizer, and the dropout rate. The hyperparameter search is done on one dataset (CoLA, IMDB, SPAM, or SST2), and we say that the model designed from the hyperparameter search result is specific to this dataset. We run hyperparameter searches for each classification task and type of filtration. The architectures of all the topological classifiers are presented in Table 15.
We then investigate the performance of a topological classifier specific to one dataset when evaluated on the other datasets. To do so, for each PI-dataset, we evaluate the performance of two topological models: one whose hyperparameters are optimized on the current dataset, the other one specific to the CoLA dataset. We refer to the first model as the specific model and to the second as the general model.
The inference time of our combination of the BERT model and the topological classifier is greater than the inference time of the BERT model itself. For example, when using the Ordinary filtration, the time needed to transform a tokenized sentence into a persistence image and predict its class is two times greater than the BERT prediction time. It goes up to 20 times for the MultiDim filtration and 70 times for the Directed filtration. The bottleneck is the computation of the persistence images from the persistence diagrams, which is not implemented on GPU but only on CPU.
All the computations are done using a virtual machine from the Google Cloud Platform (https://cloud.google.com/). The machine we work on has a NVIDIA Tesla T4 GPU with 16GB of VRAM, 8 CPUs, named “n1-standard-8”, with 30GB of RAM.
4 Results of the topological classifiers
The performance results obtained by the model whose hyperparameters are optimized on the CoLA dataset and the specific models are outlined in Table 2 and Table 3. As performance measures, we choose the accuracy on the prediction set and the Matthew Correlation coefficient which is a good metric for unbalanced datasets.
The general topological classifier outperforms BERT for the CoLA dataset by and obtains similar performances for the other datasets. It suggests that the persistence images contain as much syntactic information as the encoding provided by BERT. The biggest increase in performance is generally obtained by the MultiDim filtration. It seems that the more topological information is provided to the model, the better it performs. However, the enhancement is not comparable with the computation cost needed to compute the persistent homology of the MultiDim filtration. The model based on the Directed filtration performs as well as the Ordinary. One explanation could be that the persistence images produced by the Giotto TDA library have different ranges making it difficult for the CNN to compare them.
There is no symmetry function that works best in all cases. The mean is the best choice when considering the CoLA dataset. The multiplication performs less well for IMDB, but better for the three others. In general, all symmetry functions perform similarly on a given task. Hence this choice is not crucial for the overall performance of the topological classifier. Interestingly, there is also no significant difference between the results obtained by the specific models versus the general model. In some cases, when applied on another dataset, the general model outperforms the specific model. This observation would suggest that we could use one topological classifier for multi-task learning with a performance similar to BERT.
Lastly, there is an overall tendency for performance-boosting when we increase the amount of BERT fine-tuning epochs. This is the case on each task with respect to any filtration and any symmetry function, even when the BERT model overfits by training for more epochs (SST2 and IMDB).
5 UMAP Description
It is challenging to understand how BERT learns to solve a task, and many attempts have been published ([attention_is_not_explanation, what_does_bert_look_at, bertology]). Persistence diagrams of the attention graphs offer a new perspective to look at attention heads.
To do so, we use the UMAP (Uniform Manifold Approximations & Projection) library ([umap]). It allows to project a high dimensional point cloud onto the two dimensional plane. The data we use consists of the persistence diagrams of the 144 attention heads corresponding to one sentence. To give a graph structure to the set of data points, we compute the -Wasserstein distance between each pair of diagrams.
Definition 5.1 ([computational_topology_for_data_analysis], Definition 3.9 and 3.10).
Let be fixed and let and be two finite persistence diagrams of the same homology dimension. Let and be the diagrams obtained from and by adding all points on the diagonal with infinite multiplicity. We define the -Wasserstein distance between and by
where is the set of all bijections between and .
From the pair-wise distances of the diagrams, UMAP construct a graph as follows: It connects each point to a fixed number of closest neighbors. Then it projects this graph onto the two-dimensional plane. The number of neighbors to connect to has to be chosen manually. The higher it is, the more global information will be retrieved. The lower it is, the more local information will be displayed in the final projection. Figure 5 shows examples of UMAP projections.
UMAP provides us with information about similitude between persistence diagrams. The main observation is the formation of clusters, indicating that there are groups of diagrams that look similar and that are different from the diagrams not in the cluster.
The primary motivation to use UMAP projections of persistence diagrams is to give another perspective on the self-attention patterns depicted in the paper ([dark_secret_of_bert]). We want to see if the persistence diagrams are also clustered in a similar way. We consider sentences of the CoLA dataset, and we plot the UMAP for any category of diagrams; one per type of filtration (Ordinary, MultiDim, or Directed) and per homology dimension (0 and 1 for Ordinary, 0,1 and 2 for MultiDim and Directed). We manually identified heads with the pattern type of their self-attention maps for each sentence. Then we display the UMAP projections, coloring the dots with respect to the pattern: yellow for diagonal, red for vertical, blue for diagonal and vertical and green for heterogeneous.
In Figures 15, LABEL:, 14, LABEL:, 13, LABEL:, 12, LABEL:, 6, LABEL: and 5, the distribution of the persistence diagrams are not correlated with the distribution of the attention map classes. The colors are spread evenly across all the points, with no clear monochromatic area. In some projection maps, a color gradient between green and red can be observed, but with no clear distinction between the two groups. Nevertheless, there are some cluster formations shared across sentences. In the next subsections we explore the cluster compositions and compare them across sentences. We observe similar pattern for a large range of sentences, but only present the plots of three sentences for illustratory purposes. The three sentences are:
“Our friends won’t buy this analysis, let alone the next one we propose.” (1)
“I know a boy mad at John.” (1)
“Mary has more friends that two.” (0)
Figure 5 shows the UMAPs of the 0-dimension diagrams for the three considered sentences.
A small cluster is present for the two first sentences. The overall shape and the color distribution are similar across sentences and we observed that pattern for many more sentences. Looking at the persistent images of each head, no clear pattern is depicted between diagrams in or outside the cluster. But surprisingly, the cluster is formed on average by the same heads. For example, the cluster for the second sentence contains 20 of the 23 heads of the cluster from sentence 1. The circled area for sentence 3 contains all the elements of this cluster of size 23.
We observe no common cluster for the first homological dimension diagrams, but their UMAP projections tend to divide green and red points.
The classes proposed by [dark_secret_of_bert] are not observed through the topological lens in these three sentences. However, the UMAP projections display clusters that appear for each sentence and whose composition is shared in each example. For persistence diagrams of homology dimension zero, there is a small dense cluster shared across sentences, but we could not find a pattern shared between the diagrams inside. For homology dimension one, there is a cluster whose diagrams contain mostly short lifespan features, i.e., persistence features with death-time equal or very close to the birth-time. And for homology dimension two, there is a cluster whose diagrams present almost only features with a high birth-time. These specific clusters vary in size, but their composition is shared across sentences. In average, more than 90% of the smallest specific cluster is shared with the clusters in the other sentences. We made the same observation for numerous sentences of the CoLA dataset, leading to the claim that the attention heads of the fine-tuned BERT model specialize in searching for specific information. This claim was acknowledged in ([bertology]), and we provide a new approach to back it up.
6 Pruning the heads
Instead of exploring the structure of the persistence diagrams, we investigate what part of the input is most relevant for the model to predict the class of the sentence. To do so, we develop a method inspired by GradCam ([gradcam]).
GradCam is used to help understand the decisions made by deep learning models. Given an image, GradCam produces a heatmap that shows how the model makes its decision for that image. In our case, as the input is composed of either 288 (twice the number of heads) or 432 images (three times the number of heads), we can not directly apply the method proposed by[gradcam]
. Instead, we compute the gradient of the output logit with respect to the input image. This yields a tensor of the same shape as the input (for example for the ordinary case, the shape is [288, 50, 5]). Then we average the absolute value of the gradient over each channel to obtain a number that represents the influence of each individual image on the model output. Finally, we take the mean of these values corresponding to images coming from the same attention head (two images in the case of the ordinary filtration, three for the others). We end up with a score for the 144 heads of BERT for one input sentence. We perform this procedure for a large number of sentences and average all the obtained scores.
Our hypothesis is that the higher the score of a head, the more relevant it is to the topological model, and the more information about the sentence structure with respect to the current task it contains. Figure 7 displays the 30 best attention heads of the topological model trained on the PI-dataset “CoLA, 4 Epochs, Ordinary”, but with various numbers of sentences considered and for different symmetry functions.
The heads with the highest scores are independent of the number of sentences used, as there is almost no difference in Figures 6(c), 6(b) and 6(a), Figures 6(f), 6(e) and 6(d), and Figures 6(i), 6(h) and 6(g). We also observed that the best performing heads are almost independent of the symmetry function considered. In general, the best heads are located in the deep layers of BERT. Therefore, the heads of BERT located in the later layers are the ones that change the most when fine-tuning ([dark_secret_of_bert]), and they are also the most relevant for the topological classifier.
We design the following experiment to determine if these high-scoring heads contain most of the necessary information for our topological classifier to perform well. First we determine the head with the highest scores. We train a model on a selected PI-dataset and we apply our rating procedure on it. We look at the heads with the highest score (for = 70, 50, 30, 10, 5, 3, 2, and finally the best head). Then, we train another model on a PI-dataset but with persistence images only related to the highest scoring heads. When considering the 70 best heads, we do not consider the persistence images from the other 74 for heads. In the case of the Ordinary filtration the input of shape [288, 50, 5] is pruned to the shape [140, 50, 5], as each head produces two persistence images (one for the 0-dimensional features, one for the 1-dimensional features). Table 4 presents the performance obtained from such a pruning. The base PI-dataset considered is “CoLA, 4 Epochs, Ordinary, max”. The other columns are variations of the base PI-dataset by changing either the considered symmetry function or the number of fine-tuning epochs. In Table 4 the PI-dataset is the base PI-dataset and the PI-dataset is the PI-dataset identified by the column. In Table 5, and are both the PI-dataset identified by the column.
|CoLA 4 Epochs Ordinary Max||CoLA 4 Epochs Ordinary Min||CoLA 4 Epochs Ordinary Mean||CoLA 10 Epochs Ordinary Max||CoLA 20 Epochs Ordinary Max|
|CoLA 4 Epochs Ordinary Max||CoLA 4 Epochs Ordinary Min||CoLA 4 Epochs Ordinary Mean||CoLA 10 Epochs Ordinary Max||CoLA 20 Epochs Ordinary Max|
The performance of the topological classifier trained on the base PI-dataset does not decrease with decreasing number of input images. It is even increasing and outperforms the 144 heads model at most by 2% in accuracy on the prediction set. Astonishingly, with 2 heads, our topological classifier outperforms BERT by 1.5% in accuracy. The model receives only 4 images of the initial 288, and its accuracy is still very large. It diminishes when considering only one head, but it still has a high accuracy of almost 80%.
The same trend can be observed for all the other PI-datasets: an increasing or constant accuracy when we decrease the number of considered heads from 70 to 10. However, we get a lower accuracy when we only consider less than ten heads. In all the models, decreasing the number of input images increases the performance of our topological classifier, up to a certain minimal number of heads considered. Choosing a model-specific rating of heads does not change the behavior in the results, neither in the trend nor in values. Hence the heads containing the most relevant information with respect to our topological classifier are consistent across different symmetry functions and number of fine-tuning epochs.
We do not observe the same phenomenon when we look at different datasets: there are not high scoring heads shared across tasks (see Tables 9, LABEL: and 8 in Appendix). For the pruning to be efficient, the head scores has to be determined specifically for each dataset.
We also consider the effect of image pruning for other filtrations (see Table 10 in Appendix) and observe that they also benefit a gain in performance from it. Interestingly, to increase the performance of the topological classifier, one should prefer to remove some well-chosen input images, rather than considering more complex and computation-demanding filtrations.
The highest-rated heads may not be the only ones from which the model retrieves valuable information. To explore this, we trained the model while keeping the images coming from all the attention heads except the heads with the highest rating (see Table 11 in Appendix). We conclude that the high scoring heads are not necessary. Even without them, the topological classifier obtains a similar performance as in the non-pruning case.
These results verify the statements proposed by [bert_plays_lottery] that one can find a good sub-model inside BERT even when it is highly pruned. In [are_sixteen_heads_better_than_one], some layers where pruned to one head with no effect on performance. With our procedure, we could reduce the number of head to five without a loss of performance.
We further investigate how the our pruning procedure behaves across different fine-tuned BERT models in Appendix C.
From all these experiments, a clear observation arises: specific heads are highly relevant for our model to perform comparably to BERT or even outperform it. To investigate what these heads look like, we plot the attention maps ([dark_secret_of_bert]) for the three best heads of our base PI-dataset (“CoLA, 4 Epochs, Ordinary, max”) for ten sentences in class 1 (grammatically correct) and 10 sentences in class 0 (grammatically incorrect).
Almost all the feature maps have high values on the column corresponding to the [SEP] token. This means that to encode the sentence, the head will mainly consider the current vectorization of [SEP] to compute the new representation of each token. This suggests that this pattern contains sufficient information to get a good performance from the topological classifier. In [what_does_bert_look_at], this peculiar pattern on [SEP] is interpreted as a no-op function; the default mode a head enters if it cannot apply its specific function. For example in ([what_does_bert_look_at]) the authors found a head that is specialized in verb-subject recognition. This head puts all the attention on [SEP] if the input word is not a verb. Our observation suggests that giving attention to [SEP] contains useful information, and is not only a way for the head to do no operation.
To deduce how full attention on [SEP] can be used to classify sentences, the persistence images are of great help. Figure 9 plots the persistence images corresponding to the above attention maps.
These images correspond to attention almost exclusively on the token [SEP]. Each bar in the persistence image represents the filtration value where a token is connected to the [SEP] vertex. Sparse diagrams represent a filtration where the vertices are connected to the [SEP] vertex at different filtration values. Pack diagrams indicate a filtration where the vertices are connected to [SEP] in a narrow range of filtration values. The valuable information of the full-attention-to-[SEP] pattern might be in the connection to the [SEP] vertex in the attention graph, which is easily obtained by the topological classifiers.
But where does the model look when it processes persistence images? The regions that are the most relevant to the model appear when the gradient is visualized, as in Figure 10. The darker the red, the more the area influences the model towards class 0; the darker the blue, the more the influence is towards class 1. The white areas are not considered by the model as being relevant to the output classification..
The model has different behavior for each sentence label. If the sentence is in class 0, any positive pixel value will decrease the model output, thus increasing the probability of class 0. This is independent of the death time of the feature, and hence only the existence is relevant. If the sentence belongs to class 1, then again, any point in the image will influence the output probability towards class 1, except for some particular regions generally situated at filtration values betweenand , where the influence is inverted. Persistence features of dimension 0 that die at this filtration value influence the model output toward class 0. A 0-dimensional feature dies when it gets connected to the main connected component; hence if a vertex has edges with the lowest value in this specific filtration range, it will influence the model toward class 0.
From the CoLA dataset perspective, a sentence containing such a token has a higher chance of being predicted as grammatically incorrect.
7 Adversarial Attacks
The transformation of attention head activations into persistence images, despite being computationally demanding could increases the robustness of our model. On the other hand, pruning the number of heads considered for the input of the model might diminish the stability of the classifier. To explore both concerns we face our topological model with adversarial attacks. We used TextAttack ([textattack]) to generate hundred attacks for the SST2 dataset. After removing the skipped and failed attempts, 89 successful attacks on BERT remain. We then apply the topological classifier “SST2, 4 Epochs, ordinary, max” on each sentence before and after the changes made by the attacks, and with various numbers of heads considered, determined by the pruning method presented in Section 6.
We consider SST2 and not the CoLA dataset, because the attacks generated by TextAttack were mostly transforming a grammatically correct sentence into a grammatically incorrect one, and the attack is considered a success even if the model detected the grammatical mistake. For that we looked at the 89 sentences in SST2 that where initially correctly classified by the BERT model.
|Avoided Attacks||Avoided Common Attacks||#|
Table 6 shows that the topological model is much more stable than the BERT model. Only half of the attacks that succeed on BERT also succeed in fooling the topological classifier, which is surprising since the attention graphs are coming from the fooled BERT model. Even more, the stability of our model does not decrease with the number of considered heads. Even with persistence images coming from less than 5 heads, the adversarial attack efficiency exceeds slightly 50%. This suggest that the robustness of the classifier based on persistence homology is not due to the large amount of input images.
Furthermore, the robustness is not due to the stability against adversarial attacks of the persistence images. Figure 11 displays the perturbation between the persistence images before and after the 10 first attacks. The squares represent the attention heads sorted by layers, with one pixel per head. The darker the pixel, the larger the Euclidean distance between the images generated from the head. We consider both the persistence images of dimension 0 and 1 by summing up their differences.
Looking at the perturbation value of one head across different attacks, we notice that the perturbation value highly depends on the attack. For example, the perturbation value of the first head in the first layer (upper left corner) varies up to a factor of 100 between different attacks. In general, the images undergo modification before and after the attack. But even if the images change, the model’s output remains constant.
There are no canonical ways to analyze the perturbations in the attention maps. The sentence length can vary before and after the attack; therefore, the dimension of the attention maps can also vary. Hence, to measure the Euclidean distance, for example, one must first resize one of the attention maps to match the dimension of the other. Shrinking the largest is unsuitable as it removes the [SEP] column and will drastically change the attention graph’s structure. The same goes for the solution of padding the smaller attention map. This illustrates one advantage that persistence images adduce to interpretability methods: they allow to compare the model’s behavior when it faces two sentences of different lengths.
It is worth noticing that the attacks do not produce correct or understandable movie reviews for SST2 (see Table 7) in general. The topological classifier seems to worry less about the overall meaning of the sentence. It might represent the input more abstractly and hence procure a stability towards meaning-switching words that is more appreciated for classification.
|Before Attack||After Attack|
|It’s a charming and often affecting journey.||It ’s a cutie and often afflicts journey.|
|Unflinchingly bleak and desperate||Unflinchingly eerie and desperate|
|Allows us to hope that Nolan is poised to embark a major career as a commercial yet inventive filmmaker.||Allows ourselves to hope that Nolan is poised to embarked a severe career as a commercial yet novelty superintendent.|
|The acting, costumes, music, cinematography and sound are all astounding given the production’s austere locales.||The acting, costumes, music, cinematography and sound are all breathless given the production’s austere locales.|
|It’s slow – very, very slow.||It’s slow – pretty, perfectly lent.|
The topological model presents greater stability than BERT. Furthermore, the stability is maintained regardless of the number of heads considered for the images and the impact of the attacks on the images. The stability must therefore be intrinsic to the classification itself.
This robustness and high classification performance make our topological model more suitable than BERT when consistency and stability are needed.
In this work, we proposed numerous experiments on persistent homology applied for text classification. The model we present outperforms the baselines from BERT by and has higher robustness against adversarial attacks. We presented a new perspective on the specialization of BERT’s attention heads using persistence diagrams, and also developed a new BERT attention head scoring technique.
Our most surprising finding is the efficiency of our proposed ratings, allowing us to consider only ten attention heads out of 144 with no reduction in accuracy on the test dataset or stability. Although the attention to the [SEP] token was assumed to have a no-op behavior ([what_does_bert_look_at]), a majority of the best scoring heads showcase this pattern, suggesting that through the lens of TDA, the attention to [SEP] displays valuable information for the classification task.
One possible direction for future research is to extend the tools from TDA to other types of NLP tasks. We recommend using ordinary persistence homology up to the first dimension to avoid computational complexity and to use more powerful vector representations than persistence images like the ones computed by the Persformer ([persformer]). We also propose applying our rating approach to identify the most relevant heads and prune the others, which could increase performance. Lastly, we suggest training a specific classifier to detect adversarial textual attacks from the topology of the attention graphs.
This work was supported by the Swiss Innovation Agency (Innosuisse project 41665.1 IP-ICT).
We thank Matthias Kemper for helpful discussions and constructive comments on the paper.
A MultiDim UMAPs
For this filtration, there is a clear cluster in the UMAPs of the 1-dim diagrams observed across the three sentences.
This time, they represent diagrams that contain almost only points on the diagonal . Those points represent 1-persistence features that vanish at the time they are born. In other terms, these points represent a “triangular” cycle, a cycle formed by three edges only. When the third edge is added, a 2-cell is also added and fills the inside of the triangle, making the hole of the cycle disappear. As previously observed, the clusters across the sentences share globally the same heads.
For the second homological dimension diagrams, a similar pattern is observed.
The difference between the diagrams that are inside the cluster from the ones that are outside is that the former have only 2-holes with a high birth-time and the latter have 2-holes with varying birth-times. When transformed into persistence images, the diagrams outside the cluster will display a richer variety of patterns, compared to the diagrams inside the cluster whose images are similar: high value pixels on the top right corner, and small values everywhere else. Again, the clusters across sentences share a similar composition.
B Directed UMAPs
We observe similar clusters as in the MultiDim filtration case.
Their meaning is identical: diagrams with almost only diagonal points for the first-dimensional features and top right points for the second-dimensional features. Again the composition of the clusters is similar across sentences.
C Pruning heads across models and datasets
We further investigate how the our pruning procedure behaves across different dataset and across different fine-tuned BERT models.
|CoLA 4 Epochs Ordinary Max||IMDB 4 Epochs Ordinary Max||SPAM 4 Epochs Ordinary Max||SST2 4 Epochs Ordinary Max|
|CoLA 4 Epochs Ordinary Max||IMDB 4 Epochs Ordinary Max||SPAM 4 Epochs Ordinary Max||SST2 4 Epochs Ordinary Max|
For the head scores from CoLA, the performance of the model on the other tasks decreases with decreasing number of considered images. But when the head scores are determined for each PI-dataset, the performance remains constant when at least 10 heads are considered, and decreases slowly when less than 10 heads are considered. For the SPAM dataset, we even observe a perfect score of 100% accuracy obtained while considering 30 heads. The boosting effect of pruning images are the most significant for the CoLA dataset.
|CoLA 4 Epochs Ordinary Max||CoLA 4 Epochs MultiDim Max||CoLA 4 Epochs Directed Max|
Pruning the images is beneficial for performance, with a more significant effect for the Ordinary and Directed filtrations. Without pruning, the MultiDim filtration outperforms the others, but the ordinary persistence combined with pruning reaches the same peak performance of 82% in accuracy. Interestingly, to increase the performance of the topological classifier, one should prefer to remove some well-chosen input images, rather than considering more complex and computation-demanding filtrations.
Table 11 shows the result for different symmetry functions and a different number of fine-tuning epochs. Here, the line 10 heads corresponds to the performance obtained by the model while trained on images from 134 attention heads (we removed the images from the 10 best heads).
|CoLA 4 Epochs Ordinary Max||CoLA 4 Epochs Ordinary Min||CoLA 4 Epochs Ordinary Mean||CoLA 10 Epochs Ordinary Max||CoLA 20 Epochs Ordinary Max|
The general tendency is a constant performance while the number of considered images increases. The exceptions are for the mean symmetry function where a neat increase in accuracy occurs, and the low accuracy when only half of the worst performing heads are considered.
D Data and model specifications
|Type of persistence homology||Examples|
|Ordinary - 0|
|Ordinary - 1|
|MultiDim - 0|
|MultiDim - 1|
|MultiDim - 2|
|Directed - 0|
|Directed - 1|
|Directed - 2|