Log In Sign Up

Visualizing and Explaining Language Models

During the last decade, Natural Language Processing has become, after Computer Vision, the second field of Artificial Intelligence that was massively changed by the advent of Deep Learning. Regardless of the architecture, the language models of the day need to be able to process or generate text, as well as predict missing words, sentences or relations depending on the task. Due to their black-box nature, such models are difficult to interpret and explain to third parties. Visualization is often the bridge that language model designers use to explain their work, as the coloring of the salient words and phrases, clustering or neuron activations can be used to quickly understand the underlying models. This paper showcases the techniques used in some of the most popular Deep Learning for NLP visualizations, with a special focus on interpretability and explainability.


Local Interpretations for Explainable Natural Language Processing: A Survey

As the use of deep learning techniques has grown across various fields o...

A Survey on Explainable Artificial Intelligence (XAI): Towards Medical XAI

Recently, artificial intelligence, especially machine learning has demon...

Explaining the Deep Natural Language Processing by Mining Textual Interpretable Features

Despite the high accuracy offered by state-of-the-art deep natural-langu...

Attacking Neural Text Detectors

Machine learning based language models have recently made significant pr...

RECAST: Enabling User Recourse and Interpretability of Toxicity Detection Models with Interactive Visualization

With the widespread use of toxic language online, platforms are increasi...

Self-Explaining Structures Improve NLP Models

Existing approaches to explaining deep learning models in NLP usually su...

Who's to say what's funny? A computer using Language Models and Deep Learning, That's Who!

Humor is a defining characteristic of human beings. Our goal is to devel...

1 Introduction

Deep Learning (DL) models applied on texts need to cover the morphological, syntactic, semantic and pragmatic layers. Crafting networks that operate on so many levels is a challenging task due to the sparseness of the training data. Such networks have been traditionally called Language Models (LMs) DBLP:conf/sigir/PonteC98. The early iterations were based on statistical models DBLP:conf/sigir/PonteC98, whereas the latest iterations use neural networks and embeddings. Current LMs are trained on large corpora and are generally sparse. Most current popular LMs are based on Transformer architectures DBLP:conf/nips/BrownMRSKDNSSAA20.

The first implementation of a Transformer network

DBLP:conf/nips/VaswaniSPUJGKP17 proved that it was possible to design networks that achieve good results for Natural Language Processing (NLP) tasks with a set of multiple sequential attention layers. A Transformer contains a series of self-attention layers that are distributed through its various components. Self-attention is an attention mechanism that computes a representation of a sequence from a set of different positions of the same sequence. The Transformer model itself is simple and consists from pairs of encoders and decoders. Encoders encapsulate layers of self-attention coupled with feed-forward layers, whereas decoders encapsulate self-attention layers followed by encoder-decoder attention and feed-forward layers. The attention computation is done in parallel and the results are then combined. The result is termed a multi-head attention, and it provides the model with the ability to orchestrate information from different representation subspaces (e.g., multiple weight matrices) at various positions (e.g., different words in a sentence) DBLP:conf/nips/VaswaniSPUJGKP17. Its outputs are fed either to other encoders or into decoders, depending on the architecture. There is no fixed number of encoders and decoders which can be included in this architecture, but they will typically be paired (e.g., 10 encoders and 10 decoders). In newer architectures, encoders and decoders can also be used for different tasks (e.g., encoder for Question Answering, and decoder for Text Comprehension) DBLP:journals/corr/abs-2003-11755. While the model was initially developed for machine translation tasks, it has been tested on multiple domains and was demonstrated to work well.

During the last three years, hundreds of papers and LMs inspired by Transformers were published, the best-known being BERT DBLP:conf/naacl/DevlinCLT19, RoBERTa DBLP:journals/corr/abs-1907-11692, AlBERT DBLP:conf/iclr/LanCGGSS20, XLNet DBLP:conf/nips/YangDYCSL19, DistilBERT DBLP:journals/corr/abs-1910-01108, and Reformer DBLP:conf/iclr/KitaevKL20. Some of the most popular Transformer models are included in the Transformers library, maintained by HuggingFace DBLP:journals/corr/abs-1910-03771.

Many of these models are complex and include significant architectural improvements compared to the early Transformer and BERT models. Explaining their information processing flow and results is therefore difficult, and a convenient and very actual approach is visualization. Our survey is focused on visualization techniques used to explain LMs. We investigate two large tool classes: (i) model-agnostic tools that can be used to explain BERT predictions; and (ii) custom visualizations that are focused only on explaining the inner workings of LMs based on neural networks. An early version of this survey was published a year ago, but it was focused only on visualizing Transformer networks BrasoveanuA20. We have since extended the material to include new articles about Transformer visualizations, other types of networks, as well as an extended section about model-agnostics AI libraries that are focused on interpretability and explainability for NLP.

In this survey, we look at the visualization of several types of LMs based on DL networks, review the basic charts and patterns present in them and try to understand the basic methodology that was used to produce these visual representations. The rest of the paper is organized as follows: Section 2 presents the motivation and methodology of this survey. Section 3 showcases the two classes of tools, whereas Section 4 discusses the various findings. The paper concludes with some thoughts on the future of this class of visualizations.

2 Background and Methodology

The need to quickly update NLP models in case of unforeseen events suggests that developers will be well-served by explainable AI and visualization libraries, especially since debugging Transformers is a complex task. Visualizations are particularly important, as they help us debug the various problems that such models exhibit and which can only be discovered through large-scale analyses.

Traditional visualization libraries are based on the classic grammar of graphics philosophy DBLP:books/daglib/0024564 which is focused on the idea that visualizations are compositional by design. They provide various visualization primitives like circles or squares and a set of operations that can be applied on top of these primitives to create more complex shapes or animations. Unfortunately, such traditional visualization libraries like D3.js DBLP:journals/tvcg/BostockOH11, Vega DBLP:journals/tvcg/SatyanarayanMWH17 or, do not offer specific functions for visualizing feature spaces, neural network layers or support for iterative design space exploration DBLP:journals/tvcg/ParkKLCDE18 when designing AI models. What this means is that for AI tasks, a lot of the functionality will have to be developed from scratch.

When visualizing more complex models like those built with Transformers, we typically need to understand all the facets of the problem, from the data and training procedure, to the input, network layers, weights or the outputs of the neural network. The outputs are the core of explainability, as people will not use the networks in commercial products if they can’t explain how the outputs were obtained in the first place. What is also beneficial is to highlight the paths that lead to certain outputs, as this illuminates the features or parts of the models that may need to be changed to achieve the desired results. This can sometimes be accomplished by using model-agnostic tools specifically built for benchmarking or hyperparameter scoring, such as

Weights and Biases. We include such tools in our survey if examples of how to use them for visualizing Transformers already exist, either in scientific papers or other types of media posts (Medium posts, GitHub, etc.).

The second big class of visualizations discussed in this article is, naturally, the class of visualizations specifically built around Transformers, either for explaining it (like ExBERT DBLP:conf/acl/HooverSG20), or for explaining certain model specific attributes (like embeddings or attention maps DBLP:conf/acl/Vig19).

We selected the libraries and visualizations presented here by reviewing the standard Computer Science (CS) publication libraries (e.g., IEEE, ACM, Elsevier, Springer, Wiley), but also online media posts (YouTube, Medium, GitHub and arXiv). In this extremely dynamic research field, some articles might be published on arXiv even up to a year before they are accepted for publication in a traditional conference or journal, time in which they might already garner hundreds of citations. The original BERT article DBLP:conf/naacl/DevlinCLT19 and also one of the first articles that used visualization to explain it DBLP:journals/corr/abs-1906-04341 were cited over a hundred times before being published in conference proceedings.222Article DBLP:journals/corr/abs-1906-04341

has garnered 149 citations at the moment of the submission, before being published in a conference or journal.

When testing new models, benchmarking and fine-tuning are the two operations where we might spend the most time, as even if the scores are good, we might want to try different hyperparameter settings (e.g., learning rate, number of epochs, batch size, etc)

DBLP:journals/ijccc/FloreaA19. A hyperparameter sweep (or trial) is a central notion in both hyperparameter optimization and benchmarking. It involves running one or multiple models with different values for their hyperparameters. Since quite often the main goal behind running such sweeps is improving existing models, but it is not necessarily related to interpretability and explainability, we decided to include a minimal number of such libraries here.


is a specialized dashboard deployed with Google TensorFlow that covers the basic visualization needs for ML experiments, from tracking, computing and visualizing metrics, to model profiling and embeddings. It is not necessarily a good tool for creating custom visualization or for explaining results, but it can be a good tool for improving accuracy. It is sometimes also used with TensorFlow’s competing libraries like PyTorch or FastAI.


is an open-source ML benchmarking suite deployed for a variety of collaborative benchmarking tasks, including notebook tracking.

Sacred555 and Comet.ml666 are Neptune alternatives that provide basic charting capabilities and dedicated dashboards. Weights and Biases777 provides perhaps the largest sets of visualization and customization capabilities. It comes packed with advanced visualizations that include parallel coordinates DBLP:journals/cse/HeinrichW15, perhaps the best method to navigate hyperparameter sweeps. It is the easiest and the most agile solution to integrate with production code or Jupyter notebooks out of all the ones mentioned here. Ray DBLP:conf/osdi/MoritzNWTLLEYPJ18, a distributed benchmarking framework that contains its own fine-tuning engine called Tune DBLP:journals/corr/abs-1807-05118 is popular for optimizing Transformers.

Due to space limitations, we resume ourselves to discussing only the most interesting visualizations, especially in the model-agnostic visualization section, as otherwise this article could easily become an entire book.

3 Visualizing Language Models

Language models are difficult to train for a multitude of reasons, including (but not limited to) cost, time or carbon footprint DBLP:conf/acl/StrubellGM19

. Most of the LMs need to be trained on GPUs or TPU pods for days or weeks. Due to their generalization capabilities, such LMs can reliably estimate the actions for which they have a reasonable number of examples in their training datasets, whereas in cases with fewer examples they might overestimate the predicted actions therefore inserting some bias

DBLP:conf/coling/ShwartzC20. Debugging or retraining such models therefore becomes a necessity, even if the costs of such operations are still high. Visualization is just one of the methods that can help us explain such large LMs, especially since it is often combined with linguistics or statistics. Explaining the results in plain English should be what we are aiming for when we build new LMs, but this may sometimes require additional steps. An interpretation of the results, for example, would generally depend on the target domain (e.g., medicine, law, etc) DBLP:journals/nca/Vellido20, as in some cases a complex reasoning process (e.g., compliance with local or international regulations) may need to be applied before selecting the right words for an explanation. The visualization will essentially highlight the intermediary steps (e.g., the components that lead to a side effect for medication or a legal aspect in a certain jurisdiction) required to create a basic interpretation, and therefore it is often kept minimal and visualized features or processes are carefully selected. If it is easy to navigate the various information pathways and understand the results and their interpretations (e.g., where they may lead us), it may be safe to call the respective visualizations explainable.

Using visualization to explain the AI processes is an expanding research field. The main idea behind AI user interfaces should be to augment and expand user capabilities, rather than replace intelligence DBLP:journals/pnas/Heer19

. While not necessarily needed to understand the next section, several recent surveys about visualizations and DL can help provide additional context to the interested readers. We particularly recommend the following: the introduction on how Convolution Neural Networks "see" the world from

DBLP:journals/mfc/QinYLC18, the discussion on visual interpretability from DBLP:journals/jzusc/ZhangZ18, and the discussion on the importance of visualizing features from DBLP:series/lncs/NguyenYC19.

3.1 Model-Agnostic Explainable AI Tools

Explainable888Explainable points to the idea of describing or explaining in an intuitive manner, via charts or tables, the prediction of an algorithm. AI (XAI) is the key to enterprise adoption of the current wave of AI technologies, from vision to NLP and symbolic computation. An early XAI survey DBLP:journals/tvcg/HohmanKPC19 describes methods through which visualizations can be turned into explanations for the AI models and goes on to define the terminology of the field. Early XAI libraries focused on visualizing ML features, whereas recent libraries are focused on visualizing embeddings, attention maps or various neural network layers DBLP:series/lncs/11700.

Traditionally, the first step towards transparency was to describe the contribution of each feature to the final result DBLP:journals/jmlr/GuyonE03. This often lead to partial explanations of the results, as in reality if the models themselves were black-box, knowing the name of the features was not in itself enough. Output is definitely the most important part that we would like to explain, but not the only one. To create full explanations, we need to be able to explain the entire process from its input, to its various transformations (e.g., layers), training process and output. Explanations also need to be able to reflect state changes. For example, when computing Shapley values feature contributions are combined, and then a score that signifies the feature importance within that set of features is generated. If features are added or removed from this bundle, the Shapley value for a particular feature will change accordingly. This dynamic nature of the explainability is rarely explained, but it is one of the reason why visualizations in particular are a good fit for creating explanations in the first place.

Some early model-agnostic XAI libraries that were applied to NLP and Transformers visualizations include LIME DBLP:conf/kdd/Ribeiro0G16 and SHapley Additive exPlanations (SHAP) DBLP:conf/nips/LundbergL17

. The later was introduced to unify multiple explanation methods into a single model for interpreting predictions. Both SHAP and LIME can be used with classical ML libraries like scikit-learn and XGBoost, as well as with modern DL libraries like PyTorch or Keras / TensorFlow. SHAP provides visualizations for summary and dependency plots. Another XAI alternative to SHAP and LIME, ELI5

DBLP:conf/acl/FanJPGWA19, is currently routinely used for explaining BERT predictions, and was found to be more secure in case of adversarial network attacks DBLP:conf/aies/SlackHJSL20.

The visualizations created with LIME and SHAP are typically restrained to classic charts (e.g., line, bars or word clouds. The summary plots or interaction charts DBLP:conf/nips/LundbergL17 from SHAP are relatively easy to understand, whereas the more complex force plot charts like feature impact lundberg2018explainable are not necessarily easy to use as they require a certain learning time. While the feature impact chart simply plots the expected feature impact with red (features with positive contribution to the prediction result) or blue (features with a likely negative contribution to the result) colors and should in theory be an easy-to-understand chart, there are no direct (e.g., in chart via a legend) explanations on how to interpret the start or end values, or what do the indicators placed on top of various components mean in some cases. The interpretation of such force plot charts is generally missing and people need to read additional documentation to understand the results. This is far from ideal, as, in our opinion, visualizations need to be self-explanatory.

It can be argued that explainability libraries like SHAP or LIME tend to focus on highlighting correlations or statistical effects rather than features, and are, therefore, less reliable than interpretable models which only showcase a list of features or algorithms that contributed to the results. We consider auditing to encompass both sides of the problem, as interpretability and explainability are the key towards understanding and clearing such LMs for deployment in real-world products. Keeping this in mind, we think that focusing on neural network visualization for NLP can only help in this process, as visualizations can help process, select and highlight the most important features included in such models.

Both SHAP and LIME were proven to be easily fooled with adversarial attacks DBLP:conf/aies/SlackHJSL20

. The idea of deploying biased classifiers for tasks like credit rating, recommendation or search ranking sounds a bit counterintuitive because a single classifier should not be able to do much harm. However, given the fact that such large models typically end up being ensembles, one single classifier can actually lead to severe damage including wrong predictions, different sets of biases and ultimately even different outputs than the ones typically expected from the respective LMs. Not being able to correctly audit such models (e.g., investigate their output and the features that have contributed the most to it) can lead to problems with clients and regulatory agencies.

The list of attacks that can be perpetrated using LMs is extended every month, and therefore models may need to be periodically tested to assess their suitability for certain tasks. Some of the most robust attacks include: creating token sequences that act like universal adversarial triggers on specific target predictions when concatenated to any input from a dataset DBLP:conf/emnlp/WallaceFKGS19; training data extraction attacks in which text sequences like public information or code are extracted from the training corpora and used to attack the trained language model DBLP:journals/corr/abs-2012-07805; spelling attacks in which random spellings for well-known words are generated through modifying the gradients during training DBLP:journals/corr/abs-2003-04985; hotflip attacks in which a gradient-based embeddings swap is performed to change classification results DBLP:conf/acl/EbrahimiRLD18 or even textfooler attacks DBLP:conf/aaai/JinJZS20

in which multiple attributes like embeddings, part-of-speech matches or cosine similarities are used to perform a counter-fitted embeddings swap to fool untargeted classifiers and entailment relations. Possibilities to reuse part of the code of these attacks to create new attacks also exists if frameworks like TextAttack

DBLP:conf/emnlp/MorrisLYGJQ20 or OpenAttack DBLP:journals/corr/abs-2009-09191

are used. Sometimes these kinds of adversarial attacks can also be used to improve results, as demonstrated by improving aspect-based sentiment analysis tasks by creating artificial sentences

DBLP:conf/icpr/KarimiR020 or by using various BERT attribution patterns (e.g., pruned self-attention heads from a certain task) as adversaries DBLP:journals/corr/abs-2004-11207. To understand these attacks, visualization can be a useful tool. For example, Chen DBLP:journals/corr/abs-2006-01043 showcases three types of attacks perpetrated at character, word and sentence level using visualizations built around various metrics like accuracy and successful attack ratios. The Interpret framework DBLP:conf/emnlp/WallaceTWSGS19 uses saliency maps color highlights to showcase defenses against hotflip and untargeted classification attacks. A later publication then shows how almost all interpretations built with Interpret can be manipulated through gradient attacks (e.g., using some large gradients for irrelevant words) DBLP:conf/emnlp/WangTW020. However, most of the visualization efforts have been focused on explaining the various neural network activations from Transformer networks, rather than on the various attacks that can fool Transformers, therefore the visualization of such attacks is a relatively nascent area.

An alternative approach to such explainability libraries that can easily pick up wrong signals could be to simply use models designed specifically to be interpretable. The Neural Additive Models (NAMs) combines the features of classic DNNs with the interpretability approach advocated by Generalized Additive Models (GAM) DBLP:journals/corr/abs-2004-13912. However, since such models are quite recent, their applicability to NLP has not yet been fully explored.

Interpretability and explainability are often used with interchangeable meaning. It has to be noted that quite often, interpretability has a domain-specific component DBLP:journals/nca/Vellido20, whereas explainability is a more general term. Explainability is the term preferred by Information Visualization designers and researchers, whereas interpretability is generally the term that is preferred by ML researchers, statisticians and mathematicians.

Many other explainable AI libraries use Shapley values for computing feature importance. However, in many cases we were only barely able to discover mentions of their usage for NLP (e.g., DeepExplain999, and therefore we decided not to include them in this survey.

3.2 Visualizing Recurrent Neural Networks for NLP

When visualizing LMs, it is best to start with the language resources used for their creation, from corpora to embeddings. To uncover biases in such large models we need to study gender differences, disciplines, languages, cultural context or regional and diachronic variations. A good method to include such information and compare resources for language variation is showcased in Fanhauser’s work DBLP:conf/lrec/FankhauserKT14. It uses a grids, heatmaps and word clouds to provide quick access to large amounts of data about English dialects. Another work uses scatter plots to visualize how and why large corpora differ DBLP:conf/acl/Kessler17. Similar methods have later been used for visualizing large-scale embeddings.

Karpathy DBLP:journals/corr/KarpathyJL15

proposed a method through which to visually interpret the results of Recurrent Neural Networks (RNNs), Long Short-Term Meories (LSTMs) and Gated Recurrent Units (GRUs). He suggests that using interpretable activations can help navigate longer texts, whereas saturation plots would help showcase the gated units statistics.

Topic Visualization Subject Chart Type
a) Special topics
next word DBLP:conf/acl/LuoJBG19
syntactic heights
head attentions
parallel lines
emergence of units
activations of cells and gates
line charts
connectivity charts
b) Hidden states
visualizations of
representations DBLP:conf/naacl/LiCHJ16
modification and negation
clause composition
first-derivative saliency
saliency bars
saliency grids
hidden states
semantics DBLP:conf/visualization/SawatzkyBP19
Predictive Semantic Encodings
performance metrics
PSE charts
Bar charts
activation of
RNNs DBLP:journals/corr/KarpathyJL15
cells with interpretable
saturation plots
bar charts
overlap charts
hidden states
LSTMVis DBLP:journals/tvcg/StrobeltGPR18
hidden state patterns
hidden pathways
tables with cell activation
hidden states
RNNVis DBLP:conf/ieeevast/MingCZLCSQ17
hidden state
control panel
glyph-based chart
state clusters
word clusters
hidden states
ActiVis DBLP:journals/tvcg/KahngAKC18
neuron activation
instance selection
model overview
c) Graph Convolution
inductive text
classification DBLP:conf/acl/ZhangYCWWW20
performance metrics
attention map
line charts
unsupervised domain
adaptation DBLP:conf/www/WuP0CZ20
performance metrics
line charts
GCN with label
propagation DBLP:journals/corr/abs-2002-06755
node embeddings
line charts
Table 1: Articles focused on explaining RNN LMs through visualizations.

Around the same time, an LM visualization survey DBLP:conf/naacl/LiCHJ16 notes that classic Computer Vision visualization was focused on inverting representations, back-propagation and generating images from sketches, all techniques that work well for images. In NLP, however, it is important to focus on important keywords, composition and dimensional locality; especially since many of the words will depend on the context DBLP:conf/naacl/LiCHJ16. Important models will be able to capture this kind of information, and therefore it should be present in visualizations. The early saliency heatmaps clearly showcased these aspects for LSTM and Bi-Directional LSTMs DBLP:journals/tvcg/StrobeltGPR18. By later adapting the ideas of first order saliency from Computer Vision, researchers were able to highlight intensification and negation, as well as differences between two sequences at various at consecutive time steps. Their saliency heatmap for SEQ2SEQ auto-encoders also works well for predicting corresponding tokens at each time step DBLP:conf/acl/LuoJBG19.

The main characteristic that connects these papers, as it can easily be observed in Table 1, regardless of the number of visualizations included in them, is the fact that they are focused on a interpretability and explainability of NLP models through visualization. We have analyzed the following characteristics:

  • Topic - the main topic of the paper (e.g., attention, representation, information probing) followed by the papers in which this topic is addressed;

  • Visualization Subject - since visualizations included in these papers were focused on a large set of subjects from Transformer components (e.g., attention heads), to correlation between tasks (e.g., via Pearson correlation charts) or performance (e.g., accuracy or other metrics represented via line charts), we have decided to extract all these in a separate column to understand what kind of charts we might be interested in creating when exploring a certain topic.

  • Chart Type - includes the various types of visual metaphors used for rendering the chosen subjects. Most of the chart types are classic (e.g., line, bar chart, t-SNE), very few being rebranded (e.g., attention maps are heatmaps) or actually new (e.g., comparative attention graphs). The chart names need to give us a clear idea of what they represent.

A large class of visualizations is dedicated to the activation of neurons and the representation of hidden states, as it can easily be seen from Table 1.

LSTMVis DBLP:journals/tvcg/StrobeltGPR18 uses a large grid of sentences for matching the various state patterns that are then expanded through additional views in the same interface. The key innovation of visualizing hidden state changes keeps the focus on the right keywords, whereas the match views help enlighten particular cases. Many other visualizations similar to LSTMVis (e.g., RNNVis DBLP:conf/ieeevast/MingCZLCSQ17 or ActiVis DBLP:journals/tvcg/KahngAKC18) follow the same template: a control panel is used for selecting the phrase or sentence, a middle view is focused on the word clusters or neuron activations, whereas the last view is typically a matrix view with highlighted cells which showcases the important words that are featured in the activated pathways. Such integrated views offer us holistic views of what these models can accomplish, except for the fact that they are rarely focused also on the corpora that was used for training. In time, the visualization designers started to include this information as well, as we will observe in the next section.

We considered that Graph Convolutional Networks are also worth exploring, but since there are entire libraries dedicated to this task which are not necessarily model-agnostic (e.g., PyTorch Geometric DBLP:journals/corr/abs-1903-02428) we have limited ourselves to including some papers that offer some classic visualizations that are typically included in such libraries (e.g., node embeddings, performance).

3.3 Visualizing Transformers for NLP

No other types of neural networks have led to such an increased demand for custom visualizations since the days of the Kohonen’s Self-Organizing Maps

DBLP:books/sp/Kohonen95 or Manbelbrot’s fractals DBLP:books/fm/Mandelbrot77 like the Transformers. When selecting Transformer visualization papers, we decided to focus on the most important topics related to interpretability and explanation. We have therefore eliminated papers that used only classic charts (e.g., bars, lines, pies). We decided to focus on the works that tried to visualize as many aspects of Transformer models as possible, from attention maps, to structural or informational probing, neural network layers, and multilingualism.

Many Transformer visualizations are focused on attention. While the attention mechanism is indeed important for the Transformer architecture, and it improves results for NLP tasks, they are not necessarily easy to interpret if sophisticated encoders are used DBLP:conf/naacl/JainW19. Due to this fact, it is often not easy to test various explanations by simply modifying the weights and verifying if the outputs are also changed as a result of this. Alternative theories suggest that simply looking at information flows through such models is not enough, and that attention should only be used as explanation if certain conditions are met, e.g., if the weight distributions found via adversarial training do not perform well DBLP:conf/emnlp/WiegreffeP19. Since attention is central to Transformer models, many visualizations are rightly focused on this topic. The fact that such visualizations capture the dynamic nature of the output is not necessarily sufficient to consider them explanations. A good explanation needs to highlight the reasoning chain that lead to the particular output. This is the main reason why we have mainly looked for those visualizations that focus on multiple aspects of the network in this work.

Topic Visualization Subject Chart Type
a) Attention
attention explanation
DBLP:conf/naacl/JainW19 DBLP:conf/emnlp/WiegreffeP19
feature importance correlation
adversarial attention
performance metrics
Kendall rank statistics
adversarial charts
multiple line charts
attention flow
DBLP:conf/acl/AbnarZ20 DBLP:journals/tvcg/DeRoseWB21
raw attention
raw attention map
comparative attention flows
attention graph
comparative attention graphs
multi-head self-attention
attention for rare words
dependency scores
active heads
importance charts
bar charts
line charts
b) Hidden states
intermediate layers DBLP:journals/corr/abs-2002-04815
intermediate layers clustering
information interactions
interpretation DBLP:journals/corr/abs-2004-11207
scoring attention
information flow for tokens
evaluation accuracy
attention heads correlation
attribution graphs
line charts
Pearson correlation charts
causal mediation
analysis DBLP:journals/corr/abs-2004-12265
indirect effects
effects comparison
line chart
attention heads
evolution of representations
token changes and influences
distances between layers
token occurences
line charts
line charts
t-SNE clustering
c) Probing
structural probes
summary statistics
layer-wise performance
predictions probing
bar chart
bar distribution chart
multiple bar charts
multilingual probes
DBLP:journals/corr/abs-2005-00396 DBLP:conf/conll/EgerDG20
probing task
positional embeddings
performance metrics
stability of training size
cosine similarity matrices
line charts
bar charts
information theoretic
probing of classifiers DBLP:journals/corr/abs-2003-12298
coding components
performance metrics
bar charts
line charts
psycholinguistics tests DBLP:conf/acl/GauthierHWQL20
comparing predictions
performance metrics
test content
distribution charts
line charts
Table 2: Articles focused on explaining Transformer topics through visualizations.

The recent success of Transformers helped power many NLP tasks to the top of the leaderboards. BERT visualizations have focused on explaining these great results through visualizations, therefore highlighting: (i) the role of embeddings and relational dependencies within the Transformer learning processes DBLP:conf/nips/ReifYWVCPK19; (ii) the role of attention during pre-training or training (e.g., DBLP:conf/iclr/SuZCLLWD20 or DBLP:conf/acl/Vig19) or (iii) the importance of various linguistic phenomena encoded in its language model like direct objects, noun modifiers, or possessive pronouns. DBLP:journals/corr/abs-1906-04341.

Current XAI methods for Transformer models have further developed and supported the idea that understanding the linguistic information which is encoded in the resulting models is key towards understanding the good performances in NLP tasks. Probing tasks DBLP:conf/acl/BaroniBLKC18 are simple classification problems focused on linguistic features designed to help explore embeddings and LMs. For example, by using structural probing DBLP:conf/naacl/HewittM19

, structured perceptron parsers

DBLP:journals/corr/abs-2005-01641) or visualization (e.g., as demonstrated through BERT embeddings and attention layers visualizations like those from DBLP:journals/corr/abs-1906-04341 and DBLP:conf/acl/Vig19), one should be able to understand what kind of linguistic information is encoded into a Transformer model, but also what has changed since previous runs.

We have discovered two large classes of Transformer visualizations:

  • Focused - visualizations centered on a single subject like attention. The papers themselves might present multiple visualizations, but these visualizations are not single tools.

  • Holistic - visualizations or systems which seek to explain the entire Transformer model or lifecycle.

3.3.1 Focused Transformer Visualizations

The most important papers dedicated to focused visualizations are summarized in Table 2. We used the same conventions in this table like the ones applied in Table 1.

We can clearly distinguish several large topics in this group of focused papers: the relation between attention and model outputs (e.g., especially in DBLP:conf/naacl/JainW19, DBLP:conf/emnlp/WiegreffeP19, DBLP:conf/acl/AbnarZ20, DBLP:conf/acl/VoitaTMST19), the analysis of captured linguistic information via probing (e.g., in DBLP:conf/emnlp/VoitaST19a,DBLP:conf/acl/TenneyDP19), the interpretation of information interaction (e.g., in DBLP:journals/corr/abs-2004-11207, DBLP:conf/emnlp/VoitaST19a), and multilingualism (e.g., in DBLP:conf/acl/TenneyDP19 or DBLP:conf/conll/EgerDG20). In fact, if we look close to Table 1 we can distinguish 3 large classes of subjects: a) attention; b) hidden states and c) structural or information probing. Papers that work on similar topics also tend to use the same kind of visual metaphors. This sometimes happens due to replication of a previous study (e.g., DBLP:conf/emnlp/WiegreffeP19 replicates the experiments from DBLP:conf/naacl/JainW19 to prove that attention weights do not explain everything), whereas in other cases this happens because there is no need for more complicated visual metaphors (e.g., line charts are used in more than half of the papers to represent performance). Besides the widespread use of the heatmaps that represent attention maps, one chart type that deserves to be highlighted in this category is the attention graph DBLP:journals/corr/abs-2004-11207 which tracks the information flow between the input tokens for a given prediction.

One of the key methods used for explaining the output of LMs is called probing DBLP:conf/repeval/EttingerER16

, used to analyze the linguistic information encoded in a fixed length vector. Recent iterations like structural probes


evaluate if syntax trees are embedded in a network’s word representation space. Identifying linear transformations is evidence for entire syntax trees embedded in the LM’s vector geometry. This method works well for limited cases in which distances between words are known. The critics of probing argue that differences in probing accuracy between the various classifiers essentially render them unusable as they fail to distinguish between different representations, e.g., two LMs can end up having different linguistic representations even if based on the same initial BERT model. In such scenarios, one cannot compare the accuracy of the classifiers used to predict their labels. To counteract for such cases, a recent information-theoretic probing with minimum description length method was proposed

DBLP:conf/emnlp/VoitaT20. The basic idea is that instead of predicting labels, the probe transmits data (a description) which is then evaluated based on the returned description’s length. Such probes can be implemented on top of the classic structural probes, and are fairly stable. According to Pilault DBLP:journals/corr/abs-2002-09084, classic structural probes may not be enough for complex tasks like summarization simply because it seems that increasing the number of random encoders provide significant performance boosts. This suggests that the information-theoretic probes with minimum description length may be the better probes for this task, however this remains to be demonstrated by future experiments, as Voita’s model was published after Pilault’s article.

3.3.2 Holistic Transformer Visualizations

While in Section 3.2 several groups tried to expand their visualizations of hidden states to encompass the entirety of the model, they have rarely included the corpora provenance or an easy method to navigate it. Due to this aspect, in the case of Transformers, since such visualizations often included the whole lifecycle (e.g., including the corpora with known provenance), we decided to present them in a separate subsection.

Some of the most interesting tools or papers included in the category of holistic visualizations are compared in Table 3. These visualization systems typically integrate most of the components of a Transformer and provide detailed summaries of them. We have examined two large classes of attributes:

  • Components represents the various components of the neural networks: from corpus, to embeddings, positional heads, attention maps or outputs.

  • Summary includes the various types of views that offer us information about the state of a neuron or a layer, as well as overviews, statistics or details about the various errors encountered. Statistics might include different types of information: from correlations between layers or neurons to statistical analyses of the results. The errors column represents any error analysis method through which we can highlight where a particular error comes from (e.g., corpus, training procedure, layer, etc). While it can be argued that neuron or layer views should be included in the components section, the way these views are currently implemented suggests they are rather summaries, as neurons or layers can have different states.

We have decided against including chart types in Table 3, as each visualization suite or paper included some novel visualization types besides attention maps (heatmaps), parallel coordinates or line and bar charts.

. Article Components Summary C E H A out err neu lay ove sta BertViz DBLP:journals/corr/abs-1904-02679 Clark DBLP:journals/corr/abs-1906-04341 VisBERT DBLP:conf/www/AkenWLG20 ExBERT DBLP:conf/acl/HooverSG20 AttViz DBLP:journals/corr/abs-2005-05716 Vector Norms DBLP:journals/corr/abs-2004-10102 Dictionary Learning DBLP:journals/corr/abs-2103-15949

Table 3: Comparison of holistic Transformer visualizations. Legend: Corpus (C); Embeddings (E); Heads (H); Attention Maps (A); outputs (out); errors (err); neuron (neu); layers (lay); overview (ove); statistics(sta)

In our view, none of the examined visualization systems has yet managed to examine all the facets of the Transformers. This is perhaps because this area is relatively new and there is no consensus on what needs to be visualized. While it is quite obvious that individual neurons or attention maps (regardless of if they are averaged or not) are useful, and it is best to visualize them, the same cannot be said about the training corpora today, as only a few systems considered this aspect (e.g., DBLP:journals/corr/abs-2005-05716 and DBLP:conf/acl/HooverSG20). This is not really ideal, as lots of errors might simply come from a bad corpora, but researchers might simply not be aware of them DBLP:conf/lrec/00020KWN18. Errors themselves are only seriously discussed in a single publication DBLP:journals/corr/abs-1906-04341. ExBERT DBLP:conf/acl/HooverSG20 and AttViz DBLP:journals/corr/abs-2005-05716 deserve a special mention here, as they combine different views on the corpus, embeddings and attention maps to provide a holistic image of a Transformer model.

A study that looks at the similarity and stability of neural representations in LMs and brains DBLP:conf/aaai/HeijdenAS20 shows that combining predictive modelling with Representation Similarity Analysis (RSA) techniques can yield promising results. This article deserves a special mention as it can be included in both focused and holistic visualizations. Their visualizations are rather basic in terms of design, but they contain lots of insights, for example one of the tables they produced showcases the RSA results for various layers of multiple models like BERT, Elmo and others. These kinds of analyses are rather new, and we hope they will become more common in the next years, as they might help us clarify which LMs are more similar to the human brains.

3.3.3 Vision Transformers

Transformers have been applied in numerous fields due to their flexibility. Perhaps the most important application is the unification of vision and NLP. In the last two years, this has seen explosive growth. It would be impossible to cover all the various models in this article, and therefore we have selected only a few interesting topics related to this expansion. The one that comes to mind first is Visual Question Answering (VQA) DBLP:conf/nips/Gan0LZ0020, as this is perhaps the task that multimodal researchers have been trying to solve for decades. Vision Transformers offer an elegant, but expensive solution. Table 4 provides a short summary of interesting papers in this research direction.

Domain Topic Chart Type
language and vision
DBLP:conf/eccv/CaoGCY0020 DBLP:journals/corr/abs-2103-00112
modality importance
layer-level importance
attention heatmaps
line charts
feature maps
biomedical DBLP:conf/bibm/LiW020 attention maps head attention maps
DBLP:conf/nips/Gan0LZ0020 DBLP:conf/icml/KimSK21
multimodal alignment
training curves
visual alignment charts
line charts
Table 4: Articles focused on explaining Language and Vision (LAV) models through visualizations.

4 Discussion

Multilingual models based on Transformer architectures are error-prone since they are trained on large collections of texts. Due to this training process, the extraction of semantics from language data often captures implicit biases hidden in languages themselves or contextual biases triggered by adaptation to various niches DBLP:journals/corr/abs-1909-13705.

Besides the obvious questions of interpretability (e.g., which features are most important for the prediction, what are best-performing models for creating ensembles?) and explainability (e.g., what linguistic information is actually encoded in the model and why do random encoders perform better for summarization?), another important question is security (e.g., are these models stable enough / do they obtain the same result in any circumstance?). Through adversarial training (Wallace, 2019) it is possible to remove some procedural biases from model-agnostic explainability libraries like LIME or SHAP, as well as well-known attack vectors (e.g., trigger words that might lead to a different language model output) DBLP:conf/aies/SlackHJSL20. It can be argued that probing tasks are really only good for simple tasks like NER or POS and for verifying if certain structures were encoded in a LM, but not for more complex tasks like summarization or word-level polysemy disambiguation DBLP:journals/corr/abs-2103-15949. Dictionary activation is a method that uses visualization heavily to peek into the activation of words and their senses. Such a method helps uncover both the various layers on which certain word senses are actually learned, and the contexts that might trigger these new senses.

There are several available options for understanding the inner workings, as well as the results produced by Transformer networks. Each of them have their advantages or shortcomings, briefly discusses in the following.

Model-agnostic tools like the XAI libraries or the hyperparameter optimization and benchmarking tools can be used with a variety of networks. Due to this, model-agnosticism the visualization skills learned while debugging a certain network (e.g., a Convolutional Neural Network) will be easily transferred to debugging and optimizing other networks (e.g., Recurrent Neural Networks). By building a transferable set of skills, users might be more reluctant to try model-specific approaches, like those from the second category discussed in this paper. Some of these model-agnostic tools might be more susceptible to various adversarial attacks DBLP:journals/corr/abs-1911-02508, whereas some other tools might not provide us with sufficiently advanced visualizations to match our needs. If the users are already comfortable with some of these options, then they might well be their Swiss-Army knife for any scenario, whereas if they will need specific visualization scenarios (e.g., visualize a specific attention map), it is possible that they will eventually use the Transformer focused visualizations.

While visualization of RNNs started quite early in the DL era, it can be seen as a precursor to the Transformer visualization. Quite often, many of the topics that were important during this ear have remained important during the Transformer era (e.g., hidden states)

Some of the most useful tools discovered during this exploration include: visualization of attention maps (e.g., DBLP:journals/corr/abs-1906-04341) and embeddings DBLP:conf/acl/Vig19, hidden states visualization (e.g., DBLP:journals/tvcg/StrobeltGPR18), parallel coordinates plots DBLP:journals/corr/abs-1906-04341, and the inclusion of corpus views from ExBERT DBLP:conf/acl/HooverSG20.

Current generation of pre-trained LMs based on Transformers DBLP:journals/corr/abs-1910-03771 was shown to be relatively good at picking up syntactic cues like noun modifiers, possessive pronouns, prepositions or co-referents DBLP:journals/corr/abs-1906-04341 and semantic cues like entities and relations DBLP:conf/emnlp/HanGYYLS19, but has not performed well at capturing different perspectives DBLP:conf/naacl/ChenK0CR19, global context DBLP:conf/emnlp/WaddenWLH19 or relation extraction DBLP:conf/emnlp/GaoHZLLSZ19. This may be because biases can be already included in the embeddings and later propagated to the downstream tasks DBLP:conf/acl-wnlp/GonenG19.

The MIT Computation Psycholinguistics Laboratory created two useful tools for exploring LMs: the LM Zoo and the SyntaxGym DBLP:conf/acl/GauthierHWQL20. The first allows users to install some classic LMs, whereas the SyntaxGym allows users to run pyscholinguistics tests and generate useful visualizations based on their results.

The two large classes of Transformer visualizations we examined (focused on explaining Transformer topics or holistic) are proof that the field is extremely dynamic. While many of the articles focused on explaining Transformer topics like attention or information probing tend to use classic statistical chart types (e.g., bar charts, line charts, PCA, or Pearson correlation charts), we do not consider this a bad thing as we are still in the exploration phase of this technology. Some of these articles also showcase new charts like attention graphs or attention maps.

The second class of visualizations includes tools like BertViz DBLP:conf/acl/Vig19, AttViz DBLP:journals/corr/abs-2005-05716, VisBERT DBLP:conf/www/AkenWLG20 or ExBERT DBLP:conf/acl/HooverSG20, that aim to visualize the entire lifecycle of a neural network from corpora and inputs to the model outputs mainly through following the information flow through the various components. They also offer detailed statistics for neurons or network layers. Since most of the models included in this category are rather new, it is expected that this class will expand in the next years.

One important thing to note about visualization methods is that they can easily be imported into other domains. The averaged attention heatmaps used by Vig in his causal mediation analysis for NLP DBLP:journals/corr/abs-2004-12265, for example, were later reused for protein analysis in biology DBLP:journals/corr/abs-2006-15222. Similarly, attention maps DBLP:conf/acl/Vig19 developed for BERT models are now used in a wide variety of disciplines, from vision and speech to biology or genetics.

The end goal of future visualization frameworks should be to visualize the entire lifecycle of the Transformer models, from inputs and data sources (e.g., training corpora), to embeddings or attention maps, and finally outputs. In the end, errors observed when creating such models can come from a variety of sources: from the text corpora, from some random network layer or even from some external Knowledge Graph that might feed some data into the model. Tracking such errors would be costly without visualizations.

5 Conclusion

While the current visualizations aspire to be model-agnostic, we think the directions opened by the various Transformer or RNN visualizations are worthy of expanding upon. In fact, since this is a ubiquitous architecture that has also branched from NLP into areas like semantic video processing, natural language understanding (e.g., speech, translation) and generation (e.g., text generation, music generation), the next generation XAI libraries will probably be built upon it.

Going beyond current visualizations that are model-agnostic, future frameworks will have to provide visualization components that focus on the important Transformer components like corpora, embeddings, attention heads or additional neural network layers that might be problem-specific. By focusing on the common components from larger architectures, it should be possible to enhance reusability. Other important features that should be included in future frameworks are the ability to summarize the model’s state (e.g., through averaged attention heatmaps or similar visualization mechanisms) at various levels (e.g., neurons, layers, inputs and outputs), as well as the possibility to compare multiple settings for one or multiple models. It is important to note that most of the current visualizations seem to be designed to showcase methods and properties that were already known to belong to neural networks. It would be interesting to design visualizations that allow us to explore neural networks, that help us discover new properties. Such interactive exploration tools would significantly expand the role of visualization from communication to knowledge discovery.

One interesting direction is the automated development of model specific visualizations, as more complex neural networks might also include many specific components that cannot always be included into more general model agnostic frameworks.