Seq2Seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models

by   Hendrik Strobelt, et al.

Neural Sequence-to-Sequence models have proven to be accurate and robust for many sequence prediction tasks, and have become the standard approach for automatic translation of text. The models work in a five stage blackbox process that involves encoding a source sequence to a vector space and then decoding out to a new target sequence. This process is now standard, but like many deep learning methods remains quite difficult to understand or debug. In this work, we present a visual analysis tool that allows interaction with a trained sequence-to-sequence model through each stage of the translation process. The aim is to identify which patterns have been learned and to detect model errors. We demonstrate the utility of our tool through several real-world large-scale sequence-to-sequence use cases.



page 2

page 6

page 8


Sequence-to-sequence neural network models for transliteration

Transliteration is a key component of machine translation systems and so...

Improving Conditioning in Context-Aware Sequence to Sequence Models

Neural sequence to sequence models are well established for applications...

A Deep Learning Based Chatbot for Campus Psychological Therapy

In this paper, we propose Evebot, an innovative, sequence to sequence (S...

Identifying Cause-and-Effect Relationships of Manufacturing Errors using Sequence-to-Sequence Learning

In car-body production the pre-formed sheet metal parts of the body are ...

Interactive-predictive neural multimodal systems

Despite the advances achieved by neural models in sequence to sequence l...

Do RNN States Encode Abstract Phonological Processes?

Sequence-to-sequence models have delivered impressive results in word fo...

Predicting the time-evolution of multi-physics systems with sequence-to-sequence models

In this work, sequence-to-sequence (seq2seq) models, originally develope...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Sequence-to-Sequence Models and Attention

We begin with a formal taxonomy of seq2seq models that will inform our visual analytics approach. Throughout this work, for brevity and clarity, we will consider the running example of automatic translation from one language to another. We use the sequence notation to represent an -word sentence in a source language, and to represent a -word sentence in a target language. Seq2seq models perform translation in a left-to-right manner, one target word at a time, until a special stop token is generated, which ends the translation.

We break down the translation process of seq2seq models into five stages: (S1) encode the source sentence, (S2) decode the current target words, (S3) attend to the encoded source, (S4) predict the next target word, and (S5) search for the best complete translation. Note that some systems use a slightly different order, but most adhere roughly to this setup. Figure 1 provides a structural overview of these five stages.

Encoder (S1): Encoding uses a deep neural network to convert a sequence of source words into a sequence of vectors . Each vector in the sequence roughly represents one word

but also takes into account the surrounding words, both preceding and succeeding, that may determine its contextual meaning. This encoding is typically done using a recurrent neural network (RNN) or a long short-term memory network (LSTM), however recently non-RNN-based methods such as convolutional neural networks (CNN)

[9, 8] and Transformer [50, 29] have also been employed. Our approach supports all types of encoding methods.

Decoder (S2): The decoding process is analogous to encoding, and takes the sequence of previously generated target words and converts them to a sequence of latent vectors . Each vector represents the state of the sentence up to and including word . This provides a similar contextual representation as in the encoder, but is only based on previous words. Upon producing a new word, the prediction is used as input to the decoder.

Attention (S3): The attention component matches encoder hidden states and decoder hidden states. For each we consider which encoder states are relevant to the next prediction. In some similar language pairs, like French and Spanish, the words often align in order, e.g., the fourth French word matches the fourth Spanish word. However for languages such as English and Chinese, the matching might be quite far away. Instead of using absolute position, attention compares the word representations to find which source position to translate. Attention forms a distribution based on the dot product between vectors . We call this value , and it indicates how closely the source and target positions match.

Prediction (S4): The prediction step produces a multi-class distribution over all the words of the target language – words that are more likely to come next have higher probability. This problem takes two factors into account: the current decoder state and the encoder states weighted by attention, known as the context vector coming from S3. These two are combined to predict a distribution over the next word .

Search (S5): To actually produce a translation, these previous steps are combined into a search procedure. Beam search is a variant of standard tree search that aims to efficiently explore the space of translations. The deep learning component of Seq2seq models predicts the probability of all next words, given a prefix. While one could simply take the highest probability word at each time step, it is possible that this choice will lead down a bad path (for instance, first picking the word ”an” and then wanting a word starting with a consonant). Beam search instead pursues several possible hypothesis translations each time step. It does so by building a tree comprising the top - hypothesis translations. At each point, all next words are generated for each. Of these, only the most likely are preserved. Once all beams have terminated by generating the stop token, the final prediction is the translation with the highest score.

Each stage of the process is crucial for effective translation, and it is hard to separate them. However, the model does preserve some separations of concerns. The decoder (S2) and encoder (S1) primarily work with their respective language, and manage the change in hidden representations over time. Attention (S3) provides a link between the two representations and connects them during training. Prediction (S4) combines the current decoder state with the information moving through the attention. Finally, search (S5) combines these with a global score table. These five stages provide the foundation for our visual analytics system.

2 Motivating Case Study: Debugging Translation

To motivate the need for our contributions, we present a representative case study. Further case studies are discussed in section 6. This case study involves a model trainer (see [48]) who is building a German-to-English translation model (our model is trained on the small-sized IWSLT ’14 dataset [31]).

The user begins by seeing that a specific example was mistranslated in a production setting. She finds the source sentence: Die längsten Reisen fangen an, wenn es auf den Straßen dunkel wird.111The closing quote of the book ‘Kant’ from German author Jörg Fauser, who is attributed as being a forerunner of German underground literature. This sentence should have been translated to: The longest journeys begin, when it gets dark in the streets. She notices that the model produces the mistranslation: The longest journey begins, when it gets to the streets. Figure 4(E/D) shows the tokenized input sentence in blue and the corresponding translation of the model in yellow (on the top). The user observes that the model does not translate the word dunkel into dark.

This mistake exemplifies several goals that motivated the development of Seq2Seq-Vis. The user would like to examine the system’s decisions, connect to training examples, and test possible changes. As described in section 1, these goals apply to all five model stages: encoder, decoder, attention, prediction, and search.

Hypothesis: Encoder (S1) Error? Seq2Seq-Vis lets the user examine similar encoder states for any example. Throughout, we will use the term neighborhood to refer to the twenty closest states in vector space from training data. Seq2Seq-Vis displays the nearest neighbor sentences for a specific encoder state as red highlights in a list of training set examples. Figure 2 shows that the nearest neighbors for dunkel match similar uses of the word. The majority seem to express variations of dunkel. The few exceptions, e.g., db, are artifacts that can motivate corrections of the training data or trigger further investigation. Overall, the encoder seems to perform well in this case.

Figure 2: Hypothesis: Encoder (S1) Error – nearest neighbors of encoder state for dunkel.

Hypothesis: Decoder (S2) Error? Similarly, the user can apply Seq2Seq-Vis to investigate the neighborhood of decoder states produced at times and (Figure 3). In addition to the neighbor list, it gives a projection view that depicts all decoder states for the current translation and all their neighbors in a 2D plane. The analyst observes that the decoder states produced by gets and streets are in proximity and share neighbors. Since these states are indicative for the next word we can switch the highlight one text position to the right (+1) and observe that the decoder states at gets and streets support producing dark, darker, or darkness. Thus, the decoder state does not seem very likely as the cause of the error.

Figure 3: Hypothesis: Decoder (S2) Error – nearest neighbors of decoder state for gets and streets, which are close in projection space.

Hypothesis: Attention (S3) Error? Since both encoder and decoder are working, another possible issue is that the attention may not focus on the corresponding source token dunkel. The previous hypothesis testing revealed that well-supported positions for adding dark are after gets or streets. This matches human intuition, as we can imagine the following sentences being valid translations: The longest travels begin when it gets dark in the streets. or The longest travels begin when it gets to the streets turning dark. In Figure 4(S3) our analyst can observe that the highlighted connection following get to the correct next word dunkel is very strong. The connection width indicates that the attention weight is very high with the correct word. Therefore, the user can assume that the attention is well set for predicting dark in this position. The hypothesis for error in S3 can be rejected with high probability.

Figure 4: Hypotheses: Attention (S3), Prediction (S4), or Beam Search (S5) Error – encoder words and decoder words (E/D), Attention (S3), top k predictions for each time step in decoder (S4), and beam search tree (S5)

Hypothesis: Prediction (S4) Error? The combination of decoder state and attention is used to compute the probability of the next word. It may be that an error occurs in this decision, leading to a poor probability of the word dark. The tool shows the most likely next words and their probabilities in Figure 4(S4). Here, our analyst can see that the model mistakenly assigns a higher probability to to than dark. However, both options are very close in probability, indicating that the model is quite uncertain and almost equally split between the two choices. These local mistakes should be automatically fixed by the beam search, because the correct choice dark leads to a globally more likely sentence.

Hypothesis: Search (S5) Error? Having eliminated all other possible issues, the problem is likely to be a search error. The user can investigate the entire beam search tree in Figure 4(S5), which shows the top considered options at each prediction step. In this case, the analyst finds that dark is never considered within the search. Since the previous test showed that to is only minimally more likely than dark, a larger would probably have lead to the model considering dark as the next best option. We therefore conclude that this local bottleneck of a too narrow beam search is the most likely error case. The analyst has identified a search error, where the approximations made by beam search cut off the better global option in favor of a worse local choice.

Exploring Solutions. When observing the -best predictions for the position of to, the analyst sees that dark and to are close in probability (Figure 4(S4)). To investigate whether the model would produce the correct answer if it had considered dark, Seq2Seq-Vis allows the user to evaluate a case-specific fix. The analyst can test this counterfactual, what would have happened if she had forced the translation to use dark at this critical position? By clicking on dark she can produce this probe (shown in Figure 5), which yields the correct translation. The user can now describe the most likely cause of error (search error) and a local fix to the problem (forced search to include dark). The analyst can now add this case to a list of well-described bugs for the model and later consider a global fix.

Figure 5: Testing a fix – by clicking on the correct word dark in the predicted top-K, the beam search is forced on a specific path (P) which leads to the correct prediction.

3 Goals and Tasks

We now step back from this specific instance and consider a common deployment cycle for a deep learning model such as seq2seq. First, a model is trained on a task with a possibly new data set, and then evaluated with a standard metric. The model performs well in aggregate and the stakeholders decide to deploy it. However, for a certain subset of examples there exist non-trivial failures. These may be noticed by users, or, in the case of translation, by post-editors who correct the output of the system. While the model itself is still useful, these examples might be significantly problematic as to cause alarm.

Although these failures can occur in any system, this issue was much less problematic in previous generations of AI systems. For instance when using rule-based techniques, a user can explore the provenance of a decision through rules activated for a given output. If there is a mistake in the system, an analyst can 1) identify which rule misfired, 2) see which previous examples motivated the inclusion of the rule, and 3) experiment with alternative instances to confirm this behavior.

Ideally, a system could provide both functionalities: the high performance of deep learning with the ability to interactively spot issues and explore alternatives. However, the current architecture of most neural networks makes it more challenging to examine decisions of the model and locate problematic cases. Our work tackles the following challenges and domain goals for seq2seq models analogous to the three steps in rule-based systems:

Goal G1 – Examine Model Decisions: It is first important to examine the model’s decision chain in order to track down the error’s root cause. As mentioned in section 1, seq2seq models make decision through several stages. While it has proven difficult to provide robust examination in general-purpose neural networks, there has been success for specific decision components. For example, the attention stage (S3) has proven specifically useful for inspection [52, 3]. Our first goal is to develop interactive visual interfaces that help users understand the model’s components, their relationships, and pinpoint sources of error.

Goal G2 – Connect Decisions to Samples from Training Data: Once a model makes a particular decision, a user should be able to trace what factors influenced this decision. While it is difficult to provide specific reasoning about the many factors that led to a decision in a trained model, we hope to provide other means of analysis. In particular, we consider the approach of mapping example states to those from previous runs of the model. For instance, the training data defines the world view of a model and therefore influences its learned decisions [20]. The goal is to utilize (past) samples from training data as a proxy to better understand the decision made on the example in question.

Goal G3 – Test Alternative Decisions: Ultimately, though, the goal of the user is to improve the model’s performance and robustness. While the current state-of-the art for diagnosing and improving deep neural network models is still in an early stage [17, 47, 28, 16], our goal is to allow the user to test specific interventions. We aim to let the user investigate causal effects of changing parts of the model the let users ask what if specific intermittent outputs of a model changed.

Our motivating case study (section 2) follows these goals: First, the user defines five hypotheses for causes of error and tests them by examining the model’s decisions (G1). Some of these decisions (for S1, S2) are represented in the model only as latent high-dimensional vectors. To make these parts tangible for the user, she connects them to representative neighbors from the training data (G2). Finally, by probing alternatives in the beam search (G3) she finds a temporary alternative that helps her to formulate a better solution.

We use these goals to compile a set of visualization and interaction tasks for Seq2Seq-Vis. The mapping of these tasks to goals is indicated by square brackets:

Task T1 - Create common visual encodings of all five model stages to allow a user to examine the learned connections between these modules. [G1]

Task T2 - Visualize state progression of latent vector sequences over time to allow for high-level view of the learned representations. [G1]

Task T3 - Explore generated latent vectors and their nearest neighbors by querying a large database of training examples to facilitate error identification and training adjustment. [G2]

Task T4 - Generate sensible alternative decisions for different stages of the model and compare them to ease model exploration and compare possible corrections. [G1, G3]

Task T5 - Create a general and coherent interface to utilize a similar front-end for many sequence-to-sequence problems such as translation, summary, and generation. [G1,G2,G3]

In the following section, we will match these tasks and goals to design decisions for Seq2Seq-Vis .

Figure 6: Overview of Seq2Seq-Vis: The two main views (a) Translation View and (b) Neighborhood View facilitate different modes of analysis. Translation View provides (c) visualizations for attention, (d) the top k word predictions for each time step, and (e) the beam search tree. The Neighborhood View goes deeper into what the model has learned by providing (f,g) a projection of state trajectories and (h) a list of nearest neighbors for a specific model state.

4 Design of Seq2Seq-Vis


is the result of an iterative design process and discussions between experts in machine learning and visualization. In regular meetings we evaluated a series of low-fidelity prototypes and tested them for usability. The design presented in this section combines the prevailing ideas into a comprehensive tool.

Seq2Seq-Vis is composed of two main views facilitating different modes of analysis: In the upper part, the translation view provides a visual encoding for each of the model stages and fosters understanding and comparison tasks. In the lower part, the neighborhood view enables deep analysis based on neighborhoods of training data. Figure 6 shows the complete tool.

4.1 Translation View

In the translation view (Figure 6a), each functional stage of the seq2seq model is mapped to a visual encoding (T1, T2, G1). We generalize and extend encodings from Olah & Carter [36] and Le et al. [23]. In Attention Vis (Figure 6

c), the encoder words are shown in blue, the decoder words in yellow, and the attention is shown through weighted bipartite connections. To reduce visual clutter the attention graph is pruned. For each decoder step all edges are excluded that fall into the lower quartile of the attention probability distribution.

Right below the yellow decoder words, the top predictions (S4 of model) for each time step are shown (Figure 6d). Each possible prediction encodes information about its probability in the underlying bar chart, as well as an indication if it was chosen for the final output (yellow highlight).

In the bottom part of the translation view, a tree visualization shows the hypotheses from the beam search stage (Figure 6e). The most probable hypothesis, which results in the final translation sentence, is highlighted. Several interactions can be triggered from the translation view, which will be explained in subsection 4.4.

4.2 Neighborhood View

The neighborhood view (Figure 6b) takes a novel approach to look at model decisions in the context of finding similar examples (T2, T3, G1, G2). As discussed in section 1

, seq2seq models produce high-dimensional vectors at each stage, e.g., encoder states, decoder states, or context states. It is difficult to interpret these vectors directly. However, we can estimate their meaning by looking at examples that produces similar vectors. To enable this comparison, we precompute the hidden states of a large set of example sentences (we use 50k sentences from the training set). For each state produced by the model on a given example,

Seq2Seq-Vis searches for nearest neighbors from this large subset of precomputed states. These nearest neighbors are input to the state trajectories (Figure 6g) and to the neighbor list (Figure 6h).

The state trajectories show the changing internal hidden state of the model with the goal of facilitating task T2. This view encodes the dynamics of a model as a continuous trajectory. First, the set for all states and their closest neighbors are projected using a non-linear algorithm, such as t-SNE [30], non-metric MDS [22], or a custom projection (see section 6). This gives a 2D positioning for each vector. We use these positions to represent each encoder/decoder sequence as a trace connecting its vectors. See Figure 6g for an example of a trace representing the encoder states for wir wollen heute mal richtig spass haben.

In the projection, the nearest neighbors to each vector are shown as nearby dots. When hovering over a vector from the input, the related nearest neighbor counterparts are highlighted and a temporary red line connects them. For vectors with many connections (high centrality), we reduce visual clutter by computing a concave hull for all related neighbors and highlight the related dots within the hull. Furthermore, we set the radius of each neighbor dot to be dependent on how many original states refer to it. E.g., if three states from a decoder sequence have one common neighbor, the neighbor’s radius is set to (we use a mapping with being number of common neighbors).

The state trajectories can be quite long. To ease understanding, we render a view showing states in their local neighborhood as a series of trajectory pictograms (Figure 6f). Each little window is a cut-out from the projection view, derived from applying a regular grid on top of the projection plane. Each pictogram shows only the cut-out region in which the respective vector can be found.

Clicking on any projected vector will show the neighbor list on the right side of the view. The neighbor list shows the actual sentences corresponding to the neighbor points, grounding these vectors in particular words and their contexts. Specifically, the neighbor list shows all the nearest neighbors for the selected point with the original sequence pair. The source or target position in the sequence that matches is highlighted in red. The user can facet the list by filtering only to show source (blue) or target (yellow) sequences. She can also offset the text highlight by or to see alignment for preceding or succeeding word positions (see Figure 3).

4.3 Global Encodings and Comparison Mode

Seq2Seq-Vis uses visual encodings that are homogenous across all views to create a coherent experience and ease the tool’s learning curve for model architects and trainers (T5). First, a consistent color scheme allows the user to identify the stage and origin of data (encoder - blue, decoder - yellow, pivot - green, compare - violet). Furthermore, every visual element with round corners is clickable and leads to a specific action. Across all views, hovering highlights related entities in red.

Additionally, the tool has a global comparison mode (T4, G1, G3). As soon as we generate a comparison sample from one of several triggers (subsection 4.4), all views switch to a mode that allows comparison between examples. Attention Vis, Trajectory Pictograms, and State Projector display a superimposed layer of visual marks labeled with a comparison color (violet) different from the pivot color (green). To ease understanding, we disable all triggers in the comparison view. However, by providing an option to swap pivot and compare roles (arrow button), we allow a consistent exploration flow from one sample to the next. The only exception is the Beam Search Tree, which is only shown for the pivot sample to save visual space.

4.4 Interacting With Examples

A major focus of Seq2Seq-Vis is interactive comparison between different sources and targets (T4, G1, G3). We consider two different modes of interactions to produce comparison samples or to modify the pivot: model-focused and language-focused changes. Model-focused interactions let the user (model architect) produce examples that the model believes are similar to the current pivot to test small, reasonable variations for the different model stages. Language-focused interactions enable the user (model trainer) to produce examples focussed on the language task and observe model behavior.

For the model-focused interactions, we utilize a different variant of neighborhoods. To replace a word with a slightly different, but interpretable substitute, we search for neighbors from the model’s word vectors. The user can trigger the substitution process by clicking on the word to be replaced. As a result, a word cloud projecting the closest words w.r.t. their vector embedding in a 2D plane is shown. A click on one of the words in the cloud replaces the original and triggers a new translation in comparison mode.

Another model-focused interaction is to modify the model directly, for instance, by altering attention weights (S3 in model). For this step, the user can switch to attention modification and select a target word for which attention should be modified. By repeatedly clicking on encoder words, she gives more weights to these encoder words. Figure 7 shows how the attention can be modified for an example. After hitting apply attn, the attention is applied for this position, overwriting the original attention distribution.

Figure 7: To re-direct attention in Seq2Seq-Vis, the user first observes a split of attention between the input 8 and 9 for converting the last digits of a year in a date conversion model. She can (a) select attention mode, (b) select the decoder word, (c) click on the preferred encoder word, (d) apply the attention change, and (e) see the models reaction.

For language-focused interactions, the user can specify direct changes to either the source or the target. The user can trigger the changes by using the manual compare button and enter a new source or a new target sentence. When the source is changed, a new full translation is triggered. If the target is changed, a prefix decode is triggered that constrains the search on a predefined path along the words entered and continues regular beam search beyond.

Alternatively, the user can select the word from the top predictions (Figure 6d) that seems to be the best next word. By clicking on one of these words, a prefix decode is triggered as described above and shown in Figure 5.

Initiating either of these interactions switches Seq2Seq-Vis into comparison mode. As a core analysis method, comparison allows to derive insights about model mechanics (model-focused probing) or how well the model solves the task (language-focused testing).

4.5 Design Iterations

Figure 8: Design variants for additional token information: (a) progression of hidden states, (b) density of neighborhoods, or (c) top predictions as heatmap.

We considered several different variants for both main views of the system. For the translation view, we considered incorporating more state information directly into the encoding. Figure 8 shows iterations for adding per-word information around encoder and decoder. Similar to LSTMVis [48], the hidden state line charts show progression along encoder and decoder hidden states (Figure 8a). Domain scientists rejected this as too noisy for the given domain goals. In a later iteration the visualization experts proposed to indicate the closeness of the nearest neighbors with a simple histogram-like encoding (Figure 8

b). This information did not help to formulate hypotheses. In addition, it did not reveal a lot of variance (see abundance of similar small gray boxes). The next design focused on incorporating language features rather than latent vectors. It showed for each time step of the decoder the top

predicted words being produced as if there was only the top beam evaluated until then. Finally, we decided to use the stronger visual variable length to encode the probability values (Figure 6d).

In the neighborhood view, the trajectory pictograms are a result of a series of visual iterations around combining the linear nature of sequences with keeping some spatial information describing vector proximity. We divide the state trajectory view into cutout regions forming a regular grid. Using a regular grid limits the variants of basic pictograms to a small and recognizable number. Alternative ideas were to center the cutout area around each state or to use the bounding box of the sequence state and all its neighbors as area. Both alternatives created highly variant multiples that introduced visual noise. For the regular grid, choosing the right grid size is important and the current static solution of applying a grid will be replaced by a non-linear function of number of displayed words to allow for scalability.

5 Implementation

Seq2Seq-Vis allows for querying and interaction with a live system. To facilitate this, it uses tight integration of a seq2seq model with the visual client. We based the interface between both parts on a REST API, and we used OpenNMT [18] for the underlying model framework. We extended the core OpenNMT-py distribution to allow easy access to latent vectors, the search beams, and the attention values. Furthermore, we added non-trivial model-diagnostic modifications for translation requests to allow prefix decoding and to apply user-specific attention. We plan to distribute Seq2Seq-Vis as the default visualization mode for OpenNMT.

To allow fast nearest neighbor searches, Python scripts extract the hidden state and context values from the model for points in a large subset of the training data. These states are saved in HDF5 files and indexed utilizing the Faiss [14] library to allow fast lookups for closest dot products between vectors. For TSNE and MDS projections we use the SciKit Learn package [40] for Python.

The model framework and the index work within a Python Flask server to deliver content via a REST interface to the client. The client is written in Typescript. Most visualization components are using the d3js library. Source code, a demo instance, and a descriptive webpage are available at

6 Use Cases

We demonstrate the application of Seq2Seq-Vis and how it helps to generate insights using examples from a toy date conversion problem, abstractive summarization, and machine translation (section 2).

Date Conversion. Seq2seq models can be difficult to build and debug even for simple problems. A common test case used to check whether a model is implemented correctly is to learn a well-specified deterministic task. Here we consider the use case of converting various date formats to the unified format YEAR-MONTH-DAY. For example, the source March 25, 2000 should be converted to the target 2000-03-25. While this problem is much simpler than language translation, it tests the different components of the system. Specifically, the encoder (S1) must learn to identify different months, the attention (S3) must learn to reorder between the source and the target, and the decoder (S2) must express the source word in a numeric format.

Figure 9: Comparing translations for a date conversion model. The input sequences March 21, 2000 and May 21, 2000 are only different by some letters. The attention (top) for predicting the correct months 3 and 5 is focused on this difference (y vs. rc). The trajectory view (bottom left) shows this difference along the progression of encoder states. The neighborhood list (bottom right) indicates that after input of M a the model is still undecided.

Seq2Seq-Vis provides tools for examining these different stages of the model. Figure 9 shows an example, where the user, following Goal 3, employs a comparison between two different translations, one starting with March and the other with May. These two translations are nearly identical, except one yields the month 3 and the other 5. Following Goal 1, the user might want to examine the models decisions. The upper translation view provides a way to compare between the attention on the two inputs. The red highlighted connections indicate that the first sentence attention focuses on r c wheres the second focuses on y. These characters are used by the model to distinguish the two months since it cannot use M a. The user can also observe how the encoder learns to use these letters. The trajectory view compares the encoder states of sentence 1 and sentence 2. Here we use a custom projection, where the y-axis is the relative position of a word in a sentence and the x-axis is a 1-d projection of the vector. This reveals that the two trajectories are similar before and after these characters, but diverge significantly around r and c. Finally, following Goal 2, the user can connect these decisions back to the training data. On the right, she can see the nearest neighbors around the letter a in M a y (highlighted in red). Interestingly, the set of nearest neighbors is almost equally split between examples of M a y and M a r c h, indicating that at this stage of decoding the model is preserving its uncertainty between the two months.

Abstractive Summarization. For our second use case we apply the tool to a summarization problem. Recently, researchers have developed methods for abstractivetext summarization that learn how to produce a shorter summarized version of a text passage. Seq2seq models are commonly used in this framework [44, 33, 39, 45]. In abstractive summarization, the target passage may not contain the same phrasing as the original. Instead, the model learns to paraphrase and alter the wording in the process of summarization.

Studying how paraphrasing happens in seq2seq systems is a core research question in this area. Rush et al. [44] describe a system using the Gigaword data set (3.8M sentences). They study the example source sentence russian defense minister ivanov called sunday for the creation of a joint front for combating global terrorism to produce a summary russia calls for joint front against terrorism. Here russia compresses the phrase russian defense minister ivanov and against paraphrases for combating.

To replicate this use case we consider a user analyzing this sentence. In particular, he is interested in understanding how the model selects the length and the level of abstraction. He can analyze this in the context of Goal 3, testing alternatives predictions of the model, in particular targeting Stage 4. As discussed in Sect 4, Seq2Seq-Vis shows the top predictions at each time step. When the user clicks on a prediction, the system will produce a sentence that incorporates this prediction. Each choice is “locked” so that further alterations can be made.

Figure 10 shows the source input to this model. We can see four different summarizations that the model produces based on different word choices. Interestingly, specific local choices do have a significant impact on length, ranging from five to thirteen words. Switching from for to on leads the decoder to insert an additional phrase on world leaders to maintain grammaticality. While the model outputs the top choice, all other choices have relatively high probabilities. This observation has motivated research into adding constraints to the prediction at each time step. Consequently, we have added methods for constraining length and prediction into the underlying seq2seq system to produce different outputs.

Figure 10: Use case of abstractive summarization. The input sentence russian defense minister ivanov called sunday for the creation of a joint front for combating global terrorism can be summarized in different ways. The yellow boxes indicate alternative translations for different prefix decode settings. Top: the unconstrained abstraction; middle: changing prediction from for to on leads to automatic insertion of on world leaders to stay grammatically correct; bottom left: changing the first word from russian to moscow or russia compresses the sentence even more while retaining its meaning.

Machine Translation. Finally, we consider a more in-depth use case of a real-world machine translation system using a complete model trained on WMT ’14 (3.96M examples) to translate from German to English. This use case considers a holistic view of how an expert might go about understanding the decisions of the system.

Figure 11 shows an example source input and its translation. Here the user has input a source sentence, translated it, and activated the neighbor view to consider the decoder states. She is interested in better understanding each stage of the model at this point. This sentence is interesting as there is significant reordering that must occur to translate from the original German to English. For instance, the subject he is at the beginning of the clause, but must interact with the verb gesprochen at the end of the German sentence.

We consider Goals 1 and 2 applied to this example, with the intent of analyzing the encoder, decoder, attention, and prediction (S1-S4). First we look at the attention. Normally, this stage focuses on the word it is translating (er), but researchers have noted that neural models often look ahead to the next word in this process [19]. We can see branches going from he to potential next steps (e.g., von or gesprochen). We can further view this process in the decoder trajectory shown below, where he and spoke are placed near each other in the path. Hovering over the vector he highlights it globally in the tool. Furthermore, if we click on he, we can link this state to other examples in our data (Goal 2). On the right we can see these related examples, with the next word (+1) highlighted. We find that the decoder is representing not just the information for the current word, but also anticipating the translation of the verb sprechen in various forms.

In this case we are seeing the model behaving correctly to produce a good translation. However, the tool can also be useful when there are issues with the system. One common issue in under-trained or under-parameterized seq2seq models is to repeatedly generate the same phrase. Figure 12 shows an example of this happening. The model repeats the phrase in Stuttgart in Stuttgart. We can easily see in the pictogram view that the decoder model has produced a loop, ending up in nearly the same position even after seeing the next word. As a short-term fix, the tool’s prefix decoding can get around this issue. It remains an interesting research question to prevent this type of cycle from occurring in general.

Figure 11: Use case language translation using WMT’14 data. The attention graph (top) shows how attention for the target word he is not only focused on the decoder counterpart er but also on the following words, even to the far away verb gesprochen (spoke). The state trajectory (bottom left) for the decoder states reveals how close he and spoke are. The neighborhood list indicates that the model sets the stage for predicting spoke as next word.
Figure 12:

An under-trained English-German model. Repetition repetition is a commonly observed phenomenon in under-trained or under-parametrized models. Here the trajectory pictograms show that for the repetition

in Stuttgart in Stuttgart the decoder states alternate in the same region before being able to break apart.

7 Related Work

Various methods [4, 34] have been proposed to generate explanations for deep learning model predictions. Understanding them still remains a difficult task. To better address the specific issues of our users, we narrow the target audience for our proposed tool. Following the classifications by Strobelt et al. [48] and Hohman et al. [13], our tool aims at model developers who have at least a conceptual understanding of how the model works. This is opposed to end users, who are agnostic to the technique used to arrive at a specific result. Following Hohman et al., analysis itself can broadly be divided into global model analysis and instance-based analysis. In global model analysis, the most commonly seen methods are visualizations of the internal structure of trained deep learning models. Instance-based analysis may be coupled with interactive experimentation with the goal of understanding a particular prediction using the local information around only one input [38].

Global Model Analysis

Most recent work focuses on visualizing hidden representations of convolutional neural networks (CNNs) [24]

for computer vision applications. Techniques for visualizing CNNs include showing neural activity in the convolutional layers as overlay over the image 

[7, 23] and directly showing the images that maximize the activity [46]. Zeiler and Fergus [54] use deconvolutional networks to explore the layers of a CNN. This approach is widely used to generate explanations of models, for example, by Yosinski et al. [53].

A similar line of work has focused on visualizing recurrent neural networks (RNNs) and other sequence models. Preliminary work by Karpathy et al. [17] uses static visualizations to understand hidden states in language models. They demonstrate that selected cells can model clear events such as open parentheses and the start of URLs. Strobelt et al. [48] introduce LSTMVis, an interactive tool that allows users to understand activation patterns of combinations of hidden states. LSTMVis shows the neighborhood of activations within the training data as an approach of making sense of the complex interactions in a context.

Similar to our approach, Kahng et al. [16] propose using the model structure as the entry point into the analysis. In their approach, they try to understand connections between misclassified examples and hidden states of parts of the network by showing activation pattern differences between correct and false examples from the training data. Ming et al. [32] propose RNNVis, a tool that uses word clouds instead of full contexts or sentences to show typical words that appear for activation patterns. Our approach to show embeddings of a whole phrase is similar to that of Johnson et al. [15]. They use three-dimensional tSNE in order to visualize progressions of context vectors. Novel in our approach are different types of progressions as well as the connection and embedding with neighborhoods.

An alternative to visualizing what a model has learned is visualizing how it is learning. RNNbow by Cashman et al. [5]

shows the gradient flow during backpropagation training in RNNs to visualize how the network is learning.

Instance-Based Analysis

Instance-based analysis is commonly used to understand local decision boundaries and relevant features for a particular input. For example, Olah et al. [37] extend methods that compute activations for image classification to build an interactive system that assesses specific images. They show that not only the learned filters of a CNN matter, but also their magnitudes. The same type of analysis can be used to answer counter-factual “what if” questions to understand the robustness of a model to pertubations. Nguyen et al. [35]

show that small perturbations to inputs of an image classifier can drastically change the output. Interactive visualization tools such as Picasso 

[11] can manipulate and occlude parts of an image as input to an image classifier. Krause et al. [21] use partial dependence diagnostics to explain how features affect the global predictions, while users can interactively tweak feature values and see how the prediction responds to instances of interest.

There is an intrinsic difficulty in perturbing inputs of models that operate on text. While adding noise to an image can be achieved by manipulating the continuous pixel values, noise for categorical text is less well defined. However, there is a rich literature for methods that compute relevant inputs for specific predictions, for example by computing local decision boundaries or using gradient-based saliency [41, 42, 55, 1, 27]. Most of these methods focus on classification problems in which only one output exists. Ruckle et al. [43] address this issue and extend saliency methods to work with multiple outputs in a question-answering system. As an alternative to saliency-methods, Ding et al. [6] use a layer-wise relevance propagation technique [2] to understand relevance of input with regard to an output in sequence-to-sentence models. Yet another approach to understand predictions within text-based models is to find the minimum input that still yields the same prediction [26, 28]. None of the previous methods use our approach of using nearest neighbors of word embeddings to compare small perturbations of RNNs.

One commonality among all these approaches is that they treat the model as a black box that generates a prediction. In contrast, we are assuming that our users have an understanding of the different parts of a sequence-to-sequence model. Therefore, we can use more in-depth analysis, such as interactive manipulations of input, output, and attention. Our beam search and attention manipulations follow the approach by Lee et al. [25] who show a basic prototype to manipulate these parts of a model.

8 Conclusions and Future Work

Seq2Seq-Vis is a tool to facilitate deep exploration of all stages of a seq2seq model. We apply our set of goals to deep learning models that are traditionally difficult to interpret. To our knowledge, our tool is the first of its kind to combine insights about model mechanics (translation view) with insights about model semantics (neighborhood view), while allowing for ”what if”-style counterfactual changes of the model’s internals.

Being an open source project, we see future work in evaluating the longitudinal feedback from real-world users for suggested improvements. Two months after release, we already observed some initial quantitative and qualitative feedback. Currently, more then 5,500 page views have been recorded and 156 users liked (starred) the project on Github. The most requested new feature is integration of the tool with other ML frameworks.

There are many avenues for future work on the algorithmic and visualization side. Improving the projection techniques to better respect the linear order of sequences would be helpful. The tool could be extended to different sequence types, including audio, images, and video. Supporting these different data types requires non-trivial expansion of visual encoding for input and output. A prerequisite for future work targeting different models and frameworks is that model architects implement open models with hooks for observation and modification of model internals. We hope that Seq2Seq-Vis will inspire novel visual and algorithmic methods to fix models without retraining them entirely.