Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

12/10/2019 ∙ by Angie Boggust, et al. ∙ MIT 0

Embeddings – mappings from high-dimensional discrete input to lower-dimensional continuous vector spaces – have been widely adopted in machine learning, linguistics, and computational biology as they often surface interesting and unexpected domain semantics. Through semi-structured interviews with embedding model researchers and practitioners, we find that current tools poorly support a central concern: comparing different embeddings when developing fairer, more robust models. In response, we present the Embedding Comparator, an interactive system that balances gaining an overview of the embedding spaces with making fine-grained comparisons of local neighborhoods. For a pair of models, we compute the similarity of the k-nearest neighbors of every embedded object, and visualize the results as Local Neighborhood Dominoes: small multiples that facilitate rapid comparisons. Using case studies, we illustrate the types of insights the Embedding Comparator reveals including how fine-tuning embeddings changes semantics, how language changes over time, and how training data differences affect two seemingly similar models.



There are no comments yet.


page 1

page 5

page 7

page 8

page 9

page 10

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Embedding models map high-dimensional discrete objects into lower-dimensional continuous vector spaces such that the vectors of related objects are located close together. Although the individual dimensions of embedding spaces are often difficult to interpret, embeddings have become widely used in machine learning (ML) applications because their structure usefully captures domain-specific semantics. For example, in natural language processing (NLP), embeddings map words into real-valued vectors in a way that co-locates semantically similar words and yields interesting linear substructures (for example,

Paris - France + Italy = Rome [28]).

A key task when working with embedding models is evaluating the representations they learn. For instance, users may wish to determine how robust an embedding is, and whether it can be transferred between tasks in a domain with limited training data (e.g., applying an embedding of general English to legal or medical text [15]). In speech recognition [4], recommendation systems [18], computational biology [45, 5], and computational art [9, 11]

, evaluating embeddings has helped inform future training procedures as they have revealed the impact of different training datasets, model architectures, hyperparameters, or even random weight initializations.

To understand how users evaluate and compare embeddings, we conducted a series of semi-structured interviews with ML researchers and practitioners who frequently use embedding models as part of their research or within application domains. Participants described using a mix of quantitative metrics and static visualizations to analyze embeddings, including model precision/recall, clustering analysis, and dimensionality reduction plots. Our conversations also revealed several shortcomings with these approaches: existing tools are primarily focused on analyzing a single embedding space and, more specifically, on depicting only the global structure of this space. As a result, digging into the local neighborhoods of individual embedded objects, or comparing one embedding space against another requires a non-trivial amount of effort. Moreover, users are unable to develop tight feedback loops or rapidly iterate between generating and answering hypotheses as existing tools provide limited interactive capabilities and, thus, require tedious manual specification.

In response, we present the Embedding Comparator, an interactive system for analyzing a pair of embedding models. Drawing on the insights from our formative interviews, the Embedding Comparator balances between visualizing the models’ global geometries with depicting the local neighborhood structures. To simplify identifying the similarities and differences between the two models, the system calculates a similarity score for every embedded object based on its reciprocal local neighborhood (i.e., how many of an object’s nearest neighbors are shared between the two models, and how many are unique to each model). These scores are visualized in several ways including through a histogram of scores, by color-encoding the global geometry plots, and critically, through Local Neighborhood Dominoes: small multiple visualizations that facilitate rapid comparisons of local substructures. And, a variety of interactive mechanics help facilitate a tight iterative loop between analyzing these global and local views — for instance, by interactively selecting points in the global plots, or by searching for specific objects, users can filter local neighborhood dominoes, and hovering over dominoes highlights their points in the global views to provide additional context.

Through three case studies, drawn from use cases described by our interviewees, we demonstrate how the Embedding Comparator helps scaffold and accelerate real-world exploration and analysis of embedding spaces. Using three popular word embedding models — fastText [27], word2vec [28], and GloVe [30] — we show how our systems supports tasks such as evaluating model robustness and expressivity, and conducting linguistic analysis. As we highlight, the Embedding Comparator shifts the process of analyzing embeddings from requiring tedious and error-prone manual specification to instead browsing and manipulating a series of visualizations. As a result, only a handful of interactions are needed to surface published insights [12] about the evolution of the English language.

The Embedding Comparator is freely available as open source software, with source code at:

https://github.com/mitvis/embedding-comparator, and a live demo at: http://vis.mit.edu/embedding-comparator.

2 Related Work

We draw on prior work developing general and model-specific techniques for interpretability, as well as visual and textual tools for analyzing embedding spaces.

2.1 ML Model Interpretability

ML models are widely regarded as being “black boxes” as it is difficult for humans to understand about how models arrive at their decisions [21]. In response, numerous tools have been developed to help researchers and practitioners understand the behavior of ML models. Various visualizations have been proposed to expose the workings of specific model architectures, such as LSTMs [37], sequence-to-sequence models [36], and generative adversarial networks (GANs) [2, 3]

. Other methods are instead focused on more general techniques for interpretability including explaining the importance of input features, saliency, or neuron activations 

[29, 6, 32, 38, 34, 16]. In contrast to this prior work, our focus is on comparing the representations

learned by different models — the input embeddings or hidden layers of neural networks as opposed to their inputs and outputs — as embedding representations may differ even while input saliency remains the same.

2.2 Visual Embedding Techniques & Tools

Interpreting the representations learned at the embedding layers of ML models is a challenging task because embedding spaces are generally high-dimensional and latent (i.e., hidden rather than observed). To reason about these spaces, researchers project the high-dimensional vectors down to two or three dimensions using techniques such as principal component analysis (PCA), t-SNE 

[25], and UMAP [26]. Visualizing these projections reveals the global geometry of these spaces as well as potential substructures such as clusters, but effectively doing so may require careful tuning of hyperparameters [43] — a process that can require non-trivial ML expertise. The Embedding Comparator provides a modular system design such that users can use a dimensionality reduction technique of their choice. By default, however, we use PCA projections as they are deterministically generated and highlight, rather than distort, the global structure of the embedding space [39].

By default, many projection packages generate visualizations that are static and thus do not facilitate a tight question-answering feedback loop. Recently, researchers have begun to explore interactive systems for exploring embeddings including via direct manipulation of the projection [35], interactively filtering and reconfiguring visual forms [13], and defining attribute vectors and analogies [22]. While our approach draws inspiration from these prior systems, and similarly provides facilities for exploring local neighborhoods, the Embedding Comparator primarily focuses on identifying and surfacing the similarities and differences between representations of embedding models. To do so, we compute a similarity metric for every embedded object and use this metric to drive several interactive visualizations — an approach that does not rely on attribute vectors, which may not be available when certain linear substructures do not exist in the embedding space.

2.3 Techniques for Comparing Embedding Spaces

To compare spaces, a body of work has studied techniques for aligning embeddings through linear transformation 

[7] and alignment of neurons or the subspaces they span [20, 41]. The Embedding Comparator does not align the learned representations, but instead exposes the objects that are most and least similar between the two vector spaces, which we define through a similarity metric defined on reciprocal local neighborhoods. However, these findings suggest that identical architectures can learn different features when retrained, providing additional evidence for the need for tools that can visualize differences between two representations.

The Python software package repcomp [33] quantifies the difference between two embedding models through similarity of local neighborhoods, though it only computes a single scalar value that measures global similarity. And, Wang et al. [42] use cosine distance to find nearest neighbors between embeddings, and then arbitrarily sample words to determine if the semantic meanings differ between the models. While we use similar strategies, the Embedding Comparator computes a similarity metric for every embedded object, and then initializes its view to begin with objects that are the most and least similar between the two models. Moreover, we present these words in an interactive graphical system that facilitates rapid exploration of differences between the embedding spaces.

3 Formative Interviews

To better understand how embedding models are currently analyzed and compared, and to identify process pitfalls and limitations with existing tools, we conducted a series of semi-structured interviews with seven embedding model users. To capture diverse points of view, we recruited participants across a range of expertise including undergraduate and graduate students, postdoctoral researchers, and software engineers in industry. Through our interviews, we were able to identify that three broad classes of users and use cases exist for embedding models: embedding researchers who wish to compare models to develop novel model architectures; domain analysts who study embedded representations to uncover properties of their application domains; and, embedding engineers who use embedding models within ML-enabled products. Our findings also help corroborate and supplement the results of the literature review conducted by Liu et al. [22]. In particular, our formative interviews corroborate the high level interpretation goals — qualitative evaluation, model explanation, and data understanding — but focus primarily on understanding how users compare embeddings as opposed to interpreting a single embedding model.

3.1 Embedding Researchers

Embedding researchers are expert users whose academic or industry research directly focuses on understanding and improving embedding models. Thus, this class of users wants to develop a deep understanding of how their models work and why, and to do so they frequently compare model variants.

Through our conversations, we identified several common comparison tasks including comparing multiple layers of the same model to understand what semantic information is being captured by each layer, comparing new models against a “ground truth” model to understand their relative strengths and weaknesses, and comparing a generic model to one that has been fine-tuned for a particular task such as comparing a general English word embedding model to the same model after being tuned for performing sentiment analysis.

To perform these comparisons, researchers use a mix of quantitative performance metrics (e.g., clustering embedding vectors and calculating cluster purity) and static visualizations (e.g., heatmaps and dimensionality-reduced scatter plots). However, researchers find these approaches to be very tedious as they require a lot of repetitive manual specification (e.g., to create a heatmap of every node of every layer in their model) and need to be tweaked, customized, or sometimes rewritten to accommodate different model characteristics or idiosyncrasies.

3.2 Domain Analysts

Domain analysts use embeddings for non-ML related tasks. These types of users have very little or no machine learning knowledge, and are ambivalent to how the embeddings are created. Instead, they wish to study the representations learned by embedding models in order to answer research questions about their particular application domain. Examples of these users include linguistics, historians, social scientists, and computational biologists. For these users, embeddings are a tool or a means to an end. As one of our users said, “in computational historical linguistics, people are interested in studying how a word’s meaning changes over time. One way is to train word embedding models on texts from different eras and look at neighborhoods.”

Domain analysts more acutely experience the researchers’ frustrations with existing tools: besides the repetitive and tedious manual specification, their relative lack of ML-expertise makes it difficult to anticipate how analysis approaches must be changed when working with alternate model architectures. Moreover, these users are heavily reliant on their domain-specific expertise to guide their analysis process (e.g., to identify specific embedded objects to compare between models [19]). As a result, there is the potential that users may miss unexpected changes between models (i.e., changes to representations that their domain expertise did not anticipate).

3.3 Embedding Engineers

Embedding engineers use embedding models for downstream ML tasks or in ML-enabled products. These users have some knowledge of embedding models, but are more focused on using them for a particular task as opposed to directly researching them or studying their representations. Engineers compare embedding models to decide which model best fits their particular product needs. To compare embeddings, engineers primarily use visualization techniques and systems like the Embedding Projector [35] to explore the overall embedding space and drill down to specific objects. However, these systems are primarily concerned with visualizing a single embedding space. Thus, to compare models, users would need to adopt manual approaches (e.g., note-taking, screenshots, etc.) which can be time-consuming and error-prone (e.g., they may forget to note down an insight before proceeding with exploration).

3.4 Design Goals

To inform the design of the Embedding Comparator, we distill our formative interviews into the following design goals:

  1. Surface similarities and differences for rapid comparison. Across all users, the central goal of comparing embedding models is to understand the degree to which they are similar, and where the differences lie. And, a recurrent breakdown is how time-consuming identifying this information is with current approaches. Moreover, our interviews suggest that users are most interested in seeing the objects that are most similar or different — as one participant noted, “it is most interesting what happens at the extremes.”

  2. Provide global context. All participants use dimensionality-reduced scatter plots (like the one found in the Embedding Projector [35]) as the primary mechanism for evaluating embedding spaces. Besides users’ familiarity with these views, our interviews highlighted that global geometries provide necessary context to meet our first design goal as the overall shape and density of a scatter plot can reveal similarities and differences in a glance. For example, one participant described how viewing the global projection of a particular embedding model caused them to stumble upon an unexpected result that they later wrote a paper about.

  3. Display local neighborhoods. While projection plots provide useful global context, the substance of their analysis occurs by drilling down into the local neighborhoods of embedded objects. For example, a researcher noted that “[papers] often report a few nearest neighbors of their models to show that they capture some properties.” However, with existing tools, this approach unfolds in an unstructured and ad hoc way. Thus, users like domain analysts are left concerned that they may have missed an important result. In response, the Embedding Comparator must provide a structured interface element and invoke it in a consistent way to display this information.

  4. Interactively link global and local views. Participants consistently expressed frustration with the largely static nature of their existing embedding analysis tools — the lack of interactivity slows down their analysis processes and makes it difficult to deeply explore and identify differences that are not obvious. Thus, the Embedding Comparator should use interactive techniques to link global and local views together including allowing users to select local views from global views, highlight local neighborhoods within the global projections, and perform targeted searches to surface the local neighborhoods of specific embedded objects.

4 System Design

Informed by our formative interviews, the Embedding Comparator computes a similarity score for every embedded object based on its reciprocal local neighborhoods. Critically, this similarity metric does not require the two models to have the same dimensionality nor do they need to be aligned in any way. As a result, the Embedding Comparator is capable of supporting a wide range of embedding models. To allow users to rapidly identify similarities and differences between the models, this similarity metric is visualized via a number of global views (including projection plots and a histogram distribution) as well as through small multiple views of local neighborhoods called Local Neighborhood Dominoes.

4.1 Computing Local Neighborhood Similarity

An embedding space is a function that maps objects in vocabulary into a -dimensional real-valued vector space. For example, may be a vocabulary of English words, and maps each word into a 200-dimensional vector. Such word embeddings are commonly employed in NLP models.

Here, we consider two embedding spaces and over the same set of objects . Note that the embedding spaces may have different bases and may even have a different number of dimensions . For this reason, we compute the similarity for each object in the embedding space through similarity of the reciprocal local neighborhood around the object in each of the embedding spaces. Precisely, for each object , we compute the local neighborhood similarity (LNS) of between the two embedding spaces as:


where returns the -nearest neighbors of in embedding space and is a similarity metric between the two lists of nearest neighbors. Here, we compute -nearest neighbors with cosine or Euclidean distance between the embeddings [40] and take as the Jaccard (intersection over union) similarity between the sets of neighbors. The Jaccard similarity between two sets and is defined as . Note that this value scales between 0 and 1, where 1 implies the two sets are identical and 0 implies the sets are disjoint. Users can choose between these different distance and similarity functions in the interface, or introduce additional functions via a JavaScript API call.

4.2 Global Views

Figure 1: Embedding Comparator configuration options and Global Views. (A) Users can select a dataset and a pair of embedding models. (B) Interactive Global Projection plots visualize the geometry of each embedding space, with color encoding how similar the object’s local neighborhood is between the models. The distribution of these similarity values are also shown in a histogram (C). Users can tune the parameters used for defining local neighborhoods and computing vector distances (D) and search for a particular objects of interest (E).

The Embedding Comparator’s left-hand sidebar (shown in Figure 1) provides configuration options and interactive global views of the embedding spaces. A user begins by selecting a Dataset, which specifies the embedding vocabulary, followed by two Models, each of which defines the embedding space for that vocabulary (Figure 1A). Users can load arbitrarily many models, and by decoupling datasets from models, the Embedding Comparator makes it easy to compare several different models trained with the same vocabulary.

Beneath each model, the Embedding Comparator shows a Global Projection (Figure 1B): a scatter plot that depicts the geometric structure of the model’s embedding space, with object names shown in a tooltip on hover. As per Design Goal 2, these projections provide valuable context during exploration, and can help reveal interesting global properties such as the presence of distinct clusters. We use PCA to perform dimensionality reduction because, as compared to alternatives like t-SNE, it is deterministic and highlights rather than distorts the global structure of the embedding space [39]. However, we provide a modular system design, and users can choose to use alternative dimensionality reduction techniques.

The Similarity Distribution (Figure 1C) displays the distribution of LNS similarity values (Equation 1) over all objects in the embedding space. Bars are colored using a diverging red-yellow-blue color scheme to draw attention to the most extreme values (objects that are the most and least similar between the two selected models) in accordance with Design Goal 1. We reapply this color encoding in the Global Projections to help users draw connections between the two visualizations, and to help reveal global patterns or clusters of objects with related similarity values (such as those in the case studies of Figure 4 and Figure 5). Both the Similarity Distribution and the Global Projections can be used to interactively filter the Local Neighborhood Dominoes (Design Goal 4; see below), and the Search Bar (Figure 1E) can be used to populate specific object(s) of interest.

The Parameter Controls (Figure 1D) enable the user to interactively change the value of used to define the size of local neighborhoods for computing similarity, and select the distance metric used for computing distance between vectors. The default value of is 50, which we found to provide insightful results across our various experiments and case studies. Changes in either of these controls immediately update the rest of the Embedding Comparator interface.

4.3 Local Neighborhoods

Figure 2: Example of a Local Neighborhood Domino (for the word “mean”). (A) Unselected domino. (B) Hovering over one of the words in the Common or Unique Neighbor Lists or in either Neighborhood Plot highlights that word in each of the Neighborhood Plots.

To meet Design Goal 3, the Embedding Comparator introduces Local Neighborhood Dominoes: a small multiples visualization to surface local substructures and facilitate rapid comparisons. Each domino consists of a set of interactive Neighborhood Plots and Common and Unique Neighbor Lists.

The neighborhood plots — side-by-side scatter plots that show the nearest neighbors of the domino object in each model — graphically display the relationships between the object and its neighbors. These plots use the same PCA projections as the Global Projection views to ensure that all geometries are visualized consistently, and to help users quickly gain insight regardless of model type (alternative techniques such as t-SNE often require non-trivial per-model parameter tuning [43]). Color is used to encode whether each neighbor is common to both models, or unique to a single model. These neighbors are also displayed as separate scrollable lists above and below the plots, respectively, with neighbors sorted by the distance to the domino object. For example, Figure 2A shows the domino for the word “mean” from an embedding trained on text from the early 1800s (model A, left) to the 1990s (model B, right). To facilitate cross-model comparisons, dominoes are also interactive: hovering over a neighbor in the plots or the lists highlights it across the entire domino (Fig. 2B). Motivated by Design Goal 1, and our participant’s note that the most interesting insights often lie at the extremes, the default view of the Embedding Comparator lists two columns of dominoes: the first displays dominoes of the least similar objects (in increasing order) and the second column displays those of the most similar objects (in decreasing order) . To increase information scent [31], we adapt the Scented Widgets technique [44] and augment the scroll bars with a list of domino objects and sparklines of their similarity scores.

The dominoes’ information-dense display is designed to facilitate rapid acquisition of neighborhood-level insights. By scanning down the dominoes, users see not only the geometries involved but also specific common and unique neighbors to trigger hypothesis generation. Previous iterations of the Embedding Comparator used separate tabular lists to display the most and least similar words across models (and common and unique neighbors for individual objects) but, in early user tests, we found that such a presentation produced a high cognitive load as users tried to map back and forth between the various lists. Thus, with the domino design, we encapsulate all local neighborhood information associated with a given embedded object into a single interface element while still supporting cross-model comparisons. For instance, returning to the “mean” domino in Figure 2A, the neighbor plots reveal substructures within the local neighborhoods — in model B, the bottom appears to relate to the mathematical notion of “mean”, while the top is more synonymous with “convey” and shares neighbors with model A. And, we confirm this hypothesis by scanning the common and unique lists, where we see more mathematical words under model B than A.

4.4 Linking Global and Local Views

Figure 3: Interactions between Global Views and Local Neighborhood Dominoes. Hovering over a domino (here, “insult”) highlights the object and its local neighborhood in the Global Projections. Brushing on the Similarity Distribution filters dominoes to those in the selected range (here, between roughly 8% and 35% similarity).

Linking the global and local views is critical for allowing users to rapidly iterate between considering the overall embedding spaces and inspecting specific points of interest. When hovering over a domino, the object and its local neighborhoods are highlighted in each of the respective Global Projections with the purple/green color encoding preserved (Fig. 3). This interaction allows a user to contextualize local neighborhoods within the overall embedding space. Similarly, interactive selections (e.g., brushing or lassoing) in the global projections or similarity histogram filters the list of dominoes, allowing users to drill down and investigate specific areas of interest.

5 Case Studies

In this section, we illustrate the types of insights the Embedding Comparator helps reveal through three case studies. These case studies map to the use cases we identified during our formative interviews and utilize three popular word embedding models — fastText [27], word2vec [28], and GloVe [30]. Thus, we demonstrate that the Embedding Comparator scaffolds and accelerates real-world analysis processes.

5.1 Transfer Learning for Fine-tuned Word Embeddings

Figure 4: View of the Embedding Comparator applied to Case Study: Transfer Learning for Fine-tuned Word Embeddings

. In this case study we compare a word embedding model trained on a large corpora of English text before and after it is fine-tuned for a sentiment analysis task. The Embedding Comparator reveals the effect of transfer learning on the geography of the global embedding space via the Global Projections, as well as direct changes to neutral sentiment words that have taken on sentiment meanings (e.g

“avoid”) via the Local Neighborhood Dominoes.

Transfer learning is the process of training generic embeddings on a large dataset to capture general domain semantics, and then applying them in a new, related domain with more limited availability of labeled data. Our formative interviews revealed that researchers engage in this process to improve the performance and robustness of their embeddings [23, 15] but that existing tools make it difficult to analyze the trade-off between generalizability and domain-specific expressiveness.

We developed this case study in collaboration with an embedding model researcher who explores applications of transfer learning in NLP. Here, the researcher wishes to train an LSTM recurrent neural network 

[14] that predicts whether movie reviews express positive or negative sentiment using the Internet Movie Database (IMDb) [24] containing only 25,000 labeled training examples (see supplementary material for details). To get the most out of this small labeled dataset, the researcher initializes the network using fastText [27], a model of the English language with pre-trained English word embeddings, and then refines it using the movie review data. Once complete, the researcher is interested in investigating the effect of the refinement process: for example, identifying words that are ordinarily synonyms but not in the context of sentiment prediction or words which are not ordinarily synonyms but become interchangeable in the context of sentiment prediction.

We use the Embedding Comparator to compare the pre-trained fastText word embeddings with those fine-tuned for the sentiment analysis task. Figure 4 shows the initial view in the Embedding Comparator after loading both models. Our system immediately surfaces a number of insights about how the embedding space has changed as a result of fine-tuning. The color encoding shared between the Similarity Distribution and Global Projections is effective for revealing how the words that changed the most as a result of fine-tuning have also moved toward the outer regions of the embedding space. Upon further inspection by hovering over the left and right regions of the fine-tuned embedding space Global Projection, we find that positive sentiment words have moved toward the left side of the projection, while negative sentiment words moved toward the right. These Global Views enable us to identify how optimizing the embeddings for this binary classification task has affected the global shape of the vector space.

By interactively selecting (or brushing) areas of interest in any of these views, we are able to dive into the details of the affected data. For example, in the original space “bore” is most closely related to “carrying” or “containing”, but after fine-tuning for sentiment analysis it is most closely related to “boredom”, “boring”, or “dull” (Figure 7) — hence, fine-tuning has had the intended effect, where “bore” now takes on its sentiment-related meaning. Similarly, “redeeming” has been associated with much stronger negative sentiment (see Figure 7). Besides words, the dominoes reveal that numbers less than 10 have also be affected by the sentiment analysis task — for example, “7” has become more closely related to positive adjectives as a result of numeric scales used to rank movies within the reviews (Figure 7). Words that remained unchanged seem uncorrelated with movie sentiment, such as proper nouns like “vietnam” and “europe” as well as “cinematographer” (Figure 10).

The embedding researcher we worked with on this case study was excited about the results the Embedding Comparator helped reveal. Generating these types of insights with prior tools would require significant tedious effort — besides manually constructing the necessary views (e.g., within a Jupyter notebook), the researcher would have needed to formulate hypotheses a priori about which specific words to investigate further. In contrast, by calculating a similarity score for every word, and by visualizing local neighborhoods as dominoes, the Embedding Comparator surfaces this information more directly. As a result, the system transforms the process of comparing embedding models from requiring explicit and manual steering from the researcher, towards more of a browsing experience. This shift frees the researcher to focus on generating and answering hypotheses about their models, and allows for more serendipitous discovery. For instance, during the course of using the Embedding Comparator, the researcher discovered an unexpected result: the word “generous” adopted a more negative result after fine-tuning. To explain this finding required digging into the training data to uncover phrases such as “I’m being generous giving this movie 2 stars” and “2 stars out of a possible 10 and that is being overly generous.”

5.2 Evolution of Language with Diachronic Word Embeddings

Figure 5: View of the Embedding Comparator applied to Case Study: Evolution of Language with Diachronic Word Embeddings. In this case study, we compare word embedding models trained on literature written between 1900-1910 to literature written between 1990-2000. Here the Embedding Comparator surfaces insights into how English language has changed over the course of the century. For example the word “gay” has changed in meaning from “happy” to “homosexual” while numbers have maintained their meaning.

Our second case study follows how a domain analyst (a linguist) employs embedding models to study the evolution of languages over time. Previous work [12] has shown that word embeddings capture diachronic changes (i.e., changes over time) in language, and has proposed new statistical laws of semantic change based on these embeddings. Here, we use diachronic word embeddings from HistWords [12], trained on English books written from 1800 – 2000 grouped by decade. We select embeddings from five different decades spanning this time period (1800 – 1810, 1850 – 1860, 1900 – 1910, 1950 – 1960, and 1990 – 2000; see supplementary material for details) and evaluate how the Embedding Comparator surfaces words whose semantics have evolved over time.

Figure 5 shows the Embedding Comparator comparing embeddings of text written in 1900 – 1910 to text written on 1990 – 2000. The Embedding Comparator immediately surfaces the insights presented by Hamilton et al. [12], such as the change in meaning of “gay” (from “happy” to “homosexual”) and “major” (from “military” to “main” or “important”) over time. It also reveals words such as “aids”, whose meaning changed from “assists” to the disease HIV/AIDS which was not named until the early 1980s [10], along with many other words which are similarly ripe for further linguistic analysis (see Figure 8). Critically, in contrast to the original analysis [12], with the Embedding Comparator, there is no need to manually align the various embedding spaces nor do users need to define and compute a task-specific semantic displacement metric to uncover these findings — our method for comparing embedded objects through local neighborhoods is agnostic to application domains and tasks. As a result, the Embedding Comparator scaffolds and accelerates the analysis process for users regardless of their ML expertise — novice users need minimal technical knowledge to replicate state-of-the-art linguistic analysis while more expert users can devote their effort to designing task-specific metrics only when necessary.

Using this case study, we can also see the benefits of being able to easily switch between alternative models trained against the same dataset. For example, using the corresponding drop down menus, an analyst could switch to comparing 1800 – 1810 vs. 1900 – 1910 to see that “nice” moves away from meaning “refined” and “subtle” and moves toward “pleasant” (Figure 8), in line with previous findings [12]. Similarly, if we fix one model to 1990 – 2000 and vary the other, the similarity distribution histogram would directly visualize the pairwise diachronic changes: bars would shift rightward as models trained on more recently written text have more similar words.

5.3 Word Embeddings Pre-trained on Different Corpora

Figure 6: View of the Embedding Comparator applied to Case Study: Word Embeddings Pre-trained on Different Corpora. In this case study we compare two word embedding models trained on different datasets. Model A was trained on Wikipedia and Newswire data, while Model B Was trained on data from Twitter. Via the Similarity Distribution, the Embedding Comparator surfaces that despite having the same model architecture, the two models represent words quite differently. By looking more closely at the Local Neighborhood Dominoes, we find insights into why including the Twitter corpus containing Spanish words (e.g. “solo”) and emphasizing popular culture references (e.g. “swift”).

Our final case study evokes a common use case experienced by embedding engineers: choosing between models that appear equally viable for use in a downstream application (e.g., predicting topics based on customer review text). GloVe [30] is a popular model that offers several variants of embeddings trained with different datasets. Here, we demonstrate how the Embedding Comparator can be used to understand the impact of training GloVe with data from either Wikipedia & Newswire or Twitter (see supplementary material for details).

As Figure 6 shows, the Embedding Comparator immediately displays a number of differences between these two pre-trained embedding models that arise due to the datasets on which they were trained. Among the words that differ most between the two embedding spaces are shorthand or slang expressions such as “bc”, “bout”, and “def” (Figure 9). The local neighborhood dominoes for these words show the ways in which their semantic meanings differ between the two models. For example, “def” is associated with the notion of being defeated and “beats” (as in sporting results) and hence countries (e.g. “canada” and “usa”), while in Twitter it is most similar to conversational words such as “definitely” and



Another insight revealed by the Embedding Comparator is the difference in languages present in the training corpora. Words such as “era”, “dale”, and “solo” take on their English meanings in the Wikipedia & Newswire model, but are related to Spanish words in the Twitter variant. This finding suggests that the Twitter model was trained on multi-lingual data, while the Wikipedia & Newswire model may be limited to English.

Finally, the Embedding Comparator reveals how words such as “swift” and “galaxy” may be used very differently in different media. In the Wikipedia & Newswire model, “swift” refers to the adjective swift (i.e. quick), whereas in Twitter, “swift” is related to the musical artist Taylor Swift. Likewise, “galaxy” refers either to space or to the Samsung Galaxy electronics based on whether embeddings were trained on Wikipedia & Newswire or on Twitter, respectively (Figure 9).

Using these insights from the Embedding Comparator, an engineer can make a more informed decision about which set of embeddings may be more appropriate to adopt for their system. For example, if classifying longer or more formal customer reviews, the model trained on Wikipedia would likely perform better, but if classifying casually written reviews that contain slang or multi-lingual text, the Twitter corpus may generalize better to real-world data.

Figure 7: Additional dominoes from case study: Transfer Learning for Fine-tuned Word Embeddings. The word “bore” has changed in meaning from a general definition: “carried”, to a more sentiment rich definition: “dull”. “redeeming” has changed in sentiment from the positive sentiment definition: “compensate for faults” to a negative sentiment definition likely related to the reviewer idiom “no redeeming qualities”. The number “7” has changed from its definition as a numeric symbol to a number indicative of score (e.g. 7 out of 10).
Figure 8: Additional dominoes from case study: Evolution of Language with Diachronic Word Embeddings. The domino for “nice” is compared between its definition in 1800-1810 of “fine” over the course of the century to 1900-1910 when its definition became “pleasant”. The word “aids” is compared from 1900-1910 where it was synonymous with “assists” to 1990-2000 when it became associated with the condition caused by HIV. “score” was compared from over the course of the 20th century as it moved in meaning from “year” (e.g. four score) to a measure of rank.
Figure 9: Additional dominoes from case study: Word Embeddings Pre-trained on Different Corpora. Using the model trained on news text “def” is short for “defeated”, whereas using the model trained on Twitter data “def” is slang for “definitely”. The word “dale” is an English name in the news model, but is a represented by its Spanish meaning using the model trained on Twitter. “galaxy” in the news model is related to space, but in the Twitter model is related to the Samsung Galaxy line of phones.
Figure 10: Additional dominoes from each case study where the neighborhood of the word has not changed. From the Transfer Learning for Fine-tuned Word Embeddings case study, “cinematographer” does not change when fine tuning a general English model on a movie review dataset because “cinematographer” in English already refers to the film industry. From the Evolution of Language with Diachronic Word Embeddings case study, “25” has not changed from 1800 to 2000 indicating that numbers are susceptible to chronological changes in meaning than other words. “fernando” from the Word Embeddings Pre-trained on Different Corpora case study does not change in meaning whether the model was trained on news data or Twitter data likely because it is a proper noun with no other meanings.

6 Discussion and Future Work

In this paper, we present the Embedding Comparator, a novel interactive system for analyzing and comparing embedding spaces. Informed by formative interviews conducted with embedding researchers, engineers, and domain analysts, the design of the Embedding Comparator balances between visualizing information about the overall embedding spaces and displaying information about local neighborhoods. To directly surface similarities and differences, a similarity score is computed for every embedded object and these scores are encoded across global and local views. And, to facilitate rapid comparisons, we introduce Local Neighborhood Dominoes: small multiple visualizations of local neighborhood geometries and lists of common and unique objects. Through a series of case studies, grounded in use cases described by our interview participants, we demonstrate how the Embedding Comparator transforms the analysis process from requiring tedious and error-prone manual specification to browsing and interacting with graphical displays. Moreover, we see that by computing a similarity metric, and using it to drive the various views, the Embedding Comparator is able to more immediately surface interesting insights, and published domain-specific results can be replicated with only a handful of interactions.

Further study of our design goals suggests several compelling avenues for future work. One straightforward extension, to provide richer linking between global and local views, is to consider how metadata tags (e.g., part-of-speech, sentiment, etc.) can be visualized and used to filter dominoes. Our similarity metric usefully surfaces differences between individual models, and an interesting next step would generalize this metric to consider pairwise (or n-wise) similarities. Visualizing and interactively manipulating this metric could accelerate workflows akin to those encountered with the HistWords case study. Finally, how might interface elements in the Embedding Comparator allow users to link to specific examples in the training data? Such interaction would help further tighten the question-answering loop by providing fine-grained context for understanding why particular similarities or differences exist between the same object across multiple models.

7 Acknowledgments

This work is supported by a grant from the MIT-IBM Watson AI Lab. We also thank Jonas Mueller for helpful feedback.


  • [1]
  • [2] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network Dissection: Quantifying Interpretability of Deep Visual Representations. In Computer Vision and Pattern Recognition.
  • [3] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Zhou Bolei, Joshua B. Tenenbaum, William T. Freeman, and Antonio Torralba. 2019. GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations (ICLR).
  • [4] Samy Bengio and Georg Heigold. 2014. Word embeddings for speech recognition. In Fifteenth Annual Conference of the International Speech Communication Association.
  • [5] Tristan Bepler and Bonnie Berger. 2019. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations.
  • [6] Brandon Carter, Jonas Mueller, Siddhartha Jain, and David Gifford. 2019. What made you do this? Understanding black-box decisions with sufficient input subsets. In Artificial Intelligence and Statistics.
  • [7] Juntian Chen, Yubo Tao, and Hai Lin. 2018. Visual exploration and comparison of word embeddings. Journal of Visual Languages & Computing 48 (2018), 178–186.
  • [8] François Chollet and others. 2015. Keras. https://keras.io. (2015).
  • [9] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. 2017.

    Neural audio synthesis of musical notes with wavenet autoencoders. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1068–1077.
  • [10] Centers for Disease Control (CDC and others. 1982. Update on acquired immune deficiency syndrome (AIDS)–United States. MMWR. Morbidity and mortality weekly report 31, 37 (1982), 507.
  • [11] David Ha and Douglas Eck. 2018. A Neural Representation of Sketch Drawings. In International Conference on Learning Representations.
  • [12] William L Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1489–1501.
  • [13] Florian Heimerl and Michael Gleicher. 2018. Interactive analysis of word vector embeddings. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 253–265.
  • [14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • [15] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328–339.
  • [16] Minsuk Kahng, Pierre Y Andrews, Aditya Kalro, and Duen Horng Polo Chau. 2017. ActiVis: Visual exploration of industry-scale deep neural network models. IEEE transactions on visualization and computer graphics 24, 1 (2017), 88–97.
  • [17] Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations.
  • [18] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 8 (2009), 30–37.
  • [19] Susan Leavy, Karen Wade, Gerardine Meaney, and Derek Greene. 2018.

    Navigating Literary Text with Word Embeddings and Semantic Lexicons. In

    Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018), Luasanne, Switzerland, 4-5 June 2018.
  • [20] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E Hopcroft. 2016. Convergent Learning: Do different neural networks learn the same representations?. In Proceedings of the International Conference on Learning Representations (ICLR).
  • [21] Zachary C Lipton. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016).
  • [22] Yang Liu, Eunice Jun, Qisheng Li, and Jeffrey Heer. 2019. Latent Space Cartography: Visual Analysis of Vector Space Embeddings. Computer Graphics Forum (Proc. EuroVis) (2019). http://idl.cs.washington.edu/papers/latent-space-cartography
  • [23] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440.
  • [24] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142–150. http://www.aclweb.org/anthology/P11-1015
  • [25] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
  • [26] Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
  • [27] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  • [28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
  • [29] Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. 2018. The building blocks of interpretability. Distill 3, 3 (2018), e10.
  • [30] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  • [31] Peter Pirolli. 2003. Exploring and finding information. HCI models, theories and frameworks: Toward a multidisciplinary science (2003), 157–191.
  • [32] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 1135–1144.
  • [33] Dan Shiebler. 2018. repcomp. (2018). https://pypi.org/project/repcomp/ [Online; accessed <today>].
  • [34] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 3145–3153.
  • [35] Daniel Smilkov, Nikhil Thorat, Charles Nicholson, Emily Reif, Fernanda B Viégas, and Martin Wattenberg. 2016. Embedding projector: Interactive visualization and interpretation of embeddings. arXiv preprint arXiv:1611.05469 (2016).
  • [36] Hendrik Strobelt, Sebastian Gehrmann, Michael Behrisch, Adam Perer, Hanspeter Pfister, and Alexander M Rush. 2019. S eq 2s eq-V is: A Visual Debugging Tool for Sequence-to-Sequence Models. IEEE transactions on visualization and computer graphics 25, 1 (2019), 353–363.
  • [37] Hendrik Strobelt, Sebastian Gehrmann, Hanspeter Pfister, and Alexander M Rush. 2018. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE transactions on visualization and computer graphics 24, 1 (2018), 667–676.
  • [38] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 3319–3328.
  • [39] TensorFlow. 2019. Embeddings | TensorFlow Core. (2019). https://www.tensorflow.org/guide/embedding, accessed 2019-09-20.
  • [40] Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37 (2010), 141–188.
  • [41] Liwei Wang, Lunjia Hu, Jiayuan Gu, Zhiqiang Hu, Yue Wu, Kun He, and John Hopcroft. 2018a. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In Advances in Neural Information Processing Systems. 9584–9593.
  • [42] Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Rastegar-Mojarad, Liwei Wang, Feichen Shen, Paul Kingsbury, and Hongfang Liu. 2018b. A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics 87 (2018), 12–20.
  • [43] Martin Wattenberg, Fernanda Viégas, and Ian Johnson. 2016. How to use t-SNE effectively. Distill 1, 10 (2016), e2.
  • [44] Wesley Willett, Jeffrey Heer, and Maneesh Agrawala. 2007. Scented widgets: Improving navigation cues with embedded visualizations. IEEE Transactions on Visualization and Computer Graphics 13, 6 (2007), 1129–1136.
  • [45] Kevin K Yang, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. 2018. Learned protein embeddings for machine learning. Bioinformatics 34, 15 (2018), 2642–2648.

S1 Details of Case Studies

Here, we detail the datasets and preprocessing steps used in our case studies.

s1.1 Transfer Learning for Fine-tuned Word Embeddings

We downloaded pre-trained fastText [27] word embeddings, which are available online at https://fasttext.cc/docs/en/english-vectors.html. We use the 300-dimensional wiki-news-300d-1M embeddings consisting of 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus, and statmt.org news datasets.

We train an LSTM [14] to classify binary sentiment in movie reviews from the Large Movie Review dataset [24] containing 25000 training reviews and 25000 test reviews from the Internet Movie Database (IMDb). We use default tokenization settings for this dataset as provided in Keras [8]

. We define our vocabulary as the top 5000 most frequent words in the movie review dataset and truncate reviews to a maximum length of 500 words (with pre-padding). Our recurrent neural network architecture is defined as follows:

  1. Input/Embeddings Layer: Sequence with 500 words. The word at each timestep is represented by a 300-dimensional embedding.

  2. LSTM: Recurrent layer with 100-unit LSTM (forward direction only, dropout = 0.2, recurrent dropout = 0.2).

  3. Dense: 1 neuron (sentiment output), sigmoid activation.

Prior to training, the embeddings are initialized using the pre-trained fastText embeddings. Of our vocabulary of size 5000, 4891 tokens were present in the fastText embeddings. For tokens not present in fastText, we initialize embeddings as all-zero vectors.

We train our model for 3 epochs (batch size = 64) with the Adam optimizer 

[17] using default parameters in Keras [8] to minimize binary cross-entropy on the training set. The final model achieves a test set accuracy of 84.7% (training set accuracy of 85.4%). We did not further tune the architecture or hyperparameters for optimal performance.

For analysis in the Embedding Comparator, we output the initial fastText embeddings and fine-tuned embeddings for the 4891 words whose embeddings were initialized from fastText.

s1.2 Word Embeddings Pre-trained on Different Corpora

In this case study, we use pre-trained embeddings from GloVe [30], available online at https://nlp.stanford.edu/projects/glove/. The Wikipedia/newswire embeddings were trained on the Wikipedia 2014 and Gigaword 5 (newswire text) datasets containing 6 billion tokens (GloVe 6B), while the Twitter word embeddings were trained on text from Twitter tweets containing 27 billion tokens (GloVe 27B). We use the 100-dimensional embeddings trained on each of these corpora. We filter each of the embedding models to the top 10K most frequent words from its respective corpus and then intersect the resulting vocabularies, giving a shared vocabulary containing 3303 words. We use the Embedding Comparator to compare embeddings from each model for words in this shared vocabulary.

s1.3 Evolution of Language with Diachronic Word Embeddings

We use the HistWords dataset [12] in our case study of diachronic word embeddings, which have been shown to exhibit changes in semantic meaning of words over time. We use pre-trained word embeddings from [12], accessed at https://nlp.stanford.edu/projects/histwords/. We use the All English (1800s-1990s) set of embeddings, which are 300-dimensional word2vec embeddings [28]. This dataset provides word embeddings trained on English books from each decade from 1800 to 2000. For exploration in the Embedding Comparator, we select embeddings taken from five different decades spanning this time period: 1800-1810, 1850-1860, 1900-1910, 1950-1960, and 1990-2000. We filter each embedding space to the top 10000 most frequent words from its decade and compute the intersection of these sets over the five decades we selected, producing a vocabulary containing 6121 words from each model for comparison in the Embedding Comparator.