This repository contains the DendroMap implementation for scalable and interactive exploration of image datasets in machine learning.
In this paper, we present DendroMap, a novel approach to interactively exploring large-scale image datasets for machine learning. Machine learning practitioners often explore image datasets by generating a grid of images or projecting high-dimensional representations of images into 2-D using dimensionality reduction techniques (e.g., t-SNE). However, neither approach effectively scales to large datasets because images are ineffectively organized and interactions are insufficiently supported. To address these challenges, we develop DendroMap by adapting Treemaps, a well-known visualization technique. DendroMap effectively organizes images by extracting hierarchical cluster structures from high-dimensional representations of images. It enables users to make sense of the overall distributions of datasets and interactively zoom into specific areas of interests at multiple levels of abstraction. Our case studies with widely-used image datasets for deep learning demonstrate that users can discover insights about datasets and trained models by examining the diversity of images, identifying underperforming subgroups, and analyzing classification errors. We conducted a user study that evaluates the effectiveness of DendroMap in grouping and searching tasks by comparing it with a gridified version of t-SNE and found that participants preferred DendroMap over the compared method.READ FULL TEXT VIEW PDF
This repository contains the DendroMap implementation for scalable and interactive exploration of image datasets in machine learning.
Contains the data used in the DendroMap Live Site. Read the readme for instructions on how to use your own data in the DendroMap!
Visualization has helped ML practitioners perform a variety of analytics tasks such as: exploring datasets, analyzing performance results, interpreting and explaining model internals, building models, monitoring training progress, and debugging models [23, 65].
Many existing visualization tools for ML support the tasks of analyzing performance results and exploring datasets at multiple levels of abstraction, ranging from individual instances to entire classes. While ML practitioners often only use summary metrics (e.g., accuracy) or class-level statistics, visualization researchers have argued the importance of instance-level analysis. Early works include ModelTracker , Squares , and Facets-Dive [57, 58]. These tools represent each instance as a small square using the unit visualization technique , enabling users to see individual instances in the context of aggregated information. This can work particularly well for image datasets as each square can be replaced with a thumbnail of the actual image content.
While individual instance-level analysis provides detailed low-level analysis, the scale of datasets urges researchers to develop ways to slice and filter datasets, resulting in subgroup-level analysis [28, 23, 20]. This allows users to specify data subsets based on attributes and perform more fine-grained analysis than at the class-level. However, image data creates a fundamental challenge in supporting such analysis because there are no annotations or attributes beyond class labels. Therefore, group structures are often created with algorithmic approaches. A common approach is to use a DR technique like t-SNE  or UMAP 
, which are often applied to high-dimensional representations obtained from neuron activations. In this paper, we propose an alternative approach to visualizing relationships between images by using a hierarchical clustering algorithm.
We recognize that dataset analysis is an increasingly important topic to address. ML researchers have stressed the importance of data in deep learning by coining terms like Data-Centric AI and MLOps . Our work aligns with this trend to support data exploration for ensuring that datasets are less biased, more fair and inclusive, and contain fewer errors. A recently developed tool named Know Your Data 
aligns with this goal; however, its focus is on statistics based on many attributes obtained from external APIs (e.g., face recognition, object detection), while our work focuses on making sense of raw image datasets by relying on human perception.
Zahálka and Worring  presented a comprehensive overview of multimedia visualization methods (primarily of images) in their survey. They categorized existing techniques into five types: basic grid, similarity space, similarity-based, spreadsheet, and thread-based. The three methods commonly used by ML practitioners described in Sect. 1 and Figure 1 (i.e., random grid, t-SNE, and a grid version of t-SNE) belong to the “basic grid,” “similarity space,” and “similarity-based” categories, respectively. Our proposed treemap-based method can also be placed in the “similarity-based” category.
The idea of using treemaps for image browsing was proposed in 2001 by PhotoMesa . PhotoMesa proposed two variations of the treemap algorithms: ordered and quantum treemaps. The ordered treemap ensures the order of images in each treemap block will match the order in file structures (e.g., by timestamp) and the quantum treemap ensures that the widths and heights of the generated rectangles are integer multiples of a given elemental size. Unlike their data, ML datasets have different properties: each dataset has a set of classes, but the images within each class have no order. Because there is no existing hierarchical structure, we extract one using agglomerative clustering algorithms and adapt the slice-dice treemap algorithm .
An important task in analyzing images or multimedia data is categorizing or exploratory searching. The key difference from tabular datasets is that image datasets are not annotated with structured attributes; images are unstructured. Many common data operations like filtering, grouping, and sorting cannot be directly applied. If we consider low-level tasks by Amar et al. , only a few of the 10 tasks can be applied to images . Thus, an important challenge in interactive visualization of image data is automatic extraction of semantic information, interactive exploration of categories, or both [53, 67, 62].
As we discussed in the previous subsection, our proposed work can be considered as a similarity-based approach. We briefly describe both the similarity-space and similarity-based approaches in ML context.
The t-SNE algorithm is probably the most popular among ML researchers. They often use it to visualize cluster structures learned by deep learning models[54, 44, 55]. While t-SNE often plots each data point as a small circle in a 2-D space, the nature of images provides us with the opportunity to directly plot a small thumbnail instead of a dot. This enables users to see the image contents without interacting with each circle mark (e.g., clicking, hovering). For example, Embedding Projector 
displays MNIST images in t-SNE plots. However, as the number of images grows, images overlap, making it almost impossible to see them in high-density areas (see Fig.1B).
Researchers and practitioners have devised methods to address the issue of overlapping images. The images can be rearranged in a grid either by selecting a sample of images among many in each grid or redistributing all images into all the grid spaces in screen using optimization algorithms . Although we have not found research papers to gridify t-SNE or UMAP, there exist several implementations [29, 43, 32], including one by Karpathy . This type of gridifying algorithm has been used in several visual analytics tools for ML for image data [68, 11, 56]. the relative distances among data instances in the projected space can only approximate their distances in high-dimensional space [51, 10].
Redistributing data points or images into a rectangular grid has also been studied in non-ML context, such as IsoMatch  and rectangular packing . Removing overlaps can be more intelligent by balancing the full use of screen space and intentionally leaving some white-space to reveal cluster structures .
Our work supports hierarchical exploration of datasets by extracting hierarchical structures using clustering algorithms, so we provide a brief background about these algorithms here . Unlike the -means clustering algorithm which partitions data points into a fixed number of groups based on distances among data points, hierarchical clustering algorithms iteratively divide data space into smaller space (i.e., divisive) or merge from smaller groups into larger groups (i.e., agglomerative). We use agglomerative algorithms to form a hierarchy (called a dendrogram
), since divisive does not produce high-quality results for high-dimensional data and is computationally expensive for large data. The agglomerative ones align more closely with useful characteristics of t-SNE: focusing on similar pairs to find cluster structures.
Existing work on visualizing dendrograms include Hierarchical Clustering Explorer (HCE) , Stacked Trees which interactively merge parts of the dendrogram , and Yang et al. for steering and revising the dendrograms . All these used node-link diagrams to display dendrograms; however, this cannot easily be applied for image datasets, because the dendrograms require all instances to be positioned along a single line, which means the size of images would become very small if we want to display images in place of the dendrogram tree. A space-filling technique like treemaps can resolve this challenge.
Hierarchical data exploration has been studied extensively in text domains. Text data is unstructured, so automatic extraction of clusters is important too like images. HierarchicalTopics  extracts hierarchical structures of latent topics and enables users to explore and revise them. TopicLens  allows users to zoom into certain areas of projected two-dimensional spaces. Marcilio et al. extracts hierarchical structures from high-dimensional representations of deep learning data , and Duarte et al. represents data as treemap-style representations .
To help ML practitioners explore large-scale image datasets, we adapt treemaps with the following design goals:
Overview of Data Distributions. We aim to assist users in getting an overview of datasets as a beginning step for their analysis of datasets. This includes helping them answer questions like what kinds of images mostly exist in their datasets, whether they are diverse enough  or biased towards any properties .
Exploring at Multiple Levels of Abstraction. We aim to design our visualization to provide users with abilities to interactively adjust the level of abstraction. While treemaps are effective at supporting abstract and elaborate interactions , we adapt the original treemap techniques by considering unique properties of the dendrogram structure and the domain of ML for images.
Instance-level Exploration. As images do not contain attributes, it is important for users to see the individual image contents while exploring datasets. We aim to effectively organize image thumbnails to help users find and inspect individual data points while they navigate over the tree structure.
Subgroup-level Analysis for ML. Both the literature in multimedia analytics and visual analytics for ML point out the importance of identifying subgroups from datasets [66, 23, 40]. This can be useful for performing a wide range of analytic tasks in ML, such as error analysis and bias discovery [60, 9].
This section describes how a dendrogram can be constructed from an image dataset, how DendroMap visualizes the dendrogram, and how supported interactions help achieve our design goals.
To create groups of images for hierarchical exploration, we use the well-known hierarchical agglomerative clustering algorithm 
. Unlike flat clustering algorithms (e.g., k-means), hierarchical clustering algorithms create hierarchically nested clusters without requiring a parameterfor the number of clusters. Users can specify afterwards, whereas a flat clustering algorithm would need to recompute a new structure using the entire dataset for each users’ new request for .
High-dimensional representations of images are used as input to the clustering algorithm. We used high-dimensional embedding from one of the last fully-connected layers of trained deep learning models, although it is also possible to use embeddings from pre-trained models or raw image pixels. Given this input, each image vector is initialized as its own cluster to start, then the most similar image clusters are merged together using Ward linkage with the Euclidean distance metric to form more balanced trees. The merging process repeats until the final two clusters merge into one cluster containing all the images in the dataset. The output of the algorithm forms a special tree structure, called dendrogram, resembling a binary tree, with leaf nodes corresponding to data instances.
DendroMap visualizes dendrogram structures using a modified treemap algorithm. It traverses the dendrogram and renders each cluster node as a grid of images using the available rectangular space. At the top of each cluster node, we display the count and classification accuracy of the images in that cluster.
Treemap Layout. The dendrogram resembles a binary tree, so there will only ever be two child nodes to layout in the space at each point in the traversal. This allows DendroMap to adapt the simple slice-dice treemap layout . Normally, slice-dice creates undesirable aspect ratios when laying out many rectangles per level ; however, the dendrogram will not have more than two children per node, always resulting in just one partition of space.
We modify the slice-dice layout to display a grid of fixed sized images on top and to include padding (to highlight hierarchical structures). To demonstrate one iteration of the modified layout, consider a node that has two children and with 6 and 4 images, respectively. The goal is to fill a 100 by 90 pixel available space depicted in Figure 3. The algorithm works as follows:
Dice if the available space from the parent is a horizontal rectangle and slice if it is vertical. In Figure 3, ’s width is 100 pixels and height is 90 pixels, so dicing is chosen.
Compute the ratio to partition the space. When dicing, the partition ratio is calculated by , where represents the number of images in . The left and right areas of the partition correspond to each child, and . In Figure 3, the dice partition ratio is computed as . Meaning of the space is for the and is for .
Adjust the partition to fit images. Based on the image size, compute the maximum amount of the images that can fit across entire parent’s width (or height if slicing) by , where is the width of the available space for and is the width of each image. Then the actual partition dimensions can be calculated as pixels, resulting in a partition that fits images without cutting them off.
Add padding to show hierarchies. After laying out the and and assigning them their new dimensions, a fixed padding is added to reveal the parent cluster behind it (like in Figure 3). We set a fixed padding of 10 pixels in our implementation. Color can encode the remaining height of tree under that node .
Adjusting the number of clusters. Traversing the entire dendrogram quickly fills the available screen space making it hard to display many images. Thanks to the dendrogram’s binary tree structure, each iteration of the DendroMap algorithm only lays out two children (one partition), which allows us to render specific number of clusters (i.e., set by users). By traversing the tree breadth-first and counting the clusters created so far, the algorithm can stop and show those clusters. For example in Figure 2 the dendrogram traversal stops to only render three clusters showing in the treemap.
Organizing images within the clusters. An interesting property of dendrograms is that the leaf nodes (i.e., images) have an order based on the hierarchical structure generated by the algorithm. We use this order to organize the list of images for each cluster node. As seen in Figure 2, the root node cluster that contains all the images is in the same order as the leaf nodes. This means that nearby images in a cluster are likely more similar than images located far within the cluster. For example, in Figure 1 on the right, insect images taken over white background are clustered together with a large node. Furthermore, when there exist a larger number of images to display than the amount of available space, we uniformly sample images from the ordered list of images, in order to display a representative sample of images from a cluster.
Zooming interactions. To go past an overview, and explore large-scale datasets in more detail, DendroMap supports a zooming interaction. By clicking on a cluster node, DendroMap animates to zoom into the new cluster, which enlarges the selected cluster to fit into the entire space, and creates a set of subclusters within the selected cluster. Our implementation basically follows Bostock’s zoomable treemap implementation . In addition, by taking up the entire space with the zoom-in, more images can be shown with more specific hierarchies, leading to more in-depth exploration. This process corresponds to rendering a downstream portion of the dendrogram. At any point, by clicking back on the parent cluster, the reverse process of zooming-out goes back up the tree to reveal the overview again. The zoom-in and zoom-out interactions allow users to quickly get an overview of very large image collections and split the hierarchies into the specified detail. Please watch our video demo or use our website for this interaction.
We developed a system for DendroMap by designing coordinated views consisting of the main treemap view and the sidebar. The sidebar contains rendering settings for the treemap display, a class table for class-level error analysis, and a panel for details for a selected image.
DendroMap Settings. The sidebar contains two sliders to change the overview level: one controls the number of clusters visible and the other controls the image size.
By default, DendroMap shows eight clusters of medium-sized images to balance the level of detail and overview such that many images can be shown while still separated into distinguishable groups. These sliders allow users to easily change the overview level based on their exploration needs.
For the case when a dataset comes with predictions from a trained model, the sidebar provides two options to highlight misclassified images. One toggle highlights these images using a red border and the other toggle puts the images into focus by making the others translucent. Visually emphasizing misclassified images makes it easier for users to find groups of images that the model consistently misclassifies.
Class Table. The class table is visible if model predictions are present, and it contains information for additional error analysis at the class level. The class table updates based on the parent cluster’s images (i.e., the root or previously selected cluster; by default, all images). Each row of the table corresponds to a specific class in the dataset (e.g., cat). The next two columns of the table displays the counts of images with a true or predicted class label matching the class specified.
The last three columns of the table provide useful metrics for class-level error analysis: the prediction accuracy (i.e., how often the true and predicted classes matched that row’s class), the false negative rate (i.e, how often the true class matched that row’s class but the predicted class was different), and the false positive rate (i.e., how often the predicted class matched that row’s class but the true class was different). As shown in Figure 4, each rate is encoded with the opacity of a colored dot to quickly find rows of interest in the table.
By hovering over one of these entries in the table, the treemap view highlights the images used to determine that metric by making the other images translucent. This way users can use the class table in tandem with the treemap to isolate and find areas of high error or high accuracy.
Image Details. A user can click on an image in DendroMap to see detailed information: larger view of the image, true class label, predicted class label if it has one, and similar images. The similar images are determined based on distances in the high-dimensional space, which can be used for counterfactual analysis [12, 17].
In this section, we describe how DendroMap can be used in practice to explore and analyze image datasets through four usage scenarios.
Consider Evan, a researcher who is in the process of preparing a model to classify different animal species. He is looking into using the ImageNet dataset and he wants to get a sense of whether the images are sufficiently diverse enough for training a model. Evan loads images from ImageNet across all 1,000 classes intoDendroMap and immediately sees that the images are roughly divided into two large groups. One could be described as a group of organisms containing various plants and animals and features many earth-tone colors. The other group could be described as artifacts or non-living objects, such as vehicles and stock photo close-ups of everyday objects.
Evan clicks on the encapsulating rectangle for the organism cluster given his interest in animals. He incrementally increases the “Clusters Visible” slider until it reaches 20 and gradually sees the formation of distinctly-colored areas of the treemap within the overarching organism group. He notices a blue-ish cluster containing different aquatic animals, a green-ish cluster of insects and flowers, and a very colorful cluster of fruits located next to a cluster of cooked foods. He wants to more closely examine a cluster containing dogs and other fuzzy animals so he clicks to “zoom in” to this rectangle, revealing more clusters. He notices there is a large cluster of animals on grassy fields and clicks on several images to inspect their class labels: Chow Chow, pug, miniature schnauzer, and even pig and polar bear. In general, the images show dogs and other animals of different colors in a variety of poses with differently colored backgrounds, so Evan feels confident he will be able to train a capable model using this set of images.
Consider Priya, a data scientist who lives in the Southeast region of Asia and is evaluating whether ImageNet can be used to train an image classification model that she can deploy in her country. After she loads the DendroMap interface, Priya begins to click around to “zoom” into different portions of the dataset. She first clicks on the rectangle containing the approximately half of the dataset and discovers a cluster containing everyday objects. She notices a cluster of taxi cabs and hovers over the class name “taxicab” in the sidebar’s class table to put just the taxicab photos in focus while the rest become faded. She notices that most are black or yellow, but she knows from personal experience that many taxis are multicolored in her country, so she makes a note to supplement the “taxicab” class with some of those images. Priya “zooms out” by clicking on the outermost rectangle and decides to visit another cluster, this one featuring many images of people interacting with a variety of everyday objects, such as “violin” and “sunscreen”. However, as she clicks on several images to get a better look at each one, she notices that the images tend to include people with lighter skin tones. She makes another note to supplement the dataset with images of people with darker skin tones interacting with the objects corresponding to each of the classes listed in the class table. Priya continues this inspection process until she feels she has a good sense of the quality of the images in this dataset and has compiled a complete list of the classes she plans to supplement.
Consider Dave, a ML engineer who is using the CIFAR-100 dataset to evaluate a trained image classification model. He opens DendroMap and sees the default view of eight rectangles or clusters. At the top of each rectangle is some information about the number of images and the average prediction accuracy of the images in each cluster. As Dave inspects the interface, he notices that the group of images with the lowest accuracy score (57 percent) consists mostly of human faces. He sees no obvious pattern at this level of overview in the hierarchical structure, so he clicks on another rectangle to get a closer look. From the class table in the sidebar, he observes that a majority of the images in this group were predicted to be “woman” or “girl”, but most were incorrect. Dave thinks perhaps his classification model has trouble determining which of those two labels is correct. He navigates back up one level by clicking on the outermost rectangle. He selects a different cluster and this time he observes that a majority of the images are predicted as “man” or “boy”, but with similar proportions of incorrect guesses (as shown in Figure 5). From these two insights, Dave hypothesizes that his model can distinguish male and female faces, but has difficulty determining whether the person is a child or adult.
Consider Anna, a ML practitioner who has trained an image classification model. During the training process, she noticed her model consistently had a harder time correctly predicting images from the artifact-related classes so she decided to analyze her model for these classes from the ImageNet dataset, such as “umbrella” and “frying pan”. She opens DendroMap and toggles the “outline misclassified” and “focus misclassified” switches to spotlight the misclassified images, outlined in red, while the others fade. She notices that the red outlined images appear to be scattered without much of a pattern, so she gradually increases the number of clusters until DendroMap splits the images into subgroups of higher or lower accuracy. She stops when it reaches 18 clusters because she notices distinct subgroups of images with high accuracy (over 90 percent). Most of these subgroups focus on particular classes, such as “racket” or “potter’s wheel”. Anna wants to investigate the cause of clusters with much lower prediction accuracy, so she continually clicks on the next visible cluster with the lowest accuracy. She notices a pattern as she keeps drilling down towards the leaf nodes: the accuracy rate decreases as the images become more cluttered. She clicks on several misclassified images to inspect their true and predicted class labels, and she discovers that the predicted labels are not necessarily inaccurate–it is that the true label and predicted labels are classifying the entire image based on only a portion of it. For example, she clicks on an image of a couple of people sitting on a bench on a sunny day. The true class label for this image is “sunglasses” because one person is wearing sunglasses, whereas the predicted label for the image is “park bench” because the two people are sitting on a bench. Anna can now consider how she can train her model to handle these more complex images with multiple possible correct labels.
To evaluate the effectiveness of DendroMap for a variety of exploration tasks for large-scale machine learning datasets, we conducted a user study comparing DendroMap and a baseline visualization technique for images, t-SNE-Grid, a gridified version of t-SNE.
We compare DendroMap with a gridified version of t-SNE, which we call t-SNE-Grid. It re-adjusts the positions obtained from the t-SNE algorithm , by filling the available rectangular grid space with the images for effectively using screen space .
This process works by first taking the image representations from the dataset and reducing them down to their two-dimensional embeddings using t-SNE (like Fig. 6A). Then, to fill the space, two dimensional grid points are evenly laid out over the space of image embeddings (like Fig. 6B). Finally, each grid point is assigned the closest image embedding and the corresponding image is displayed on top (like Fig. 6C). The result is a grid of images with the structure from t-SNE.
There may be overlap with what is considered the closest image embedding to each grid point, so to achieve a result where the sum of grid assignment distances is minimized, the Jonker-Volgenant algorithm is used to get the optimal assignments . The optimal grid assignments work by phrasing the problem as a linear assignment problem. For this user study, to enhance the t-SNE-Grid exploration further, we implemented a one-level zoom that recomputes the grid with a smaller number of images based on where the user clicks in the t-SNE-Grid. In particular, the top closest to the click are recomputed with the Jonker-Volgenant algorithm to display a smaller and more focused grid of images to the user. is chosen based on the number of grids to show in the zoomed in view. For example, to show a grid could be to take the
closest points and gridify them. We will open-source this implementation.
We recruited 20 participants by using the departmental student mailing lists. Their average age was 26. Five were female and 15 were male. Six were undergraduate and 14 were graduate students. Their degree programs included computer science, robotics, and AI. We recruited only those who have taken at least one AI or ML course. Every participant attended the study in-person and we had one participant per session. Each participant was compensated with a $20 gift card.
We used a within-subject design such that each participant evaluated both DendroMap and t-SNE-Grid. Each study session had two phases, each involving a visualization (DendroMap or t-SNE-Grid) and a dataset (Artifact and Organism subset from CIFAR-100), which we describe in detail in Section 5.2.3). From the two visualizations and two datasets, we created four conditions. Each participant was assigned to one of these four conditions to ensure there was no bias in the order in which a participant used a particular visualization/dataset combination (shown in Table 1).
|#||Phase 1||Phase 2|
Every participant completed two sets of tasks, one for each visualization-dataset combination of their respective condition. For each phase, a participant was given a brief tutorial of the visualization, then they were asked to complete seven tasks while thinking aloud. We recorded their voice and screen. After each phase, the participant filled out a post-questionnaire form. All participants used the same computer setup with a 32-inch monitor.
We used the CIFAR-10 and CIFAR-100 datasets  for the study. The CIFAR-10 dataset has 10 classes, each containing 6,000 images (5,000 from training set and 1,000 from test set), while the CIFAR-100 dataset has 100 classes, each containing 600 images.
We fine-tuned the ResNet50 
architecture that was pretrained on the ImageNet dataset provided by TensorFlow555https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50. The CIFAR-10 and CIFAR-100 images were upsampled to fit the input shape of the ResNet50 model (i.e.,
). After extracting the image features from the models, we used Average Pooling, followed by Dense layers. The model was fine-tuned for 20 epochs, achieving a test set accuracy ofon CIFAR-10 and on CIFAR-100. For use in the DendroMap and t-SNE-Grid algorithms, we represented the images in each dataset as high-dimensional vectors from the outputs of one of the last hidden layers in each respective model. For the CIFAR-10 ResNet50 model, we extracted the outputs from second-to-last hidden layer. For the CIFAR-100 ResNet50 model, we extracted the outputs from the last hidden layer.
We divided the classes of CIFAR-100 into two sets–“Artifacts” and “Organisms”–in order to have two very distinct sets of classes for the within-subject design. This helps ensure that results from the first interface only minimally affect those from the second interface. Each set consists of 40 classes (i.e., 4 superclasses, each consisting of 10 classes) . For instance, the Artifact set contains classes like bed, chair, television, and bottles, while the Organisms set contains classes like lion, tiger, crocodile, and trout.
The participants completed seven tasks. These tasks can be divided into two broad categories: grouping and searching. The grouping tasks involved identifying or analyzing groups of images based on semantically similar properties; the searching tasks involved searching for images based on specific properties. Table 2 provides a summarized description of the tasks.
|1.||Categorizing images into groups across 40 classes|
|2.||Categorizing images into groups for a single class|
|3.||Identifying groups of images with high classification accuracy within a single class|
|4.||Estimating the image count distribution over multiple groups within a single class|
|5.||Searching for an image with a given text description|
|6.||Searching for an image with a given visual description|
|7.||Searching for an anomalous image with an incorrect class label|
In Tasks 1 and 2, participants were asked to categorize images into 3-4 groups based on semantically similar properties. Task 1 was designed to evaluate how users make sense of and categorize images across many (i.e., 40) classes whereas Task 2 focuses on how users make sense of images within a single class. The common objectives of these two tasks include analyzing diversity or any potential bias present in the distribution of the data as well as getting an overview of the data.
In Task 3, we asked participants to find two large groups, using images from a single class, that have very high classification accuracy and have specific properties. This task was designed to evaluate the scope of subgroup-level error analysis.
Task 4 is about examining the distribution of images for a single class. This task was designed based on the “characterize distribution” task discussed by Amar et al. . The participants were asked to estimate the approximate proportions of four groups determined based on an attribute (e.g., color of objects).
The following two tasks are conventional searching tasks. In Task 5, participants must find an image that matches a provided text description. In Task 6, participants must find the image that matches the one on the task sheet.
Note that every participant worked with the same task list for both DendroMap and t-SNE-Grid, but used a different dataset for each of the visualizations.
For fairer comparison, the sidebar component from DendroMap was added to the t-SNE-Grid visualization. Additionally, to confirm that certain sidebar components are not overused over the main visualization, the class table, class filtering, and similar images components were removed from the sidebar for both DendroMap and t-SNE-Grid.
The setup of our user study gives us the scope to analyze data from a multitude of perspectives.
Our first set of analyses focused on task completion time. During the study, we recorded the time a participant took to complete each task. After conducting a paired t-test, we found no significant difference between the average time taken by our participants witht-SNE-Grid and that with DendroMap for each task.
We evaluated the responses to the seven tasks using statistical methods.
Task 1. We instructed our participants to identify four groups such that an image can be assigned to only one group (mutually exclusive) and most images present in the interface can be assigned to one of the groups (collectively exhaustive). To evaluate the quality of groups made by the participants, we conducted three analyses. First, to measure the collectively exhaustive property of the groups, we counted the number of classes covered by at least one of the four groups and divided that number by the total number of classes present in the dataset (i.e., 40). We counted the number of “classes” instead of “images” because the number of classes can approximate the number of images because class has an equal number of images. If only a portion of images in a class belongs to a group, we count it as half; in an ideal scenario, the value would be 1.0. With DendroMap, the average value over all participants are higher with a value of , compared to with t-SNE-Grid. A one-sided paired t-test with a significance level of 0.1 indicates the value is significantly greater for DendroMap than t-SNE-Grid. This suggests that on average, participants were able to maintain the “collectively exhaustive” property more with DendroMap than t-SNE-Grid. Next, to assess the mutual exclusiveness of the groups made by a participant, we counted the number of classes that belong to two or more groups. In an ideal scenario, the value is zero because there is no overlap between the groups. We calculated the average value to be for t-SNE-Grid and for DendroMap. The results of the same t-test also indicate that on average participants were able to create more “mutually exclusive” groups with t-SNE-Grid than DendroMap
. Lastly, we calculated the entropy score of the probability distribution of the four groups to check how much the groups are equally distributed. From our analysis, we found the average entropy score ofDendroMap to be which is only slightly higher than t-SNE-Grid whose average entropy is .
Task 2. Like Task 1, participants were asked to identify mutually exclusive and collectively exhaustive groups. The main difference for Task 2 is that they worked with images for only one class. To evaluate the quality of groups identified by our participants, we conducted the same three analyses as for Task 1. However, for Task 2, instead of counting the number of classes, we labeled a 10% sample of individual images. In our first analysis of the collectively exhaustive property, the average values for t-SNE-Grid and DendroMap are almost the same with the values of and respectively. This also happened with the mutual exclusiveness analysis (i.e., and ). Our final analysis of the entropy scores is also no exception (i.e., and ).
Task 3. This task is also about grouping as participants were asked to find two large groups of images with high classification accuracy. We conducted two analyses for this task. First, we assessed the average accuracy of the two groups. To find the accuracy of each group, we counted the correctly classified images from the total number of images covered by each group. The average accuracy values of the two groups are 92.2% and 93.2% for t-SNE-Grid and DendroMap, respectively. DendroMap is slightly higher, but there is no significant difference. Second, we evaluated how large these groups are. The average for t-SNE-Grid is and for DendroMap is , with no statistical significant difference.
Task 4. In this task, participants estimated the approximate percentage of different cars and birds based on car color (yellow, red, white or silver, or other) or background of birds (e.g., sky), respectively. To evaluate user responses, we counted the number of car and bird images that correspond with the aforementioned criteria and calculated the Kullback-Leibler (KL) divergence score to quantify how much the probability distributions reported by our participants differ from our own. A score of 0 means the two distributions are the same. In Fig. 7 are two histograms to show the distribution of the KL divergence scores for t-SNE-Grid and DendroMap. From the distribution of the histograms, we see that DendroMap has more counts in between 0.0 and 0.1 than t-SNE-Grid. This indicates that more participants were closer to the actual distribution when using DendroMap than the t-SNE-Grid. This is also supported by the medians of the KL divergence scores where the median is for the t-SNE-Grid and for DendroMap.
Tasks 5 & 6. These tasks were about finding specific images. All the participants of our study were successful in finding the correct images using both the t-SNE-Grid and DendroMap.
Task 7. For this task, participants were asked to find labeling errors from misclassified images. Unlike Tasks 5 and 6, multiple correct answers exist. We assessed the images selected by our participants and divided them into three categories: reasonable, somewhat reasonable, not reasonable. Based on our assessment of 20 images found among 20 participants, with t-SNE-Grid, 12 are reasonable and 3 are somewhat reasonable; with DendroMap, 15 are reasonable and 3 are somewhat reasonable. This indicates that DendroMap is likely more helpful in finding potential anomalies in image datasets as a user is required to review many images for a task like this. The images in DendroMap are divided into clusters with distinguishable boundaries, which makes it more convenient to systematically survey a large group of images than with t-SNE-Grid.
Each participant answered 10 questions in two separate post-questionnaire forms: one for DendroMap and one for t-SNE-Grid. They provided ratings on a 7-point Likert scale (7 being strongly agree). The questions and their average rating are shown in Table 3.
|Easy to learn how to use||6.45||6.30|
|Easy to use||6.00||6.00|
|Helpful for overview||5.95||6.45|
|Helpful for detailed analysis||5.15||6.05|
|Helpful for finding specific images||5.10||5.75|
|Helpful to identify image categories||5.70||6.20|
|Helpful to discover new insights||5.25||6.00|
|Confident when using the tool||5.85||6.05|
|Enjoyed using the tool||6.10||6.40|
|Would like to use again||5.80||6.65|
The results indicate that DendroMap received higher ratings than t-SNE-Grid in 8 out of 10 questions. The t-SNE-Grid received a better rating for only the first question regarding the learnability of the visualization. This is reasonable as t-SNE-Grid supports fewer interactions than DendroMap. From the ratings of several important aspects of image visualization, DendroMap is found to be statistically significantly more preferable than t-SNE-Grid, such as getting an overview, performing detailed analysis, identifying image categories, and discovering new insights. Moreover, participants on average inclined more towards DendroMap than t-SNE-Grid in mentioning their eagerness to use the tool again. The difference is significant ().
We observed participants’ usage while they performed the tasks. Based on their usage patterns, we have made a few important findings.
DendroMap provides a more structured workflow. Compared to t-SNE-Grid, it is easier to assess or follow how a user makes certain decisions with DendroMap. In DendroMap, the presence of clusters and the hierarchical relationships within them provide significant semantic information to users when they create groups or search images based on certain properties. One participant said: “The clustering of DendroMap was very intuitive, more so than the grid one where the boundaries between groups were not clearly defined. The ability to click into different levels of clusters was very useful as well.”
DendroMap helps with extracting more specific properties. Using the semantic information provided by DendroMap, users were able to find more detailed information about different image groups. This is more evident with Task 3 where participants worked with the images of ships and dogs to find two large groups that have high classification accuracy and specific properties. With DendroMap, participants mentioned more specific properties compared to t-SNE-Grid. For example, regarding dogs, DendroMap users described their eyes, hair length, and facial structure in addition to generic properties such as size, color, and background. With the t-SNE-Grid, participants mostly described groups using only generic properties.
Image search can be narrowed down more with DendroMap. The hierarchical relationships within the clusters helped users narrow their search for a particular image. With DendroMap, they easily found specific clusters with more images similar to the one they were looking for. The sub-clusters present within a cluster then helped users further narrow the search space. On the other hand, with t-SNE-Grid, users had to check a large group of images as there is no structured way of narrowing the search. One participant said: “With the treemap, the ability to narrow down the search without having to recompute the grid size every time, having some predetermined way of organizing the images, and having the images broken up into clusters made it very easy to scan through the images without getting lost. I was able to quickly filter the exact things I was looking for.”
Cluster summary provided with DendroMap is helpful. DendroMap provides information about each cluster and sub-cluster, such as the number of images and classification accuracy. Participants found this information useful, especially for Tasks 3 and 4. One participant expressed their liking by saying: “I like the clusters having details like how many images and the accuracy. Also, the outline of the different clusters having different sizes helped.”
Lastly, we evaluate the quality of the cluster structures generated from DendroMap computationally. We quantitatively measure -nearest neighbor accuracy–how well DendroMap preserves the top- nearest neighbors in the original high-dimensional space.
We measure the number of common images in the top- nearest images between one of the techniques and the original high-dimensional representation of data, while varying (i.e., the size of nearest neighbor list). It is a common way to evaluate the quality of DR methods . The techniques we compare are: (1) t-SNE, (2) t-SNE-Grid (described in subsection 5.1), and (3) DendroMap. We performed this experiment over 12 different datasets: CIFAR-10, CIFAR-100, and 10 subsets of CIFAR-10, each from one of the 10 classes. All are trained with ResNet50 (same setup described in subsubsection 5.2.3), but for the first two, the high-dimensional representations were taken from the last hidden layer, while those for the 10 subsets were taken from the second-to-last hidden layer.
While we compute Euclidean distances between 2-D points for ranking similar images in t-SNE and t-SNE-Grid which assigns a (, ) value to each data point, DendroMap needed a different methodology. This is because DendroMap creates additional structures to the 2-D space using treemaps, so it does not make sense to directly use Euclidean distances. Instead, we define a distance from an image to another image in DendroMap by measuring the distance from the corresponding node for in the dendrogram tree to the nearest common ancestor node between and . This can be thought of as how many times a user needs to zoom-out from the leaf node for to reach to the cluster where both and belong to.
Figure 8 shows the results. For each of the 12 plots, the -axis represents (in -nearest neighbor) and the y-axis represents the average number of common images in two top- image lists. We display up to 300 for 10,000 image datasets and 50 for the class-level CIFAR-10 datasets As shown in the figure, in all cases, t-SNE outperforms the other two, as we can expect because t-SNE is designed to optimize this metric. When comparing DendroMap and t-SNE-Grid, DendroMap shares more top- nearest neighbors with the high-dimensional representations than t-SNE-Grid for all 12 datasets. This indicates that DendroMap preserves the local similarity structures better than t-SNE-Grid.
Interactive Refinement of Tree Structures. While the agglomerative clustering algorithms generate hierarchical structures that allow users to flexibly specify the number of clusters to be displayed, the formed structures may not be ideal for some cases. Visualization researchers have extensively studied interaction methods for steering and refining clustering results [63, 13]. Future research challenges include designing interactions for treemap representations that are distinct from scatterplots and node-link diagrams.
Using Interpretable Attributes for Tree Construction. We used embedding vectors extracted from deep learning models as input to clustering algorithms, but alternative methods may help people better interpret substructures of each cluster in DendroMap. For example, representing each image with human-understandable concepts [30, 68] or additional resources  may make each dimension more interpretable. Alternatively, integrating information about each dimension of the embedding vectors into the interface using explainable AI methods can also be helpful [39, 24].
Formalizing Interaction Operations. Several data manipulation operations can also be provided in DendroMap. For example, sorting images within each node by user-specified criteria (e.g., prediction scores) or splitting and zooming into only a subset of nodes [5, 63]. Formalizing these types of operations would allow for more flexible user exploration. Integrating some ideas presented in the unit visualization literature [41, 58, 45], such as horizontally or vertically separating space based on categorical attributes in Facets [58, 57], into the treemap context would also be an interesting future direction.
ActiVis: Visual exploration of industry-scale deep neural network models.IEEE Transactions on Visualization and Computer Graphics, 24(1):88–97, 2017.