1 Introduction
Deep convolutional neural networks (CNNs) have become ubiquitous in image classification tasks thanks to architectures such as GoogLeNet and ResNet. However, we do not quite understand how these networks achieve their superhuman performance. The main challenge in deep learning, today, is its
interpretability: What representations have these networks learned that could be made human interpretable?Given a trained deep neural network, we can address the interpretability issue by probing neuron activations, that is, combinations of neuron firings, in response to a particular input image. With millions of input images, we would like to obtain a global view of what the neurons have learned by studying neuron activations at a particular layer, and across multiple layers of the network. We aim to address the following questions: What is the shape of the space of activations? That is, what is the organizational principle behind neuron activations, and how are the activations related within a layer and across layers?
We propose to leverage tools from topological data analysis to capture the global pattern of how a trained network reacts to millions of input images, see Figure 1 for an illustration.
In this work:

We present TopoAct, an interactive visual analytics system that uses topological summaries to explore the space of activations in deep learning classifiers. TopoAct leverages the mapper construction [34] from topological data analysis to capture the overall shape of activation vectors for interactive exploration.

For a fixed layer, TopoAct supports interactive exploration of the topological summary by exploring dataset examples within each cluster of activation vectors. Feature visualization is also applied to individual and averaged activation vectors within each cluster to create idealized images of what the network detects, and thus provide explicit comparison across clusters.

We study how classes of images are related within a layer and across layers to capture the pattern of evolution as the network goes deeper.

We present exploration scenarios where TopoAct helps us discover valuable, sometimes surprising, insights into learned representations of an image classifier, the InceptionV1 [35].

We observe structures in the topological summary, specifically loop and branching structures, which correspond to evolving patterns of activations that help us understand how a neural network reacts to millions of images.
Finally, we will release an opensource, webbased implementation on Github
^{5}^{5}5https://github.com/architrathore/TopoAct; the current system is available via a public demo link ^{6}^{6}6https://architrathore.github.io/TopoAct/.2 Related Work
Visual analytics systems for deep learning interpretability. Visual analytics systems have been used to support model explanation, interpretation, debugging, and improvement for deep learning in recent years, see [17] for a survey. Here we focus on approaches based on neuron activations for interpretability in deep learning.
This line of research attempts to explain the internal operations and the behavior of deep neural networks by visualizing the features learned by hidden units of the network. The method of activation maximization [14]
uses gradient ascent to find the input image that maximizes the activation of the neuron under investigation. It was used to visualize the hidden layers of a deep belief network
[14] and deep autoencoders [22]. Simonyan et al. [32] used a similar gradientbased approach to obtain salience maps by projecting neuron activations from the fully connected layers of the convolutional network back on to the input space. Building on the idea of activation maximization, Zieler et al. [39] proposed a deconvolutional network architecture that reconstructs inputs of convolutional layers of a CNN from its output.These methods assume that each neuron specializes in learning one specific type of feature. However, the same neuron can be activated in response to vastly different types of input images. Reconstructing a single feature visualization, in such cases, leads to unintelligible mix of color, scales or parts of objects. To address this issue, Nguyen et al. [28] proposed multifaceted feature visualization which synthesizes a visualization of each type of input image that activates a neuron. Another issue with these visualization approaches is the (sometime unrealistic) assumption that neurons operate in isolation. This issue is addressed by the model inversion method proposed by Mahendran et al. [24, 25]. Model inversion looks at the representations learned by the fully connected layers of a convolutional network, and reconstructs the input from these representations. TCAV uses directional derivatives of activations to quantify the sensitivity of model predictions to an underlying highlevel concept [20].
While all these techniques can help us understand how a single input or a single class of inputs is “seen” by the network, visualizing activations of neurons alone is somewhat limited in explaining the global behavior of the network. To obtain a global picture of the network, Karpathy [19] used tSNE to arrange input images that have similar CNN codes (i.e., fc7 CNN features) nearby in the embedding. Nguyen et al. [28]
projected the training set images that maximally activate a neuron into a lowdimensional space, also via tSNE. They clustered the images using kmeans in the embedded space, and computed a mean image by averaging the images nearest to the cluster centroid.
Recently proposed, activation atlas [7] combines feature visualization with dimensionality reduction to visualize averaged activation vectors with respect to millions of input images. For a fixed layer, activation atlas obtains a highdimensional activation vector corresponding to each input image. These highdimensional vectors are then projected onto lowdimensional space via UMAP [26] or tSNE [36]. Finally, feature visualization is applied to averaged activation vectors from small patches of the lowdimensional embedding which allow users to intuitively understand how a particular layer reacts to millions of input images. SUMMIT [18] is another framework that summarizes neuron activations of an entire layer of a deep convolutional network using dimensionality reduction. In addition to aggregated activations, SUMMIT also computes neuron influences to construct an attribution graph which captures relationships between neurons across layers.
Activation atlas computes average activation vectors in a lowdimensional embedding which may introduce errors due to neighborhood distortions. In comparison, our approach aggregates activation vectors in a different manner. Using the mapper construction, a tool from topological data analysis, we obtain a topological summary of a particular layer by preserving the clusters as well as relationships between the clusters in the original highdimensional space of activations. Our approach preserves more neighborhood structures since the topological summary is obtained within the highdimensional activation space. Not only do we study how a particular layer of the neural network reacts to a large number of images through the lens of this topological summary, but also how such summaries evolve across layers.
Various notions of topological summaries. In topological data analysis, various notions of topological summaries have been proposed to understand and characterize the structure of a scalar function defined on some topological space . Some of these, such as merge trees [2], contour trees [6], and Reeb graphs [31], capture the behavior of the (sub)level sets of a function. Others, including Morse complexes and the MorseSmale complexes [16, 13, 11], focus on the behavior of the gradients of a function. Fewer topological summaries are applicable for a vectorvalued function, including Jacobi sets [10, 3], Reeb spaces [12, 27] and their discrete variant, the mapper construction [34]. In this paper, we apply the mapper construction to the study of the space of activations to generate topological summaries suitable for interactive visualization.
3 Technical Background
In this section, we review technical background on the mapper construction, and neural network architecture. We delay the discussions on activation vectors and feature visualization until Section 4.
Mapper graph on point cloud data. In this paper, we give a highlevel description of the mapper construction by Singh et al. [34] in a restrictive, point cloud setting. Given a highdimensional point cloud equipped with a function on , , the mapper construction provides a topological summary of the data for compact representation and exploration. It utilizes the topological concept known as the nerve of a covering first introduced by Pavel Alexandrov [1].
An open cover of is a collection of open sets in with an index set such that . Given a cover of , the dimensional nerve of , denoted as , is constructed as follows: A finite set (i.e., an edge) belongs to if and only if the intersection of and is nonempty; if the set belongs to , then any of its subsets (i.e., the point and the point ) is also in .
For the mapper construction, we start with a finite cover of the image of , such that . Since is a scalar function, is an open interval in . Let denote the cover of obtained by considering the clusters induced by points in for each . The dimensional nerve of , denoted as , is called the mapper graph. In the context of this paper, we refer to as the topological summary graph (summary graph in short) for simplicity.
The summary graph is a multiscale representation that serves as a topological summary of the point cloud . Its construction relies on three parameters: the function , the cover , and the clustering algorithm.

Lens: The function plays the role of a lens, through which we look at the data, and different lenses provide different insights [4, 34]. An interesting open problem for the mapper construction is how to define topological lenses beyond the best practice or a rule of thumb [4, 5]. In practice, height functions, distances from the barycenter of the space, surface curvature, integral geodesic distances, and geodesic distances from a source point in the space have all been used as lenses [4]. In this paper, we use the norm of the activation vectors as the lens; although other options are possible.

Cover: The cover of consists of a finite number of open intervals as cover elements, . A common strategy is to use uniformly sized overlapping intervals. Let be the number of intervals and the amount of overlap between adjacent intervals. Adjusting these parameters increases or decreases the amount of aggregation provides.

Clustering algorithm: We compute the clustering of the points lying within and connect the clusters whenever they have nonempty intersection. A typical algorithm to use is DBSCAN [15]
, a densitybased clustering algorithm; it groups points in highdensity regions together and makes points that lie alone in lowdensity regions outliers. The algorithm requires two input parameters:
minPts (the number of samples in a neighborhood for a point to be considered as a core point), and (the maximum distance between two samples for one to be considered as in the neighborhood of the other).
We give an illustrative example of summary graph construction in Figure 2. The data set is sampled from a noisy circle, and the function (lens) used is , where is the lowest point in the data. is colored by the value of the function. We divide the range of the into 3 intervals with a overlap. For each interval, we compute the clustering of the points lying within the domain of the filter restricted to the interval , and connect the clusters whenever they have nonempty intersection. is the topological summary graph whose vertices are colored by the index set.
Neural network architecture. We give a highlevel overview of InceptionV1 [35]
(also known as GoogLeNet), the neural network architecture employed in this paper. However, our framework is not restricted by the specific architecture of a neural network. InceptionV1 is a CNN that won the ImageNet LargeScale Visual Recognition Challenge for image classification in 2014. It was trained on ImageNet ILSVRC
[8]. ImageNet dataset consists of over 15 millions labeled highresolution images with roughly 22,000 classes/categories while ILSVRC takes a subset of ImageNet of around 1000 images in each of 1000 classes, for a total of 1.2 million training images, 50,000 validation images, and 100,000 testing images. The highlights of InceptionV1 architecture include the use ofconvolution, inception modules, and global average pooling.
convolution from NIN (networks in networks) [23] is used for dimensionality reduction and therefore computation reduction prior to expensive convolutions with larger image patches. A new level of organization is introduced in the form of the inception module, which combines different types of convolutions for the same input and stacking all the outputs on top of each other. InceptionV1 contains inception modules, each of which contains multiple convolution layers. The demo version of TopoAct explores the activations of the last layer of each inception module, for a total of layers, including mixed3a, mixed3b, mixed4a, and mixed4b; which are shortened as 3a, 3b, etc. This choice is wellaligned with previous literature on visual exploring of InceptionV1 [29, 30, 18, 7].4 Methods
We describe data analytic components of TopoAct. First, for a chosen layer of a neural network model (such as InceptionV1), we obtain activation vectors as highdimensional point clouds for topological data analysis. Second, we construct (topological) summary graphs from these point clouds to support interactive exploration. The nodes in the summary graphs correspond to clusters of activation vectors in highdimensional space and the edges capture relationships between these clusters. Third, for each node (cluster) in the summary graph, we apply feature visualization to individual activation vectors in the cluster as well as the averaged activation vector.
Obtaining activation vectors as point clouds.
The activation of a neuron is a nonlinear transformation (i.e., a function) of its input. To start, we fix a pretrained model (i.e., InceptionV1) and a particular layer (e.g.,
4c) of interest. We feed each input image to the network and collect the activations, that is, the numerical values of how much each neuron has fired with respect to the input, at a chosen layer. Since InceptionV1 is a CNN, a single neuron does not produce a single activation for an input image, but instead a collection of activations from a number of overlapping spatial patches of the image. When an entire image is passed through the network, a neuron will be evaluated multiple times, once for each patch of the image. For example, a neuron within layer 4c outputs activations per image (for patches). To simplify the construction, in our setting, we randomly sample a single spatial activation from the patches, excluding the edges to prevent boundary effects. For 300,000 images, this gives us 300,000 activation vectors for a given layer. Each activation vector is highdimensional; its dimension depends on the number of neurons in that layer. For instance, layers 3a, 3b, and 4a have , , and neurons respectively, producing point clouds in , , and dimensions. We then apply the mapper framework to obtain topological summary graphs of these point clouds.Constructing summary graphs from activation vectors. Given a point cloud of activation vectors, we now apply the mapper construction to obtain a topological summary graph. Each node in the summary graph represents a cluster of activation vectors, and there is an edge between two nodes if their corresponding clusters have an nonempty intersection. For the mapper construction parameters, all of our summary graphs use L2norm of the activation vector as the lens function; cover elements with , or amount of overlap. We use DBSCAN [15] as our clustering algorithm, with minimum points per cluster . For the parameter of the DBSCAN algorithm, which defines core points, we use two variations in our experiments. For the first variation, we use a fixed
, which is estimated by the distribution of pairwise distances at a middle layer. For the second variation, we employ the procedure proposed in
[15] to estimate a value of for each layer, which is more adaptive to the space of activations at that particular layer. Specifically, we generate an approximated nearest neighbor (NN) graph and sort the distances from the th neighbor; and we select an based on the location of a ”valley” when these distances are plotted [15]. This way, an value is more adaptive to the distribution of pairwise distances within a point cloud. As a result, our adaptive values are for 3a, for 3b, for 4a, for 4b, for 4c, for 4d, for 4e, for 5a, and for 5b.The above parameter configurations give rise to 6 datasets current deployed in our live demo. Each dataset contains 9 summary graphs (across 9 layers of InceptionV1) constructed by a particular set of parameters associated with the mapper construction. It is named according to these parameters. In particular, each data set start with ”overlap” where is either , , or to denote , , or overlap parameter respectively. The second half of the name consists of ”epsilon” where is either ”fixed” or ”adaptive”, representing either a fixed of or a choice of based on distances to the th nearest neighbor in an approximate NN graph. For example, overlap50epsilonfixed is the dataset containing summary graphs of 9 layers generated using and fixing to be .
Applying feature visualization to clusters of activation vectors. Activation vectors are highdimensional abstract vectors. To make sense of them, we employ feature visualization to transform them into a more meaningful semantic dictionary using techniques proposed by Olah et al. [30]. While the neural network transforms an input image into activation vectors, feature visualization goes in the opposite direction. Starting from an activation vector, it synthesizes an idealized image that would have produced that activation vector. This is achieved through an iterative optimization process.
The process begins with a random noise image. Using gradient descent, this image is iteratively tweaked to optimize the specific objective. In this case, given an activation vector and a direction , in the activation space, an objective of the following form:
is optimized. Following feature visualization [29], a transformation robustness
regularizer is used, which applies small stochastic transformation to the image before applying the optimization step. Maxpooling can introduce high frequencies in the gradients. To tackle this issue, the gradient descent is performed in Fourier basis, with frequencies scaled to have equal energy. This is equivalent to preconditioning the data by whitening and decorrelating.
In our setting, we apply feature visualization for each of the 300,000 input images and obtain individual activation images. Once we obtain a summary graph, we also apply feature visualization to the averaged activation per cluster (node), and obtain an averaged activation image for each cluster.
5 TopoAct User Interface and System Design
We present the user interface of TopoAct, an interactive system used to explore the organizational principles of neuron activations in deep learning image classifiers. We use InceptionV1 trained on 1.2 million ImageNet images across 1,000 classes. We obtain activation vectors of 300,000 images (300 images per class) via the pretrained model. The TopoAct user interface contains two exploration modes: single layer exploration and multilayer exploration.
Figure 1 illustrates the user interface under single layer exploration mode. The header includes: information regarding the layer of choice (e.g., 3a, 3b, 4a), the dataset (across various mapper parameters) under exploration (e.g., overlap30epsilonfixed, overlap50epsilonadaptive), and a class search box that supports filtering by a set of classes. The header also contains a check box that replaces nodes in the summary graph by averaged activation images to provide alternative overview of the topological summary (see Feature Visualization View for details).
Class search box with a shopping directory view. As illustrated in Figure 3, users can type a class name in the search box which is used to filter the summary graph. The search algorithm uses partial matching to locate a list of possible class names. Alternatively, users can select a subset of classes from the “shopping directory” view in which top classes within the current layer are listed in alphabetical order. The summary graph will highlight the clusters that contain any of the userspecified classes among their top three classes. As an example, we look at layer 5a of the overlap30epsilonadaptive dataset. Using the shopping directory view, we select several classes of large motor vehicles, for example, school bus, tow truck, fire engine, minibus, minivan, etc. Each of the nodes highlighted in the summary graph of Figure 4 contains at least one of the selected classes among its top three classes.
5.1 Single Layer Exploration Mode
For single layer exploration, the interface is composed of three views: the summary graph view, the data example view and the feature visualization view, see Figure 1 for an illustration.
Summary graph view. TopoAct uses the mapper construction to construct a topological summary graph from the activation vectors of 300,000 images across 1,000 classes. Different from dimensionality reduction approaches such as tSNE [36] and UMAP [26], TopoAct computes and captures the shape of the activation space in the original highdimensional space in the form of a summary graph, and preserves as much as possible the structural information when the summary graph is drawn on the 2dimensional screen.
As shown in Figure 1(A), we use forcedirected layout by Dwyer [9] to visualize the summary graph. Each node represents a cluster of “similar” activation vectors; and each edge encodes the relations between clusters of activation vectors. Given two clusters of activation vectors and , there is an edge connecting them if . Given and connected by an edge , the edge weight of
is their Jaccard Index. That is,
. Each edge is then visualized by visual encodings (i.e., thickness and colormap) that scale proportionally with respect to their weights. As shown in Section 6, weights on the edges highlight the strength of relations between clusters.To explore the summary graph, users can zoom and pan within the view. Hovering over a node in the summary graph displays simple statistics of the cluster: number of activation vectors in the cluster and averaged lens function value. Clicking on a node will give information on the top three classes (with maximum percentages) within the selected cluster; it will also update the selection for the data example view and the feature visualization view, as described below.
Data example view. To make each cluster more interpretable, we combine data examples with feature visualization. For a selected node (cluster) in the summary graph, we give textual description of the top three classes in the cluster as well as five data examples from each top class. For example, as illustrated in Figure 5a, a selected cluster in the summary graph view for layer 5a of overlap30epsilonadaptive contains three top classes of images: fire engine, tow truck, and electric locomotive. Its corresponding data example view contains five images sampled from each class to give a concrete depiction of the input images that trigger high activations.
Feature visualization view. After a user selects a node (cluster) in the summary graph view, we display activation images pregenerated for each input image from the data example view. These individual activation images are generated by applying feature visualization to individual activation vectors from the 300,000 input images. The feature visualization displays up to of such individual activation images, up to for each of the top class, see Figure 5b. Furthermore, we also average the activation vectors that fall within the cluster and run feature inversion on the averaged activation, producing an averaged activation image per cluster (as shown in Figure 5c). Moving across clusters following edges of the summary graph will help us understand how the averaged activation images vary across clusters. We obtain a global understanding of not only what the network “sees” via these idealized images but also how these idealized images are related in the space of activations.
In addition to a graph view, we can replace each node in the summary graph by an averaged activation image as a glyph. This can be perceived as an alternative to the activation atlas [7] with one crucial difference: our summary graph captures clusters of activation vectors in their original highdimensional space and preserves connectivity/relations between these clusters. As demonstrated in Section 6, such a global view provides valuable insights during indepth explorations.
5.2 Multilayer Exploration Mode
In multilayer exploration mode, three adjacent layers are explored side by side, see Figure 14(top). After choosing a particular class or a set of classes using the class search box, TopoAct highlights nodes (clusters) across all three layers that contain the chosen set of classes among its top three classes. Other visualization features are inherited from the single layer exploration. Multilayer exploration helps capture the evolution of classes as images run through the network and supports structural comparisons of summaries across layers. This can be particularly useful when used in conjunction with the class search tool. As an example of class search in multilayer mode, we look at layers 4e, 5a and 5b of the overlap30epsilonadaptive dataset. We use the same selection of classes of large motor vehicles used in the earlier example of class search in single layer mode (Figure 4). Figure 6 shows the class search results, now in the multilayer exploration mode.
5.3 System Design
TopoAct is webbased with a public demo available via GitHub ^{7}^{7}7https://architrathore.github.io/TopoAct/. It is tested for Google Chrome and Firefox. It was developed using Javascript, HTML, and CSS, together with D3.js^{8}^{8}8D3.js: https://d3js.org/. The 300,000 dataset examples were sampled from ImageNet with reduced resolution. For our summary graph construction, we used the open sourced Kepler Mapper library [37]. The construction of summary graphs across layers were performed on high performance server machines with , , and CPU cores, and RAM ranging from GB to GB. The construction took 4 hours for layers with lowerdimensional activation vectors (i.e., layer 3a produces 256dimensional activation vectors) and 7 hours for higherdimensional activation vectors (e.g., layer 5b produces 1024dimensional activation vectors). For our choice of for the DBSCAN algorithm, we ran PyNNDescent^{9}^{9}9https://github.com/lmcinnes/pynndescent on a commodity workstation with a 4 core intel i7 (4750HQ) and 8GB of RAM. Computing took on average 5 minutes per layer. Finally, we used Google Colab ^{10}^{10}10https://colab.research.google.com to run our feature visualization with GPUs, either from an Nvidia P100, Nvidia K80, or Nvidia T4 GPU. Feature visualization of all 300,000 input images was done via the Lucid library ^{11}^{11}11Lucid: https://github.com/tensorflow/lucid, which took on average 8 hours. Feature visualization of average activation vectors took between 2.5 (i.e., 3a) to 6 hours (i.e., 5b) per layer/summary graph.
6 Exploration Scenarios
We present various exploration scenarios using TopoAct that provide valuable insights towards learned representations of InceptionV1. Visual exploration of the space of activations takes two forms: single layer exploration and multilayer exploration. For single layer exploration, the main takeaway from these scenarios is that TopoAct captures specific topological structures, in particular, branching and loop structures, in the space of activations that are hard to detect via dimensionality reduction techniques; and such features offer new perspectives on how a neural network “sees” the input images. For multilayer exploration, the summary graphs of activation vectors within a single layer and across layers give interesting perspectives on the global organizational principles of the activation space.
6.1 Discovering Branching Structures in a Summary Graph
We begin by providing examples of interesting topological structures that capture relationships between activations during single layer exploration. We notice that by replacing nodes of a summary graph by averaged activation images, we obtain the most interpretable observations. There are two main types of topological structures that are unique to our framework, branching and loop structures, which differentiate TopoAct from prior work (e.g., [18, 7]).
Topologically, branching structures in a graph represent bifurcations. With TopoAct, while we observe variations of similar features along a specific branch, different branches may capture distinct, sometimes completely unrelated features.
LegFace bifurcation. Our first example of a branching structure comes from the layer 4c of the overlap30epsilon–adaptive dataset. Figure 7 shows two branches emerging from a node in the summary graph; refer to such as node as the branching node. It is composed of activation vectors. The top three classes within the branching node are rugby ball, Indian elephant, and wig. While this appears to be random, the summary graph coupled with averaged activation images reveals interesting insights.
The left branch appears to capture the leg of an animal. The top three classes represented in all the clusters within this branch include various breeds of large dogs (Figure 7a). The right branch appears to capture features that resemble human faces, albeit being distorted. While class names associated with clusters along the right branch may not suggest any relevance to human faces; the data examples associated with these clusters reveal that all the top classes within the right branch contain images with humans, most of which include closeups of faces (Figure 7b, 7c). Now, let us return back to the branching node, upon close inspection (Figure 7d), the branching node contains images of rugby players and elephants that include both leg and face features, while wig images also include human faces. Therefore, the space of activations bifurcates at the branching node to further differentiate between just leg and just face features.
BirdMammal bifurcation. Our second example of a branching structure comes from the layer 5a of the overlap30epsilonfixed dataset. The branching node, in this case, is composed of activation vectors. Figure 8 shows two branches emerging from this node. It contains images of both birds and dogs (Figure 8a), and the (averaged) activation image of the branching node appears to be a combination of the left profile of bird faces and right profile of the doglike faces. Upon further investigation, the bottom branch focuses on the features of bird faces: profile views composed of the left eye and beak, with variations of color and textures as we move along the branch. The clusters in this branch include mainly bird species. The variations in the captured features and corresponding data samples can be seen in Figures 8d8f. The clusters in the top branch, on the other hand, appear to capture features of animal (mammal) faces: eyes and snout, with variations in color and texture. This branch is mainly composed of images of mammals, including various dog breeds (Figures 8b8c), wolves, foxes and even monkeys.
GeometryText bifurcation Our third example of a branching structure comes from the layer 4b of the overlap30epsilonfixed dataset. The branching node is a relatively large cluster with size . Figure 9a shows examples of images in this cluster which include classes like bison and speedboat. Since this cluster is made up of several very different classes, the corresponding averaged activation image is less representative of the cluster as a whole. There are several branches emerging from this branching node, some of them being very short. We will focus on the three longest branches shown in Figure 9. The bottom two branches appear to capture patterns related to the text in the data images. This is somewhat obvious for the middle branch since the top classes of the clusters in this branch include things like menus. Figure 9f shows some examples of these data images. The top classes in the bottom branch include images of birds, dogs, computer mouse, etc. However, we can observe in Figures 9b and 9c that all these images contain some text. In the case of bird and animal images, this could be textual information such as copyright or the time and location where the images were captured.
The top branch is a mixture of two types of classes  one type consists of window screens, “shoji”, and chainlink fences. Clusters along the top branch seem to capture the geometric patterns found in images of such objects, see Figure 9d. However, clusters on this branch also contain images of dogs, wolves, etc. Indeed, in some cases, as in Figure 9e, the animal could be behind the fence. Because of such cases, the clusters in the top branch appear to capture a combination of geometric patterns from objects like window screens and fences as well as features related to animals like eyes and noses.
WheelTread bifurcation. Our fourth example of a branching structure comes from the layer 4c of the overlap30epsilonadaptive dataset. The branching node is a small cluster of size . All the clusters in this example contain images of various types of automobiles, for example minibus, police van, fire engine, limousine, etc. The branching node appears to capture what looks like a wheel of a vehicle  dark round shape with treadlike pattern seen in Figure 10a. The two branches of this branching node appear to focus on one of these two features. While the left branch focuses on the dark round swirling patterns of automobile wheels (Figures 10c, 10d), the right branch appears to focus more on the treadlike patterns and textures (Figure 10b).
6.2 Exploring Loop Structures in a Summary Graph
While branching structures capture bifurcations in the types of features across different objects, loop (or circular) structures seem to capture different aspects of the same underlying object.
Furnoseeareye loop of mammals. Our first example of a loop structure comes from the layer 4d of the overlap30epsilonfixed dataset. Figure 11 shows a loop created by six clusters. The top classes in all six clusters include various dog breeds, and a variety of foxes. All these clusters seem to capture different features/body parts related to these animals. The left most cluster appears to capture the color patterns and the texture of the fur from the body of these animals (Figure 11a). Going clockwise, the next cluster also captures the colors and texture of the fur but from a different part of the body, possibly the fur and hair surrounding the nose, suggested by the dark spot and the swirling pattern (Figure 11b). The next two clusters (Figures11c, 11d) appear to capture the ears of the animals. The averaged activation image captured by the fifth cluster (Figure 11e) are not as clearly attributable to specific part of animal body. As can be observed in as seen in Figure 11e, this cluster consists of images from a larger variety of animals, from foxes to Siamese cats to hogs. As a result, the corresponding averaged activation image is a mixture of various colors and slightly different textures. The last cluster appears to capture the eye and nose of the animals. We can observe in Figure 11f that the cluster contains front and side views of dog heads.
Facebodyleg loop of birds. Our next example of a loop structure comes from the layer 5a of the overlap30epsilonadaptive dataset. Figure 12 shows six clusters creating the loop. The top three classes of all the clusters in the loop consist of bird species, and similar to the previous example, the averaged activation images show us different features of the birds captured by these clusters. The three clusters (ce) in the top part of the loop appear to capture the left profile views of the bird faces with the left eye and the beak identifiable in the averaged activation image. Figures 12c12e show that these clusters are in fact composed of bird images. The variation in the color of birds (from red to blue, and to brown) is reflected in the corresponding activation images (b, c, d, e). The three clusters on the bottom (a, b, f) appear to capture the body and legs along with a feathered texture, although not as clearly as the other three clusters. It can be seen, in Figure 12f, that cluster f also includes images of tigers mixed with images of birds for the representation of texture.
6.3 Studying Global View of Topological Summaries
We now explore the global view of a topological summary using the single layer exploration mode. Instead of focusing on a single type of topological structure such as loops or branches, we investigate the distribution of topological structures. As illustrated in Figure 13, we investigate the distribution of branching structures within the largest connected component of a summary graph, at layer 5b of the dataset overlap30epsilonadaptive.
We pay special attention to branching nodes with high degrees (Figure 13a, 13b). There are a few interesting (however speculative) observations. First, averaged activation images for these highdegree branch nodes appear to mostly focus on some form of face features. This is not too surprising as the image dataset we used to obtain activations contains a large number of human and animal faces. These branching nodes, in some sense, serve as “anchors” or “hubs” of the underlying space of activations. Second, it appears that in order to “see” faces, there is a mixture of geometric and texturebased images that contribute towards the representation of each of the branching node. Nodes immediately adjacent to the branching node (a), that is, those that form branches that merge at node (a), contain geometric objects that are squareshaped (envelop, bath towel), circleshaped (bowl, pasta), pointyshaped (tie, hammer), and bottleshaped (beer bottles). Nodes immediately adjacent to the branching node (b) have other objects that serve similar purposes, including squareshaped (vest, cuirass), circleshaped (dough, mashed potato), pointyshaped (ladle, ball point), and bottleshaped (milk cans, whiskey jug). However, (a) and (b) seem to draw these geometric shapes from (almost completely) different classes of images.
6.4 Multilayer Comparison of Summary Graphs
Finally, we can compare the shape of activation spaces across multiple layers. As illustrated in Figure 14, we show sidebyside comparison of all 9 layers for the dataset overlap30epsilonadaptive. There are several observable trends. First, there are more highdegree branching nodes at deeper layers (i.e. 5b), indicating more specialized differentiation among activation vectors, and consequently, among learned features. Second, there are more “islands”, that is, small isolated components at deeper layers, indicating further separation among activation vectors and the corresponding learned features at those layers. A further investigation into structural comparisons across layers, such as tracking the evolution of a particular branching node is nontrivial and left for future work.
7 Discussion
There are numerous interesting topological structures provided by TopoAct, both locally and globally. We only present a few examples in this paper. We encourage the reader to discover more interpretable, insightful observations regarding the shape of the activation space by using the live public demo^{12}^{12}12https://architrathore.github.io/TopoAct/. We conclude with some topics for discussion.
Generality. We focus on CNNs, specifically, InceptionV1 in this paper. However, our approach is not restricted to any particular network architecture. Topological summaries could be generated and used as a vehicle for visual exploration whenever neuron activations are present. In that sense, TopoAct could be generalized to explore other network architectures, such as ZFNet [38], AlexNet [21], and VGGNet [33].
Parameter tuning. As described in Section 3, the choice of lens, cover, and clustering algorithm are important in the construction of the topological summary graph. The choice of cover determines the size of clusters and how densely the summary graph is connected. In the demo, we used a uniform cover which caused huge variations in cluster sizes. While some clusters were composed of only a handful of activation vectors, there were several very large clusters with thousands of activation vectors, and large intersection between neighboring clusters. Finding meaningful relationships across such large clusters is difficult in these cases since top three classes are poor representatives of the cluster as a whole.
Note that the branching and loop structures explored in our examples contain relatively small clusters for which the averaged activation images were more meaningful. The best way to remedy the large variation in cluster sizes is to use an adaptive cover, in which interval lengths are modified in such a way that each interval contains approximately same number of points. Creating adaptive cover elements may be achieved by looking at the distribution of lens function values using histograms.
Scalability. In this paper, we used 300,000 activation vectors. Our goal is to scale the implementation to handle one million activation vectors. The computation of the summary graphs and the generation of activation images are both trivially parallelizable; however, computing them on the fly (to be suitable for realtime visual explorations with varying parameters) is nontrivial.
Stability. Our approach is not without limitations; however some of these limitations are not unique to our approach (e.g.,[7, 18]). The exploration scenarios we present here are specific to the choice of input images as well as the choice of activation vectors. Further analysis is required to determine how stable or sensitive the results are with respect to these choices.
For a fixed layer, the space of activations we study arises from a particular set of input images. As we vary the input, we obtain different activations and therefore different topological summaries. As the input grows larger and more diverse, such summaries become better approximations of the underlying complex space.
Even with a fixed input set of images, the output of our framework depends on the extraction of activation vectors. In a CNN, activation vectors are the results of different convolutional filters acting on a specific patch of the input. For example, the output of layer 4b in InceptionV1 is a 3D tensor of dimension
. We extract a single dimensional activation vector by randomly sampling a patch from the spatial grid. However, one specific patch may not be representative of the image as a whole (imagine a patch of sky from the image of a forest). A different choice of patch would give us a different representation of the same input as seen by the network. An alternative strategy is to sample the patch that maximizesthe activation function. Since sampling is still necessary due to scalability concerns at the moment, it is an ongoing investigation into how stable the topological summaries are with respect to different sampling techniques.
Acknowledgement
This project is partially supported by NSF DBI1661375, NSF IIS1513616, and NSF IIS1910733.
References
 [1] P. S. Aleksandroff. Über den allgemeinen dimensionsbegriff und seine beziehungen zur elementaren geometrischen anschauung. Mathematische Annalen, 98(1):617–635, 1928.
 [2] K. Beketayev, D. Yeliussizov, D. Morozov, G. Weber, and B. Hamann. Measuring the distance between merge trees. Topological Methods in Data Analysis and Visualization III: Theory, Algorithms, and Applications, Mathematics and Visualization, pages 151–166, 2014.
 [3] H. Bhatia, B. Wang, G. Norgard, V. Pascucci, and P.T. Bremer. Local, smooth, and consistent Jacobi set simplification. Computational Geometry: Theory and Applications (CGTA), 48(4):311–332, 2015.
 [4] S. Biasotti, D. Giorgi, M. Spagnuolo, and B. Falcidieno. Reeb graphs for shape analysis and applications. Theoretical Computer Science, 392:5–22, 2008.
 [5] S. Biasotti, S. Marini, M. Mortara, and G. Patane. An overview on properties and efficacy of topological skeletons in shape modelling. Shape Modeling International, 2003.
 [6] H. Carr, J. Snoeyink, and U. Axen. Computing contour trees in all dimensions. Computational Geometry, 24(2):75–94, 2003.
 [7] S. Carter, Z. Armstrong, L. Schubert, I. Johnson, and C. Olah. Activation atlas. Distill, 4(3):e15, 2019.

[8]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
ImageNet: A largescale hierarchical image database.
IEEE Conference on Computer Vision and Pattern Recognition
, 2009.  [9] T. Dwyer. Scalable, versatile and simple constrained graph layout. Proceedings of the 11th Eurographics/IEEE  VGTC conference on Visualization, pages 991–1006, 2009.
 [10] H. Edelsbrunner and J. Harer. Jacobi sets of multiple Morse functions. In F. Cucker, R. DeVore, P. Olver, and E. Süli, editors, Foundations of Computational Mathematics, Minneapolis 2002, pages 37–57. Cambridge University Press, 2002.
 [11] H. Edelsbrunner, J. Harer, V. Natarajan, and V. Pascucci. MorseSmale complexes for piecewise linear manifolds. Proceedings of the 19th Annual symposium on Computational geometry, pages 361–370, 2003.
 [12] H. Edelsbrunner, J. Harer, and A. Patel. Reeb spaces of piecewise linear mappings. In Proceedings of the 24th annual symposium on Computational geometry, pages 242–250, 2008.
 [13] H. Edelsbrunner, J. Harer, and A. J. Zomorodian. Hierarchical MorseSmale complexes for piecewise linear 2manifolds. Discrete and Computational Geometry, 30(87107), 2003.
 [14] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higherlayer features of a deep network. Technical Report, Univeristy of Montreal, 01 2009.
 [15] M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise. International Conference on Knowledge Discovery and Data Mining, pages 226–231, 1996.
 [16] S. Gerber and K. Potter. Data analysis with the MorseSmale complex: The msr package for R. Journal of Statistical Software, 50(2), 2012.
 [17] F. Hohman, M. Kahng, R. Pienta, and D. H. Chau. Visual analytics in deep learning: An interrogative survey for the next frontiers. IEEE Transactions on Visualization and Computer Graphics (TVCG), 2018.
 [18] F. Hohman, H. Park, C. Robinson, and D. H. Chau. Summit: Scaling deep learning interpretability by visualizing activation and attribution summarizations. IEEE Transactions on Visualization and Computer Graphics, 2019.
 [19] A. Karpathy. tSNE visualization of CNN codes. https://cs.stanford.edu/people/karpathy/cnnembed/, 2014.

[20]
B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres.
Interpretability beyond feature attribution: Quantitative testing
with concept activation vectors (TCAV).
International Conference on Machine Learning
, 2018.  [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. 25th Advances in Neural Information Processing Systems 25, 2012.

[22]
Q. V. Le.
Building highlevel features using large scale unsupervised learning.
In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8595–8598, May 2013.  [23] M. Lin, Q. Chen, and S. Yan. Network in network. International Conference on Learning Representations (ICIR), 2014.
 [24] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5188–5196, June 2015.
 [25] A. Mahendran and A. Vedaldi. Visualizing deep convolutional neural networks using natural preimages. International Journal of Computer Vision, 120(3):233–255, 2016.
 [26] L. McInnes, J. Healy, N. Saul, and L. Großberger. UMAP: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29):861, 2018.
 [27] E. Munch and B. Wang. Convergence between categorical representations of Reeb space and mapper. International Symposium on Computational Geometry (SOCG), 2016.
 [28] A. Nguyen, J. Yosinski, and J. Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616, 2016.
 [29] C. Olah, A. Mordvintsev, and L. Schubert. Feature visualization. Distill, 2(11):e7, 2017.
 [30] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mordvintsev. The building blocks of interpretability. Distill, 3(3):e10, 2018.
 [31] G. Reeb. Sur les points singuliers d’une forme de pfaff completement intergrable ou d’une fonction numerique (on the singular points of a complete integral pfaff form or of a numerical function). Comptes Rendus Acad.Science Paris, 222:847–849, 1946.
 [32] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. Workshop at International Conference on Learning Representations,, 2014.
 [33] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. 3rd International Conference on Learning Representations (ICLR), 2015.

[34]
G. Singh, F. Mémoli, and G. Carlsson.
Topological methods for the analysis of high dimensional data sets and 3d object recognition.
Eurographics Symposium on PointBased Graphics, 22, 2007.  [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [36] L. van der Maaten and GeoffreyHinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
 [37] H. J. van Veen and N. Saul. Keplermapper. http://doi.org/10.5281/zenodo.1054444, Jan 2019.
 [38] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. The 26th Annual Conference on Neural Information Processing Systems, 2012.
 [39] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, pages 818–833. Springer International Publishing, 2014.
Comments
There are no comments yet.