Increases in computational power and data analysis capabilities have inspired a corresponding increase in the appetite for data collection. Data sets collected by scientific, industrial, financial, and political efforts continue to grow in scale and complexity. Comprehending the contents of these data sets becomes challenging as the number of items increases to thousands, millions, or more. Data sets that consist of images can be particularly problematic: while humans are very good at image understanding, no human can feasibly scan through and comprehend a collection of millions of images.
Automated machine learning methods can organize and prioritize image collections to make the best use of limited human attention and time. Classification methods can identify members of known classes, enabling humans to quickly zero in on images of interest. Unsupervised methods enable exploration of data sets where the classes may not yet be known. For example, clustering methods can identify common groups or trends within the data set. Our focus is on a complementary discovery task: highlighting novelties or anomalies within the data set.
Methods that identify novel observations help focus attention on items that merit closer human examination (Chandola et al., 2009). These are items that can potentially teach us something new about the subject of study, inspire policy changes, or overturn an existing scientific theory (Kuhn, 1962). Alternatively, they could signal instrument data collection errors, artifacts, or processing problems and inspire important corrections or upgrades. Both kinds of anomalies are valuable to identify.
Identifying the anomalies is a necessary first step, but without further information, a human reviewer may not be able to determine what makes a given item anomalous. This is especially true when the data set is too large for any individual to have reviewed completely. Users need to know what properties of the item are anomalous. For images, these properties might include color, shape, texture, location, etc. To date, very few anomaly detection methods provide explanations. We employ the DEMUD algorithm(Wagstaff et al., 2013)
, which expresses explanations as residual vectors in the input feature space. When the representation is derived from a neural network, additional steps are needed to convert this explanation into a form that is understandable for the human user.
In this paper, we describe DEMUD-VIS, the first method to generate human-comprehensible (visual) explanations for novel discoveries in large image data sets. This approach has three steps: (1) convert each image into a feature vector that captures semantic content, (2) employ anomaly detection to discover images with novel content, and (3) visualize the novel content for each selection in a human-interpretable fashion. We employ existing methods to address the first two steps. The primary contribution of this paper is the final step: generating a visual explanation of novel image content. We evaluated DEMUD-VIS on several large image data sets, including ImageNet, images of insect species found in freshwater streams on Earth, and images collected by the Mars Science Laboratory rover. We also conducted a user study to assess whether the explanations are useful to humans in identifying and understanding novel image content. Given the ability to explain selections automatically chosen from very large image data sets, investigators in a variety of application areas can explore and learn new aspects of their own image data sets.
2 Related Work
Developing explainable machine learning methods, models, and decisions is a growing area of interest within the research community. The goal is to improve the interface between humans and machine learning so that we can comprehend how a given method works, what a given model learned, or why a given decision was made.
Following Rudin (2018), we define an interpretable
model as one in which model decisions can be directly phrased in human-understandable ways, such as by a conjunction of attribute tests in a decision tree. We contrast interpretable models withexplainable models, for which post-hoc explanations can be generated to describe the model using a language other than that employed by the model itself. Such an explanation can also be viewed as an “interpretation” of a model’s decision (Biran and Cotton, 2017).
Most of the work on interpretable or explainable models has focused on supervised learning methods(Biran and Cotton, 2017). Approaches for explaining the decisions made by a supervised method range from using inherently interpretable models such as rule lists via CORELS (Angelino et al., 2018) to constructing post-hoc locally approximate models to explain the decisions of a more complex model, as with LIME (Ribeiro et al., 2016)
. In addition, methods from feature selection(Guyon and Elisseeff, 2003) can be used to identify the features with the greatest influence on the model in general or on particular decisions specifically. The DARPA XAI (Explainable AI) program seeks to support work across the spectrum that includes interpretable models, explainable features, and model approximation (Gunning and Aha, 2019).
Image classifiers may generate a saliency map(Dabkowski and Gal, 2017) to indicate which parts of an input image most influenced the classification decision. The salience map can indicate relevant content, but it does not explain why or how a decision was made (Rudin, 2018). Significant effort has been invested in trying to understand the concepts being learned by deep convolutional neural networks (Montavon et al., 2018), e.g. by visualizing the features learned at each layer (Zeiler and Fergus, 2014), linking internal node activations to human concepts (Bau et al., 2017), or generating synthetic inputs to visualize class membership with DeepVis (Yosinski et al., 2015). To help understand the network’s view (representation) of a given input image, methods such as Deep Goggle (Mahendran and Vedaldi, 2015) and DeePSiM (Dosovitskiy and Brox, 2016b) “invert” the internal feature vector into reconstructed images.
Fewer methods currently exist for generating explanations for unsupervised machine learning methods, which are most relevant for supporting discovery. The MMD-critic method identifies prototypes and “criticisms” (outliers) to summarize the contents of a data set (Kim et al., 2016). In this approach, individual items help to describe the data set by example. However, no explanations are provided for why each item was selected as a prototype or a criticism, so human viewers must construct their own understanding of the content. Other approaches seek to identify a subset of features to explain why a given item should be considered an anomaly (Micenková et al., 2013; Dang et al., 2014; Duan et al., 2015). The X-PACS method identifies feature subspaces that contain groups of anomalies (Macha and Akoglu, 2018). Siddiqui et al. (2019) proposed a method to generate sequential
feature-based explanations to help human experts decide whether a particular item was anomalous or not. Their method provides a ranked list of feature-value pairs that are incrementally revealed until the human expert reaches a satisfactory level of confidence. Features are revealed in an order that minimizes the item’s marginal likelihood, assuming for example that items were generated from a Gaussian mixture model. The quality of the explanations was assessed in terms of parsimony (the number of feature-value pairs the expert needed to see to reach a confident conclusion).
The DEMUD novelty detection algorithm provides an explanation that considers all feature values simultaneously (Wagstaff et al., 2013)
. DEMUD incrementally constructs a growing model of the user’s knowledge (using Singular Value Decomposition) and selects items with large reconstruction error (those most likely to represent new classes) for the user to review. A DEMUD explanation consists of the residual vector (information in the observation that was not captured by the current model) for each selection. DEMUD was designed to be inherently interpretable, since the explanations employ the same features by which items are originally described. However, when working with images, the features may not be individually interpretable for humans. In that case, an additional post-hoc explanation step is required.
We developed the first approach to generate visually meaningful explanations for class discovery in image data sets and published it in short form at the 2018 ICML Workshop on Human Interpretability (Wagstaff and Lee, 2018). This approach combines a CNN-based representation of semantic image content with the DEMUD algorithm for image selection and an up-convolutional network (Dosovitskiy and Brox, 2016b) to convert image feature vectors back into the image domain for visualization. The current paper provides more detail on the concept, approach, and experimental results. It also describes three new advances: (1) a new visualization step to yield more meaningful explanations (described in Section 3), (2) several additional experiments (Section 4.4), including a new scientific data set (stonefly images), a comparison of the use of different CNN layers for image content representation, and a comparison of novelty detection for class discovery and within-class novelty, and (3) the results of a large user study to determine the utility and impact of the visual explanations (Section 4.5).
Our image discovery and explanation process consists of three steps:
Represent image content as a feature vector.
Discover images with novel content using an anomaly detection algorithm that also generates explanations.
Render the explanations understandable to humans.
To represent image content, we employ the representation extracted by a convolutional neural network (CNN), as detailed in section 3.1. We then rank images by their novelty using a novelty detection algorithm that generates an explanation for its selections (section 3.2). Finally, we convert the algorithm’s explanations into human-interpretable images using CNN feature vector inversion (section 3.3). Figure 1 summarizes this process. Here, the CNN is CaffeNet (Jia et al., 2014), and the user can choose which of the fully connected layers (fc6, fc7, fc8) to use for extracting image content as a feature vector. The yellow daisy image was identified as a novel image, and the explanation indicates that the data model’s best reconstruction of that image was as a pile of yellow lemons. The color is close, but the spatial structure is wrong. The residual image shows the information that fell outside the model’s representation (i.e., the novel content in the daisy image); it emphasizes the dark center of the flower and the fine radial texture in its petals. The residual image is noticeably lacking in yellow, because the color was already included in the reconstruction.
3.1 CNN Features for Image Content Representation
Several methods from computer vision exist for extracting or representing image content, such as LBP(Ojala et al., 2002), SIFT (Lowe, 2004), and HOG (Dalal and Triggs, 2005). Recently, convolutional neural networks trained on massive data sets have been shown to learn effective representations of image content (Donahue et al., 2014; Oquab et al., 2014). Experiments have shown that these representations can be used for a variety of visual tasks, not just the task for which the original network was trained (Razavian et al., 2014).
We extract a feature vector to represent each image by propagating the image through a neural network that was previously trained for an object classification task. Any network trained on a sufficiently diverse set of inputs can be employed, although prior work has shown that supervised networks provide better representations than unsupervised or self-supervised networks (Bau et al., 2017). For our experiments, we used CaffeNet (Jia et al., 2014), a CNN that consists of five convolutional layers and three fully connected layers (Krizhevsky et al., 2012). It was trained on 1.2 million images from 1000 classes using the ImageNet data set (Russakovsky et al., 2015).
To extract image feature vectors, we resized each input image to 227
227 pixels, then propagated it through CaffeNet and recorded the activations observed at a chosen fully connected layer (fc6, fc7, or fc8). For layers fc6 or fc7, we recorded the activations prior to the linear rectification (ReLU) activation function, which clips all negative values to zero(Nair and Hinton, 2010). As we discuss in section 3.3, the pre-rectification activations carry information that is necessary for the explanation generation process. Layers fc6 and fc7 each yield feature vectors with 4096 values, while fc8, the final layer prior to the softmax classification, yields a feature vector with 1000 values (one per output class).
3.2 Novelty Detection with Explanations
To detect novel images within a data set, we employed the DEMUD method (Wagstaff et al., 2013). We chose DEMUD due to its demonstrated superiority to other methods for class discovery and its unique capability to generate per-selection explanations. DEMUD incrementally builds a model of what is known about a data set , employing a singular value decomposition (SVD) to model the most prominent data components. DEMUD iteratively selects the most interesting remaining item, with respect to the current SVD model, and then updates the SVD model to incorporate (learn) the newly selected item. Interestingness (or novelty) is calculated using reconstruction error, where a higher error indicates more novelty. Reconstruction error for each item is calculated as
where is the current set of top eigenvectors from the SVD of , the set of already selected items, and is the mean of all previously seen . The number of principal components for the SVD model, , is a pre-specified parameter. The most interesting item is moved from to , and an incremental SVD algorithm (Lim et al., 2005) updates to incorporate . This approach minimizes redundancy in the selections, since items similar to any previously selected item will have low reconstruction error.
DEMUD’s explanation for each selection is composed of two parts. First, , the reconstruction of after projection into the low-dimensional space defined by , contains the information in that the current model was able to represent:
Second, , the difference between and (referred to as the residual), captures the information in that the current model was not able to represent:
Together, the reconstruction and residual can be examined to understand which features and values led to each novelty selection.
3.3 Visualization of Explanations
When previous researchers applied DEMUD to numeric data sets, the residual vectors
could be directly interpreted because each feature was already represented as a human-comprehensible value (e.g., size, age, or other measured property). In contrast, in the CNN feature domain, residual values for the 4096 features employed by layer fc6 in CaffeNet are not directly interpretable. Each feature value represents the activation of an individual neuron, for which there may not always be a known, independent correspondence to a particular image component. Prior work has proposed methods to determine an individual neuron’s sensitivity to specific image content through feature attribution or example image generation(Zeiler and Fergus, 2014; Olah et al., 2017). While these methods are useful for understanding the CNN model itself, they are not sufficient to interpret a complete feature vector with thousands of neuron activation values.
To visualize DEMUD explanations in CNN feature space, we require a method that can convert (or invert) CNN feature vectors back into the image domain. We considered several methods for this step. Let be the feature vector the represents image . The Deep Goggle method employs gradient descent to generate a synthetic input image that minimizes the Euclidean loss in feature space between and , the feature vector of the generated image (Mahendran and Vedaldi, 2015). Dosovitskiy and Brox (2016b) trained an up-convolutional (UC) network to predict the original image given a feature vector by minimizing the Euclidean loss in image space between and , the output of the UC network. In later work, the same authors replaced the pixel-space Euclidean loss with DeePSiM, a weighted sum of feature-space loss, image-space loss, and an adversarial loss determined by a discriminator model (Dosovitskiy and Brox, 2016a). While Deep Goggle is able to visualize fine details such as texture, we found that colors and object locations were often not faithful to the original image. On the other hand, the UC method accurately represented colors and locations, but generated blurry images; this is a well-known issue with generative models optimizing image-space Euclidean loss (Mathieu et al., 2016). The UC DeePSiM method yielded the best results, generating feature vector visualizations with sharp edges, detailed textures, and accurate shapes and colors. We adopted the latter method for this study.
The UC DeePSiM model was trained to invert feature vectors that were directly extracted from natural images, which differs from our visualization goal. We seek to visualize two synthetic items that constitute the DEMUD explanation: the reconstruction and its residual . We made two changes to UC DeePSiM to enable its application to these feature vectors.
First, we modified the stage at which feature vectors are extracted for visualization. Dosovitskiy and Brox applied UC DeePSiM to the feature vectors obtained after linear rectification (ReLU), which maps all negative feature values to zero. However, the sign of the values in the residual expresses the relationship between and and, for our purposes, must be preserved. Therefore, we trained a new UC DeePSim network that operates on the feature vectors obtained before rectification.
Second, we found that UC DeePSiM sometimes failed to generate meaningful images for the residual vectors . Blank or otherwise uninterpretable images were generated instead. Recall that
represents a partial image: the difference between the original and reconstructed image vectors. As a result, its feature vector values can fall into a very different range than do those obtained from natural images. Visualizations of fc7 residuals were successful, but those of fc6 and fc8 sometimes failed. While fc7’s input and output are both fully connected layers, fc6 is at the transition between convolutional and fully connected layers, and fc8 is at the transition between fully connected and output (softmax) layers. To address the value range mismatch for fc6, we transformed the residual vectors by shifting the meanvalue to align with the mean value of the reconstruction :
where is a vector of ones, and and are the mean values of and , respectively. This transformation successfully yields meaningful visualizations of residual fc6 vectors. However, it is not a complete solution for fc8 residuals, and further improving fc8 visualizations is an ongoing area of investigation.
4 Experimental Results
We conducted several experiments on benchmark and scientific data sets to evaluate DEMUD-VIS in terms of (1) its ability to discover new classes in progressively more difficult conditions and (2) the quality of the generated explanations. All data sets, extracted features, and evaluation scripts are available at https://jakehlee.github.io/visualize-img-disc.html.
4.1 Data Sets
All images were center-cropped to enforce a square aspect ratio, then resized to 277x277 pixels to match the input dimensions of the CaffeNet CNN.
We compiled a subset of the data set used by the ImageNet Large Scale Visual Recognition Challenge in 2012 (ILSVRC2012) (Russakovsky et al., 2015). We randomly selected 20 classes111Class definitions provided in code repository., then selected 50 images from each class in the ILSVRC training set to create a heterogeneous yet balanced data set that contains 1000 images. The top row of Figure 2 shows ten random images from this data set. To pose a more challenging task, we also created an imbalanced data set by selecting 50 images from the first 10 classes and only one image from the last 10 classes for a total of 510 images.
For use in the user study described in Section 4.5, we created a different ILSVRC subset designed to be more homogeneous in color. We manually selected the following 10 classes as those containing primarily yellow objects: banana, butternut squash, carbonara, corn, daisy, lemon, orange, rapeseed (a yellow flower), school bus, and spaghetti squash. By minimizing novelty in color, this data set enabled us to test the detection, and explanation, of novelty that depends on other kinds of differences such as shape, structure, and semantic content. We selected up to 50 of the “most yellow” images from each class, which are those images for which the mean pixel value had a Euclidean distance of less than 150 from yellow (255, 255, 0) (except for the school bus class, which used a cutoff of 170 to collect more images). The daisy and school bus classes yielded only 19 and 26 images respectively, while the other classes contained 50 images each, for a total of 445 images. See row 2 of Figure 2 for examples.
We employed a publicly available scientific data set composed of labeled images of the Mars surface collected by the Mars Science Laboratory (Curiosity) rover, plus an additional “sun” class222See code repository.. Identifying images with novel content has an immediate operational value for Mars planners, as new discoveries can influence plans for the rover’s next actions (Kerner et al., 2019). This data set contains 6712 images from 25 classes collected by various instruments on the Curiosity Mars rover from sol (Mars day) 3 to 1060 (Stanboli and Wagstaff, 2017; Wagstaff et al., 2018). The classes consist of broad environmental categories such as ground and horizon, as well as engineering categories such as instruments and calibration targets. The data set is imbalanced, with class sizes ranging from 21 to 2684. See row 3 of Figure 2.
We also conducted experiments with a data set that arises from the field of ecology. The STONEFLY9 data set333http://web.engr.oregonstate.edu/~tgd/bugid/stonefly9/ consists of nine types of stoneflies that were collected in Oregon streams and rivers, then imaged with a microscope (Lytle et al., 2010; Martínez-Muñoz et al., 2009). In this domain, the ability to automatically identify novel images is beneficial for filtering out images of leaf fragments, debris, or other insects not of interest to the study (Lytle et al., 2010). We used the set0 subset, which contains 1362 images of 9 taxa (types) of stonefly larva. The data set is imbalanced, with class sizes ranging from 44 to 223. All images were taken with a microscope on a solid blue background. The classes are very highly similar, and distinguishing between them is difficult even for trained human experts (Martínez-Muñoz et al., 2009). See row 4 of Figure 2.
4.2 Experimental Methodology
We conducted experiments in which we varied the novelty detection method and the image content representation method. This subsection describes each variant and the metrics we employed.
4.2.1 Novelty Detection Methods
We compared DEMUD to a standard SVD and a random baseline. In all DEMUD and SVD experiments, , the number of principal components, was set to the maximum possible number of principal components, which is the lesser of (dimensionality) and (number of items). In many SVD applications, the goal is dimensionality reduction, with much smaller than . In this setting, our goal is to model as much of the information content as possible so as to yield the best-quality reconstructions and therefore be maximally sensitive to new image content.
SVD-based novelty detection is based on maximizing reconstruction error, as with DEMUD:
For an SVD, the model and mean are constructed from the entire data set , while for DEMUD, and represent only the prior selections in and therefore incrementally grow and change. DEMUD requires the specification of how to initialize . In these experiments, we initialized by constructing a full SVD over the entire data set and selecting the single item with the highest reconstruction error.
The random baseline selects items from randomly. This approach yields no explanations for the selections, since no model is constructed. However, in many cases it can achieve good class discovery results, especially for balanced data sets.
4.2.2 Image Content Representation Methods
We assessed three options for representing image content as a feature vector for input to the novelty detection methods.
Pixel: Pixel values of the image, flattened into a one-dimensional vector by reading the image array (where the color channel is the last axis) in C-like order.
CNN: Features extracted from the fully connected layers of CaffeNet, as described in Section 3.1.
SIFT-based features were generated using a visual bag of words approach. First, we used OpenCV’s SIFT module to extract SIFT keypoints from every image in the data set (Bradski, 2000). We then ran -means clustering (with clusters) on all keypoints. Finally, for each image, we assigned each of its keypoints to the nearest cluster to generate a histogram of cluster counts per image. This resulted in a -dimensional feature vector for each image, which can be used directly by DEMUD or a standard SVD. Since there is no standard way to select the best in advance, we report the best performance achieved after testing .
Note that SIFT-based representations do not provide interpretable explanations. The resulting reconstructions and residuals are in the form of distributions of unexpected values for keypoint cluster histograms and do not have a natural correspondence to the visual domain. As with the random baseline, SIFT is included in these experiments as a point of comparison for class discovery performance; it enables the first and second parts of the task (representation and discovery) but not the third part (explanation).
4.2.3 Evaluation Metrics
We assessed novelty detection in terms of the ability to discover new classes. While novelty detection is unsupervised and therefore does not have access to class labels while operating, we used the labels to quantify class discovery after the fact. After each selection , we recorded , the cumulative number of distinct classes discovered up to and include that selection. Plotting the number of discovered classes as a function of the number of items selected yields a discovery curve. We calculated the area under this curve for the first selections by summing from selection 1 to . We then divided this value by the area under a perfect discovery curve to calculate the normalized area under the curve ():
where is the number of classes in the data set and the denominator is the perfect discovery performance with a new class is discovered in each of the first selections.
The value (number of selections made) reflects the number of selections that the user wants to review. In practice, this parameter will vary depending on (which may be unknown), the size of the data set, and human time available. For our experiments, we set to either the number of selections needed to discover all classes via random selection, or , whichever is less.
To assess the class discovery performance of the random baseline, we calculated the average over 1000 trials. DEMUD and SVD are deterministic, so a single trial suffices to assess their output.
The second part of our evaluation was to assess the quality and utility of the explanations. Assessing explanations in an objective fashion is notoriously challenging (Montavon et al., 2018), especially for unsupervised settings in which there is no single correct output (Siddiqui et al., 2019). We evaluated the generated explanations through a user study, which is described in Section 4.5.
4.3 Class Discovery Performance
4.3.1 ImageNet Data Sets
Our first experiment employs the ImageNet-Random data set, which consists of 20 balanced classes. Figure 3(a) shows the number of classes discovered as a function of the number of selections. The “Oracle” line shows (hypothetical) perfect discovery performance. Since CaffeNet was optimized to distinguish between 1000 pre-defined image classes that include the 20 classes in this data set, we expected to see high performance when using CaffeNet feature vectors (green lines). However, we found that DEMUD using CNN-based representations was the only method to out-perform random selection. SVD using the CNN features performed much worse than random. The same was true for DEMUD and SVD when using SIFT features or pixel values. However, DEMUD using CNN layer fc8, the last layer of CaffeNet, achieved oracle-level performance for its first 13 selections (discovered a new class with each selection). Figure 3(b) compares DEMUD and SVD performance using different CaffeNet layers to represent image content. DEMUD using fc8 employs feature vectors that are most closely aligned with CaffeNet’s final classification output, so it is unsurprising that this layer yielded the best performance. However, DEMUD using fc6 and fc7 performed almost as well, suggesting that novel content is captured even at lower levels of the network. SVD’s performance was more variable, but again fc8 yielded the best result.
For real discovery problems, classes are unlikely to be equally balanced in the data set. The imbalanced ImageNet-Random data set was more challenging; while the random baseline discovered all 20 balanced classes within 70 selections, it found only 16 after 300 selections in the imbalanced data set (Figure 4(a)). In fact, DEMUD using CNN layer fc8 was the only method to discover all 20 classes within the first 300 selections. DEMUD again out-performed SVD for all representations, and DEMUD achieved better performance with the CNN-based representations over other representations. Figure 4(b) shows that DEMUD and SVD had more similar performance on this data set, but the SVD remained below the random baseline.
Finally, Figure 5(a) shows results for the ImageNet-Yellow data set. This discovery task is easier because there are only 10 classes, not 20. Once again, DEMUD using CNN features achieved the best performance, fc8 yielded the best representation, and fc6 and fc7 were not far behind (Figure 5(b)).
|Balanced ()||Imbalanced ()||Yellow ()|
Figure 6 shows the first ten and last ten images obtained when using DEMUD with fc8 to select (and therefore rank) all 445 images in the ImageNet-Yellow data set. The first ten selections are very diverse, comprising images from eight different classes. The classes that appear twice (school bus and corn) include images with very different scales and backgrounds. The last ten selections (Figure 6(b)) contain images with less variation in color, texture, and backgrounds.
Table 1 compiles the values for the ImageNet-Random (both balanced and imbalanced variants) and ImageNet-Yellow data sets. DEMUD using CNN layer fc8 was the best performing method for all subsets of ImageNet.
To assess whether novelty detection using CaffeNet feature vectors could be effective in a different image domain, we applied the same techniques to the Mars-Curiosity data set. The reader may wonder which ImageNet classes the CaffeNet model proposed for the Mars-Curiosity images. We found that the most commonly predicted classes were sand viper and nematode for images of the Mars surface and visually similar classes for the rover parts, such as waffle iron for the rover wheels, which have a metal grill texture. Since the network was not trained on Mars-specific classes, this experiment tests whether abstract representations learned by CaffeNet from Earth images can generalize sufficiently well to the Mars domain.
As shown in Figure 7 and Table 2, this more challenging data set generally yielded lower discovery performance than the ImageNet data sets, except for DEMUD. DEMUD with CNN-based representations was again the only method to perform better than random selection. It was also the only method to discover all 25 classes within 300 selections. The fc8 layer yield the best performance, but all three performed well, which is consistent with our results from previous experiments.
Figure 8 shows the first ten and last ten selections made by this method. Once again, the first ten selections are diverse in content, with eight different classes represented in these images. Even for repeat classes, such as the eight and tenth images from the left (which are both in the wheel class), the content itself differs significantly. Meanwhile, the last ten selections made by the method are all from the same class, ground, and exhibit minimal novelty.
Finally, we experimented on the STONEFLY9 data set. As mentioned in the data set description, this data set is difficult and the classes are fine-grained. CaffeNet’s most common predictions for these images are mixture of sky (kite, wing, warplane) and ocean (jellyfish) classes that contain isolated objects on a bright blue background. This experiment tests whether CaffeNet can distinguish between highly similar classes in a different non-ImageNet domain.
Figure 9 and Table 3 show that random selection achieved a high discovery performance, due to the small number of classes and mostly balanced class distribution. Once again, DEMUD was the highest performing method and the only one to out-perform the random baseline, with fc8 yielding the best result.
4.4 Visualization of Explanations
As previously discussed, novelty detection methods are most useful when they are accompanied by explanations that highlight the novel component of a given selection. DEMUD-VIS provides explanations that consist of the reconstruction () and the residual () feature vectors. In this section, we examine visual explanations generated by different CNN layers and examples of two kinds of novel discovery: new classes and variation within classes.
4.4.1 Comparison of Explanations from Different CNN Layers
We compared novelty detection using feature vectors extracted from the fc6, fc7, and fc8 layers of CaffeNet, which encode increasingly abstract image content. Different layers yield different selections and explanations. Figure 10 compares explanations for DEMUD-VIS selections from the ImageNet-Yellow data set when using each of these layers. For each layer experiment, we show DEMUD-VIS’s third selection (), which is the most novel image given the previous two selections ( and ). The explanation for novelty in (green background) includes the reconstruction and the residual . The reconstruction contains content that DEMUD-VIS was able to model based on the first two images, while the residual contains the novel content. Note that, in general, the reconstructions shown here are expected to be poor matches for their respective selected images, since DEMUD-VIS at each round selects the item that is most poorly modeled. Images for which the reconstruction is very good are (by definition) not considered to be novel.
The fc6 explanation is clear and easily interpretable. The reconstruction combines content from the first two images (lemon and daisy) that approximates the color of the school bus but fails to reproduce its novel components (including windows, wheels, stop sign), as shown in the residual. The residual appears washed out because the vivid yellow tones were accurately captured in the reconstruction. The large discrepancy between image and reconstruction caused this image to be selected as novel. Similarly, the fc7 explanation is easy to interpret. The reconstruction is mostly composed of lemons (the best color match), while the residual emphasizes the dark center and the radial texture of the flower petals. However, the fc8 explanation is less interpretable. The reconstruction visualization employs content from the school bus image, but it is highly deformed. The residual is blurred and inaccurate compared to the selected image.
This pattern occurs with other data sets and aligns with our understanding of neural network abstraction: fc6 is closer to the original image content, while fc8 is closer to abstract class-level content. It is expected that UC DeePSiM visualization quality degrades as the feature space moves away from the convolutional layers (conv5) and closer to the final softmax classification layer (Dosovitskiy and Brox, 2016a, b). We are presented with a trade-off between explanation interpretability and class discovery performance. DEMUD-VIS with fc8 features achieved the best discovery performance across all data sets (Section 4.3), but it did not generate consistently interpretable explanations. DEMUD-VIS with fc6 features offers discovery performance that out-performs most other methods and yields the most interpretable explanations. Therefore, all visualizations in this subsection will be generated from fc6 features.
4.4.2 Explanations of First-Time Class Discovery
When DEMUD-VIS discovers a new class, the explanation helps pinpoint how it is new. Figure 11 shows the DEMUD-VIS explanation when it discovered the Eskimo dog, husky class in the balanced ImageNet-Random data set. The model had previously selected an image from the English foxhound class, which dominates the reconstruction. The residual emphasizes differences the in coat color and pose in the new image.
Figure 12 shows another example of class discovery, this time from the Mars-Curiosity data set. This selection constitutes the discovery of the ChemCam Calibration Target class. The model had previously selected five images from five distinct classes, one of which contained the REMS UV Sensor. This previously selected image contains small, dark circles on a white background, similar to the ChemCam Calibration Target image. However, the ChemCam Calibration Target image differs in the location and pattern of the dark circles, and also contains a larger patterned circle on the bottom right. These features are highlighted in the residual visualization.
Finally, Figure 13 shows an example of the discovery of the CAL class in the STONEFLY9 data set. The model had previously selected six images from four different classes (taxa), one of which was a dark insect in a similar orientation. The reconstruction replicates the dark insect, while the residual highlights the light-colored markings in the selected image. Note that the blue background is not present in the residual, because it was learned from the previous selections and therefore is not novel.
In each case, the explanations illustrate image content that can be modeled (reconstructed) from previous selections as well as content that is new, which helps humans realize that a new class is present and provides the first step in characterizing the new class.
4.4.3 Explanations of Novelty Discovery Within a Class
Novelty detection is not restricted to the discovery of new classes. Figure 14 shows an example two very different images from the same class (REMS UV Sensor) that DEMUD-VIS selected from the Mars-Curiosity data set. While the second image does not constitute a new class discovery, the sensor’s appearance has significantly changed to due dust deposition in the 772 sols (Mars days) that elapsed between the two images. The explanation shows that the model attempted to reconstruct the sensor in its clean state, and the residual emphasizes the darker, dirtier appearance in the new image.
Figure 15 shows another example of within-class novelty from the imbalanced ImageNet-Random data set. Both images are from the English foxhound class with similar poses. However, as highlighted in the explanation, they have very different coloring (brown and white versus black). Whether such differences constitute class-level novelty depends on the granularity of interest to the user, and being made aware of this diversity can help guide those decisions.
4.5 User Study to Assess Explanation Utility
Thus far, the novelty explanations displayed have been anecdotal and interpreted by the authors. To independently assess the utility of these explanations for the general public, we conducted a large user study. We presented users with 20 images and asked them to rate the novelty of each one. We employed a randomized A/B testing strategy in which some users were given only the image and others were given the image plus the visual novelty explanation.
User study design.
Users were shown the first 20 images selected by DEMUD-VIS from the ImageNet-Yellow data set, as described in Section 4.1. Images were presented in the order of their selection (most novel first). We chose to use 20 images chosen from a data set of 10 classes to ensure that there would be a mixture of novel and not-novel images, with respect to their ImageNet class labels. As discussed in the previous section, even within a given class there can be images with unique or novel elements.
For each selection, users were asked to answer the question “Given the images you have already seen, does the next image contain something new?” The response options provided were:
Yes: new type of object
Some: new aspect of object type
Minor: unimportant details are new
No: duplicate of previous object
Not sure / skip
We proposed two hypotheses to assess the utility of the explanations:
Explanations increase user confidence: Users will have fewer “Not sure” responses when given the explanation, compared to no explanation provided.
Explanations increase user awareness of novelty: Users will have more “Yes” responses when given the explanation, compared to no explanation.
We used the SurveyMonkey platform to collect responses over a period of two months. We advertised the survey online, within our social networks, at Columbia University, and at the Jet Propulsion Laboratory. We received 399 responses, of which 280 (70%) were complete. The response rate was higher for survey A (without explanations) (80%) than for survey B (with explanations) (61%). This suggests that using the explanations required more effort, motivation, or persistence on the part of the participant. The average time to complete the survey was 366 seconds for survey A (standard deviation of 888.9 s) and 457 seconds for survey B (standard deviation of 262.7 s)444We excluded one survey B response from our analysis that had a duration of 1,437,922 seconds (16 days) on the grounds that this likely did not represent continuous effort. Since the survey assesses novelty, it is important that it is completed in one session.
We also found a difference in the distribution of educational backgrounds for those who completed each survey, despite random assignment to survey A or B. Figure 16 shows that those who completed survey B (with explanations) tended to have higher educational levels than did survey A completers. Less difference was observed in the amount of self-reported experience with machine learning, although survey B completers were slightly more likely to have run their own machine learning experiments or worked as professionals in the field.
We compiled the results of the 280 complete responses to test the hypotheses stated above.
Hypothesis 1: We found a significant overall difference (
using the Fisher exact test) that indicates survey B respondents were in general 4 timesmore likely to respond “Not sure” than survey A respondents. This contradicts our hypothesis that survey B respondents would be less likely to respond “Not sure.” However, this occurred only 9 of 3180 times for survey A and 30 of 2420 times for survey B. The rarity of this response (only 0.7% of responses) suggests that users generally felt sufficiently confident that they could rate the novelty of a given image regardless of whether the explanation was provided. For a data set in which novelty manifests in more subtle ways, the outcome may be different. We did not find a statistically significant difference in the number of “Not sure” responses to the two surveys for any individual images.
Hypothesis 2: Overall, survey B users were slightly more likely to respond “Yes” (image contains a new type of object) than were survey A users, but the difference was not significant. However, there are differences in which images they considered to be novel, with survey B users showing more sensitivity to within-class novelty. For three of the 20 images (shown in Figure 17), survey B users were more likely than survey A users to confidently state that the image contained a new type of object (), consistent with our hypothesis. Note that to make the explanations more widely accessible, we called the reconstruction “known content” and the residual “new content.” Each of these images were cases of within-class novelty, since the same class had appeared earlier in the selections seen by the user. As shown in Figure 17, a corn image includes a green can; a daisy image includes a yellow (not brown) center, and a banana image is dominated by the stem and very green bananas. Within-class novelty is often more subtle than the discovery of a new class. Survey B users, who received the novelty explanation highlighting these differences, judged the images to be novel more frequently than did Survey A users, who did not have the aid of the explanation.
In contrast, for three other images (see Figure 18), survey A users were more likely than survey B users to vote “Yes” for novelty. These three images were all examples of ImageNet classes the user had not yet seen, so they were genuinely new. There is no clear explanation for why users given explanations would be less likely to consider these images novel. They all occurred early in the sequence (selections 2, 3, and 5), so we speculate that users were confused by or unable to interpret the explanations at this point in the survey. As they gained more experience, they were able to extract more meaning from the explanations, leading to the more fine-grained decisions that were made in Figure 17, which occurred in selections 14, 16, and 17. This observation provides useful feedback that additional user training prior to working with explanations, or the ability to reference the tutorial while browsing explanations, could benefit future users.
5 Conclusions and Future Work
We have developed and evaluated the first approach to generating visual explanations of novel image content. This technique can provide a prioritized list of unusual or anomalous images within a large data set to enable the best use of limited human review time. Images with novel content may signal the occurrence of unanticipated artifacts (e.g., data corruption) or the potential for a new scientific discovery (e.g., a new insect species). Our approach combines the use of a convolutional neural network (CNN) to extract features to represent image content, a novelty detection method based on reconstruction error, and an up-convolutional network to convert novelty explanations from the CNN feature space back into image space for human interpretation.
This approach achieves strong (sometimes near-perfect) performance in discovering new classes within large data sets for a range of image domains, including standard ImageNet images, microscope images of insects collected in freshwater streams, and images collected by a rover on Mars. We evaluated the utility of the corresponding image explanations through a user study and found that visual novelty explanations influenced user decisions about novelty by highlighting specific aspects of the image that were new. The explanations increased the chance that users detected within-class novelty. We observe that there is a human learning curve, and we expect that users would benefit from seeing multiple example explanations so that they are well positioned to understand the explanations.
This approach can be used to accelerate analysis and discovery in a variety of application areas, ranging from surveillance to remote sensing to medical diagnosis (Chandola et al., 2009; Pimentel et al., 2014). Many investigations employ cameras to observe phenomena of interest. By focusing attention on the most novel images, our approach can help investigators quickly zero in on the observations most likely to lead to new discoveries.
We have also found that DEMUD-VIS is very good at detecting mislabeled examples, since they contain unusual content with respect to their class. In a separate experiment with ImageNet data, we applied DEMUD-VIS to 63 images from the tiger cub class. The ninth selection turned out to be a leopard cub that does not belong to this class. As shown in Figure 19, the reconstruction is very poor, because DEMUD-VIS had only seen tiger cubs and could not reproduce the leopard. Tiger-stripe-like features are evident (and incorrect). The residual highlights the green leaf and the leopard’s spots, which are novel and help diagnose the ImageNet labeling error. This capability could be useful in exploring even fully labeled data sets, to help identify labeling errors and/or adversarial examples.
Acknowledgements.We thank the Planetary Data System Imaging Node for funding this project. Part of this research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. All rights reserved. ©2019 California Institute of Technology. U.S. Government sponsorship acknowledged.
- Certifiably optimal rule lists for categorical data. Journal of Machine Learning Research 19, pp. 1–79. Cited by: §2.
Network dissection: quantifying interpretability of deep visual representations.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6541–6549. Cited by: §2, §3.1.
- Explanation and justification in machine learning: a survey. In Proceedings of the IJCAI-17 Workshop on Explainable AI, pp. 8–13. Cited by: §2, §2.
- The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §4.2.2.
- Anomaly detection: a survey. ACM Computing Surveys 41 (3), pp. Article No. 15. External Links: Cited by: §1, §5.
- Real time image saliency for black box classifiers. In Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 6970–6979. Cited by: §2.
- Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 886–893. Cited by: §3.1.
- Discriminative features for identifying and interpreting outliers. In 2014 IEEE 30th International Conference on Data Engineering, pp. 88–99. External Links: Cited by: §2.
- DeCAF: a deep convolutional activation feature for generic visual recognition. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Bejing, China, pp. 647–655. Cited by: §3.1.
- Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 658–666. Cited by: §3.3, §3.3, §4.4.1.
- Inverting visual representations with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4829–4837. Cited by: §2, §2, §3.3, §4.4.1.
- Mining outlying aspects on numeric data. Data Mining and Knowledge Discovery 29 (5), pp. 1116–1151. External Links: Cited by: §2.
DARPA’s explainable artificial intelligence program. AI Magazine 4 (2), pp. 44–58. Cited by: §2.
- An introduction to variable and feature selection. Machine Learning 3, pp. 1157–1182. Cited by: §2.
- Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093. Cited by: §3.1, §3.
- Novelty detection for multispectral images with application to planetary exploration. In Proceedings of the Innovative Applications in Artificial Intelligence Conference, pp. 9484–9491. Cited by: §4.1.
- Examples are not enough, learn to criticize! Criticism for interpretability. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2280–2288. Cited by: §2.
- ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1097–1105. Cited by: §3.1.
- The structure of scientific revolutions. University of Chicago Press. Cited by: §1.
- Incremental learning for visual tracking. In Advances in Neural Information Processing Systems 17, pp. 793–800. Cited by: §3.2.
- Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. External Links: Cited by: §3.1, item 2.
- Automated processing and identification of benthic invertebrate samples. Journal of the North American Benthological Society 29 (3), pp. 867–874. Cited by: §4.1.
- Explaining anomalies in groups with characterizing subspace rules. Data Mining and Knowledge Discovery 32, pp. 1444–1480. External Links: Cited by: §2.
- Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5188–5196. Cited by: §2, §3.3.
- Dictionary-free categorization of very similar objects via stacked evidence trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 549–556. Cited by: §4.1.
- Deep multi-scale video prediction beyond mean square error. In 4th International Conference on Learning Representations, Cited by: §3.3.
- Explaining outliers by subspace separability. In Proceedings of the IEEE 13th International Conference on Data Mining, pp. 518–527. External Links: Cited by: §2.
- Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1–15. External Links: Cited by: §2, §4.2.3.
- Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pp. 807–814. Cited by: §3.1.
- Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (7), pp. 971–987. Cited by: §3.1.
- Feature visualization. Distill. Note: https://distill.pub/2017/feature-visualization External Links: Cited by: §3.3.
- Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724. Cited by: §3.1.
- A review of novelty detection. Signal Processing 99, pp. 215–249. External Links: Cited by: §5.
- CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813. Cited by: §3.1.
- “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §2.
- Please stop explaining black box models for high-stakes decisions. In Proceedings of the NIPS Workshop on Critiquing and Correcting Trends in Machine Learning, Cited by: §2, §2.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §3.1, §4.1.
- Sequential feature explanations for anomaly detection. ACM Transactions on Knowledge Discovery from Data 13 (1). External Links: Cited by: §2, §4.2.3.
- Mars surface image (Curiosity rover) labeled data set (version 1.0.0). Note: Data set on Zenodo External Links: Cited by: §4.1.
- Guiding scientific discovery with explanations using DEMUD. In Proceedings of the Twenty-Seventh Conference on Artificial Intelligence, pp. 905–911. Cited by: §1, §2, §3.2.
- Interpretable discovery in large image data sets. In Proceedings of the Workshop on Human Interpretability in Machine Learning (WHI), pp. 107–113. Cited by: §2.
- Deep Mars: CNN classification of Mars imagery for the PDS Imaging Atlas. In Proceedings of the Innovative Applications in Artificial Intelligence Conference, pp. 7867–7872. Cited by: §4.1.
Understanding neural networks through deep visualization.
Proceedings of the ICML Deep Learning Workshop, Cited by: §2.
- Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, pp. 818–833. Cited by: §2, §3.3.