1 Introduction
In computer vision, data mining, and machine learning (ML), a
feature is a measurable variable that characterizes a particular kind of property or attribute of a data object (e.g., an image, a time series, a multivariate record, etc.). Many technical solutions in these fields heavily rely on modeldevelopers’ knowledge about various features and include humancentric feature engineering as a critical process in a model development workflow [EI96, Alp10].On the other hand, some other technical solutions were designed to minimize the dependence on the human knowledge of potentially useful features. For example, in deep learning, neural networks are typically expected to learn how to extract a good number of useful features automatically
[LBH15]. At the same time, there have also been concerns that some socalled “useful” features may be actually harmful because they contribute towards undesirable biases [ZWW18, KRCP17]. Inevitably, modeldevelopers have been interested in what features may have or have not been learned by an ML model. A class of visualization techniques, such as neuron activation plot, filter plot, gradient ascent plot [SDBR14], Deconvolution [ZF14], and their variants, have been widely used by developers of neural networks to observe neurons. Since a neural network typically consists of a huge number of neurons, the visual observation may encounter several obstacles, including time demand for viewing all neurons that may reveal some features, subjectivity and memory limitation of an observer, and uncertainty about the semantic meaning of an observed feature. More importantly, while most modeldevelopers have a nontrivial amount of knowledge about features that are potentially useful or harmful, their initiatives are limited to searching for patterns in many thousands of neuronbased plots and speculating if a feature has been learned.In this work, we propose a new visual analytics approach that enables modeldevelopers to use their knowledge and initiatives in hypothesising and evaluating if any feature may be useful or harmful, if such a feature is learned by a model, and how it may affect a learned model. In particular, we outline a framework for testing such hypotheses systematically, and describe the underlying statistical and logical analysis for inferring conclusions about multiple hypotheses from multiple sets of testing results. Because many modeldevelopers may not be familiar with or remember the underlying statistical and logical analysis, we develop a visual analytics tool, HypoML, for carrying out analysis as well as for depicting the flow of inference (Figure [
), facilitating rapid observation of the conclusions and the logical flow between the testing data and hypotheses. We have made HypoML available as opensource software, a demo is available at
https://hypoml.bitbucket.io/ and the source code is available at https://bitbucket.org/hypoML/hypoml.bitbucket.io.The term “feature” typically implies a piece of information contained in the original input data. Since HypoML can also be used to test a hypothesis about a piece of information that may not be part of the original data, we will use the term “conceptbased hypotheses” to describe what to be tested with HypoML.
2 Related Work
While machine learning (ML) has an important role to play in visualization and visual analytics [ERT17], almost every aspect of ML processes can benefit from visualization as shown by a recently established ontology VIS4ML [SKKC19]. In general, when modeldevelopers observe some phenomena in an ML process, such as its training and testing data, results, the inner states of a model, and the provenance of the learning process, they acquire new information to inform their various decisions that affect the ML process. As demonstrated quantitatively by Tam et al. [TKC01], a modeldeveloper can contribute a huge amount of knowledge (measured in bits) to an ML process through the use of visualization. This work focuses on the evaluation stage of ML workflows.
Methods for evaluating ML models can be categorized into two main classes: blackbox analysis and whitebox analysis. Here we focus our review of the previous works on model evaluation that feature visualization techniques. More comprehensive surveys on using visualization for ML can be found in the works of Zhang and Zhu [ZZ18] and Hohman et al. [HPRC20].
Blackbox analysis enables users to investigate and evaluate ML models without knowing the internal working mechanism. Statistic metrics (e.g., accuracy, recall), ROC curve, and confusion matrices are widelyused blackbox analysis and have commonly been provided as builtin functions in machine learning environments. To aid the aggregated statistical analysis, researchers recently proposed visualization techniques to support blackbox evaluation of ML models [ACD15, KPN16, RAL17, ZWM19]. For example, Squares [RAL17] juxtaposes a set of histograms to present an instancelevel visualization for models in multiclass classification problems. Manifold [ZWM19]
employs a scatterplotbased visual technique to assist in the comparison between multiclass classifiers. However, these techniques focus mainly on visualizing model performances and offer limited support for modeldevelopers to ask indepth questions about the model or the experiment results, or to evaluate specific hypotheses in a statisticallymeaningful way.
Whitebox analysis, on the contrary, opens the black box and displays the internal states of ML models. A number of visualization tools have been proposed to support whitebox analysis of different ML models, including MLP [RFFT17], CNNs [LSL17, KAKC18, PHVG18, RFFT17, LCJ18], deep generative models [LSC18, WGYS18, KTC19], and RNNs [MCZ17, SGPR18]. Although these tools have utilized some of the most sophisticated visual representations and have assisted modeldevelopers in evaluating, understanding, and explaining their models, comprehending a huge number of highdimensional internal states is naturally challenging for humans.
In addition, researchers proposed techniques to summarize information about internal variables and present the summary information visually. Saliencebased methods, such as CAM [ZKL16], GradCAM [SCD17], and guided back propagation [SDBR14], identify discriminative regions in the input image and thus highlight important features for a certain prediction. However, these saliencebased methods can only offer explanations for specific predictions but cannot confirm whether or not a concept has been learned. To offer instanceindependent explanation, Yosinski et al. employed gradient ascent plots [YCFL15] to depict the patterns that an individual neuron has learned. Figure 1 illustrates a small selection of gradient ascent plots being observed in conjunction with a CNN. However, even for such a simple model, there are a huge number of neurons, it is impossible for modeldevelopers to conduct a full examination. Moreover, the depicted pattern would largely be a hunch, but not a proof that a certain concept is useful or not to the classification task. Perhaps the most relevant to our work is TCVA [KWG18], which learns humanfriendly concepts from an already trained model and conducts hypothesis testing. However, TCVA requires a timeconsuming process to label the concept across the whole dataset.
In this work, we propose a novel MLtesting framework that combines blackbox and whitebox analysis. Whether an ML model has learned a concept or feature is a typical “internal problem” that is to be investigated using whitebox analysis. The new framework allows modeldevelopers to investigate “internal problems” in a manner of blackbox analysis.
3 ConceptBased Testing of ML Models
Let M be a machine learned (ML) model that transforms an input data object to an output decision that may be of a classification label, or a prediction. A concept is a variable that is not explicitly defined in , but is hypothesized by an ML modeldeveloper that would be useful or harmful to the quality of the output decision should M be able to access some extra information about . Figure 2 shows several examples of concepts. We can observe that some concepts may be extracted from the original data objects using known techniques, while it may be almost impossible to infer some other concepts from the data objects.
As long as M has a finite number of constructs (e.g., neurons or tree nodes) or receives input data with finite informative dimensions, there will always be some concepts that M cannot learn. Inevitably, most modeldevelopers will have questions about some concepts in relation to a learned model M. For example, considering the examples in Figure 2, one may ask:

Would having an extended field of view be useful for recognizing an object captured from a less ideal viewing angle?

Would another model for detecting an anomalous background or some scale inconsistency be useful to differentiate a toy from a real building?

Would another model that is able to detect an object in an unusual position and estimate the rotation angle be useful to the recognition of the object?

Would having additional information about a geographical context improve the accuracy of building recognition?
One can easily imagine many other questions about different types of extra information, such as different metadata, multiple data capture modalities, and various preprocessing techniques. All these questions are essentially hypotheses. Just as in psychology, healthcare, social science, and many other disciplines, one can conduct experiments to evaluate such hypotheses. Indeed, one can test ML models against many thousands of data objects in comparison with tens of stimuli in typical empirical studies.
Because model testing is a routine operation in ML, it is desirable to establish a structured method such that many modeldevelopers can adopt the same method and produce comparable testing results. Because the above definition of concept is relatively broad, developers of different ML models in various applications can benefit from open source software or commercial systems for supporting such a structured testing method.
Figure 3 illustrates the framework for conceptbased hypothesis testing. Given an ML process and a training and testing dataset, a modeldeveloper is interested to know how some extra information about a concept may affect the ML process and the learned model. The framework thus requires the developer to invoke two ML processes that receive two pieces of input data. As shown on the left of Figure 3, both processes take the original training data as one piece of the data. For the other piece, one process takes random noise as its input, while the other takes extra information about a concept (denoted by the sign “”).
Following the same procedure for model training, the two ML processes generate two learned models, and , respectively. The framework then requires the modeldeveloper to test each model with two runs. As illustrated in the middle column of Figure 3, one testing run uses testing data that does not have extra information, while the other run uses testing data that include extra information. The two runs with thus produce two sets of results, and , while the two runs with produce and . Because evaluating an ML model typically involves testing many thousands of data objects, some computational analysis of the four sets of results will be necessary.
HypoML is designed to support the computational analysis. In particular, it provides statistical and logical analysis for evaluating a set of hypotheses. The statistical analysis is based on the wellestablished method for hypothesis testing, while the logical analysis is formulated in this work for reasoning about the intertwining relationships between 12 hypotheses and 6 statistical conclusions drawn from different pairs of results. To assist users in understanding such complex relationships, HypoML provides a purposelydesigned visual representation, which enables users to trace the conclusion of each hypothesis to related statistical analysis and to the corresponding testing results.
The 12 hypotheses are listed on the right of Figure 3. The first two hypotheses, and , are about whether the concept concerned is useful (or harmful) to , and would be useful (or harmful) to . Although the conclusions for these two hypotheses cannot in principle be both positive, each can also be inconclusive. We thus follow the convention of hypothesis testing by listing them as separate hypotheses, each can be independently confirmed, rejected, or unproven (inconclusive).
hypothesizes that model has already learned the concept adequately, while hypothesizes that model has learned the concept adequately. For , the adverb “adequately” implies that the concept can be learned by a model, such as , without the need for any extra information about the concept. For , the adverb “adequately” implies that would perform worse without the extra information of the concept.
In general, model has not been trained with extra information. It is thus not expected to be affected by any extra information during testing. However, as a scientific exercise, one cannot take this assumption for granted since one cannot assume that a model template (e.g., an untrained neural network) has always been configured correctly or a training method has always been implemented correctly. and are thus designed to examine whether is affected positively or negatively by the extra information during testing. Because there exists an inconclusive state, they are kept as two separate hypotheses, in a way similar to and .
When model is trained with extra information, the model may learn new capability from the extra information, while losing some capability that would be learned without the extra information. The vice versa could also be true. , , , and are for investigating the tradeoff between different parts of in the development of its intelligence. Depending on the design of the model template or architecture, the parts of for handling the extra information ( part) and the original information () can be quite separated as well as rather integrated. When the two parts are more integrated, one should consider two parts as functional units rather than geometric or topological regions. Similarly, we separate from , and separate from because of the inconclusive state in each case. We also anticipate that more testing and analysis methods may be developed in the future, which may support or reject those apparentlypaired hypotheses asymmetrically. Having separate hypotheses will not hinder such advancement.
4 Statistical and Logical Reasoning of Hypotheses
As shown in Figure 3, HypoML receives four sets of results, namely , , , and . Each set of results is a list of tuples, each of which consists of:

id
— the unique identifier of a data object. The data object may be an image, a feature vector, a multivariate data record, or a more complex data record.

ground truth — a ground truth label, which can be a nominal value, an integer, a real number, a range, or a data record of a more complex data type (e.g., a time series).

ML label — a label generated by an ML model. The label must be of the same data type as ground truth.

ML uncertainty — an optional value indicating the uncertainty estimated by an ML model equipped with a selfassessment capacity. It is a real number in the range [0, 1] with 1 being the most uncertain. Many ML models may not have any selfassessment capacity, and in such a case, this entry takes the default value 0. Some ML models may return a confidence value, which can easily be converted to uncertainty.

correctness — This is a value in the range of [0, 1] with 1 indicating absolutely correct, and 0 indicating absolutely incorrect. The value is mostly computed based on ground truth and ML label using a userdefined function. The simplest function can be true (1) if ground truth equals ML label, or 0 otherwise. A more complicated function may feature a distance or similarity metric.

correctness with uncertainty — This is used by the statistical analysis and is defined as ML uncertainty correctness.
Given two sets of results, and , we assume that the tuples in the two lists are paired, i.e., the id entries are in the same order exactly. We can compare and with their accuracy, i.e., the average of correctness with uncertainty. As testing in ML often shows small variations of accuracy, it is necessary to measure the statistical significance. HypoML uses paired, twotail test for this purpose. Let us introduce the following notation to denote the possible outcomes of the statistical analysis.

— It is statistically significant that is lower than .

— It is statistically significant that is higher than .

— It is statistically insignificant that is higher or lower than .

— or , but not .

— or , but not .
With four sets of results, there are six pairs of statistical comparison, which are labelled as . Each analytical conclusion may support or reject some of the 12 hypotheses , but not all. For example the analysis , which compares and , can inform the evaluation of and . If is statistically better than , i.e., , we can draw a conclusion that supports and rejects . If , supports and rejects . If , returns an unproven (inconclusive) verdict about and .
With some careful reasoning, we can observe that can also inform the evaluation of , , , and . While can inform the evaluation of , , , , , and , but it can only do so subject to that some other hypotheses have already been confirmed or rejected. Table 1 summaries the relations between the six sets of statistical analysis and the 12 hypotheses .
Analysis  Condition  Hypothesis 

: v.  , , , , ,  
: v.  ,  , , , , , 
: v.  ,  , , , , , 
: v.  ,  
: v.  ,  , 
: v.  , 
Clearly, reasoning about these relations is time consuming and error prone. In order to support the frequent analytical tasks of the developers in testing their ML models, HypoML provides automated logical analysis as well as statistical analysis. To help describe the logical analysis, we employ some additional notations. They are:

— The statement is true.

— The statement is false.

— The statement is unproven.

— Logical conjunction.

— Logical (inclusive) disjunction.
W can now specify the logical inference from as:
: v. may conclude:

. This reads as , , and are all true, and , , and are all false.

.
Analysis cannot draw conclusions about and , but its conclusion may depend on them. In general, there is a commonsense assumption that neither nor is likely to be true.
: vs. may conclude:

(i) if then ; or
(ii) if then ; or
(iii) if . This offers an explanation but it is against a commonsense assumption that is unlikely to be true, and should be treated cautiously. 
(i) if then ; or
(ii) if then ; or
(iii) if . This offers an explanation but it is against a commonsense assumption that is unlikely to be true, and should be treated cautiously.
Because analysis does not compare with , the conclusion is limited to the context of . Mathematically, it is possible for to conclude that the concept is useful in the context of , while or concludes that the concept is harmful or is neither useful nor harmful. Considering this limitation, it is unsafe for this analysis to draw a conclusion about and . Meanwhile the analysis depends on the conclusions of and in a small way.
: v. may conclude:

(i) if , then ; or
(ii) if , then ; or
(iii) if , then . 
(i) if , then ; or
(ii) if , then ; or
(iii) if , then . This conclusion is against a commonsense assumption that a useful concept normally should not affect the extra part of M+ negatively, and should be treated cautiously.
Analysis is relatively easy to reason, and it is useful for investigating if the part of model for handling the original data becomes less capable due to the training with extra information.
: v. may conclude:

.

.
cannot draw conclusions about and , but its conclusion may depend on them. In general, there is a commonsense assumption that neither nor is true.
: v. may conclude:

(i) if then ; or
(ii) if then ; or
(iii) if . This offers an explanation but it is against a commonsense assumption that is unlikely to be true, and should be treated cautiously. 
(i) if then ; or
(ii) if then ; or
(iii) if . This offers an explanation but it is against a commonsense assumption that is unlikely to be true, and should be treated cautiously.
Analysis is the only comparison that may inform the evaluation of nor . In general, there is a commonsense assumption that neither nor is true if the model template or architecture was correctly defined, the correct ML method was followed, and the correct ML process was executed. When or is confirmed, it usually suggests some imperfection of the model template or learning process. Therefore the conclusions of should not be interpreted as their face values. However, the evaluation of nor is necessary since and depend on them.
: vs. may conclude:

;

.
Because the dependency among the six sets of analysis, the computation of the logical inference must follow an appropriate order, which is summarized as follows:
STEP 0: Initialise the indicator of each hypothesis to 0.
STEP 1: Compute the six comparative values, i.e., , in terms of , , and , based on statistical analysis.
STEP 2: Compute the logical inference (i.e., in terms of ) based on , , . For each true statement, i.e., , add to the indicator of . For each false statement, i.e., , add to the indicator of .
STEP 3: Compute the indicators based on , .
STEP 4: Compute the indicators based on .
STEP 5: Then display each indicator based on positive or negative values. HypoML displays each hypothesis according to its indicator in three states: (confirmed), 0 (unproven), (rejected).
5 Visual Analysis of Hypotheses
Figure 4 shows a typical workflow of the proposed hypotheses testing. To start with, modeldevelopers conduct experiments and obtain four sets of results, i.e., , , , and . HypoML then performs six sets of statistical analysis by comparing each pair of the results. Based on the statistical analysis, HypoML makes logical inference about the twelve hypotheses, deciding whether a hypothesis should be supported or rejected.
It is helpful for modeldevelopers to make quick observation about the analysis and conclusions. It will also be useful for the modeldevelopers to convey the outcomes of the test to other stakeholders, such as users of the ML models being evaluated. It can be difficult for some modeldevelopers and many of ML users to remember and reason the complicated relationships among experiment results, statistic and logical analysis, and multiple hypotheses. Therefore, an effective visual representation is necessary. The bipartite graph shown in Figure 4 is a straightforward solution but it exhibits several shortcomings that hinder efficient information acquisition and effective information dissemination.
One main shortcoming is the cluttered links between the six statistical comparisons and the twelve hypotheses. These links have no obvious or memorable structures and are difficult to track by eye. One can add additional visual encoding to these links to depict three types of conclusions (i.e., reject, support, unproven) and conditional dependency. However, such encoding would further worsen the cluttering of the bipartite graph. To address this issue, we designed a matrixbased visualisation for HypoML as shown in Figure 5(a), where four types of icons (a2) are introduced to indicate reject, support, unproven, and conditional dependency.
The second shortcoming is that simply listing numerical values (e.g., the accuracy of experiment results, the value of statistical comparisons) incurs a fair amount of cognitive load upon users who have to compare and analyse them numerically. Therefore, we thus visually encoded these values while maintaining the numerical representations. In particular, HypoML depicts experiment results with positions, since position is considered to be the most effective visual channel [Mun14]. As shown in Figure 5
(c), the position of the circle indicates the average accuracy while the line indicates the 95% confidence interval.
We decided to encode value using a glyph, and considered several alternative designs as shown in Figure 5(b1). With the first design option, the area of a circle is used to encode the level of statistical significance, i.e., the inverse of a value. The less the value, the more significant the difference, and the larger the circle. However, in an informal pilot study, this design was found to be “confusing” due to the reverse encoding. With the second design option, the value is encoded using the area of an orange circle, which is inside a large blue circle of a fixed size. While this design enables direct observation of statistical significant through the blue area as well as the value through the orange area, it was found to be “unintuitive” for those who were unfamiliar with the definition of
value. We finally settled down on the third design based on a widelyused illustration for explaining the concept of statistical hypothesis testing. In this design, the whole shape represents a normal distribution and the area in orange coarsely encodes the
value. The normal distribution curve can quickly remind users of the meaning of value.The third shortcoming is that while depicting the reasoning flow from data to conclusion as in Figure 4 correctly represents the temporal order of the computation, it would slow users down when they wish to find out the conclusions quickly. We thus reverse the order of the workflow in both the vertical and horizontal versions of the visual user interface (see [ and Figure 5). The horizontal design is more suitable for widescreen displays, while the vertical design can be used on portable devices and highresolution monitors. Users may benefit from having both designs available.
Both versions of the interface were designed and developed by following an iterative design process with regular feedback from potential users, including modeldevelopers and ML model users. Through such feedback, we discovered that most users would prefer to observe the conclusions of the hypotheses as soon as the testing results were loaded into HypoML. They could then decide whether it would be necessary to track back to the statistical comparison and experiment results for detailed reasoning. We also discovered that double encoding used for the value and hypotheses had enhanced users’ perception of the information and enable them to switch between overview (through visual encoding) and details on demand (through numerical values) rapidly by simply changing their visual attention. While each value is already encoded using the glyph and numerical value, we further encode it through its links with the testing results. The link width indicates the reverse of the value and the link style (i.e., solid or dashed) shows whether the difference between two sets of results is statistical significant or not (Figure 5(b2)). While the decision state of a hypothesis is already encoded using icons in the matrix, we double encode it using black and two greyscale values to the levels of support to the hypothesis (Figure 5(a1)). The black color draws users’ attention quickly to those hypotheses that have been confirmed.
HypoML supports a set of interactions. Users are allowed to modify the threshold of value, which may lead to changes in the conclusions of the hypotheses and dynamical update of the whole visualization. By hovering on a value, users can highlight the two corresponding sets of results.
6 Results and Discussions
The testing reported in this section is primarily for testing HypoML to see if HypoML can make correct transformation from four sets of results , , , and
to visual representations of the conclusions about 12 hypotheses. The examples shown are not intended to establish the truth about the goodness of any particular ML technique, but to demonstrate the practical uses of HypoML. If a developer suspects an ML model may have a shortcoming, HypoML can help the developer confirm or reject such a hypothesis. With convolution neural networks (CNN), a common wisdom is that the deeper and the larger a CNN is, more likely a concept will be learned by the CNN. When our tests show that a particular CNN model has not learned a concept adequately, it does not necessarily mean that a more complicate CNN model would not be able to learn the concept either. This is indeed what testing is for in software engineering. The goal of testing is to discover the shortcoming of a model or a piece of software in order to improve the model or software.
We used the Fashion MNIST dataset [XRV17]
to train a CNN model for classification. The model was specified using Keras and Tensorflow in Python, and was trained and tested using the Google Colaboratory server. We use the same CNN structure as that in the official example of Keras. This CNN consists of the following layers: convolution (3x3x32, RELU), convolution (3x3x64, RELU), max pooling (2x2), dropout (25%), flatten, dense (128, RULE), dropout(50%), and dense(10, softmax). We refer readers to
[ker14] for more details.In each training session, a model is trained using 40,000 training images. With batch sizes of 128 and 50 epochs, convergence occurs in around 5 minutes. In each test, a model is tested against 6,666 test images. These images are all of 28
28 8bit pixels. The class labels are: (0) Tshirt/top, (1) Trouser, (2) Pullover, (3) Dress, (4) Coat, (5) Sandal, (6) Shirt, (7) Sneaker, (8) Bag, and (9) Ankle boot.The original images in the Fashion MNIST dataset feature all fashion objects in an upright position. This naturally leads to a speculation that a trained model may not be rotation invariant. One possible way to address the need for rotationinvariance is to train a model with images featuring randomly rotated objects, which is widely employed in data augmentation techniques [SSP03]. As humans can determine easily if a fashion object is in an upright position or not, one may hypothesize that a classification model may benefit from the extra information from another model that can detect the rotation angle or perform rotation normalization.
Following the workflow depicted in Figure 3, we constructed two types of data. We applied random rotation to each image in the training and testing data. This resulted in a new training dataset and testing dataset . We then created the part of the data by simply reusing the original upright images, by presuppose the existence of a rotation normalization model. As illustrated in Figure 6, each group of three images shows an original image (left), an image in or (middle), and an image in or (right). The middle image contains only the rotated image, together with noise in the other three quadrants. The right image contain both the rotated image and the normalized image, together with noise in the two lower quadrants.
We then trained two models and , and tested each of them using two datasets and according to the workflow in Figure 3. From the four sets of testing results, HypoML carries out statistical and logical analysis and displays the results as shown in Figure [. In [, we can oberve that six hypotheses have been confirmed. They indicate:

: The concept of rotation normalization is useful to and would be useful to .

: has learned from the concept of rotation normalization adequately.

: The extra information in , when it is fed to , has a negative effect on . Although has only learned from noise the upperright quadrant of the stimuli, when nonnoise information appears in that area, it still affects , in a negative way.

: The extra information in (upperright quadrant) has a positive effect on .

: Learning with affects the extra part of positively. This is somehow anticipated because is confirmed.

: Learning with affects the part of negatively, that is, if the extra information is unavailable, performs worse than , which has not learned with the extra information.
When working with the dataset, we also noticed that the fashion objects in all images are maximized within the boundary of the image. We wondered if this would introduce some biases to a trained model. As humans can usually perceive the size of an everyday object fairly quickly, we hypothesized that a model that can remap a maximized object to a more realistic size may help the classification of such an object. As shown in Figure 7, we conducted another test by following the same workflow illustrated in Figure 3. In this case, the extra information features a scaled object on the upperright quadrant. We measured typical sizes of fashion objects in each category and defined a relative range for the category accordingly. For the extra information, we randomly selected a scaling factor within the range defined for the corresponding category, and used the factor to scale the image. The analytical result is shown on the right of Figure 7. The conclusions are more or less the same as the hypothesis rotation normalization.
To demonstrate a slightly more complex design of a test, we combined the above two tests to examine the combined effects of the two concepts, namely rotation normalization and relative scaling. As shown in Figure 8, we used the upperleft quadrant for the rotated object as the information present in all training and testing data. We placed rotationnormalized and relativelyscaled object at the lowerright quadrant. As perhaps expected, the test confirmed the same set of hypotheses as the two tests mentioned before.
In general, a CNN is expected to learn features about some aggregated properties (e.g., mean, median, or mode). We thus conducted a test to see whether providing such a feature as a piece of extra information is useful. As shown in Figure 9, we introduced the average intensity value of an object as a singlecolored square in the upperright quadrant. The analysis of the test results indicates that most hypotheses are unproven. In other words, we cannot be sure if this extra piece of information is useful or harmful. The only hypothesis that has been confirmed is , i.e., learning with affects the extra part of positively. However, this does not translate to a confirmation of about the overall positive impact to . By observing the details about how this hypothesis (i.e., ) was confirmed, we can see that it is confirmed only within the context of , without involving any tests about .
Considering further about the intensity of the images, one common idealized requirement in computer vision is lighting invariance, i.e., a model can recognize the same object under different lighting conditions. We thus hypothesized that another model for normalizing the intensity of an image may help a classification model. Using a similar strategy as in the first test (random rotation), we randomly change the intensity of the original images to create the benchmark datasets and . We then use the original images as the extra information, presupposing that the original images were the results of intensity normalization.
Figure 10 shows that the extra information is useful to (), and has learned the concept adequately (). While the test confirms and , it is inconclusive about and . Interestingly, the test confirms unexpectedly, i.e., the extra information in has a positive effect on . This is in some way related to the failure to confirm as in some earlier tests. For each image in , the signals in the extra information (i.e., the upperright quadrant), which in many ways is similar to those in (i.e., upperleft quadrant). One possible explanation is the signals in the upperright quadrant somehow strengthen the signals in the upperleft quadrant, even though has not learned to use the extra information.
We have also conducted several other tests about the randomlysized class labels and images with incorrect labels. HypoML has also shown to be useful for support such hypothesis testing.
7 Conclusions
In this paper, we propose a novel testing framework to aid the evaluation of ML models. In particular, this framework tests a set of hypotheses about a concept, checking whether extra information about the concept can benefit an ML model, and if so, how the extra information affects the model. The testing framework is underpinned by statistical analysis of the experiment results as well as logical inferences about the relations between six statistical conclusions and twelve hypotheses. Through an implementation of this framework HypoML, we demonstrate that with a purposelydesigned visual representation, modeldevelopers can visualize the conclusions about the twelve hypotheses as soon as the four sets of testing result data become available. This approach complements the traditional way of observing various plots for monitoring neuron activities, such as activation plots and gradient ascent plots. Modeldevelopers, who observe any interesting patterns or failed to find desired patterns, can now formulate a conceptbased hypothesis and carry out a structured test to evaluate their hypotheses.
We recognize that HypoML is only one of the many steps towards an ultimate goal of developing a powerful testing suite for evaluating, understanding, and explaining ML models. There is a need for further theoretical and practical developments in this direction, including, for instance, formulating more detailed logical analysis for subgroup analysis of the testing results, designing an advanced user interface for supporting detailed observation of subgroup analysis, and integrating with other visualization techniques for observing, understanding, and explaining ML models.
References
 [ACD15] Amershi S., Chickering M., Drucker S. M., Lee B., Simard P., Suh J.: Modeltracker: Redesigning performance analysis tools for machine learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea, 2015), ACM, pp. 337–346.
 [Alp10] Alpaydin E.: Introduction to Machine Learning, 2nd ed. The MIT Press, 2010.

[EI96]
Elder IV J. F.:
Machine learning, neural, and statistical classification.
Journal of the American Statistical Association 91, 433 (1996), 436–439.  [ERT17] Endert A., Ribarsky W., Turkay C., Wong B. L. W., Nabney I., Blanco I. D., Rossi F.: The state of the art in integrating machine learning into visual analytics. Computer Graphics Forum 36, 8 (2017), 458–486.
 [HPRC20] Hohman F., Park H., Robinson C., Chau D. H. P.: SUMMIT: Scaling deep learning interpretability byvisualizing activation and attribution summarizations. IEEE Transactions on Visualization and Computer Graphics 26, 1 (2020).
 [KAKC18] Kahng M., Andrews P. Y., Kalro A., Chau D. H.: ActiVis: visual exploration of industryscale deep neural network models. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 88–97.
 [ker14] Keras CNN examples. https://keras.io/examples/mnist_cnn/, 2014. Accessed: 20191203.
 [KPN16] Krause J., Perer A., Ng K.: Interacting with predictions: Visual inspection of blackbox machine learning models. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (San Jose, CA, USA, 2016), ACM, pp. 5686–5697.
 [KRCP17] Kilbertus N., RojasCarulla M., Parascandolo G., Hardt M., Janzing D., Schölkopf B.: Avoiding discrimination through causal reasoning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA, 2017), Curran Associates, pp. 656–666.
 [KTC19] Kahng M., Thorat N., Chau D. H., Viégas F. B., Wattenberg M.: GAN Lab: Understanding complex deep generative models using interactive visual experimentation. IEEE Transactions on Visualization and Computer Graphics 25, 1 (2019), 310–320.
 [KWG18] Kim B., Wattenberg M., Gilmer J., Cai C., Wexler J., Viegas F., Sayres R.: Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). 35th International Conference on Machine Learning (2018). arXiv:1711.11279.
 [LBH15] LeCun Y., Bengio Y., Hinton G.: Deep learning. Nature 521, 7553 (2015), 436–444.
 [LCJ18] Liu D., Cui W., Jin K., Guo Y., Qu H.: Deeptracker: Visualizing the training process of convolutional neural networks. ACM Transactions on Intelligent Systems and Technology (2018).
 [LSC18] Liu M., Shi J., Cao K., Zhu J., Liu S.: Analyzing the training processes of deep generative models. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 77–87.
 [LSL17] Liu M., Shi J., Li Z., Li C., Zhu J., Liu S.: Towards better analysis of deep convolutional neural networks. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 91–100.
 [MCZ17] Ming Y., Cao S., Zhang R., Li Z., Chen Y., Song Y., Qu H.: Understanding hidden memories of recurrent neural networks. arXiv preprint arXiv:1710.10777 (2017).
 [Mun14] Munzner T.: Visualization analysis and design. AK Peters/CRC Press, 2014.
 [PHVG18] Pezzotti N., Höllt T., Van Gemert J., Lelieveldt B. P., Eisemann E., Vilanova A.: Deepeyes: Progressive visual analytics for designing deep neural networks. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 98–108.
 [RAL17] Ren D., Amershi S., Lee B., Suh J., Williams J. D.: Squares: Supporting interactive performance analysis for multiclass classifiers. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 61–70.
 [RFFT17] Rauber P. E., Fadel S. G., Falcao A. X., Telea A. C.: Visualizing the hidden activity of artificial neural networks. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 101–110.
 [SCD17] Selvaraju R. R., Cogswell M., Das A., Vedantam R., Parikh D., Batra D.: Gradcam: Visual explanations from deep networks via gradientbased localization. In International Conference on Computer Vision (ICCV) (2017), IEEE, pp. 618–626.
 [SDBR14] Springenberg J. T., Dosovitskiy A., Brox T., Riedmiller M.: Striving for simplicity: The all convolutional net, 2014. arXiv:1412.6806.

[SGPR18]
Strobelt H., Gehrmann S., Pfister H., Rush A. M.:
LSTMVis: A tool for visual analysis of hidden state dynamics in recurrent neural networks.
IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 667–676.  [SKKC19] Sacha D., Kraus M., Keim D. A., Chen M.: VIS4ML: An ontology for visual analytics assisted machine learning. IEEE Transactions on Visualization and Computer Graphics 25, 1 (2019), 385–395.
 [SSP03] Simard P. Y., Steinkraus D., Platt J.: Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition (August 2003), Institute of Electrical and Electronics Engineers, Inc.
 [TKC01] Tam G. K. L., Kothari V., Chen M.: An analysis of machine and humananalytics in classification. IEEE Transactions on Visualization and Computer Graphics 23, 1 (201), 71–80.
 [WGYS18] Wang J., Gou L., Yang H., Shen H.: Ganviz: A visual analytics approach to understand the adversarial game. IEEE Transactions on Visualization and Computer Graphics 24, 6 (2018), 1905–1917.
 [XRV17] Xiao H., Rasul K., Vollgraf R.: Fashionmnist: a novel image dataset for benchmarking machine learning algorithms, 2017. arXiv:1708.07747.
 [YCFL15] Yosinski J., Clune J., Fuchs T., Lipson H.: Understanding neural networks through deep visualization. In Proceedings of the 32nd International Conference on Machine Learning (Lille, France, 2015).
 [ZF14] Zeiler M. D., Fergus R.: Visualizing and understanding convolutional networks. In Proc. 13th European Conference on Computer Vision. Springer, 2014, pp. 818–833.

[ZKL16]
Zhou B., Khosla A., Lapedriza A., Oliva A., Torralba A.:
Learning deep features for discriminative localization.
InConference on Computer Vision and Pattern Recognition (CVPR)
(2016), IEEE, pp. 2921–2929.  [ZWM19] Zhang J., Wang Y., Molino P., Li L., Ebert D. S.: Manifold: A modelagnostic framework for interpretation and diagnosis of machine learning models. IEEE Transactions on Visualization and Computer Graphics 25, 1 (2019), 364–373.

[ZWW18]
Zhang L., Wu Y., Wu X.:
Achieving nondiscrimination in prediction.
In
Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI
(Stockholm,Sweden, 2018), ijcai.org, pp. 3097–3103.  [ZZ18] Zhang Q.s., Zhu S.c.: Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering 19, 1 (Jan 2018), 27–39. URL: http://dx.doi.org/10.1631/FITEE.1700808, doi:10.1631/fitee.1700808.
Comments
There are no comments yet.