Learning Convolutional Neural Networks with Interactive Visualization.
Deep learning's great success motivates many practitioners and students to learn about this exciting technology. However, it is often challenging for beginners to take their first step due to the complexity of understanding and applying deep learning. We present CNN Explainer, an interactive visualization tool designed for non-experts to learn and examine convolutional neural networks (CNNs), a foundational deep learning model architecture. Our tool addresses key challenges that novices face while learning about CNNs, which we identify from interviews with instructors and a survey with past students. Users can interactively visualize and inspect the data transformation and flow of intermediate results in a CNN. CNN Explainer tightly integrates a model overview that summarizes a CNN's structure, and on-demand, dynamic visual explanation views that help users understand the underlying components of CNNs. Through smooth transitions across levels of abstraction, our tool enables users to inspect the interplay between low-level operations (e.g., mathematical computations) and high-level outcomes (e.g., class predictions). To better understand our tool's benefits, we conducted a qualitative user study, which shows that CNN Explainer can help users more easily understand the inner workings of CNNs, and is engaging and enjoyable to use. We also derive design lessons from our study. Developed using modern web technologies, CNN Explainer runs locally in users' web browsers without the need for installation or specialized hardware, broadening the public's education access to modern deep learning techniques.READ FULL TEXT VIEW PDF
Learning Convolutional Neural Networks with Interactive Visualization.
Interactive Tools for Machine Learning, Deep Learning and Math
An interactive visualization App aims to help non-exports learn about Recurrent Neural Networks (RNNs)
About Learning Convolutional Neural Networks with Interactive Visualization.
This section provides a high-level overview of convolutional neural networks (CNNs) in the context of image classification, which will help ground our work throughout this paper.
Image classification has a long history in the machine learning research community. The objective of supervised image classification is to map an input image, , to an output class,
. For example, given a cat image, a sophisticated image classifier would output a class label of “cat”. CNNs have demonstrated state-of-the-art performance on this task, in part because of their multiple layers of computation that aim to learn a better representation of image data.
CNNs are composed of several different layers (e.g., convolutional layers, downsampling layers, and activation layers)—each layer performs some predetermined function on its input data. Convolutional layers “extract features” to be used for image classification, with early convolutional layers in the network extracting low-level features (e.g., edges) and later layers extracting more-complex semantic features (e.g., car headlights). Through a process called backpropagation, a CNN learns kernel weights and biases from a collection of input images. These values also known as parameters, which summarize important features within the images, regardless of their location. These kernel weights slide across an input image performing an element-wise dot-product, yielding intermediate results that are later summed together with the learned bias value. Then, each neuron gets an output based on the input image. These outputs are also called activation maps. To decrease the number of parameters and help avoid overfitting, CNNs downsample inputs using another type of layer called pooling. Activation functions are used in a CNN to introduce non-linearity, which allows the model to learn more complex patterns in data. For example, a Rectified Linear Unit (ReLU) is defined as
, which outputs the positive part of its argument. These functions are also often used prior to the output layer to normalize classification scores, for example, the activation function called Softmax performs a normalization on unscaled scalar values, known as logits, to yield output class scores that sum to one. To summarize, compared to classic image classification models that can be over-parameterized and fail to take advantage of inherent properties in image data, CNNs create spatially-aware representations through multiple stacked layers of computation.
Researchers and practitioners have been developing visualization systems that aim to help beginners learn about deep learning concepts. Teachable Machine [websterNowAnyoneCan2017] teaches the basic concept of machine learning classification, such as overfitting and underfitting, by allowing users to train a deep neural network classifier with data collected from their own webcam or microphone. The Deep Visualization Toolbox [yosinskiUnderstandingNeuralNetworks2015] also uses live webcam images to interactively help users to understand what each neuron has learned. These deep learning educational tools feature direct model manipulation as core to their experience. For example, users learn about CNNs, dense neural networks, and GANs through experimenting with model training in ConvNetJS MNIST demo [karpathyConvNetJSMNISTDemo2016], TensorFlow Playground [smilkovDirectManipulationVisualizationDeep2017], and GAN Lab [kahngGANLabUnderstanding2019], respectively. Beyond 2D visualizations, Node-Link Visualization [harleyInteractiveNodeLinkVisualization2015a] and TensorSpace[TensorSpaceJsNeural2018] demonstrate deep learning models in 3D space.
Inspired by Chris Olah’s interactive blog posts [olahNeuralNetworksManifolds2014], interactive articles explaining deep learning models with interactive visualization are gaining popularity as an alternative medium for education [carterUsingArtificialIntelligence2017, madsenVisualizingMemorizationRNNs2019]. Most existing resources, including visualizations and articles, focus on explaining either the high-level model structures and model training process or the low-level mathematics, but not both. However, we found that one key challenge for beginners learning about deep learning models is that it is difficult to connect unfamiliar layer mechanisms with complex model structures (discussed in Sect. 4). Our work aims to address this lack of research in developing visual learning tools to help learners bridge these two views on deep learning.
Before deep learning started to attract interest from students and practitioners, visualization researchers have been studying how to design algorithm visualizations (AV) to help people learn about dynamic behavior of various algorithms [hundhausenMetaStudyAlgorithmVisualization2002, shafferAlgorithmVisualizationState2010, brownAlgorithmAnimation1988]. These tools often graphically represent data structures and algorithms using interactive visualization and animations [gallesDataStructureVisualizations2006, guoOnlinePythonTutor2013, brownAlgorithmAnimation1988]. While researchers have found mixed results on AV’s effectiveness in computer science education [fouhRoleVisualizationComputer2012, grissomAlgorithmVisualizationCS2003, byrneEvaluatingAnimationsStudent1999], growing evidence has shown that student engagement is the key factor for successfully applying AV in education settings [napsExploringRoleVisualization2002, hundhausenUsingVisualizationsLearn2000]. Naps, et al. defined a taxonomy of six levels of engagement222Six engagement categories: No Viewing, Viewing, Responding, Changing, Constructing, Presenting. at which learners can interact with AVs[napsExploringRoleVisualization2002], and studies have shown higher engagement level leads to better learning outcomes [hansenDesigningEducationallyEffective2002, fouhRoleVisualizationComputer2012, schweitzerInteractiveVisualizationActive2007, kehoeRethinkingEvaluationAlgorithm2001].
Deep learning models can be viewed as specialized algorithms comprised of complex and stochastic interactions between multiple different computational layers. However, there has been little work in designing and evaluating visual educational tools for deep learning in the context of AV. ’s design draws inspiration from the guidelines proposed in AV literature (discussed in section 4); our user study results also corroborate some of the key findings in prior AV research (discussed in subsection 7.3). Our work advances AV’s landscape in covering modern and pervasive machine learning algorithms.
Many visual analytics tools have been developed to help deep learning experts analyze their models and predictions [hohmanVisualAnalyticsDeep2019, kahngActiVisVisualExploration2018a, liuBetterAnalysisDeep2017a, liuVisualizingHighDimensionalData2017, bilalConvolutionalNeuralNetworks2018]. These tools support many different tasks. For example, recent work such as Summit [hohmanSummitScalingDeep2020] uses interactive visualization to summarize what features a CNN model has learned and how those features interact and attribute to model predictions. LSTMVis[strobeltLSTMVisToolVisual2018]
makes long short-term memory (LSTM) networks more interpretable by visualizing the model’s hidden states. Similarly, GANVis[wangGANVizVisualAnalytics2018] helps experts to interpret what a trained generative adversarial network (GAN) model has learned. People also use visual analytics tools to diagnose and monitor the training process of deep learning models. Two examples, DGMTracker[liuAnalyzingTrainingProcesses2018] and DeepEyes[pezzottiDeepEyesProgressiveVisual2018], help developers better understand the training process of CNNs and GANs, respectively. Also, visual analytics tools recently have been developed to help experts detect and interpret the vulnerability in their deep learning models[liuAnalyzingNoiseRobustness2018a, dasMassifInteractiveInterpretation2020a]. These existing analytics tools are designed to assist experts in analyzing their model and predictions, however, we focus on non-experts and learners, helping them more easily learn about deep learning concepts.
Our goal is to build an interactive visual learning tool to help students more easily understand the internal mechanisms of CNN models. To identify the learning challenges faced by the students, we conducted interviews with deep learning instructors and surveyed past students.
To inform our tool’s design, we recruited 4 instructors (2 female, 2 male) who have taught CNNs in a large university, and refer to them as T1-T4 throughout our discussion. One instructor teaches computer vision, and the other three teach deep learning. We interviewed the participants one-on-one in a conference room (3/4) and via a video-conferencing software (1/4); each interview lasted around 30 minutes. The interviews were semi-structured. An interview guide listing the core questions was prepared. During the interviews, we ensured that all questions were addressed while leaving time for asking follow-up questions from the interviewees. We aimed to learn (1) how do instructors teach CNNs in a traditional classroom setting; and (2) what are the key challenges for instructors to teach and for students and learn about CNNs.
Student survey. After the interviews, we recruited students from a large university who have previously studied CNNs to fill out an online survey. We received 43 responses, and 19 of them (4 female, 15 male) met the criteria. Among 19 participants, 10 were Ph.D. students, 3 were M.S. students, 5 were undergraduates, and 1 was a faculty member. In the survey, participants were asked what were “the biggest challenges in studying CNNs” and “the most helpful features if there was a visualization tool for explaining CNNs to beginners”. We provided pre-selected options based on the prior instructor interviews, but participants could write down their own response if it was not included in the options. The aggregated results of this survey are shown in Figure 2.
Together with a literature review, we synthesized our findings from these two studies into the following five design challenges (C1-C5).
[topsep=0mm, itemsep=0mm, parsep=1mm, leftmargin=6mm, label=C0., ref=C0]
Intricate model structure. CNN models consist of many layers, each having a different structure and underlying mathematical functions[lecunDeepLearning2015]. In addition, the connection between two layers is not always the same—some neurons are connected to every neuron in the previous layer, while some only connect to a single previous neuron. T2 said “It can be very hard for them [students with less knowledge of neural networks] to understand the structure of CNNs, you know, the connections between layers.”
Complex layer operations. Different layers serve different purposes in CNNs[guRecentAdvancesConvolutional2018]
. For example, convolutional layers exploit the spatially local correlations in inputs—each convolutional neuron connects to only a small region of its input; whereas max pooling layers introduce regularization to prevent overfitting. T1 said, “The most challenging part is learning the math behind it [CNN model].” Many students also reported that CNN layer computations are the most challenging learning objective (Figure 2). To make CNNs perform better than other models in tasks like image classification, these models have complex and unique mathematical operations that many beginners may not have seen elsewhere.
Connection between model structure and layer operation. Based on instructor interviews and the survey results from past students (Figure 2), one of the cruxes to understand CNNs is understanding the interplay between low-level mathematical operations (2) and the high-level model structure (1). Smilkov et al., creators of the popular dense neural network learning tool Tensorflow Playground[smilkovDirectManipulationVisualizationDeep2017], also found this challenge key to learning about deep learning models: “It’s not trivial to translate the equations defining a deep network into a mental model of the underlying geometric transformations.” In other words, in addition to comprehending the mathematical formulas behind different layers, students are also required to understand how each operation works within the complex, layered model structure.
Effective algorithm visualization (AV). The success of applying visualization to explain machine learning algorithms to beginners [websterNowAnyoneCan2017, smilkovDirectManipulationVisualizationDeep2017, kahngGANLabUnderstanding2019] suggests that an AV tool is a promising approach to help people more easily learn about CNNs. However, AV tools need to be carefully designed to be effective in helping users gain an understanding of algorithms [fouhRoleVisualizationComputer2012]. In particular, AV systems need to clearly explain the mapping between the algorithm and visual encoding [mayerAnimationsNeedNarrations1991], and actively engage users [kehoeRethinkingEvaluationAlgorithm2001].
High barrier of entry for learning deep learning models. Most neural networks written in deep learning frameworks, such as TensorFlow[abadiTensorFlowSystemLargeScale2016]
and PyTorch[paszkePyTorchImperativeStyle2019a], are typically trained and deployed on dedicated servers that use powerful hardware with GPUs. Can we make understanding CNNs more accessible without such resources, so that everyone has the opportunity to learn and interact with deep learning models?
Based on the identified design challenges (section 3), we distill the following key design goals (G1–G5) for , an interactive visualization tool to help students more easily learn about CNNs.
[topsep=0mm, itemsep=0mm, parsep=1mm, leftmargin=6mm, label=G0., ref=G0]
Visual summary of CNN models and data flow. Based on the survey results, showing the structure of CNNs is the most desired feature for a visual learning tool (Figure 2). Therefore, to give users an overview of the structure of CNNs, we aim to create a visual summary of a CNN model by visualizing all layer outputs and connections in one view. This could help users to visually track how input image data are transformed to final class predictions through a series of layer operations (1). (subsection 5.1)
Interactive interface for mathematical formulas. Since CNNs employ various complex mathematical functions to achieve high classification performance, it is important for users to understand each mathematical operation in detail (2). In response, we would like to design an interactive interface for each mathematical formula, enabling users to examine and better understand the inner-workings of layers. (subsection 5.3)
Fluid transition between different levels of abstraction. To help users connect low-level layer mathematical mechanisms to high-level model structure (3), we would like to design a focus+context display of different views, and provide smooth transitions between them. Through easily navigating through different levels of CNN model abstractions, users can get a holistic picture of how CNN works. (subsection 5.4)
Clear communication and engagement. Our goal is to design and develop an interactive system that is easy to understand and engaging to use so that it can help people to more easily learn about CNNs (4). We aim to accompany our visualizations with explanations to help users to interpret the graphical representation of the CNN model (subsection 5.5), and we wish to actively engage people during the learning process through visualization customization. (subsection 5.6)
Web-based implementation. To develop an interactive visual learning tool that is accessible for users without specialized computational resources (5), we would like to use modern web browsers as the platform to explain the inner-workings of a CNN model. We also open-source our code to support future research and development of deep learning educational tools. (subsection 5.7)
This section outlines ’s visualization techniques and interface design. The interface is built on our prior prototype [wangCNN101Interactive2020a]. We visualize the forward propagation, i.e., transforming an input image into a class prediction, of a pre-trained model (Figure 3). Users can explore a CNN at different levels of abstraction through the tightly integrated Overview (subsection 5.1), Elastic Explanation View (subsection 5.2), and the Interactive Formula View (subsection 5.3). Our tool allows users to smoothly transition between these views (subsection 5.4), provides explanations to help users interpret the visualization (subsection 5.5), and engages the user to test hypotheses through visualization customization (subsection 5.6). The system is targeted towards beginners and describes all mathematical operations necessary for a CNN to classify an image.
Color scales are used throughout the visualization to show the impact of weight, bias, and activation map values. Consistently in the
interface, a red to blue color scale is used to visualize neuron activation maps as heatmaps, and a yellow to green color scale represents weights and biases. A persistent color scale legend is present across all of the views, so the user always has context for the displayed colors. We chose these distinct, diverging color scales with white representing zero, so that a user can easily differentiate between varying activation and weight values. We group layers in the Tiny VGG model, our CNN architecture, into four units and two modules (Figure 3
). Each unit starts with one convolutional layer. Both modules are identical and contain the same sequence of operations and hyperparameters. To analyze neuron activations throughout the network with varying contexts, users can alter the range of the heatmap color scale (subsection 5.6).
The Overview (: Learning Convolutional Neural Networks with Interactive VisualizationA, Figure 4A) is the opening view of . This view represents the high-level structure of a CNN: neurons grouped into layers with distinct, sequential operations. It shows neuron activation maps for all layers represented as color encoded heatmaps with a diverging red to blue color scale. Neurons in consecutive layers are connected with edges, which connect each neuron to its inputs; to see these edges, users simply can hover over any activation map. In the model, convolutional and output neurons are fully connected to the previous layer, while all other neurons are only connected to one neuron in the previous layer.
The Elastic Explanation View visualizes the computation that leads to an output without overwhelming users with low-level mathematical operations. In , there are two elastic views, namely the Convolutional Elastic Explanation View (: Learning Convolutional Neural Networks with Interactive VisualizationB) and the Flatten Elastic Explanation View (Figure 4B).
Explaining the Convolutional Layer (: Learning Convolutional Neural Networks with Interactive VisualizationB). The Convolutional Elastic Explanation View is entered when a user clicks a convolutional neuron from the Overview. This view applies a convolution on each input node of the selected neuron, visualized by a kernel sliding across the input neurons, which yields an intermediate result for each input neuron. This sliding kernel forms the output heatmap during the animation, which imitates the internal process during a convolution operation. While the sliding kernel animation is in progress, the edges in this view are represented as flowing-dashed lines; upon the animations completion, the edges transition to solid lines.
Explaining the Flatten Layer (Figure 4B). The Flatten Elastic Explanation View
visualizes the operation of transforming an n-dimensional tensor into a 1-dimensional tensor, which is shown when a user clicks anoutput neuron. This flattening operation is often necessary in a CNN prior to classification so that the fully-connected output layer can make classification decisions. The view uses the pixel’s color from the previous layer to encode the flatten layer’s neuron as a short colored line. Edges connect each flatten layer neuron with its source component and intermediate result. These edges are colored based on the model’s weight value. Users can hover over any component of this connection to highlight the associated edges as well as the flatten layer’s neuron and the pixel value from the previous layer.
The Interactive Formula View consists of four variations designed for convolutional layers, ReLU activation layers, pooling layers, and the softmax activation function. After users have built up a mental model of the CNN model structure from the previous Overview and Elastic Explanation Views, these four views demonstrate the detailed mathematics occurring in each layer.
Explaining Convolution, ReLU Activation, and Pooling (Figure 5A, B, C)) Each view animates the window-sliding operation on the input matrix and output matrix over an interval, so that the user can understand how each element in the input is connected to the output, and vice versa. In addition, the user can interact with the these matrices by hovering over the heatmaps to control the position of the sliding window. For example, in the Convolutional Interactive Formula View (subsection 5.3A), as the user controls the kernel position in either the input or the output matrix, this view visualizes the dot-product formula with input numbers and kernel weights directly extracted from the current kernel. This synchronization between the input, the output and the mathematical function enables the user to better understand how the kernel convolves a matrix in convolutional layers.
Explaining the Softmax Activation (Figure 5D). This view outlines the operations necessary to calculate the classification score. It is accessible from the Flatten Elastic Explanation View to explain how the results (logits) from the previous view lead to the final classification. The view consists of animated logit values encoded as circles colored with a light orange to dark orange color scale to provide users with a visual cue of the importance of every class. This view also includes a corresponding equation, which explains how the classification score is computed. As a user hovers over a logit circle, the corresponding value will be highlighted in the equation along with the logit circle itself, so the user can understand the impact that every logit has on the output of the softmax function. This interactivity between the logit circles and the mathematical equation is commutative, so the user may also highlight over elements of the equation, which will highlight the appropriate logit circle. Interacting with both the logit circles and the mathematical equation in combination allows a user to discern the impact that every logit has on each classification score in the output layer.
The Overview is the starting state of and shows the model architecture. From this high-level view, the user can begin inspecting layers, connectivity, classifications, and tracing activations of neurons through the model. When a user is interested in more detail, they can click on neuron activation maps in the visualization. Neurons in a layer that have simple one-to-one connections to a neuron in the previous layer do not require an auxiliary Elastic Explanation View, so upon clicking one of these neurons, a user will be able to enter the Interactive Formula View to understand the low-level operation that a tensor undergoes at that layer. If a neuron has more complex connectivity, then the user will enter an Elastic Explanation View after clicking a neuron from the Overview. In this view, uses visualization and annotations before displaying mathematics. Through further interaction, a user can hover and click on parts of the Elastic Explanation View to uncover the mathematical operations as well as the values, weights, and biases used by the network for classification. This low-level Interactive Formula View is only shown after transitioning through the previous two views, with progressively more detail about how CNNs transform input data.
is accompanied by an interactive tutorial article beneath the interface that explains CNNs, layer functions, hyperparameters, and outlines its interactive features. Learners can read freely, or jump to specific sections by clicking layer names or the info buttons (Figure 5) from the main visualization. The article provides beginner users detailed information regarding CNNs that can supplement their exploration of the visualization.
Additionally, text annotations are placed throughout the visualization (e.g., explaining the flatten layer operation in the right image), which further guide users and explain concepts that are not easily discernible from the visualization alone. These annotations help users map the underlying algorithm to its visual encoding.
The Control Panel located across the top of the visualization (: Learning Convolutional Neural Networks with Interactive Visualization) allows the user to alter the CNN input image and edit the overall representation of the network. The Hyperparameter Widget (Figure 6) enables the user to experiment with differnt convolution hyperparameters.
Change input image. Users can choose between (1) preloaded input images for each output class, or (2) upload their own custom image. Preloaded images allow a user to easily access data from the classes the model was originally trained on. User can also freely upload any image for classification into the ten classes the network was trained on. The user does not have any limitations on the size of the image they upload. resizes and crops a central square of any image uploaded, so that the image matches network dimensions. Supporting custom image upload allows users to analyze the network’s classification decisions on diverse image inputs, interactively testing their own hypotheses.
Show network details. A user can toggle the “Show detail” button, which displays additional network specifications in the Overview. When toggled on, the Overview will reveal layer dimensions under the layer names and show color scale legends. Additionally, a user can vary the activation map color scale range. The CNN architecture presented by is grouped into four units and two modules (Figure 3). By modifying the drop down menu in the Control Panel, a user can adjust the color scale range used by the network to investigate activations with different groupings.
Explore hyperparameter impact. The tutorial article (subsection 5.5) includes an interactive Hyperparameter Widget that allows users to experiment with convolutional hyperparameters (Figure 6). Users can adjust the input and hyperparameters of the stand-alone visualization to test how different hyperparameters change the sliding convolutional kernel and the output’s dimensions. This interactive element emphasizes learning through experimentation by supplementing knowledge gained from reading the article and using the main visualization.
is a web-based, open-sourced visualization tool to teach students the foundations of CNNs. A new user only needs a modern web-broswer to access our tool, no installation required. Additionally, other datasets and linear models can be quickly applied to our visualization system due to our robust implementation.
Model Training. The CNN architecture, Tiny VGG (Figure 3), presented by for image classification is inspired by both the popular deep learning architecture, VGGNet [simonyanVeryDeepConvolutional2015], and Stanford’s CS231n course notes [karpathyCS231nConvolutionalNeural2016]
. It is trained on the Tiny ImageNet dataset[TinyImageNetVisual2015]. The training dataset consists of 200 image classes and contains 100,000 6464 RGB images, while the validation dataset contains 10,000 images across the 200 image classes. The model is trained using TensorFlow [abadiTensorFlowSystemLargeScale2016] on 10 handpicked, everyday classes: , , , , , , , , , and . During the training process, the batch size and learning rate are fine-tuned using a 5-fold-cross-validation scheme. This simple model achieves a 70.8% top-1 accuracy on the validation dataset.
Front-end Visualization. loads the pre-trained Tiny VGG model and computes forward propagation results in real time in a user’s web browser using TensorFlow.js [smilkovTensorFlowJsMachine2019]. These results are visualized using D3.js [bostockDataDrivenDocuments2011] throughout the multiple interactive views.
Janis is a virology researcher using CNNs in a current project. Through an online deep learning course she has a general understanding of the goals of applying CNNs, and some basic knowledge of different types of CNN layers, but she needs help filling in some gaps in knowledge. Interested in learning how a 3-dimensional input (RGB image) leads to a 1-dimensional output (vector of class probabilities) in a CNN, Janis begins exploring the architecture from theOverview.
After clicking the “Show detail” button, Janis notices that the output layer is a 1-dimensional tensor of size 10, while max_pool_2, the previous layer, is a 3-dimensional (131310) tensor. Confused, she hovers over a neuron in the output layer to inspect connections between the final two layers of the architecture: the max_pool_2 layer has 10 neurons; the output layer has 10 neurons each representing a class label, and the output layer is fully-connected to the max_pool_2 layer. She clicks that output neuron, which causes a transition from the Overview (Figure 4A) to the Flatten Elastic Explanation View (Figure 4B). She notices that edges between these these two layers intersect a 1-dimensional flatten layer and pass through a softmax function. By hovering over pixels from the activation map, Janis understands how the 2-dimensional matrix is “unwrapped” to yield a portion of the 1-dimensional flatten layer. To confirm her assumptions, she clicks the flatten layer name, which directs her to an explanation in the tutorial article underneath the visualization explaining the specifics of the flatten layer. As she continues to follow the edge after the flatten layer, she clicks the softmax button which leads her to the Softmax Interactive Formula View (Figure 4C). She learns how the outputs of the flatten layer are normalized by observing the equation linked with logits through animations. Janis recognizes that her previous coursework has not taught these “hidden” operations prior to the output layer, which flatten and normalize the output of the max_pool_2 layer. helps Janis learn these often-overlooked operations through a hierarchy of interactive views that expose increasingly more detail as they are explored. She now feels more equipped to apply CNNs to her virology research.
A university professor, Damian, is currently teaching a computer vision class which covers CNNs. Damian begins his lecture with standard slides. After describing the theory of convolutions, he opens to demonstrate the convolution operation working inside a full CNN for image classification. With projected to the class, Damian transitions from the Overview (: Learning Convolutional Neural Networks with Interactive VisualizationA) to the Convolutional Elastic Explanation View (: Learning Convolutional Neural Networks with Interactive VisualizationB). Damian encourages the class to interpret the sliding window animation (Figure 1B) as it generates several intermediate results. He then asks the class to predict kernel weights in a specific neuron. To test student’s hypotheses, Damian enters the Convolutional Interactive Formula View (: Learning Convolutional Neural Networks with Interactive VisualizationC), to display the convolution operation with the true kernel weights. In this view, he can hover over the input and output matrices to answer questions from the class, and display computations behind the operation.
Recalled from theory, a student asks a question regarding the impact of altering the stride hyperparameter on the animated sliding window in convolutional layers. To illustrate the impact of alternative hyperparameters, Damian scrolls down to the “Convolutional Layer” section of the complementary article, where he experiments by adjusting stride and other hyperparameters with theHyperparameter Widget (Figure 6) in front of the class. With the help of , students gain a better understanding of the convolution operation and the effect of its different hyperparameters by the end of the lecture, but to reinforce the concepts and encourage individual experimentation, Damian provides the class with a URL to the web-based for students to return to in the future.
We conducted an observational study to investigate how ’s target users (e.g., aspiring deep learning students) would use this tool to learn about CNNs, and also to test the tool’s usability.
is designed for deep learning beginners who are interested in learning CNNs. In this study, we aimed to recruit participants who aspire to learn about CNNs and have some knowledge of basic machine learning concepts (e.g., knowing what an image classifier is). We recruited 16 student participants from a large university (4 female, 12 male) through internal mailing lists (e.g., machine learning and computer science Ph.D., M.S., and undergraduate students). Seven participants were Ph.D. students, seven were M.S. students, and the other two were undergraduates. All participants were interested in learning CNNs, and none of them had known before. Participants self-reported their level of knowledge on non-neural network machine learning techniques, with an average score of 3.26 on a scale of 0 to 5 (0 being “no knowledge” and 5 being “expert”); and an average score of 2.06 on CNNs (on the same scale). No participant self-reported a score of 5 for their knowledge on CNNs, and one participant had a score of 0. To help better organize our discussion, we refer to participants with CNN knowledge score of 0, 1 or 2 as B1-B11, where “B” stands for “Beginner”; and those with score of 3 or 4 as K1-K5, where “K” stands for “Knowledgeable”.
We conducted this study with participants one-on-one via video-conferencing software. With the permission of all participants, we recorded the participants’ audio and computer screen for subsequent analysis. After participants signed consent forms, we provided them a 5-minute overview of CNNs, followed by a 3-minute tutorial of . Participants then freely explored our tool in their computer’s web browser. We also provided a feature checklist, which outlined the main features of our tool and encouraged participants to try as many features as they could. During the study, participants were asked to think aloud and share their computer screen with us; they were encouraged to ask questions when necessary. Each session ended with a usability questionnaire coupled with an exit interview that asked participants about their process of using , and if this tool could be helpful for them. Each study lasted around 50 minutes, and we compensated each participant with a $10 Amazon Gift card.
The exit questionnaire included a series of 7-point Likert-scale questions about the utility and usefulness of different views in (Figure 7). All average Likert rating were above 6 except the rating of “easy to understand”. From the high ratings reported and our observations, participants found our tool easy to use and understand, retained a high engagement level during their session, and eventually gained a better understanding of CNN concepts. Our observations also reflect key findings in previous research in algorithm visualizations (AV) [fouhRoleVisualizationComputer2012, kehoeRethinkingEvaluationAlgorithm2001]. This section describes design lessons and limitations of our tool distilled from this study.
Several participants (9/16) commented that they liked how our tool visualizes both high-level CNN structure and explains low-level mathematical operations on-demand. This feature enables them to better understand the interplay between low-level layer computations and the overall CNN data transformation—one of the key challenges for understanding CNN concepts, as we identified from our instructor interviews and our student survey. For example, initially K4 was confused to see the Convolutional Elastic Explanation View, but after reading the annotation text, he remarked, “Oh, I understand what an intermediate layer is now—you run the convolution on the image, then you add all those results to get this.” After exploring the Convolutional Interactive Formula View, he immediately noted, “Every single aspect of the convolution layer is shown here. [This] is super helpful.” Similarly, B5 commented, “Good to see the big picture at once and the transition to different views […] I like that I can hide details of a unit in a compact way and expand it when [needed].”
employs the fisheye view technique for presenting the Elastic Explanation Views (: Learning Convolutional Neural Networks with Interactive VisualizationB, Figure 4B): after transitioning from the Overview to a specific layer, neighboring layers are still shown while further layers (lower degree-of-interest) have lower opacity. Participants found this transition design helpful for them to learn layer-specific details while having CNN structural context in the background. For instance, K5 said “I can focus on the current layer but still know the same operation goes on for other layers.” Our observations from this study suggest that our fluid transition design between different level of abstraction can help users to better connect unfamiliar layer mechanisms to the complex model structure.
Another favorite feature of that participants mentioned was the use of animations, which received the highest rating in the exit questionnaire (Figure 7). In our tool, animations serve two purposes: to assimilate the relationship between different visual components and to help illustrate the model’s underlying operations.
Transition animation. Layer movement is animated during view transitions. We noticed it helped participants to be aware of different views, and all participants navigated through the views naturally. In addition to assisting with understanding the relationship between distinct views, animation also helped them discover the linking between different visualization elements. For example, B8 quickly found that the logit circle is linked to its corresponding value in the formula, when she saw the circle-number pair appear one-by-one with animation in the Softmax Interactive Formula View (Figure 4C).
Algorithm animation. Animations that simulate the model’s inner-workings helped participants learn underlying operations by validating their hypotheses. In the Convolutional Elastic Explanation View (Figure 1B), we animate a small rectangle sliding through one matrix to mimic the CNN’s internal sliding window. We noticed many participants had their attention drawn to this animation when they first transitioned into the Convolutional Elastic Explanation View. However, they did not report that they understood the convolution operation until interacting with other features, such as reading the annotation text or transitioning to the Convolutional Interactive Formula View (Figure 1C). Some participants went back to watch the animation multiple times and commented that it made sense, for example, K5 said “Very helpful to see how the image builds as the window slides through,” but others, such as B9 remarked, “It is not easy to understand [convolution] using only animation.” Therefore, we hypothesize that this animation can indirectly help users to learn about the convolution algorithm by validating their newly formed mental models of how specific operation behave. To test this hypothesis, a rigorous controlled experiment would be needed. Related research work on the effect of animation in computer science education also found that algorithm animation does not automatically improve learning, but it may lead learners to make predictions of the algorithm behavior which in turn helps learning [byrneEvaluatingAnimationsStudent1999].
Engagement and enjoyable experience. Moreover, we found animations helped to increase participants’ engagement level (e.g., spending more time and effort) and made more enjoyable to use. In the study, many participants repeatedly played and viewed different animations. For example, K2 replayed the window sliding animation multiple times: “The is very well-animated […] I always love smooth animations.” B7 also attributed animations to his enjoyable experience with our tool: “[The tool is] enjoyable to use […] I especially like the lovely animation.”
allows users to modify the visualization. For example, users can change the input image or upload their own image for classification; visualizes the new prediction with the new activation maps in every layer. Similarly, users can interactively explore how hyperparameters affect the convolution operation (Figure 6).
Hypothesis testing. In this study, many participants used visualization customization to test their predictions of model behaviors. For example, through inspecting the input layer in the Overview, B4 learned that the input layer comprised multiple different image channels (e.g., red, green, and blue). He changed the input image to a red bell pepper from Tiny Imagenet (shown on the right) and expected to see high values in the input red channel: “If I click the red image, I would see…” After the updated visualization showed what he predicted, he said “Right, it makes sense.” We found the Hyperparameter Widget also allowed participants to test their hypotheses. While reading the description of convolution hyperparameters in the tutorial article, K3 noted “Wait, then sometimes they won’t work”. He then modified the hyperparatmeters in the Hyperparameter Widget and noticed some combinations indeed did not yield a valid operation output: “It won’t be able to slide, because the stride and kernel size don’t fit the matrix”.
Engagement. Participants were intrigued to modify the visualization, and their engagement sparked further interest in learning CNNs. In the study, B6 spent a large amount of time on testing the CNN’s behavior on edge cases by finding “difficult” images online. He searched with keywords “koala”, “koala in a car”, “bell pepper pizza”, and eventually found a bell pepper pizza photo (shown on the right333Photo by Jennifer Laughlin, used with permission.). Our CNN model predicted the image as with a probability of and with a probability of . He commented, “The model is not robust […] oh, the ladybug [’s high softmax score] might come from the red dot.” Another participant B5 uploaded his own photo as a new input image for the CNN model. After seeing his picture being classified as , B5 started to use our tool to explore the reason of such classification. He traced back the activation maps of neurons and the intermediate convolutional results from later layers to the early layers. He also asked us how do experts interpret CNNs and said he would be interested in learning more about deep learning interpretability.
While we found provided participants with an engaging and enjoyable learning experience and helped them to more easily learn about CNNs, we also noticed some potential improvements to our current system design from this study.
Beginners need more guidance. We found that participants with less knowledge of CNNs needed more instructions to begin using . Some participants reported that the visual representation of the CNN and animation initially were not easy to understand, but the tutorial article and text annotation in the visualization greatly helped them to interpret the visualization. B8 skimmed through the tutorial article before interacting with the main visualization. She said, “After going through the article, I think I will be able to use the tool better […] I think the article is good, for beginner users especially.” B2 appreciated the ability to jump to a certain section in the article by clicking the layer name in the visualization, and he suggested us to “include a step-by-step tutorial for first time users […] There was too much information, and I didn’t know where to click at the beginning”. Therefore, we believe adding more text annotation and having a step-by-step tutorial mode could help users who are less familiar with CNNs to better understand the relationship between CNN operations and their visual representations.
Limited explanation of why CNN works. Some participants, especially those less experienced with CNNs, were interested in learning why the CNN architecture works in addition to learning how a CNN model makes predictions. For example, B7 asked “Why do we need ReLU?” when he was learning the formula of the ReLU function. B5 understood what a Max Pooling layer’s operation does but was unclear why it contributes to CNN’s performance: “It is counter-intuitive that Max Pooling reduces the [representation] size but makes the model better.” Similarly, B6 commented on the Max Pooling layer: “Why not take the minimum value? […] I know how to compute them [layers], but I don’t know why we compute them.” Even though it is still an open question why CNNs work so well for various applications [guRecentAdvancesConvolutional2018, zhangUnderstandingDeepLearning2017a], there are some commonly accepted “intuitions” of how different layers help this model class succeed. We briefly explain them in the tutorial article: for example, ReLU function is used to introduce non-linearty in the model. However, we believe it is worth designing visualizations that help users to learn about these concepts. For example, allowing users to change the ReLU activation function to a linear function, and then visualizing the new model predictions may help users gain understanding of why non-linear activation functions are needed in CNNs.
Explaining training process and CNN architecture intuitions. helps users to learn how a pre-trained CNN model transforms the input image data into a class prediction. As we identified from two preliminary studies and an observational study, students are also interested in learning about the training process for CNNs, including technical approaches such as cross-validation and backpropagation. We plan work with instructors and students to design and develop new visualizations to address these extensions.
Generalizing to other neural network models. Our observational study demonstrated that supporting users to transition between different levels of abstraction helps them more easily understand the interplay between low-level layer operations and high-level model structure. Other neural network models, such as Long short-term memory models (LSTM) [hochreiterLongShortTermMemory1997] and Transformer models [vaswaniAttentionAllYou2017], also require learners to understand the intricate layer operations in the context of a complex network structure. Therefore, our design can be adopted for explaining other neural network models to beginners.
Integrating algorithm visualization best practices. Existing work has studied how to design effective visualizations to help students learn algorithms. applies two key design principles from AV—visualizations with explanations and customizable visualizations (4). However, there are many other AV design practices that future researchers can integrate in educational deep learning tools, such as giving interactive “pop quizzes” during the visualization process [napsJHAVEEnvironmentActively2000] and encouraging users to build their own visualizations [staskoUsingStudentbuiltAlgorithm1997].
Quantitative evaluation of educational effectiveness. We conducted a qualitative observational study to evaluate the usefulness and usability of . We would like to further conduct quantitative user studies with a before-and-after knowledge quiz to compare the educational benefits of our tool and that of traditional educational mediums such as textbooks and lecture videos. It would be particularly valuable to investigate the educational effectiveness of visualization systems that explain deep learning concepts to beginners.
As deep learning is increasingly used throughout our everyday life, it is important to help learners take the first step toward understanding this promising yet complex technology. In this work, we present , an interactive visualization system designed for non-experts to more easily learn about CNNs. Our tool runs in modern web browsers and is open-sourced, broadening the public’s education access to modern AI techniques. We discussed design lessons learned from our iterative design process and an observational user study. We hope our work will inspire further research and development of visualization tools that help democratize and lower the barrier to understanding and appropriately applying AI technologies.