Net2Vis: Transforming Deep Convolutional Networks into Publication-Ready Visualizations

02/11/2019 ∙ by Alex Bäuerle, et al. ∙ Universität Ulm 30

To properly convey neural network architectures in publications, appropriate visualization techniques are of great importance. While most current deep learning papers contain such visualizations, these are usually handcrafted, which results in a lack of a common visual grammar, as well as a significant time investment. Since these visualizations are often crafted just before publication, they are also prone to contain errors, might deviate from the actual architecture, and are sometimes ambiguous to interpret. Current automatic network visualization toolkits focus on debugging the network itself, and are therefore not ideal for generating publication-ready visualization, as they cater a different level of detail. Therefore, we present an approach to automate this process by translating network architectures specified in Python, into publication-ready network visualizations that can directly be embedded into any publication. To improve the readability of these visualizations, and in order to make them comparable, the generated visualizations obey to a visual grammar, which we have derived based on the analysis of existing network visualizations. Besides carefully crafted visual encodings, our grammar also incorporates abstraction through layer accumulation, as it is often done to reduce the complexity of the network architecture to be communicated. Thus, our approach not only reduces the time needed to generate publication-ready network visualizations, but also enables a unified and unambiguous visualization design.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Visualizing network architectures to improve them or convey the underlying ideas is an extensive field of research. The following will provide an overview over what has been done in other works to visualize network architectures.

1.1 Handcrafted Visualizations

Handcrafted visualizations are part of many research papers that use neural network architectures in their publications [34, 43, 21, 2]. Before publishing their work, authors often manually generate visualizations of the used network architectures. However, since they are handcrafted, they differ greatly in their visual appearance [41, 30, 32]. This diversity of network visualizations makes transferring knowledge between them hard. Visualizations designed by hand also sometimes contain errors, which can lead to misunderstanding of the network architecture [18]

. In this paper, the network visualization shows a smaller box as the preceding one at the second layer. The textual description, however, indicates that the spatial resolution was still the same. When looking at the network implementation, one can clearly see that the stride of two in the first convolutional layer indeed downscales the spatial resolution. Thus, the textual description of this visualization is incorrect. For a reader, if not looking at the network implementation, it is, however, not possible to determine if the glyph or the text is wrong.

In contrast, automatically generated visualizations can provide a unified visualization design, reduce ambiguities between different visualizations, and prevent errors in the visual representation. Also, authors can offload work to such visualization tools, which gives them additional valuable paper editing time.

1.2 Programmatically Generated Visualizations

Some programs support the automatic generation of neural network visualizations. While some of them are directly integrated into deep learning frameworks and mainly serve as network debugging tools, others have been custom made to support certain aspects of neural network visualizations. In the following, we will present both these types of automatic network visualization tools.
Integrated visualization tools.

Most deep learning frameworks also provide visualization techniques. In Tensorflow, the visualization toolkit Tensorboard 

[1] is used to debug network architectures, and also provides a tool to visualize network architectures [44]

. In Tensorboard, complex machine learning architectures can be analyzed at a fine level of detail.


Caffe [20], also has a tool to visualize network architectures. Their visualization toolkit Netscope provides visualizations similar to Tensorboard [15]. Netscope also visualizes all individual layers without the option to create abstract aggregations and lack a visual representation of features such as spatial resolution.
The deep learning API Keras [9], that can be used on top of multiple deep learning frameworks provides some rather simplistic network visualization possibilities as well. With its built-in visualization utils, it is possible to render images of the network architecture. They can, however, not be customized and always include all layers of the network, which basically makes them unusable for publications that use large networks.
What all of these visualizations have in common is, that they are clearly designed for online use. Most of them employ a vertical layout for their graph representation, draw detailed visualizations that include all layers and parameters, and provide spatial information only on interaction rather than by design. For debugging the network architecture instead of visualizing the general structure of the network architecture, this is completely fine. Scrolling through layers vertically is a sufficient mechanism for interaction on websites. Also, the integration of interaction to provide details on demand is essential for debugging architectures. Visualizing the network architecture in a more abstract way is, nonetheless, important to convey the structure at a glance and its affordances differ greatly from these of online analysis tools. Vertical layouts, which are used by all of these integrated visualization tools, would only allow an integration into print publications by sacrificing additional, valuable space. In print media, almost all of the network visualizations are drawn horizontally, since this takes up less space and follows the natural flow of reading. At the same time, interaction with printed visualizations is impossible. Thus, the visualization must convey the most important information directly inherently through its design.
Specialized external tools. Some educational visualizations try to convey simple neural network architectures to explain their functionality to novices [11, 40]. They display rather basic network architectures and are limited to specific use-cases, while the visualizations are not generalizable to more complex architectures. This is perfect for educational purposes, however, it does not scale to arbitrary network architectures.
One such educational network visualization is an online tool to visualize a CNN for analyzing MNIST classification [16]. While this supports detailed exploration of activations for each layer, it is specifically targeted to the MNIST classification task and connot be generalized. Also, its exploration in three-dimensional space is clearly designed for online use, and the education of deep learning concepts, rather than the communication of a specific network architecture. The focus of this visualization, that lies clearly on interactivity and exploration, cannot be adapted to printed media. Other tools that aim at the same audience and have similar problems, when trying to generalize to different network architectures are Tensorflow Playground [40] and GanLab [24].
Other tools have been crafted to support more general network visualizations. One such tool is ANNvisualizer [14]. It is capable to visualize conventional networks as well as CNN architectures and is deployed as a python package. One can import this to be used within the network definition code. Netron visualizes neural networks by analyzing network model definitions that are used to distribute trained networks [36]. However, the visual grammar of these two visualization applications is not applicable to print media. The glyphs employed in their techniques do not convey information apart from layer type. Additional information is displayed by using textual descriptions overlaid on top of the used glyphs. Their vertical layout, along with the spacing between network layers makes even small networks appear relatively large, which makes these visualizations unsuitable for publications.
In a paper by Liu et al. [29], in one by Zeng et al. [47], as well as in the work by Bruckner [5], the authors use a visualization of the network architecture to present learned layer features. Below a visualization that only contains information about the layer type in text, they present features or interconnections between layers. Such visualizations are targeted towards analyzing what a network has learned, and not to convey network architectures. They are embedded into specialized visualization tools and support these. Using them as a standalone visualization, however, does not provide much information about the architecture. Therefore, they are not used for publications.
In another work by Liu et al. [28], generative models are visualized. They visualize network layers in great detail. The glyphs for individual layers even contain graphs to plot parameters such as activation within a specific layer. This work is also specialized on one architecture, and based on interactively exploring the architecture and parameters of the network and does not provide an abstract network visualization to be used in publications.
ActiVis, a tool developed by Kahng et al. [23], also includes a visualization of the computation graph for a neural network. Their focus is on inspecting variables and training samples. Therefore, they chose to include variable nodes between all operation nodes to have an easy access to these variables. While this enables interaction with the graph, it also makes it more complex, which is not what is desired in abstract graph visualizations.
The discontinued web-framework Keras.js [8] was built to train neural network models in the browser. They also provide a visualization of the network that is currently run. This visualization is, however, very similar to what Keras provides out of the box. They use a vertical layout for their network architecture, draw every individual layer of the network without aggregation and do not use glyph shapes or color coding to convey information. This visualization helps to understand the model that is used but is not designed to be used in publications. It is too large to be printed and conveys additional information only via text.
One visualization technique that is specifically targeted towards use in scientific papers is called drawconvnet [12]. They use glyph-size as a mapping of the spatial dimension of layers. To convey the number of feature channels, glyphs are stacked upon each other. They additionally use text to provide information about layer types and parameters. While this visualization can be used for small and simple networks, it faces three main problems. First, it does not scale to modern, large network architectures since no aggregation technique is used in this visualization. Second, they visualize layer connections simply by placing the layers from left to right. This way, parallel network parts, which frequently occur in modern architectures, cannot be represented. Third, the source code of the application needs to be modified to match the network architecture in use. Thus, users have to rebuild their network architecture in drawconvnet to visualize it. When then changing the architecture, the visualization code has to be changed with it, in a similar manner as handcrafted visualization would have to be updated.
Similarly, NN-SVG also claims to create publication-ready network visualizations [27]. They provide three different styles. The first is only usable for fully connected networks. The second provides visualizations that match those in drawconvnet. The third visualization technique was borrowed from the AlexNet paper and visualizes the network in 3D. They, however, also lack aggregation techniques and only support sequential models and are thus not usable for modern parallel and large computation graphs. Furthermore, to define the network architecture, visualization designers need to configure the application and add each layer individually to the network.

In a survey on deep learning in visual analytics by Hohman et al. [19], many of these graph visualization techniques are also discussed. All of the visualizations are designed to serve certain use-cases. However, an important downside of current automatic network visualization tools is, that they struggle to visualize large networks in a compact way. While in current state-of-the-art visualizations [15, 44], operations can be inspected in great detail, they lack the possibility of abstracting networks to make their general structure comprehensible at a glance. While there are visualization techniques which aim at providing publication-ready visualizations, they cannot handle modern network architectures [27, 12]. Thus, in research papers, these complex network architectures are then usually simplified and drawn manually [18, 2, 32]. Besides the extra time effort related to this manual drawing process, the field lacks guidelines to create such visualizations. Some properties in existent visualizations are arbitrarily interpretable, and knowledge can hardly be transferred between different publications. In our paper, we, therefore, propose novel visualization techniques and guidelines for abstracting neural network visualizations. These visualizations are optimized to be used in scientific publications where display space is limited and interaction is impossible.

2 Properties to Visualize

In the following, we introduce properties of the network and its layers that might be visualized. To obtain these, we analyzed existing network visualizations as used in scientific publications, and worked out common visualization aspects. For a full list of reviewed papers, please refer to the supplementary material. We extracted the following properties as being the most important properties to visualize when communicating CNNs. For their discussion, we have classified them into layer properties, which are associated with individual layers, and global properties, which are relevant for the entire CNN.

2.1 Layer Properties

Layers are the building blocks of the computation graph that defines any network architecture. Thus, visualizing properties that parametrize these layers is important to convey the structure and architectural decisions of the network in use. Neural network layers are often visualized by simple glyphs. One such glyph either represents one layer in the network architecture or an accumulation of layers that has been introduced to simplify the visualization. The following covers the most common properties that are presented by such glyphs and explains how they were visualized by others.
Layer type. One of the most important aspects of all network architectures is what type of layers were used. Most visualizations use both text and color of the layer-glyph to convey this information [44, 46, 34, 21, 2, 45, 49, 18], while some visualizations just use textual descriptions [41, 17, 43, 32, 38, 48]. It is important to note, that whenever there is a color coding for layer types, visualization designers need to provide a mapping from color to layer type. In some cases, this was done by including a legend below the visualization, i.e., [21, 2, 35].
Spatial resolution.

The spatial resolution of a layer is the extent of the tensor. In the two-dimensional case, this would be the dimension in x and y direction. In most visualizations, this is represented by the size of the layer-glyph in combination with a textual information about the layer resolution 

[41, 34, 7, 38, 35, 3, 48, 49, 18]. Sometimes, however they use only either the layer-glyph size [43, 21, 30, 2, 45, 25] or textual representations [32]. The spatial resolution is important to convey the transformation of features from input space to the latent space of a network and in some cases further to the output space. Thus, this marks a fundamental characteristic of the network architecture.
Feature channels. The number of feature channels per layer is an important property for network architectures, especially in image processing. They indicate how many different features each layer of the neural network is extracting in an image. Thus, this is also a feature that is often encoded in the layer glyphs of network visualizations. There are different ways that authors use to indicate how many feature channels a layer contains. Textual descriptions are common for conveying feature channel numbers [41, 17, 21, 35]. Again, some authors also use glyph shapes as a way to convey the number of feature channels [34, 25], or text combined with glyphs [43, 30, 32, 7, 38, 45, 48, 49, 18]. At the same time, some visualizations do not show the number of feature channels at all [44, 46, 2]. The number of feature channels also represents the important transformation into or from latent space. Thus, feature channels and spatial resolution are often tightly coupled.
Kernel size. For convolution layers as well as pooling layers and their respective reverse operations, kernel sizes can also be a parameter of interest. They are not encoded in most of the visualizations of other authors but can be found in some as textual descriptions of the individual layers [41, 17, 43, 48]. We also found one visualization encoding the kernel size in their glyphs [38]. When analyzing why kernel sizes were not displayed in many visualizations, two factors were apparent. First, kernel sizes often are consistent across multiple layers. This would lead to repeated information in the visualization when visualizing them for each of the network layers. Also, since similar kernel sizes were set for many layers, they could be easily described in the text that accompanied the visualization. Second, some visualizations contain abstractions of layers. Whenever multiple layers are combined, change in spatial resolution or number of resulting feature channels can still be easily visualized, however, kernel sizes might differ within abstractions and can therefore not always be included in these visualizations.
Additional layer properties.

Neural network layers contain many more features such as weights, strides, padding, data formats, and others. Most of them only exist for certain layer-types. Some printed visualizations included special network features such as activation maps 

[41] or receptive fields [38]. These are, however, only provided for specialized use-cases. Most visualizations only convey layer types, spatial resolution and the number of feature channels.

2.2 Global Properties

Apart from layer-specific properties, general characteristics of the network can also be visualized in different ways. In the following, global properties of network architectures that influence the overall appearance of the visualization will be described.
Connections. Most neural network visualizations used in publications connect their layers from left to right, e.g., [34, 2]. This way, the natural reading direction is preserved for these visualizations and they nicely fit across the width of one page. While there are some exceptions to this, vertical layouts are only used for relatively small networks [41]. Connections between layers are visualized either through connecting lines, e.g., [25, 18], or by simply placing layers that are connected next to each other, e.g. [43, 35]. Some visualizations additionally add arrows to clarify the direction of data-flow, e.g., [30, 34]. Skip connections introduce additional paths for the data that skip some layers to then be merged back with data that has been processed by all intermediate layers. They are widely used in modern network architectures and represent features of interest in the data-flow graph. Thus, to fully describe a neural network, many authors visualize them by adding lines between these layers [17, 21, 2, 7, 18]. In all of these papers, skip connections are indicated simply by drawing a line from one to another layer.
Aggregations. Often authors manually aggregate layers when their architectures are too complex to fit on a page [18, 44]. Additionally to making the visualizations smaller, aggregations can reduce redundancy in networks. Typical building blocks can be aggregated and displayed as a new, abstracted glyph. When these aggregations are made, a legend indicating what layers are aggregated into a new aggregation layer is often shown. Especially for modern, rather complex network architectures, this way of aggregating layers can help to understand the underlying network architecture. Since many network architectures consist of some fixed building blocks, that are repeated multiple times, such aggregations are commonly used, e.g., [42, 17].
Input and output samples. Some of the network visualizations also incorporate input and output examples [34, 30, 2, 7, 38, 35, 45, 25, 18]. These visualize what type of data goes into the network and what output the network produces. Such samples are mainly useful for image- or shape-related tasks, and do not provide additional information with regard to the network architecture.

3 Visualization Design

Based on the described properties, we have developed a visual grammar to illustrate modern CNNs. We chose to only visualize a subset of all properties that we mentioned, since our application is focused on conveying the overall network structure, and not to analyze the layers themselves. In the following section, we describe the visualization design we propose for the visualization of CNNs. For all those properties we describe in Section 2, we either discuss how we visualize them, or why we chose not to visualize certain aspects.

3.1 Layer Properties

Our visualization design supports the direct encoding of selected layer properties. The following will convey this encoding and provide explanations for why certain layer properties are not encoded through our visualization design.
Layer type. To directly be able to differentiate between layer types, we use colors to encode them. We chose not to use text for the encoding of each individual layer, since this would lead to repetition in almost all network architectures. Perceptual research shows, that color is the visually most dominant channel [10]. Since analyzing color patterns can be done faster than reading a textual description for each layer, color-coding layers is superior in conveying the overall network architecture. Mackinlay’s ranking [31] also provides information about which channel is best used for nominal data. Here, color ranks just behind position, which we already employ for the data-flow of the network, which supports our choice to use color for identifying the layer type. Using color implies that a mapping from these colors to layer types must exist. Therefore, we include a legend below the visualization of the network. This legend maps the color of the layers to a textual layer-type description.
To also make our visualizations accessible to colorblind readers, and support publications without colored images, we provide an alternative encoding of the layer type. Here, greyscale textures are used to encode the layer type. Again, this mapping is resolved in the legend visualization. We provide 12 distingishable patterns, that can be extended upon when needed. An example for this encoding can be seen in Figure 1.

Figure 1: Accessible encoding of the layer type. To also support colorblind readers and support publications without color, we use texture to encode the layer type for our glyphs.

Spatial resolution. In 20 out of 25 inspected visualizations of neural networks in publications, the spatial resolution is visualized through the height of the layer glyph. However, not all visualizations found in publications use the glyph height in the same way. Whenever the spatial resolution is changed by a layer, there is the question, if this layer should already get rendered in the changed resolution or if the next layer should be the first affected by this change. This ambiguity makes the interpretation of such visualizations harder, as one needs to find out which representation was chosen for each visualization approach.
The transformation of the resolution is determined by multiple parameters that can be set for the layers (e.g. stride, kernel size, padding). Thus, since the resulting resolution is a result of the inner working of each layer rather than a fixed parameter that defines the resulting resolution, we chose to visualize the spatial resolution as a change within the layer. For our glyphs, we thus change the height along the x-axis, resulting in trapezoid-shaped glyphs, as shown in Figure 2. This conveys the resolution change being made by the mathematical operations within the layer. At the same time, the interpretation of the spatial resolution loses its ambiguity, which makes an interpretation of these visualizations more intuitive.

Figure 2: Glyph examples. On the left, one can see a glyph for a layer that reduces the spatial resolution. The glyph in the center has the same spatial resolution for its input and its output. The glyph on the right, symbolizes a layer that increases the spatial resolution of the processed tensor.

To draw these glyph-shapes, we first analyze the input and output dimensions of each layer. Visualization designers can define a minimal and a maximal height for all the glyphs. To obtain height values per input and output of each network layer, we map them to the previously calculated spatial extent of all layers. The highest and lowest value of all spatial extents gets mapped to the extremes of height values, which can be changed if desired. Values between maximal and minimal resolution are then interpolated linearly to convey the actual spatial resolution for the input and output of each layer in the network. For the width of the layers, which is mapped to the number of feature channels, we also have a changeable maximum and minimum value. Here the mapping is done equally, with the exception that for feature channels, only one value per layer has to be analyzed since this number is direcly set by the user, and does not depend on the input. For dense layers, we calculate the maximal and minimal number of neurons and map it to the defined height extremes for the glyphs. Thus, dense layers are treated separately and do not influence the height of convolutional layers, and vice versa. Using these interpolations, each input and output of every layer gets a height value assigned. Additionally, the width of the layer is defined after this step.


When analyzing abstract visualizations of network architectures, we found that it is not always important to convey the change of spatial resolution through exact numbers. Since many modern architectures further allow for the input to be arbitrarily shaped, the spatial dimension is not necessarily fixed at any given layer for many network architectures. We, therefore, provide the user with the option to toggle labels that display the exact spatial resolution for any given layer. If labels are switched off, the change of spatial resolution is still provided by the shape of the glyphs that represent layers in our visualizations. Whenever the visualization designer feels that exact resolution numbers are of importance, they can, however, be displayed at connections between layers. This follows our visualization design, in which the resolution is fixed between layers but changes within them.
We intentionally chose not to use 3D visualizations for our layer glyphs. While there are network visualizations that display the layer glyphs in 3D, they struggle to convey additional information through this channel [26, 34, 43]. Only one out of 25 reviewed papers manages to provide additional information through three-dimensional glyph shapes. Since pooling is almost always applied to all spatial dimensions equally, the third dimension is not needed for the visualization of changes in the spatial resolution. Therefore, in almost all network architectures, three-dimensional glyphs do not convey additional information.
Feature channels. Feature channels often follow the trend of spatial resolution reversely. As the spatial resolution gets reduced towards the latent space, the number of feature channels often increases throughout the network. However, feature channels are different from the spatial resolution in that they are inherent properties of one layer rather than being derived from the previous number of feature channels. When, for instance, one declares a convolution layer in Keras as layers.Conv2D(32, [3, 3])(input), with the first parameter being the number of feature channels, and the second representing the kernel size, one can see that the number of features is a direct property of each layer and not a result of layer-internal calculations. Thus, we chose to visualize the number of feature channels as a direct property of the layer rather than a change within it. Since feature channels are closely related to the spatial resolution, and often viewed in combination, we chose to employ a similar visual feature to convey them. Therefore, feature channels are mapped to the width of each layer glyph in our visualization design. As with the spatial resolution, feature channels can additionally be represented as text. The number of feature channels is then displayed below each layer.
Kernel size. Convolution or pooling kernels are also of interest when analyzing a neural network architecture. In our visualization, however, we chose not to display them. The most important reason for this is, that layer accumulations are important to reduce the complexity of the network visualization. For these groups, there is no such thing as one kernel size since the kernel size may differ for layers within such groups. Thus, displaying one kernel size for groups is not always possible and adding multiple kernel sizes for each accumulation would need a mapping back to the layers that form the group. Also, we try to provide a minimalistic visualization of the network architecture. Our visualization conveys layer types, change of spatial resolution and change in the number of feature channels. Out of 25 papers that include neural network visualizations, only four directly display the kernel size for layers in the visualization, six more include text about it. Since kernel sizes often do not vary greatly throughout the network, there would be a great number of repetition when displaying information about them for every layer.
Additional layer properties. Additional properties that might be important for network visualizations are stride, initialization method, regularization parameters, and others. Their importance, however, differs by network architecture and use-case. Therefore, we chose not to make our visualization any more complex by providing additional information about the network. Since our techniques are targeted towards printed visualizations, where necessary, textual description about special properties of the network can be added in the accompanying text. This is in line with our goal of providing an abstract overview of the network architecture, where detailed information can be obtained by reading the publication it is contained in.

3.2 Global Properties

Global network properties can be conveyed in different ways. In the following, we describe our proposed visual grammar for connections and aggregations, and explain why we did not include input and output samples in our visualization.
Connections. For the appearance of the visualized network, one big insight from our literature review was, that especially for printed figures, thin horizontal visualizations were preferred. To account for this and support the natural reading direction of western cultures, we, therefore, chose a network layout that has its input on the left, and its output on the right side, as 21 out of 25 reviewed papers did.
Similar to most handcrafted visualizations, we use lines for connections between layers. However, when layers directly follow each other in the visualization, the gap between layers can be removed so that no lines are visible. This follows the approach of other visualizations that simply place network layers next to each other from left to right.
We do not treat skip connections any different from conventional connections between two layers. Whenever a layer has multiple outgoing or incoming connections, we modify the glyph that represents it as shown in Figure 3. This way, there might be multiple ends on the left or the right side of the glyph. Including the split of such layers rather than just draw lines between them has two reasons. First, each input or output is of equal importance to the network and we did not want one of them to be less prominent, as it often appears in current representations of skip connections found in scientific papers. Also, splits and joins of different network paths are important features of a network in general and reveal much of the design idea underlying network architectures. Thus, we wanted these features to be prominently visible through the special shape of such glyphs, and thus we graphically represent skip connections as well as any other combination of network layers in our visualization.
One might think that such layer shapes induce problems with edge-crossings. However, edge-crossings are not common to occur in neural network architectures. We have only found planar network graphs, which consequently can be layouted so that no crossing edges occur.
Aggregations. Since modern network architectures can get fairly complex, we provide an automated way to aggregate multiple layers. Aggregations are displayed in a legend below the network graph and map sequences of layers to a new layer type. Whenever such a layer-sequence is found in the network, it gets replaced by the new layer type that has been generated through the aggregation. This is done to both, make the network easier to understand and compress it to fit on a page to be printed.
Input and output samples. We chose not to include a visualization of input and output samples since this would require the user to provide such samples, and thus interfere with the automatic nature of our visualization. Instead, we have decided to provide placeholders for these input and output samples. These placeholders comply with the size calculations of the proposed layouting. Thus, users can replace them afterwards with actual samples, which would then obey the overall layout. Alternatively, the placeholders can also be removed, in cases where samples are not wanted.

Figure 3: Whenever glyphs have multiple inputs or outputs, we modify their shape. Here, one can see a part of a network, where two simultaneous data-paths are integrated. The pooling layer displayed on the left has two outgoing connections, while the addition layer displayed on the right fuses these paths back together. The two layers in between only have one single connection for their input and output respectively.

4 Graph Layout

To support the natural reading direction in print media, as well as to save valuable vertical space, we follow many other network visualizations and always display our graph from left (input) to right (output). Based on the graph structure, we have obtained from a CNN’s source code specification, we generate the desired layout. Within the graph, layers are seen as nodes and all hold their predecessors and successors as directed connections. To layout this graph, we use a slightly modified version of the network simplex algorithm [13]. This technique is explicitly targeted towards drawing directed rank-based graphs. It used a four-pass algorithm of first, assigning ranks to all layers. Second, nodes are ordered within these ranks to minimize edge crossings. Third, coordinates on the canvas are assigned to all nodes. Finally, edges are drawn to connect these nodes. The rank-based nature of this algorithm especially fits our use case, where parallel layers should be placed at the same x-coordinate whenever they are processed in one evaluation step of the network. Also, by ordering these nodes, sequential parts of the network tend to be drawn on the same vertical level and can thus be visually recognized well. The network simplex algorithm also prefers short edges, which is desirable in our visualizations since the nodes that represent layers are mainly of interest. We did intentionally not default to a algorithm that layouts series-parallel graphs, since Keras operations are not restricted to generating such graphs (e.g. ).
Edges are added to our visualization to convey data-flow between them. Whenever edges are short, the connection between two layers can be resolved easily. To further optimize the performance and presentation of the graph, an approach presented by Jünger and Munzel [22] is used to minimize edge crossings. The coordinates of nodes are assigned as proposed by Brandes and Köpf [4]

. They optimize a heuristic that guarantees vertical (or horizontal, when rotated by 90 degrees as in our case) inner segments, keeps edges short and fairly balances with neighbors of a node in a three-step algorithm.


For creating the network visualizations as described in Section 3, we further needed to modify the graph layout. While the algorithm described above is able to position and connect rectangular-shaped nodes, trapezoid shapes and especially glyphs that have multiple input or output handles are not handled well. Therefore, we position any layer-glyph as if it was rectangular-shaped, taking the maximum of the calculated input and output heights as its height value. The glyph is then centered at the node-coordinates we obtained through the methods described above. Assuming this center point, we now draw the actual handles for input and output connections with the calculated height values that might differ between input and output of the layer. These handles are drawn so that they end at the y-position of the layer they are connected to. This results in trapezoid-shaped glyphs for layers that only have one or less incoming and outgoing connections. Whenever multiple handles need to be drawn for input and output connections, this gets more complex. For each such connection, the handles are sorted by the y-position of the layer they are connected to. Then, an intersection point between the bottom line of the first, and the top line of the second handle is calculated. The first handle is then drawn with the bottom line ending at this intersection point, while the second ones top line starts at this position. We then repeat this process for all following inputs and outputs. This way, we can draw all desired glyph shapes which we described in Section 3. Also, the respective top and bottom lines for these handles are always either connect corners of these glyphs or are intersected somewhere along this line, which visually supports the data-flow semantic within these glyphs. This process is illustrated in Figure 4.

Figure 4: For glyphs with multiple connections on either input or output side, we add a handle for each of these. The sides of these handles are always connecting corners on the input side of the glyph with corners on the output side, here depicted by a dotted line. To support this for all handles, we calculate the intersection point between such handles as marked in red.

After drawing the layers onto the canvas, we also connect them through lines. These lines need to also be changed from the representation that we obtain using the layout approach described above. In the algorithms described above, parallelism is assumed to be visualized by edges that change directions rather than glyph shapes that split. In our visualizations, however, this is visualized by glyphs with multiple handles. Therefore, we analyze each edge and extract the part that matches the vertical position of the glyph end they are connected to. These reduced edges are then drawn, such that our graph always only contains straight horizontal edges that connect two handles of layer glyphs.

5 Layer Aggregation

In the following, we will explain how layers can be aggregated to support a compact network visualization. If layers are replaced by abstract glyphs, this abstraction needs to be resolved. Therefore, we also include a legend to our visualizations.
Aggregation constraints. Multiple layers can be aggregated by Net2Vis. Aggregation substitutes all occurrences of these layer-sequences throughout the graph with a single abstracted layer, whereby only sequential layers can be aggregated. This also includes sequences that begin with a layer that has multiple inputs, or end with a layer that has multiple outputs. For parallel parts, however, not all possible aggregations are sensible. Parallel parts can only be aggregated if they can be replaced with a sequential part after the aggregation. We check this by searching for all paths from the input layer of the selection to the output layer of the selection to be made. If there are paths, that never reach the output layer of the layers to be aggregated, these can also not be aggregated. The same is true if one of the layers to be aggregated contains an input layer that is not contained in one of these paths. By restricting aggregation to serializable segments, no deformed aggregation layers can occur, where two outputs or inputs might differ in their spatial resolution. This ensures visual consistency in that layers always manipulate the data the same way for all of their connections and that all connections of each layer end on the same level in the graph.
Automatic aggregation. To generate aggregations, the layers to be aggregated can either be selected by the user, or more conveniently be selected automatically by Net2Vis. Within this automated selection, parallel network parts are initially left as provided, since they often represent essential architectural decisions and can only be aggregated directly by the visualization designer. To obtain these automatically generated aggregations, we analyze all sequential parts of the network. For all of these sequential parts, we search for recurring sequences of layers. The most frequent of these sequences is then assumed to be the preferred aggregation. This can be repeated until no sequences in these sequential parts are left. As the length of such sequences is not important for this automatism, we compare sequences of all length with another, and aggregate the most common sequence.
For the integration of these aggregations to the network, we analyze the network structure for occurrences of the aggregation. By traversing all the network layers, we first search for all layers that match the type of the input layer to the aggregation and the number of outgoing connections. For each of these outgoing connections, the layer type and number of outputs is checked again. This is done recursively until all layers have been visited. If exactly one matching layer has been found for all layers in the aggregation, these network-layers are replaced by the aggregation glyph.
Interacting with aggregations. When resolving an aggregation, all abstracted layers in the network expand back to their initial layout. When other aggregations contain one of these deleted aggregations, they also get expanded within these. Visualization designers can additionally temporally deactivate aggregations of the network. Deactivating an aggregation leads to an expansion of all occurrences of this aggregation in the network visualization while preserving the aggregation in the legend. To visually convey the state of an aggregation, active ones are drawn with a dark outline and black description text, while for inactive aggregations, outlines and text is drawn in light gray. Upon deactivation, we analyze which aggregations depend on the deactivated one, and deactivate them as well since they cannot occur when one of their elements is inactive. For the reactivation, we analyze this dependence the reverse way. When an aggregation that depends on another is reactivated, this aggregation gets activated as well, since the activation would otherwise have no effect on the visualization. We included deactivation in addition to deletion of aggregations, to be able to explore the visualization without permanently losing information. This way, the effect of different aggregation levels can be analyzed by the visualization designer while preserving the aggregation settings.
Legend generation. Since we only use color-coding to differentiate between layer types, a legend that maps these color codes back to layer names is needed. Therefore, we also automatically create a legend to be presented alongside our graph visualization. This legend contains a glyph for each layer type in the network and displays the name of its layer centered below the glyph as shown in Figure 5. At the same time, appealing colors are key for a good network visualization. Colors for new layers are automatically proposed and can then be manipulated by the visualization designer. To propose a new color we have realized two options. The first option is to find unused colors in hsv color space. We optimize color difference by searching for a hue value that is as far as possible from all other colors. This is done by arranging all hue values on a 360-degree circle, and searching for the biggest gap between two of these values. The center-point of this gap is then used as a proposed color for the new layer type. This way, colors that maximally differ from all colors that already are in use, can be found. The second option for color proposition is palette-based, and serves as the default. We employ the color palette from materialuicolors [33] to generate visually pleasing color mappings. This is in line with the color scheme of our application and provides a commonly used color palette. Here, we automatically select unused colors as long as there are new colors left. The visualization designer can additionally always change the color-mapping for each layer-type.
The legend visualization also supports layer aggregations. The glyph representing such an aggregation is followed by a composition-graph that shows the components of this aggregation and is generated in analogy to the network graph. Here, however, all layers are assigned the same size for all input and output dimensions, since aggregations and layer-types generalize to all occurrences in the network. To align the legend items vertically, each glyph has a fixed position. For abstract layers, the input handle of the composition graph is placed at the same vertical position as layer representants. Legend items are sorted from complex to simple, since this makes finding and resolving aggregations more convenient. Complexity is determined by analyzing the dependency-tree of aggregations.

Figure 5: Exemplar legend visualization automatically generated with Net2Vis. The legend maps color to layer types and illustrates the composition of aggregates. Colors are automatically proposed but might be changed by the user. Layer names are extracted from code, however, they can also be changed or abbreviated.

6 Application Design

To make our techniques publicly accessible, we created an application that includes all presented techniques and has been released on GitLab [6]. This application is divided into five main areas. These are used to display different functional aspects of the visualization and can be seen in Figure 6. The following will describe how users can interact with our application, and how network visualizations can be directly obtained using our techniques.
Interaction. Controls that are globally important for the application are located in the menu bar at the top. Here, the visualization designer can choose to show and hide the code and preferences areas. This feature has been included to support a focused view on the network visualization. A prominent download button on the right of this menu bar always provides the possibility to export the generated visualization and save it locally as a SVG or PDF file to be used directly within publications.
In the center of the application are its most important parts. At the top, the current visualization of the network architecture is displayed as described in Section 4. Below it, one can see the legend visualization that has bee explained in Section 5. Both these visualizations can be moved by clicking and dragging, and zoomed by scrolling over it.
Left to the network and legend visualizations, the code area is displayed. Here, code that defines the network architecture can be inserted and modified. When code is added or changed, these changes are instantly recorded and analyzed. Upon each change, we transfer the network code to our backend, where we process the python file and obtain a Keras model. We convert this model representation with all its layers and connections into a JSON object that can be used by our graph-layouting algorithm. To be able to draw glyphs as described in Section 3, input and output dimensions are calculated for each layer in the network. We also support network architectures that do not have predefined input dimensions by calculating the spatial resolution based on either, a given input, or an arbitrary input dimension. After this process, we have a representation for each layer in the network that contains its name, its properties as set by the network programmer, dimensions for the input and output of the layer and incoming as well as outgoing connections to other layers in a JSON description. This is then sent back to the frontend of our application and instantly triggers an update of the visualization.
While Net2Vis works automatically, users can still override default parameters, if desired. Therefore, on the right, preferences that define the visualization can be set. The displayed preferences change depending on which part of the visualization is currently inspected. This way, we always only present relevant parameters upon interaction. By default, the network parameters are displayed. Since the network is the most important part of the visualization, this is a natural starting point. When the legend pane is selected, preferences for the legend can be set accordingly. Since grouping layers is an important aspect of our technique, the functionality to either automatically group layers or select layers and manually assign them to a group can be triggered from within the network preferences and the legend preferences. When a legend item is selected, the color and name for this layer can be changed, and aggregations might be deactivated or removed.
Changing any property of the visualization instantly updates it, which makes for an interactive and responsive visualization design. This way, each network visualization can be adjusted to the specific network architecture, while at the same time following the general visualization design proposed by us.

Figure 6: Screenshot of the Net2Vis web application. At the top, one can see the controls area, where global actions for the visualization can be made. On the left, one can see the code area, where the network code can be inserted and edited. In the center, the generated graph visualization is shown, with the legend below it. On the right, preferences that modify individual parts of the visualization can be changed.
Figure 7:

Net2Vis applied to VGG19, which is a network that was used to compete in the ImageNet classification task. With just two aggregations, we were able to present the network using 12 instead of 22 glyphs.

Visualization export. We render the network graph and the legend using SVG paths and SVG text. To then support a straightforward embedding of these visualizations into publications, the network graph along with the legend can be directly downloaded from within our application. Visualization designers can thus use our pipeline of first, pasting Keras code into our application, second, parametrizing and abstracting the initially proposed visualization if desired, and third, obtain ready-to-use SVG/PDF graphics. Using SVG for the definition of our visualizations has the additional advantage, that most image processing applications can manipulate them well. Thus, visualization designers can even expand upon or adjust the obtained visualizations after downloading them. To be able to integrate the visualization directly in LaTeX, we also convert the SVG figures on our server into PDF documents and provide them alongside the SVG figures.

Figure 8: Net2Vis applied to ResNet, in order to demonstrate how our techniques perform on large network architectures. The repetition of residual blocks in ResNet allows us to reduce the number of glyphs that need to be drawn from 178 to just 23.
Figure 9: Net2Vis can also be used to generate visualizations of three-dimensional network architectures, which are conveyed in the same way as image-related networks, with the exception that the spatial resolution of the data-flow is three-dimensional. The dashed square serves as placeholder for a three-dimensional input sample.

7 Application Examples

To demonstrate the capabilities of Net2Vis, we have applied it to commonly used network architectures. Figure Net2Vis: Transforming Deep Convolutional Networks into Publication-Ready Visualizations shows a variation of U-Net [37], a family of networks frequently used for semantic segmentation. Figure 7 shows the application of Net2Vis to the VGG19 network [39], which was used to compete in the ImageNet challenge in 2014. Figure 8 shows a visualization of ResNet [17], where we show how a reduction from 178 to just 23 glyphs is possible using our aggregation functionality. Finally, in Figure 9, we demonstrate that we do not only support two-dimensional network architectures, but also three- or multi-dimensional ones.

8 Conclusions and Future Work

We present a method that can automatically generate visualizations of modern and complex CNN architectures, whereby users only need to paste the code that defines their model into our application. Through aggregating multiple layers, we make it possible to abstract upon these visualizations. Also, we investigated affordances for bringing these visualizations to print media and incorporate these in our visualization design. To further support the visualization designer, we provide well-founded defaults for many of the visualization parameters, such as colors and level of detail. Our unified visualization design is targeted towards reducing ambiguity in visualizations of CNNs. At the same time, our automatic approach reduces errors in the generation of such network visualizations. While we provide one general visualization approach, modifications based on taste and network architecture characteristics can be made by the visualization designer. This gives each visualization a personal touch, while preserving the overall design. Readers can thus transfer knowledge from one visualization to another without needing to learn the specific design language of each visualization. Altogether, Net2Vis represents the first visualization technique for modern and complex CNNs that allows a direct use for publications even for large models.

One limitation of our visualization approach is, that it might lack important information for some network architectures. A consistent way of visualizing additional properties, along with an intuitive customizability of displayed features would enhance our techniques and provide a solution to this problem. Also, this visualization is dedicated to CNNs that are most used for images and 3D data. For sequence models such as text or speech recognition systems, many of the techniques presented in here could be reused. However, some aspects of the visualization would need to be redesigned for these particular use cases. Working towards the adaptation for such network architectures could help to make this tool more generally applicable. Finally, to demonstrate our visualization approach, we decided to focus on networks generated in Keras. Supporting other deep learning frameworks could further improve the usability of our techniques.

Acknowledgements.
This work was funded by the Carl-Zeiss-Scholarship for Ph.D. students.

References