1 Introduction
the behavior of neural networks is still an open problem in the field of deep learning. Solving the black box problem, i.e. explaining which elements of a network are used to solve each part of a given classification problem, demands the design and implementation of new tools and theoretical frameworks that enable us to untangle the result of applying automatic differentiation in a randomlyinitialized distributed model. Current methods to directly approach the blackbox problem are varied. Notable examples include the distillation
of a given network’s behavior into soft decision trees
[Frosst2017], and the use of a parallel neural network to synthesize textbased explanations of the specific responses of a model [Hendricks2016]. While these methods show promising results in terms of providing better descriptions of neural behavior, distillation (i.e. the transformation of a deep network into a decision tree) incurs in lower accuracies, and textbased explainers provide very little insight regarding the actual networks that are involved in training and classification. Other methods include the generation of activation atlases via dimensionality reduction techniques such as Tdistributed Stochastic Neigbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP) [Maaten2008, carter2019], and then using Feature Visualization to approximate via optimization the type of input that excites each of the units in the network [Olah2017, Olah2018]. In more recent work, PCA and the Mapper algorithm have been applied to the weights of a feedforward network to study the topology of the weights in hidden layers during learning [Gabella2019], showing an inherent structure of divergence and specialization that should be explained in the coming years. These methodologies, unfortunately, need to be constantly rerun for incoming datapoints, and can become rather computationally expensive. Of notable mention as well are multiple methods for salience map production in images [Simonyan2013, Zeiler2015]which is a means of finding neurons that are responsible of firing when presented with very specific features, such as faces. Last but not least,
network dissection associates the relevance of units in hidden convolutional layers in the detection of specific image semantic information, such as texture, material, and color detection, with the aid of a labeled dataset [Bau2018].All these methods succeed in either (a) simplifying the structural complexity of the trained model (model induction), (b) generating humanreadable sets of input images that activate specific regions of the network (deep explanations), or (c) discovering the features in the input that produce specific activations at specific hidden layers. While these are fundamental breakthroughs in the field of explainable artificial intelligence (XAI), an ideal future objective is to not only produce atlases of possible input representations that activate specific layers, but instead understanding how different activation sequences result in specific classification results. Model evaluation is, then, of paramount importance if we wish to understand when these models can be trusted, why they fail, and how they can be corrected [darpa2016].
This article attempts to generate a geometric visualization of the domain of all possible activations in a given network with the objective of assessing its behavior. We will refer to this domain as the activation space of a neural network. We propose the use of autoencoders (in this case, Variational Autoencoders or VAEs) to encode the hidden layers of a given network into a lowdimensional representation [Rezende2014, Kingma2013]
. Additionally, we show how using a neural network to generate a representational space allows us to visualize and estimate the expected errors for a validation set, provide confidence intervals, and observe its behavior against noise and simple adversarial attacks. Third, and final, we show that using an autoencoder to generate an activation map allows us to completely dissect a neural network and disentangle units with respect to their influence on the output layer. These features represent a form of selfreflection of the model by only looking at its hidden structure and repeating patterns, hence the name of the method:
selfintrospection.This work mainly focuses on experimental results shown in the case of feedforward layers and dense layers of convolutional networks. A similar result for recurrent, convolutional and attentionbased models could be potentially found, as long as there is a method to provide the internal states of interest to the unsupervised model. The rest, as can be shown in the following pages, requires no significant additional preparations.
2 Materials and methods
2.1 Definitions for a learning problem
Let and
be topological spaces, namely vector spaces with natural Euclidean topology. We will consider classification problems where there exists a function
that relates an input domain with an output domain . Let and represent the input and output training sets, respectively. Let the training set be defined by pairs of inputoutput pairs(1) 
with the number of available training pairs. Similarly, let and be the input and output validation sets, respectively. Let the validation set be
(2) 
Here, will be the number of validation set pairs. In typical learning problems, the training and validation sets will be point clouds (compact sets) of different sizes (), with no overlapping, and presenting the same degree of variability among classes. In more formal terms, consider the following properties:

Training and validation sets do not share any data:

Both input and output validation sets are similarly representative of the complete set. Since we are in Euclidean space, a way to write this would be
with denoting the Hausdorff distance between any two point clouds.

The training and validation sets better represent the input and output spaces as the number of pairs increases:
Let be a model or approximation of our desired (unknown) function , where is a set of variable parameters, and is the output domain of the model (image of ). Training can be defined as solving the following optimization problem:
(3)  
subject to 
where denotes the training error of the model for a given training dataset , and represents the validation error of the model. The restriction of the optimization problem represents the overfitting of the model to the training dataset, which would hinder generalization.
To solve the optimization problem, a deep learning model will update its parameters through gradient descent:
(4) 
and optimization will halt when . The derivatives are calculated via automatic differentiation (also referred to as backpropagation). Learning rate can be set to a fixed value, or changed over the course of training with adaptive rate methods, such as Adam [Kingma2014AdamAM]
or RMSprop
[Graves2013GeneratingSW].2.2 Notation for a feedforward network
We will consider a multilayer perceptron with
layers as a series of operations, as follows:(5a)  
(5b)  
(5c)  
(5d)  
(5e)  
(5f) 
In this sequence, represents a nonlinear, continous function with a continuous inverse, receives the name of input layer, are the hidden layers, and is the output layer. In this model, the parameters are the weights and biases . For our particular experiments, we will use ELU [Clevert2015]
as the main activation function for all hidden layers.
2.3 Selfintrospection
A way to restate the black box problem is as follows. At initialization, the weights and biases of a neural network are randomly prescribed. During training, the gradient of the cost function with respect to these weights and biases is calculated, and these parameters are updated with respect to these initial values and the activations of preceding units, which in turn also depend on random initial values and other preceding unit activations. The credit assignment of each parameter to the final classification error can be obtained via automatic differentiation due to the fact that they are randomly initialized (i.e. if they were set to zero at initialization, the gradient would be zero and no error could be backpropagated) but, as a consequence of this random initialization, the method assigns a random role to each unit. Thus, a clean relationship between the hidden activations and the output result cannot be found, even after the network has been trained.
Informally, the activations at the hidden layers could be interpreted as representations of the input vectors as they are being converted into output vectors, as a result of applying the nonlinear equation system described by the model. This intuition can be better understood with a simpler example. If a trained feedforward classifier was only composed of one hidden layer , the activations of such hidden layer directly represent the relationship between the input and the output for any specific data pair . Each specific activation pattern at the hidden layer will produce a specific response at the output layer, and will only result from a very specific set of possible input values. The hidden layer could then be sorted with respect to the amplitude of each activation for each of the output categories, and the error of the classifier could be estimated for all known hidden layer patterns.
Under these assumptions, a hypothesis can be defined. In order to transform all the input datapoints into the same output vector, the hidden activations in a trained classifier must become more and more similar as we travel from the first to the last layers of the network. This similarity suggests that activation patterns may belong to a lowdimensional subspace or embedding, on par with the training data, and that each output unit acts as an attractor for similar, yet different data belonging to the same category. Obtaining this embedding with a secondary deep neural network yields a lowdimensional representation of all possible trajectories followed by the input data. If such trajectories belong to a lowdimensional embedding, then hidden activations should be fairly similar for inputoutput pairs belonging to the same category, so that the output layer yields the same result every time. Conversely, similar activation patterns will correspond to clustered points in this newly defined lowdimensional representation.
Consider the feedforward classifier shown in Figure 1.(a), with three hidden layers. If we understand the network as a distributed function approximator that gradually transforms the input vectors into output (category) vectors via Equations (5a) through (5f), minimizing Equation (3), then the classification problem (and solution) must be encoded by the hidden activations implicitly. In order to obtain a humanreadable representation of all hidden activations (at least, for the feedforward example of Figure 1), the hidden layers are concatenated into a single vector, and then fed as input to an isolated autoencoder (Fig. 1.(b)). The autoencoder’s bottleneck will produce a lowdimensional map of all possible activations, as shown in Fig. 1.(c), where all the activations of the network for a given inputoutput pair is –in this case– represented by a single point in 2D space. Using the bottleneck outputs as inputs to another network, an error estimator (represented by the two horizontal layers under the autoencoder’s bottleneck in Fig. 1.(b)) can be trained as well, by using the classification errors obtained by standard training. This is equivalent to using tSNE or MDS, with a fundamental difference: the autoencoder serves as an explicit function that relates activations to a lowdimensional manifold and, thus, it can be trained on a given dataset, respond to incoming data and be interrogated in real time.
This rather simplistic methodology shows interesting potential in various different problems. First, it can be used to weigh the expected accuracy of a classifier in an ensemble. Second, the representation can be used as a method for anomaly detection: uncommon activations will be placed in positions far away from common activations. Third, it allows us to sort the hidden layers of the network with respect to the expected activations for a particular output category. Finally, the autoencoder can be trained on the complete learning record of a given classifier, showing the evolution of the activation patterns as a response to data, where the network gradually specializes to solve the classification problem.
2.4 Notation for a selfintrospecting model
The following symbols will be used to refer to the various structures throughout the manuscript:

Classifier . The classifier is an approximator of the function . In particular, , where is the set of all possible hidden unit activations. It is trained by minimizing .

Autoencoder. The autoencoder is composed of two functions: the encoder , which provides a relationship ; and the decoder , which in turns attempts to revert the encoder’s operation: . It is trained by the reconstruction error and the Maximum Mean Discrepancy (MMD) at the bottleneck.

Estimator. The estimator generates an estimate of the error of the classifier, with the lowdimensional representation provided by the encoder. It is trained by the estimation error, namely .
These systems can be trained simultaneously or sequentially. Note that, in order to train the classifier, autoencoder and estimator independently, but at the same time, backpropagated gradients must be stopped at three locations: (a) between the hidden activation sequence and the autoencoder input, (b) between the autoencoder bottleneck and the error estimator input, and (c) between the MSE error function and the output of the network . If trained sequentially, the order must be, as can be expected, (1) classifier, (2) autoencoder, and (3) estimator.
2.5 Model architecture
Two classifier models were tested in the experiments. The simplest model is a feedforward
ELU model, which will be used for classifying MNIST samples. As an example of the same analysis on a convolutional model, an allconvolutional neural network
[Springenberg2014] will be tested, by evaluating the activations at the dense (feedforward) layers. The architectures of both networks are left in Table 1.In the case of the autoencoder and estimator models, the same networks were used for both classifiers. Their architectures are provided in Table 2. The autoencoder is a feedforward InfoVAE [Zhao2018]
, which implements latentspace Gaussianity by the Maximum Mean Discrepancy (MMD) between the latentspace results and random samples from a Standard Normal distribution
. The estimator is a feedforward network with a single linear output.It must be noted that these architectures are mostly experimental and for illustrative purposes only. For more realistic examples, the complexity and depth of the networks shall certainly be improved. Additionally, for the convolutional case, only the feedforward hidden layers are used for selfintrospection. Encoding the convolutional layers as inputs for the autoencoder should certainly be studied in further work.
2.6 Training routine
The models were trained on Stochastic Minibatch Gradient Descent, with Adam as the optimization engine. Cyclic Learning Rates (CLR) [Smith2015] were used with a triangular (sawtooth) learning cycle. Dropout regularization was included for all layers, with , unless specified otherwise [Hinton2012]. Validation errors were tested at the end of each cycle, and early stopping was implemented with a patience of
cycles of monotonously increasing validation error. The complete system is implemented with Tensorflow 1.14 on Python 3.6 and run on an
nVidia RTX 2080 Ti GPU (Nvidia Corporation, Santa Clara, California, USA).For the feedforward network classifier and the MNIST dataset, cycle length was iterations, with a learning rate range of . Minibatch size was 128 for the classifier, the autoencoder, and 32 for the error estimator. The total number of cycles was , achieving a test set accuracy of . The error estimator required a total of cycles with a cycle length of . For CIFAR10 and the allconvolutional net, cycle length was and , and dropout was set to . Additionally, the maximum value in the learning rate was set to decay at a rate of , achieving a final test set accuracy of . The autoencoder is trained with an identical regime, only with . The error estimator is trained with twice as many cycles ().
2.7 Confidence metrics and error estimation
During training, the input will flow through the classifier into an output category , producing an activation sequence . The latter will be read by the autoencoder, which in turn will produce a representation of the activation sequence . Then, the estimator will use to provide a number which should represent the expected error during training for a specific hidden layer pattern sequence, .
While these are still estimates at training time, as long as overall accuracy and errors are within the same order of magnitude for the training and validation steps, by avoiding overfitting as much as possible, we could consider them an adequate measure of how much a sequence activation is expected to provide an accurate response. For this particular model, the output unit at the estimator is a linear unit, which must converge to the logarithm of the error at training, . Confidence could be measured as the negative of the logarithm of the error, namely .
Net1. Deep feedforward ()  Net2. Allconvolutional () 

Input  Input 
ELU  ELU, f.s. 
Sigmoidal Output  ELU, f.s. , str 
ELU, f.s.  
ELU, f.s. , str  
, global avg.  
ELU  
Sigmoidal Output 
Notation: f.s. stands for filter size (per neuron). Layer notation is , number of identical layers, number of hidden neurons in each layer. Finally, str indicates striding. Flattening and reshaping layers implemented via tf.contrib.layers.flatten() and tf.reshape(). 
Feedforward InfoVAE ()  Feedforward estimator () 

Input  Input 
ELU  ELU 
ELU  ELU 
ELU  ELU 
Linear  1 Linear Output 
ELU  
ELU  
ELU  
Linear Output 
Layer notation is , number of identical layers, number of hidden neurons in each layer. 
2.8 Reorganizing randomly initialized layers
Let us consider a selfintrospective system consisting of a feedforward classifier , an autoencoder composed of an encoder and a decoder (where here is the concatenation of all hidden layers in the classifier) and an error estimator . We will assume that we can split a training dataset into the various categories they belong to, i.e. , with representing the subset of that belongs to the th class. Since
can be understood as a set of samples from a continous random variable
, then the vector of hidden states can also be defined as a continuous random variable
, related to the input by , which can be deduced from Equations (5a) through (5f). Let the domain of be, and let its joint probability density function (PDF) for each of the elements of
given hypothesis be . The expected hidden activation pattern of the network given a type of hypothesis or category of input data , , could in theory be calculated via the multiple integral(6) 
This volume integral becomes quickly computationally expensive as the number of units increases, and becomes intractable if the domain of the random variable is not bounded, as is the case for ELU units, or if the input data is highly dimensional, making it difficult to obtain if the PDFs need to be estimated.
If an autoencoder is used to obtain , then this random variable has a tractable PDF or prior approximately equal to the unit Gaussian . Similarly, the expected bottleneck values, or expected value of random variable for a specific hypothesis can be calculated much more quickly via
(7) 
where here is still an unbounded domain, but it can be truncated to a smaller domain (e.g. ) with negligible accuracy losses, since is only nonzero near the origin. With this in mind, the previous integral can be numerically approximated with a Riemannian sum:
(8) 
with and is an estimate of the joint PDF of , which can be obtained with the training data. With this estimate, the decoder can be used to obtain an estimate of the activation pattern represented by that lowdimensional point:
(9) 
The network’s units can be sorted by their expected activation to all hypotheses, via
(10) 
In this way, each unit is assigned to the dataset category to which they are most sensitive. If a unit serves more than one purpose, it will at least be assigned to the most significant category. This operation results in a sorting of all layers with respect to the output neuron that they activate with.
2.9 Artificial brainbow
One way to visually identify how much each unit in a layer contributes to a specific output classification is by colorcoding each of them in terms of the expected intensity of each activation for each type of output, . This can be easily done with a series of colors , coded as 3dimensional RGB vectors, and associated for each of the output categories , and then calculating:
(11) 
The units will then acquire a more grayish tone when not supporting any specific output category, or take the color of the output category if the opposite is true.
2.10 Robustness against noise and adversarial attacks
We will study how the autoencoder interprets the activations of the network with a series of simple experiments. First, noise will be added to MNIST on a model that has been trained on clean samples. The influence of noise on the internal states of the network should be percievable enough to produce a visible representation. Then, the network can be trained under noise, and its performance can be reevaluated. Since the latter network has been prepared to operate under noise, its responses should be more consistent than the network that was trained with clean samples, which should then be reflected by their accuracy.
Similarly, the network’s resilience to adversarial attacks can be studied as trajectories in activation space. An adversarial attack consists essentially in a small variation (also known as a perturbation) in the input data that produces significantly different activation sequences in a classifier, resulting in an erroneous result by a large margin. For these experiments, the Fast Gradient Sign Method (FGSM) will be used to modify the internal states of the network from the input [Goodfellow2014]. A successful attack shall then be defined as a trajectory in activation space from the original (correct) category to the target (wrong) category, whereas the opposite will be true for failed attacks.
3 Results and discussion
The results are divided in 6 sections, dedicated to showcasing some practical results of using an autoencoder on the hidden activations. Our main objective is to present the flexibility and interactivity of these autoencoders in stateoftheart problems.
3.1 Examples of activation spaces
Figure 2 shows the visualizations that can be achieved by autoencoding hidden activations. The top row corresponds to Net1, the feedforward classifier trained for MNIST, and the bottom row shows the dense activations of Net2, its convolutional counterpart, trained for CIFAR10. These plots represent the overall behavior of the network when provided with various different inputs, empirically showing that (a) there exists a lowdimensional embedding where hidden activations in a deep model are similar for similar inputs, that (b) that a 2D representation of this embedding where sufficient separability across activation patterns can be achieved; and that (c) wrongly classified inputs are a product of uncommon hidden activations.
As a first experiment, the hidden states of the network were recorded over a total of 30 cycles, or CLR epochs. Then, the complete record was used to train a VAE, so that the trajectories of the hidden activations with respect to the inputs could be plotted. The result of this calculation is shown in Figure
2.(a), where each line represents a handwritten digit (0–9). The trajectories fade from grey into the color of the ground truth category they belong to, in order to better depict the passage of training time. In this representation, it is possible to observe how successive learning cycles separate further the hidden activations that correspond to each of the categories. Once convergence is achieved, a new VAE can be used to encode the final activation patterns for the complete MNIST dataset (Figure 2.(b)). In this subfigure, each of the input datapoints is colorcoded by ground truth as well. Misclassified digits can be seen as input datapoints that elicit an activation pattern that is sufficiently abnormal to result in an erroneous output response. By studying the mean square error (MSE) of the output layer during training (Figure 2.(c)), we can observe how the misclassified digits correspond to activation patterns inbetween clearly defined responses, for both the training set and the test set, the latter shown in Figure 2.(d). A similar result is achieved by autoencoding the densely connected layers of Net2, in Figs. 2.(e) through (h)..(e) showcase how some of the datapoints become more separated than others during training. This can be thought of as a result of the training process, where some inputs cannot be separated as easily as other, more clearlydefined ones. Hidden activations in the classifier are varied for a given category, yet they all seem to cluster together in a family of similar activation patterns (or data trajectories) that produce the same output. Datapoints that elicit unclear activations lie in the fringes of welldefined clusters, thus probabilistically resulting in misclassification after passing the output sigmoidal/softmax layer through an
operator to obtain the output category.The classification error estimates in Figures 2.(c) and 2.(g) depict clear spatial corregistration between the estimated mean square error (MSE) and the locations of the activation patterns in space (i.e. the space defined by the bottleneck of the autoencoder) that produce classification errors, for both the training and test sets. By using this estimate of the error in a confidence metric , unclear classifications could be discarded or detected as anomalies, which could be useful in classification tasks where reliability metrics are a necessity.
3.2 Reordering a feedforward network
Being able to calculate the expected activation patterns of a given network and obtaining a lowdimensional embedding of its behavior not only provides a functional visual representation, but may also serve as a tool for dissecting and reordering the layers of a network, as well as studying the paths that the input data take as they merge and become the output vector.
The following result is presented for the feedforward network Net1 only, mostly for illustrative purposes. Figure 3 depicts the complete process, as described by the equations in Section 2.8. Figure 3.(a) provides a representation of the affinity of each unit to their respective output category, via colorcoding (Equation 11). In these plots, Layer 0 corresponds to the hidden layer closest to the input, and Layer 11 directly precedes the output layer. As per Equation 10, trained units seem to have a preferred output category, i.e. a set of similar inputs from the same category that elicit a maximal strength of activation. This allows us to sort each of the layers in terms of the type of output category that they are most likely to respond to (Fig. 3.(c)). This results in a reordered network (Fig. 3.(d)), where data takes different paths depending on their actual, ground truth category (ten right subplots in Fig. 3).
A few things can be noted about the behavior of the rearranged network. First, Fig. 3.(c) provides insight about the relative relevance of each output category in the hidden layers: the relative distribution of units dedicated to each category is more or less uniform. This could be explained by the fact that MNIST is a balanced dataset, and thus similar amounts of the network are dedicated to activating when presented with each of the 10 different handwritten digits. Secondly, it appears as if each input digit follows a very specific path inside the network, proving (at least empirically) that there exists a common activation pattern (albeit with slight variations to accept input variability) for every output category. These activation columns, seen as bright yellow streaks that change for each category, show some degree of overlapping, which could potentially be due to dropout regularization. Additionally, while many of the units are randomly activated, an exact subset of them (as detected by the sorting method in (c)) activate more than others, while another fraction (less noticeable but color coded in dark blue near the bright columns in (e) through (o)) of the network is dedicated to being inhibited. This seems to suggest that, within a deep neural network, there exist a defined subset of units dedicated to excitation, another fraction dedicated to inhibition, and a much larger section that is orthogonal (i.e. showcasing near zero activation) to each of the classification categories. A much more scattered, weaker fourth subset is subtly activated across all categories, likely corresponding to units with more than one specific role in the classification problem. These fractions and elements are statistically the same for a given classification category, and can now be observed and sorted, suggesting there is a clear latent structure within a deep classifier, which unfortunately could not be observed before due to random initialization.
3.3 Ordered activation sequences and latent representations
To study the correspondence of pattern activations with the classification input and the latent space, a few examples of activation patterns for a series of MNIST digits are shown in Figure 4. Two sets of 20 samples are displayed. Each number can be represented in three different ways: a blackbox perspective (top plots for each digit), where the handwritten digit (center), its actual category (top left number) and the classifier’s output (bottom right digit) are provided, allowing us to directly identify misclassification; an activationspace representation of the current state of the network, which is presented as a black cross in the activation map; and direct observation of the rearranged classifier activations. The two latter allow us to study the output activation sequence further. In general terms, there seems to be a substantially clear relationship between the strength and clarity of the activation pattern and its location in activation space. Stronger (i.e. more intense) activation patterns tend to be embedded in clusters that are far away from fringe areas, whereas weaker, noisier activations that correspond to misclassifications will stray away from such clusters, and exist between two or more of them, resulting in multiple parts of the network being activated in such a way that the hard classification output becomes corrupted. Let us comment two examples: the fourth digit from the left in the first row, the 4 classified as a 2, elicits a weak hidden activation sequence that is interpreted by the autoencoder as intermediate between the and output categories. Similarly, the sixth digit from the right in the second row, a 6 misclassified as a 4, elicits two responses in the network: it activates regions associated mostly with and with at the same time. The strongest one, possibly due to its location and small loop size, turns out to be . In activation space, this is translated as a hidden activation sequence that belongs to neither typical activations for and , and so the sequence rests halfway between the typical clusters associated with those output categories.
We can empirically conclude that, after taking appropriate measures to avoid overfitting, and assuming sufficient variety and representativity of the true distributions in the training dataset, abnormal behaviors will exist in between clusters representing defined, strong activation patterns and, thus, may be detectable and corrected, insofar as the corruption of the input information is not severe to the point of producing an activation sequence typical of another category. More research is needed to study the presence of outliers and extreme adversarial examples, as well as providing a theoretical model that can explain these results. The following subsections will attempt to provide illustrative examples of the applicability of this method for some current problems in deep learning.
3.4 Dataset noise and network performance
A first problem, which is easy to study, consists in visualizing the performance of a network when trained with clean data and then evaluated with data that contains noise. Such a representation would be analogous to a constellation diagram of PSK and QAM signals in digital communication systems, where symbol detection errors in a channel can be visualized as displacements from the reference input signals in a 2D map [Etten2005appA]
. While the overall accuracy as a function of input noise variance
can be studied as a black box, the representations provide an additional picture where we can observe the damage caused to the hidden layers’ response as noise power increases. The experiment is summarized in Figure 5. First, a feedforward network was trained on the MNIST training set. Then, a single test set digit is selected (for Fig. 5, Sample 907 was chosen, a handwritten number one). To generate the noisy input test set, 200 samples of additive white Gaussian noise (AWGN) with size are generated and added to identical copies of Sample 907. This set is fed into the classifier, and its behavior is recorded. The recorded hidden states are then given to the autoencoder, who generates 200 representations in the 2D activation map. Very much like in quadrature amplitude modulation (QAM) constellation diagrams, the network’s response strays away from its original response as input noise power is increased. This deviation is represented in the subplots (a) through (d) for increasing values of input power for Net1, who has never experienced AWGN noise before, and shows a preference to misdetect noisy ones as twos and fours for very small noise variance (note 5.(c), for instance). This preference could be thought of as a susceptibility for adversarial noise in a specific direction in activation space.Secondly, a clone of Net1 is trained again for the same number of epochs, but this time noise is injected at the input layer (AWGN with ). To improve its capability to classify under all sorts of input noise power up to a limit, domain randomization is included in the training routine, so that can vary from (noiseless classification) to (maximum noise) for each minibatch. Its activation space map (or its classification constellation diagram) is plotted in Figs. 5.(e) through 5.(h), presenting a much more stable response to Sample #907 to various levels of input AWGN power. This observable improvement in generalization power by introducing noise in the model has been shown empirically before [Zur2009, Rakin2018], by studying the overall accuracy of the network. The presented observations suggest that noise injection may be modifying the type of implicit function that the network finds optimal for minimizing the cost function. After all, the noiseinjected network responds similarly to inputs that produce radically different output responses in Net1.
3.5 Representation of adversarial examples
Having a visual representation of all possible activations may also be helpful in the understanding of adversarial attacks. As defined in Section 2.10, FGSM was deployed as a typical model of adversarial noise. The gradient of the cost function for a desired output category was calculated, and the input was modified as to minimize in a series of steps. For this particular configuration of FGSM, the number of steps was 100 and . The experiments were prepared on a trained Net2. Figure 6 provides six examples from the test set in CIFAR10. Each of the subplots in every row represents one of the ten possible adversarial attacks for this dataset.
A few interesting conclusions can be drawn from these visualizations. First, the reader can notice that the trajectories followed in activation space are different for each input image, but there are common patterns amongst them. A successful adversarial attack occurs when the sequence of hidden activations is modified in such a way that the output layer provides the target category instead of the correct one. In activation space, this is visualized as a trajectory that begins in a point corresponding to an adequate activation sequence and then gradually becomes a hidden activation sequence that provides the desired, wrong classification. In many cases, these paths are traveled by more than one adversarial example, suggesting that certain activation sequences represent some sort of gateway to very specific activations. Second, not all FGSM attacks succeed after 100 steps, suggesting that some activation steps and inputs are more resilient to adversarial attacks than others. For example, in the third from top case of Figure 6, the network can be easily convinced that the white car is in fact a ship, a plane, or a truck, but is resilient to wrongly identifying the image as a bird, a deer, a dog, or a horse. Third, the trajectories cross regions with high estimated classification errors, suggesting a series of typical scenarios where an adversarial attack could be detected at these boundary crossings by a properly trained error/confidence estimator. More research is needed to study the exact structures herein visualized; perhaps the adversarial trajectories follow a pattern that could be detected and counterattacked accordingly.
3.6 A brief study on error estimate reliability
In some of our experiments, we have assumed that the output error estimates given by the selfintrospective system should be reliable enough to detect most misclassified pixels. Such a claim needs to be verified quantitatively. To test this hypothesis, we have analyzed the output of the selfintrospective module in the test set for both MNIST and CIFAR10, on fully trained networks. Being in possession of the ground truth for the set, we have split the dataset into two categories: samples correctly classified by the network, and samples that have been misclassified. Then, the distributions of their respective error estimates are represented in green and red, respectively, by the violin plots of Figure 7. This result shows how misclassified images are, on average, two to three orders of magnitude worse than the correctly classified data for MNIST, in terms of expected classifier error. The case for CIFAR10 is not as simple, and thus we can observe how the error is within to for key categories that are systematically misclassified, such as cats, birds and dogs.
There are plenty of things to say about these plots. First, we can observe that a welltrained neural network can show a consistent difference between misclassified pixels and correct classifications in a dataset with consistent differentiability. Such is the case for the MNIST classifier, shown in the top plot, where the percent of correct classification for each of the individual categories is shown on top of each pair of violins. Unfortunately, there is more to this situation than meets the eye. In general, datapoints that are able to trespass the frontiers between two or more wellknown regions of wellknown neural network responses will appear to the estimator network as if they were legitimate classifications, responding with a low expected error. This phenomenon can be observed for CIFAR10 in Figure 2.(h), where many of the ’cat’ and ’dog’ categories have breached each other’s limits, with plenty of datapoints infiltrating into the other response. In the violin plots, this can be seen in the shapes of the misclassified distributions, as a small fraction of the wrongly classified samples are evaluated as correct classification. It is also possible that the network we trained for CIFAR10 is not optimally trained (our test set accuracy remains about 10% below the state of the art [Springenberg2014], perhaps due to adding densely connected layers after the global averaging), and thus thanks to the selfintrospective layer we can observe how the complete model could be improved for further training. This is in many ways a positive result: this network puts higher estimated errors in output categories where its performance is poorer, showing that it is aware that its expected performance will be worse than for other categories. Much more research is needed to better understand how these error estimation functions could be improved to detect abnormal neural network behavior, but for now we can observe, as a general rule, that consistently learned images can be observed as wellknown responses in this new domain, whereas images that are less evident to the network will provide less typical classifier activations and, thus, a greater likelihood of producing an erroneous result. How this domain is handled, and how these error and confidence functions are defined, should be studied in the future.
4 Conclusions
In this experimental work, we have examined the idea of studying the behavior of hidden layers in a deep neural network, by means of unsupervised dimensionality reduction on records of hidden activation sequences. This was achieved by the use of a wellknown network, a variational autoencoder, with which an interactive activation atlas can be produced. The autoencoder provides an additional layer of flexibility: it can be interrogated with new, incoming data, and additional networks and operations can be layered on top of its representations to further allow us to analyze and predict classifier behavior.
Understanding the underlying structure of a deep neural net is complicated, amongst other reasons, due to the difficulty in handling the vast number of connections, activations, and weights, and their final influence in the output layer. By applying nonlinear dimensionality reduction to the hidden layers, we can observe that a network responds very similarly to inputs that should be considered identical by the output layer. This, in turn, means that a network responds with an activation signature when the input resembles a specific subset of the training data. Consequentially, misclassification is likely to occur whenever the sequence strays away from typical signatures. Wellknown states for a network exist in a lowdimensional embedding, and are very sensitive to input perturbations that differ from those observed during training.
The potential of unsupervised dimensionality reduction as an essential part of model evaluation and design is not in itself a novel concept; see [Andry2016, Olah2017, Olah2018, carter2019]. Studying a network with another network, on the other hand, does not seem to have been considered to date. For now, these experiments allows us to consider model evaluation as another interactive element residing within the model architecture. Selfintrospection, understood as the addition of an autoencoder and secondary functions that can analyze and regulate the behavior of a given classifier, can also serve as a human interface for model reliability in more complex systems. For instance, in ensembles or assemblies of classifiers, the individual selfconfidence of each network can be studied as the inverse of its expected error, . Such a metric could be used to weigh the influence of each classifier to the output of the ensemble. Since the complete system is a neural network in itself, the activations for new incoming datapoints could be plotted in an activation map, and their expected accuracies could be estimated. Such an application could be helpful in the detection of adversarial attacks in complicated algorithms (e.g. face and voice recognition), where the autoencoder could serve as a private key: if the attack is subtle, but sufficient to trick the classifier, the activations may likely end up far away from wellknown classification sequences, and thus be noticed by the autoencoder, whose gradient cannot be studied by the attacker, as long as its parameters are kept secret.
Whereas studying how the inputs elicit specific activations provides a bottomup explanatory framework, applying unsupervised dimensionality reduction to the set of all possible activations inside a classifier results in somewhat of a topdown approach, where many interesting phenomena can be observed and analyzed as geometrical objects. This means, in turn, that current successful approaches in neural feature visualization could benefit from this approach. For example, results similar to those in Activation Atlases [carter2019] could perhaps be redefined from a series of hierarchical, lowdimensional representations, by applying Feature Visualization at the bottleneck of an autoencoder that studies the activation space of the first convolutional layers of a network. Such a result could be achieved by modifying the input so that the hidden activation pattern is located at a specific point in activation space. Other methods, such as Neural Style Transfer [Gatys2015ANA]
, could study the location of specific artistic styles as locations in the activation subspace of the first convolutional layers of a pretrained model, and tune the final result in this lowdimensional representation to their particular interests. Activations in recurrent neural networks (RNNs) could be encoded in a lowdimensional space, showing typical and atypical trajectories, and possibly enforcing corrections not on their input/output data, but instead on the output of the autoencoder, by rewarding specific trajectories and discarding others.
In general, applying interactive, explicit dimensionality reduction methods opens up a window of explainability that is easy to replicate (only an unsupervised neural network is needed) and whose output is easily exploitable in further calculations and training processes. More research is required in order to understand the reliability, structure and significance of these representations and how they can aid us in more complex problems.
Acknowledgments
The authors would like to thank Samuel S. Streeter and Benjamin W. Maloney from Thayer School of Engineering at Dartmouth College (03755 Hanover, NH) for the multiple fruitful discussions regarding this work and its potential future applications.
Research reported in this article was funded by projects R01 CA192803 and F31 CA196308 (National Cancer Institute, US National Institutes of Health), FIS201019860 (Spanish Ministry of Science and Innovation), TEC201676021C22R (Spanish Ministry of Science, Innovation and Universities), DTS1700055 (Spanish Minstry of Economy, Industry and Competitiveness and Instituto de Salud Carlos III), INNVAL 16/02 (IDIVAL), and INNVAL 18/23 (IDIVAL), as well as PhD grant FPU16/05705 (Spanish Ministry of Education, Culture, and Sports) and FEDER funds.
Comments
There are no comments yet.