Related Work
Due to the significant potential of Interpretable Neural Networks, a lot of research has been conducted on this topic [bratko1997machine, ridgeway1998interpretable, nauck1999obtaining]
. Some of the most popular techniques for visual interpretability rely on backpropagating the Network’s prediction back through the network, to compute its derivative with respect to the input features (i.e. the pixels in the image). Through this, the contribution of each feature towards the final prediction is calculated
[zeiler2014visualizing, simonyan2013deep, springenberg2014striving, ribeiro2016should, sundararajan2017axiomatic, lundberg2017unified, shrikumar2017learning]. These techniques, however, have been shown to have some major flaws, which deem them unfit for generating reliable explanations [mahendran2016salient, ghorbani2019interpretation, kindermans2019reliability, nie2018theoretical, adebayo2018sanity]. The most prominent example of interpretability in CNNs is Class Activation Mapping (CAM) [zhou2016learning]. This technique involves concluding a CNN with a Global Averaging Pooling (GAP) and a single Fully Connected (FC) layer. This allows for the FC to learn a mapping directly from the highlevel features extracted by the final convolutional layer to the classes. Thus, for any given image, it is possible to identify
which extracted features activate each class the most. By projecting these features onto the original image, a map of which area activates each class can be generated. This, in turn, allows a CNN to provide a degree of reasoning for its decision, i.e. where is it looking at when making a prediction. However, this comes at a cost of the model’s complexity and therefore performance [selvaraju2017grad, tagaris2019high], essentially sacrificing the model’s Fidelity for Interpretability. The same principle has been applied in other studies as well, which tried to improve results by either replacing the GAP operation with Global Max Pooling
[oquab2015object] or a logsumexp pooling [pinheiro2015image] operation. Another approach, called GradCAM [selvaraju2017grad], proposes an extension of CAM, where the importance of each feature map is determined not by the FC weights, but by the backwardpropagating gradients. This allows for the use of a much broader range of architectures (e.g. no GAP required, more than one FC layers allowed), which lets networks achieve the same level of interpretability, without sacrificing performance. While these methods produce maps with continuous values, a binary mask (Def. Document) can be generated via a simple threshold. The core issue of CAMbased approaches is that they rely on the final convolution layer to generate the features that form the CAM. This layer, though, usually has a much smaller resolution than the input image and its CAM requires an image upscaling technique (e.g. bilinear interpolation) to be projected on the original image. This causes the CAM to be rather coarse (no fine details), thus reducing the interpretability of the model. This has been addressed in another study with the addition of an unsupervised network to improve the resolution and details of the CAMs
[tagaris2019high], but to little effect. Another family of techniques strive to achieve Interpretability by classifying perturbations of the input image. One way is to occlude different patches of the image and identify which occlusions result in a lower classification score (i.e. when relevant objects are occluded the model will have a hard time to classify the image) [zeiler2014visualizing]. Another study attempts to classify many patches containing a pixel then average these patch classwise scores to provide the pixel’s classwise score [oquab2014learning]. These approaches are computationally inferior to the CAMbased methods, as they require multiple passes of an input image. However, at the cost of higher computational complexity, they can theoretically achieve an arbitrarilyhigh level of Interpretability, while maintaining model Fidelity; in practice, however, high levels of interpretability are impractical. Interestingly, the first approach is conceptually similar to ours, however, instead of relying on heuristically hiding parts of the image, we propose a fully trainable process, offering a higher degree of both Fidelity and Interpretability at a much lower computational cost. The proposed framework will consist of two networks, one tasked at producing input masks and an image classifier. The two will be trained jointly, in a collaborative manner, on two objectives: to minimize the classification loss and maximize the number of masked pixels. The joint model will learn to hide as much of the image as possible, while retaining a high classification performance. The main contributions of this study can be summarized as follows:

A novel framework for discriminative localization in classification tasks, is proposed. Instead of relying on coarse CAMs or computationallyexpensive image perturbations, our framework is fully trainable, producing the finest masks possible in a single inference.

The above framework, also has the ability of manual FIR (Def. Document) designation, i.e. the user can decide how fine or coarse his masks will be (fine masks can come at the cost of performance). This way the user can have an active participation in where the model stands regarding the fidelityinterpretability tradeoff.

Given this choice the model is guaranteed to have the highest level of fidelity (i.e. highest possible classification performance) or interpretability (i.e. finest masks possible).

An exploration is conducted on the cost of interpretability to a model. I.e. How much of its performance does it need to sacrifice to be interpretable?
Theoretical background
One main component of the framework is the production of the binary mask (Def. Document). To achieve this, the continuous output of the previous layer must be converted to binary. This paper explores two ways of performing this conversion: a deterministic approach and a stochastic one.
Binary Neuron formulation
The mathematical formulation for both types of binary neurons will be presented here.
Deterministic thresholding
The most intuitive way of converting a continuous value to a binary one is to select a threshold and set the value equal to or if it is above or below the threshold. Given a continuous variable and a threshold , this is equivalent to a Heaviside step function offset by :
() 
To effectively select the threshold , the continuous values need to be normalized to a predetermined range (e.g.
). Practically, this can be achieved by applying a sigmoid function to
.() 
Though the sigmoid function can easily be added after any neural network layer, this is not the case for the step function (Eq. Document). From its definition, the derivative of the step function is everywhere besides , where is not differentiable.
() 
Normally, this would prevent the error from backpropagating to layers preceding the step function, which would make their training impossible. A typical workaround is to ignore this layer during backpropagation (i.e. treat it as an identity function , which has a derivative of ).
() 
The name deterministic thresholding attempts to distinguish this approach from the stochastic one that will be described in Section Document.
Binary Deterministic Neurons
By combining the aforementioned ideas, a type of new type of neuron can be defined, the Binary Deterministic Neuron
(BDN). This includes an affine transformation, a nonlinear activation function (i.e. sigmoid) and a deterministic thresholding function:
() 
where and are trainable parameters and . The derivatives of this layer with respect to its parameters and are:
∂H∂W
&= ∂H∂σ ⋅∂σ∂h ⋅∂h∂W
&≈1 ⋅σ(h) ⋅(1σ(h)) ⋅∂(W x + b)∂W
&= σ(h) ⋅(1σ(h)) ⋅x
and
∂H∂b
&= ∂H∂σ ⋅∂σ∂h ⋅∂h∂b
&≈1 ⋅σ(h) ⋅(1σ(h)) ⋅∂(W x + b)∂b
&= σ(h) (⋅1σ(h))
This is only an approximation of the gradient, as the derivative of the step function was substituted with (Eq. Document). Another way to think about this would be as having an infinitely steep sigmoid function (i.e. one that resembles a step function).
The same principle can be applied to convolutional layers simply by substituting the affine transformation with another with its equivalent.
Stochastic thresholding
Another thought would be to convert a continuous variable to binary through a stochastic thresholding
function. The motivation for this approach arose from the various visual attention models, which saw better results with stochastically generated attention masks (what they refer to as “hard attention”) rather than deterministic ones (i.e. “soft attention”)
[mnih2014recurrent, ba2014multiple, xu2015show]. Contrary to the step function (Eq. Document) which is deterministic (i.e. if is greater than the threshold, will always be equal to ), stochastic functions carry with them a degree on uncertainty. Here, there isn’t a threshold to compare to; instead becomes orat random, with a probability equal to its value. Obviously,
needs to be normalized (through Eq. Document) for this to work properly.() 
This means that if , will be equal to with a probability of . In this sense,
is essentially a Bernoulli random variable with
. Computing the derivative of this function is impossible, even by making the approximation of Equation Document. The problem, in this case, is the noise induced by the stochasticity of the operation, which will need to be taken into account in the approximation.Binary Stochastic Neurons
Similar to a BDN (introduced in Sec. Document), a Binary Stochatic Neuron (BSN) performs an affine transformation, a nonlinear activation function and a stochastic threshold:
() 
Backpropagating through this neuron is trickier, due to its stochastic nature. Since a derivative can’t be computed the traditional ways, it has to be estimated.
Gradient estimation
The simplest way of formulating the problem of gradient estimation is to consider an architecture where the noise is not inherently part of the units but is injected into them. This way the output of each BSN is a function (i.e. the step function in Eq. Document) of both a noise source and the result of a transformation on its inputs (in this case ).
() 
This decomposition of the neuron’s function creates a path for the gradients to flow during backpropagation; the only problem is the estimation of the gradient despite the noise . Four separate gradient estimators will be examined for this purpose. All of these will be viewed through the lens of a single BSN, , but with no loss of generality this can be applied to any and every BSN in the network.
Likelihoodratio estimation
The most general technique for estimating the gradient through stochastic units is the Likelihoodratio method [glynn1990likelihood] also known as the Score Function estimator [kleijnen1996optimization] or the REINFORCE estimator [williams1992simple]. The goal of a nonstochastic network is to minimize a cost function w.r.t the trainable parameters of the network (i.e. ). This can be done through gradient descent, which requires the computation of the gradient . Due to the existence of one or more noise sources in the network (which directly affect the value of the cost function), the goal in stochastic networks is to minimize the expected value of the cost over the noise sources w.r.t. , through its gradient .
() 
The computation of
is infeasible and has to be estimated. Through the likelihoodratio method, the unbiased estimator
can be derived. This takes the following form.() 
The subscript is added to denote that this is the uncentered form for the estimator. A proof for the equation above is provided by [bengio2013estimating]. This estimator is convenient as it only requires broadcasting throughout the network (i.e. no need for backpropagation), which makes training the network possible.
Variance reduction
Even though the above estimator is unbiased, it has quite a high variance. There are several methods that help reduce this, the most prominent is by subtracting a
baseline from the cost. The baseline that leads to the greatest reduction in variance for the estimator is:() 
Out of all possible baselines this results in lowest variance, while not increasing the estimator’s bias. The proof of this can also be found in [bengio2013estimating]. Because and are specific to a single neuron, the baseline can be though of as a weighted average of the cost values , whose weights are specific to that particular neuron [bengio2013estimating]. By subtracting Eq. (Document) from the cost in Eq. (Document), the centered estimator is derived:
() 
StraightThrough estimators
Another idea would be to completely ignore the step function and the noise induced by the stochasticity of the neuron. This is referred to as the StraightThrough estimator (ST). There are two variants of this idea, depending on whether or not the derivative of the sigmoid is ignored during computation. The two estimates of the derivative of w.r.t the weights would be
() 
or
() 
These two estimators are very simple to compute, but they are clearly biased as . The reason for this bias is the discrepancy between the functions in the forward and backward pass.
SlopeAnnealing trick
Another approach is called the SlopeAnnealing trick [DBLP:journals/corr/ChungAB16] and attempts to reduce the bias of the StraightThrough estimator. This draws inspiration from the fact that the sigmoid function becomes steeper if multiplied by a scalar larger than . As this scalar increases, the sigmoid approaches the step function (Eq. Document), while remaining continuous and differentiable. Following the formulation of Equation Document:
() 
where is the aforementioned scalar that will be referred to as the “slope”. By applying the first StraightThrough estimator (Eq. Document) to the slopeaugmented sigmoid, the estimator for this layer is derived:
() 
The trick is to start from a value of and slightly increase it during training, so that the sigmoid resembles more and more a stepfunction [DBLP:journals/corr/ChungAB16].
Proposed Framework
The driving idea behind this paper is to train a Deep Learning architecture for classification, with the ability to provide a certain “reasoning” for its decision. This reasoning comes by method of discriminative localization, i.e. the ability to highlight a region in the input image that the network mainly took into account to make its decision. The goal is to produce a network that can be trained to both classify an image and produce a mask that indicates the part of that image that the network used to produce its prediction. The architecture should have the ability of generating the finest mask possible (i.e. highest degree of interpretability) for a given classification performance (i.e. level of fidelity). By sacrificing some of this performance, it is possible to achieve even higher degrees of interpretability. This allows for an active selection of where the framework stands on the fidelityinterpretability tradeoff.
Highlevel Architecture outline
As stated previously, the proposed framework consists of two networks trained jointly. The first, which is tasked at generating the binary masks, is a model whose output must have the same dimensions as its input. Because its role is to “hide” a portion of its input, this network will be referred to as the hider. The second, is any regular classifier that can work well on its own in the aforementioned task and will be referred from now on as the seeker
. Thus, the only limitation lies in the design of the hider. If the input is structured (e.g. a table), the output will also be a table of the same shape. If the input data is a sequence (e.g. text) the output should be a sequence of the same length. Finally, as is our case, if the input is an image, the output should also be an image with the same resolution. There are many architectures which can be used to generate the binary masks, for example an Autoencoder or even an image segmentation network like a UNet
[ronneberger2015u]. The goal of the hider, in our case, is to produce a binary mask for the input image that will hide some of its pixels. It is trained so that it leaves the pixels with the most relevance to classification, while hiding the background. Additionally, it will be rewarded for hiding as many pixels as possible from the input image, in hopes of producing concise masks. However, it will not be trained directly, but through the use of the seeker, an idea inspired by the Generative Adversarial Networks [goodfellow2014generative]. The seeker, on the other hand, is trained to classify the masked images produced by the hider, In contrast to the hider, the seeker will have direct access to the loss function and will serve the goal of backpropagating the error back to the hider. The
HideandSeek (HnS) framework, essentially is the collaborative training of a hider and a seeker. The architecture of a HnS model for image classification is depicted in Figure Document. The input image is fed to the hider, which with the help of a binary layer produces a binary mask. This mask is then applied to the input and fed to the seeker, which attempts to generate its prediction. This methodology could be applied to different tasks besides image classification. For example by swapping the hider with a sequencetosequence model and the seeker with a sequence classifier, this could be applied to Natural Language Processing problems such as sentiment analysis. It wouldn’t be hard to imagine a model capable of masking the irrelevant words in a sentence. The loss function will have two terms, a standard classification error and a function of the number of pixels in the hider’s mask. The first term serves to train the models for classification, while the second encourages the hider to produce smaller masks. These two terms are to an extent competing one another, as smaller masks might make the images harder to classify. However, if they are balanced, the trained model will be capable of producing the smallest masks that achieve a sufficiently high performance in classification.Hider
Though there are many architectures which could be used effectively for generating a binary mask, a Convolutional Autoencoder was elected for this study, due to its simplicity. Intuitively, the hider should learn to recognize what parts of the image might be important and what might not, for classification. This should have sufficient capacity in its downscaling path to extract the necessary features for classification. Depending on the size of the input images and the difficulty of the task, the complexity of the hider should also be adjusted (e.g. larger networks are needed for more difficult tasks). Figure Document, outlines a template architecture, which can be modified with more or less convolution layers to accommodate tasks of varying difficulty. The final two layers cannot be changed. The final hider layer is a convolution layer, which is meant to produce a mask with a single channel that outputs values within . Its output is fed to a binary layer, which converts its real valued inputs to binary either through deterministic (Sec. Document) or stochastic thresholding (Sec. Document). As a convention, it is not considered to be part of the hider. Two architectures were employed in the present study both of which follow the one of Figure Document. The first one, features layers ( downscale conv, FC, upscale conv, final conv) and trainable parameters and was used on the “smaller” datasets. The second included layers ( downscale conv, FC, upscale conv, final conv) and parameters. This was used on the “large” dataset.
Seeker
The seeker, is a regular CNN, that could be used to perform image classification on the desired dataset, without any modifications. Its input should have exactly the shape as the hider’s, while outputting the probability that the input belongs to each class. There are no limitations regarding its capacity, though it is recommended to be lower or equal to that of the contraction path of the hider. Three different seeker architectures were examined: A small CNN with layers ( conv, FC) and trainable parameters, a larger CNN with 6 layers ( conv, FC) and trainable parameters, and finally a ResNet50 ( layers, total parameters) [he2016deep].
Loss Function
In order to train the joint network for both classifying images and masking their content as best as possible, a loss function with two terms needed to be constructed.
The first term represents the classification loss between the prediction and the actual target (in our case crossentropy). This is required to train the network for classification.
The second term () was a measure of the amount of information that the mask allows to pass through it. While a few different metrics were considered, such as the percentage of pixels equal to , or the energy of the masked image, even something as simple as the sum of the pixels of the mask worked without issue.
The weighted sum of these two terms was used as the joint network’s loss function.
J &= α⋅J_clf + (1  α) ⋅J_mask
&= ∑_i=1^N[α⋅y_i log^y_i + (1  α) ⋅∑_j=1^M H_i,j]
The hyperparameter regulates the amount each of the two losses contributes to the joint loss function.
Relation of the loss regulator to the training objective
From Equation Document it becomes apparent that a value of near leads to the joint loss being dominated by the classification loss. Empirically, this encourages the hider to not mask anything from the original image, as the classifier can use the extra information for improving its performance (which is more important to the total loss for large values of ). On the other hand, values near encourage the hider to mask the whole image, as the classification loss becomes irrelevant. Through this parameter, the user gains control over the FIR (Def. Document), as higher values of lead to an increased model fidelity, while lower values lead to an increased interpretability. However, to train a model that is actually useful, a balance between the two needs to be established. The goal set in this present study was to hide as much information as possible, without impacting the seeker’s performance. During training, in some occasions the hider converged to a suboptimal solution where it masked the exact same pixels for each image and couldn’t update further. This issue arose more frequently for smaller values of and only for deterministic thresholding.
Adaptive weighting
In order to achieve the aforementioned goal and overcome the issue mentioned previously, a scheme was devised for dynamically adapting during training. More specifically, the joint network started its training with (i.e. pure classification). The classification loss was monitored during training. If the loss stagnated for a few iterations then the value of was dropped by a small amount, causing the mask loss’ importance to increase. This, encourages the hider to hide more pixels, which, in turn, make classification harder for the seeker, leading to a temporary destabilization of the classification loss. After a while it will stabilize and stagnate again; when this happens, is further decreased. This process is repeated until a very small value of is achieved. This technique alleviates the need for selecting a proper value for beforehand, or the need of gridsearching over this hyperparameter. Additionally, it stabilizes the training process, because it emphasizes on training the seeker early on, while the hider comes into play during the final iterations. In fact, no instances of the issue mentioned in Section Document were observed when using an adaptive . Furthermore, a decrease in convergence time was observed when using an adaptive . This speedup is speculated to be caused by the progressive training of the seeker in evermore masked images. Finally, this technique also serves as an earlystopping mechanism, as the loss is monitored and gives an indication on when to stop training. Even through this technique an issue occasionally arose, where the hider would adopt extreme strategies, either masking everything or nothing at all. This means that the total loss was determined by only one of its two terms. These instances will be referred to as collapses, referencing the “mode collapse” of GANs. [goodfellow2014generative]
Implementation Details
Three possible ways of speeding up the training process were examined: pretraining the hider, the seeker or both and will be discussed in detail in Section Document. The pretraining of the hider was accomplished by training it, in an unsupervised manner, like an Autoencoder: the inputs and targets are set as to be the same, while the hider is trained on a reconstruction loss [goodfellow2016deep]. In the case of RGB input images, the target should be the same image converted to grayscale to make in compatible with the singlechannel bianry masks. The seeker is much easier to pretrain, as it can be accomplished by training it like a regular CNN. To implement the previouslymentioned adaptive scheme, a queue of the past classification losses is kept. If none of these diverge significantly from the average loss of the queue then the value of will be decreased. In this case, the model’s weights are stored and running average is flushed. The condition used was that the loss values shouldn’t fluctuate more than of the average loss.
Experiments
Datasets
Experiments were performed on three datasets:

the MNIST^{1}^{1}1http://yann.lecun.com/exdb/mnist/ dataset, which consists of , grayscale images from handwritten digits (10 classes in total).

the FashionMNIST^{2}^{2}2https://github.com/zalandoresearch/fashionmnist dataset, which consists of , grayscale images from fashion products (10 classes in total).

the CIFAR10 and CIFAR100^{3}^{3}3https://www.cs.toronto.edu/ kriz/cifar.html datasets, which consists of , RBG images distributed amongst and classes respectively.
Most of the experiments and conclusions were made on the CIFAR10 experiments, because its smaller size allowed for a larger number of experiments.
Evaluation criteria
There are two criteria that the models can be evaluated on: performance and variance. Performance has two components, classification performance (i.e. how accurately the seeker classifies) and masking performance (i.e. what percentage of pixels does the hider mask), both of which are important. The first can be measured by the model’s Fidelity (Eq. Document), while the second through its Interpretability (Eq. Document). As discussed in Section LABEL:sec:definition, there is a tradeoff between the two, which can be regulated through the parameter (Sec. Document). To properly measure this balance two additional metrics are used: the FIR (Eq. Document) and the FII (Eq. Document). Likewise, there are two forms of variance that the models can exhibit: intramodel variance
(i.e. deviations in performance within the same model from epoch to epoch) and
intermodel variance (i.e. deviations in performance from model to model). The first can be identified through fluctuations in the model’s training curves, while the second requires retraining the same model for a number of times. These two forms of variance are depicted in Figure Document.For the experiments conducted, the models were trained for a total of times, under the same conditions, to be able to detect the latter form of variance. For simplicity, from now on both types will be referred to as variance. Note that while variance is linked to performance, it doesn’t require highlevels of it, rather consistency. A model can have a low variance while performing poorly if it its performance doesn’t fluctuate from epoch to epoch and from training to training.
Full training vs Pretraining
The first question that arose, was whether or not the hider and seeker required any form of pretraining in order for the HnS model to converge. For this reason, the same model was trained on the cifar10 dataset, with four different initialization conditions (i.e. training from scratch, pretrained hider, pretrained seeker, both hider and seeker pretrained). As mentioned in Section Document, each of these models was trained times independently to properly assess the model’s variance. The first thing to check is the classification performance of the models. Figure Document, illustrates the performance of each of the four initialization conditions. The grey line is the baseline performance which was achieved by a fully trained seeker on the same dataset. By hiding portions of the image, it is expected that the performance will experience a slight dropoff, which can be attributed to the fidelityinterpretability tradeoff (see Sec. LABEL:sec:definition). The bold colored lines represent the mean performance of the models of each initialization, while the shade gives an indication of the intermodel variance.
The models with both components pretrained proved to be the best and most stable in this category. Interestingly, training a model initialized from a pretrained seeker, proved to be the toughest of all, as its former training didn’t translate well when the hider started masking pixels. Due to the goal set in Section LABEL:sec:definition, the aim was to select models that don’t experience a steep dropoff in terms of classification performance. For this reason, only models that perform within of the baseline were examined further. A secondary task was to hide as much information as possible. For this reason an arbitrary threshold of of pixels hidden was established. If these two goals are achieved, then the model will be considered to have converged to an “optimal” solution. A total of models with a pretrained hider fulfilled these requirements. Only from the rest each of the rest categories managed to do so. This shows that by having a pretrained hider, it is much easier to train a HnS model and achieve optimality. It should be mentioned that a lot of the models models that did not achieve this status, still managed to converge to suboptimal solutions.
The masking performance of the best model of each initialization condition is depicted in Figure Document. The two models with a pretrained seeker experienced much sharper ascends, while the rest were much more stable during training. Especially the one with just the pretrained seeker, experienced a lot of variance. This gives some indication as to why so many of these models collapsed. The pretrained hider has an interesting training curve, it starts at a rather high percentage (due to the pretraining of its component) but drops a bit as training proceeds. The reason is that, in order to increase its classification performance (which is dominant during the early stages of training due to the adaptive weighting), it allows more information to pass. When it achieves a low enough classification error, it proceeds again to hide more and more pixels and was actually the first of the four to converge. Perhaps the best way to assess performance in both of these areas is to project the Fidelity and Interpretability of all the models on 2 axes. Figure Document depicts this projection.
This paints a clearer picture regarding the different strategies. While the pretrained hider didn’t lead to any collapses, it did settle to many suboptimal solutions. Instead by having both components pretrained, led to a lot of optimal convergences, even though it suffered a few collapses.
Stochastic estimator comparison
Another objective of this research was to examine which of the stochastic estimators performs the best. Four different estimators were examined: StraightThrough v1 (ST1), StraightThrough v2 (ST2), SlopeAnnealing (SA) and REINFORCE. Like before, each model was trained times independently and was initialized from scratch.
In the case of the SA estimator, different rates of slope increase were examined. When applying this estimator, the slope starts off at and increases gradually during training (Sec. Document). A small increase was applied at the end of each weight update. The rate of the slope, indicates the percentage of its increase at the end of each epoch. For example, a rate of means that the slope increases by a total of at the end of each epoch. First of all, the best rate needed to be identified. Five different rates were examined, including , , , and . Only the middle three had models that converged to optimal solutions, namely , and . Figure Document shows the classification performance of these three rates. This Figure, along with the previous numbers, give an indication that indication that smaller rates, which lead to more gradual increases of the slope are better. The models with a rate of , clearly achieve better performance, while exhibiting a lower overall variance.
The next step was to compare the different types of estimators. Out of the four, ST1 had optimal convergences, ST2 just had , SA (Rate=) had and REINFORCE . The last, while being the best known gradient estimator couldn’t be made to work for this problem. The classification performance of these estimators can be seen in Figure Document. SA seems to be ahead in terms of the rest, with ST2 following closely behind. The masking performance of the best model from each estimator is portrayed in Figure Document. While all three models converged to the same percentage, the two ST estimators managed to get there slightly faster. Interestingly, despite the stochasticity in the training process, these estimators don’t exhibit a lot of variance.
These two performances are known to come with a tradeoff, which cannot properly be assessed by just looking at the best model’s performance. To visualize both of the terms that come into the FII equation (i.e. Fidelity and Interpretability), a 2D projection of the models’ performance performance could be made (Fig. Document). In this the Interpretability of a model is depicted in the xaxis, while its Fidelity is depicted in the yaxis. The performances of the models in each run are scattered as dots throughout the graph. Note that solutions near left are considered as collapsed, due to their low Interpretability. Same thing for solutions near the bottom, regarding Fidelity. Optimal solutions can be found near the top right.
All of these models achieve a high degree of Fidelity. In fact none of the models managed to collapse in their classification performance. Interpretability, on the other hand, seems to vary a lot from model to model. Only a handful of models managed to converge to optimal solutions. A large number of models seems to be stuck in with Interpretability and Fidelity. It is unclear why this area is so common in collapses, or if these models would ever manage to get themselves out of this region, if trained for more epochs. One theory is that by hiding the wrong pixels (i.e. crucial ones towards classification, the model suffers a great loss in Fidelity (), without managing to significantly increase its Interpretability. Concerning this plot, its hard to assess the best stochastic estimator, however ST1 seems a bit more stable than the rest.
Deterministic vs Stochastic
One of the main objectives was to determine if the stochastic thresholding would yield superior results to the deterministic. The straightthrough estimator (ST1) proved to be, marginally, the best from all stochastic estimators and will be used to represent the “stochastic” family. For a fair comparison the deterministic models that were trained from scratch will be used to represent the “deterministic” family. The models were evaluated on a basis of performance (both on the scales of Fidelity and Interpretability) as well as variance, numbers of collapsed models and convergence speed. The quantitative results are presented in Table Document.
Det.  Stoch.  
#models collapsed ( Fidelity)  1  0 
#models collapsed ( Interpretability)  2  2 
#models collapsed (total)  3  2 
mean convergence Fidelity  0.86  0.88 
peak convergence Fidelity  0.96  0.98 
mean convergence Interpretability  0.80  0.65 
peak convergence Interpretability  0.92  1.00 
mean FII (Eq. Document)  0.79  0.82 
peak FII (Eq. Document)  0.85  0.91 
mean FIR (Eq. Document)  0.53  0.60 
average convergence (epochs)  38.3  31.6 
fastest convergence (epochs)  19  20 
First, regarding their classification performance (i.e. Fidelity), stochastic models seem to be on top. Granted, the difference in terms of peak and mean performance isn’t much, however, all other evidence clearly favors stochastic models. None of the stochastic models collapsed (compared to deterministic), while exhibiting a lower variance and were generally much more stable, as indicated by Figure Document. ^{1}^{1}footnotetext: excluding collapsed models ^{2}^{2}footnotetext: this model actually had an Interpretability of , with a fidelity of ; obviously, a perfect Interpretability score would mean that the model masks the entire image and is predicting at random (which would cause a collapse).
Masking, on the other hand seems to favor deterministic models in average, however the best overall masking model was a stochastic one. Stochastic models tend to have a lot of intermodel variance, causing inconsistency from experiment to experiment. On the other hand, These models don’t have a lot of intramodel variance, because of their stochastic nature for generating masks. Not a much can be said in terms of collapses, as both models had the same number. The percentage hidden of the best performing deterministic and stochastic models can me seen in Figure Document. Out of these two the stochastic reached a higher percentage, while being a lot more stable and a bit faster. While this seems to be the case for these specific models the trend is that deterministic estimators are the best.
Both of these observations can be confirmed by the models’ FIR scores (Eq. Document). Stochastic models seem to favor classification over masking, thus leading to an . Deterministic models are a bit more balanced, in this regard (i.e. ), which results in a better and more consistent masking performance. A more indepth analysis will be offered in Section Document. Another way to inspect the tradeoff for these two families of models is to project their performance on two axes, Interpretability and Fidelity (Fig. Document).
This figure portrays the contrast of the two different types of thresholding. While deterministic models seem to favor masking over classifying, in stochastic models the opposite is happening. They seem to ignore the percentage of pixels hidden, while always striving for high classification performance. This is one of the reasons for their reduced variance, as will be discussed in Section Document.
Results
It should be noted that the aim of this research is not to push the boundaries on classification performance, but to augment existing classifiers with Interpretability. For this reason, all models that were tested on the HnS framework, were first trained on their own, without any masking to establish a baseline. This baseline is used to compute the model’s Fidelity (Eq. Document). The best performing models for each of the three datasets are presented in Table Document.
Estimator  Epoch  Fidelity (%)  Interpretability (%)  FIR  FII  

MNIST  Deterministic  8  
FashionMNIST  Deterministic  7  
CIFAR10  SA (rate = )  24  
CIFAR100  Deterministic  35 
The first two datasets, were used to push both Fidelity and Interpretability as high as possible, as they are significantly easier than the rest. All types of models and estimators performed well on these datasets, while no collapses were exhibited at all. On average models managed to obtain both a Fidelity and Interpretability of . All models converged to exactly FIR during training. Around one out of three models converged to nearoptimal solutions in the CIFAR10 dataset, scoring nearbaseline accuracies with approximately of the pixels masked. This is rather surprising, as the images of this dataset are quite lowresolution, meaning that a classifier doesn’t need a lot of pixels to perform adequately. In the first three datasets, the tradeoff between Fidelity and Interpretability appeared almost nonexistent. All models managed to hide a large portion of their input, while not suffering with respect to classification performance. This will be discussed further in Sec. Document. CIFAR100 is tougher as a dataset, as it consists classes. This is evident by the diminished performance. Here the tradeoff seems a bit tougher to overcome, as the models’ Fidelity needed to drop by a bit to manage to hide a significant portion of the input.
Discussion
Relation to other approaches
As mentioned in Section Document, HnS is conceptually similar to “occlusion sensitivity” analysis [zeiler2014visualizing]. Through this technique the authors were able to visualize the features extracted by a fully trained CNN. However, this technique involves multiple inferences on a the model to identify the strongest feature map. While these techniques can theoretically achieve an arbitrarily high level of Interpretability, this would require performing a significant number of occlusions (as much as , where are the input image’s pixels). Needless to say this is computationally infeasible. In contrast, HnS proposes using a trainable Neural Network (i.e. the hider) that learns what are the most relevant pixels for classification. This allows the hider to mask an everlarger part of the input image, until the point where the least possible information is passed to the seeker. The downside to this is that the hider does require training; however, when trained it can make a prediction and generate a binary input mask in a single forward pass. This is arguably much more practical for real world applications.
Applications
The HnS framework has several properties that can be exploited for various applications. These include:

Training fullyinterpretable Neural Networks. This has been discussed extensively throughout this paper.

Training “Student” networks. TeacherStudent training [hinton2015distilling] techniques have shown a lot of promise lately. A fullytrained hider could be used to train a much smaller CNN for a task which would normally require a model of a higher capacity. The intuition is that the hider has already learned to identify the important parts of an input image and by masking the irrelevant, it would help focus the CNN’s attention to the parts that actually matter.

Identifying bias in a dataset. There have been examples of models “cheating” in image classification by exploiting aspects of the images that humans would ignore (e.g. a watermark appearing in a corner is some class). By analyzing the saliency of a trained HnS, possible sources of bias in the data could be identified.
FidelityInterpretability Tradeoff
The tradeoff was not as costly as expected, most of the models that didn’t collapse managed to hide a significant percentage. Key insights can be discovered by examining the FIR for the HnS during training (Fig. Document). Both types of models, during the early steps tend to weigh in favor of Fidelity. This is natural due to the nature of the adaptive weighting (Sec. Document), which pushes the models to favour classification performance over masking. As the models start getting better at classifying and starts dropping, the models start increasing their masking performance (i.e. their interpretability), which drops their FIR. During the latter stages of training, the models converge to their final FIR.
Deterministic vs Stochastic thresholding
As analyzed in Section Document, deterministic and stochastic thresholding have very different effects to the training of the HnS. Stochastic models tend to settle for solutions with a higher degree of Fidelity. The stochasticity of the input mask leads to a better performing and more robust seeker. Additionally, it’s very hard for a model to collapse to a solution where the hider masks the whole input image. This can be explained by the nature of the stochastic hider, which allows for a greater degree of exploration during training. Finally, these models exhibit a relatively low intermodel variance. This can, again, be attributed to the nature of stochastic thresholding. Deterministic models offer a higher degree of exploitation, especially during the generation of the input mask. This results in more “extreme” solutions where the models either collapse or achieve optimally. This is evident by the fact that the “best model” for out of datasets was a deterministic one. They are less robust than their stochastic counterparts, but manage to outperform them in terms of Interpretability, by sacrificing a bit of Fidelity. The differences between deterministic and stochastic models, with respect to their Fidelity and Interpretability can be seen in Figure Document. Deterministic models, seem to favor obtaining high Interpretability, while stochastic ones favor Fidelity. The high degree of intermodel variance they exhibit compared to stochastic thresholding (Fig. Document), can be attributed to their nature. By utilizing a binary threshold, there could at any given point a set of weights ( in Eq. Document) that lead to activations near the threshold. This means that slight adjustments to those weights, during training, could result in large output fluctuations. Stochastic thresholding is superior in this regard as it ensures more stable transitions as the weights are updated. Small changes in the activations will only lead to small changes over the probability of a neuron being or . Statistically, in the image, the same amount of pixels will be hidden or passed.
Conclusion
This paper proposes a new framework for increasing the interpretability of any Neural Network (NN), denoted the seeker. It involves the use of another NN (called the hider), which is trained to hide portions of its input. These two models are jointly trained to minimize classification error and maximize the percentage of the input that is hidden. As a result, the hider learns to recognize which parts of the input are possibly ‘‘more interesting” and mask the rest. This framework can be adapted for nearly any application, from Natural Language Processing (where the hider is a sequencetosequence model tasked at masking words), to Computer Vision (where the hider is an imagetoimage model, e.g. an Autoencoder or a UNet) and even structured data. To achieve both goals of classifying accurately and masking a large portion of the input, the loss function is comprised of two components, which need to be regulated so that the model doesn’t emphasize on only one of the goals. An adaptive weighting scheme of the two components is proposed as an alternative to manually tweaking their relative importance. The notions of Fidelity and Interpretability are introduced to help define and measure the two goals. Relevant literature describes the relationship of the above two as a tradeoff. This claim was thoroughly investigated and is shown to be misleading, as we were able to achieve a high degree of Interpretability, while maintaining nearbaseline classification performance. An extensive examination was conducted, regarding the best means of masking the input during training. Both deterministic and stochastic techniques for generating the mask were considered. While these two converged to roughly the same solutions, they achieved them with different means. Experiments were performed on four different image classification datasets and proved that the HnS framework can be successfully applied to multiple tasks, without any finetuning.