Controversial stimuli: pitting neural networks against each other as models of human recognition

11/21/2019 ∙ by Tal Golan, et al. ∙ Columbia University 18

Distinct scientific theories can make similar predictions. To adjudicate between theories, we must design experiments for which the theories make distinct predictions. Here we consider the problem of comparing deep neural networks as models of human visual recognition. To efficiently determine which models better explain human responses, we synthesize controversial stimuli: images for which different models produce distinct responses. We tested nine different models, which employed different architectures and recognition algorithms, including discriminative and generative models, all trained to recognize handwritten digits (from the MNIST set of digit images). We synthesized controversial stimuli to maximize the disagreement among the models. Human subjects viewed hundreds of these stimuli and judged the probability of presence of each digit in each image. We quantified how accurately each model predicted the human judgements. We found that the generative models (which learn the distribution of images for each class) better predicted the human judgments than the discriminative models (which learn to directly map from images to labels). The best performing model was the generative Analysis-by-Synthesis model (based on variational autoencoders). However, a simpler generative model (based on Gaussian-kernel-density estimation) also performed better than each of the discriminative models. None of the candidate models fully explained the human responses. We discuss the advantages and limitations of controversial stimuli as an experimental paradigm and how they generalize and improve on adversarial examples as probes of discrepancies between models and human perception.



There are no comments yet.


page 4

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Candidate MNIST models

We assembled a set of nine candidate models, all trained on MNIST (Table 1). We included five families of models: (1) discriminative feedforward models—we adapted the VGG architecture [simonyan_very_2014] to MNIST (’small VGG’, see Materials and Methods) and trained it on the either the standard MNIST dataset or on a version extended by non-digit images (Fig. S1, dubbing the resulting model ’small VGG’), (2) discriminative recurrent models—the Capsule Network [sabour_dynamic_2017] and the Deep Predictive Coding Network (PCN) [wen_deep_2018], (3) adversarially-trained discriminative DNNs [madry_towards_2018], (4) a reconstruction-based readout of the Capsule Network [qin_detecting_2019], and (5) class-conditional generative models—classifying according to a likelihood estimate of each class, obtained either through a class-specific Gaussian Kernel Density Estimation (KDE), or through a class-specific Variational Autoencoder (VAE)—the ’Analysis by Synthesis’ model [schott_towards_2019].

model family
model error
discriminative feedforward small VGG [simonyan_very_2014] 0.47%
small VGG [simonyan_very_2014] 0.59%
discriminative recurrent Wen PCN-E4 [wen_deep_2018] 0.42%
CapsuleNet [sabour_dynamic_2017] 0.24%
adversarially trained Madry [madry_towards_2018] () 1.47%
Madry [madry_towards_2018] () 1.07%
reconstruction-based CapsuleNet Recon [qin_detecting_2019] 0.29%
generative Gaussian KDE 3.21%
Schott ABS [schott_towards_2019] 1.00%

A modified architecture. See Supplementary Materials and Methods.

Table 1: Tested MNIST models

Many DNN models operate under the assumption that each test image is paired with exactly one correct class (here, an MNIST digit). In contrast, human observers may detect more than one class in an image, or alternatively, detect none. To provide the tested models with greater flexibility, the outputs of all of the models were evaluated using multi-label readout. For each class, the related penultimate activation (i.e., the logit) was fed to a sigmoid function instead of the usual softmax readout. This setup handles the detection of each digit as a binary classification problem


Figure 2: Synthetic controversial stimuli, contrasting nine different MNIST-classifying models. All of these stimuli result from optimizing images to be ’seen’ as 7 (but not 3) by one model and as 3 (but not 7) by another model (see Fig. S2 for all digit pairs). Each image was synthesized to target one particular model pair. For example, the bottom-left image (seen as a 7 to human observers) was optimized so that 7 will be detected with high certainty by the generative ABS model and at the same time, the discriminative small VGG model will detect 3. All images here achieved controversiality score (Eq. 2) greater than 0.75.
Figure 3: Controversial stimuli, here organized to show the results of targeting each possible digit pair for four different model pairs (see Fig. S3 for all 36 model pairs). The rows and columns within each subpanel indicate the targeted digits. For example, the top-right image in subpanel D was optimized to be ’seen’ as 9 by the Schott ABS model and as 0 by the Gaussian KDE model. Since this image looks to us as a ’9’, it provides evidence in favor of the Schott ABS over the Gaussian KDE as a model of human digit recognition. Missing (crossed) cells are either along the diagonal (where the two models would agree) or where our optimization procedure did not converge to a sufficiently controversial image (a controversiality score of at least 0.75).
Figure 2: Synthetic controversial stimuli, contrasting nine different MNIST-classifying models. All of these stimuli result from optimizing images to be ’seen’ as 7 (but not 3) by one model and as 3 (but not 7) by another model (see Fig. S2 for all digit pairs). Each image was synthesized to target one particular model pair. For example, the bottom-left image (seen as a 7 to human observers) was optimized so that 7 will be detected with high certainty by the generative ABS model and at the same time, the discriminative small VGG model will detect 3. All images here achieved controversiality score (Eq. 2) greater than 0.75.

Another limitation of many DNN models is that they are typically too confident about their classifications [guo_calibration_2017]. To address this issue, we calibrated each model by applying an affine transformation to its logits [platt_probabilistic_1999, guo_calibration_2017]. The slope and intercept parameters of this transformation were shared across classes and were fit to minimize the predictive cross-entropy on MNIST test images. This procedure tunes the sigmoid readout and enables a fair comparison among the models in which all of them are well-calibrated. For pre-trained models, this calibration (as well as the usage of sigmoids instead of the softmax readout) affects only the models’ certainty rather than their classification accuracy (i.e., it does not change the most likely class given an input image).

Synthesizing controversial stimuli

Consider a set of candidate models. We would like to define a controversiality score for image . This score should be high if the models strongly disagree on the contents of this image.

Ideally, information-theoretic experimental design [lindley_measure_1956, houlsby_bayesian_2011]

would approach this problem by formulating our beliefs regarding which model is correct as a posterior probability distribution, conditioned on observed or hypothetical stimuli and responses. Sets of potential stimuli would be scored according to their expected reduction of the entropy of this posterior probability distribution. However, this statistically-ideal approach is not currently tractable in high-level vision, where stimuli are arbitrary images and models are DNNs.

Here we use a simple heuristic approach. We consider one pair of models at a time, e.g.

and . For a given pair of digits, and (e.g., 3 and 7), an image is assigned with a high controversiality score if it is recognized by model as digit and by model as digit . The following function achieves this:


where is the estimated conditional probability that image contains digit according to model , and is the minimum function. However, this function assumes that a model cannot simultaneously assign high probabilities to both digit and digit in the same image. This assumption is true for models with softmax readout. To make the controversiality score compatible also with less restricted, multi-label readouts we used the following function instead:


If the models agree over the classification of image , then and will be either both high or both low, so either or will be a small number, pushing the minimum down.

Employing an activation-maximization approach [erhan_visualizing_2009], one can form images that maximize Eq. 2 by following its gradient with respect to the image (in practice, we differentiated a smoother surrogate function, Eq. 5). We initiated stimuli as random noise images and iteratively ascended a numerical estimate of this gradient until convergence (see Materials and Methods). This procedure results in gradually increasing the controversiality of the image. Convergence to a sufficiently controversial stimulus (e.g., ) is not guaranteed. A controversial stimulus cannot be found, for example, if both models associate exactly the same regions of image space with the two digits. However, if a controversial image is found, it is guaranteed to provide an informative test stimulus for adjudicating between the two models.

Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition

For each pair of models, we formed 90 controversial stimuli, targeting all possible digit pairs. Fig. 3 shows the results of this procedure for a particular digit pair across all model pairs. Fig. 3 shows the results of this procedure across all digit pairs for four model pairs.

Viewing the resulting controversial stimuli, it is immediately apparent that pairs of discriminative models can detect incompatible digits in images that are meaningless to us, the human observers. Images that are confidently classified by DNNs, but unrecognizable to humans have been described in the computer science literature (e.g., ’fooling images’ [nguyen_deep_2015], ’rubbish class examples’ [goodfellow_explaining_2015], and ’distal adversarial examples’ [schott_towards_2019]). However, instead of misleading one model (compared to some standard of ground truth), our controversial stimuli elicit disagreement between two models. For pairs of discriminatively trained models (Fig. 3A, B), human classifications are not consistent with either model, providing evidence against both.

One may hypothesize that the poor behavior of discriminative models outside the manifold of training examples is related to the lack of non-class examples in their training [e.g., mccoyd_background_2018]. To test this hypothesis, we trained a discriminative model with diverse non-digit examples (Fig. S1). The small VGG model, trained to discriminate not only among digits, but also between digits and non-digits, still detected digits in controversial images that look like ’rubbish’ to us (Fig. 3A, but see the next section for some advantages this model revealed in the quantitative testing).

There were some qualitative differences among the stimuli resulting from targeting pairs of discriminative models. Images targeting one of the two different recurrent DNN models, the Capsule network [sabour_dynamic_2017] and the PCN [wen_deep_2018], showed increased (yet largely humanly unrecognizable) structure (e.g., Fig. 3B). When the discriminative models pitted against each other included a DNN that had experienced -bounded adversarial training [madry_towards_2018], the resulting controversial stimuli showed traces of human-recognizable digits (Fig. 3, Madry ). These digits’ human classifications tended to be compatible with the classifications of the adversarially trained discriminative model [see tsipras_robustness_2019, for a discussion of the advantages of -bounded adversarial training].

And yet, when any of the discriminative models was pitted against either the reconstruction-based readout of the Capsule Network, or either of the generative models (Gaussian KDE or ABS), the controversial image was almost always a human-recognizable digit compatible with the target of the reconstruction-based or generative model (e.g., Fig. 3C). Finally, synthesizing controversial stimuli to adjudicate between the three reconstruction-based/generative models produced images whose human classifications are most similar to the ABS model (e.g., Fig. 3D).

Figure 4: The performance of the nine candidate models in predicting the human responses. Each dot marks the squared error in predicting the responses of one individual human participant, averaged across all 820 tested stimuli and 10 response categories. The vertical bars mark across-subject means. The gray dots mark the MSE of predicting each participant by the mean response pattern of the other participants. The ’noise floor lower bound’ (dashed line) marks the lowest across-subject MSE achievable by any single model. Significance indicators (right hand): A closed dot connected to an open dot indicates that the model aligned with the closed dot has significantly better (smaller) MSE than the model aligned with the open dot (p 0.05, bootstrap, FWE corrected). The generative ABS model (in red) explains human responses to the set of controversial stimuli better than all of the other candidate models. And yet, none of the models are optimal: leave-one-subject-out prediction shows significantly smaller error.

Human psychophysics can formally adjudicate among models and reveal their limitations

Inspecting a matrix of controversial stimuli synthesized to cause disagreement among two models can provide a sense of which model is more similar to us in its decision boundaries. However, it does not tell us how a third model (not used in synthesizing the stimuli) responds to these images. Moreover, some of the resulting controversial stimuli are ambiguous to human observers. We therefore need careful human behavioral experiments to adjudicate among the models.

We evaluated each model by comparing its judgments to those of human subjects, and compared the models in terms of how well they could predict the human judgments. For the behavioral experiment, we selected 720 controversial stimuli (20 per model-pair comparison, see Materials and Methods) as well as 100 randomly selected MNIST test images. We presented these 820 stimuli to 30 human observers, in a different random order for each observer. For each image, observers rated each digit’s probability of presence from 0% to 100% on a five point scale (Fig. S4). Subjects were allowed to judge multiple digits as present in a given image with high probability; the probabilities were not constrained to sum to 1. Since there is no objective reference providing correct answers in this task (i.e., the human responses define ground truth), no feedback was provided.

The responses of each human subject were directly compared to each model’s predictions:


where is the mean squared error with which model predicts human-judged probabilities that image contains digit , and are the model’s corresponding judgments. This measure is proportional to the squared euclidean distance between the model’s responses and the subject’s responses.

Given the intersubject variability and decision noise, the true model (if it were included in our set) cannot perfectly predict the human judgments. To estimate the maximal attainable performance (i.e., the noise floor of the MSE), we compared each subject’s responses with the predictions obtained by averaging the response patterns of all of the other subjects:


where is the mean squared error, with which the mean response pattern across all subjects except (820 images 10 digits) predicts the judged probabilities of subject .

The results of the experiment (Fig. 4) largely corroborate the qualitative impressions of the controversial stimuli, indicating that the deep class-generative ABS model [schott_towards_2019] is superior to the other models in predicting the human responses to the tested dataset. Its performance is followed by the shallow class-conditional model—the Gaussian KDE, which in turn is followed by the reconstruction-based readout of the Capsule network. The discriminative models all performed significantly worse than these three models. The leave-one-subject out noise floor estimate (Eq. 4) provided a significantly more accurate prediction compared to all of the models (black dots in Fig. 4). This indicates that none of the models (including the ABS model) fully explained the explainable variability in the data.

Prediction error as measured by Eq. 3 is a strict criterion. Achieving minimal MSE (i.e. reaching the noise floor) requires a model to exactly predict the average human response pattern. To eliminate the potential effect of miscalibration of the models with respect to the human-assigned probabilities, we conducted control analyses employing either model recalibration or more flexible correlation measures. First, we repeated the analysis after recalibrating each model to minimize the overall prediction error of the model across subjects (Fig. 4A). We also retested the models after replacing the multi-label sigmoid readout with a modified softmax [schott_towards_2019]

, similarly recalibrating the readout hyperparameters to minimize each model’s MSE (Fig. 

4B). In addition, we compared the non-recalibrated models to the human judgements using linear correlation instead of MSE, allowing for subject-specific scaling and shifting of the models’ predictions (Fig. 5A). Finally, we applied isotonic regression to predict each subject’s individual responses from each model’s prediction with an arbitrary monotonous transformation from model to human judgements (Fig. 5B). In all of these control analyses, the advantage of the ABS model over all of the other models persisted, as well as the gap between all of the models and the noise floor.

To better understand the contribution of different stimuli to the results, we partitioned the MSE of each model to three components: error in predicting the responses (1) to controversial stimuli that targeted the model, (2) to controversial stimuli that did not target the model, and (3) to the MNIST test images (Fig. 5). The error partitioning uncovered two findings: First, the small VGG model (trained with non-digit examples) performed better than other discriminative models because it made less errors on controversial stimuli that targeted models other than itself. Many of these controversial stimuli targeting other models are not recognized as digits by humans, and small VGG recognizes them as non-digits with greater accuracy than the other discriminative models. This explains this model’s quantitative advantage over the other discriminative models (Fig. 4). However, small VGG fails when targeted in the synthesis of controversial stimuli, revealing that it does not capture human decision boundaries (Fig. 3A). A second finding evident from the error partitioning is that the two generative models, the Gaussian KDE and the ABS model, were significantly worse than the discriminative models at predicting the human responses to the 100 MNIST images (Fig. 6A).

Figure 5: Partitioning the mean squared prediction error of each candidate model to three components: the contribution of the 100 MNIST stimuli (black), the contribution of the 160 controversial stimuli that targeted model pairs that included the model (dark gray), and the 560 controversial stimuli that targeted model pairs that did not include the model (light gray).

Explaining the remaining predictive gap of the ABS model

While the Gaussian KDE model has indeed very low MNIST test accuracy, the failure of the ABS model on the MNIST test data compared to all of the discriminative models cannot be explained by its accuracy. We hypothesized that the multi-label readout we employed interacted unfavorably with the class-conditional structure of the two generative models: Since these models estimate the density of each class independently, mapping each class-density estimate directly into a class-presence rating prevented any interaction between the different detectors (i.e., the detection of 7 did not reduce the response of the ’1’ output). In the ABS model’s original formulation [schott_towards_2019], the modified softmax integrated the different class densities. Applying this readout instead of the multi-label readout decreased the model’s error on the MNIST test images (MSE=0.0218 instead of MSE=0.0287 with sigmoid readout). Even better MSE was obtained by recalibrating the modified softmax hyperparameters (Fig. 6B, MSE=0.0135 instead of MSE=0.0280 with recalibrated sigmoid readout). And yet, the PCN model and the Capsule Network had significantly better MSE than the ABS model even after such recalibration, probably reflecting the accuracy gap between the ABS model and the these recurrent DNN models (Table 1).


In this study, we synthesized stimuli to maximally differentiate the predictions of nine different candidate models about human classification of hand-written digits. We then tested the models’ predictions by presenting these controversial stimuli to human observers. We found that a deep class-conditional generative neural network, the ABS model [schott_towards_2019] explained human responses to these stimuli significantly better than several discriminatively trained DNNs. We also found that none of the candidate models, including the ABS model, explained all of the explainable variability of the human responses.

We believe that controversial stimuli can be an important addition to the toolboxes of two groups of scientists. The first group is cognitive computational neuroscientists interested in better understanding perceptual processes such as object recognition by modeling them as artificial neural networks. Natural images will always remain a necessary benchmark. However, models often make similar predictions for natural images [e.g., schrimpf_brain-score:_2018]. Controversial stimuli guarantee that different models make considerably different predictions, and thus empower us to adjudicate among models.

The second group that might find controversial stimuli useful is computer scientists interested in comparing different DNN architectures. Controversial stimuli pinpoint where in the input space the decisions of different models disagree. This can be used for illustrating with examples the functional difference between two models, efficiently testing hypotheses about a remote black box system, or comparing models with respect to their robustness to adversarial attacks.

Adversarial examples are a special case of controversial stimuli. An ideal adversarial example would be controversial between the targeted model and ground truth. In practice, ground-truth labeling is rarely available within the adversarial-example optimization loop, so a stand-in for ground truth is used. When targeting object recognition models, a common stand-in is the assumption that the human-assigned label of an image does not change within a pixel-space ball around it. The images resulting from adversarial attacks that employ this assumption can be construed as controversial between the targeted model and a pixel-space one-nearest-neighbor classifier.

The more general perspective provided by controversial stimuli enables us to replace the ground-truth stand-in with any alternative candidate model. Contrasting models can be a more severe test of robustness to adversarial attack. Moreover, the image search space is not limited to balls around labeled examples. An unrestricted image search space has two advantages: (1) We might find informative controversial stimuli far away from the ’natural’ examples. (2) We can start from arbitrary images, including random images (as we did here). Starting from random images safeguards us from concluding that a model is robust when uninformative gradients prevent us from finding effective adversarial examples [athalye_obfuscated_2018]. If the generation of controversial stimuli fails due to uninformative gradients, the failure is transparent: a high controversiality score will not be achieved.

Implications for DNN modeling of human recognition

Generative models may better capture human recognition

The deep ABS model beat the discriminatively trained models at predicting the human responses to the controversial stimuli. One interpretation of this finding is that, like the deep ABS model, humans have a generative model for each class. Each VAE in the ABS model learns an approximation of the likelihood of an image given a digit . Images that are far away from the category’s distribution are assigned low likelihoods and hence can be rejected as nondigits, matching the human responses to such stimuli (i.e., low probability ratings for all digits). In contrast, discriminative training does not penalize models for assigning labels with high confidence to images that are outside the training distribution. As demonstrated by the current study (Fig. 3, Fig. 3C), even a shallow class-conditional generative model (the Gaussian KDE) leads to considerably more human-compatible responses to such far-removed images.

However, even the best available generative model we tested, the ABS model, still could not fully explain the human responses, and did not match the better discriminative models at predicting human responses to the test MNIST images [see also its unsatisfactory performance on the MNIST-C data set, mu_mnist-c:_2019].

The challenge for modelers is to combine the advantages of discriminative models (i.e., good discriminative performance) and generative models (i.e., good generalization performance). The purely generative class-conditional approach is insufficient, as we show here, even for MNIST. For natural images, the shortcomings of this approach are even more apparent: [schott_towards_2019] report a failure to achieve good test accuracy on CIFAR-10 with the ABS model, and [fetaya_conditional_2019] report a lack of robustness of a CIFAR-10 normalizing flow-based class-conditional model. These difficulties might be related to an over-emphasis of low-level statistics over high-level semantic properties in the density functions learned by current generative modeling approaches, including VAEs, normalizing-flow models, and pixelCNNs [nalisnick_deep_2018]. How to combine the strengths of discriminative and generative inference remains an important problem of both machine learning and brain science.

Adversarial-training does not lead to human-compatible class boundaries

Adversarial training aims to imbue a model with robustness to perturbations within an ball in pixel-space by introducing such adversarial perturbations to the training data as the model is being trained [goodfellow_explaining_2015, madry_towards_2018]. If we define robustness as invariance to -norm bounded perturbations in pixel-space [e.g., ilyas_adversarial_2019], adversarial training might indeed give us robust models. However, here adversarially-trained models failed to predict human responses to controversial stimuli. If we define model robustness as the absence of model decisions that are incompatible with human judgements, then these models are clearly not robust.

Adversarial training is limited in two respects. First, it does not enable the model to recognize images very far from the training distribution as non-digits [schott_towards_2019]. Second, the 2D or 3D intuition of achieving well-formed decision boundaries by surrounding each training example with a ball [madry_towards_2018] breaks in high-dimensional space. For human perception, in particular, it is well known that beyond a very small ball of necessarily imperceptible perturbations, pixel-space norm is a poor measure of perceptual distance [zhou_wang_image_2004]. For example, it is easy to devise two perturbations of similar norms, where one causes the image to cross a human category boundary, while the other is invisible [jacobsen_excessive_2019]. Therefore, forming pixel-space balls of invariance around the training examples cannot capture the category regions that the human visual system employs.

Controversial stimuli: current limitations and future directions

Testing populations of DNN instances instead of single instances

Like most work using pre-trained models [kriegeskorte_deep_2015, yamins_using_2016], this study operationalized each model as a single trained DNN instance. In this setting, a model predicts a single response pattern, which should be as similar as possible to the average human response. To the extent that the training of a model results in instances that make idiosyncratic predictions, the variability across instances will reduce the model’s performance at predicting the human responses. However, an alternative approach to evaluating models considers each DNN instance as an equivalent of an individual human brain. In this setting, idiosyncratic predictions do not necessarily count against a model. Instead, the distribution of model instances should match the distribution of individual humans. After all, humans, too, might have idiosyncratic decision boundaries [kriegeskorte_deep_2015].

To compare the distribution of model instances to the distribution of individual humans, we would need a sufficiently large sample of instances, repeating DNN training with different random weight initializations or training data. As a first approximation, we can consider the means across instances and humans. Given a sample of instances of model A and a sample of instances of model B, controversial stimuli would be synthesized so that each stimulus is classified in the same way by all of the instances of model A and in another, incompatible way by all of the instances of model B. The generality of the controversial stimuli could be further validated by testing them on held-out instances of these two architectures. Only invalid predictions that are proved to generalize across instances would then be considered as evidence against a model. While this approach cannot test for the existence of idiosyncratic decision boundaries in human observers, it does not penalize models whose instances have that property. A major technical hurdle to the implementation of such multi-instance controversial stimulus synthesis is the high computational cost of the best performing model in the current study, the ABS model, which relies on iterative optimization during inference.

An additional advantage of using multiple instances per model is that it allows obtaining more informed estimates of the stability of the experiment’s results, taking into account random-variability introduced by weight initialization. This particular point is not specific to experiments using optimized stimuli; it is relevant to any study comparing trained DNNs. Random variability related to weight initialization can be a concern when comparing similarly performing models. However, experiments with alternative instances of some of the models (data not shown) suggest that our inferential results here would not qualitatively change if the entire procedure was repeated with retrained model instances.

Scaling up to many classes and many models

Synthesizing controversial stimuli for every pair of classes and every pair of models is difficult to scale up to problems with a large number of classes or a large number of models. For example, for ImageNet where there are 1000 classes, there are almost half a million stimuli for each pair of models. In order to distinguish the models, exhaustive sampling of all pairs of classes for each pair of models is not required. However, it is desirable to have a variety of controversial stimuli for each pair of models, and for this variety to cover a diversity of pairs of classes. Such diversity can be achieved either by randomly sampling class pairs, or by more advanced multi-objective optimization heuristics

[e.g., the MAP-Elites algorithm, mouret_illuminating_2015, nguyen_deep_2015].

From an information-theoretic perspective, our set of controversial stimuli should be designed for the human responses to maximally reduce our uncertainty (i.e., the entropy of our belief distribution) about the models. This optimization objective can be used to synthesize controversial stimuli adaptively as sequentially collected human responses come in. Such a process will zoom in on the most promising models, ignoring models that have been effectively eliminated by previous trials. However, this approach faces both technical and theoretical challenges. Technically, it requires joint back-propagation through all models. Theoretically, it requires an estimate of the model likelihood, i.e., the probability of the human responses given the model and stimulus set. This estimate should reflect the fact that a repeated presentation of the same stimulus is less informative than a diverse stimulus set.

Materials and Methods

Details on training/adaptation of candidate models appear in the Supplementary Materials and Methods.

Controversial Stimuli synthesis

Each controversial stimuli was initiated as a randomly seeded uniform () noise image. To efficiently optimize the controversiality score (Eq. 2), we ascended the gradient of a more numerically favorable version of this quantity:


where (an inverted LogSumExp, serving as a smooth-minimum), is a hyperparameter that controls the LogSumExp smoothness (initially set to ), and is the calibrated logit for class (the input to the sigmoid readout).

While for most models, one can derive an analytical gradient of Eq. 5, this is not possible for the ABS model, since its inference is based on a latent space optimization. Hence, following [schott_towards_2019]’s approach to forming adversarial examples, we used numerical differentiation for all models. In each optimization iteration, we used the symmetric finite difference formula to estimate the gradient of Eq. 5 with respect to the image. An indirect benefit of this approach is that one can set to be large, trading gradient precision for better handling rough cost-landscapes. For each image, we began optimizing using (clipping and to stay within the the grayscale intensity range). Once the optimization converged to a local maxima, we halved and continued optimizing. We kept halving upon convergence until final convergence with . We then increased the LSE hyperparameter to 10 and reset

to equal 1 again, repeating the procedure (but without resetting the optimized image). A third and final optimization epoch used


In each optimization iteration, once a gradient estimate was determined we used a line search for the most effective step size: We evaluated the effect of the maximal step in the direction of the gradient that did not cause intensity clipping, as well as of this step size. When the resulting image had a controversiality score (Eq. 2) of less than 0.85 we repeated the optimization procedure with a different initial random image, up to three attempts.

For analytically differentiable models, we found that this more involved (and computationally intensive) approach to image optimization resulted in less convergence to poor local maxima compared to standard gradient ascent using symbolic differentiation.

For each model pair, we selected 20 controversial stimuli for human testing (out of up to 90 we produced). Using integer programming (IBM DOcplex) we searched for the set of 20 images with the highest total controversiality score, under the constraint that each digit is targeted exactly twice per model.

Human testing

30 participants (17 women, mean age = 29.3) were recruited through All participants provided informed consent at the beginning of the study, and all procedures were approved by the Columbia Morningside ethics board. We monitored the performance of the human subjects through three measures: their accuracy on the 100 MNIST images, their reaction times, and 108 controversial images (3 per model pair) that were displayed again at the end of the experiment (testing within-subject response reliability). While the participants’ performance on these measures varied, we found no basis for rejecting the data produced by any participant due to evident low effort or negligence.

Statistical inference

Differences between models with respect to their human response prediction error were tested by bootstrapping-based hypothesis testing. For each bootstrap sample (100,000 resamples), subjects and stimuli were both randomly resampled with replacement. Stimuli resampling was stratified by stimuli conditions (37 conditions—controversial stimuli targeting 36 model pairs, plus MNIST test images). For each pair of models, this bootstrapping procedure yielded an empirical sampling error distribution of the difference between the models’ MSEs. Percent of bootstrapped MSE differences below (or above) zero were used as left-tail (or right-tail) p-values. These p-values were Bonferroni corrected for multiple pairwise comparisons and for two-tailed testing.

Data and code availability

Python optimization source code, synthesized images and detailed behavioral testing results will be available on


TG acknowledges ELSC brain sciences postdoctoral fellowships for training abroad, and NVIDIA for a donation of a Titan Xp used for this research. Stimulus synthesis was conducted on the Zuckerman Institute Research Computing ’Axon’ GPU cluster. The authors wish to thank Máté Lengyel for a helpful discussion and Raphael Gerraty, Heiko Schutt, Ruben van Bergen, and Benjamin Peters for their comments on the manuscript.


Supplementary Material

Supplementary Materials and Methods

Candidate models

Most of the tested models were based on official pre-trained versions [wen_deep_2018, sabour_dynamic_2017, madry_towards_2018, schott_towards_2019], unmodified except for the readout layer. Here we describe the models we trained from scratch or more deeply altered.

Small VGG

Starting from the VGG-16 architecture [simonyan_very_2014, Table 1, architecture D]

, we downsized its input to the 28x28 MNIST format, removed the deepest three convolutional layers and replaced the three fully-connected layers with a single, 512 unit fully-connected layer, feeding a ten-sigmoid readout layer. All weights were initialized using the Glorot uniform initializer, as implemented in Keras. Batch normalization was applied between the convolution and the ReLU operations in all layers. The model was trained with Adagrad (

, , decay=0) for 20 epochs using a mini batch size of 128. The epoch with best validation performance (evaluated on 5000 MNIST held-out training examples) was used.

Reconstruction-based readout of the Capsule Network

In the training procedure of the original Capsule network [sabour_dynamic_2017]

, the informativeness of the class-specific activation vectors (’DigitCaps’) is promoted by minimizing the reconstruction error of a decoder reading out the vector activation related to each example’s correct class.

[frosst_darccc:_2018, qin_detecting_2019] suggested to use the reconstruction error during inference, flagging examples with high reconstruction error (conditioned on their inferred class) as potentially adversarial. While rejecting suspicious images and avoiding their classification is a legitimate engineering solution, for a vision model we require that class conditional probabilities () will always be available. Hence, instead of using the reconstruction error as a rejection criterion, we used it as a classification signal. Reading out the decoder’s output in the official pre-trained Capsule Network, the 10 mean squared reconstruction-errors (conditional on each class) were fed into 10 sigmoids, whose response was calibrated as described in the results section. To eliminate a bias of this error measure towards blank images, we normalized the reconstruction error of each class by dividing it by the mean squared difference between the input image and the average image of all MNIST training examples (averaged across classes).


For each class (digit) , we formed a Gaussian KDE model, where is a class-specific bandwidth hyper-parameter, is a multivariate Gaussian likelihood with unit covariance, and are all MNIST training examples labeled as class . was chosen independently for each class from the range (100 logarithmic steps) to maximize the likelihood of held-out 500 training examples. The ten resulting log-likelihoods were fed as penultimate activations to a sigmoid readout layer, calibrated as described in the results section.

Figure S1: A random sample of the negative examples (non-digits) used as a background class for the small VGG model. (A) pixel-scrambled MNIST images. (B) Fourier-phase scrambled MNIST-images. (C) EMNIST letters [cohen_emnist:_2017], excluding the letters o,s,z,l,i,q and g. (D) patches cropped from natural images. The small VGG model was trained on the MNIST dataset plus a dateset of a similar size per each of these four non-digit classes (so the digit images were only a fifth of the training set). The non-digit class labels’ were coded as all-zero rows in the one-hot coding.
Figure S2: The entire set of controversial stimuli we synthesized, organized by digit pairs. Each subpanel indicates a targeted digit pair and the rows and columns within each subpanel indicate the targeted model pairs. For example, consider the bottom-left image in the bottom-left subpanel. This image (seen to us as a 9) is classified as a 0 by the small VGG model and as 9 by the Schott ABS model. Missing (crossed) cells are either along the diagonal (where the two models would agree) or where our optimization procedure did not converge to a sufficiently controversial image (a controversiality score of at least 0.75). Best viewed digitally.
Figure S3: The same controversial stimuli as in Fig. S2, organized by model pairs. Each subpanel shows the stimuli synthesized for one pair of targeted models. The rows and columns within each subpanel indicate the targeted digits. Missing (crossed) cells are either along the diagonal (where the two models would agree) or where our optimization procedure did not converge to a sufficiently controversial image (a controversiality score of at least 0.75). Best viewed digitally.
Figure S4: A trial in the human experiment. The subjects had to rate the presence of each digit from 0% to 100%. These ratings do not need to sum to 100%. The ’Previous’ button enabled subjects to go back one trial to correct their responses.
Figure S5: Mean squared error of each model in predicting the human responses to all of the test stimuli, after recalibrating each model independently to minimize the human response prediction error. In both subpanels, recalibration was performed by fitting two readout parameters per model (the readout parameters were shared across digits and were not fit to predict individual subjects). Note that this procedure introduces a small optimistic bias to the MSE measures. (*fig:human_responses_prediction_MSE_recalibrated) Recalibrated multi-label readout

. The logits (i.e., the inputs to the ten readout sigmoids, see main text) were recalibrated by a linear transformation. The slope and intercept of this transformation were the two free parameters that were adjusted to minimize each model’s MSE. (*fig:human_responses_prediction_MSE_softmax_recalibrated_all_stimuli)

Recalibrated multi-class (softmax) readout. Here the models were evaluated using a modified softmax readout [schott_towards_2019]: where is the logit of a particular class . For each model, the and parameters were adjusted to minimize the across-subject MSE. With both kinds of recalibrated readout (*fig:human_responses_prediction_MSE_recalibrated and *fig:human_responses_prediction_MSE_softmax_recalibrated_all_stimuli) the advantage of the generative ABS model over all of the discriminative models was maintained, and all of the models still performed worse than the ’leave 1 subject out’ benchmark in predicting the human responses.
Figure S6: Alternative measures of the model-human prediction error. Both subpanels here analyze the same prediction errors depicted in main text Fig. 4 using more flexible association measures. (*fig:human_responses_prediction_linear_correlations) Linear correlation coefficients of each model and each subject’s responses, calculated over all of the test stimuli. Each subject’s responses were linearly correlated with each model’s predictions. Note that this measurement allows for shifting and scaling of the model’s prediction on an individual subject level. (*fig:human_responses_prediction_MSE_after_isotonic_regression) Mean squared error of each model in predicting each subject’s responses to all of the test stimuli, after applying isotonic regression between each model’s predictions and each subject’s responses. Scikit-learn [pedregosa_scikit-learn:_2011]’s implementation was used to find a monotonous function mapping each model’s prediction to each subject’s responses. The MSE of each model’s predictions after this mapping was applied is shown here. Since the subjects used a 5-point scale, the isotonic mapping consisted of four thresholds, partitioning the model’s predictions into the five human rating levels. Similarly to subpanel *fig:human_responses_prediction_linear_correlations, this analysis allows for fitting each model’s predictions to each subject’s individual responses, and does so with greater flexibility. Here as well, the advantage of the generative ABS model over all of the discriminative models was maintained, and all of the models still performed worse than the ’leave 1 subject out’ benchmark.
Figure S7: Mean squared error of each model in predicting the human responses only to the 100 test MNIST images included in the human testing (i.e., no synthetic controversial stimuli were included). *fig:human_responses_prediction_MSE_only_MNIST Multi-label readout, without recalibration. The analysis from which this figure resulted is equivalent to the one described in main text Fig. 4 except for measuring the error only for the MNIST test images. When evaluated on test MNIST alone, the discriminative models are more compatible with the human responses than the generative models. (*fig:human_responses_prediction_MSE_softmax_recalibrated_only_MNIST) The MSE after recalibrating the modified softmax readout (see Fig. 4B). An (optimally-recalibrated) modified softmax readout reduces the error of the ABS on the MNIST test images but the high-accuracy discriminative models are still better at predicting the subjects’ responses to these stimuli.