The Notorious Difficulty of Comparing Human and Machine Perception

04/20/2020 ∙ by Christina M. Funke, et al. ∙ 8

With the rise of machines to human-level performance in complex recognition tasks, a growing amount of work is directed towards comparing information processing in humans and machines. These works have the potential to deepen our understanding of the inner mechanisms of human perception and to improve machine learning. Drawing robust conclusions from comparison studies, however, turns out to be difficult. Here, we highlight common shortcomings that can easily lead to fragile conclusions. First, if a model does achieve high performance on a task similar to humans, its decision-making process is not necessarily human-like. Moreover, further analyses can reveal differences. Second, the performance of neural networks is sensitive to training procedures and architectural details. Thus, generalizing conclusions from specific architectures is difficult. Finally, when comparing humans and machines, equivalent experimental settings are crucial in order to identify innate differences. Addressing these shortcomings alters or refines the conclusions of studies. We show that, despite their ability to solve closed-contour tasks, our neural networks use different decision-making strategies than humans. We further show that there is no fundamental difference between same-different and spatial tasks for common feed-forward neural networks and finally, that neural networks do experience a "recognition gap" on minimal recognizable images. All in all, care has to be taken to not impose our human systematic bias when comparing human and machine perception.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

How biological brains infer environmental states from sensory data is a long-standing question in neuroscience and psychology. In recent years, a new tool to study human visual perception has emerged: artificial deep neural networks (DNNs). They perform complex perceptual inference tasks like object recognition (Krizhevsky ., 2012)

or depth estimation

(Eigen  Fergus, 2015) at human-like accuracies. These artificial networks may therefore encapsulate some key aspects of the information processing in human brains and thus invite the enticing possibility that we may learn from one system by studying the other (Hassabis ., 2017; Jozwik ., 2018; Kar ., 2019; Kriegeskorte  Douglas, 2018; Lake ., 2017).

A range of studies has been following this route, often comparing varying characteristic features of information processing between humans and machines (Geirhos, Temme ., 2018; Ullman ., 2016; Srivastava ., 2019; Han ., 2019). However, many subtle issues exist when comparing humans and machines, which can substantially alter or even invert the conclusions of a study. To demonstrate these difficulties we discuss and analyze three case studies:

  1. Closed Contour Detection Distinct visual elements can be grouped together by the human visual system to appear as a “form” or “whole”, as described by the Gestalt principles of prägnanz or good continuation. As such, closed contours are thought to be prioritized by the human perceptual system and to be important in perceptual organization (Koffka, 2013; Elder  Zucker, 1993; Kovacs  Julesz, 1993; Tversky ., 2004; Ringach  Shapley, 1996). Starting from the hypothesis that contour integration is difficult for DNNs, we here test how well humans and neural networks can separate closed from open contours. Surprisingly, we find that both humans and our DNN reach high accuracies (see also X. Zhang . (2018)). However, our further analyses reveal that our model performs this task in ways very different from humans and that it does not actually understand the concept of closedness. This case study highlights that several types of analyses are crucial to investigate the strategies learned by a machine model and to understand differences in inferential processes when comparing humans and machines.

  2. Synthetic Visual Reasoning Test The Synthetic Visual Reasoning Test (SVRT) (Fleuret ., 2011) consists of problems that require abstract visual reasoning (cf. Figure 2A). Several studies compared humans against varying machine learning algorithms on these tasks (Fleuret ., 2011; Ellis ., 2015; Stabinger ., 2016; J. Kim ., 2018). A key result was that DNNs could solve tasks involving spatial arrangements of objects but struggled to learn the comparison of shapes (so-called same-different tasks). This lead J. Kim . (2018) to argue that feed-back mechanisms including attention and perceptual grouping would be key computational components underlying abstract visual reasoning. We show that the large divergence in task difficulty is fairly specific to the minimal networks chosen in the latter study, and that common feed-forward DNNs like ResNet-50 experience little to no difference in task difficulty under common settings. While certain differences do exist in the low-data regime, we argue that this regime is not suited for drawing conclusions about differences between human and machine visual systems given the large divergence in prior visual experiences and many other confounding factors like regularization or training procedures. In other words, care has to be taken when drawing general conclusions that reach beyond the tested architectures and training procedures.

  3. Recognition Gap Ullman . (2016) investigated the minimally necessary visual object information by successively cropping or reducing the resolution of a natural image until humans failed to identify the object. The study revealed that recognition performance dropped sharply if the minimal recognizable image patches were reduced any further. They refer to this drop in performance as recognition gap. This recognition gap was much smaller in the tested machine vision algorithms and the authors concluded that machine vision algorithms would not be able to “explain [humans’] sensitivity to precise feature configurations” (Ullman ., 2016). In a similar study, Srivastava . (2019) identified “fragile recognition images” with a machine-based procedure and found a larger recognition gap for the machine algorithms than for humans. We here show that the differences in recognition gaps identified by Ullman . (2016) can at least in part be explained by differences in the experimental procedures for humans and machines and that a large recognition gap does exist for our DNN. Put differently, this case study emphasizes that humans and machines should be exposed to equivalent experimental settings.

2 Related work

Comparative psychology and psychophysics have a long history of studying mental processes of non-human animals and performing cross-species comparisons. For example, they investigate what can be learned about human behavior and perception by examining model systems such as monkeys or mice and describe challenges of comparing different systems (Romanes, 1883; Köhler, 1925; Koehler, 1943; Haun ., 2011; Boesch, 2007; Tomasello  Call, 2008). With the wave of excitement about DNNs as a new model of the human visual system, it may be worthwhile to transfer lessons from this long comparative tradition.

A growing body of work discusses this on a higher level. Majaj  Pelli (2018) provide a broad overview how machine learning can help vision scientists to study biological vision, while Barrett . (2019) review methods how to analyze representations of biological and artificial networks. From the perspective of cognitive science, Cichy  Kaiser (2019) stress that Deep Learning models can serve as scientific models that not only provide both helpful predictions and explanations but that can also be used for exploration. Furthermore, from the perspective of psychology and philosophy, Buckner (2019) emphasizes often-neglected caveats when comparing humans and DNNs such as human-centered interpretations and calls for discussions regarding how to properly align machine and human performance. Chollet (2019)

proposes a general Artificial Intelligence benchmark and suggests to rather evaluate intelligence as “skill-acquisition efficiency” than to focus on skills at specific tasks.

In the following, we give a brief overview of studies that compare human and machine perception. In order to test if DNNs have similar cognitive abilities as humans, a number of studies test DNNs on abstract (visual) reasoning tasks (Barrett ., 2018; Yan  Zhou, 2017; Wu ., 2019; Santoro ., 2017; Villalobos ., ). Other comparison studies focus on whether human visual phenomena such as illusions (Gomez-Villa ., 2019; Watanabe ., 2018; B. Kim ., 2019) or crowding (Volokitin ., 2017; Doerig ., 2019) can be reproduced in computational models. In the attempt to probe intuition in machine models, DNNs are compared to intuitive physics engines, i.e. probabilistic models that simulate physical events (R. Zhang ., 2016).

Other works investigate whether DNNs are sensible models of human perceptual processing. To this end, their prediction or internal representations are compared to those of biological systems; for example to human and/or monkey behavioral representations (Peterson ., 2016; Schrimpf ., 2018; Yamins ., 2014; Eberhardt ., 2016; Golan ., 2019), human fMRI representations (Han ., 2019; Khaligh-Razavi  Kriegeskorte, 2014) or monkey cell recordings (Schrimpf ., 2018; Khaligh-Razavi  Kriegeskorte, 2014; Yamins ., 2014).

A great number of studies focus on manipulating tasks and/or models. Researchers often use generalization tests on data dissimilar to the training set (X. Zhang ., 2018; Wu ., 2019) to test whether machines understood the underlying concepts. In other studies, the degradation of object classification accuracy is measured with respect to image degradations (Geirhos, Temme ., 2018) or with respect to the type of features that play an important role for human or machine decision-making (Geirhos, Rubisch ., 2018; Brendel  Bethge, 2019; Kubilius ., 2016; Ullman ., 2016; Ritter ., 2017). A lot of effort is being put into investigating whether humans are vulnerable to small, adversarial perturbations in images (Elsayed ., 2018; Zhou  Firestone, 2019; Han ., 2019; Dujmović ., 2020) - as DNNs are shown to be (Szegedy ., 2013)

. Similarly, in the field of Natural Language Processing, a trend is to manipulate the data set itself by for example negating statements to test whether a trained model gains an understanding of natural language or whether it only picks up on statistical regularities

(Niven  Kao, 2019; McCoy ., 2019).

Further work takes inspiration from biology or uses human knowledge explicitly in order to improve DNNs. Spoerer . (2017) found that recurrent connections, which are abundant in biological systems, allow for higher object recognition performance, especially in challenging situations such as in the presence of occlusions - in contrast to pure feed-forward networks. Furthermore, several researchers suggest (X. Zhang ., 2018; J. Kim ., 2018) or show (Wu ., 2019; Barrett ., 2018; Santoro ., 2017) that designing networks’ architecture or features with human knowledge is key for machine algorithms to successfully solve abstract (reasoning) tasks.

Despite a multitude of studies, comparing human and machine perception is not straightforward. An increasing number of studies assesses other comparative studies: Dujmović . (2020)

, for example, show that human and computer vision are less similar than claimed by

Zhou  Firestone (2019) as humans cannot decipher adversarials: Their judgment of the latter depends on the experimental settings, i.e. specifically the choice of stimuli and the labels. Another example is the study by Srivastava . (2019) which performs an experiment similar to Ullman . (2016) but with swapped roles for humans and machines. In this case, a large recognition gap is found for machines but only a small one for humans.

3 Methods

In this section, we summarize the required data sets as well as the procedures for the three case studies: (1) Closed Contour Detection (2) Synthetic Visual Reasoning Test, and (3) Recognition Gap. All code is available at https://github.com/bethgelab/notorious_difficulty_of_comparing_human_and_machine_perception.

3.1 Data sets

Closed Contour Detection

We created a data set with images of size px that each contained either one open or one closed contour, which consisted of straight line segments, as well as several flankers with either one or two line segments (Figure 1A). The lines were black and the background was uniformly gray. More details on the stimulus generation can be found in Appendix A.1.

Additionally, we constructed variants of the data set to test generalization performance (Figure 1A). Nine variants consisted of contours with straight lines. Six of these featured varying line styles like changes in line width (, , ) and/or line color (, ). For one variant (), we increased the number of edges in the main contour. Another variant () had no flankers, and yet another variant () featured asymmetric flankers. For variant

, the lines were binarized (only black or gray pixels instead of different gray tones).

In another six variants, the contours as well as the flankers were curved, meaning that we modulated a circle with a radial frequency function. The first four variants did not contain any flankers and the main contour had a fixed size of (), () and (). For another variant (), the contour was a dashed line. Finally, we tested the effect of different flankers by adding one additional closed, yet dashed contour () or one to four open contours ().

Synthetic Visual Reasoning Test

The SVRT (Fleuret ., 2011) consists of different abstract visual reasoning tasks. We used the original C-code provided by Fleuret . (2011) to generate the images. The images had a size of pixels. For each problem, we used up to images for training, images for validation and images for testing.

Recognition Gap

We used two data sets for this experiment. One consisted of ten natural, color images whose grayscale versions were also used in the original study by Ullman . (2016)

. We discarded one image from the original data set as it does not correspond to any ImageNet class. For our ground truth class selection, please see Appendix

C.3. The second data set consisted of images from the ImageNet (Deng ., 2009) validation set. All images were pre-processed like in standard training of ResNet (i.e. resizing to 256 pixels, cropping centrally to pixels and normalizing).

3.2 Experimental Procedures

3.2.1 Closed Contour Detection

Fine-tuning and Generalization tests

We fine-tuned a ResNet-50 (He ., 2016) pre-trained on ImageNet (Deng ., 2009), on the closed contour task. We replaced the last fully connected,

-way classification layer by a layer with only one output neuron to perform binary classification with a decision threshold of

. The weights of all layers were fine-tuned using the optimizer Adam (Kingma  Ba, 2014) with a batch size of

. All images were pre-processed to have the same mean and standard deviation and were randomly mirrored horizontally and vertically for data augmentation. The model was trained on

images for epochs with a learning rate of . We used a validation set of images.

To determine the generalization performance, we evaluated the model on the test sets without any further training. Each of the test sets contained images. To account for the distribution shift between the original training images and the generalization tasks, we optimized the decision threshold (a single scalar) for each data set (see Appendix A.3).

Adversarial Examples

Loosely spoken, an adversarial example is an image that - to humans - appears very similar to a correctly classified image, but is misclassified by a machine vision model. We used the python package

foolbox (Rauber ., 2017) to find adversarials on the closed contour data set (parameters: CarliniWagnerL2Attack, max_iterations=1000, learning_rate=10e-3).

BagNet-based Model and Heatmaps

We fine-tuned the weights of an ImageNet-pre-trained BagNet-33 (Brendel  Bethge, 2019). This network is a variation of ResNet-50, where most kernels are replaced by kernels and therefore the receptive field size at the top-most convolutional layer is restricted to pixels. We replaced the final layer to map to one single output unit and used the optimizer RAdam (Liu ., 2019) with an initial learning rate of . The training images were generated on-the-fly, which meant that new images were produced for each epoch. In total, the fine-tuning lasted epochs. Since BagNet-33 yields log-likelihood values for each pixels patch in the image - which can be visualized as a heatmap - we could identify exactly how each patch contributed to the classification decision. Such a straight-forward interpretation of the contributions of single image patches is not possible with standard DNNs like ResNet (He ., 2016) due to their large receptive field sizes in the top layers.

3.2.2 Synthetic Visual Reasoning Test

For each of the SVRT problems, we fine-tuned the ResNet-50-based model (as described in section 3.2.1). The same pre-processing, data augmentation, optimizer and batch size as for the closed contour data set were used.

Varying Number of Training Images

To fine-tune the models, we used subsets containing either , or images. The number of epochs depended on the size of the training set: The model was fine-tuned for respectively , or epochs. For each training set size and SVRT problem, we used the best learning rate after a hyper-parameter search on the validation set, where we tested the learning rates [, , ].

Initialization with Random Weights

As a control experiment, we also initialized the model with random weights and we again performed a hyper-parameter search over the learning rates [, , ].

3.2.3 Recognition Gap

Model

In order to evaluate the recognition gap, the model had to be able to handle small input images. With standard networks like ResNet (He ., 2016), there is no clear path how to do that. In contrast, BagNet-33 (Brendel  Bethge, 2019) allows to straightforwardly analyze images as small as pixels and hence was our model of choice for this experiment. For more details on BagNet-33, see Section 3.2.1.

Minimal recognizable images

Similar to Ullman . (2016), we defined minimal recognizable images or configurations (MIRCs) as those patches of an image for which an observer - by which we mean an ensemble of humans or one or several machine algorithms - reaches accuracy, but any additional cropping of the corners or reduction in resolution would lead to an accuracy

. MIRCs are thus inherently observer-dependent. The original study only searched for MIRCs in humans. We implemented the following procedure to find MIRCs in our DNN: We passed each pre-processed image through BagNet-33 and selected the most predictive crop according to its probability. See Appendix

C.2 on how to handle cases where the probability saturates at and Appendix C.1 for different treatments of ground truth class selections. If this probability of the full-size image for the ground-truth class was , we again searched for the subpatch with the highest probability. We repeated the search procedure until the class probability for all subpatches fell below . If the subpatches would be smaller than pixels, which is BagNet-33’s smallest natural patch size, the crop was increased to pixels using bilinear sampling. We evaluated the recognition gap as the difference in accuracy between the MIRC and the best-performing sub-MIRC. This definition was more conservative than the one from Ullman . (2016) who considered the maximum difference between a MIRC and its sub-MIRCs. Please note that one difference between our machine procedure and the psychophysics experiment by Ullman . (2016) remained: The former was greedy, whereas the latter corresponded to an exhaustive search under certain assumptions.

4 Results

4.1 Closed Contour Detection

In this case study, we compared humans and machines on a closed contour detection task. For humans, a closed contour flanked by many open contours perceptually stands out. In contrast, detecting closed contours might be difficult for DNNs as they would presumably require a long-range contour integration.

Humans identified the closed contour stimulus very reliably in a two-interval forced choice task. Specifically, participants achieved a performance of (SEM = ) on stimuli whose generation procedure was identical to the training set. For stimuli with white instead of black lines, the performance was (SEM = ). The psychophysical experiment is described in Appendix A.2.

Our ResNet-50-based model also performed well on the closed contour task. On the test set, our model reached an accuracy of (cf. Figure 1A [i.i.d. to training]).

To gain a better understanding of the strategies and features used by our ResNet-50-based model to solve the task, we performed three additional experiments: First, we tested how well the model generalized to modifications of the data set such as different line-widths. Second, we looked at the minimal modifications necessary to flip the decision of our model. And third, we employed a BagNet-33-based model to understand whether the task could be solved without global contour integration.

Generalization

We found that our trained model generalized well to many but not all modified stimulus sets (cf. Figure 1A and B). Despite the severe transition from straight-lined polygons in the training data to curvy contours in test sets, the model generalized to curvy contours () perfectly as long as the contour remained below a diameter of . Also, adding a dashed, closed contour () as a flanker did not lower performance. The classification ability of the model remained similarly high for the no flankers () and the asymmetric flankers condition (). When testing our model on main contours that consisted of more edges than the ones presented during training (), the performance was also hardly impaired. It remained high as well when multiple curvy open contours were added as flankers ().

The following variations seemed more difficult for our model: If the size of the contour got too large, a moderate drop in accuracy was found (). For binarized images, our model’s performance was also reduced (). And finally, (almost) chance performance was observed when varying the line width (, , ), when changing the line color (, ) or when using dashed curvy lines ().

Minimal adversarial modifications

We found that small changes to the image, which are hardly recognizable to humans, were sufficient to change the decision of the model (Figure 1B). These small changes did not alter the perception of the contours to humans and suggested that machines would not use the same features to classify closed contours.

BagNet

A BagNet-33-based model, which by construction cannot integrate contours larger than pixels, still reached close to performance. In other words, contour integration was not necessary to perform well on the task. The heatmaps of the model (cf. Figure 1C), which highlight the contribution of each patch to the final classification decision, reveal why: an open contour could often be detected by the presence of an end-point at a short edge. Since all flankers in the training set had edges larger than 33 pixels, the presence of this feature was an indicator of an open contour. In turn, the absence of this feature was an indicator of a closed contour.

Figure 1: A: Our ResNet-50-model generalized well to many data sets, suggesting it would be able to distinguish closed and open contours. B: However, the poor performance on many other data sets showed that our model did not

learn the concept of closedness. C: We generated adversarial examples for images of the closed contour data set. If the network used similar features as humans to discriminate closed from open contours, then adversarial images should swap the class label for humans. However, they appeared identical to the original images. D: The heatmaps of our BagNet-33-based model show which parts of the image provided evidence for closedness (blue) or openness (red). The patches on the sides show the most extremely, non-overlapping patches and their logit-values. The logit distribution shows that most patches had logit values close to zero (y-axis truncated) and that many more patches in the open stimulus contributed positive logit values. Figure best viewed electronically.

4.2 Synthetic Visual Reasoning Test

For each SVRT subtask, we fine-tuned a pre-trained ResNet-50-based model on training images (in contrast to one million images as used by J. Kim . (2018)) and reached above % accuracy on all sub-tasks, including tasks that required same-different judgments (Figure 2B). This finding is contrary to the original result by J. Kim . (2018), which showed a gap of around between same-different and spatial reasoning tasks.

The performance on the test set decreased for our model, when reducing the number of training images. In particular, we found that the performance on same-different tasks dropped more rapidly than on spatial reasoning tasks. If the ResNet-50 was trained from scratch (i.e. weights were randomly initialized instead of loaded from pre-training on ImageNet), the performance dropped only slightly on all but one spatial reasoning task. Larger drops were found on same-different tasks.

Figure 2: A: For three of the 23 SVRT problems, two example images representing the two opposing classes are shown. In each problem, the task was to find the rule that separated the images and to sort them accordingly. B: J. Kim . (2018) trained a DNN on each of the problems. They found that same-different tasks (red points), in contrast to spatial tasks (blue points), could not be solved with their models. Our ResNet-50-based models reached high accuracies for all problems when using training examples and weights from pre-training on ImageNet.

4.3 Recognition Gap

We tested our model on machine-selected minimal recognizable patches (MIRCs) to evaluate the recognition gap in machines in a way as similar as possible to the way in which Ullman . (2016) evaluated the recognition gap in humans. The recognition gap was measured as the gap between the class probability on the MIRC versus a crop or a lower resolution version of the MIRC with the highest class probability (cf. Figure 3A). On average, we found a recognition gap of in our model on the original data of Ullman . (2016) - and a similar value on our subset of ImageNet. This was similar to the recognition gap in humans and contrasted with results for machines’ recognition gap between human-selected MIRCs and sub-MIRCs by Ullman . (2016): .

Figure 3: A: BagNet-33’s probability of correct class for decreasing patches: The sharp drop when the patch became too small or the resolution too low was called the ’recognition gap’ (Ullman ., 2016). The patch size on the x-axis corresponds to the size of the original image in pixel. Steps of reduced resolution are not displayed such that the three sample stimuli can be displayed coherently for presentation purposes. B: Recognition gaps for machine algorithms (vertical bars) and humans (gray horizontal bar). A recognition gap was identifiable for the DNN BagNet-33 when testing machine-selected stimuli from Ullman . (2016) and a subset of the ImageNet validation images (Deng ., 2009). Error bars denote standard deviation.

5 Discussion

We examined three case studies comparing human and machine visual perception. Each case study illustrates a potential pitfall in these comparisons.

5.1 Closed Contour Detection — Human-biased judgment might lead to wrong conclusion

We find that both humans and our ResNet-50-based model can reliably tell apart images containing a closed contour from images containing an open contour. Furthermore, we find several successful generalization cases outside of the i.i.d. regime compared to the training data. Having trained our model on polygons with straight edges only, it also performs well on, for example, curvy lines. These results suggest that our model did, in fact, learn the concept of open and closed contours and that it performs a similar contour integration-like process as humans.

However, this would be a human-centered interpretation as shown by further analyses: For one, even seemingly small changes such as different line colors or line widths often drastically decrease the performance of our model. Second, almost imperceptible image manipulations exist that flip the decision of the model. For humans, these manipulations do not alter the perception of closedness suggesting that our model learned to solve the task without properly integrating the contours. Finally, we analyzed which alternative features could possibly allow to solve the task using a Bag-of-Feature network. Interestingly, there do exist local features such as an endpoint in conjunction with a short edge that can often give away the correct class label. Whether or not this feature is actually used by the ResNet-50-based model is unclear, but its existence highlights the possibility that our previously stated assumption — namely that this task would only be solvable with contour integration — is misleading. In fact, as humans, we might easily miss the many statistical subtleties by which a given task could be solved. In this respect, BagNets proved to be a useful tool to test purportedly ”global” visual tasks for the presence of local artifacts.

Altogether, we applied three methods to analyze the classification process adopted by a machine learning model in this case study: (1) testing the generalization of the model to non-i.i.d. data sets involving the same visual inference task; (2) generating adversarial example images; and (3) training and testing a model architecture (BagNet) that is designed to be interpretable. These techniques provide complementary ways to investigate the strategies learned by a machine learning model and to better understand differences in inferential processes compared to humans. To avoid premature conclusions about what models did and did not learn, we advocate for the routine use of such analysis techniques.

5.2 Synthetic Visual Reasoning Test — Generalizing conclusions from specific architectures and training procedures is difficult

Previous studies (Stabinger ., 2016; J. Kim ., 2018) explored how well deep neural networks can learn visual relations by testing them on the Synthetic Visual Reasoning Test (Fleuret ., 2011). Both studies found a dichotomy between two task categories: While a high accuracy was reached on spatial problems, the performance on same-different problems was poor. In order to compare the two types of tasks more systematically, J. Kim . (2018) developed a parameterized version of the SVRT data set called PSVRT. Using this dataset, they found that for same-different problems, an increase in the complexity of the data set could quickly strain their model. The DNNs used by J. Kim . (2018) consisted of up to six layers. From these results the authors concluded that same-different problems would be more difficult to learn than spatial problems. More generally, these papers have been perceived and cited with the broader claim of feed-forward DNNs not being able to learn same-different relationships between visual objects (Serre, 2019; Schofield ., 2018).

The previous findings of J. Kim . (2018) were based on rather small neural networks: They consisted of up to six layers. However, typical network architectures used for object recognition consist of more layers and have larger receptive fields. When testing a representative of such DNNs, namely ResNet-50, we find that feed-forward models can in fact perform well on same-different tasks (see also concurrent work of Messina . (2019)). In total, we used fewer images ( images) than J. Kim . (2018) ( million images) and Messina . (2019) (400,000 images) to train the model. Although our experiments in the very low data regime (with samples) show that same-different tasks require more training samples than spatial reasoning tasks, this cannot be taken as evidence for systematic differences between feed-forward neural networks and the human visual system. In contrast to the neural networks used in this experiment, the human visual system is naturally pre-trained on large amounts of abstract visual reasoning tasks, thus making the low-data regime an unfair testing scenario from which it is almost impossible to draw solid conclusions about differences in the internal information processing. In other words, it might very well be that the human visual system trained from scratch on the two types of tasks would exhibit a similar difference in sample efficiency as a ResNet-50.

Furthermore, the performance of a network in the low-data regime is heavily influenced by many factors other than architecture, including regularization schemes or the optimizer, making it even more difficult to reach conclusions about systematic differences in the network structure between humans and machines.

5.3 Recognition Gap — Humans and machines should be exposed to equivalent experimental settings

Ullman . (2016) showed that humans are sensitive to small changes in minimal images. More precisely, humans exhibit a large recognition gap between minimal recognizable images - so-called MIRCs - and sub-MIRCs. For machine algorithms, in contrast, these authors identified only a small recognition gap. However, they tested machines on the patches found in humans - despite the fact that the very definition of MIRCs is inherently observer-dependent. This means that MIRCs look different depending on whether an ensemble of humans or one or several machine algorithms selects them. Put another way, it is likely for an observer to use different features for recognition and thus to have a lower recognition rate on MIRCs identified by a different observer and hence a lower recognition gap. The same argument is true for a follow-up study (Srivastava ., 2019), which selected “fragile recognition images” (defined similarly but not identically to human-selected MIRCs by Ullman . (2016)) in machines and finds a moderately high recognition gap for machines, but a low one for humans. Unfortunately, the selection procedures used in Ullman . (2016) and Srivastava . (2019) are quite different, leaving the question open as to whether both humans and machines experience a similar recognition gap. Our results demonstrate that this gap is similar in humans and machines on the respective MIRCs.

These results highlight the importance of testing humans and machines on the exact same footing and of avoiding a human bias in the experiment design. All conditions, instructions and procedures should be as close as possible between humans and machines in order to ensure that all observed differences are due to inherently different decision strategies rather than differences in the testing procedure.

6 Conclusion

We described notorious difficulties that arise when comparing humans and machines. Our three case studies illustrated that confirmation bias can lead to misinterpreting results, that generalizing conclusions from specific architectures and training procedures is difficult, and finally that unequal testing procedures can confound decision behaviors. Addressing these shortcomings altered the conclusions of previous studies. We showed that, despite their ability to solve closed-contour tasks, our neural networks use different decision-making strategies than humans. In addition, there is no fundamental difference between same-different and spatial tasks for common feed-forward neural networks, and they do experience a “recognition gap” on minimal recognizable images.

The overarching challenge in comparison studies between humans and machines seems to be the strong internal human interpretation bias. Not only our expectations whether or how a machine algorithm might solve a task, but also the human reference point can confound what we read into results. Appropriate analysis tools and extensive cross checks - such as variations in the network architecture, alignment of experimental procedures, generalization tests, adversarial examples and tests with constrained networks - help rationalizing the interpretation of findings and put this internal bias into perspective. All in all, care has to be taken to not impose our human systematic bias when comparing human and machine perception.

7 Author contributions

The closed contour case study was designed by CMF, JB, TSAW and MB and later with WB. The code for the stimuli generation was developed by CMF. The neural networks were trained by CMF and JB. The psychophysical experiments were performed and analysed by CMF, TSAW and JB. The SVRT case study was conducted by CMF under supervision of TSAW, WB and MB. KS designed and implemented the recognition gap case study under the supervision of WB and MB, JB extended and refined it under the supervision of WB and MB. The initial idea to unite the three projects was conceived by WB, MB, TSAW and CMF, and further developed including JB. The first draft was jointly written by JB and CMF with input from TSAW and WB. All authors contributed to the final version and provided critical revisions.

8 Acknowledgments

We thank Alexander S. Ecker, Felix A. Wichmann, Matthias Kümmerer as well as Drew Linsley for helpful discussions. We thank Thomas Serre, Junkyung Kim, Matthew Ricci, Justus Piater, Sebastian Stabinger, Antonio Rodríguez-Sánchez, Shimon Ullman, Liav Assif and Daniel Harari for discussions and feedback on an earlier version of this manuscript. Furthermore, we thank Wiebke Ringels for helping with data collection for the psychophysical experiment.

We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting CMF and JB. We acknowledge support from the German Federal Ministry of Education and Research (BMBF) through the competence center for machine learning (FKZ 01IS18039A) and the Bernstein Computational Neuroscience Program Tübingen (FKZ: 01GQ1002), the German Excellence Initiative through the Centre for Integrative Neuroscience Tübingen (EXC307), and the Deutsche Forschungsgemeinschaft (DFG; Projektnummer 276693517 – SFB 1233).

Elements of this work were presented at the Conference on Cognitive Computational Neuroscience 2019 and the Shared Visual Representations in Human and Machine Intelligence Workshop at the Conference on Neural Information Processing Systems 2019.

Appendix A Closed Contour Detection

a.1 Closed Contour Data Set - More Details

Each image in the training set contained a main contour, multiple flankers and a background image. The main contour and flankers were drawn into an image of size px. The main contour and flankers could either be straight or curvy lines, for which the generation processes are respectively described in A.1.1 and A.1.2. The lines had a default thickness of . We then re-sized the image to px using anti-aliasing to transform the black and white pixels into smoother lines that had gray pixels at the borders. Thus, the lines in the re-sized image had a thickness of . In the following, all specifications of sizes refer to the re-sized image (i.e a line described of final length extended over when drawn into the px image). For the psychophysical experiments (see A.2), we added a white margin of on each side of the image to avoid illusory contours at the borders of the image.

Varying Contrast of Background

An image from the ImageNet data set was added as background to the line drawing. We converted the image into LAB color space and linearly rescaled the pixel intensities of the image to produce a normalized contrast value between (gray image with the RGB values ) and (original image) (cf. Figure 6A). When adding the image to the line drawing, we replaced all pixels of the line drawing by the values of the background image for which the background image had a higher grayscale value than the line drawing. For the experiments in the main body, the contrast of the background image was always . Only for the additional experiment described in A.4, we used other contrast levels.

Generation of Image Pairs

We aimed to reduce the statistical properties that could be exploited to solve the task without judging the closedness of the contour. Therefore, we generated image pairs consisting of an ”open” and a ”closed” version of the same image. The two versions were designed to be almost identical and had the same flankers. They differed only in the main contour, which was either open or close. Examples of such image pairs are shown in Figure 4. During training, either the closed or the open image of a pair was used. However, for the validation and testing, both versions were used. This allowed us to compare the predictions and heatmaps for images that differed only slightly, but belonged to different classes.

a.1.1 Line-drawing with Polygons as Main Contour

The data set used for training as well as some of the generalization sets consisted of straight lines. The main contour consisted of n {3, 4, 5, 6, 7, 8, 9} line segments that formed either an open or a closed contour. The generation process of the main contour is depicted on the left side of Figure 4A. To get a contour with edges, we generated points which were defined by a randomly sampled angle and a randomly sampled radius (between and ). By connecting the resulting points, we obtained the closed contour. We used the python PIL library (PIL 5.4.1, python3) to draw the lines that connect the endpoints. For the corresponding open contour, we sampled two radii for one of the angles such that they had a distance of from each other. When connecting the points, a gap was created between the points that share the same angle. This generation procedure could allow for very short lines with edges being very close to each other. To avoid this we excluded all shapes with corner points closer to from non-adjacent lines.

The position of the main contour was random, but we ensured that the contour did not extend over the border of the image.

Besides the main contour, several flankers consisting of either one or two line segments were added to each stimulus. The exact number of flankers was uniformly sampled from the range . The length of each line segment varied between and . For the flankers consisting of two line segments, both lines had the same length and the angle between the line segments was at least . We added the flankers successively to the image and thereby ensured a minimal distance of between the line centers. To ensure that the corresponding image pairs would have the same flankers, the distances to both the closed and open version of the main contour were accounted for when re-sampling flankers. If a flanker did not fulfill this criterion, a new flanker was sampled of the same size and the same number of line segments, but it was placed somewhere else. If a flanker extended over the border of the image, the flanker was cropped.

Figure 4: Closed contour data set. A: Left: The main contour was generated by connecting points from a random sampling process of angles and radii. Right: Resulting line-drawing with flankers. B: Left: Generation process of curvy contours. Right: Resulting line-drawing.

a.1.2 Line-drawing with Curvy Lines as Main Contour

For some of the generalization sets, the contours consisted of curvy instead of straight lines. These were generated by modulating a circle of a given radius with a radial frequency function that was defined by two sinusoidal functions. The radius of the contour was thus given by

(1)

with the frequencies and , (integers between and ), amplitudes and (random values between and ) and phases and (between and ). Unless stated otherwise, the diameter (diameter = ) was a random value between and , and the contour was positioned in the center of the image. The open contours were obtained by removing a circular segment of size at a random phase (see Figure 4B).

For two of the generalization data sets we used dashed contours which were obtained by masking out 20 equally distributed circular segments each of size .

a.1.3 More Details on Generalization Data Sets

As described in the methods (Section 3.1), we used variants of the data set as generalization data sets. Here, we provide some more details on some of these data sets:

Black-White-Black lines (). Black lines enclosed a white one in the middle. Each of these three lines had a thickness of which resulted in a total thickness of .

Asymmetric flankers (). The two-line flankers consisted of one long and one short line instead of two equally long lines.

W/ dashed flanker (). This data set contained an additional dashed, yet closed contour as a flanker. It was produced like the main contour in the dashed main contour set. To avoid overlap of the contours, the main contour and the flanker could only appear at four determined positions in the image, namely the corners.

W/ multiple flankers (). In addition to the main contour, between one and four open curvy contours were added as flankers. The flankers were generated by the same process as the main contour. The circles that were modulated had a diameter of and could appear at either one of the four corners of the image or in the center.

a.2 Psychophysical Experiment: Closed Contour Detection

To estimate how well humans would be able to distinguish closed and open stimuli, we performed a psychophysical experiment in which observers reported which of two sequentially presented images contained a closed contour (two-interval forced choice (“2-IFC”) task).

Figure 5: A: In a 2-IFC task, human observers had to tell which of two images contained a closed contour. B: Accuracy of the 20 naïve observers for the different conditions.

a.2.1 Stimuli

The images of the closed contour data set were used as stimuli for the psychophysical experiments. Specifically, we used the images from the test sets that were used to evaluate the performance of the models. For our psychophysical experiments, we used two different conditions: the images contained either black (i.i.d. to the training set) or white contour lines. The latter was one one of the generalization test sets.

a.2.2 Apparatus

Stimuli were displayed on a VIEWPixx 3D LCD (VPIXX Technologies; spatial resolution px, temporal resolution , operating with the scanning backlight turned off). Outside the stimulus image, the monitor was set to mean gray. Observers viewed the display from (maintained via a chinrest) in a darkened chamber. At this distance, pixels subtended approximately degrees on average ( per degree of visual angle). The monitor was linearized (maximum luminance using a Konica-Minolta LS-100 photometer. Stimulus presentation and data collection was controlled via a desktop computer (Intel Core i5-4460 CPU, AMD Radeon R9 380 GPU) running Ubuntu Linux (16.04 LTS), using the Psychtoolbox Library (Pelli  Vision, 1997; Kleiner ., 2007; Brainard  Vision, 1997, version 3.0.12) and the iShow library (http://dx.doi.org/10.5281/zenodo.34217) under MATLAB (The Mathworks, Inc., R2015b).

a.2.3 Participants

In total, naïve observers ( male, female, age: years, SD = ) participated in the experiment. Observers were paid €per hour for participation. Before the experiment, all subjects had given written informed consent for participating. All subjects had normal or corrected to normal vision. All procedures conformed to Standard 8 of the American Psychological 405 Association’s “Ethical Principles of Psychologists and Code of Conduct” (2010).

a.2.4 Procedure

On each trial, one closed and one open contour stimulus were presented to the observer (cf. Figure 5 A). The images used for each trial were randomly picked, but we ensured that the open and closed images shown in the same trial were not the ones that were almost identical to each other (see ”Generation of Image Pairs” in Appendix A.1). Thus, the number of edges of the main contour could differ between the two images shown in the same trial. Each image was shown for , separated by a inter-stimulus interval (blank gray screen). We instructed the observer to look at the fixation spot in the center of the screen. The observer was asked to identify whether the image containing a closed contour appeared first or second. The observer had to respond and was given feedback after each trial. The inter-trial interval was . Each block consisted of trials and observers performed five blocks. Trials with different line colors and varying background images (contrasts including , and ) were blocked. Here, we only report the results for black and white lines of contrast . Upon the first time that a block with a new line color was shown, observers performed a practice session with trials of the corresponding line color.

a.3 Optimized Decision Criterion

In our tests of generalization, poor accuracy could simply result from a sub-optimal decision criterion rather than because the network would not be able to tell the stimuli apart. To find the optimal threshold for each data set, we subdivided the interval, in which of all logits lie, into subpoints and picked the threshold that would lead to the highest performance.

a.4 Additional Experiment: Increasing the Task Difficulty by Adding a Background Image

We performed an additional experiment, where we tested if the model would become more robust and thus generalized better if we trained on a more difficult task. This was achieved by adding an image to the background, such that the model had to learn how to separate the lines from the task-irrelevant background.

In our experiment, we fine-tuned our ResNet-50-based model on images with a background image of a uniformly sampled contrast. For each data set, we evaluated the model separately on six discrete contrast levels {0, 0.2, 0.4, 0.6, 0.8, 1} (cf. Figure 6A). We found that the generalization performance did not increase substantially compared to the experiment in the main body (cf. Figure 6B).

Figure 6: A: An image of varying contrast was added as background. B: Generalization performances of our models trained on random contrast levels and tested on single contrast levels.

Appendix B Svrt

b.1 Model Accuracy on the Individual Problems

Figure 7 shows the accuracy of the models for each problem of the SVRT data set. Problem 8 is a mixture of same-different task and spatial task. In Figure 2 this problem was assigned to the spatial tasks.

Figure 7: Accuracy of the models for the individual problems. Bars re-plotted from J. Kim . (2018).

Appendix C Recognition Gap

c.1 Analysis of Different Class Selections and Different Number of Descendants

Treating the ten stimuli from Ullman . (2016) in our machine algorithm setting required two design choices: We needed to both pick suitable ground truth classes from ImageNet for each stimulus as well as choose if and how to combine them. The former is subjective and using relationships from WordNet Hierarchy (Miller, 1995) (as Ullman . (2016) did in their psychophysics experiment) only provides limited guidance. We picked classes to our best judgement (for our final ground truth class choices, please see Appendix C.3). Regarding the aspect of handling several ground truth classes, we extended our experiments: We tested whether considering all classes as one (’joint classes’, i.e. summing the probabilities) or separately (’separate classes’, i.e. rerunning the stimuli for each ground truth class) would have an effect on the recognition gap. As another check, we investigated whether the number of descendant options would alter the recognition gap: Instead of only considering the four corner crops as in the psychophysics experiment by Ullman . (2016)

(’Ullman4’), we looked at every crop shifted by one pixel as a potential new parent (’stride-1’). The results reported in the main body correspond to joint classes and corner crops. Finally, besides analyzing the recognition gap, we also analyzed the sizes of MIRCs and the fractions of images that possess MIRCs for the mentioned conditions.

Figure 8A shows that all options result in similar values for the recognition gap. The trend of smaller MIRC sizes for stride-1 compared to four corner crops shows that the search algorithm can find even smaller MIRCs when all crops are possible descendants (cf. Figure 8B). The final analysis of how many images possess MIRCs (cf. Figure 8C) shows that recognition gaps only exist for fractions of the tested images: In the case of the stimuli from Ullman . (2016) three out of nine images, and in the case of ImageNet about of the images have MIRCs. This means that the recognition performance of the initial full-size configurations was for those fractions only. Please note that we did not evaluate the recognition gap over images that did not meet this criterion. In contrast, Ullman . (2016) average only across MIRCs that have a recognition rate above and sub-MIRCs that have a recognition rate below (personal communication). The reason why our model could only reliably classify three out of the nine stimuli from (Ullman ., 2016) can partly be traced back to the oversimplification of single-class-attribution in ImageNet as well as to the overconfidence of deep learning classification algorithms (Guo ., 2017): They often attribute a lot of evidence to one class, and the remaining ones only share very little evidence.

Figure 8: A: Recognition gaps. The legend holds for all subplots. B: Size of MIRCs. C: Fraction of images with MIRCs.

c.2 Selecting Best Crop when Probabilities Saturate

We observed that many crops had very high probabilities and therefore used the “logit”-measure (Ashton, 1972), where is the probability of the correct class

. Note that this measure is different from what the deep learning community usually refers to as “logits”, which are the values before the softmax-layer. The logit

is monotonic w.r.t. to the class probabilities, meaning that the higher the probability , the higher the logit . However, while saturates at , is unbounded and thus yields a more sensitive discrimination measure between image patches that all have .

This is a short derivation for the logit : The probability of the correct class can be obtained by plugging the logits into the softmax-formula:

(2)

Since we are interested in the probability of the correct class, it holds that . Thus, in the regime of interest, we can invert both sides of the equation. After simplifying, we get:

(3)

And finally, when taking the negative logarithm on both sides, we obtain:

(4)

Intuitively, the logit measures in log-space how much the network’s belief in the correct class outweighs the belief in all other classes taken together. The following reassembling operations illustrate this:

(5)

The above formulations regarding one correct class hold when adjusting the experimental design to accept several classes as correct predictions. In brief, the logit , where stands for several classes, then states:

(6)

c.3 Selection of ImageNet Classes for Stimuli of Ullman et al. (2016)

Note that this selection is different from the one used by Ullman . (2016). We went through all classes for each image and selected the ones that we considered sensible. The tenth image of the eye does not have a sensible ImageNet class, hence only nine stimuli from Ullman . (2016) are listed in the following table.

image
WordNet
Hierarchy ID
WordNet Hierarchy description
neuron number in ResNet-50
(indexing starts at 0)
fly n02190166 fly 308
ship n02687172
aircraft carrier, carrier, flattop, attack
aircraft carrier
403
n03095699
container ship, containership, container
vessel
510
n03344393 fireboat 554
n03662601 lifeboat 625
n03673027 liner, ocean liner 628
eagle n01608432 kite 21
n01614925
bald eagle, American eagle, Haliaeetus
leucocephalus
22
glasses n04355933 sunglass 836
n04356056 sunglasses, dark glasses, shades 837
bike n02835271
bicycle-built-for-two, tandem bicycle,
tandem
444
n03599486 jinrikisha, ricksha, rickshaw 612
n03785016 moped 665
n03792782 mountain bike, all-terrain bike, off-roader 671
n04482393 tricycle, trike, velocipede 870
suit n04350905 suit, suit of clothes 834
n04591157 Windsor tie 906
plane n02690373 airliner 404
horse n02389026 sorrel 339
n03538406 horse cart, horse-cart 603
car n02701002 ambulance 407
n02814533
beach wagon, station wagon, wagon
estate car, beach waggon, station waggon,
waggon
436
n02930766 cab, hack, taxi, taxicab 468
n03100240 convertible 511
n03594945 jeep, landrover 609
n03670208 limousine, limo 627
n03769881 minibus 654
n03770679 minivan 656
n04037443 racer, race car, racing car 751
n04285008 sports car, sport car 817

References