Recently, deep learning methods [1, 2] have improved state-of-the-art in many domains, such as speech recognition, visual object detection and recognition, machine translation, etc. Deep Convolutional Neural Network (DCNN) is one of the most successful deep learning architectures. It has been widely adopted by the research communities, since Krizhevsky et al. 
used a DCNN to almost halve the error rate in the ImageNet competition in 2012. Since then, DCNN’s have achieved great successes in various computer vision tasks, approaching or even surpassing human-level performance in some tasks[4, 5].
The advantages of DCNN include the hierarchy of self-learned features and its ability to learn complex non-linear functions via direct end-to-end training 
. The hierarchy of features, representing low-level image details as well as high-level abstract properties, can be learned from the raw data without any hand-crafted manipulation. At the same time, the complex function, which bridges the gap between raw data and the learning task, is learned from the examples by the back-propagation algorithm. Usually, there are millions of weights in a DCNN model. Since the complex model could capture data variance very well, it could absorb a large amount of training samples to avoid over-fitting. Comparing with conventional machine learning algorithms, deep learning has a clear advantage in terms of easy implementation, seemingly unlimited learning capacity, and unprecedented performance, which makes it very attractive to both research community and industry in this era of big data. However, due to the end-to-end training of the “black-box” layers, the general reasoning ability of DCNN is still not fully explored or understood, and the appropriate size of data set to train a DCNN is still mostly empirical.
. After years of evolution, CNNs have become deeper and deeper. More nonlinear and complex layers and structures are utilized in DCNNs, such as ReLU and normalization layers, dropout layer 9]10, 11] and so on. Complexity brought opacity—although there are ways to visualize what patterns might have been learned by intermediate layers [12, 13], thus shedding some lights on the inner working of a CNN, overall we still lack a thorough understanding of the learning process. Some studies have shown that carefully designed adversarial noises could mislead the learned model [14, 15], casting doubt on the generalization capability of these new models— Did they really learn? what did they learn?
In this paper, we use simple and well-defined concepts, such as “symmetry”, “counting”, and “uniformity”, and synthetic and clean examples to test and compare the “intelligence” of algorithms and humans. The relationship between the number of training data and the classification accuracy is used to evaluate the learning speed. Unseen test data, especially those drawn from outside the training distribution, are used to evaluate the degree of learning at the semantic level. Implicit “end-to-end” learning of such concepts is a stepping stone to the scaling up
of artificial intelligence for many real-world tasks such as diagnostic imaging, where symmetry (e.g., of the brain), counting (e.g., of vertebrae), and uniformity (of tissue texture or anatomical structures) are key features for the detection of certain diseases.
2 Related work
There have been some attempts to analyze the complexity and learning capacity of artificial neural networks over the last decades [16, 17]. Recently, Basu et. al.  derived upper bounds on the VC dimension of CNN for texture classification tasks. Szegedy et. al.  reported counter-intuitive properties of neural networks, and found adversarial examples with hardly perceptible perturbation that could mislead the algorithm. Since then, many more successful attacks at deep learning were reported [19, 20, 21].
Deep networks’ vulnerability to adversarial examples led to active research for a defense mechanism. Goodfellow et. al.  proposed adversarial training, and Hinton et. al.  proposed to distill the knowledge by model compression. Goodfellow et. al. 
also proposed a zero-sum game framework for estimating generative models via an adversarial process, namely Generative Adversarial Nets (GAN). A GAN balances an adversarial data generator and a discriminator during training. And the trained discriminator is more tolerate to adversarial examples as a result.
While GAN related research reveals and repairs a “statistical” weakness of CNNs, our study focuses on “semantic” level limitation of CNNs. In other words, we ask “how well can a CNN discover patterns or concepts from data” — a capability that is a hallmark of natural intelligence. For example, can a CNN learn the concept of “Symmetry” (we will start with the simplest “bilateral symmetry”) from examples? how fast (i.e., with how many training samples) can it learn it? and how general does it understand the concept? and ultimately, can the same network architecture be trained to learn another concept, e.g., “counting”?
are trying to adopt a classifier to accommodate new classes not seen in training, given only a few examples, one example, or no example at all, respectively. The goal is to transfer learned knowledge and make the model generalizable to new classes or tasks. However, the new patterns tested in these papers are mostly analogous or homologous to the learned patterns. They have not tested the kind of semantic Gestalt visual concepts, which are more diverse and more challenging for machines, but mostly trivially easy for humans. Nevertheless, zero-shot learning capability of DCNN was observed occasionally in some rounds within our experiments. For example, see the near 100% performance on the “Deliberate test 1” rows under Setting 2 and Setting 3 in Table3
. There were work on heuristic program to solve visual analogy IQ test, and explicit modeling of higher level visual concepts based on low level textons [32, 33]. We approach these topics from a different angle, using a classification problem to implicitly embed the concepts, and focus on the testing of the limit of end-to-end learning capacity of machines.
A similar question was posed in the language understanding domain by Winograd , where a collection of questions can be easily understood, thus answered by a reasonably intelligent human, but not easily at all by an algorithm. The concepts selected in this study are also easily discoverable by a reasonably intelligent human, and our experiments confirm this quantitatively. One aspect that is unique to human visual perception process is that there is often an ”Aha!” moment, after which the concept is fully understood, and error rate drops to zero or near zero immediately. This is not observed in algorithms. We believe the reason behind this cognitive gap is the same as that underlines the Winograd Schema Challenge, one that is related to the accumulative human experience (both environmental and societal) or some may call it “general intelligence”.
There have also been a long history of research in the classic computer vision domain called Gestalt visual perception theory [35, 36]. Our study draws upon some insights gained from this research field. However, with a few simple concepts, we are only scratching the surface. As future work, there are many interesting concepts to explore, such as similarity or uniformity, continuity or conformance, and proximity or grouping, etc.
3 Study design
The inception v3 networks developed by Szegedy et al.  obtained high classification accuracy with relatively less computational cost. We use this network in this study as a representative method of DCNNs.
Three types of visual recognition/classification tasks are studied in the following sections. The first one is based on the concept of “symmetry” (section 4), the second one is based on “counting” (section 5), and the last one is based on “grouping or conformance behavior” (section 6). First two tasks include several sub-tasks, which may require additional learning of concepts such as “uniformity” or “grouping”. All tasks are designed as binary classification problems.
We create synthetic data sets for these image recognition problems. All images are generated in size of .
To handle the binary tasks and these synthetic images, we adapt the inception v3 model by changing the input size, and replacing the original softmax output layer by one hidden fully connected layer of 1024 nodes (with relu activation) plus one new softmax layer of 2 nodes. The weights of the network (except the new layers) are initialized from the pre-trained model on ImageNet database.
In this section, we investigate the learnability of the concept “symmetry” by CNNs as well as by humans. We focus on the simple form of bilateral symmetry, but test both global (in section 4.1) and local (in section 4.2) symmetry. Global symmetry means that all the positive example images have bilateral symmetry, while local symmetry means that all positive images contains only symmetric shapes. Therefore, the local symmetry test is also a test of a “uniformity” concept.
To bring the study closer to real-world use cases, and to make it a bit more interesting, we conduct another test exploiting the symmetry in human faces (section 4.3).
4.1 Global symmetry
Some examples of global symmetry are shown in Fig. 1
. In this case, images are created by randomly sample control points and connecting them to form polygons. The interpolation between control points can be either a straight line or a bezier curve, which is determined randomly. From these random shapes, symmetric images are generated by creating polygons from symmetrical control points or mirroring asymmetric shapes.
To program a computer algorithm explicitly to detect such bilateral symmetry is trivial: just fold the image along the mid-line and then check the differences of overlapping pixels. However, for an algorithm to learn this seemingly simple operation implicitly, from examples only, is not trivial at all — especially if the benchmark is human performance. Humans can often grasp the concept quickly, after seeing only a few dozen examples.
We conduct four rounds of training and testing, designed to evaluate model generalization from different angles and at different levels. The first round is a statistical test within the same sample distribution as the training. This round is a “smoke test” to ensure that the network is working. The second and third rounds of testing are deliberate and adversarial manipulation of the training set in order to test for “understanding”, or the lack thereof, of the real concept. The last round of test uses samples outside and far away from the training distribution, checking again for concept-level comprehension from another angle.
In the first round, we generate examples randomly similar to Fig. 1. Denoting as the training set, as the validation set, and as the testing set, we can train model using and , then test on . To understand the generalization ability of the model on the concept level, we also generate new “deliberate” test samples based on , with small but just enough modifications such that the class labels are all reversed. should be easy for a learner that has learned the concept, but very confusing for a learner that merely memorized (i.e., overfit to) , because every sample in has a similar-looking counterpart in .
Denote as the operation to create deliberate test from existing samples. Various linear or non-linear functions can be applied in . We use different operations in each round: , , and . Specifically, on symmetric samples is removing some random part of the foreground in either left or right half to achieve asymmetry, while on asymmetric samples is mirroring one side of the image on the other side to achieve symmetry, and then, with a chance of , symmetrically erasing some part of the image (this last step is to avoid bias by adding similar erasion pattern on both classes). Some of the samples in set are shown in Fig. 2.
In the second round, we construct a new training set , validation set , and train a new model using and . Then, we test it on . The operation is based on scaling of, or adding small shape object(s) randomly (with a 50% chance) to, the symmetric samples in . More specifically, to create deliberate symmetric samples, we either scale the whole image, or add a pair of identical shapes at random but symmetric positions (in order to maintain symmetry). To create deliberate asymmetric samples, we either scale one side (left or right) of the image, or add a shape on one side. For scaling, we increase or decrease the size by . For the added shape, we evenly sample triangle, square, or ball, with the same size of pixels and random intensity . If the added small shape is located inside a foreground shape in the image, the intensity of the additional shape is set to 0 (making a hole). Random samples from are shown in Fig. 3.
In the third round, again we generate new training set , and validation set . New model is trained using and , and tested on . Samples in are created by adding more small shape objects into the symmetric images in . The strategy of creating new symmetric samples is similar to , but more objects/shapes are added. To create asymmetric samples, we randomly chose among three different approaches: (1) two random objects are added on the left and right sides, at asymmetric locations; (2) two random objects of different shape are added at symmetric positions; (3) two random objects of the same shape, but of different sizes, are added at symmetric positions. Examples from are shown in Fig. 4.
In the last round, we generate brand new testing samples to evaluate the previous three models. These new samples are generated by placing shape objects (triangle, square and ball) at symmetric or asymmetric locations for the two classes. These samples lie completely outside of the previous training distributions, therefore, can serve as a good challenge for those models that have failed to learn the concept at the semantic level. Some samples from are shown in Fig. 5.
4.2 Local Symmetry
In this sub-task, each image contains multiple objects. A positive image contains symmetric objects only, while a negative image contains at least one asymmetric object. Some examples are shown in Fig. 6. In these image samples, objects include equilateral triangle, square, ball and connected-component polygon, which is generated in the similar way in . Asymmetric objects are those asymmetric polygons. The sizes of objects are in range of . No rotation is applied to objects to assure the bilateral symmetry at the object level. After training, we create new deliberate samples to test the generalization. Two sets of deliberate samples are generated. The first testing set consists of new objects of the same size range. The symmetric objects include hexagram, 4-leaf flower (F4), and 2-leaf flower (F2). The asymmetric objects are created by using similar operation of , which is scaling of, or adding object to, the symmetric polygon to make it asymmetric. Some examples are shown in Fig. 7. The second testing set is constructed by using the training objects of larger sizes, in range of .
4.3 Normal/Tampered human face
The most obvious symmetry we see every day, and are very sensitive to, is probably in the human face, so we design an experiment to distinguish normal and tampered human faces. Although human faces are not completely symmetrical, higher degree of facial symmetry has been shown to be correlated with beauty, attractiveness, and personality . Recently, deep learning has shown great power in face identification and verification [4, 39]. In this sub-task, we aim to train and test DCNN’s awareness to facial symmetry.
We collected frontal face images from two public databases. Yale cropped face database[40, 41] and AT&T face database.The tampered faces are created by fusing two half faces together (each half came from a different subject). Some examples are shown in Fig. 8.
To fuse two faces, we first use DLIB implementation of real-time face pose estimation algorithm  to detect 68 facial landmarks. These landmarks are defined in . Secondly, we rotate each face so that the mid-line passing through the nose is vertical. Thirdly, we normalize the two faces to be fused to the same height, keeping the original width. Then, we align two faces together based on the face center (defined as mass center of eyes, nose and mouth). Finally, we merge two faces into one by using a sigmoid horizontal blending filter which is defined as:
in which, (x,y) is the coordinates of image pixel; W is the image width; , and are target image and two source images respectively; in this study.
The second major task in this study is counting. We design two sub-tasks related to counting in this section. The first one is simply counting objects. The second and also more difficult one is counting object types.
Six basic shapes are used to generate shape objects: equilateral triangle, square, ball, hexagram, 4-leaf flower (F4), and 2-leaf flower (F2). All shapes in an image can have different sizes, positions, intensities, and orientations in some cases. Fig. 9 shows a sample image. The size of each shape is measured by the longer edge of its bounding box.
5.1 Counting objects
To formulate counting problem as binary classification task, we design data set so that the two classes of image have different number of objects in each image sample.
Not to make the problem too trivial, a task of counting 3 objects vs. other-than-3 objects is designed. Each positive image contains 3 objects, while each negative image contains a different number (1, 2, 4 or 5) of objects. The experiments are conducted in three settings: in the first setting, training images only contain balls as objects. In the second setting, training objects include triangle, square and ball, but each image only contain one type of object. The third setting is similar to the second one, but each image may contain mixed types of object. The objects could be located in various positions with different intensities and orientation. The size is in range of . Examples of training images are shown in Fig. 10.
Then, two deliberate testings are conducted: (1) using new objects (hexagram, F4, F2) of size to test shape sensitivity; (2) using the training objects of different size to test scale sensitivity.
5.2 Counting types (shape uniformity/diversity)
In this task, we design another binary classification problem: one type of shape vs. two types of shapes. More specifically, each positive image contains multiple objects of the same type, while each negative image contains two types of objects. The number of objects in each image varies. Example are shown in Fig. 11. This task is harder than counting objects, since it requires local shape discrimination and global reasoning at the same time.
Similar to the last section, we conduct two deliberate testings to test the generalization of the trained model. In training, we only use triangle and square to create image samples. In deliberate testing of new shapes, we use ball, hexagram and F4.
6 Common Fate / Synchrony
In this section, another Gestalt-style experiment is designed to test “grouping or conformance behavior”, where multiple objects (e.g., pointy triangles) in a positive image are all behaving in a consistent manner, facing a single target (e.g., a dot). Whereas a negative image would contain objects that do not conform in the same way. Some image examples are shown in Fig. 12.
Multiple rounds of training and testing are conducted, with the first round using randomly oriented triangles as negative images. In subsequent rounds, we test on some deliberately constructed testing samples. These could be negative samples with only a few (1 or 2) outlier objects that are not targeting the focus point; or samples with fewer or more objects; or combinations of the variations; plus size variations of the triangles.
7 Experimental results
Since our synthetic data sets are all gray images while the network requires 3-channel input, the images are converted to 3-channel by duplicating three times. In all experiments, training takes 70 epochs, and model selection is based on minimum error on the validation set. Training batch size is 40 for all synthetic data sets. We report error rate (ER), accuracy (ACC), recall and precision, all in percentage.
For data augmentation, we apply randomly (1) rotation in range of 5 degrees; (2) horizontal and vertical shift in range of 2% of image width and height; (3) flip horizontally and/or vertically.
7.1 Results of global symmetry
We label symmetry as class 0, and asymmetry as class 1. In the first round, we generate , and . Each set has 8000 examples, 4000 in each class. Denote as the first trained model, all errors are misclassified asymmetric samples, so the recall of symmetric image is 100%. In the second round, model is trained on the enlarged datasets and . Each of the data set has 16000 examples, 8000 in each class. Similarly, in the third round training, is trained on and , each contains 32000 examples. We report all training and deliberate testing results in table 1.
The ER’s in the three rounds of training and testing are plot in Fig. 13. There are two main observations from this plot. One is that as the number of training samples increases (by incorporating more deliberate samples), the testing errors become smaller. This shows the importance of large and representative training data. The other one is that the generalization of the trained model is not automatically achieved when the training set is relatively small, since the deliberate testing results are much worse than the training error. For example, the ER of is about times worse than that of . However, this is not the case in human testing, which we will report subsequently.
|Model||Test Data||Accuracy (%)||Precision (%)||Recall (%)|
Results of global symmetry in accuracy, precision and recall.
To better understand the behavior of the trained models, we create a new set (denoted by ) with 4000 images in each class. These samples are generated by using simple shapes (triangle, square and ball) only. Some examples are shown in Fig. 5. 1st model ER: , precision 88.77% and recall 100%. 2nd model ER: , precision 98.38% and recall 100%. 3rd model ER: , precision 98.57%, recall 99.98%. We can see that the models trained in the three rounds have improved ability of symmetry recognition as more deliberate samples are included in training. It is interesting to see that almost all the errors come from the misclassified asymmetric images. This makes sense because the asymmetric patterns have much larger variance than symmetric ones.
Finally, we conduct the 4th round of training by combining training sets and , and test the final model on 8000 new testing samples (denoted by ). These final testing samples are created like by using new types of shape (hexagram, F4 and F2). Some examples are shown in Fig. 14. The final model achieves ER on . We also test the final model on . The ER decreases from 1.61% to 0.37%.By looking at the error cases produced by the final model (samples are shown in Fig. 15), we can see that the model only fail in some fine-scale details in some asymmetric images.
These final numbers, although very impressive, indicate the lack of semantic-level learning of this relatively simple concept. And the lack of sensitivity to fine-scale details is consistent with the anecdotal errors reported in the literature.
We developed a simple game to test human performance in the same task, which consists of maximum three rounds of training and four rounds of testing. The three sets of training samples come from , and . The four testing sets come from , , and .
To maximize consistency of the tests across different human subjects, a written instruction (see Appendix) is used and verbal communication was kept to a minimum. Each subject is firstly shown 12 pairs of positive and negative examples from two classes. A subject can request to see more examples (in increments of 3) from either class. As soon as the subject believes that he/she has learned the classification rule, we test them on 20 random examples. If succeed with no errors, a next round of testing, each with 20 random examples, will be conducted. In case the subject make mistakes in any round of testing, a new training session using corresponding training set will be given.
We tested on 15 subjects individually in two groups. The first group of 7 subjects were tested on biased training samples, in which each symmetric image has only one connected component as the foreground object. The second group of 8 subjects were tested on unbiased training samples, in which symmetric samples have more variance (can have separated symmetric objects in an image). In the first group, 5 subjects succeeded to pass all four tests by learning from 24, 24, 30, 39, 48 examples, respectively. Interestingly, 2 subjects in the first group failed to learn the rule. These two subjects came up with different hypotheses for discriminative object properties, such as angle, roundness, connectivity, “looks like living things?” and so on. In the second group, all subjects succeeded to pass all tests by learning from 24, 24, 30, 33, 36, 48, 64, 300 examples, respectively. Among 13 passing subjects, four learned from only 12 pairs of examples in the first training round; the rest required more examples from 1st or 2nd training round (made a few errors and quickly corrected them after seeing more deliberate training samples).
We found that the most tricky testing examples are a couple of the asymmetric images, in which the small and subtle asymmetry was missed by human eyes, when testing was conducted in a fast pace.
7.2 Results of local symmetry
We create 8000 training samples and 8000 validation samples to train the model. The local symmetric and asymmetric patterns are created in the similar way as creating in global symmetry task. Symmetric patterns also include small objects, like, triangle, square and ball. All foreground objects have size range . After training, 8000 statistical testing samples are created in the same manor so that they have same data distribution as the training set.
Deliberate test 1: This test set is created by using new types of objects. Some examples are shown in Fig. 7. Deliberate test 2: This test set is created by using the same types of objects as in training, but larger size. The new sizes range in [40, 45].
All results are reported in table 2. From the deliberate testing results, we can see that the learned model for local symmetry recognition is sensitive to unseen objects, but not sensitive to different object sizes.
|Test Data||Accuracy (%)||Precision (%)||Recall (%)|
|Deliberate test 1||56.57||54.07||87.4|
|Deliberate test 2||97.47||95.28||99.9|
7.3 Results of normal vs. tampered human face
We firstly use Yale-cropped and AT&T data sets for training and statistical testing. Train 1574 samples: 462 real (positive) and 1112 tampered (negative). Test 1450 samples: 338 real and 1112 tampered. Due to limited samples, we use the training set as validation set in training process. The subjects in testing set are in the same population of the training set, but the faces have varied illumination and facial expressions. For the tampered face class, two faces and are swapped to create testing samples. Test accuracy: 98.76%, precision 98.48%, recall 96.15%. Error cases are shown in Fig. 16.
When we tested the best symmetry model from section 4.1 trained on and on this face problem (only test on well aligned and cropped Yale set: 380 real faces and 1406 tampered faces), the accuracy is 81.92%, and precision 64.47%, recall 33.42%. Most errors come from misclassified real faces. Random examples of error cases are shown in Fig. 17. It shows that the symmetry model cannot identify real faces well, which should be an easy task for human.
Fig. 8 was shown to two children, a 12-year-old and a 14-year-old, with the question “what is the difference between the two groups of images?” Within 20 and 5 seconds respectively, both realized that the second group of faces are all combinations from different people. We did not yet conduct the human test on a larger scale, which is part of the future work to establish more human performance benchmarks.
7.4 Results of counting objects
As described in section 5.1, three settings of training examples are generated (all have the same number of training samples, 2000 positive and 2000 negative) to learn three models. Different testing sets are generated to evaluate these models. For each setting, new statistical testing samples, which have the same data distribution as the training set, are generated and tested. Then two deliberate testing sets are generated. Deliberate testing 1: 4000 testing samples are generated by using completely different objects (hexagram, F4, and F2). Deliberate testing 2: 4000 testing samples are generated by using the same objects as training (triangle, square, ball), but the sizes are in range of , which is about 50% bigger than training shapes (). All testing results are reported in table 3. For the three training settings, the ER’s of deliberate testing 2 become around , , times worse than that of statistical testings, respectively. The model trained in setting 2 outperforms that of other two settings in deliberate testing sets. The results show that training examples with more types of object are useful in this task. However, mixed types of object in one image may be distracting. With only limited number of training examples, such variance could mislead or delay the training convergence to some extent.
|Test Data||Accuracy (%)||Precision (%)||Recall (%)|
|Deliberate test 1||95.63||98.98||92.2|
|Deliberate test 2||62.5||79.83||33.45|
|Deliberate test 1||99.95||99.9||100|
|Deliberate test 2||88.55||97.53||79.1|
|Deliberate test 1||99.95||99.9||100|
|Deliberate test 2||85.12||96.43||72.95|
Our hypothesis is that humans can quickly discover the counting rule after given a few of the Setting 1 examples as shown in Fig. 10. Once their brains are primed with this concept, Setting 2 and 3 testing will become easy, with zero training, or at most a couple more training samples for confirmation, and the subsequent testing performance will be 100%, with invariance to size, shape, or location etc. A quick human test of a few subjects confirmed this hypothesis. A larger scale study is left as future work.
7.5 Results of counting types
In training, we use triangle and square to generate all images (4000 training and 4000 validation samples). The default size range is . Deliberate testing 1: We include 3 more object types into testing, i.e. ball, hexagram and F4. In this testing set, each sample in single-type class may contain one of the three new shapes. Each sample in the two-type class includes any 2 combination of the 5 shapes. Deliberate testing 2: We also test on new samples with larger objects. Two sets are created with size range and , respectively.
Additional test 1: Similar to global symmetry task, we can add new samples to train new model, which is expected to have less over-fitting and better generalization. We add F4 into training shapes to train a new model and test on the same testing set as in deliberate testing 1. Additional test 2: Following deliberate testing 2, we train a new model using samples with shapes of small and large sizes, specifically sizes range in , and test on size [30,35] to see whether there is improvement of generalization.
Results are reported in table 4, which show that the learned model is sensitive to new object shapes and scales. Even after additional training with new training samples, the models still do not get the logic of type counting.
|Test Data||Accuracy (%)||Precision (%)||Recall (%)|
|Deliberate test 1||73.67||76.68||68.05|
|Deliberate test 2 [30,40]||86.6||97.66||75.0|
|Deliberate test 2 [40,50]||56.7||78.39||18.5|
|Additional test 1||83.1||79.53||89.15|
|Additional test 2||72.75||69.36||81.5|
7.6 Results of common fate
We trained 5 models in sequential rounds, each time adding some new variations of image samples into the training set. The first dataset contains the samples (4000 training, 4000 validation) similar to Fig. 12, but each image has objects and one target point. In negative samples, all objects are facing random directions. In the second set, 4000 new samples are added in both training and validation, where all objects except one or two outliers in the negative images are facing the target (outliers are facing at least 60 degrees away from the target). In the third set, new samples are added again, where all images contain objects, and negative images each have only one outlier. In the fourth set, new samples similar to the last round are added, but each with fewer objects (). Some negative image examples are shown in Fig. 18. In the fifth set, the number of training/validation samples are doubled. In each training round, the best model was picked based on accuracy on the validation set.
In this task, we observed the same behavior as in the other tasks, where deliberate tests yielded clear accuracy degradation, and training with deliberate samples would bring the accuracy back up. Below we report a different set of accuracy numbers, all based on the same hold-out test set, which is different from all the training sets.
The hold-out testing images are more sparsely populated than all the training images, with each image containing only two objects and each negative sample has one outlier. This hold-out set was used to evaluate the performance of all trained models, in order to comparably observe the learning progression. The learning curve (green) is plotted in Fig. 19. The learning curve shows a promising accuracy () for the final model trained and validated with 64000 images. However, the model still makes errors. Some of the error cases are shown in Fig. 20.
Finally, we altered the hold-out test set, to make all triangles bigger (by doubling their size). The testing accuracy dropped to around 74% (the blue point in Fig. 19). Interestingly, all errors are false negatives.
Seven human subjects participated in the classification task (using the first set). All subjects used less than 20 training samples to reach the Aha-moment of grasping the semantic concept.
One of the subjects was tested for unsupervised learning as a first step. Four training images, two positive and two negative, are shown to the subject with the question “can you separate these into two groups?” The human subject succeeded in both the clustering task as well as the subsequent classification task (“can you put these images into the two groups you just identified?”), with 100% accuracy.
8.1 The intelligence gap
It seems that in at least two aspects machine learning has not closed the “intelligence gap”: one is that it needs many more training samples than humans do; and the other is that the “aha moment” seems to be still uniquely human, and machines have not show any sign of mastering this trick.
The comparison of learning curves by human and deep learning in symmetry task is shown in Fig. 21. The human accuracy is calculated as the percentage of head count who reach the “aha moment”, at sample size 24, 36, 48, 64 and 300. We can see that human can learn the rule from examples quickly and come to the “aha moment” after seeing very few examples. On the contrary, DCNN can only statistically approximate the bilateral symmetry after a large number (40,000) of training examples, and still cannot reach accuracy on unseen data.
Artificial Neural Networks have indeed shown more sophisticated “intelligence” than counting. One example was demonstrated by Hoshen and Peleg : they were able to get a simple Neural Network to learn the mathematical operation of addition from pictures of numbers as inputs. Considering that human children typically learn counting first before they are taught how to add numbers, this achievement is quite impressive. Nevertheless, DCNN’s rather rudimentary visual counting ability serves as a reminder that there is still some way to go to achieve a basic but semantic level of artificial visual intelligence.
8.2 Study implications
Imagine that you have access to anonymized brain scans from 10,000 patients. 2000 of them eventually died of a certain disease, and the rest lived long and were free of that disease. Now you want to find out whether the brain scans could have predicted the disease. You could try an “end-to-end” learning using DCNN by feeding the brain scans as training examples. Will it work? Or will you need to understand more about the disease and its (potentially very complex) manifestations in the brain scan?
Let’s assume that the disease manifestation in the brain is actually based on different configurations of some “hot spots” (i.e., high intensity areas), with an asymmetric configuration indicating disease, while symmetry predicts no-disease. Or, the number of hot spots within a certain region of the brain is indicative of the disease: having, say, three or less hot spots means healthy, and more means disease, regardless of shape or size of each hot spot. Or, it is the diversity/heterogeneity of those hot spots that mattered. Or, it is a combination of these rules for different parts of the brain that are working together to determine the patient’s fate. But you do not know any of these “rules” a priori.
Based on the experiments in this paper, you have reason to stop and think, and to doubt that a direct application of DCNN will always solve the problem. And you know roughly how many more examples you may need if you do suspect a particular rule is at work. Or, at least you know how to design synthetic experiments to shed some lights on this problem, e.g., to look for a performance limit, in case you have failed a first round of “end-to-end” learning.
8.3 Is visual intelligence learned or hard-wired?
One might argue that symmetry is not a fair test case, because humans may be hard-wired to be sensitive to it. Indeed, symmetry is very common in our world, both in natural anatomy and in man-made objects. And being sensitive to symmetry probably carries a strong evolutionary advantage. But then again, one can also argue that most, if not all, of human visual intelligence may be somewhat hard-wired. If this is true, then does it mean that an “end-to-end” learning-from-examples paradigm may not be sufficient to achieve full visual intelligence? And that proper understanding, followed by direct hard-coding (i.e., semantic modeling), of such wirings may be at least an indispensable stepping-stone?
8.4 Why synthetic testing?
Another question could be, why do we need synthetic examples, while we have already large annotated data sets like ImageNet? We believe popular learning targets such as cats, dogs, human faces, or more generally, animals on earth, are not the best targets for testing a learning machine rigorously. Animal species today are very sparsely distributed on the evolutionary continuum due to multiple prehistorical mass extinctions—there is no other animals with zebra stripes, or any other animal with a long nose like an elephant. When a learning machine sees a striped table cloth and calls it a “zebra”, or calls a child wearing a long-nosed mask an “elephant”, is it wrong?—After all, the table cloth could be a cutout from a massive zebra picture, and “elephant” would be the exact right name to call a child during her Halloween adventure. Therefore, when a CNN is fooled by some noise-like adversarial overlays, could it be that it is seeing a “zebra stripes”- or “elephant-nose”-type of unique feature for that object? What really sets humans apart is that humans see the full picture: the table and the child together with all the other distractions. In order to see the full picture, then, CNNs have to learn all the objects and concepts at the same time. This is not yet possible today.
In this paper, we step back and investigate at a much smaller scale — we focus on simple and well-defined concepts, and clean and synthetic examples to test and compare the “intelligence” of algorithms and humans, in a quantitative way.
8.5 Error analysis
For symmetry tasks, DCNN made more mistakes in classifying asymmetric patterns as symmetric ones. This behavior is somewhat like that of a human. The reason may lie, on the one hand, in the fact that symmetry is a precisely defined concept in which even a one-pixel-difference would make it asymmetric; and on the other hand, in the limited ability for fine-scale discrimination, of humans as well as of machines. Humans often make careless mistakes or suffer from sensory defects (e.g.,short-sightedness); while today’s DCNNs are not sensitive to fine-grained variations in the images, unless re-designed specifically for a particular task .
In the local symmetry study and facial study, the DCNN models learned very well on seen shapes/distributions. However, it demonstrated limited generalization in terms of new objects (second row in Table 2), even though most of the errors should be very easy for human to avoid (e.g. error cases in Fig. 16).
The counting experiments show that DCNN is too sensitive to object scale in the object-counting and common fate task; and failed to generalize in scale and object type in the object-type-counting task. With limited amount of training samples, the DCNN model cannot learn the counting rules precisely. It tends to learn a small kernel space that could just cover the training sample distribution. For example, in the last scale testing experiment, the learned model can learn to fit the small and large scales, yet when tested at medium scale, the error rate spikes up. Similarly, objects with a new unseen shape seem to be able to easily confuse the learned models. In the common fate task, we see clearly that scale-invariance is not achievable without exhaustive enumeration or explicit modeling/coding.
Humans have the prior knowledge (or tendency) that each connected component is counted once, regardless of its shape or size variations. Our trained DCNN model does not have this explicit knowledge or tendency. So, it tends to, interestingly, count twice or more for objects larger than what it saw in the training set. But can we really judge that this behavior is wrong? or merely “uninformed”? I.e., uninformed of our world in which scale-invariance is a common law due to both the natural growth phenomenon and our perspective visual system. Imagine an alien world where things grow in density instead of size, and a sensing system based on parallel waveforms or touch only, then scale-invariance would be a very alien concept. This brings the question: how much of our visual intelligence is specific to this world that we live in, and to this bodily construct that we possess? and not universal at all?
In the end, a machine may have to live among us to learn the whole experience, before acquiring Gestalt-style-intelligence in visual perception.
8.6 Sample size and general intelligence
Our experiments show that some “intelligence gaps” can be asymptotically closed with sufficiently large training data, albeit at very different convergence rate. One can hope that refinement of the network architecture, trained with ever larger data sets, will bring about an eventual breakthrough. However, others may believe that “general intelligence”, the kind of intelligence that understands relationships and complex rules, and that sees the full picture, can be reached only by tackling the Gestalt-type “aha moment” problem (plus the Winograd Schema Challenge ) first. Our object-counting experiments support the latter view. For one thing, a DCNN can learn that an object of many different sizes is the same thing as long as examples of each size is present in the training set. But it could never learn or infer by itself the rule that “size does not matter!”—this rule has to be hard-coded.
Even if a data-driven deep learning algorithm can learn complex rules in a given domain, it may still have a difficult time combining rules or prior knowledge from multiple different perspectives (e.g., “size does not matter”, “rotation does not matter”, but “shape matters”). With an increasing number of potentially hidden rules in effect, the required sample size for “end-to-end” training will grow exponentially.
8.7 Future work: Aha Challenge, a call for participation
The datasets used in this work and scripts to generate the synthetic images are available online at https://github.com/zhennany/synthetic.
As future work, we would like to propose an “Aha Challenge” to the research community. The idea is to push together focused and quantitative research on algorithmic vs. natural
visual intelligence, toward Gestalt-style pattern recognition based on a relatively small number of training images.
The goal of this challenge is two-fold: to achieve a deeper, and quantitative, understanding of the nature of human visual intelligence; and to improve higher- or semantic-level “intelligence” in visual learning algorithms.
Participation can be in three forms: submission of new algorithms; submission of human study results; and proposal for new types of tests.
An algorithm submission should be tested on multiple visual learning tasks, each designed with deliberate adversarial testing that a human can easily pass (once an “aha moment” has been reached).
A human subject study should be conducted as comparable as possible to the algorithm training process, with minimal extra information or hints provided, besides showing the training images to the human subjects.
This challenge is analogous to the ”Winograd Schema Challenge”  in the language understanding domain. As demonstrated by the Winograd Challenge, if the machine does not have all the knowledge of this human world, it cannot correctly infer all the meanings of, or relationships among, words or visual objects.
We invite submissions of either new types of tests, new algorithms, or human study results, to establish “intelligence” baselines, and to build machines that one day will be able to exclaim “Aha! I see!”.
APPENDIX: Written instruction for the human test
Below is the written instruction used in the human test for the symmetry study.
You are participating in a game of “visual classification”. The goal for you is to learn to classify visual patterns into two classes, after seeing examples of both classes.
Here is how the game will proceed:
1. You will be shown examples of visual patterns from two classes, class 0 on the left of the screen and class 1 on the right;
2. As soon as you believe that you have learned the classification rule, you can stop the training;
3. You will be shown 20 random test examples. Please label each as either class 0 or class 1;
4. If you succeed with no errors, 3 additional rounds of testing, each with 20 patterns, will be conducted to confirm your learning;
5. In case you make mistakes in any round of testing, a new training session will be given to strengthen your learning. Again you can stop the training once you believe you have learned the classifier;
6. The game stops after a maximum of 4 rounds.
Please keep this confidential so that we can invite more participants.
-  Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
-  J. Schmidhuber, Deep learning in neural networks: An overview, Neural networks 61 (2015) 85–117.
-  A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
-  Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708.
-  K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
-  D. H. Hubel, T. N. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, The Journal of physiology 160 (1) (1962) 106–154.
-  D. J. Felleman, D. C. Van Essen, Distributed hierarchical processing in the primate cerebral cortex., Cerebral cortex (New York, NY: 1991) 1 (1) (1991) 1–47.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting., Journal of machine learning research 15 (1) (2014) 1929–1958.
-  K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
-  M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European conference on computer vision, Springer, 2014, pp. 818–833.
-  A. Mahendran, A. Vedaldi, Understanding deep image representations by inverting them, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5188–5196.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, arXiv preprint arXiv:1312.6199.
-  I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, D. Song, Robust physical-world attacks on machine learning models, arXiv preprint arXiv:1707.08945.
-  C. L. Giles, T. Maxwell, Learning, invariance, and generalization in high-order neural networks, Applied optics 26 (23) (1987) 4972–4978.
-  C. M. Bishop, Neural networks for pattern recognition, Oxford university press, 1995.
-  S. Basu, M. Karki, S. Mukhopadhyay, S. Ganguly, R. Nemani, R. DiBiano, S. Gayaka, A theoretical analysis of deep neural networks for texture classification, in: Neural Networks (IJCNN), 2016 International Joint Conference on, IEEE, 2016, pp. 992–999.
-  N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, A. Swami, Practical black-box attacks against deep learning systems using adversarial examples, arXiv preprint arXiv:1602.02697.
-  V. Behzadan, A. Munir, Vulnerability of deep reinforcement learning to policy induction attacks, arXiv preprint arXiv:1701.04143.
-  S. Huang, N. Papernot, I. Goodfellow, Y. Duan, P. Abbeel, Adversarial attacks on neural network policies, arXiv preprint arXiv:1702.02284.
-  I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, arXiv preprint arXiv:1412.6572.
-  G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27, Curran Associates, Inc., 2014, pp. 2672–2680.
-  J. Snell, K. Swersky, R. S. Zemel, Prototypical networks for few-shot learning, arXiv preprint arXiv:1703.05175.
-  S. Ravi, H. Larochelle, Optimization as a model for few-shot learning, in: International Conference on Learning Representations, 2017.
-  B. M. Lake, R. Salakhutdinov, J. B. Tenenbaum, Human-level concept learning through probabilistic program induction, Science 350 (6266) (2015) 1332–1338.
-  L. Fei-Fei, R. Fergus, P. Perona, One-shot learning of object categories, IEEE transactions on pattern analysis and machine intelligence 28 (4) (2006) 594–611.
-  M. Palatucci, D. Pomerleau, G. E. Hinton, T. M. Mitchell, Zero-shot learning with semantic output codes, in: Advances in neural information processing systems, 2009, pp. 1410–1418.
-  R. Socher, M. Ganjoo, C. D. Manning, A. Ng, Zero-shot learning through cross-modal transfer, in: Advances in neural information processing systems, 2013, pp. 935–943.
-  T. G. Evans, A heuristic program to solve geometric-analogy problems, in: Proceedings of the April 21-23, 1964, spring joint computer conference, ACM, 1964, pp. 327–338.
-  S.-C. Zhu, C.-E. Guo, Y. Wang, Z. Xu, What are textons?, International Journal of Computer Vision 62 (1) (2005) 121–143.
-  T.-F. Wu, G.-S. Xia, S.-C. Zhu, Compositional boosting for computing hierarchical image structures, in: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE, 2007, pp. 1–8.
-  H. J. Levesque, E. Davis, L. Morgenstern, The winograd schema challenge., in: AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, Vol. 46, 2011, p. 47.
-  J. Wagemans, J. H. Elder, M. Kubovy, S. E. Palmer, M. A. Peterson, M. Singh, R. von der Heydt, A century of gestalt psychology in visual perception: I. perceptual grouping and figure–ground organization., Psychological bulletin 138 (6) (2012) 1172.
-  F. Jäkel, M. Singh, F. A. Wichmann, M. H. Herzog, An overview of quantitative approaches in gestalt perception, Vision research 126 (2016) 3–8.
-  K. Grammer, R. Thornhill, Human (homo sapiens) facial attractiveness and sexual selection: the role of symmetry and averageness., Journal of comparative psychology 108 (3) (1994) 233.
-  B. Fink, N. Neave, J. T. Manning, K. Grammer, Facial symmetry and the ‘big-five’personality factors, Personality and individual differences 39 (3) (2005) 523–529.
-  Y. Sun, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint identification-verification, in: Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27, Curran Associates, Inc., 2014, pp. 1988–1996.
A. Georghiades, P. Belhumeur, D. Kriegman, From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intelligence 23 (6) (2001) 643–660.
-  K. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intelligence 27 (5) (2005) 684–698.
-  F. S. Samaria, A. C. Harter, Parameterisation of a stochastic model for human face identification, in: Applications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on, IEEE, 1994, pp. 138–142.
-  V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1867–1874.
-  C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, M. Pantic, 300 faces in-the-wild challenge: The first facial landmark localization challenge, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 397–403.
-  Y. Hoshen, S. Peleg, Visual learning of arithmetic operation., in: AAAI, 2016, pp. 3733–3739.
-  T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear cnn models for fine-grained visual recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1449–1457.