The unprecedented performance achieved by deep convolutional neural networks for image classification is linked primarily to their ability of capturing rich structural features at various layers within networks. Here we design a series of experiments, inspired by children's learning of the arithmetic addition of two integers, to showcase that such deep networks can go beyond the structural features to learn deeper knowledge. In our experiments, a set of images is constructed, each image containing an arithmetic addition n+m in its central area, and several classification networks are then trained over a subset of images, using the sum as the label. Tests on the excluded images show that, as the image set gets larger, the networks have well learnt the law of arithmetic additions so as to build up their autonomous reasoning ability strongly. For instance, networks trained over a small percentage of images can classify a big majority of the remaining images correctly, and many arithmetic additions involving some integers that have never been seen during the training can also be solved correctly by the trained networks.READ FULL TEXT VIEW PDF
Human intelligence is mostly represented by the ability of drawing inferences on many other cases (similar or new) from the learnt instances or knowledge [3, 16]. Such ability can be built up gradually for nearly every human being during his/her learn-and-growth process. One of the most typical scenarios is the learning of the arithmetic addition of two integers with children at very early ages. During the learning, parents/teachers show examples one by one and tell the answers and then test children with new additions. One proven result  is that, after teaching several dual-examples such as “, ”, “, ”, and “, ”, and additionally , some children could give the correct answer to the new addition . At this stage, children may not understand the commutativity of arithmetic additions at all. Nevertheless, they get the correct answer merely by a simple inferencing. As the learning continues, children can understand the law of arithmetic additions gradually and then conduct new additions correctly, which builds up their autonomous calculating ability .
Here, each example is usually presented on a white board or a piece of paper. Children observe the question and start the learning. They first recognize two integers and
as well as the symbol ‘+’, through their structural patterns. Then, they will learn the law of arithmetic additions gradually when more and more examples are taught. Clearly, such law is far beyond and thus much more important than the structural patterns perceived from the arithmetic addition formula itself, as it implies deeper and more intrinsic knowledge. In comparison, we have witnessed over the past decade a huge success of various deep learning networks (DLNs) in applications such as image analysis, machine translation, and natural language processing
. One benchmark example is the worldwide image-classification competition around ImageNet[9, 1], where deep convolutional neural networks (CNNs) have been dominant for years. The unprecedented performance achieved by these classification networks is linked primarily to their ability of capturing rich structural features at various layers within networks . Nevertheless, as we are aware so far, no examples have ever been given to showcase whether or not and then to what extent these deep networks have learnt intrinsic knowledge that is beyond the structural features perceived from individual images . This is largely because it is extremely challenging to define clearly the specific knowledge that needs to be learnt in the image classification task (as well as many other similar tasks), which thus hinds the quantitative assessment on the network s autonomous reasoning ability after the training is completed. Encouragingly, the recent work around AlphaGo [12, 13, 15] and AlphaStar 
has shone light on providing some proven evidence to the reasoning ability of DLNs. By supervised learning and reinforcement learning, these Alpha-agents have almost swept top-ranked human players. However, these successful examples require that the game-rules be provided explicitly.
We return to deep image classification networks where no rules or knowledge are provided, except for the labelling of all training data. Here, we focus on two questions: (1) how strongly can these networks build up their autonomous reasoning ability after the training is completed and (2) have they learnt specific knowledge that is beyond the structural features so as to lead to a strong autonomous reasoning ability. Inspired by children’s learning of the arithmetic addition of two integers, as discussed earlier, we carry out a series of experiments on several deep CNNs for image classification to assess their autonomous reasoning abilities quantitatively. To this end, we construct a closed-set of images, each image containing an arithmetic addition of two integers in its central area, see the input part of Figure 1
for one example. Then, we train several deep CNNs over a subset of images. Although each image in our experiments is very simple as compared to the images used in ImageNet, it does imply clear rules or knowledge that are easy to understand for everyone when observing the image. More specifically, the goal here is threefold: (1) to recognize two integers and the symbol ‘+’ within each formula, (2) to understand the arithmetic meaning of each integer and the symbol ‘+’ (more important), and (3) to master the law of arithmetic additions (most important): ones digits versus tens digits, commutativity between addend and augend, and carry-over in additions. The first part is a classic pattern recognition task. The second and third parts together form the learning of necessary knowledge for computing arithmetic additions. After the training is completed, we test all images that are not seen during the training. Clearly, the test results will illustrate whether or not and then to what extent these networks can learn the specific knowledge (for mastering arithmetic additions) that is beyond the structural patterns perceived from images. Only under the circumstance where the necessary knowledge has well been mastered can strong autonomous reasoning abilities be built up for the trained networks. We would like to point out that some of the previous works have addressed the task of arithmetic additions, but in different ways reflecting different properties of networks. For example, Lianget al. applied Optical Character Recognition (OCR) system to recognize operators as well as the digits from images, based on which the additions are calculated accordingly . Hoshen et al. adopted a network that consists of only fully connected layers to conduct the task of additions , where inputs to the network are two images and the output is also an image containing the summation result, and a simple OCR system is used for recognizing digits for measuring the accuracy. As a result, the network learns the function of image-to-image mapping that concentrates on the transformation of structural patterns perceived in images. In contrast, we directly output the results as the classification labels rather than output images, allowing the network to focus on the learning of the underlying knowledge of arithmetic additions as well as its autonomous reasoning ability.
Let denotes an image set and its size (i.e., the number of images in the set). By upper-limiting and , we have constructed several image sets whose sizes range from to . Here, each image is of resolution . By using the sum as the label for each image, we can convert the calculation of n+m into an image classification problem. We have trained several popular classification networks, such as VGG , ResNet , and SENet , see Figure 1 for the general architecture shared by these classification networks. Instead of measuring the Top-5 accuracy in the traditional image classification task, we measure the Top-1 accuracy in our experiments, as we are facing a scientific calculation in which the absolute accuracy is the top-priority requirement. We split the image set into the training set and test set . We first construct by a random selection. Here, we focus on finding out how many images are needed for the training so that a big majority ( or above) of the test images can be classified correctly by the trained network. Then, we construct by excluding all arithmetic additions , where or , i.e., any integer in the interval will not appear at either addend or augend or both. In this way, we would like to showcase that, although those integers have never been seen by networks during the training, a big majority of arithmetic additions involving them can still be calculated correctly. For each experiment, we train the network independently for times. Here, we only report the results of SENet, as very similar results have been obtained in other networks such as VGG and ResNet.
In the first experiment, we try to verify the commutativity between addend and augend in arithmetic additions. To this goal, we remove all images that contain (i.e., they are all put into the test set ). For other images, we select randomly some images into the training set but make sure that their dual images are all excluded. In this way, we can select maximally of images into the training set. After the training is completed, we first examine the calculation of . The results show that we can achieve the accuracy when the image set is large enough ( or above), with two exceptions only, i.e., and ( denotes the maximum integer), as these two additions do not have the corresponding labels. Then, we test all dual images. Figure 2a presents the accuracy averaged over 10 independent trials. When the image set gets large, the accuracy increases dramatically, e.g., to almost when . Consequently, we believe that the commutativity law (the knowledge beyond the structural features) should have been learnt very successfully. Next, we focus on analyzing incorrect calculations in the image set with (as an example). To this end, we count the numbers of incorrect inferences and accumulate them over 10 trials, with the results shown in Figure 2b. As seen, most incorrect calculations are gathered at two ends with small or big sums (labels). This is because that there are fewer images in at these two ends, thus leading to fewer training images to be observed by the network during the training, so that the reasoning logic cannot be built up robustly. We further verify the above distribution by looking at the results of one trial. Figure 2c shows the learning map of one trial, where each small cell represents 1 out of 4 possible states (in different colors): train right, train wrong, test right, and test wrong. As seen, no cells of red color appear (i.e., the training is accurate); no cells of deep-blue color appear along the 45 diagonal line (i.e., calculations of are all correct, except for and ); and only 30 cells of deep-blue color (incorrect inferences) appear, all gathering around the corner . In the second experiment, we also remove all images that contain . Then, we further exclude randomly some pairs of and images so that they will never be seen during the training. Here, we aim at finding out how many images are needed for the training so that a big majority (over ) of the test images can be classified correctly by the trained networks. This is equivalent to how to split the image set into the training set and test set , in percentages. Figure 3 shows these percentages for image sets of different sizes so that the test can achieve over accuracy. As seen, when the image set size is too small, i.e., , we cannot even exclude a single pair from the training as they cannot be calculated correctly. In other words, the learning of arithmetic additions involving one-digit integers fails. When , the task becomes solvable, whereas images in needs to be selected into the training set so as to achieve over accuracy on the test set. For the largest image set , only images in are needed for the training in order to accomplish the same task. In the third experiment, we select images in to form the training set completely randomly (i.e., without any constraints). Here, we focus on the dynamic relationship between the test set size and the test accuracy. Figure 4 shows the results (again, averaged over trials), where implies how many images in are used in tests (in percentage). As seen, for , the test accuracy drops quickly when more images in are reserved into the test set . This situation improves when increases. For instance, for , a very high test accuracy (over ) can be maintained even when we only select of images in to train the network.
In the following experiments, we obey certain rules to exclude some arithmetic additions from the training set. First, we exclude all arithmetic additions of form or from the training. Here, is a fixed integer and . This task seems very easy as our multiple trials (choosing different ) show that the trained network can classify all excluded images correctly. This result indicates that although the network is blind to one specific integer, a very strong reasoning logic has also been built up after the training is completed. To increase the level of difficulty, we exclude several consecutive integers. For instance, arithmetic additions of form or , where and , are all excluded, as shown in Figure 5. Here, Figure 5a shows the learning map that displays 4 types of cells, i.e., train right, train wrong, test right, and test wrong (the same as Figure 2c). Notice that two color strokes in this figure form the test set so that each image in this set has been excluded from the training but needs to be classified during the testing. In the learning map, an intersection region (highlighted by the yellow rectangle) produces a few incorrect results Then, we zoom-in this region to examine the detailed results, as shown in Figure 5b. Interestingly, most of those incorrect calculations lost 10, indicating that the network seems not fully understand the carry-over rule in arithmetic additions. We further zoom into one incorrect calculation () to examine the distribution of all classification probabilities, as shown in Figure 5c. It is found that the network produces as it receives the highest probability (), whereas the correct answer 131 receives the second highest probability (). Finally, we would like to report that the network will collapse if integers are excluded consecutively in a range such as . We believe that this is because the integer 6 has never been seen by the network in the tens position, whereas cases of seeing it in the ones position do not help the learning on the tens position.
Some discussions are necessary as complementary to the quantitative assessment results presented above. First of all, we would like to point out that the labelling of training data itself does contain rules of arithmetic additions. Our results show that these rules have been well learnt so that new arithmetic additions can be solved (with a very high accuracy) by the trained networks. Our discussions in the following focus on a comparison between the deep CNN-based image classification and the learning of arithmetic additions by image classification networks, with respect to the mechanism in these two tasks. In the former one, structural features extracted from images play the most important role solely. Taking ImageNet as an example, an image is classified into the class “dog” because neuron that extract features such as dog’s legs, dog’s head, and etc. are triggered. Similarly, for the class “bird”, neuron taking care of different parts of birds (e.g., eyes, wing, legs, and etc.) are triggered.
In the later task, however, we see a different scenario. As illustrated by the example shown in Figure 6, 11 different combinations of two integers constitute Class-10. If and are taken away from the training (which does happen in our experiments presented earlier), the structural patterns of 4, 5, and 6 do not contribute to the training of this class. Seeing one of them (in combination with another integer) during the training would lead to a classification to other classes, e.g., (or ), (or , and or (). Now, when testing (or ) and , any mechanism driven by structural patterns would more likely classify each of them into a class that saw 4 or 6 or 5 during the training, rather than Class-10 that never sees any of them during the training. Nevertheless, our experimental results presented earlier show that a big majority of those arithmetic additions excluded from the training has been classified correctly. We believe that this owes to the successful learning of the knowledge (that is beyond the structural patterns perceived from individual images) involved in computing arithmetic additions so that the trained networks have built up their autonomous reasoning ability strongly. To the best of our knowledge, it is the first time that a meaningful example be designed as the proving evidence to the learning of deeper knowledge that can be defined clearly but is far beyond the structural features perceived from individual images.
IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
AAAI Conference on Artificial Intelligence, Cited by: §2.