Object recognition aims at automatically assigning labels of object categories from a finite label collection to a given image. It is a fundamental problem in the field of computer vision and also a core technique for many applications girshick2014rich ; long2015fully ; sun2014deep . Various algorithms for object recognition have been developed in the past decades which can be roughly summarized into the following standard pipeline: first a variety of handcrafted features are extracted; then the features are fed into some sophisticated feature encoding or transformation (e.g.
dimension reduction, feature pooling) procedures; finally those high-level features are classified with trained sophisticated classifiers. Though many works (e.g. SIFT ng2003sift , LBP he1990texture , HOG dalal2005histograms , Gabormarvcelja1980mathematical ) focus on developing better handcrafted features, the feature is still undoubtedly the major bottleneck for improving the performance of object recognition.
In recent years, great progress has been achieved in object recognition which is arguably attributed to the availability of larger datasets for training more sophisticated models and greater computation resources, and more importantly the application of deep learning algorithms.
The convolution neural network (CNN) – a popular example of deep learning algorithms – adopts a deep architecture that consists of many stacked convolutional and fully-connected layers. Such an architecture is specifically designed for solving computer vision related problems le1990handwritten ; lecun1998gradient ; ciresan2012multi ; yin2013icdar ; krizhevsky2012imagenet and has also seen many other successful applications. The designed architecture of CNN is end-to-end trainable and is able to automatically learn features performing well for specific targets at different abstraction levels. With these high-level features, it is possible to classify images accurately with a simple classifier. Nowadays, CNN-based algorithms have achieved the state-of-the-art results on many challenging tasks girshick2014rich ; girshick2015fast ; ren2015faster ; taigman2014deepface ; sun2014deep .
However, the simple feedforward architecture cannot well handle some challenging cases of object recognition. For instance, it has been observed that the powerful GoogLeNet szegedy2015going often fails in recognizing small objects in an image. Another challenging case for the feedforward deep architecture is the fine-grained object recognition zhang2014part where the differences among different fine-grained categories are quite subtle. Distinguishing fine-grained categories requires the CNN based models to extract features from the most discriminative regions. Therefore, part annotations are usually utilized to assist fine-grained image classification xiao2015application
. Through empirical statistics on the classification errors, we find that the network is able to predict several candidate categories that include the correct one with high confidence. However, making the correct final decision on the single category is difficult for the network based models, due to the distraction from other candidate categories. Motivated by the above observations, we propose a novel “Learning by Rethinking” (LR) algorithm in this paper: instead of making the final decision based on one-pass of the data through the network, we introduce feedback connections and allow the network based models to “re-think” the decision and take the high-level feedback information into feature extraction. Benefiting from the feedback, the model is able to extract more discriminative low-level features with the guidance from the high-level information.
We propose two new types of layers – the “feedback” layer and the “emphasis” layer – to serve as the channel for transferring the feedback information. The feedback layer connects one top layer to one specific bottom layer in the network. The emphasis layer produces different weights based on such top-down information for the feature maps in the connected bottom layer. The proposed “Learning with Rethinking” algorithm exploits the fed back posterior probability of candidate object categories in the feedback layer, and endows the network model with the ability to “re-think” the decision during training. A new prediction will be made with consideration of previous prediction.
Figure 1 provides the overall pipeline of the “Learning with Rethinking” algorithm in a time-unfolded manner for illustration purpose. Here we take the network-in-network (NIN) network as a basic CNN structure and illustrate how we build the proposed “Learning with Rethinking” network through augmenting the existing neural network architecture. The small “ibex” in the image is misclassified as “parachute” by a classic feedforward CNN (as shown in the left part where the probability of “parachute” is larger than “ibex”). In contrast, by further exploiting the posterior probabilities in the top layer before making final decision, the “Learning with Rethinking” algorithm recurrently adjusts the feature maps in hidden layers through feedback connection and identifies the correct category from other distracting categories. We will detailedly describe this pipeline in Section 3.
2 Related Work
It has been twenty years since Lenet was first applied to OCR in 1990 le1990handwritten
. Many algorithms have been developed to improve the performance of CNN, although the basic framework of CNN has not changed much ever since it was proposed. The large object recognition data set ILSVRC2012, also known as ImageNetdeng2009imagenet , has greatly propelled the progress in this area. Some most well-known progress in CNN structure has been made along with the continuous improvements of the performance on ImageNet data set. After AlexNet was proposed on ILSVRC2012, there are some remarkable advances in CNN architecture zeiler2014visualizing ; DBLP:journals/corr/LinCY13 ; simonyan2014very ; szegedy2015going ; he2016deep . And also, there are some task-specific modifications on CNN structure cirecsan2011committee ; sermanet2011traffic ; sun2014deep ; barat2016string ; cao2015large . For example, in multi-resolution CNN cirecsan2011committee ; sermanet2011traffic ; sun2014deep , combining features in lower layers leads to a more detailed representation of an input image. MOP-CNN gong2014multi is another algorithm proposed to extract more powerful features. With a combination of VLAD and CNN, MOP-CNN extracts a multi-scale and robust feature. This algorithm does not actually change CNN structure, but utilizes a pre-trained CNN model and modifies the feature extraction procedure.
loose the weight sharing constraint in normal convolution layer, and is suitable for face related tasks. Leaky ReLUgraham2014spatially adds a negative slope to the normal ReLU, to preserve information discarded by ReLU. PReLU he2015delving further enhances this by making the negative slope learnable. Spatial Pyramid Pooling (SPP) he2014spatial
extends max-pooling by enables CNN to avoid input warping or resizing and still produces fixed-length features. Inspired by Dropoutsrivastava2014dropout , DropConnect wan2013regularization regularize the CNN by randomly setting a subset of weights to zero within each layer. Spatial Dropout tompson2015efficient randomly sets some feature maps to zero entirely. DropSample yang2016dropsample randomly selects low confidence samples during training according to the output of CNN. The commonly used fully-connected layer can be transformed into convolution layer with kernel size , as shown in DBLP:journals/corr/LinCY13 . With this transformation, CNN can take the input of any size and output classification maps.
Considerable works have been devoted to improving the performance of a CNN model through modifying its architecture. However, all these algorithms are still founded on a single feedforward pass of samples. Rare effort has been made to recurrently improve the performance of a CNN model. In this work, we argue that a recurrent recognition processing is more consistent with the mechanism embedded in the human brain for visual processing, motivated by neural science research buffalo2010backward ; gilbert2007brain ; hupe1998cortical ; bastos2015dcm .
Based on the analysis of response latencies to a newly-presented image, there are two stages of visual processing: a pre-attentive phase and an attentional phase, corresponding to feedforward and recurrent processing respectively lamme2000distinct . And the feedback connections play an important role in the attentional phase gilbert2007brain ; hupe1998cortical . Different with feedforward connections which directly carry information, the feedback connections primarily play a modulatory role bastos2015dcm . Experiments have shown that recurrent processing contributes to making object recognition in degraded images more robust wyatte2012limits .
The idea of recursive or recurrent neural network has a long history, and recursive neural network (RNN) is successful in modeling temporal and sequential datagraves2009novel ; donahue2015long . Several works consider employing recursive neural network on processing a single image. Eigen et al. eigen2014understanding proposed a recursive convolutional network in image classification, finding that too large recursion depth may result in inferior performance due to over-fitting. Ming Liang liang2015recurrent enhanced the recursive layer by taking feed-forward inputs into all un-folded layers, the recurrent connections are spatial within the same recursive layer. Kim et al. kim2016deeply
propose a deep recursive convolutional neural network for image super-resolution. The recursion depth is much more larger, and all predictions from the intermediate recursion is utilized to obtain the final output.
Our “Learning with Rethinking” algorithm differs from above recursive neural networks in that we combines the posterior probabilities in the top layer into next recursion. The “Learning with Rethinking” algorithm recurrently adjusts the feature maps in hidden layers through feedback connection and identifies the correct category from other distracting categories
The idea of refining prediction is similar to cascading, which is a multistage ensemble learning algorithm. The subsequent stages focus on refining predictions of previous stages sun2013deep ; li2015convolutional ; timofte2016seven ; ren2015faster . For instance, state-of-the-art object detection algorithms adopt a two-stage pipeline ren2015faster . The region proposal network proposes object candidates in the first stage, and the detection network focus on classifying proposals in the following stage. Sun et al. sun2013deep
proposed three-stage cascaded convolutional neural networks for facial point detection, where the subsequent stage focus on giving more accurate keypoints estimation. Liet al. li2015convolutional ; qin2016joint
proposed three-stage cascaded convolutional neural networks for face detection, where the first two stage quickly reject easy background regions, and the third stage carefully evaluates a small number of challenging candidates. Timofteet al. employed a four-stage cascaded models to gradually refine the contents in image super-resolution. They kept the same settings for all the stages but models are trained per stage.
Our “Learning with Rethinking” algorithm differs from above cascading algorithms in that we recurrently refine the same model in all stages. In contrast, cascading algorithms needs to train a model for each stage.
The most related work on utilizing the recurrent neural network for object recognition would be dasNet stollenga2014deep
. It makes use of a reinforcement learning strategy to iteratively adjust some weights of feature maps. And final classification results are made after several iterations. Our “Learning with Rethinking” algorithm differs from dasNet in three major aspects. Firstly, we use a neural network to feedback information into lower layers, which is relatively easy to calculate. Secondly, we only use the posterior probability of previous feedforward pass as the feedback information, which is much more timely and spatially efficient. Thirdly, our algorithm can be regarded as a new further training algorithm which is easy to be applied to any pre-trained models, and will further boost the performance. Comparatively, dasNet needs to train from random initialization.
3 Learning with Rethinking
In this section, we briefly review the conventional architecture of convolutional neural networks (CNN). Then we elaborate how to incorporate the feedback mechanism into the existing CNN architectures and propose the “Learning with Rethinking” algorithm to improve the performance of CNN for object recognition. The basic idea of “Learning with Rethinking” is intuitive: in addition to the feedforward connections in a neural network, several feedback connections directed from a top layer to a certain bottom layer are also established to provide top-down information for object recognition. With the higher-level information from the top layer, the bottom layers can stay being informed of those categories in the training data that are misclassified and those the layers need extra effort to distinguish. Such information is fed back through the “emphasis layer” and the “feedback layer” devised in this work.
3.1 Conventional Convolutional Neural Networks
In the conventional convolutional neural network (CNN) architecture, multiple layers of different types (e.g., convolutional layers and pooling layers) are connected in a simply feedforward manner and the information only moves in one direction. In particular, each layer takes a collection of feature maps output by the previous layers as input, and produces a set of new feature maps via convolution or pooling operations. The new feature maps are then fed into the next layer directly. By stacking multiple convolutional layers interlaced with pooling layers, CNN can extract features at different abstraction levels with increasingly larger receptive fields. One advantage of employing such a feedforward mechanism in the CNN architecture is that the involved operations in producing the feature maps (such as convolution and pooling) are computationally efficient without a directed cycle. And the algorithms of back-propagating errors from top layers to bottom ones can be applied straight-forwardly to efficiently optimize the parameters of the CNN. However, such a feedforward mechanism also has a limitation, since each layer only interacts with its neighboring layers and the important top-down information cross different layers is lost.
In the following subsections, we introduce a new network architecture that also allows feedback connection among different layers. We elaborate how such an architecture with a feedback mechanism can learn more discriminative feature representation to better solve the object recognition problems – especially for those involving recognizing objects of small size or from fine-grained categories with subtle differences.
Throughout the paper, we use the following notations for the simplicity of explanation. The feature maps are represented by a tensor of dimension, where is the number of feature maps, and is the spatial dimension of each feature map. We use to denote the input feature maps for the -th layer and to denote the value of the -th feature map at the position , in the -th layer.
3.2 Feedback Layer
We introduce the feedback mechanism to the conventional CNN architecture through a new feedback layer. The feedback layer connects two layers that may not be neighboring to each other in a top-down direction. When an input sample passes through all the layers, instead of immediately making a prediction based on the predicted posterior probability of the sample belonging to a specific category, a feedback layer is deployed to propagate the predicted posterior probability to the bottom layers to update the network. Intuitively, when a sample has similar posterior probabilities for two different categories, it is not easy to be classified. Instead of outputting the final prediction directly, a wiser way is to guide the previous layers based on the current posterior probabilities of these confusable categories, such that the bottom layers can be strengthened or weakened to produce more discriminative features specifically for those categories difficult to distinguish.
Formally, suppose there are in total categories. Then for each sample, the network outputs posterior possibilities , each of which denotes the possibility of the sample belonging to the corresponding category. The feedback layer is a fully connected layer whose parameters are denoted as and for producing the -th emphasis vector. It takes posterior possibility as input and outputs emphasis vectors. The dimension of the -th emphasis vector (here ) is denoted as , which equals the number of channels in the corresponding bottom layer. The -th element in the -th emphasis vector used to re-weight the -th channel is computed as follows,
The initial emphasis vector computed in Equation (3) is then normalized and weighted by the total channel number in the emphasis vector in Equation (2). The emphasis vectors are then used to re-weight the feature maps in the layer connected to the feedback layer. Such normalization guarantees that coefficients in the emphasis vector have a mean value of such that the feature maps re-weighted by the emphasis vector have a magnitude at the same order as the feature maps without being augmented by the feedback and re-weighted.
The computational cost in time and space in the feedback layer is negligible. For an output emphasis vector with a length of , only extra parameters are introduced. Each emphasis vector is able to adaptively rectify the feature maps – through lifting contribution to certain layers and weakening the effects of other layers – to produce more discriminative feature maps for the following object recognition. In the next subsection, we explain the role of the emphasis vectors in more details.
3.3 Emphasis Layer
To adaptively re-weight different feature maps in a specific layer, an emphasis layer is introduced in our proposed Learning-with-Rethinking network. The emphasis layers take the emphasis vectors as well as the feature maps as inputs and outputs the re-weighted feature maps. More concretely, the -th channel in the -th layer is weighted by the corresponding emphasis coefficient by multiplying with : with , the -th channel is enhanced; and the channel is suppressed with . All the emphasizing coefficients in the -th layer form an emphasizing vector . The intuition of this emphasis layer comes from the human visual mechanism. It is believed that the feedback connections primarily play a modulatory role. This structural augmentation enables the Learning-with-Rethinking network to selectively emphasize some discriminative features, and suppress the feature maps causing confusion in the recognition. The emphasis procedure can be formally written as
Figure 3 illustrates such operation conducted by the emphasis layer.
3.4 Architecture of the Learning with Rethinking Network
With the feedback layer and the emphasis layer, we build the proposed Learning-with-Rethinking network through augmenting the existing neural network architecture. Here we take the network-in-network (NIN) network as a basic CNN structure and illustrate how we can build and train the corresponding Learning-with-Rethinking NIN network LR-NIN. Figure 1 provides the overall pipeline of the LR-NIN in a time-unfolded manner for illustration purpose. The LR-NIN network is constructed as follows.
First, we pre-train the NIN model without feedback connection to obtain an initial model of LR-NIN. Then, we build three emphasis layers that are connected to three different convolution layers. Each emphasis layer takes its corresponding emphasis vector from the feedback layer as input (ref. Section 3.2), and produces the emphasis vectors to re-weight the produced feature maps of each convolution layer (ref. Section 3.3). Such information feedback in the LR-NIN is repeated for times in total to train the overall network.
In our implementation, all the coefficients in the emphasis vector are initialized as , and the emphasis layer does not change the feature maps at the initial stage. In the training phase, LR-NIN recursively feeds back the posterior possibility at the -th step to guide the operation at the -th step.
It is obvious that increasing the number of recursive steps for information feedback allows the bottom layers to receive richer top-down information, but the training time cost will also be increased accordingly. We observe from experiments that after the performance improvement is incremental. In order to trade off the time cost and the final performance, we empirically set in the training phase of LR-NIN.
LR-NIN only introduces very few extra parameters compared with NIN: LR-NIN adds three emphasis layers after the three convolution layers whose kernel size is greater than . In this case, only 58k extra parameters are introduced on CIFAR-100, amounting to only increase in the total number of network parameters.
Similar to other recurrent neural networks, the LR-NIN network is trained by unfolding it into a very deep feedforward network, and is optimized by backpropagation through time (BPTT) algorithmrumelhart1985learning
with Stochastic Gradient Descent (SGD). Due to gradient vanishing probolemlee2015deeply ; hochreitergradient ; graves2012supervised , the error signals propogated back tend to either blow up or vanish. This leads to failure in learning long time dependencies (training deep networks) sometimes. Inspired by the Deeply Supervised Network (DSN), we provide intermediate supervision for the intermediate predictions. As shown in Figure 1, there are cross-entropy losses corresponding to the feedforward passes. And the gradients of the losses are summed to be final gradients.
Formally, the loss to be optimized is a combination of the per-iteration cross-entropy loss ,
During feedforward propagation, the coefficients in the emphasis vectors are produced by the feedback layer, as explained in Section 3.2. During back propagation, the gradients of the input feature maps and the emphasis vectors
can be calculated via the chain rule:
The training and testing speed is roughly times slower than the original model. However, during training LR models, we can initialize the LR model with a well trained baseline model, and do not need to train the model from scratch. This makes the total training time reduced.
With these structural augmentations, LR enables the bottom layers in an existing model to be aware of the current classification prediction in the top layers. Then features are emphasized adaptively in the following iteration. In this way, the network is able to distinguish between confusing categories and yield better classification performance.
4.1 Overall Settings
We evaluate the performance of the LR algorithm on four benchmark datasets for image classification: CIFAR-100 krizhevsky2009learning , CIFAR-10 krizhevsky2009learning , MNIST-background-image larochelle2007empirical and ILSVRC-2012 ILSVRC15 . Four pre-trained CNN models are employed as the baseline models which include NIN DBLP:journals/corr/LinCY13 , R-CNN liang2015recurrent , LeNet lecun1998gradient , VGG-Net simonyan2014very
. We implement the LR algorithm on the Caffe platformjia2014caffe .
Throughout the experiments, we fix the step of recursive feedback as . On CIFAR-100, we also report the performance of the LR algorithm with in order to investigate the effect of on the final performance. Batch size is fixed as on CIFAR-10, CIFAR-100, MNIST-background-image and on ILSVRC-2012. The initial learning rate is set to on CIFAR100, CIFAR10, MNIST-background-image, and on ILSVRC-2012. The momentum is set as in all the experiments. Weight decay of the L2 normalization is used as the regularization. The weight decay coefficient is set to in all experiments. No weight decay is applied to any bias term. All these hyper parameters are not particularly tuned. And the dropout rates stay the same with the three publicly released models.
CIFAR-100 is a widely used benchmark dataset for image classification. There are in total color images of categories in the dataset. All the samples are split into for training and for testing. The size of images in CIFAR-100 is . NIN has been proven to be a successful CNN structure on CIFAR-100 DBLP:journals/corr/LinCY13 ; lee2015deeply . We follow the same image pre-processing procedure used in NIN, i.e. global contrast normalization and ZCA whitening.
We conduct three sets of experiments with different settings to evaluate the proposed LR algorithm. In the first experiment, we train a vanilla NIN model without data augmentation as the baseline. There are convolution layers with kernel size , each followed by convolution layers with kernel size . Then we train an LR-CNN with the Learning-with-rethinking algorithm. The overall pipeline is the same as the one shown in Figure 1. Three emphasis layers are added after each convolution layer with the kernel size of . And corresponding feedback layers are added to produce emphasis vectors.
In the second experiment, we train a new baseline model termed as LNIN. LNIN differs from NIN in the non-linear rectification unit. LNIN uses the leaky-ReLU to replace ReLU in NIN. In this setting, we train the LNIN model without data augmentation. It turns out that Leaky-ReLU is a more effective non-linearity function on CIFAR-100 dataset than ReLU. We then train an LR-LNIN with the LR algorithm based on this LNIN baseline model, following the same procedure as in the first experiment.
In the third experiment, we use the pre-trained LNIN with data augmentation as the baseline model which is named as LNIN-aug. Comparison with this baseline validates the effectiveness of our algorithm on further boosting models with even better performance. As for data augmentation, instead of using the heavy data augmentation used in sparse-cnn graham2014spatially , we only use horizontal reflection. During training, we randomly flip the input image. In the test phase, the model makes predictions on both the original image and its mirror. Final classifications are given by simply averaging the two predicted posterior possibilities.
|Model||No. of Param.||Error (%)|
|ReLU, without data augmentation|
|Leaky ReLU, without data augmentation|
|Leaky ReLU, with data augmentation|
shows the comparison results of three experiments. For a fair comparison, all baseline models are further trained for the same epochs as training the LR models, and corresponding models are referred with a prefix “ft-”. We pre-trained each baseline model forepochs, and fine-tune with Learning-with-rethinking for another epochs. As shown in Table 1, in all the three experiments, models trained with the LR algorithm not only outperform the pre-trained baseline model, but also outperform the further trained baseline models. Through comparing the baseline model and their corresponding further trained model, we can see that further training only brings minor improvements. But further training with LR could effectively improve the classification accuracy for near .
We have also compared the performance of LR-LNIN-aug models with different values. As shown in Figure 4 (a), with a larger the training procedure still works as expected, and the training loss is decreasing with the same . However, we can observe from Figure 4 (b) that the performance converges after on cifar-100 dataset, and a larger is not really necessary. Besides, a large leads to higher difficulty in training since we need times more computation than the baseline model.
|Model||Input Size||No. of Param.||Testing Error(%)|
|without data augmentation|
|with data augmentation|
|LR-RCNN-128 liang2015recurrent||32||1.19M +0.05M||30.72|
We then compare our model with some state-of-the-art models of similar depth and model size. Table 2 shows the comparison results. Among the existing models in Table 1, NIN DBLP:journals/corr/LinCY13 , DSN lee2015deeply and LR-LNIN have comparable network depth and parameter number. RCNN liang2015recurrent models are much deeper. Maxout goodfellow2013maxout and dasNet stollenga2014deep employ more parameters. As for the DeepCNet and DeepCNiN graham2014spatially
, input images are padded with zeros toones, and the deep network models have million and million parameters respectively, which are much more than those in our model.
We also have evaluated our LR algorithm with RCNN-128 liang2015recurrent as a baseline model. Four weight layers are inserted after every recurrent convolution layer. Because Liang liang2015recurrent do not release their Caffe implementation of RCNN-128 model on the CIFAR-100 dataset, we can not reproduce the reported accuracy and have to use our re-implementation model (marked with ). The RCNN-128 model are further trained for the same epochs as training the LR-RCNN-128 to make a fair comparison. As shown in Table 2, our proposed model surpasses all the other models with moderate network depth and number of parameters. Our LR-LNIN-aug model is only slightly inferior to DeepCNiN, which employs much more parameters. It is worth noting that our model beats DeepCNet when data augmentation is used – our model only uses of the parameters of DeepCNet and only horizontal flip data augmentation.
CIFAR-10 is a dataset with the same image size and number of images as CIFAR-100. But its images are only from 10 categories. With fewer categories, the number of extra parameters introduced along with the feedback layer is only of the number in CIFAR-100. We evaluate the LR algorithm both with and without data augmentation. is fixed to 2. All baseline models are further trained for the same epochs as training the LR models, and corresponding models are named with a prefix “ft-”. As can be seen from Table 3, the experimental results show that models trained with the LR algorithm not only outperform the pre-trained baseline model, but also outperform the further trained baseline models.
|Model||No. of Param.||Error(%)|
|Leaky ReLU, without data augmentation|
|Leaky ReLU, with data augmentation|
|Model||No. of Param.||Testing Error(%)|
|without data augmentation|
|with data augmentation|
We compare our model with some state-of-the-art models on CIFAR-10 as shown in Table 4. Our LR algorithm has improved the baseline RCNN-128 model with and without data augmentation. And our LR-RCNN-128 model outperforms the state-of-the-art models with a similar number of parameters and depth.
MNIST-background-image is a variant of the popular MNIST digits dataset larochelle2007empirical . The gray digit image is surrounded with a gray image patch as background. It is more challenging than the original MNIST dataset. There are training images and testing images. The baseline LeNet model consists of two convolutional layers and one fully connected layer. The detailed structure of the LeNet model is shown in Table 5. Then we train an LR-LeNet with “Learning with Rethinking” algorithm. Two emphasis layers are added after each convolution layer and corresponding feedback layers are added to produce emphasis vectors. We train the baseline model for epochs, and then train an LR-LeNet model for another epochs. Training the baseline LeNet model for another epochs do not give a better performance. The comparison results with other algorithms are shown in Table 6. The baseline LeNet model outperforms other algorithms with a large margin, and our LR-LeNet model reduces the error rate by nearly . To our best knowledge, LR-LeNet achieves the highest accuracy among all the reported results on MNIST-background-image dataset.
Benefiting from the small number of feature channels and categories, we are able to make a qualitative analysis of LR algorithm. We conduct visualization analysis on samples in category 7 and 9, which are commonly misclassified to each other. Several typical confusing images are shown in the left part of Figure 5. In Figure 6, we visualize their emphasis vectors in the second emphasis layer. As shown in Figure 6 (a), when the confidence of top-1 candidate is higher, there are clear patterns of emphasis vectors. When the top-1 confidence is lower, i.e. the predicted confidences of category 7 and 9 are at the same level, the emphasis vectors of both category 7 and 9 are similar, as shown in Figure 6 (b). Emphasis vectors mainly enhance those features that increase the prediction confidence of the correct category. Therefore, in Figure 6 (a), the enhanced features differ for different categories. When the predicted confidences of category 7 and 9 are at the same level as shown in Figure 6 (b), emphasis vectors focus on enhancing the features which are the most beneficial in distinguishing these two categories, i.e. the th channel, and suppressing the features with weaker discriminability, i.e. the th channel. We further visualize the corresponding feature maps in the middle row of Figure 7. The receptive field of the maximum value in the th feature map is marked with red rectangles in input images. It seems that the th feature map mostly responds to the left part of a “blob”. We also visualize the feature maps of the th channel in the bottom row of Figure 7, which shows this channel responds to the top part of a “blob”, or rather, a curved horizontal line.
4.5 Ilsvrc 2012
The ILSVRC 2012 dataset is a much larger one than CIFAR-100, CIFAR-10 and MNIST-background-image. There are over 1.2 million color images in the training set, and 50k color images in the validation set. Top-5 accuracy on validation set is used as an evaluation metric. The VGG-Net is one of the top performed models on this dataset. These pre-trained VGG models are thede facto basic component in many papers.
There are two VGG-Net models released – one has 16 layers and the other has 19 layers, termed here by VGG16 and VGG19 respectively. Both of them are pre-trained on the training data of ILSVRC 2012 with data augmentation. We use VGG16 as our baseline model due to its less training time cost but similar performance as VGG19 on ILSVRC 2012. An emphasis layer is added after each of the convolution layers. Also, corresponding feedback layers are added into the VGG16 network. This leads to extra parameters, amounting to only of the total number of parameters. We trained the LR-VGG16 model for epochs. VGG16 is a well-trained model, and further training the baseline VGG16 model barely make any improvement.
Detailed comparisons are shown in Table 7. This result on ImageNet validates the effectiveness of the LR algorithm on further boosting state-of-the-art models and shows its potential in large scale object classification tasks.
|Single Crop||Multi Crop|
We here provide some analyses on the classification results of original VGG16 and LR-VGG16. As shown in Figure 8, the average posterior possibilities of top-1 prediction increase by on both training set and validation set. The improvement of top-k posterior possibilities demonstrates that LR-CNN is more “certain” of its prediction. This shows that our algorithm can boost the model to make it fit the training data better, and thus learn more information from training samples. By distinguishing confusing categories, the LR algorithm can improve the performance. The analysis of top-k accuracy in Figure 8 also supports this observation. The top-1 accuracy of LR-VGG16 has surpassed the VGG16 model by more than on the training set, and a consistent improvement of is shown on the validation set. These statistical observations validate the effectiveness of “Learning with Rethinking” on further boosting state-of-the-art models. Some examples of image classification results are shown in Figure 9 with comparison between VGG16 and LR-VGG16. The results show that LR is more “certain” of its prediction (i.e. the entropy of the finally predicted probabilities is much smaller).
In this paper, we propose a “Learning with Rethinking” algorithm for image recognition. The Learning with Rethinking algorithm feeds back posterior probability information from top layers to guide the bottom layers in their feature learning. Experiments on four benchmark datasets show that the Learning with Rethinking algorithm is able to further boost the well-established models with only a few parameters introduced. Particularly, experiments on benchmark datasets MNIST-background-image and ImageNet clearly demonstrate the advantage of the Learning with Rethinking algorithm in recognizing objects or categories with large inter-class similarity. Besides, our work also demonstrates that recurrently improving performance with feedback information is a promising direction.
This work was supported by the National Basic Research Program of China (973 program) under Grant No. 2013CB329403 and the National Natural Science Foundation of China under Grant No.61471214.
R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE, 2014, pp. 580–587.
- (2) J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 3431–3440.
- (3) Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000 classes, in: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE, 2014, pp. 1891–1898.
- (4) P. C. Ng, S. Henikoff, Sift: Predicting amino acid changes that affect protein function, Nucleic acids research 31 (13) (2003) 3812–3814.
- (5) D.-C. He, L. Wang, Texture unit, texture spectrum, and texture analysis, Geoscience and Remote Sensing, IEEE Transactions on 28 (4) (1990) 509–512.
- (6) N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition (CVPR), 2005 IEEE Conference on, IEEE, 2005, pp. 886–893.
- (7) S. Marčelja, Mathematical description of the responses of simple cortical cells*, JOSA 70 (11) (1980) 1297–1300.
- (8) B. B. Le Cun, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Handwritten digit recognition with a back-propagation network, in: Advances in Neural Information Processing Systems, Citeseer, 1990.
- (9) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
- (10) D. Ciresan, U. Meier, J. Schmidhuber, Multi-column deep neural networks for image classification, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 3642–3649.
- (11) F. Yin, Q.-F. Wang, X.-Y. Zhang, C.-L. Liu, Icdar 2013 chinese handwriting recognition competition, in: Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, IEEE, 2013, pp. 1464–1470.
- (12) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
- (13) R. Girshick, Fast r-cnn, in: Computer Vision (ICCV), 2015 IEEE International Conference on, IEEE, 2015, pp. 1440–1448.
- (14) S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems, 2015, pp. 91–99.
- (15) Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to human-level performance in face verification, in: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE, 2014, pp. 1701–1708.
- (16) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 1–9.
- (17) N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based r-cnns for fine-grained category detection, in: Computer Vision–ECCV 2014, Springer, 2014, pp. 834–849.
T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, Z. Zhang, The application of two-level attention models in deep convolutional neural network for fine-grained image classification, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 842–850.
- (19) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, IEEE, 2009, pp. 248–255.
- (20) M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Computer Vision–ECCV 2014, Springer, 2014, pp. 818–833.
M. Lin, Q. Chen, S. Yan, Network in
network, CoRR abs/1312.4400.
- (22) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
- (23) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE, 2016, pp. 770–778.
- (24) D. Cireşan, U. Meier, J. Masci, J. Schmidhuber, A committee of neural networks for traffic sign classification, in: Neural Networks (IJCNN), The 2011 International Joint Conference on, IEEE, 2011, pp. 1918–1921.
- (25) P. Sermanet, Y. LeCun, Traffic sign recognition with multi-scale convolutional networks, in: Neural Networks (IJCNN), The 2011 International Joint Conference on, IEEE, 2011, pp. 2809–2813.
- (26) C. Barat, C. Ducottet, String representations and distances in deep convolutional neural networks for image classification, Pattern Recognition 54 (2016) 104–115.
- (27) L. Cao, X. Zhang, W. Ren, K. Huang, Large scale crowd analysis based on convolutional neural network, Pattern Recognition 48 (10) (2015) 3016–3024.
- (28) Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep convolutional activation features, in: Computer Vision–ECCV 2014, Springer, 2014, pp. 392–407.
- (29) J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, J. Kim, Rotating your face using multi-task deep neural network, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 676–684.
- (30) B. Graham, Spatially-sparse convolutional neural networks, arXiv preprint arXiv:1409.6070.
- (31) K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Computer Vision (ICCV), 2015 IEEE International Conference on, IEEE, 2015, pp. 1026–1034.
- (32) K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, in: Computer Vision–ECCV 2014, Springer, 2014, pp. 346–361.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15 (1) (2014) 1929–1958.
- (34) L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, R. Fergus, Regularization of neural networks using dropconnect, in: Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1058–1066.
- (35) J. Tompson, R. Goroshin, A. Jain, Y. LeCun, C. Bregler, Efficient object localization using convolutional networks, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, 2015, pp. 648–656.
- (36) W. Yang, L. Jin, D. Tao, Z. Xie, Z. Feng, Dropsample: A new training method to enhance deep convolutional neural networks for large-scale unconstrained handwritten chinese character recognition, Pattern Recognition 58 (2016) 190–203.
- (37) E. A. Buffalo, P. Fries, R. Landman, H. Liang, R. Desimone, A backward progression of attentional effects in the ventral stream, Proceedings of the National Academy of Sciences 107 (1) (2010) 361–365.
C. D. Gilbert, M. Sigman, Brain states: top-down influences in sensory processing, Neuron 54 (5) (2007) 677–696.
- (39) J. Hupe, A. James, B. Payne, S. Lomber, P. Girard, J. Bullier, Cortical feedback improves discrimination between figure and background by v1, v2 and v3 neurons, Nature 394 (6695) (1998) 784–787.
- (40) A. Bastos, V. Litvak, R. Moran, C. Bosman, P. Fries, K. Friston, A dcm study of spectral asymmetries in feedforward and feedback connections between visual areas v1 and v4 in the monkey, NeuroImage 108 (2015) 460–475.
- (41) V. A. Lamme, P. R. Roelfsema, The distinct modes of vision offered by feedforward and recurrent processing, Trends in neurosciences 23 (11) (2000) 571–579.
- (42) D. Wyatte, T. Curran, R. O’Reilly, The limits of feedforward vision: Recurrent processing promotes robust object recognition when objects are degraded, Journal of Cognitive Neuroscience 24 (11) (2012) 2248–2261.
- (43) A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, J. Schmidhuber, A novel connectionist system for unconstrained handwriting recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions on 31 (5) (2009) 855–868.
- (44) J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 2625–2634.
- (45) D. Eigen, J. Rolfe, R. Fergus, Y. Lecun, Understanding deep architectures using a recursive convolutional network, in: International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014.
- (46) M. Liang, X. Hu, Recurrent convolutional neural network for object recognition, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 3367–3375.
- (47) J. Kim, J. Kwon Lee, K. Mu Lee, Deeply-recursive convolutional network for image super-resolution, in: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE, 2016, pp. 1637–1645.
- (48) Y. Sun, X. Wang, X. Tang, Deep convolutional network cascade for facial point detection, in: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE, 2013, pp. 3476–3483.
- (49) H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 5325–5334.
- (50) R. Timofte, R. Rothe, L. Van Gool, Seven ways to improve example-based single image super resolution, in: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE, 2016, pp. 1865–1873.
- (51) H. Qin, J. Yan, X. Li, X. Hu, Joint training of cascaded cnn for face detection, in: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE, 2016, pp. 3456–3465.
- (52) M. F. Stollenga, J. Masci, F. Gomez, J. Schmidhuber, Deep networks with internal selective attention through feedback connections, in: Advances in Neural Information Processing Systems, 2014, pp. 3545–3553.
- (53) D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning internal representations by error propagation, Tech. rep., DTIC Document (1985).
- (54) C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets., in: AISTATS, Vol. 2, 2015, p. 6.
- (55) S. Hochreiter, Y. Bengio, P. Frasconi, J. urgen Schmidhuber, C. Elvezia, Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
- (56) A. Graves, et al., Supervised sequence labelling with recurrent neural networks, Vol. 385, Springer, 2012.
- (57) A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, (2009).
- (58) H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, in: Proceedings of the 24th International Conference on Machine Learning, ACM, 2007, pp. 473–480.
- (59) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision (IJCV) (2015) 1–42doi:10.1007/s11263-015-0816-y.
- (60) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM international conference on Multimedia, ACM, 2014, pp. 675–678.
- (61) I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, Y. Bengio, Maxout networks., Proceedings of the 30th International Conference on Machine Learning (ICML-13) 28 (2013) 1319–1327.
Q. Wang, J. Zhang, S. Song, Z. Zhang, Attentional neural network: Feature selection using cognitive feedback, in: Advances in Neural Information Processing Systems, 2014, pp. 2033–2041.