I Introduction
Most conventional object detections, such as cascade detector [1] and deformable partbased models [2, 3]
, adopt the sliding window strategy, which means classifiers are applied on all possible locations over the entire image. They are not efficient obviously because the computation resource is distributed equally among the salient regions and the vast majority of background regions. Recently, neural network methods have achieved great success in object detection. Region proposal based methods
[4, 5, 6] first extract many salient regions, from which the bounding boxes are predicted. But too many proposals slow down the detector. Moreover, the precision is limited by the performance of region proposal methods. Singleshot approaches [7, 8] has achieved the stateoftheart precision and speed, utilizing regression and grid strategies directly. However, it is difficult to detect small objects or those with unusual aspect ratios. Besides, the precision of oneshot process cannot be improved by easily recurrent operations, while the simple strategy of glimpsing more steps is adopted by human to obtain more accurate locations.Human localizes and recognizes an object by several glimpses, each of which contains only details of a small region and rough contextual information[9], instead of regularly scanning. In addition, contextual information is significant to efficiently localize and recognize the objects in realworld environment [10, 11]. Especially, the detection of the small object lacking appearance information depends more on the context [12], such as a far and forward car on the highway can be localized by the sky, the ground, the lane lines, and other cars etc. Multifixation glimpses are an effective and flexible way to model the relationships between the objects and their context.
We thus propose a novel structure named Recurrent Attention to Detection and Classification Network shown in Fig.1
. It detects and classifies the object by recurrent glimpses, and contains three modules at each step: attentional representation extractor (ARE), information fusion network (IFN) and multitask actions (MTA). Given the input image and the fixation point, ARE mimics the retina to extract multiscale local regions of detailed appearance and rough context. Following this, it applies Convolutional Neural Network (CNN) learners, acting as the V1 area of the brain, to learn effective appearance representations. In IFN, appearance and attentional location information are fused to form a local observation by fullyconnected layers. Then 3layer LSTMs are used to simulate the memory ability of the cerebral cortex, which can fuse multifixation local observations and generate a vector of hidden states. MTA receives the vector and executes three tasks: the detection and classification of the object, and the prediction of the next fixation point. These tasks are finished by fullyconnected layers and limitations of the value range.
On training, we propose a multitask loss function to optimize the network endtoend, and combine stochastic and objectawareness (SA) strategy for learning fixation prediction. The stochastic strategy (S) samples the next fixation randomly over a distribution, which enlarges the diversity of samples, and present better results as early as possible by policyreward method. The objectawareness strategy (A) means the last fixation close to the object under L2 criterion, which is beneficial to the stability of optimization and prediction. Besides, we build a realworld dataset named FCAR, which provides annotations of the forward cars, and can be utilized to train and evaluate the detector of object of interest. We will open the source code of our proposed method.
Ii Related Work
Deep learning methods are popular in object detection, and have achieved great success recently. To avoid sliding windows, Girshick et al. proposed to select salient regions and classified them via CNN [4], and the regionbased method was improved in terms of speed and precision later [5, 13, 6]. Nevertheless, the speed is still slow due to huge amount of regions, and the precision is limited by the performance of region proposal methods. Oneshot approaches take advantages of regression and grid strategies to directly predict the locations and categories of objects, which is faster than the regionbased methods, and achieves higher precision [7, 8]. But they are difficult in locating the small objects without making good use of contextual information. Furthermore, the rough location leads to the instability of object detection and classification.
The significance of context to localize and recognize objects has been researched in neural computing and cognitive science[10, 11]. Bell et al. explained that context and multiscale representations provides great improvement in detecting small objects [12]. To rationally use computation resource and effectively extract context clues, Larochelle et al. proposed fovealike image capture [14], which is similar to the fovea but too complicate to be used practically. Besides, stochastic latent variables are utilized in neural network by Tang et al. and Rezende et al. [15, 16]
, which can imitate the fixation strategy of human attention. Other researchers optimized the network with stochastic latent variables by reinforcement learning methods
[17, 18].More complex recurrent networks are applied to simulate the attention of the brain [19]. Lots of researches used recurrent or multistage networks to learn the attentional position [20, 21, 22, 23]
. Mnih et al. proposed Recurrent Attention Model (RAM) and presented the stochastic fixation strategy
[18]. Deep Recurrent Attentive Writer (DRAW) is proposed by Gregor et al. to learn the latent distribution of samples, and they also apply DRAW to finish classification task [24]. While none of them detects and classifies the objects jointly by recurrent network, and the hybrid strategy of stochastic and objectawareness fixation has not been discussed. Caicedo et al. utilized reinforcement learning method to localize objects with 9action policies [25], but frequently resizing cause too much deformation of objects and the location is not precise, meanwhile, repeatedly processing the large image makes the speed slow.Iii Recurrent Attention to Detection and Classification Network
We propose a novel recurrent network to jointly predict the category and bounding box of an object with related local observations. The overview structure named RADCN is illustrated in Fig. 1. Three main modules in the network are: Attentional Representations Extractor (ARE), Information Fusion Network (IFN) and Multitask Actions (MTA).
Iiia Attentional Representations Extractor
A rich collection of contextual associations is important to search objects in the realworld environment [11]. Simulating the retina and the V1 area, we joint the retinalike capture and CNN learners together, which can provide attentional representations with abundant contextual information and significant spatial structures. The network of the proposed Attentional Representation Extractor is shown in Fig.2.
Retinalike Capture: Retinalike capture [9] is the first module of our proposed network, which controls the source of appearance information flow. Centering on the given fixation point , it generates several scaled patches from an entire image at step with specific parameters . Given the number of scales , the size of the minimum scale and the scaled factor , we crop th patches with size , and these patches are resized into an unified size . Mathematically, the mapping can be presented as
(1) 
in which
is handcrafted by a little priory knowledge of the input size. The fixation point at the first step is sampled from an uniform distribution.
CNN learners: A group of CNN learners are applied to learn effective representations on those scaled patches . Each CNN learner consists of three convolutional layers with parameter set and a fullyconnected layer with weight and bias
, where the convolutional layer involves convolution operation, maxpooling(except for the second layer) and ReLU activation. The output of the
th CNN learner is(2) 
The parameters of CNN learners can be trained for hybrid task of detection and classification, which is a flexible way to fit various errands.
IiiB Information Fusion Network
Similar to human visual system, the position of fixation is fused with appearance features in our proposed method. We concatenate attentional representations of multiple scales and map them into an appearance descriptor via fullyconnected layer. Meanwhile, the coordinates of fixation point are also projected into a location descriptor with the same dimension as .
Appearance & Location Fusion: Several operations following by fullyconnected layer can be utilized to fuse appearance and location descriptors, such as addition , elementwise multiplication and concatenation etc. Furthermore, the fusion descriptor is formulated as
(3) 
in which is the set of parameters including the weight and bias of the fullyconnected layer.
MultiFixation Fusion: It is essential to joint these local observation vectors over multiple steps, where is the number of fixations. We thus take layer LSTMs to learn the dependencies and contributions of these observations , which performs similarly to the cerebral cortex remembering lots of general rules. The hidden states of the last layer is taken as the output of LSTMs . Specifically, the fusion of multiple steps is represented as
(4) 
where is the parameters, and is the hidden states of all LSTMs at the previous step. Different from conventional method [3] and rCNN detector [5], the multifixation fusion provides a flexible way to recognize objects from local to global.
IiiC Multitask Actions
Three main tasks of our proposed network are detection, classification, and prediction of the next fixation point, all of which take the multifixation fusion vector as their inputs .
Detection&Classification: The object geometric parameters include the leftup position and the size along with a score. They are regressed by a fullyconnected layer with parameters , , and clip operations with super parameters
are used to ensure the numerical range. The probabilities of all categories
is obtained by a fullyconnected layer with parameters , , and normalized by a softmax operation . Specifically, these can be expressed as(5)  
Prediction of Fixation Point
: We sample the next fixation point from a Gaussian distribution with a mean
and standard deviate
. A fullyconnected layer with parametersand sigmoid activation are utilized to estimate the mean
. The sampling process can be presented as(6) 
where represents Gaussian distribution, and
is the sigmoid function. Those weighted parameters of the whole network can be trained by a hybrid loss function endtoend.
Iv Hybrid Loss Function
To complete multiple tasks, we minimize a hybrid loss function ,
(7) 
where is the loss of detection, denotes the crossentropy of classification, and represents the loss of fixation points.
Iva Detection and Classification Loss
Detection: We suppose that an image contains only one object of interest with the true leftup position and size . The prediction of the position at the last step is regarded as a random vector conditioned on the groundtruth, which can be optimized by maximum logarithmic likelihood. In addition, a loss item is required to optimize the prediction of the size . We take the Intersection of Union (IoU) between the predicted bounding box and the groundtruth to construct the loss related to the size prediction. Experimentally, we limit the range of mapping to be within . The loss function of detection task is given
(8) 
The error of predicted position vector is selected as a Gaussian function with the zeros mean, namely
(9) 
where represents Gaussian distribution, the and coordinates are supposed to be independent, and is the standard deviate.
The maximum likelihood item will guide the adjustment of weighted parameters, if the overlap between the prediction and the groundtruth is zero. While it is nonzero, maximizing the IoU item can improve the precision of the predicted bounding box.
Classification: We use crossentropy to measure the predicted distribution and the true distribution over categories. The loss of classification task is formulated as where is the number of categories, and represent the th component of and respectively.
IvB Fixation Process Loss
Besides the loss of detection and classification, a loss related to fixation points is also required to constrain the fixation prediction, and efficiently extract more effective local observations.
Policyreward mechanism is utilized to optimize a decision process in reinforcement learning. In our method, the policy of fixation is over Gaussian distribution with an optimal mean value. We define the cumulative reward as the overlap rate between the predicted bounding box and the true one by IoU mapping. The decision process of fixation points thus can be optimized by a policyreward method [26], the loss item of which can be represented as
(10) 
where is the distribution of sampling point , is the expectation of [27], namely . Minimizing the loss
can obtain lower variance of gradient estimation
[18].Human will focus his attention to the object they intend to recognize. To mimic this mechanism, we takes L2 loss to constrain the mean of the last fixation close to the center of an object. The whole loss of the fixation process is presented as
(11) 
where the score is optimized to estimate the reward under L2 criteria. The total loss of fixation contains both stochastic and objectawareness strategies, called SA. The stochastic strategy (S) enlarges the diversity of samples and can enhance the generalization ability, while the item is used to improve the attentional efficiency. The objectawareness strategy (A) makes the last fixation close to an object, which leads to the stability of the prediction and optimization.
V Experiments
Va Dataset
Our experiments are carried on three datasets, including two variants of the MNIST dataset and a realworld dataset of car detection.
MSO Dataset & MSNO Dataset: We build a new dataset named MNIST Scaled Object (MSO) dataset, on which an image contains a bounding box and a label of the only one digital object. Images of the MSO dataset are generated by two steps:(1) Resize the images of the MNIST dataset by random scales sampled from a uniform distribution over [0.3, 1.5]. (2)Insert the images into a dark background at random locations, ensuring that the digital numbers are not truncated. To test the sensitivity to noises, we build another dataset set named Mnist Scaled and Noised Object(MSNO) by adding 6 patches of size into the images of MSO. The noising patches are cropped from corresponding subset of the MNIST dataset randomly.
FCAR Dataset: Particularly, we build a realworld dataset named FCAR dataset for car detection, of which the images may contain more than one car, but the object we are interested in is just the one on the same lane with our car. Therefore, the annotation of an image is a bounding box of the attentional car. The realworld images taken by fixed camera contains more abundant structure and context information of the scene, which is significant for the detection of attentional object, especially those small objects. For the FCAR dataset, the trainset includes images with size , and the testset has images.
VB Implemented Details
Initial Fixation Point: The initial fixation point is essential for the proposed network. We normalize the image coordinates to [1, 1], and put the original point (0. 0) to the center of the image. We randomly sample the initial point from an uniform distribution over a range .
Parameter Settings: The setting of the structure parameters of our proposed RADCN are illustrated in Table I. The parameters of ARE are adjusted according to the size of the input image, and all the CNN learners have same structure but with different parameters.
SubNet  ARE  CNN learners  Fusion Network  Actions  

Params 




VC Effects of the Structure Parameters
On MSO and MSNO dataset, we explore the effects of the structure of the proposed RADCN by four groups of experiments, in which the default structure parameters are , and each group of experiments change one of these parameters.
Initial range: Randomly selecting initial point is a way to augment and enlarge the dataset because of the different local observations. We sample the initial point from an uniform distribution over the range . Fig.3 (1) shows the mAP values under different ranges. Without randomness (), the precision on the MSNO dataset is about less than that on the MSO dataset, which indicates that random initialization is beneficial to learn a model robust to clutter background. Results with different initial points and fixation paths are shown in Supplementary 2.
Number of scales: Multiple scales provide more contextual information which can guide the prediction of the next fixation point. We change the number of scales from to to explore the effects of multiscale method. Fig.3 (2) manifests that 3scale local regions are more effective to contain context, and achieve 0.977 mAP value on the MSO dataset.
Glimpse steps: Glimpsing by multiple steps is another way to mine abundant contextual clues, and the number of steps T is chosen from in our experiments. As shown in Fig.3 (3), a random glimpse (=1) is a bad choice. With the growth of glimpses, results are getting better. About mAP value is achieved by the proposed RADCN when on the MSO dataset. As a conclusion, multiple steps is an essential way to improve performance. The attentional processes are illustrated in Supplementary 1.
Fusion methods: Three kinds of methods are provided to fuse appearance and location information, namely addition(), production() and concatenation(). Fig.3 (4) illustrates that the three methods obtain similar mAP values, about on the MSO dataset and on the MSNO dataset. For computational simplicity, we choose the addition as the fusion method.
VD Significance of the Fixation Point Strategy
Experiments on the MSO and MSNO dataset are carried out to analyze the significance of stochastic and objectawareness strategies. The stochastic strategy (S) is trained by a variance of REINFORCE rule, and the objectawareness one (A) without randomly sampling is constrained by L2 loss with the true location of the object. The hybrid method (SA) of the stochastic and objectawareness method are also taken into consideration in experiments.
Fig. 4 (a) indicates that the L2 constraint is effective to guide the objectawareness strategy to predict the next fixation, while the stochastic one is more robust to clutter background, and achieve a gain of more than percentage on the MSNO dataset. However, we also discover that the stochastic method may fail to achieve convergent state occasionally. The hybrid method (SA) can obtain the highest mAP of 0.905, and is very stable in convergence. The strategy (None) without both S and A obtains bad results on the MSO and MSNO dataset, which illustrates the necessity of the stochastic method and the L2 loss. Several instances generated by the hybrid method are illustrated in Fig. 4 (b). We can see that even the small objects and the corrupted ones can be detected and classified accurately.
We also compare our RADCN method with other networks, shown as Table II. On the cluttered translated (CT100) dataset [18], the results suggest that it is beneficial to jointly localize and recognize objects along with our fixation strategy. On the MSNO dataset, our RADCN achieves the best precision, and the proposed method is better in detecting and classifying the small object with cluttered background.
Methods  RADCN(our)  LeNet+Regression [28]  RAM [18]  DRAW [24]  

Dataset  MSNO  CT100  MSNO  MSNO  CT100  MSNO  CT100 
mean IoU  0.879  0.918  0.643  –  –  –  – 
error rate  6.8%  3.72%  42.1%  19.7%  8.11%  19.5%  3.36% 
mAP  0.905  0.940  0.395  –  –  –  – 
VE Attentional Object Detection on Realworld Images
We apply the proposed RADCN on the FCAR dataset to detect the forward car which is in the same lane with our car, and evaluate the precision of the detection by mean IoU. For training process, the crossentropy of classification is removed due to the absence of true label information.
As shown in Fig. 5(a), increasing the number of glimpse steps is a simple and effective way to improve the precision. Fig. 5(b) illustrates that the proposed RADCN is effective to detect small and large objects, especially, it can ignore other cars in different lanes, and localize the attentional car. The predicted bounding boxes achieve mean IoU. With fixed viewpoint, the images imply the distributions of these objects. Besides, lots of contextual information guides the network to localize the forward car, such as the sky, the ground, lane lines and other cars etc. Therefore, our proposed model is flexible and efficient to learn effective features, mine abundant environmental information, and fuses multifixation partial observations to detect the attentional object.
Compared with the method of convolution on the entire image, the approach of multiscale local regions is much faster. The local regions are resized to , while the size of the entire image is . The number of convolutions on the global image is , where is the number of scales, is the number of steps, and is the number of convolutions on those local regions. On the FCAR dataset, . As a result, , which illustrates that method of those local regions costs far less computation than that of the entire image. The proposed RADCN can achieve about 30 fps on images when
on TensorFlow framework and M40 GPU.
We will open the source code of our proposed method.Vi Discussion
Our RADCN is effective to jointly localize and recognize objects efficiently. Furthermore, the hybrid loss function of multiple tasks can be used to learn the parameters of the network endtoend. Especially, the stochastic fixation and objectawareness strategy makes our network stable and accurate. Experiments on the FCAR dataset demonstrate that our method can extract useful contextual information to detect attentional objects, especially those small objects.
Our current model detects only one objects in each image. In the future work, we will extend it to multiobject detection and classification.
References

[1]
P. Viola and M. J. Jones, “Robust realtime face detection,”
IJCV, vol. 57, no. 2, pp. 137–154, 2004.  [2] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and et al., “Object detection with discriminatively trained partbased models,” TPAMI, vol. 32, no. 9, pp. 1627–1645, 2010.

[3]
M. A. Sadeghi and D. Forsyth, “30hz object detection with dpm v5,” in
European Conference on Computer Vision
. Springer, 2014, pp. 65–79.  [4] R. Girshick, J. Donahue, T. Darrell, and et al., “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014, pp. 580–587.
 [5] R. Girshick, “Fast rcnn,” in ICCV, 2015, pp. 1440–1448.
 [6] S. Ren, K. He, R. Girshick, and et al., “Faster rCNN: Towards realtime object detection with region proposal networks,” in NIPS, 2015, pp. 91–99.

[7]
J. Redmon, S. Divvala, R. Girshick, and et al., “You only look once: Unified,
realtime object detection,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 779–788.  [8] W. Liu, D. Anguelov, D. Erhan, and et al., “SSD: Single shot multibox detector,” in ECCV. Springer, 2016, pp. 21–37.
 [9] J. Schmidhuber and R. Huber, “Learning to generate artificial fovea trajectories for target detection,” IJNS, vol. 2, pp. 125–134, 1991.
 [10] A. Torralba, A. Oliva, M. S. Castelhano, and et al., “Contextual guidance of eye movements and attention in realworld scenes: the role of global features in object search.” Psychological review, vol. 113, no. 4, p. 766, 2006.
 [11] A. Oliva and A. Torralba, “The role of context in object recognition,” Trends in cognitive sciences, vol. 11, no. 12, pp. 520–527, 2007.

[12]
S. Bell, C. Lawrence Zitnick, K. Bala, and et al., “Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks,” in
CVPR, 2016, pp. 2874–2883.  [13] K. He, X. Zhang, S. Ren, and et al., “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV. Springer, 2014, pp. 346–361.

[14]
H. Larochelle and G. E. Hinton, “Learning to combine foveal glimpses with a thirdorder boltzmann machine,” in
NIPS, 2010, pp. 1243–1251.  [15] Y. Tang and R. R. Salakhutdinov, “Learning stochastic feedforward neural networks,” in NIPS, 2013, pp. 530–538.
 [16] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” arXiv preprint arXiv:1401.4082, 2014.
 [17] S. ShalevShwartz, N. BenZrihem, A. Cohen, and et al., “Longterm planning by shortterm prediction,” arXiv preprint arXiv:1602.01580, 2016.
 [18] V. Mnih, N. Heess, A. Graves, and et al., “Recurrent models of visual attention,” in NIPS, 2014, pp. 2204–2212.
 [19] A. Graves, G. Wayne, M. Reynolds, and et al., “Hybrid computing using a neural network with dynamic external memory,” Nature, vol. 538, no. 7626, pp. 471–476, 2016.
 [20] M. Ranzato, “On learning where to look,” arXiv preprint arXiv:1405.5488, 2014.
 [21] M. Denil, L. Bazzani, H. Larochelle, and et al., “Learning where to attend with deep architectures for image tracking,” Neural computation, vol. 24, no. 8, pp. 2151–2184, 2012.
 [22] K. Xu, J. Ba, R. Kiros, and et al., “Show, attend and tell: Neural image caption generation with visual attention.” in ICML, vol. 14, 2015, pp. 77–81.
 [23] L. Bazzani, H. Larochelle, V. Murino, and et al., “Learning attentional policies for tracking and recognition in video with deep networks,” in ICML, 2011, pp. 937–944.
 [24] K. Gregor, I. Danihelka, A. Graves, and et al., “DRAW: A recurrent neural network for image generation,” in ICML, 2015.
 [25] J. C. Caicedo and S. Lazebnik, “Active object localization with deep reinforcement learning,” in ICCV, 2015, pp. 2488–2496.
 [26] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 34, pp. 229–256, 1992.
 [27] R. S. Sutton, D. A. McAllester, S. P. Singh, and et al., “Policy gradient methods for reinforcement learning with function approximation.” in NIPS, vol. 99, 1999, pp. 1057–1063.

[28]
Y. LeCun, B. Boser, J. S. Denker, and et al., “Backpropagation applied to handwritten zip code recognition,”
Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
Comments
There are no comments yet.