1 Introduction
Estimating similarity across various image patches is one of the most fundamental components in the field of computer vision and pattern recognition. In most of the cases, a precise similarity measure leads a solid foundation for several challenging tasks which include structure from motion, wide baseline matching, object recognition, segmentation, classification and image retrieval [1]. Among visual object tracking applications, video surveillance, human computer interaction and robotics have been prime focus of computer vision community for many years. Although great achievements have been accomplished within the last few years, there is still a need to optimize the performance in terms of accuracy, robustness and speed.
Over the years, many tracking algorithms have been proposed [12] where the main goal is to localize an object in a series of frames with the sole supervision of a bounding box given only in the first frame of the sequence. In these scenarios the object’s appearance model is learnt from previous frame of a sequence online and these learnt models are then crosscorrelated with the search image to localize the target object in the next frame. There are several stateoftheart methods based on similarity rationale with certain modifications such as KCF [2], SRDCF [4] and CCOT [6] which are widely accepted by the tracking community. Due to the tremendous achievements of deep neural networks in diverse computer vision applications, the researchers have focused their attention to bring the best out of deep convolutional nets in tracking paradigm. Moreover, the scarcity of large sets of supervised data and extremely slow learning ability have made deep trackers not feasible for real life applications [7]. Even though there is a massive constraint of speed, deep trackers such as DeepSRDCF [5], TCNN [9] and MDNet [10]
have proved their effectiveness in wide variety of challenging sequences. Some of the recent trackers have aimed to learn a detector per video by finetuning multiple layers of a pretrained deep network with stochastic gradient descent mechanism
[10, 11]. But, the necessity of high frames per second in real world applications leaves the online adaptive deep convolutional networks a step behind the other stateoftheart trackers [12, 13]. A possible solution to these shortcomings could be to use a pretrained deep similarity network such as Siamese network [14] in order to discriminate the target from its background. So the objective of the network would be to learn from a single exemplar image in one branch and predict the essential parameters of the other branch which will assist in identifying instances of the same object in the upcoming frames [14]. Thus, the deep network would generalize a similarity function from annotated pairs of raw image patches without attempting to use hand crafted features [16].Most of the trackers have achieved appealing results both in accuracy and robustness. It is also witnessed that the use of several consistency techniques such as scale adaptive KCF [17], influence of windowing [7], bounding box regression [10] and online adaptation of appearance model [10] has turned out to be extremely effective in numerous tough sequences. Even though the introduction of immunity to minor scale changes has brought radical advancement, still there is necessity of orienting the bounding box according to the target object. In real life applications, detecting not only proper bounding boxes but also estimating the orientation of the target object plays a vital role. This would be a key factor in increasing overlap ratio and anticipating trajectory of the target object in a more efficient manner. One of the major advantages of using orientation would be to update the appearance model and to use more sophisticated features from the rotated version of cropped exemplar image in the subsequent frames.
Therefore, one of the important aspects of this paper is to propose a scheme to determine the orientation of the target object and analyse its impact in visual object tracking. We further extend our work and propose a better approach of updating target position taking into account the distance and direction of motion and Gaussian weighted average response map based on various scales. Our proposed method is generic in a sense that it can be applied to any stateoftheart trackers which inturn improves their performance standard. To establish this proposition, we have integrated our algorithm with SiameseFC [7] and CFnet [8]. The main reason behind choosing these two trackers is that they achieve comparable stateoftheart performance while operating at extremely high frames per second. The proposed algorithm have been evaluated on popular tracking benchmarks such as VOT [12] and OTB [13].
2 Related Works
Most of the correlation filter based algorithms are based on two key elements, how the target object is represented and how to localize this object of interest in the subsequent frames. Object representation models have developed gradually from histogram [21] based approach to more advanced generative [22, 23] or discriminative [2, 24] approaches. For target object localization, methods such as Elliptical head tracking [25], Probabilistic color and adaptive multifeature tracking [26], Robust visual tracking [27] and Learning to track with multiple observers [28] have gained a lot of attention. Recently, the widespread success of object detection algorithm has emerged an advanced approach of localization, known as trackingbydetection. Due to outstanding performance of these trackingbydetection algorithms on evaluation benchmarks [12, 13]
, this paradigm has gained popularity in the tracking community. This method usually employ binary classifier to discriminate target object from its background. S. Hare
et al. discuss Struck [29], a discriminative tracker which employs a kernelized structured output Support Vector Machine(SVM) to provide adaptive tracking. M. Danelljan
et al. have proposed SRDCF [4] which uses a spatially regularized correlation filter that helps in learning from a large set of negative samples, without corrupting the positive samples. Y. Li et al. have proposed a scale adaptive scheme in [17] which has strong impact on determining the size of the target efficiently. G. Koch et al. explores a method to train Siamese neural network which ranks the similarity between its input image patches [30]. J. Valmadre et al. take a step forward to investigate the influence of modified Alexnet on Siamese network [7] and propose an endend training model of correlation filter in CFnet [8].Sequence like glove [12] from VOT challenge as shown in Figure 1 undergoes severe deformation which leads to significant changes in aspect ratio as well as rotation of the bounding box. Due to this, the illequipped axis aligned bounding box used in most stateoftheart tracking methods fails to capture more detailed information from the object of interest. Y. Hua et al. [20] have addressed this issue by generating suitable candidates which capture more detailed information by estimating the transformations undergone by the object. H. A. Rowley et al.
have carried out an extensive research in designing a rotation invariant neural network based face detection system
[31]. There are several face detection systems which could only detect upright or frontal faces [32]. But this is far from the reality where images of faces could be much more complicated than just upright or frontal. Introduction of router network in [31] to detect angle of orientation of the face has helped to convert the rotated faces back to frontal before it is actually passed through the face detector. M. Jaderberg et al.have introduced a spatial transformer network (STN)
[33], where the model learns invariance to rotation, scale, translation and other warping of the input data. Use of localization network, sampling grid generator and sampler in a STN have brought radical achievement in challenging object classification datasets such as CUB2002011 birds dataset [34]. In this paper we explore extensively the impact of rotation invariance in tracking paradigm as well as several new consistency techniques which have out performed the original tracking algorithm by a large margin. In the following section 3 and section 4 , we explain the detailed modifications and experimental results based on our evaluation on OTB [13] and VOT [12] datasets respectively.3 Proposed Methodology
In this section, we detail our contributions by proposing a generic approach for incorporating rotation invariance (RI) object tracking and introducing the motion consistencies guided by the laws governing physical motion of the objects. For the sake of experimentation and analysis we incorporated the proposed modifications to Siamese Net and CFnet. Here we first briefly discuss about the architecture of SiameseFC in 3.1 and CFnet in 3.2 followed by the consistencies termed as Displacement Consistency in 3.3, Scale Consistency in 3.4 and RI in 3.5.
3.1 Siamese Fully Convolutional Network
Deep similarity learner mainly learns the parameters of a function which takes two images as input and generates a response score. If the two images depict the same object, it generates a high score otherwise a low score. A convolutional neural net is used as this learning function . Typically Siamese architectures have been quite successful in deep similarity measure. The network learns the parameters from the first frame of each sequence and then all the possible candidates are tested exhaustively to measure similarity with the exemplar image. A simple architecture of Siamese fully convolutional network is shown in Figure 2. In this architecture, the appearance model of the exemplar image isn’t updated at all. Siamese network employs same transformation which is a five layered CNN to both of its inputs and . The transformed inputs and are crosscorrelated to obtain a response map. The location of maximum response score indicates the position of the object of interest. Thus, the overall similarity function becomes . One of the major advantages of using fully convolutional architecture is that, a larger search region can be fed into the network without resizing it to the size of exemplar. Then the similarity function helps in finding the response score at all translated subwindows in a large search region. The response map computed by this architecture is given by equation 1.
(1) 
where b is a bias signal.
3.2 Correlation Filter Network
Correlation filter network is a modification over baseline Siamese network as given in equation 1. The correlation filter network is shown in Figure 3. The new similarity measure function is given in equation 2.
(2) 
where and are scale and bias respectively. The correlation filter block uses feature map
to learn a template by solving ridge regression problem in the Fourier domain
[2]. The effect of circular boundaries has been mitigated by multiplying with a cosine window and cropping the final template. Unlike SiameseFC, the appearance model in CFnet is updated after each frame using a rolling average in order to avoid abrupt transitions from frame to frame. The rest of the procedure is similar to that of the baseline Siamese as illustrated in section 3.1. A deep insight into CFnet including back propagation can be found in [8].3.3 Displacement Consistency
In order to avoid the target position deviating much from its previous position, most trackers [7, 8] employ a target centroid update strategy as shown in Figure 4. Mathematically, this update strategy can be written as in equation 3. This is usually employed to enforce smoothness on the object motion. However, we identify that this is not sufficient to enforce the smoothness on the displacement (direction and distance) of the object in motion which can contribute to more reliable tracking. This can be noticed by observing the Figure 10. Due to the incremental angular deviation , the target centroid keeps drifting away from the actual centroid which decreases the overlap ratio. It is observed from Table II that minimizing this deviation as proposed has increased the success as well as precision.
(3) 
We have integrated a new displacement consistency approach on top of the conventional approach to enhance the degree of smoothness. The angle consistency is illustrated in Figure 5. In distance consistency, the algorithm remembers the previous distance encountered and updates the new distance using rolling average. A pictorial representation of distance consistency is elucidated in Figure 6. After displacement consistency, the new centroid of the target is computed using equation 6. The accuracy and robustness after these integrations along with original values have been provided in section 4 which proves the efficacy of the scheme.
(4) 
(5) 
(6) 
3.4 Scale Consistency
The conventional approach to estimate size of the target object is to form a scale pyramid and compute response map using each of these images [17]
. The corresponding scale of the response map having maximum response score among all these response maps determines the size of the target object. Then that particular response map is used for obtaining the target centroid. In this standard approach, only the winning response map i.e the map having maximum response score among all maps decides the size of the object. However, as we know that in real scenarios, the scale of the object doesn’t undergo drastic change from frame to frame as the scale change depends on the distance of the object from camera and as the objects move smoothly in many real scenarios. Though there are methods which add a penalty factor to the new target size, this applies only to the size of the target object. However, if the position of the target centroid itself is corrupted due to the use of wining response map only, it will persist in subsequent frames. In this standard scenario, the response maps that correspond to different scales aren’t used in determining the centroid. Therefore, we propose to use Gaussian weighted average response map centred at the winning map and have variance as an additional hyper parameter. In this way we can incorporate the response maps that correspond to various scales in the scale pyramid. Our approach has enhanced the accuracy as well as robustness of the considered base trackers. The results of this experiment are described in details in section
4. The pseudo code for Gaussian weighted average response map is provided in Algorithm I.Algorithm I : Scale Consistency using Gaussian weights

Input parameters :
Let responseMaps represents the stack of response maps at each scale. represents the index of the winning response map.
represents the standard deviation of Gaussian weights.
scaleBins numerically represents each scale i.e. scaleBins(1) represents the first scale, scaleBins(2) represents the second scale and so on. Let represents the total number of scales used in the scale pyramid. 
Computation of scale weights and updation of responseMap:

Define weights for each scale as
scaleWeights = 


Output response map : The output of this algorithm is the Gaussian weighted average responseMap.
3.5 Rotation Invariance
In this section 3.5, we will discuss two different ways of incorporating rotation adaptiveness in tracking algorithms such as the proposed rotation invariant SiameseFC 3.5.1 and rotation invariant CFnet 3.5.2. The former can be used where the target object is not updated after each frame and the later can be used where the object is updated after each frame.
3.5.1 Rotaion Invariant SiameseFC
When an object is in motion it can assume any of its rotated form or view from frame to frame. However, conventionally only a base template with a fixed (zero) orientation is employed to find the similarity. To incorporate robust RI tracking, we propose to augment various possible rotated images of the object and measure similarity with all these rotated images. The working of rotation invariant Siamese fully convolutional network has been illustrated in Figure 8. Since the appearance model isn’t updated during tracking, the corresponding features of rotated exemplar can be extracted once for a sequence. Since the angle of rotation does not change drastically from frame to frame, only 5 nearest neighbour response maps are used in computing response map. The mean of the Gaussian weights is considered as the index of the winning response map and variance is tuned as an additional hyperparameter in the similar manner as explained for different scales in 3.4
. In order to avoid false alarm, we have computed three Gaussian weighted average response maps centred at top three maps according to their response scores. Accordingly there would be three most probable target centroids, out of which the final centroid is selected based on the highest score to displacement ratio in a sense that the object wouldn’t have travelled far from its previous location. In this approach, as the path with dominant direction is detected, the bounding box can be rotated accordingly to increase the overlap ratio. A comparison between SiameseFC and rotation invariant SiameseFC is shown in Figure
7. We have evaluated our rotation invariant SiameseFC on VOT datasets [12] and the obtained results are provided in 4.3.5.2 Rotation Invariant CFNet
Unlike rotation invariant SiameseFC, in rotation invariant CFnet the exemplar is updated after every frame [8]. In the second frame the object itself would have undergone some rotation. So there is no need to extract features from all the rotated exemplars beforehand, instead only forward and backward rotations after each model update would suffice. Therefore, there is no need to feed back the angle of rotation. Thus, there would be three response maps corresponding to each of the three rotations. Since we have only three response maps corresponding to three nearest neighbour angles and each response map itself is a Gaussian weighted average map performed by the SCorr block, there is no need to compute average of the three rotated maps again. The final centroid is computed by using the map having highest response score among all the three. In fact, using Gaussian weighted average after scale correction doesn’t seem to improve the performance much, though it would be useful when more nearest neighbour rotated exemplars are used. This is a general approach which can be integrated into any stateoftheart trackers to enhance their performance further. The rotation invariant CFnet is illustrated in Figure 9. The results of our proposed CFnet DS and CFnet DSR can be found in section 4. In rest of the sections, we have referred CFnetconv2 [8] as original CFnet and applied our consistency strategy to this model. We have used CFnetconv2 in our experiments because it has less than 4% parameters used in fivelayer baseline and outperforms the rest in the series[8].
4 Experimental Details and Analysis
We have evaluated original CFnet and our modified CFnet on 43 sequences out of OTB50 with 3 repetitions for each sequence in order to get a rough estimation of the performance. These particular sequences were selected based on the toughness of deformations incurred in the object of interest. For the evaluation on 43 sequences, we have used OTBTRE function which has been provided in original CFnet [8] codes repository. The tracking performance varies from machine to machine based on whether GPU support is enabled. The reason for this is the numerical effects which gets accumulated over time. Due to which, most of the trackers suffer from a slight variation in results when reevaluated on different machines. In order to avoid this effect, we have evaluated the original tracker and all our modifications under exactly same circumstances i.e. same sequences, same system and same evaluation function. The results are given in Table II. From Table II, it is clearly observed that our proposed displacement 3.3 and scale 3.4 correction schemes have improved the performance of original CFnet [8]. The success and precision values for different angles of rotation are shown in Table I. From Table I, it is clear that the tracker achieves optimal performance for the angle of rotation = 8 and its performance decreases with increase in angle of rotation. This is evident from practical point of view as there won’t be drastic change in orientation between two subsequent frames.
Angle of Rotation  5  7  8  10  20  30 

Success(AUC)  56.97  58.11  58.78  57.01  53.80  48.17 
Precision(threshold)  67.05  68.30  69.13  67.91  67.18  63.88 
Tracker  Success(AUC)  Precision (Threshold) 

CFnet 
57.3447  65.0869 
CFnet D  59.0531  69.3120 
CFnet DS  59.0531  70.3673 
CFnet DSR  58.7865  69.1349 
( = 8) 
Final evaluation is done using OTB toolkit for all 50 sequences [13] and only important results are shown due to space constraint. The success rate of original CFnet could not be plotted because of the unavailability of final bounding boxes in OTB results database during the time of writing the paper. A comparison between our modified CFnet and current stateoftheart trackers is shown in Figure 10. Due to lack of rotated bounding box results of other trackers, we had to use axis aligned bounding box during accuracy assessment. So there is a slight improvement in accuracy and precision as shown in Table III. This performance will certainly be enhanced if compared with rotated bounding box results which unfortunately isn’t available for most of the trackers. The success(AUC) and precision plots obtained by using fully integrated OTB toolkit are shown in Figure 11 and Figure 12 respectively.
Tracker  Success(AUC)  Precision (Threshold) 

CFnet 
52.7  70.2 
CFnet DS  52.9  70.0 
CFnet DSR  52.8  71.5 
( = 8) 
As per the results obtained using fully integrated vottoolkit as shown in Table IV, an improvement of 15.57% in accuracy rank and 14.3% in robustness rank have been observed with no degradation in overlap ratio. Since the absolute value of the accuracy and robustness rank varies based on the trackers used for evaluation, the relative improvements of proposed methods over the original have been used to showcase the efficiency. The ranking plot for experiment baseline is shown in Figure 13, which depicts the improvement in accuracy and robustness of proposed modified Siamese DSR over the original SiameseFC.
Tracker 





MDNet  1.00  1.00  0.57  
CCOT  1.17  1.67  0.54  
SiameseFC  1.17  6.50  0.52  
Siamese DSR  1.00  5.67  0.52  
CFnetconv2  1.33  5.00  0.52  
CFnet DSR  1.50  4.50  0.52 
5 Conclusion
The principal focus of this paper have been to investigate the consequences of rotation adaptiveness in object tracking. The proposed consistency techniques have surely outperformed the original tracking algorithm. The success rate has improved by 4.6% whereas precision, by 6.75% as given in the Table II. According to the evaluation of proposed Siamese DSR on VOT [12], a drastic improvement in robustness rank by 15.7% and accuracy rank by 14.3% have been achieved with no degradation in overlap ratio as shown in Table IV. Above all, detecting the orientation of the target object will certainly be a significant boost in the tracking paradigm. As per the analysis, the concept of rotation adaptive tracking with aforementioned motion consistencies has been exceptional in determining the target centroid in most of the tough sequences in popular tracking benchmarks. As per our investigation of rotation adaptive Siamese and CFnet, we conclude that rotation adaptiveness has certainly enhanced the robustness and regarding the overlap ratio, we suggest that the rotation invariance would surely be more effective if compared with rotated bounding box results. Our future research may include replacing the simple CNN present in both Siamese and CFnet architectures with a very deep CNN. We will also investigate the improvement achieved by integrating the proposed motion consistency in top trackers such as MDnet [10], TCNN [9] and CCOT [6].
References
 [1] R. Tao, E. Gavves, and A. W. M. Smeulders. Siamese instance search for tracking. In CVPR, pages 1420–1429, 2016.
 [2] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Highspeed tracking with kernelized correlation filters. IEEE TPAMI, 37(3):583–596, 2015.
 [3] Z. Kalal, K. Mikolajczyk, J. Matas,: Trackinglearningdetection. PAMI 34(7) (2012) 14091422
 [4] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual tracking. In ICCV, pages 4310–4318, 2015.
 [5] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Convolutional features for correlation filter based visual tracking. In ICCV Workshops, pages 58–66, 2015.
 [6] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV, pages 472–488, 2016.
 [7] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fullyconvolutional Siamese networks for object tracking. In ECCV Workshops, pages 850–865, 2016.
 [8] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Endtoend representation learning for Correlation Filter based tracking. arXiv preprint arXiv:1704.06036v1, 2017.
 [9] K. Kang, W. Ouyang, H. Li, and X. Wang. Object Detection from Video Tubelets with Convolutional Neural Networks. In CVPR, 2016.
 [10] H. Nam, and B. Han. Learning multidomain convolutional neural networks for visual tracking. In CVPR 2016, pages 4293–4302, 2016.
 [11] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. In ICCV 2015.
 [12] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Cˇ ehovin, T. Voj´ır, G. Ha¨ger, A. Lukezˇicˇ, G. Fern´andez, et al. The Visual Object Tracking VOT2016 challenge results. In ECCV Workshops. Springer, 2016.
 [13] Y. Wu, J. Lim, and M.H. Yang. Online Object Tracking: A Benchmark. In CVPR, 2013.
 [14] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and A. Vedaldi. Learning feedforward oneshot learners. In NIPS, pages 523–531, 2016.
 [15] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In NIPS, pages 3630–3638, 2016.
 [16] S. Zagoruyko, and N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks. arXiv preprint arXiv:1504.03641v1, 2015.
 [17] Y. Li, and J. Zhu. A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration. In: Agapito L., Bronstein M., Rother C. (eds) Computer Vision  ECCV 2014 Workshops.
 [18] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. arXiv preprint arXiv:1506.01497v3, 2016.
 [19] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual Tracking: an Experimental Survey, In Proceeding of IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 2013.
 [20] Y. Hua, K. Alahari, and C. Schmid. Online Object Tracking with Proposal Selection. arXiv preprint arXiv:1509.09114v1, 2015.
 [21] D. Comaniciu, V. Ramesh, and P. Meer. Kernelbased object tracking. PAMI, 25(5):564–577, 2003.
 [22] K. Lee, J. Ho, M. Yang, and D. Kriegman. Visual tracking and recognition using probabilistic appearance manifolds. CVIU, 99(3):303–331, 2005.
 [23] X. Mei and H. Ling. Robust visual tracking and vehicle classification via sparse representation. PAMI, 2011.
 [24] R. T. Collins, Y. Liu, and M. Leordeanu. Online selection of discriminative tracking features. PAMI, 2005.
 [25] S. Birchfield. Elliptical head tracking using intensity gradients and color histograms. In CVPR, 1998.
 [26] V. Badrinarayanan, P. P´erez, F. Le Clerc, and L. Oisel. Probabilistic color and adaptive multifeature tracking with dynamically switched priority between cues. In ICCV, 2007.

[27]
D. Park, J. Kwon, and K. Lee. Robust visual tracking using autoregressive hidden Markov model. In CVPR, 2012.
 [28] B. Stenger, T. Woodley, and R. Cipolla. Learning to track with multiple observers. In CVPR, 2009.
 [29] S. Hare, A. Saffari, P. H. S. Torr. Struck: Structured Output Tracking with Kernels. International Conference on Computer Vision (ICCV), 2011.

[30]
G. Koch, R. Zemel, and R. Salakhutdinov. Siamese Neural Networks for Oneshot Image Recognition. In: ICML 2015 Deep Learning Workshop, 2015.
 [31] H. A. Rowley, S. Baluja, and T. Kanade. Rotation Invariant Neural NetworkBased Face Detection. In CVPR 1998.
 [32] G. Burel, and D. Carel. Detection and localization of faces on digital images. Pattern Recognition Letters, 15:963–967, October 1994.
 [33] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. arXiv preprint arXiv:1506.02025v3, 2016.
 [34] C.Wah, S. Branson, P.Welinder, P. Perona, and S. Belongie. The CaltechUCSD Birds2002011 dataset. 2011.
Comments
There are no comments yet.