Object tracking is a major task to be accomplished in computer vision applications. Mostly, it is utilized to take into account the information between several frames. Regarding the importance of this subject, numerous effort has been made to investigate an appropriate approach. Some of these methods are based on detecting objects and associating unique ids to the same objects in a sequence of frames while the others focus on the tracking task exclusively . Investigating the detail of the tracking-by-detection approaches, researchers commonly intend to solve the multiple object tracking problem. In this regard,  proposes to use tracklets based on a deep associate method while utilizing a tracking-by-detection paradigm. The other approach in  suggests a reciprocative learning algorithm while utilizing visual attention. In , adversarial learning is employed to find the most robust features of the targets in a form of a mask. Single object tracking, on the other hand, is of high importance for which plenty of researches are conducted in this area . In 
, a single object tracker is presented based on fuzzy least squares support vector machine. Another method is suggested based on an online fusion of trackers. Some of the approaches are developed to deal with special problems in a tracking scenario such as occlusions [12, 11] while others extend the tracking task to a specific application [5, 6, 13].
A wide research has been conducted to perform the tracking without detection. To start with, a meta-updater is proposed in  by which an online-update-based tracker is designed to accomplish the local tracking. This is done by the use of a cascaded LSTM module. This idea helps the long-term tracking to benefit from the perfect short-term trackers. Another method presented in  suggests a baseline for performing a global search (over a very large area). By this means, there is no need for an online learning or any changes in each of the scale and the trajectory. Besides, this tracker is mainly developed based on the two-stage detectors. In , on the other hand, a tracking framework is presented based on the proposed skimming and perusal modules. The latter one stands for a module consisting of two main parts. Firstly, a bounding box regressor generates several candidates. Then, a verifier is applied to identify the optimal candidate while taking its confidence score into account to infer about the object presence. Note, all the mentioned trackers employ the similar approach of using a global search along with a verifier module responsible for recognizing the intended object. The verifier module is often based on ResNet-50 architecture and acts as a matching network. They also utilize the SiamRPN-based network  to produce some region candidates for the subsequent tasks. On the other hand, despite the other problems related to computer vision, the object tracking task requires an online prediction of the object appearance model which should be embedded into the tracker’s architecture to realize an End-to-End training. In this regard a popular approach is the Siamese paradigm [1, 10] where a feature template of the target is predicted regardless of the background information. In , however, the authors suggest to consider both the target and the background appearance information to predict the required model based on defining a discriminative learning loss. Previous to this method the same backbone had been employed by  which is basically proposed using a ResNet-18 architecture along with an online discriminative learning.
As it is seen a vital issue in object tracking is to elicit exclusive features of the target to distinguish it from both the background and the other similar objects. In this work, our focus is on presenting an appropriate architecture based on a novel approach while addressing the challenging issues mainly the presence of similar objects. In this article, the proposed object tracker takes the advantage of using a light CNN model along with four input branches to track an object while taking both its template and its size into account. Besides, it is suggested to take only a small part of the image containing the specified object. This part is then updated and displaced by the proposed method. Based on this approach, the mentioned region of interest is enough to track the object even in a long-term tracking scenario. The challenge, then, is to determine an appropriate new RoI for each of the incoming frames. Assuming an initial RoI with a size twice the object one, we present a new object tracker to accomplish the object tracking task. In brief, an object localizer utilizes the RoI as its input to elicit a matrix representing the object location. This module is a fully CNN with several input branches to process each object regarding its specific size while taking the intended object template from the first frame into account, as well. To clarify, in each processing, two of four branches are activated one of which is the template related branch, and the other is one of the three branches corresponding to the object size. Taking the object template branch helps to integrate the exclusive features of the target to distinguish it from similar objects. Fig. 1 presents some sample outputs of the proposed model trained on GOT-10k dataset. The methodology is presented in detail in the next section. Then experimental results are reported while comparing the performance of the proposed tracker with state-of-the-art approaches. Finally, the last section concludes the paper.
This section focuses on the methodology of the presented approach. There are several points to present regarding the proposed method. In what follows, first, the main model used to extract the RoI matrix from an image is investigated. This model is inspired by , and it is shown to be applicable in obtaining an accurate object location from an ROI in a single image. Then, the other parts are presented and explained, precisely.
In , a new approach is proposed to localise an object based on a single multiple-model convolutional neural network while considering a specific training algorithm. In brief, in contrast to most of the relevant researches that use a fixed-size input, a flexible architecture is developed while providing the capability of having several branches each activated for a different object size. This specifically enhances the performance of the model in terms of speed and consistency regarding different object sizes. Besides, a major challenge is to take an integrated architecture that contains all the branches, and is trained in an End-to-End fashion. Compared to the model in , the center point matrix is eliminated, and instead of the RoI matrix, the higher resolution of this matrix is utilized. Besides, a new branch is added which gets the intended object template as its input and outputs feature maps which are then concatenated with one of the other three branches. This way, the exclusive features of the target are included in the model. Fig. 2
shows our architecture in which each of the branches is illustrated with a different color. To train such a model, the mean square of the error is utilized as the loss function. Moreover, during the training, a batch ofordered images is employed among which the template is given based on only the first image in each batch. The RoI matrix mainly represents the super-pixel classification result of the input image to determine the object location. This concept is shown in Fig. 3 where the input image is presented on the left-side while the corresponding RoI matrix is given on the right. This approach is capable of localizing objects while not only realizing a consistent performance regarding different object sizes but also distinguishing the target from other similar objects in the same area.
A critical point here is how to use the RoI matrix to move the region of interest to a proper place for the next frame. In this regard, an efficient approach is pursued in this paper which is based on applying an average-pooling over the RoI matrix. Using a
kernel size and a stride equal tofor the average-pooling layer, the resulting matrix has a shape representing an information about the object presence in each of the four parts. We call this matrix the ”Direction matrix”. Taking the sample presented in Fig. 3 into consideration, the resulting ”Direction matrix” is:
Based on this matrix, the amount of movement in each of the four directions is determined. The whole concept is illustrated in Fig. 4.
Remark: To obtain the RoI matrix, ”Sigmoid” activation function is applied.
Another paramount issue to be covered is the scenario in which the object size changes over time. To address this common pattern in the tracking, the RoI size (i.e. the height and the width) is updated regarding the object size. This is performed for each of the height and width of the RoI, separately. There are two main conditions under which the RoI size is updated, first the case in which the object height/width is more than of the RoI one, and the second corresponds to the object height/width less than of the RoI one. In the former case the height/width of the RoI is multiplied by , while in the latter one, it is multiplied by . The main idea behind these works, is to not only maintain the object in the center but also to keep the ratio equal to , approximately. This means to have a region of interest twice the object size in each direction. To obtain the object height/width, the RoI matrix is investigated while calculating a summation over its columns/raws and determining the maximum value among them. This simple and fast approach yields a good approximation of the object size.
To sum up, a critical issue to be addressed in object tracking is to handle challenging situations specially when a similar object is present in the scene, and it is highly probable to mis-classify it as the intended object. To deal with such problems, a special approach is proposed based on incorporating both the object localization and the object recognition tasks into a light architecture. The light structure of the suggested model is of high importance since in a real-time object tracking scenario it is highly required to perform the task quickly while maintaining an acceptable accuracy.
Iii Experimental Results
This section presents the results of the proposed architecture compared to the SOTA approaches while using the GOT-10k dataset to train the tracker. The average of overlap rates (AO) is a metric commonly used to evaluate the trackers on this dataset. Note, previously, in , the similar approach is evaluated while considering classic object trackers along with the SOTA presented in . Besides, a failure analysis is reported to further demonstrate the performance of the proposed model in object localization. Here, we compare the modified and empowered architecture with several SOTA approaches on the GOT-10k dataset. the output of the proposed method is a heat-map in the form of a RoI matrix by which the object tracking is performed and the location of the object is identified. Hence, we define a metric with slight changes compared to the AO. Considering the RoI matrix presented in Fig. 3, the dot product of this matrix with the ground truth represents a value for the true positive outcomes. Besides, doing the same thing with the complement of the ground truth matrix yields a value for the false positive results. Based on this concept, the following equation is employed to evaluate the performance of the proposed tracker:
where, and are the true and the estimated RoI matrices. Also, note that is a matrix of ones. It can be easily seen that this metric is a good approximation of the intersection over union (IoU). To compare the results of the proposed tracker with the SOTA, the AO is converted to the IoU. Hence, considering and overlap as two examples, the corresponding IoU values equal to and , respectively. Firstly, let us evaluate the proposed RoI extractor on the validation data provided for Got-10k. There are videos for which the RoI extractor is used. To further investigate the capabilities of the proposed object tracker, we assume that we may update the template of the target each frames. Taking various values for , Fig. 5 presents the results of the average of the mentioned metric over all the videos for each value of . Besides, instead of a tracking scenario, the target is randomly located in the images during the tracking.
|Speed (Hz)||32.85 (1080Ti)||26.28 (rtx)||23.14 (2080Ti)||23.02 (V100)||40120 (1080Ti)||22.67 (rtx2060)||8.84 (1080Ti)|
Fig. 5 indicates the capability of the proposed approach in accomplishing the tracking task with no template update. In other words, the template given in the first frame is adequate to accomplish the tracking acceptedly. Note, although for each video in GOT-10k, the variation is apparent from a frame to another, almost all of the videos contain less than or equal to frames. To better clarify the promising results of the suggested architecture, results of some SOTA approaches on GOT-10k are reported in Table. I. As can be seen these results are given in the GOT-10k benchmark page. Comparing these results surely reveals the applicability of the proposed model in fulfilling the task of single object tracking. Note, according to the results presented in Fig. 5, and as mentioned earlier, the interval in Table. I equals to the interval given for the defined metric. Considering this result, the presented object tracker accomplishes the object tracking task closely to the SOTAs while having a super speed compared to them. Note, the result of our tracker is reported based on running it on the validation dataset as the labels are provided for the validation data. Since this is a real-time robust approach in single object tracking, it has various applications in robotics specially when the target template does not change significantly over time.
Now, let’s investigate the tracking performance of the proposed architecture for some challenging scenarios shown in Fig. 6. These are some samples to better illustrate the applicability of the proposed method. Moreover, these outcomes are given for the test data, except for the first scenario which corresponds to a video from the validation dataset. The first scenario contains different view of an object along with various object deformation on a cluttered background. This is reported to indicate the capability of the presented approach in localizing the object accurately while extracting the target size. The important point to ponder is, the upper images are the RoIs that the tracker itself produces and applies on the incoming frames. In other words, as it is seen the main policy of the tracker to maintain the object in the center is realized carefully. The other scenario, however, is more challenging. Mainly, the background is similar to the target, and the intended object features are weak. Investigating the result of the CNN for each frame, apparently, the CNN gives attention to the target properly. Note, the template is set just based on the first frame given by the user (there is no template update). The third scenario, is the most challenging one where not only are similar objects present in the scene but also several issue are addressed such as fast movement, cluttered background, and occlusion. To better illustrate the performance of the tracker, both the RoI and the direction matrices are presented. Results surely reveal the effectiveness of the proposed tracker. As it is seen even the occlusion could not make the proposed tracker to fail. The last frame shows that if the exact same target is present in the scene, the tracker may have problem to distinguish the intended object with that. However, note that since there is a movement in the tracking scenario, and a limited region of interest is employed for each frame, it is possible to even handle such situations. Finally, direction matrices guide the RoI where to move for the next frame.
In conclusion, a novel approach to object tracking is proposed including a specific architecture. The proposed CNN model contains four input branches three of which are responsible for localizing the object regarding its size while the last one integrates the target template into the architecture. Considering the intended object template helps the RoI extractor to better identify the target while taking other branches makes it possible to process the input regarding its specific size. This, of course, enhances the performance of the object tracker in terms of both accuracy and speed. The proposed model realizes a robust object tracking in critical situations such as presence of similar objects or cluttered background. Finally, the comparison with the SOTA on the GOT-10k dataset illustrates the effectiveness of the suggested approach.
-  (2016) Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pp. 850–865. Cited by: §I.
-  (2019) Learning discriminative model prediction for tracking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6182–6191. Cited by: §I.
High-performance long-term tracking with meta-updater.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6298–6307. Cited by: §I.
-  (2019) Atom: accurate tracking by overlap maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4660–4669. Cited by: §I.
-  (2019) Visdrone-sot2019: the vision meets drone single object tracking challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §I.
-  (2018) Single object tracking applied to an aircraft. In 2018 15th International Multi-Conference on Systems, Signals & Devices (SSD), pp. 1441–1446. Cited by: §I.
-  (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5374–5383. Cited by: §I.
Globaltrack: a simple and strong baseline for long-term tracking.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11037–11044. Cited by: §I.
-  (2018) On-line fusion of trackers for single-object tracking. Pattern Recognition 74, pp. 459–473. Cited by: §I.
-  (2018) High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980. Cited by: §I.
-  (2018) Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in rgb-d videos. IEEE Transactions on Multimedia 21 (3), pp. 664–677. Cited by: §I.
Robust object tracking based on recurrent neural networks. In 2018 6th RSI International Conference on Robotics and Mechatronics (IcRoM), pp. 507–511. Cited by: §I.
-  (2020) Surgical instrument tracking for vitreo-retinal eye surgical procedures using aras-eye dataset. In 2020 28th Iranian Conference on Electrical Engineering (ICEE), pp. 1–6. Cited by: §I.
-  (2021) Object localization through a single multiple-model convolutional neural network with a specific training approach. arXiv preprint arXiv:2103.13339. Cited by: Fig. 2, §II, §II, §III.
-  (2020) Multiple object tracking: a literature review. Artificial Intelligence, pp. 103448. Cited by: §I.
-  (2018) Deep attentive tracking via reciprocative learning. In Advances in neural information processing systems, pp. 1931–1941. Cited by: §I.
-  (2018) Vital: visual tracking via adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8990–8999. Cited by: §I.
-  (2019) ’Skimming-perusal’tracking: a framework for real-time and robust long-term tracking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2385–2393. Cited by: §I, §III.
-  (2006) Object tracking: a survey. Acm computing surveys (CSUR) 38 (4), pp. 13–es. Cited by: §I.
Single object tracking with fuzzy least squares support vector machine. IEEE Transactions on Image Processing 24 (12), pp. 5723–5738. Cited by: §I.
-  (2020) Long-term tracking with deep tracklet association. IEEE Transactions on Image Processing 29, pp. 6694–6706. Cited by: §I.
-  (2018) Learning regression and verification networks for long-term visual tracking. arXiv preprint arXiv:1809.04320. Cited by: §I.