Visual tracking of objects in a scene is a very important component of a unified robotic vision system. Robots need to track objects in order to interact. As such as they move closer, robots and other autonomous vehicles will have to avoid other moving objects, humans, animals, as they operate in our everyday environment.
The human visual system object tracking performance is currently unsurpassed by engineered systems, thus our research tries to take inspiration and reverse-engineer the known principles of cortical processing during visual tracking. Visual tracking is a complex task, with neuroscience studies of cortical processing painting an incomplete picture, and thus is only partially able to guide the design of a synthetic solution. Nevertheless a few key features arise from studying the human visual system and its tacking abilities: (1) the human visual system is not limited to three-dimensional conventional objects in space, rather is able to track a set of visual features [Blaser et al., 2000]. Thus object in this paper refers to a distinct group of features in the two-dimensional space. (2) It is not necessary for humans to have knowledge of the object class before visual tracking, and (3) humans can track an object after a very brief presentation. Even though the human visual system does not operate with frames it is common to desire synthetic systems to be able to track from a single frame, or just a few (tens).
|1. David||2. Jumping||3. Pedestrian1||4. Pedestrian2||5. Pedestrian3||6. Car|
|Number of Frames||761||313||140||338||184||945|
|1. David||2. Jumping||3. Pedestrian1||4. Pedestrian2||5. Pedestrian3||6. Car|
|Number of Frames||761||313||140||338||184||945|
|[Lim et al., 2004]||17||75||11||33||50||163|
|[Collins et al., 2005]||n/a||313||6||8||5||n/a|
|[Babenko et al., 2009]||135||313||101||37||49||45|
|[Kalal et al., 2010b]||761||170||140||97||52||510|
|SMR (this work)||761||313||140||236||66||510|
Visual tracking in artificial systems has been studied for decades, with laudable results [Yilmaz et al., 2006]. In this paper we focus on bio-inspired visual tracking systems that can be part of a unified neurally-inspired vision system. Ideally, a unified visual model would be able to parse and detect an object every frame , but right now there is no bio-inspired model that can do this in real-time [DiCarlo et al., 2012, LeCun et al., 2004, Serre et al., 2007]
. Deep neural networks come close to this performance when trained to look for a single object on a large collection of images[Sermanet et al., 2011].
When we think of visual tracking we often have in mind a familiar object in space. But humans are able to track any localized variation in a 2D field, such as a set of features [Blaser et al., 2000]
. It is a high-SNR peak-detector that allows us to track a puff of smoke or a cloud, for example. A bio-inspired synthetic visual tracker is generally thought of having two outputs of the same unified stream: one is a deep neural network classifier that is capable of categorizing object, another is a shallower classifier that can group features into objectness . The first deep system is used to be able to continue tracking an object as it disappears and reappears in the scene, while the second system provides rapid grouping of local features, by tracking local maxima in the retinal space. Such distinction might be necessary as a deep system will need 100-200ms to process one visual scene[Thorpe et al., 1996], while tracking without predicting object movement, as the one required for the oculo-motor control of smooth-pursuit [Wilmer and Nakayama, 2007], requires faster processing of the visual stream.
Inspired by recent findings on shallow feature extractors of the visual cortex [Vintch et al., 2010], we postulate that simple tracking processes are based on a shallow neural network that can identify quickly similarities between object features repeated in time. We propose an algorithm that can track and extract motion of an object based on the similarity between local features observed in subsequent frames. The local features are initially defined as a bounding box that defines the object to track.
Traditional template matching algorithms define the tracking problem as follows: we are given two images and which represent the pixels values at each location
. We want to find the distance vectorthat minimizes some measures of the difference between and [Lucas and Kanade, 1981]. The measures can be cross correlation, image intensity, color features, image gradients or color histograms. However, this traditional definition of tracking suffers from outliers or regions that drastically change their appearance or disappear from the scene.
In our work we change this definition of tracking and propose a novel approach, Similarity Match Ratio (SMR). Instead of trying to minimize some measures of difference between and , we want to find that gives the best match ratio between and . To do this, we are turning differences into a probability value and accumulating them for every pixel that has a good match. If there is no good match between and , the difference gives zero probability because we are not interested in how badly the two pixels match. This approach is more robust to appearance change, disappearance and outliers. The method is tested on challenging benchmark video sequences which include camera movement, partial/full occlusion, illuminance change, scale change and similar objects. State-of-the-art performance is achieved from these video sequences.
2 Previous Work
Most popular trackers that are based on the traditional definition of the tracking problem (e.g. Sum-of-Squared-Distances (SSD), Sum-of-Absolute-Differences (SAD), Lucas-Kanade tracker) try to find distance vector that minimizes the difference between and either on the grayscale or color image. However, the template may be including outliers or some parts that dramatically change or disappear, which cause tracking failure. The common approach to overcome these tracking failures is that trackers should not treat all pixels in a uniform manner but eliminate outliers from the computation.
Some studies [Comaniciu et al., 2003, Shi and Tomasi, 1994] proposed using a weighted histogram as a measure to minimize for the tracking. By assuming that pixels close to the center are the most reliable, these methods weigh them higher, since occlusions and interferences tend to occur close to boundaries. However, a dramatical change in the appearance can occur even in the center, which cannot be handled by this method.
There are studies that aim to detect outliers and suppress them from the computation. [Hager and Belhumeur, 1998]
uses the common approach that outliers produce large image differences that can be detected by the estimation process[Black and Jepson, 1998]. Residuals are calculated iteratively and if the variations of the residual are bigger than a user defined threshold they are considered outliers and suppressed. [Ishikawa et al., 2002] uses the spatial coherence property of the outliers which means that outliers tend to form a spatially coherent group rather than being randomly distributed across the template. In that work the template is divided into blocks and constant weights are assigned for each block. If the image differences of the blocks between the frames are large, it means these blocks include a significant amount of outliers. The method excludes the blocks that contain outliers from the computation of minimization. These methods are robust to outliers. However, they are computationally expensive.
[Kalal et al., 2010b] tracks the points from the template back and forth between the previous frame and current frame and validates the detection. This method enables trackers to avoid tracking points that disappear from the camera view or change appearance drastically. Before our work, Kalal’s tracker was the state-of-the-art.
3 Similarity Matching Ratio (Smr) Tracker
The SMR tracker uses a modified template-matching algorithm. In this algorithm, we look for similarity between a template and patches of a new video frame . The SMR computes the difference between the template and the patches at each pixel. Templates are moved convolutionally on the new video frame, and stepped by one pixel. If this difference is lower than a threshold, it is summed to the output after negative exponential distance conversion. This thresholding eliminates outlying pixels, in such a way that they do not appear in the final output. The SMR algorithm is as follows:
The search area, , is limited to the neighborhood of the target’s previous position.
For each pixel in the template , the method is checking if the condition is satisfied, where is a dynamic threshold defined in 6.
If satisfied, we are interested in how close the match is, so the pixel difference is converted into a probability value by . If not these pixels are ignored.
The probability values are summed up for each patch. The algorithm finds the that gives the highest similarity matching ratio, .
The patch is extracted in every detection and assigned as new template.
Dynamic threshold where is a constant determined experimentally.
The biggest advantage of the SMR is that pixel differences above are not contributing to the matching similarity output. These pixels may be outliers or points that dramatically change appearance, and thus should not effect the matching similarity. Outlying pixels usually only increase the error and cause failure, so we chose to ignore them in this method. This way only reliably matching pixels contribute to the output of each matching step.
We tested this approach on a challenging benchmark: the TLD [Kalal et al., 2010a] dataset. From this dataset six videos with different properties were selected as displayed in Table 1. Each video contains only one target. The metric used is the number of correctly tracked frames. For this test color videos were converted to grayscale. State-of-the-art performance was achieved and results are presented in Table 2.
To illustrate how the qualitatively different way of defining the tracking problem of the SMR tracker provides better results than the traditional approach, we will compare the SMR tracker with the SAD tracker in the present section.
Figure 1 shows the detections from the SAD tracker and the SMR tracker where they have used the same template. Points that dramatically changed appearance cause the SAD tracker to fail whereas the SMR tracker correctly detects the object. For illustration purposes, the differences for each pixel between the template and the patches the SAD tracker and the SMR tracker detected are mapped in Figure 1. The patch the SMR tracker detected has a bigger sum of absolute differences. However, that is because of the region that dramatically changed appearance. That patch has many close matches with the template as can be seen in Figure 2. As such, the SMR tracker is able to detect it. Again, with the same principle the SMR tracker is able to track the object when it is going out of the scene as shown in Figure 3.
The SMR tracker is more robust to outliers than the traditional approach. As can be seen in Figure 4 outliers cause the SAD tracker to drift away from the object, whereas the SMR tracker (Figure 4) finds the target. Ideally the bounding box should be entirely filled with the target. However, during long-term tracking, the object may move back and forth and rotate which cause some background pixels to be included in the next template. A tracker does not know which pixel belong to the object and which ones belong to the background. On the other hand, the SMR tracker has a higher probability of rejecting background pixels, as they tend to change more.
The SAD tracker from the 2nd frame to 3rd in Figure 4 (bottom) drifts away from the object, because the pixels from the background have become included in the bounding box and they propagate to the template. When the face moves right, the SAD tracker does not move and drifts away from the object because the background, which has high contrast, gives big differences if the bounding box shifts to a new position. Therefore, the traditional approach gives priority to preventing big distances when it is making a decision, even if these pixels are not the majority of the template. On the other hand, the SMR tracker is focusing on the number of pixels that have small differences with the template which is the face in this case Figure 4 (top).
5 Failure Mode
Even though the SMR tracker updates the template at every frame in this presented work, drifts caused by the accumulation of small errors during each detection are not observed by applying this method on the benchmark dataset. However, when an object becomes occluded very slowly, updating the template at every frame causes the template to include foreground pixels that are not belong to the object. An example can be seen in Figure 5. A better template update mechanism will prevent this kind of failure. This will most probably require the use of a classifier which is out of the scope of the work in this paper.
This paper proposes a novel approach of tracking: the Similarity Matching Ratio (SMR). The SMR tracker is more robust to outliers than the traditional approaches because it is not collecting differences between the template and the frame for each pixel. Instead, it is collecting probabilities from the pixels that have small differences from the template. The SMR tracker tries to find a region which maximizes the good match not minimizes the differences for the whole template. This proves to be a superior approach. The SMR tracker is tested on challenging video sequences and achieves state-of-the-art performance (See Table 2).
- [Avidan, 2007] Avidan, S. (2007). Ensemble tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(2):261–271.
- [Babenko et al., 2009] Babenko, B., Yang, M., and Belongie, S. (2009). Visual tracking with online multiple instance learning. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 983–990. IEEE.
- [Black and Jepson, 1998] Black, M. and Jepson, A. (1998). Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. International Journal of Computer Vision, 26(1):63–84.
- [Blaser et al., 2000] Blaser, E., Pylyshyn, Z., Holcombe, A., et al. (2000). Tracking an object through feature space. Nature, 408(6809):196–198.
- [Collins et al., 2005] Collins, R., Liu, Y., and Leordeanu, M. (2005). Online selection of discriminative tracking features. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(10):1631–1643.
- [Comaniciu et al., 2003] Comaniciu, D., Ramesh, V., and Meer, P. (2003). Kernel-based object tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(5):564–577.
- [DiCarlo et al., 2012] DiCarlo, J., Zoccolan, D., and Rust, N. (2012). How does the brain solve visual object recognition? Neuron, 73(3):415–434.
[Hager and Belhumeur, 1998]
Hager, G. and Belhumeur, P. (1998).
Efficient region tracking with parametric models of geometry and illumination.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(10):1025–1039.
- [Ishikawa et al., 2002] Ishikawa, T., Matthews, I., and Baker, S. (2002). Efficient image alignment with outlier rejection. Citeseer.
- [Kalal et al., 2010a] Kalal, Z., Matas, J., and Mikolajczyk, K. (2010a). P-N Learning: Bootstrapping Binary Classifiers by Structural Constraints. Conference on Computer Vision and Pattern Recognition.
- [Kalal et al., 2010b] Kalal, Z., Mikolajczyk, K., and Matas, J. (2010b). Forward-backward error: Automatic detection of tracking failures. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 2756–2759. IEEE.
- [LeCun et al., 2004] LeCun, Y., Huang, F., and Bottou, L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–97. IEEE.
- [Lim et al., 2004] Lim, J., Ross, D., Lin, R., and Yang, M. (2004). Incremental learning for visual tracking. Advances in neural information processing systems, 17:793–800.
[Lucas and Kanade, 1981]
Lucas, B. and Kanade, T. (1981).
An iterative image registration technique with an application to
Proceedings of the 7th international joint conference on Artificial intelligence.
[Sermanet et al., 2011]
Sermanet, P., Kavukcuoglu, K., and LeCun, Y. (2011).
Traffic signs and pedestrians vision with multi-scale convolutional
Snowbird Machine Learning Workshop.
- [Serre et al., 2007] Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., and Poggio, T. (2007). Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29:411–426.
- [Shi and Tomasi, 1994] Shi, J. and Tomasi, C. (1994). Good features to track. In Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE Computer Society Conference on, pages 593–600. IEEE.
- [Thorpe et al., 1996] Thorpe, S., Fize, D., Marlot, C., et al. (1996). Speed of processing in the human visual system. nature, 381(6582):520–522.
- [Vintch et al., 2010] Vintch, B., Movshon, J. A., and Simoncelli, E. P. (2010). Characterizing receptive field structure of macaque v2 neurons in terms of their v1 afferents. Annual meeting in Neuroscience.
- [Wilmer and Nakayama, 2007] Wilmer, J. and Nakayama, K. (2007). Two distinct visual motion mechanisms for smooth pursuit: Evidence from individual differences. Neuron, 54(6):987–1000.
- [Yilmaz et al., 2006] Yilmaz, A., Javed, O., and Shah, M. (2006). Object tracking: A survey. Acm Computing Surveys (CSUR), 38(4):13.