A multi-feature tracking algorithm enabling adaptation to context variations

by   Duc Phu Chau, et al.

We propose in this paper a tracking algorithm which is able to adapt itself to different scene contexts. A feature pool is used to compute the matching score between two detected objects. This feature pool includes 2D, 3D displacement distances, 2D sizes, color histogram, histogram of oriented gradient (HOG), color covariance and dominant color. An offline learning process is proposed to search for useful features and to estimate their weights for each context. In the online tracking process, a temporal window is defined to establish the links between the detected objects. This enables to find the object trajectories even if the objects are misdetected in some frames. A trajectory filter is proposed to remove noisy trajectories. Experimentation on different contexts is shown. The proposed tracker has been tested in videos belonging to three public datasets and to the Caretaker European project. The experimental results prove the effect of the proposed feature weight learning, and the robustness of the proposed tracker compared to some methods in the state of the art. The contributions of our approach over the state of the art trackers are: (i) a robust tracking algorithm based on a feature pool, (ii) a supervised learning scheme to learn feature weights for each context, (iii) a new method to quantify the reliability of HOG descriptor, (iv) a combination of color covariance and dominant color features with spatial pyramid distance to manage the case of object occlusion.



page 6


Robust Mobile Object Tracking Based on Multiple Feature Similarity and Trajectory Filtering

This paper presents a new algorithm to track mobile objects in different...

Occlusion-aware Visual Tracker using Spatial Structural Information and Dominant Features

To overcome the problem of occlusion in visual tracking, this paper prop...

Automatic Parameter Adaptation for Multi-object Tracking

Object tracking quality usually depends on video context (e.g. object oc...

Online Tracking Parameter Adaptation based on Evaluation

Parameter tuning is a common issue for many tracking algorithms. In orde...

Real Time Detection Free Tracking of Multiple Objects Via Equilibrium Optimizer

Multiple objects tracking (MOT) is a difficult task, as it usually requi...

Repairing People Trajectories Based on Point Clustering

This paper presents a method for improving any object tracking algorithm...

Underwater Fish Tracking for Moving Cameras based on Deformable Multiple Kernels

Fishery surveys that call for the use of single or multiple underwater c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many approaches have been proposed to track mobile objects in a scene [Yilmaz]. The problem is to have tracking algorithms which perform well in different scene conditions (e.g. different people density levels, different illumination conditions) and to be able to tune their parameters. The ideas of an automatic control for adapting an algorithm to the context variations have already been studied [monique, hall, prost]. In [monique], the authors have presented a framework which integrates knowledge and uses it to control image processing programs. However, the construction of a knowledge base requires a lot of time and data. Their study is restricted to static image processing (no video). In [hall], the author has presented an architecture for a self-adaptive perceptual system in which the ”auto-criticism“ stage plays the role of an online evaluation process. To do that, the system computes trajectory goodness score based on clusters of typical trajectories. Therefore, this method can be only applied for the scenes where mobile objects move on well defined paths. In [prost], the authors have presented a tracking framework which is able to control a set of different trackers to get the best possible performance. The approach is interesting but the authors do not describe how to evaluate online the tracking quality and the execution of three trackers in parallel is very expensive in terms of processing time.

In order to overcome these limitations, we propose a tracking algorithm that is able to adapt itself to different contexts. The notion of context mentioned in this paper includes a set of scene properties: density of mobile objects, frequence of occlusion occurrences, illumination intensity, contrast level and the depth of the scene. These properties have a strong effect on the tracking quality. In order to be able to track object movements in different contexts, we define firstly a feature pool in which each weighted feature combination can help the system to outperform its performance in each context. However, the parameter configuration of these features (i.e. determination of feature weight values) is a hard task because the user has to quantify correctly the importance of each feature in the considered context. To facilitate this task, we propose an offline learning algorithm based on Adaboost [adaboost] to compute feature weight values for each context. In this work, we have two assumptions. First, each video has a stable context. Second, for each context, there exists a training video set.

The paper is organized as follows: The next section presents the feature pool and explains how to use it to compute link similarity between the detected objects. Section 3 describes the offline learning process to tune the feature weights for each scene context. Section 4 shows in detail the different stages of the tracking process. The results of the experimentation and validation can be found in section 5. A conclusion as well as future work are given in the last section.

2 Feature pool and link similarity

2.1 Feature pool

The principle of the proposed tracking algorithm is based on the coherence of mobile object features throughout time. In this paper, we define a set of 8 different features to compute a link similarity between two mobile objects and within a temporal window (see figure 1).

2.1.1 2D and 3D displacement distance similarity

Depending on the object type (e.g. car, bicycle, walker), the object speed cannot exceed a fixed threshold. Let be the possible maximal 3D displacement of a mobile object for one frame in a video and be the 3D distance of two considered objects, we define a similarity between these two objects using the 3D displacement distance feature as follows:


where is the temporal difference (frame unity) of the two considered objects.

Similarly, we also define a similarity between two objects using displacement distance feature in the 2D image coordinate system.

2.1.2 2D shape ratio and area similarity

Let and be the width and height of the 2D bounding box of object . The 2D shape ratio and area of this object are respectively defined as and . If no occlusions occur and mobile objects are well detected, shape ratio and area of a mobile object within a temporal window does not vary much even if the lighting and contrast conditions are not good. A similarity between two 2D shape ratios of objects and is defined as follows:


Similarly, we also define the similarity between two 2D areas of objects and as follows:


2.1.3 Color histogram similarity

In this work, the color histogram of a mobile object is defined as a normalized RGB color histogram of moving pixels inside its bounding box. We define a link similarity between two objects and for color histogram feature as follows:


where is a parameter representing the number of histogram bins for each color channel (), and are respectively the histogram values of object , at bin .

2.1.4 HOG similarity

In case of occlusion, the system may fail to detect the full appearance of mobile objects. The above features are then unreliable. In order to address this issue, we propose to use the HOG descriptor to track locally interest points on mobile objects and to compute the trajectory of these points. The HOG similarity between two objects is defined as a value proportional to the number of pairs of tracked points belonging to both objects. In [piotr]

, the authors propose a method to track FAST points based on their HOG descriptors. However the authors do not compute the reliability level of the obtained point trajectories. In this work, we define a method to quantify the reliability of the trajectory of each interest point by considering the coherence of the Frame-to-Frame (F2F) distance, the direction and the HOG similarity of the points belonging to a same trajectory. We assume that the variation of these features follows a Gaussian distribution.

Let be the trajectory of a point. Point is on the current tracked object and point is on an object previously detected. We define a coherence score of F2F distance of point as follows:


where is the 2D distance between and , and

are respectively the mean and standard deviation of the F2F distance distribution formed by the set of points


In the same way, we compute the direction coherence score and the similarity coherence score of each interest point. Finally for each interest point on the tracked object , we define a coherence score as the mean value of these three coherence scores.

Let be the set of interest point pairs which trajectories pass through two considered objects and ; ( respectively) be the coherence score of point ( respectively) on object ( respectively) belonging to set . We define the similarity of HOG between these two objects as follows:


where and are the total number of interest points detected on objects and .

2.1.5 Color covariance similarity

Color covariance is a very useful feature to characterize the appearance model of an image region. In particular, the color covariance matrix enables to compare regions of different sizes and is invariant to identical shifting of color values. This becomes an advantageous property when objects are tracked under varying illumination conditions. In [sbak], for a point in a given image region , the authors define a covariance matrix corresponding to 11 descriptors: where (, ) is pixel location, and are RGB channel values, and , correspond to gradient magnitude and orientation in each channel at position .

We use the distance defined by [forstner] to compare two covariance matrices:


where is the number of considered image descriptors ( in this case),

is the generalized eigenvalue of

and .

In order to take into account the spatial coherence of the color covariance distance and also to manage occlusion cases, we propose to use the spatial pyramid distance defined in [Grauman]. The main idea is to divide the image region of a considered object by a set of sub-regions. For each level (), the considered region is divided by a set of x sub-regions. Then we compute the local color covariance distance for each pair of corresponding sub-regions. The computation of each sub-region pair helps to evaluate the spatial structure coherence between two considered objects. In the case of occlusions, the color covariance distance between two regions corresponding to occluded parts is very high. Therefore, we take only a half of the lowest color covariance distances (i.e. highest similarities) for each level to compute the final color covariance distance.

The similarity of this feature is defined as a function of the spatial pyramid distance:


where is the spatial pyramid distance of the color covariance between two considered objects, and is the maximum distance for two color covariance matrices to be considered as similar.

2.1.6 Dominant color similarity

Dominant color descriptor (DCD) has been proposed by MPEG-7 and is extensively used for image retrieval

[Yang]. This is a reliable color feature because it takes into account only important colors of the considered image region. DCD of an image region is defined as where is the total number of dominant colors in the considered image region,

is a 3D RGB color vector,

is its occurrence percentage, with .

Let and be the DCDs of two image regions of considered objects. The dominant color distance between these two regions is defined using the similarity measure proposed in [Yang]. Also, similar to the color covariance feature, in order to take into account the spatial coherence and occlusion cases, we propose to use the spatial pyramid distance for the dominant color feature. The similarity of this feature is defined in the function of the spatial pyramid distance as follows:


where is the spatial pyramid distance of dominant colors between two considered objects.

2.2 Link similarity

Using the eight features we have described above, a link similarity is defined as a weighted combination of feature similarities between objects and :


where is the feature weight (corresponding to its effectiveness), at least one weight is not null.

3 Learning feature weights

Each feature described above is effective for some particular context conditions. However, how can the user quantify correctly the feature significance for a given context? In order to address this issue, we propose in this paper an offline supervised learning process using the Adaboost algorithm [adaboost]

. First a weak classifier is defined per feature. Then a strong classifier which combines these eight weak classifiers (corresponding to the eight features) with their weights is learnt.

For each context, we select a learning video sequence representative of this context. First, for each object pair (called a training sample) in two consecutive frames, denoted (), we classify it into two classes {+1, -1}: if the pair belongs to the same tracked object and otherwise. For each feature , we define a classification mechanism for a pair as follows:


where is the similarity score of feature (defined in section 2.1) between two objects and , is a predefined threshold representing the minimum feature similarity considered as similar.

The loss function for Adaboost algorithm at iteration

for each feature is defined as:


where is the weight of the training sample at iteration . At each iteration , the goal is to find whose loss function is minimum. and (corresponding to value found) are denoted and . The weight of this weak classifier denoted is computed as follows:


We then update the weight of samples:


where is a normalization factor so that .

At the end of the Adaboost algorithm, the feature weights are determined for the learning context and allow to compute the link similarity defined in formula 10.

4 The proposed tracking algorithm

The proposed tracking algorithm needs a list of detected objects in a temporal window as input. The size of this temporal window (denoted ) is a parameter. The proposed tracker is composed of three stages. First, the system computes the link similarity between any two detected objects appearing in a given temporal window to establish possible links. Second, the trajectories that include a set of consecutive links resulting from the previous stage, are then computed as the system gets the highest possible total of global similarities (see section 4.3). Finally, a filter is applied to remove noisy trajectories.

4.1 Establishment of object links

For each detected object pair in a given temporal window of size , the system computes the link similarity (i.e. instantaneous similarity) defined in formula 10. A temporal link is established between these two objects when their link similarity is greater or equal to (presented in equation 11). At the end of this stage, we obtain a weighted graph whose vertices are the detected objects in the considered temporal window and whose edges are the temporally established links associated with the object similarities (see figure 1).

Figure 1: The graph representing the established links of the detected objects in a temporal window of size frames.

4.2 Long term similarity

In this section, we study similarity score between an object detected at and the trajectory of

detected previously, called long term similarity (to distinguish with the link similarity score between two objects). By assuming that the variations of the 2D area, shape ratio, color histogram, color covariance and dominant color features of a mobile object follow a Gaussian distribution, we can use the Gaussian probability density function (PDF) to compute this score. Also, longer the trajectory of

is, more reliable this similarity is. Therefore, for each feature in these features, we define a long term similarity score between object and trajectory of as follows:


where is the value of feature for object , and are respectively mean and standard deviation values of feature of last -objects belonging to the trajectory of ( is a predefined parameter), is time length (number of frames) of trajectory. Thanks to the selection of the last -objects, the long term similarity can take into account the latest variations of the trajectory.

For the left features (2D, 3D displacement distance and HOG), the long term similarity are set to the same values of link similarity.

4.3 Trajectory determination

The goal of this stage is to determine the trajectories of the mobile objects. For each detected object at instant , we consider all its matched objects (i.e. objects with temporal established links) in previous frames that do not have yet official links (i.e. trajectories) to any objects detected at . For such an object pair , we define a global score as follows:


where is the weight of feature (resulting from learning phase, see section 3), is the global score of feature between and , defined as a function of link similarity and long term similarity of feature :


where is the link similarity of feature between the two objects and , is their long term similarity defined in section 4.2, is the weight of long term similarity and is defined as follows:


where , are presented in section 4.2, and is the maximum expected weight for the long term similarity.

The object having the highest global similarity is considered as a temporal father of object . After considering all objects at instant , if more than one object get as a father, the pair which value is the highest will be kept and the link between this pair is official (i.e. become officially a trajectory segment). An object is no longer tracked if it cannot establish any official links in consecutive frames.

4.4 Trajectory filtering

Noise usually appears when wrong detection or misclassification (e.g. due to low image quality) occurs. Hence a static object (e.g. a chair, a machine) or some image regions (e.g. window shadow, merged objects) can be detected as a mobile object. However, such noise usually only appears in few frames or have no real motion. We thus use temporal and spatial filters to remove potential noises. A trajectory is considered as a noise if one of the two following conditions is satisfied:

where is time length of the considered trajectory; is the maximum spatial length of this trajectory; , are predefined thresholds.

5 Experimentation and Validation

The objective of this experimentation is to prove the effect of feature weight learning, also to compare the performance of the proposed tracker with other trackers in the state of the art. To this end, in the first part, we test the proposed tracker with two complex videos (many moving people, high occlusion occurrence frequency) which are respectively provided by the Caretaker European project111http://cordis.europa.eu/ist/kct/caretaker_synopsis.htm and the TRECVid dataset [trecvid]. These two videos are tested in both cases: without and with the feature weight learning. In the second part, five videos belonging to two public datasets ETISEO222http://www-sop.inria.fr/orion/ETISEO/ and Caviar333http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/ are experimented, and the tracking result (with the feature learning) is compared with some other approaches in the state of the art.

In order to evaluate the tracking performance, we use the three tracking evaluation metrics defined in the ETISEO project

[atnghiem]. The first tracking evaluation metric measures the percentage of time during which a reference object (ground truth data) is correctly tracked. The second metric computes throughout time how many tracked objects are associated with one reference object. The third metric computes the number of reference object IDs per tracked object. These metrics must be used together to obtain a complete performance evaluation. Therefore, we also define a tracking metric taking the average value of these three tracking metrics. The four metric values are defined in the interval [0, 1]. The higher the metric value is, the better the tracking algorithm performance gets.

In this experimentation, we use the people detection algorithm based on the HOG descriptor of the OpenCV library. So we focus the experimentation on the sequences containing people movements. However the principle of the proposed tracking algorithm is not dependent on the tracked object type. For learning feature weights, we use video sequences that are different from the tested videos but which have a similar context.

The first tested video (provided by the Caretaker project) depicts people moving in a subway station. The frame rate of this sequence is () and the length is 5 min (see image 2a). We have learnt feature weights on a sequence of 2000 frames. The learning algorithm selects (color histogram feature) and (HOG feature).

The second tested sequence (belonging to the TRECVid dataset) depicts the movements of people in an airport (see image 2b). It contains 5000 frames and lasts 3 min 20 sec. We have learnt feature weights on a sequence of 5000 frames. The learning algorithm selects (3D distance displacement), (2D area) and (color histogram).

Table 1 presents the tracking results in two cases: without and with feature weight learning. We can find that with the proposed learning scheme, the tracker performance increases in both tested videos. Also, the processing time of the tracker also decreases significantly because many features are not used.

Without learning With learning
Caretaker video 0.62 0.16 0.99 0.59 0.47 0.83 0.80 0.70

TRECVid video
0.60 0.82 0.90 0.77 0.70 0.93 0.84 0.82
Table 1: Summary of tracking results in both cases: without and with feature weight learning.

The two following tested videos belong to ETISEO dataset. The first tested ETISEO video shows a building entrance, denoted ETI-VS1-BE-18-C4. It contains 1108 frames and frame rate is 25 . In this sequence, there is only one person moving (see image 2c). We have learnt feature weights on a sequence of 950 frames. The learning algorithm has selected the 3D displacement distance feature as the unique feature for tracking in this context. The result of the learning phase is reasonable since there is only one moving person.

Figure 2: Illustration of five tested videos: a. Caretaker b. Trecvid c. ETI-VS1-BE-8-C4 d. ETI-VS1-MO-7-C1 e. Caviar

The second tested ETISEO video shows an underground station denoted ETI-VS1-MO-7-C1 with occlusions. The difficulty of this sequence consists in the low contrast and bad illumination. The scene depth is quite important (see image 2d). This video sequence contains 2282 frames and frame rate is 25 . We have learnt feature weights on a sequence of 500 frames. The color covariance feature is selected as the unique feature for tracking in this context. It is a good solution because the dominant color and HOG feature do not seem to be effective due to bad illumination. Also, the size and displacement distance features are not reliable because their measurements do not seem to be discriminative for far away moving people from the camera.

In these two experiments, tracker results from seven different teams (denoted by numbers) in ETISEO have been presented: 1, 8, 11, 12, 17, 22, 23. Because names of these teams are hidden, we cannot determine their tracking approaches. Table 2 presents performance results of the considered trackers. The tracking evaluation metrics of the proposed tracker get the highest values in most cases compared to other teams.

Our tracker Team 1 Team 8 Team 11


0.50 0.79 0.48 0.77 0.49 0.58 0.56 0.75

1.00 1.00 0.80 0.78 0.80 0.39 0.71 0.61

1.00 1.00 0.83 1.00 0.77 1.00 0.77 0.75

0.83 0.93 0.70 0.85 0.69 0.66 0.68 0.70

Team 12 Team 17 Team 22 Team 23


0.19 0.58 0.17 0.80 0.26 0.78 0.05 0.05

1.00 0.39 0.61 0.57 0.35 0.36 0.46 0.61

0.33 1.00 0.80 0.57 0.33 0.54 0.39 0.42

0.51 0.66 0.53 0.65 0.31 0.56 0.30 0.36

Table 2: Summary of tracking results for two ETISEO videos. BE denotes ETI-VS1-BE-18-C4 sequence, MO denotes ETI-VS1-MO-7-C1 sequence. The highest values are printed bold.

The last three tested videos belong to the Caviar dataset (see image 2e). In this dataset, we have selected the same sequences experimented in [snidaro] to be able to compare each other: OneStopEnter2cor, OneStopMoveNoEnter1cor and OneStopMoveNoEnter2cor. In these three sequences, there are 9 persons walking in a corridor. The proposed approach can track all of them. However there are three noisy trajectories in the last sequence because of wrong detection occurred in a long period. Table 3 presents the result summary for these videos. TP (True Positive) refers to the number of correct tracked trajectories. FN (False Negative) is the number of lost trajectories. FP (False Positive) represents the number of noisy trajectories. Compared to [snidaro], our proposed tracker have better values in all of these three indexes.

# trajectories TP FN FP
Proposed tracker 9 9 0 3
Approach of [snidaro] 9 8 1 7

Table 3: Summary of tracking results for three Caviar videos

6 Conclusion and Future work

We have presented in this paper an approach which combines a large set of appearance features and learn tracking parameters. The quantification of HOG descriptor reliability and the combination of color covariance, dominant color with spatial pyramid distance help to increase the robustness of the tracker for managing occlusion cases. The learning of feature significances for different video contexts also helps the tracking algorithm to adapt itself to the context variation problem. The experimentation proves the effect of the feature weight learning, also the robustness of the proposed tracker compared to some other approaches in the state of the art. We propose in future work an automatic context detection to increase the auto-control capacity of the system and to remove the two assumptions given in this paper (presented in section 1).


This work is supported by The PACA region, The General Council of Alpes Maritimes province, France as well as The ViCoMo, Vanaheim and Support projects.