visual tracker benchmark results
In this paper we introduce a fully end-to-end approach for visual tracking in videos that learns to predict the bounding box locations of a target object at every frame. An important insight is that the tracking problem can be considered as a sequential decision-making process and historical semantics encode highly relevant information for future decisions. Based on this intuition, we formulate our model as a recurrent convolutional neural network agent that interacts with a video overtime, and our model can be trained with reinforcement learning (RL) algorithms to learn good tracking policies that pay attention to continuous, inter-frame correlation and maximize tracking performance in the long run. The proposed tracking algorithm achieves state-of-the-art performance in an existing tracking benchmark and operates at frame-rates faster than real-time. To the best of our knowledge, our tracker is the first neural-network tracker that combines convolutional and recurrent networks with RL algorithms.READ FULL TEXT VIEW PDF
In this paper, we propose an active object tracking approach, which prov...
In the last decade many different algorithms have been proposed to track...
One of the main challenges of visual object tracking comes from the arbi...
Visual face tracking is one of the most important tasks in video surveil...
We study active object tracking, where a tracker takes visual observatio...
We formulate tracking as an online decision-making process, where a trac...
In this work we introduce a fully end-to-end approach for action detecti...
visual tracker benchmark results
Given some object of interest marked in one frame of a video, the goal of single-object tracking is to locate this object in subsequent video frames, despite object movement, changes in the camera’s viewpoint and other incidental environmental variations such as lighting and shadows. Single-object tracking finds immediate applications in many important scenarios such as autonomous driving, unmanned aerial vehicle, security surveillance, etc.
Despite the success of traditional trackers based on low-level, hand-crafted features [2, 8, 23]; models based on deep convolutional neural network (CNN) have dominated recent visual tracking research [20, 9, 3] . The success of all these models largely depends on the capability of CNN to learn a good feature representation for the tracking target. In order to predict the target location in a new frame, either a search-and-classify , new target lies in the spatial vicinity of the previous prediction. Unfortunately, for a busy scene with multiple occluding objects, short-term cues of correlating temporally close objects can often fail to account for multiple targets and mutual occlusion. Hence, how to harness the power of deep-learning models to automatically learn both
. The success of all these models largely depends on the capability of CNN to learn a good feature representation for the tracking target. In order to predict the target location in a new frame, either a search-and-classify or crop-and-regress [9, 3] approach is applied. In that sense, although the representation power of CNN is exploited to capture spatial features, only limited manual temporal constraints are added in these frameworks,
, new target lies in the spatial vicinity of the previous prediction. Unfortunately, for a busy scene with multiple occluding objects, short-term cues of correlating temporally close objects can often fail to account for multiple targets and mutual occlusion. Hence, how to harness the power of deep-learning models to automatically learn bothspatial and temporal constraints, especially with longer-term information aggregation and disambiguation, should be fully explored.
[4, 19]. We explore and investigate a more general strategy to develop a novel visual tracking approach based on recurrent convolutional networks. The major intuition behind our method is that the historical visual semantics and tracking proposals encode pertinent information for future predictions and can be modeled as a recurrent convolutional network. However, unlike video classification or visual attention where only high-level semantic or single-step predictions are needed, visual tracking requires continuous and accurate predictions in both spatial and temporal domain over a long period of time, and thus, requires a novel network architecture design as well as proper training algorithms.
In this work, we formulate the visual tracking problem as a sequential decision-making process and propose a novel framework, referred to as Deep RL Tracker (DRLT), which processes video frames as a whole and directly outputs location predictions of the target in each frame. Our model integrates convolutional network with recurrent network (Figure 1 ), and builds up a spatial-temporal representation of the video. It fuses past recurrent states with current visual features to make predictions of the target object’s location over time. We describe an end-to-end RL algorithm that allows the model to be trained to maximize tracking performance in the long run. This procedure uses backpropagation to train the nueral-network components and REINFORCE algorithm
), and builds up a spatial-temporal representation of the video. It fuses past recurrent states with current visual features to make predictions of the target object’s location over time. We describe an end-to-end RL algorithm that allows the model to be trained to maximize tracking performance in the long run. This procedure uses backpropagation to train the nueral-network components and REINFORCE algorithm to train the policy network.
Our algorithm augments traditional CNN with a recurrent convolutional model learning spatial-temporal representations and RL to maximize long-term tracking performance. The main contributions of our work are:
We propose and develop a novel convolutional recurrent neural network model for visual tracking. The proposed method directly leverages the power of deep-learning models to automatically learn both spatial and temporal constraints.
Our framework is trained end-to-end with deep RL algorithms, in which the model is optimized to maximize a tracking performance measure in the long run.
Our model is trained fully off-line. When applied to online tracking, only a single forward pass is computed and no online fine-tuning is needed, allowing us to run at frame-rates beyond real-time.
Our extensive experiments demonstrate the outstanding performance of our tracking algorithm compared to the state-of-the-art techniques in OTB  public tracking benchmark.
We claim that recurrent convolutional network plus RL algorithm is another useful deep-learning framework apart from CNN-based trackers. It has the potential of developing into a much robust and accurate tracker given that it pays explicit attention to temporal correlation and a long-term reward mechanism through RL.
The rest of the paper is organized as follows. We first review related work in Section 2, and discuss our RL approach for visual tracking in Section 3.1. Section 3.2 describes our end-to-end optimization algorithm, and Section 4 demonstrates the experimental results using a standard tracking benchmark.
Visual Tracking is a fundamental problem in computer vision that has been actively studied for decades. Many methods have been proposed for single-object tracking. For a systematic review and comparison, we refer the readers to a recent benchmark and a tracking challenge report [29, 16].
Classification-based trackers. Trackers for generic object tracking often follows a tracking-by-classification methodology [14, 26] . A tracker will sample "foreground" patches near the target object and "background" patches farther away from the target. These patches are then used to train a foreground-background classifier, and this classifier is used to score potential patches in the next frame to estimate the new target location. Usually, the classifier is first trained off-line and fine-tuned during online tracking. Many neural-network trackers following this approach
. A tracker will sample "foreground" patches near the target object and "background" patches farther away from the target. These patches are then used to train a foreground-background classifier, and this classifier is used to score potential patches in the next frame to estimate the new target location. Usually, the classifier is first trained off-line and fine-tuned during online tracking. Many neural-network trackers following this approach[12, 20, 25, 32] have surpassed traditional trackers [2, 8, 23], and achieved state-of-the-art performance [20, 16]. Unfortunately, these trackers are inefficient at run-time since neural networks are very slow to train in an online fashion. Another drawback of such a design is that it does not fully utilize all video information, particularly explicit temporal correlation.
Regression-based trackers. Some recent works [9, 3] have attempted to treat tracking as a regression instead of classification problem. David et al.  trained a CNN to regress directly from two images to the location in the second image of the object shown in the first image. Luca et al.  proposed a fully-convolutional siamese network to track objects in videos. These deep-learning methods can run at frame-rates beyond real time while maintaining state-of-the-art performance. However, they only extract features independently from each video frame and only perform comparison between two consecutive frames, prohibiting them from fully utilizing longer-term contextual and temporal information.
Recurrent-neural-network trackers. Several recent works [13, 6] have sought to train recurrent neural networks for the problem of visual tracking. Gan et al.  trained an RNN to predict the absolute position of the target in each frame and Kahou et al.  similarly trained an RNN for tracking using the attention mechanism. Although they brought good intuitions from RNN, these methods have not yet demonstrated competitive results on modern benchmark.
Another related work to ours is . They proposed a spatially supervised recurrent convolutional neural network in which a YOLO network  is applied on each frame to produce object detections and a recurrent neural network is used to directly regress YOLO detections. Our framework does not need any supervision from other detection module and is more general and flexible.
RL is a learning method based on trial and error, where an agent does not necessarily have a prior knowledge about what is the correct action to take. It learns interactively from rewards fed back from the environments. In order to maximize the expected rewards in the long term, the agent learns the best policy.
We draw inspiration from recent approaches that have used REINFORCE  to learn task-specific policies. Mnih et al.  and Ba et al.  learned spatial attention policies for image classification, and Xu et al.  for image caption generation. Our work is similar to the attention model described in
for image caption generation. Our work is similar to the attention model described in, but we designed our own network architecture specially tailored for solving the visual tracking problem by combining CNN, RNN and RL algorithms.
Our proposed framework directly apply RNN on top of frame-level CNN features, paying direct attention to both spatial and temporal constraints, and the full framework is trained off-line with REINFORCE algorithm in an end-to-end manner. Due to its run-time simplicity, our tracker runs at frame-rates beyond real-time while maintaining state-of-the-art performance. We will describe our framework in detail in Section 3.
Our goal is to take a sequence of video frames and output target object locations at each frame. We formulate our tracking algorithm as a sequential decision-making process of a goal-oriented agent interacting with the visual environment. Figure 1 shows our model structure. At each point in time, the agent extracts representative features from a video frame, integrates information over time, and decides how to take actions accordingly. The agent receives a scalar reward signal at each timestep, and the goal of the agent is to maximize the total long-term rewards. Hence, it must learn to effectively utilize these temporal observations to reason on the moving trajectory of the object.
The model consists of two major components: an observation network (Section 3.1.1), and a recurrent network (Section 3.1.2). The observation network encodes representations of video frames. The recurrent network integrates these observations over time and predicts the bounding box location in each frame. We now describe each of these in more detail. Later in Section 3.2, we explain how we use a combination of backpropagation and REINFORCE to train the model in an end-to-end fashion.
As shown in Figure 1, the observation network , parameterized by , observes a single video frame at each timestep. It encodes the frame into a feature vector , concatenates a location vector and provides the feature and location combo (denoted as ) as input to the recurrent network.
The feature vector is typically computed with a sequence of convolutional, pooling, and fully connected layers to encode information about what was seen in this frame. The importance of are two folds: When the ground-truth bounding box location is known, such as the first frame in a given sequence, is directly set to be the normalized location coordinate , serving as a strong supervising guide for further inferences. Otherwise, is padded with zero and only the feature information
is padded with zero and only the feature informationis incorporated by the recurrent network.
The concatenation of and allows the recurrent network to directly encode image features as well as location predictions, and it is also easier for location regression.
The recurrent network , parameterized by , forms the core of the learning agent. As can be seen in Figure 1, at one single timestep , the observation feature vector is fed into a recurrent network, and the recurrent network updates its internal hidden state based on the previous hidden state and the current observation feature vector :
where is a recurrent transformation function and we use LSTM  in our network.
Importantly, the network’s hidden state models temporal hypotheses about target object locations. Since is a concatenation of the image feature and the location signal, directly encodes information about both where in the frame an object was located as well as was seen.
As the agent reasons on a video, it outputs the location of target object at each timestep . where represent the coordinates of the bounding box center relative to the width and height of the image, respectively. The width and height of the bounding box are also relative to those of the image, consequently, .
The predicted location is directly extracted from the last four elements of denoted as , such that the agent’s decision is a function of its past observations and their predicted locations. At training time, is sampled from a multi-variate Gaussian distribution with a mean of
is sampled from a multi-variate Gaussian distribution with a mean of
Figure 1 further illustrates the roles of each component as well as the corresponding inputs and outputs with an example of a forward pass through the network.
Training this network to maximize the overall tracking performance is a non-trivial task, and we leverage the REINFORCE algorithm  from the RL community to solve this problem.
During training, the agent will receive a reward signal from the environment after executing an action at time . In this work, we explore two different reward definitions in different training phases. One is
where is the location outputted by the recurrent network, is the target ground truth at time , and compute the pixel-wise mean and maximum. The other reward is
where the reward is computed as the intersection area divided by the union area (IoU) between and .
The training objective is to maximize the sum of the reward signals: . By definition, the reward in Equation 2 and Equation 3 both measure the closeness between predicted location and ground-truth location . We use the reward definition in Equation 2 in the early stage of training, while using the reward definition in Equation 3 in the late stage of training to directly maximize the IoU between the prediction and ground-truth .
Our network is parameterized by and we aim to learn these parameters to maximize the total tracking reward the agent can expect in the long run. More specifically, the objective of the agent is to learn a policy function with parameters W that, at each step , maps the history of past interactions with the environment (a sequence of past observations and actions taken by the agent) to a distribution over actions for the current timestep. Here, the policy is defined by our neural network architecture, and the history of interactions is summarized in the hidden state . For simplicity, we will use to indicate all histories up to time , thus, the policy function can be written as .
To put it in a formal way, the policy of the agent induces a distribution over possible interactions and we aim to maximize the total reward under this distribution, thus, the objective is defined as:
where is the distribution over possible interactions parameterized by .
This formulation involves an expectation over high-dimensional interactions which is hard to solve in traditional supervised manner. Here, we bring techniques from the RL community to solve this problem, as shown in  , the gradient can be first simplified by taking the derivative over log-probability of the policy function
, the gradient can be first simplified by taking the derivative over log-probability of the policy function:
and the expectation can be further approximated by an episodic algorithm: since the action is drawn from probabilistic distributions, one can execute the same policy for many episodes and approximate expectation by taking the average, thus
where s are cumulative rewards obtained by running the current policy for episodes, .
The above training rule is known as the episodic REINFORCE  algorithm, and it involves running the agent with its current policy to obtain samples of interactions and then updating parameters of the agent such that the log-probability of chosen actions that have led to high overall rewards is increased.
In practice, although Equation 6 computes a good estimation of the gradient , when applied to train the deep RL tracker, the training process is hard to converge due to the high variance of this gradient estimation. Thus, in order to obtain an unbiased low-variance gradient estimation, a common method is to subtract a reinforcement baseline from the cumulative rewards :
where is called reinforcement baseline in the RL literature, it is natural to select , and this form of baseline is known as the value function . This estimation maintains the same expectation with Equation 6 while sufficiently reduces the variance.
The only remaining part to compute the gradient in Equation 7 is to compute the gradient over log-probability of the policy function . To simplify notation, we focus on one single timestep and omit usual unit index subscript throughout. In our network design, the policy function outputs the target location which is drawn from a Gaussian distribution centered at with fixed variance , and is the output of the deep RL tracker parameterized by . The density function determining the output on any single trial is given by:
Based on REINFORCE algorithm , the gradient of the policy function with respect to is given by the gradient of the density function:
since is the output of deep RL tracker parameterized by , the gradients with respect to network weights can be easily computed by standard backpropagation.
The overall procedure of our training algorithm is presented in Algorithm 1. The network parameters are first randomly initialized to define our initial policy. Then, we take first frames from one training video to be the input of our network. We execute current policy times, compute gradients and update network parameters. Next, we take consecutive frames from the same video and apply the same training procedure. We repeat this for all training videos in our dataset, and we stop when we reach the maximum number of epochs or the cumulative reward ceases to increase.
frames from the same video and apply the same training procedure. We repeat this for all training videos in our dataset, and we stop when we reach the maximum number of epochs or the cumulative reward ceases to increase.
During testing, the network parameters are fixed and no online fine-tuning is needed. The procedure at test time is as simple as computing one forward pass of our algorithm, i.e., given a test video, the deep RL tracker predicts the location of target object in every single frame by sequentially processing the video data.
We evaluated the proposed approach of visual object tracking on the Object Tracking Benchmark  , and compared its performance with state-of-the-art trackers. Our algorithm was implemented in Python using TensorFlow toolbox
, and compared its performance with state-of-the-art trackers. Our algorithm was implemented in Python using TensorFlow toolbox111https://www.tensorflow.org/, and ran at around 45 fps with an NVIDIA GTX 1080 GPU.
We followed the evaluation protocols in , where the performance of trackers was measured based on two different metrics: success rate and precision plots. In both metrics, the ratio of successfully tracked frames was measured by a set of thresholds, where bounding box overlap ratio and center location error were employed in success rate plot and precision plot, respectively. We ranked the tracking algorithms based on the Area-Under-Curve (AUC) for the success rate plot and center location error at 20 pixels for the precision plot, again, following . We also compared the average bounding box overlap ratio for each tracking sequence, as well as run-time tracking speed.
Here, we describe the design choices of our observation network and recurrent network as well as the network learning procedure in detail.
Observation network: We used a YOLO network  fine-tuned on the PASCAL VOC dataset  to extract visual features from observed video frames as YOLO was both accurate and time-efficient. The first Fc-layer features were extracted and concatenated with the location vector into a 5000-dimensional vector. Since the pre-trained YOLO weights were fixed during training, we added one more Fc-layer, with 5000 neurons on top of the concatenated vector, and provided the final observation vector as the input to the recurrent network.
to extract visual features from observed video frames as YOLO was both accurate and time-efficient. The first Fc-layer features were extracted and concatenated with the location vector into a 5000-dimensional vector. Since the pre-trained YOLO weights were fixed during training, we added one more Fc-layer, with 5000 neurons on top of the concatenated vector, and provided the final observation vector as the input to the recurrent network.
Recurrent network: We used a 1-layer LSTM network with 5000 hidden units. At each timestep , the last 4 digits were directly taken as the mean value of the location policy . The location policy was sampled from a Gaussian distribution with mean and variance during training, and we found that was good for both randomness and certainty in our experiment. During testing, we directly used the output mean value as prediction which was the same as setting .
Network learning: The training algorithm was the same as Algorithm 1, we used and as these hyper-parameters provided the best tracking performance. We kept the pre-trained YOLO weights unchanged as they were proven to encode good information for both semantic prediction and localization, while the weights of the Fc-layer and the LSTM were updated using ADAM algorithm . The initial learning rate was and we exponentially annealed the learning rate from its initial value to over the course of training. We trained the model up to 500 epochs, or until the cumulative tracking reward stopped increasing. The first 300 epochs were trained with reward defined in Equation 2 while the last 200 epochs were trained with reward defined in Equation 3.
The trained model was directly applied to the test sequences with no online fine-tuning. During testing, only the video frames and the ground-truth location in the first frame were inputed to the network, and the predictions were directly generated through a single forward pass of the recurrent network.
We have conducted extensive experiments on comparing the performance of our algorithm with eight other distinct trackers on a suite of 30 challenging and publicly available video sequences. Specifically, the one-pass evaluation (OPE)  was employed to compare our algorithm with seven top trackers included in the benchmark suite: STRUCK , TLD , CSK , OAB , VTD , VTS , SCM . Note that DLT  was another tracking algorithm based on deep neural networks, which provided a baseline for tracking algorithms adopting deep learning. Since the YOLO weights were pre-trained on ImageNet dataset and finetuned on PASCAL VOC, capable of detecting objects of 20 classes, we picked a subset of 30 videos from the benchmark where the targets belonged to these classes (Table
was another tracking algorithm based on deep neural networks, which provided a baseline for tracking algorithms adopting deep learning. Since the YOLO weights were pre-trained on ImageNet dataset and finetuned on PASCAL VOC, capable of detecting objects of 20 classes, we picked a subset of 30 videos from the benchmark where the targets belonged to these classes (Table2). According to our evaluation results, the difficulty of this subset was harder than that of the full benchmark.
As a generic object tracker, our algorithm was trained off-line and no online fine tuning mechanisms were applied. Thus, training data with similar dynamics were needed to capture both categorical and motional information. We split the dataset and used first frames in each sequence with ground truth for off-line training, while the algorithm was tested on the whole sequence with unseen frames. This property made our algorithm especially useful in surveillance environments, where models could be trained off-line with pre-captured data.
Figure 2 illustrates the precision and success plots based on the center location error and bounding box overlap ratio, respectively. It clearly presented the superiority of our algorithm over other trackers. The higher success and precision scores indicated that our algorithm hardly missed targets while maintaining good tracking of tight bounding boxes to targets. The superior performance was probably because the CNN captured representative features for localization and RNN was trained to force the long-term consistency of the tracking trajectory.
To gain more insights about the proposed algorithm, we evaluated the performance of trackers on individual sequences in the benchmark. Table 2 summarizes the average bounding box overlap ratio for each sequence. Our algorithm achieved best results for 12 sequences, and second best results for 4 sequences. We also achieved the best overall performance, beating the second best by almost 10% (0.562 vs. 0.477). Unlike other trackers where catastrophic failures were observed for certain sequences, our algorithm performed consistently well among all 30 sequences. This further illustrated that the spatial representations and temporal constraints learned by our algorithm were general, robust, and well-suited for tracking a large variety of targets.
We compared our tracker qualitatively with two distinct benchmark methods as well as DLT in Figure 3. It demonstrated that our tracker effectively handled different kinds of challenging situations that often required high-level semantic and temporal understanding such as motion blur, illumination variation, rotation, deformation, etc. Comparing with other trackers, our tracker hardly drifted to the background and predicted more accurate and reasonable bounding box locations.
To verify the importance of RNN in our algorithm, we did more experiments on varying RNN step sizes. Step size denoted the number of frames considered each time for training the network, referred to as in Algorithm 1. Success plots of three different RNN step sizes were illustrated in Figure 4, and we found that larger step sizes allowed us to model longer and more complicated temporal constraints, thus resulting in better accuracy. This analysis demonstrated the importance of incorporating temporal information in tracking and the effectiveness of using our RL formulation.
Table 1 analyzed different trackers in terms of speed and accuracy. Our original model already operated at frame-rates beyond real-time by getting rid of online searching and fine-tuning mechanisms. Furthermore, pre-computing frame-level YOLO features off-line allowed us to only perform LSTM computation during online tracking, resulting in processing speed at 270 fps. In all, although our implementation was based on deep CNN and RNN, the proposed DRLT method was very efficient due to its extreme run-time simplicity, while preserving accurate tracking performance.
In this paper, we proposed a novel neural network tracking model based on a recurrent convolutional network trained with deep RL algorithm. To the best of our knowledge, we are the first to bring RL into CNN and RNN to solve the visual tracking problem. The entire network is end-to-end trainable off-line allowing it to run at frame-rates faster than real-time. The deep RL algorithm directly optimizes a long-term tracking performance measure which depends on the whole tracking video sequence. Other than CNN-based trackers, our paper aims to develop a new paradigm for solving the visual tracking problem by bringing in RNN and RL to explicitly exploit temporal correlation in videos. We achieved state-of-the-art performance on OTB public tracking benchmark.
We believed that our initial work shed light on many potential research possibilities along this direction. Not only better training and design of recurrent convolutional network can further boost the efficiency and accuracy for visual tracking, but a broad new way of solving vision problem with artificial neural network and RL can be further explored.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625--2634, 2015.