Pixel-wise object tracking

11/20/2017 ∙ by Yilin Song, et al. ∙ 0

In this paper, we propose a novel pixel-wise visual object tracking framework that can track any anonymous object in a noisy background. The framework consists of two submodels, a global attention model and a local segmentation model. The global model generates a region of interests (ROI) that the object may lie in the new frame based on the past object segmentation maps, while the local model segments the new image in the ROI. Each model uses a LSTM structure to model the temporal dynamics of the motion and appearance, respectively. To circumvent the dependency of the training data between the two models, we use an iterative update strategy. Once the models are trained, there is no need to refine them to track specific objects, making our method efficient compared to online learning approaches. We demonstrate our real time pixel-wise object tracking framework on a challenging VOT dataset



There are no comments yet.


page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Provided with an object of interest at the first frame, visual object tracking is a problem of buiding a computational model that is able to predict the location of the object in consecutive frames. A robust tracking algorithm should be able to tackle some of the common issues including: target deformation, motion blur, illumination change, partial occlusion and background clutters. Many existing algorithms uses online learning by building a discriminative model to seperate object from background. The feature extractor is the most important component of a tracker, using appropriate features could drasticly boost the tracking performance. Many recent tracking-by-detection approaches [7, 27, 18] are inspired by methods for object detection [6, 21, 22]

and fully embrace the features learnt from deep convolutional neural network. We recognize that existing CNN based feature extractor increases the performance and robustness of the tracking system, yet how to extend the deep neural network for visual object tracking has not been fully investigated. In our work, we tackle object tracking as a time-series prediction problem, in particular we want to give a pixel-wise foreground-background label for consecutive frames.

Segmentation-based tracking algorithms [1, 2, 25, 32, 9] have advantage over detection-based algorithm for handling a target undergoes substantial no-rigid motions. Many of them [2, 25, 1] rely only on pixel-level information and hence fail to consider semantic structure of the target. [32]

uses Markov Chain on superpixel graph, but information propagation through a graph could be slow depending on the structure.

[9] uses a encoder-decoder sturcture which shares some similarity with ours, but they rely on optical flow and markov random field, which limits the segmentation speed to around 1 fps and high image quality dataset as DAVIS[19]

. The encoder-decoder structure is widely used in deep learning systems

[15, 11, 20, 30]. [20, 30] uses deconvolution for image segmentation and contour detection. In our work, we use a decoder to directly perform pixel-wise classification (object or not) on a video sequence.

To consider the target appearance variation in object tracking, several recent trackers embed CNN into their frameworks. Specifically [26, 3] modify siamese network structure for visual tracking purpose. [28]

trains a CNN using Imagenet data and transfers rich features learnt to a new object sequence by updating the network in an online manner.

[17] trained a multi-domain network and has seperate branches for different domain sequences. The domain specific layer needs to be refined for each sequence. One major limitation of the aforementioned methods is that they lack mechanism to jointly model spatial-temporal traits of the object. [5, 10, 31] propose to solve tracking problem as sequential postion prediction by training RNN to model the time series. [5, 10] uses RNN to model the temporal relationship among frames, but they only conducted experiments on synthesized data and did not demonstrate competitive result on challenging dataset like VOT [12]. [23] uses convolutional LSTM [29] (convLSTM) to perform instance segmentation on single image. By spatial inhibition with an attention mechanism, they demonstrate compelling result on VOC Pascal dataset [4]. [31] uses convLSTM to model object feature variations, and their object detection mechanism use similar convolution structure as [3]. By performing convolutional operation between exemplar frame with a region of interest (ROI), the output of their system is too coarse for fine-grained pixel labeling.

We propose a novel object tracking framework consisting of two models. The global model learns the global motion pattern of the object and predicts the object’s likely location in a new frame from its past locations. The local model performs object segmentation in a ROI identified by the global model, based primarily on the appearance features of the object in the new frame. The local model uses a convLSTM based structure whose memory state evolves to learn the essential appearance features of the object, enabling the segmentation of the object even under significant appearance shifts and occlusion. The LSTM output further goes through a deconvolution layer to generate the segmentation map. The global model also employs a convLSTM structure to generate the latent feature characterizing the object motion, which is fed to a spatial transformer network to determine the location and size of the ROI in the new frame. The proposed framework has demonstrated promising performance on a very challenging dataset (VOT 2016

[12]), where some objects are very small relative to the image sizes, and testing videos often contain unseen objects in the training videos (in our cross validation study).

2 Framework

Figure 1: Pixel-wise object tracking framework. The network consists of two sub-modules: a local and a global model, working in a closed loop. At time step , a resized full resolution binary image is feed into the global model. In inference time, this binary image is the predicted segmentation map acquired from the local model at frame . The global model then roughly predict where the object would appear in frame based on past segmentation maps and generate a region of interests (ROI). The cropped image in the ROI at frame is then fed into the local model for segmentation.

Our goal is to build a pixel-wise object tracking framework for all possible image resolutions and aspect ratios. Models will be trained in an offline supervised setting. Once the offline training process is finished, there is no need to retrain the network. Segmentation of the full-resolution image would require a large amount of computation resources for real time application. In addition, scaling original image to a fixed size could destroy the semantic information and appearance features of a small object relative to the image size. To overcome these difficulties, a global model is used to predict the rough location of the object based on the past object segmentation maps. We then crop a region of interest (ROI) from the original image and perform segmentation on the ROI. The network structure is shown in Fig. 1.

The model runs in a close loop during inference time. At time step the global model takes a fixed size segmentation map as input, which is the resized version of a predicted full resolution segmentation map derived at . We use several layers of convolution and pooling to reduce the dimensionality of the image. The resulting features are fed into a convolutional LSTM [29]

to fully exploit the temporal variation characteristics of the past segmentation maps. To allow different ROI sizes for the local model (necessary to handle different object sizes and object size variation due to motion in the depth direction), another fully connected layer takes convolution LSTM output as its input to estimate the spatial transformation parameter

(including translation and scaling) for the ROI locator, which applies the transformation on a reference anchor box to generate the ROI in the raw frame . As the input segmentation map to global model is resized to , we inject the aspect ratio of the original image into the last fully connceted layer of the global model for generalization power among video sequences with different aspect ratios.

At time step the local model receives a ROI image croped from the full resolution image

. A pretrained VGG is used to extract features from the ROI image. These features are then fed into a convLSTM to model for appearance shift. Then the output of convLSTM goes through a deconvolution layer to generate the local segmentation map (which is a gray scale image, with the value at each pixel proportional to the estimated likelihood that the pixel belongs to the tracked object). Based on how the ROI is cropped from the full resolution image, the full resolution segmentation map is interpolated accordingly from the ROI segmentation map. Here we assume the ROI encloses the entire object hence all pixels outside the ROI are set to zero.

Global model and local model are trained alternatingly in an end-to-end supervised manner. Once the offline training process is finished, there is no need to online finetune the network based on the appearance of the target object, as in some prior work [16, 17, 28].

Figure 2: Local model for object segmentation in a ROI image. The M-CNN and F-CNN are feature normalization layers. are hidden and memory states.

3 Local Segmentation Network

3.1 Framework

The network structure for our local model is shown in Fig. 2. We consider the pixel-wise object tracking as a time series prediction problem. At each time , an input ROI is first processed by a pretrained convolutional network. As [33] has shown, high layer features of a trained CNN bears more semantic information whereas low layer outputs bears more appearance information. For genetic visual tracking, the features should be robust enough to work with many different object categories, and also be able to discriminate object instances from the same object class. Lower level features could be more helpful for such a task. However, using very low level features from a pretrained network could drasticly increase the computation cost for the following layers. Based on these consideration, we use pool4 features from a VGG network [24] pretrained for image segmentation dataset. The weights of this feature extractor are kept the same during training.

The features are then fed into convolutional LSTMs. However, we have found that the resulting network is hard to train because pool4 features are not confined in a certain range. Therefore, we use another small network consisting of two convolutional layers to normalize the VGG features, which use

as the last activation function. These parts are denoted as F-CNN in Fig. 

2. The normalized features then go through a two layer convLSTM. Intuitively the first convLSTM layer models the dynamics of the foreground object as well as the background. And the second convLSTM layer mostly address appearance shift of the target object. The equation we use for ConvLSTM are shown in Eq. (1). To get the segmentation map, the output of second layer convolutional LSTM features are then fed into a deconvolution layer.


In equation 1, the hadamard product between and

are crucial for learning long term dependencies. It restricts cross-channel information exchange and overcomes vanishing gradient problem. Replacing hadamard product with convolution would not achieve similar performance for time sequence model. On the other hand, ConvLSTM is not equivariant to translation particluarly because of the hadamard product. This means a spatially shifted version of the input image may not lead to an equally shifted segmentation map. As the global model may not always generate the ROIs centered around the object at different frame times, it would be preferred that the local segmentation network has a certain degree of translation eqivariance. Although this is one major drawback of using ConvLSTM for object tracking, we have found that with the ROI chosen by the global model, the object tends to fall near the centers of the ROIs in all frames, and our local model can perform well even with small spatial shift between consecutive frames for unseen objects. The detailed number of parameters are shown in Tab. 


Local model filter size channels stride
M-CNN 1024 1
F -CNN 512 1
ConvLSTM 256 1
Decov -1 128 2
Decov -2 64 2
Decov -3 32 2
Decov -4 1 2
Global model filter size channels stride
layer 1 8 1
layer 2 16 1
layer 3 32 1
layer 4 64 1
ConvLSTM 64 1
full 1 1024
full 2 3
Table 1: Number of filters for each modules in local segmentation network and global attention network. Notation represents two identical layers that are connected. Local model:

In M-CNN and F-CNN internal activation functions use rectified linear unit (relu), whereas the outut activation function is

. The internal activation function in deconvolution is leaky-relu, the last activation function is sigmoid. Global model

: every two convolution layer are followed by a pooling operation to reduce the spatial dimentionality. The input to fully connected layer is vectorized ouput of convolutional LSTM. For the second fully connected layer the input dimension is 1025, where we concatenate the feature from last layer with aspect ratio of the current video clip.

3.2 Memory Initialization

To start ConvLSTM, we need to initialize the memory and hidden state. Initializing the memory cell to be zeros is one option. But a major drawback of such approach is the memory cell of recurrent network would need multiple time steps to converge. During this time its hidden connection is also drastically different from its true distribution. And segmentation could easily fail because deconvolution is directly applied on . A wrongfully predicted local segmentation map would further affect the global model. Moreover, within the first ROI there could be more than one salient object. Without differentiating between these salient objects, the tracking system would not know which object to track and is likely to fail.

Instead of arbitrarily initializing the memory with zero, we train an initialization module that takes the object mask, and the image in a manually chosen ROI in the first frame and generates the initial memory cell state and the hidden state which ideally should capture the appearance features of the object. To overcome the boundary artifact, we use a dilated mask to generate the masked image. In our experiment, we find that instead of applying the object mask in the image domain, applying the mask on the layer right before the pool1 layer in the VGG network would render better performance. We then regress the initial memory and hidden states of ConvLSTM using the concatenated feature. This is done by using another two convolution layers denoted by M-CNN in Fig. 2. Simiar as [31], we find using a function as the last activation function for M-CNN stabilizes the memory, even thougth the numerical value of memory cell could go beyond the range of . Ideally we want the memory cell to slowly adapt to appearance drift meanwhile while being able to ignore false objects. In Fig. 3 we show the memory state evolution under different training strategies. The training strategies would be discussed in the following subsection.

(a) Sequential segmentation result visualization without randomly inserting noisy frames during training.
(b) Sequential segmentation result visualization with randomly inserting noisy frames during training.
Figure 3: In each subfigure, each row in vertical order is: the segmentation result overlaid on top of the raw image, first layer convolutional LSTM memory cells, second layer convolutional LSTM memory cells. The displayed images are downsampled by 2. Row 1: Both true sequence and inserted frames comes from testing set. Row 2 and 3, we show the top 16 activations out of 256 cells. Note: (i) For ConvLSTM even with memory regression there is still a burn-in time for the memory to converge. (ii) Memory cells get far noiser in subfigure(a) compare to subfigure(b) after several steps. (iii) There is memory cells co-adapt with noisy sequences, which act as action detection (encircled with red rectangle in subfigure (b)).

3.3 Training

Visual object tracking (VOT) [12] dataset is considered one of the hardest dataset for object tracking, because it contains videos in varying resolutions and some of the target objects (e.g. a football) are very small relative to the image size, and some objects undergo significant appearance shifts. The dataset contains 60 video sequences with more than 200 frames per sequence on average. To deal with the limited number of videos, we use 10 fold-cross validation and randomly distribute 60 sequences into 10 data fold. Each fold contains 54 videos in the training set and the other 6 videos in the testing set. Testing videos often contains objects not seen in the training set. For all models training is only done on the training set and we report the average accuracy on the testing set.

The minibatch of sequences are prepared by the following steps:

  1. Manually select a frame from a sequence randomly as the initial frame. Initial frame does not contain artifacts including occlusion, motion blur etc.

  2. Crop this and all subsequent frames to generate ground truth ROI images. The width of the square ROI is twice the longer length of the object along the horizontal and vertical directions. The ROI width is further truncated to within the range of . In order to train the model to deal with the potential error of the global model, the location is set according to the object mask at frame . Resize all ROI images to , equal to the input image size for the VGG network.

  3. Perturb the resulting ROIs in both positions and size randomly. Random scaling is set in the range of and spatial shift pixels. We denote the resulting sequences of ROI images for all training videos (each video contains only one object) as , and the sequences of ground truth segmentation masks within the ROI as .

  4. For each training video , replace the ROI image at a randomly chosen time with the ROI image for another randomly chosen video at another time . The ground truth segmentation maps for such ROI images are set to all zero. Motivation for this step is explained in Sec. 3.4.

After these steps, each training sample is a pair of video clips (the ROI image sequence and the ground truth ROI mask sequence for a training video), we then solve the following optimization problem in Eq. (2), where and are element-wise cross entropy loss and image total variation loss respectively. defines the local segmentation network and is the parameters belonging to . We use the image total variation loss to discourage the resulting segmentation map to contain multiple small isolated components. We intentionally avoid applying more complicated post-processing on the segmentation map using approaches like markov random field (MRF) to both reduce the computation complexity at the inference time and to enable end-to-end training. is a thresholding term that stablizes the training procedure especially at the beginning stage. and was found to achieve the best performance.


3.4 Comparison and Analysis

We found step 4 in the data preparation is crucial for the success of the local segmentation network. Without step 4, the convolution LSTM merely learns a frame by frame saliency detection. In Fig. 3, we compare the memory state evolution for two networks with and without step 4 on unseen sequences. The module learnt with step 4 is much more stable especially when there are multiple salient objects in the same ROI.

We further conducted another experiment to demonstrate the benefit of using convLSTM. In this experiment, we fine tune a pre-trained segmentation network using fully convolutional neural network (FCN)[14] structure for the local segmentation task. The FCN is pretrained on COCO dataset [13]

. The feature extraction part of our local model use the same model up to pool4. When using the FCN segmentation network on a testing video, we fine tune it on the first frame of testing video clips with small learning rate and few iterations, and apply the refined model to subsequent frames. We compare the segmentation accuracy for the following 32 frames in all testing video clips. For convLSTMs trained with and without step 4, we don’t fine tune based on the first frame of the testing video. We report the ROC curve and framewise IOU curve for 1200 randomly sampled video clips in the testing set in Fig.  

4. True positive rate and false positive rate is defined at the pixel level. Framewise IOU is defined as in Eq. (3). Convolution LSTM trained under both strategies get higher AUC for the ROC curves and the FCN network with refinement during testing stage could not adapt to appearance shift as demonstrated with Fig. 4.


The better peformance using ConvLSTM for local model comes with a price, as analyzed in subsection 3.1. ConvLSTM is not shift equivariant, a large spatial drift between consecutive frames could cause loss of tracking. In our observation, spatial shift larger than 30 pixels in the ROI could cause instability in our tracking system. To circumvent this problem, we predict the ROI using a global attention network.

Figure 4: Comparison between ConvLSTM and framewise segmentation. Left: ConvLSTM 1 and 2 represents training strategies with and without randomly replaced frame respectively. Right: IOU comparison per frame.


4 Global Attention Network

To predict where the ROI should be located in the current frame based on predicted segmentation map in the history, one naive way is to use weighted average of the past predicted location directly to decide where the ROI should be cropped. However during the test time, the local predictor might make prediction mistakes caused by light condition, drastic appearance change, motion blur etc. Such mistakes could then cause the global model to locate a wrong ROI for the next frame. Overtimes, the ROI could drift away from the correct object location. Therefore, we need to develop a rather robust global model that can handle such problems. The ROI is specified by a spatial transform acting on a fixed anchor (a square region) . We apply a LSTM on the past global segmentation maps to generate features that are then fed to a spatial transformer network to determine the transform parameter. Our spatial transform network is a special form of [8], but the transformation is not applied on the feature map, but on a fixed anchor . The training framework of global attention network is shown in Fig. 6.

During training stage, at each time a fixed size segmentation map is feed into the global attention model . The network generates a special form of affine transform parameters . The spatial transform is applied on , so that the transformed anchor maximally overlaps with the ground truth segmentation map in frame , . We want the transformed anchor to enclose as much foreground pixel as possible, and we use a weighted loss between and . We further add a loss term between and so that the tranformer is temporally smooth. Parameter

constrains the transform to only allow spatial shift and resizing. The resizing operation takes consideration of image aspect ratio, so that when cropping the image at the image domain the aspect ratio is not distorted(the ROI on the real image is always a square but with varying sizes). The overall loss function is defined as:


The detailed number of parameters of our global model is shown in Tab. 1. During training, we observe that recurrent model needs burn-in time to accurately predict the spatial transform. Otherwise it would not utilize the full history of the observations. So we only compute the loss after th frame. In our experiment, we find setting works best for a total sequence length of . To let our model converge faster, in practice we apply a dilation kernel on our input sequence and gradually shrink the size of the dilation kernel until convergence.

However during inference stage, since the model could only utilize the predicted masks by the local model , there is a distribution difference between testing sequences and training sequences. Fig. 5 demonstrates the training set and testing set difference. To handle the distribution gap we iteratively adapt our global model and local model. We discribe the way to update our model in Sec. 5.

Figure 5: Demonstration of observation difference between testing set observation and ground truth. Each row shows a sequence temporally downsampled by 4. From top to bottom: input to the global model in testing sequence, ground truth mask and predicted ROI location.
Figure 6: Training framework for global attention model

5 Experiment

5.1 Iterative Optimization

In addition to preparing local samples as described in section 3.3, to handle the observation difference mentioned in Sec. 4, on each of the data fold we perform our training as following:

  1. Evenly seperate video sequences of each training set into two subsets. On each subset use the ground truth bounding boxes to prepare a training set for the local model (see section 3.3). Train one local model on each subset with early termination with loss function Eq. (2).

  2. Train the initial global model using sequences of ground truth segmentation maps with loss function Eq. (LABEL:optimization2). To increase convergence speed, we apply dilation operation on and shrink dilation kernel size every ten thousand iterations until convergence.

  3. Use the trained local model 1 from step 1 and global model from step 2 to generate predicted segmentation maps, and ROI images and ROI segmentation maps for training data in subset 2. Procedure is discribed at algorithm 1. Use local model 2 to do the same on subset 1.

  4. Update the global model with modified input sequence generated by step 3 using Eq. (LABEL:optimization2).

  5. Train local model using the ROI image sequence and segmentation map sequence , which are generated by the updated global model for the entire training set with Eq. (2).

1: Raw image , segmentation map , global model , local model
2: Predicted segmentation map , ROI image and segmentation map sequences ,
3:Crop ROI with spatial paramter
4:Initialize the memory of local model using
5:for do
6:  Estimate using
7:  Update as
8:  Get ROI images , from frame , use
9:  Estimate use
10:  Use to fill in with
Algorithm 1 Two stage tracking algorithm
Figure 7: Tracking results for 8 videos. The predicted segmentation maps are overlaid on top of the original image. The results are obtained with the trained 2-stage model. The failure case at sequence 2 (top right) and 8 (bottom right) are mostly due to large camera motions.

5.2 Time Complexity

We evaluate our tracking algorith on VOT2016 [12]

segmentation dataset. We implement our algorithm with tensorflow and test it on a single NVIDIA Tesla K80 with 24G RAM. The inference speeds using the local and global models are 24 ms and 6 ms per frame respectively. With memory initialization and ROI interpolation included, the entire framework still runs more than 20 fps.

5.3 Quantitaive Analysis

First, we compare our local segmentation network using LSTM with FCN segmentation network. Both models are provided with the same ROI sequences. The only difference is FCN network is further fine tuned on the first frame for each of the sequence. We follow the same procedure in Sec.  3.3 using the ground truth label to prepare sequences. The only exception here is that we use a exponential weighting on the ROI location center by . The decay rate is set at 0.8. Here, we use the object ground truth location at frame to crop the local ROI so that object is reasonably far apart from the center but still in the ROI. We admit that the test could be still favoring our convolutional LSTM as the location of the object is registered to be close to the center of the ROI. On the other hand this test demonstrates the upper bound of the proposed object tracking approach, achievable when the global model can accurately locate the ROI. The evaluation using the FCN segmentation network, on the other hand, is meant to evaluate the achievable tracking performance by a CNN-based segmentation network, when equipped with a near perfect global tracker. The result is shown in Tab. 2.

Next, we evaluate our global model and local model jointly. The inference for the two stage model (denoted by 2-stage ConvLSTM) follows algorithm 1. We also evaluate a benchmark 1-stage model, which replaces the global model with a simple predictor for the ROI center, described in Eq. (5). Here is the estimated probalility of pixel belonging to the foreground by the local model for the previous frame. The ROI size is fixed. FCN based tracking uses the same approach for determining the ROI with its own segmentation map. Table 3 compares the performance of our 2-stage model, the benchmark 1-stage and the FCN segmentation network.

Results in Tables 2 and 3 demonstrate that the proposed 2-stage convLSTM architecture is better than a CNN fine tuned on the first frame. Even when provided with nearly correct location of the ROI in each new frame, the CNN-based segmentation network could not handle appearance shift as well as our 2-stage ConvLSTM. Furthermore, our global model performs better than a naive location predictor. Sample visual results are shown in Fig. 7.

Test sequence length threshold at 0.4 20 40 80 160
ConLSTM 0.4690 0.4405 0.3962 0.3342
FCN 0.3785 0.3582 0.3157 0.2304
Table 2: Local segmentation network evaluation: average IOU at different sequence length using ROIs that are close to ground truth location. For sequences shorter than the preset length, we upsample the testing sequences to the fixed length.
Test sequence length threshold set at 0.4 20 40 80 160
2-stage ConLSTM 0.3992 0.3606 0.3201 0.2564
1-stage ConLSTM 0.380 0.34601 0.3046 0.2419
FCN 0.2275 0.2058 0.1678 0.1437
Test sequence length threshold at 0.7 20 40 80 160
2-stage ConLSTM 0.2485 0.2302 0.202 0.1854
1-stage ConLSTM 0.2080 0.1926 0.1679 0.1518
FCN 0.1030 0.093 0.0711 0.058
Table 3: Overall network evaluation: average IOU at different sequence length.

6 Acknowledgement

This work was funded by National Science Foundation award CCF-1422914.

7 Conclusion

In this work, we tackle tracking problem at the pixel level. By providing the beginning frame and corresponding segmentation map, we model the appearance shift as a time series. We propose a novel two-stage model handling micro-scale appearance change and macro-scale object motion seperately. The local segmentation model has far better performance compared to a CNN fine-tuned on the first frame. The global model can accurately predict the rough location and size of the object from frame to frame. We demonstrate our novel approach on a very challenging VOT dataset. Finally our model performs pixel-wise object tracking at a reasonable accuracy in real time.