Provided with an object of interest at the first frame, visual object tracking is a problem of buiding a computational model that is able to predict the location of the object in consecutive frames. A robust tracking algorithm should be able to tackle some of the common issues including: target deformation, motion blur, illumination change, partial occlusion and background clutters. Many existing algorithms uses online learning by building a discriminative model to seperate object from background. The feature extractor is the most important component of a tracker, using appropriate features could drasticly boost the tracking performance. Many recent tracking-by-detection approaches [7, 27, 18] are inspired by methods for object detection [6, 21, 22]
and fully embrace the features learnt from deep convolutional neural network. We recognize that existing CNN based feature extractor increases the performance and robustness of the tracking system, yet how to extend the deep neural network for visual object tracking has not been fully investigated. In our work, we tackle object tracking as a time-series prediction problem, in particular we want to give a pixel-wise foreground-background label for consecutive frames.
Segmentation-based tracking algorithms [1, 2, 25, 32, 9] have advantage over detection-based algorithm for handling a target undergoes substantial no-rigid motions. Many of them [2, 25, 1] rely only on pixel-level information and hence fail to consider semantic structure of the target. 
uses Markov Chain on superpixel graph, but information propagation through a graph could be slow depending on the structure. uses a encoder-decoder sturcture which shares some similarity with ours, but they rely on optical flow and markov random field, which limits the segmentation speed to around 1 fps and high image quality dataset as DAVIS
. The encoder-decoder structure is widely used in deep learning systems[15, 11, 20, 30]. [20, 30] uses deconvolution for image segmentation and contour detection. In our work, we use a decoder to directly perform pixel-wise classification (object or not) on a video sequence.
To consider the target appearance variation in object tracking, several recent trackers embed CNN into their frameworks. Specifically [26, 3] modify siamese network structure for visual tracking purpose. 
trains a CNN using Imagenet data and transfers rich features learnt to a new object sequence by updating the network in an online manner. trained a multi-domain network and has seperate branches for different domain sequences. The domain specific layer needs to be refined for each sequence. One major limitation of the aforementioned methods is that they lack mechanism to jointly model spatial-temporal traits of the object. [5, 10, 31] propose to solve tracking problem as sequential postion prediction by training RNN to model the time series. [5, 10] uses RNN to model the temporal relationship among frames, but they only conducted experiments on synthesized data and did not demonstrate competitive result on challenging dataset like VOT .  uses convolutional LSTM  (convLSTM) to perform instance segmentation on single image. By spatial inhibition with an attention mechanism, they demonstrate compelling result on VOC Pascal dataset .  uses convLSTM to model object feature variations, and their object detection mechanism use similar convolution structure as . By performing convolutional operation between exemplar frame with a region of interest (ROI), the output of their system is too coarse for fine-grained pixel labeling.
We propose a novel object tracking framework consisting of two models. The global model learns the global motion pattern of the object and predicts the object’s likely location in a new frame from its past locations. The local model performs object segmentation in a ROI identified by the global model, based primarily on the appearance features of the object in the new frame. The local model uses a convLSTM based structure whose memory state evolves to learn the essential appearance features of the object, enabling the segmentation of the object even under significant appearance shifts and occlusion. The LSTM output further goes through a deconvolution layer to generate the segmentation map. The global model also employs a convLSTM structure to generate the latent feature characterizing the object motion, which is fed to a spatial transformer network to determine the location and size of the ROI in the new frame. The proposed framework has demonstrated promising performance on a very challenging dataset (VOT 2016), where some objects are very small relative to the image sizes, and testing videos often contain unseen objects in the training videos (in our cross validation study).
Our goal is to build a pixel-wise object tracking framework for all possible image resolutions and aspect ratios. Models will be trained in an offline supervised setting. Once the offline training process is finished, there is no need to retrain the network. Segmentation of the full-resolution image would require a large amount of computation resources for real time application. In addition, scaling original image to a fixed size could destroy the semantic information and appearance features of a small object relative to the image size. To overcome these difficulties, a global model is used to predict the rough location of the object based on the past object segmentation maps. We then crop a region of interest (ROI) from the original image and perform segmentation on the ROI. The network structure is shown in Fig. 1.
The model runs in a close loop during inference time. At time step the global model takes a fixed size segmentation map as input, which is the resized version of a predicted full resolution segmentation map derived at . We use several layers of convolution and pooling to reduce the dimensionality of the image. The resulting features are fed into a convolutional LSTM 
to fully exploit the temporal variation characteristics of the past segmentation maps. To allow different ROI sizes for the local model (necessary to handle different object sizes and object size variation due to motion in the depth direction), another fully connected layer takes convolution LSTM output as its input to estimate the spatial transformation parameter(including translation and scaling) for the ROI locator, which applies the transformation on a reference anchor box to generate the ROI in the raw frame . As the input segmentation map to global model is resized to , we inject the aspect ratio of the original image into the last fully connceted layer of the global model for generalization power among video sequences with different aspect ratios.
At time step the local model receives a ROI image croped from the full resolution image
. A pretrained VGG is used to extract features from the ROI image. These features are then fed into a convLSTM to model for appearance shift. Then the output of convLSTM goes through a deconvolution layer to generate the local segmentation map (which is a gray scale image, with the value at each pixel proportional to the estimated likelihood that the pixel belongs to the tracked object). Based on how the ROI is cropped from the full resolution image, the full resolution segmentation map is interpolated accordingly from the ROI segmentation map. Here we assume the ROI encloses the entire object hence all pixels outside the ROI are set to zero.
3 Local Segmentation Network
The network structure for our local model is shown in Fig. 2. We consider the pixel-wise object tracking as a time series prediction problem. At each time , an input ROI is first processed by a pretrained convolutional network. As  has shown, high layer features of a trained CNN bears more semantic information whereas low layer outputs bears more appearance information. For genetic visual tracking, the features should be robust enough to work with many different object categories, and also be able to discriminate object instances from the same object class. Lower level features could be more helpful for such a task. However, using very low level features from a pretrained network could drasticly increase the computation cost for the following layers. Based on these consideration, we use pool4 features from a VGG network  pretrained for image segmentation dataset. The weights of this feature extractor are kept the same during training.
The features are then fed into convolutional LSTMs. However, we have found that the resulting network is hard to train because pool4 features are not confined in a certain range. Therefore, we use another small network consisting of two convolutional layers to normalize the VGG features, which use
as the last activation function. These parts are denoted as F-CNN in Fig.2. The normalized features then go through a two layer convLSTM. Intuitively the first convLSTM layer models the dynamics of the foreground object as well as the background. And the second convLSTM layer mostly address appearance shift of the target object. The equation we use for ConvLSTM are shown in Eq. (1). To get the segmentation map, the output of second layer convolutional LSTM features are then fed into a deconvolution layer.
In equation 1, the hadamard product between and
are crucial for learning long term dependencies. It restricts cross-channel information exchange and overcomes vanishing gradient problem. Replacing hadamard product with convolution would not achieve similar performance for time sequence model. On the other hand, ConvLSTM is not equivariant to translation particluarly because of the hadamard product. This means a spatially shifted version of the input image may not lead to an equally shifted segmentation map. As the global model may not always generate the ROIs centered around the object at different frame times, it would be preferred that the local segmentation network has a certain degree of translation eqivariance. Although this is one major drawback of using ConvLSTM for object tracking, we have found that with the ROI chosen by the global model, the object tends to fall near the centers of the ROIs in all frames, and our local model can perform well even with small spatial shift between consecutive frames for unseen objects. The detailed number of parameters are shown in Tab.1.
|Local model||filter size||channels||stride|
|Global model||filter size||channels||stride|
: every two convolution layer are followed by a pooling operation to reduce the spatial dimentionality. The input to fully connected layer is vectorized ouput of convolutional LSTM. For the second fully connected layer the input dimension is 1025, where we concatenate the feature from last layer with aspect ratio of the current video clip.
3.2 Memory Initialization
To start ConvLSTM, we need to initialize the memory and hidden state. Initializing the memory cell to be zeros is one option. But a major drawback of such approach is the memory cell of recurrent network would need multiple time steps to converge. During this time its hidden connection is also drastically different from its true distribution. And segmentation could easily fail because deconvolution is directly applied on . A wrongfully predicted local segmentation map would further affect the global model. Moreover, within the first ROI there could be more than one salient object. Without differentiating between these salient objects, the tracking system would not know which object to track and is likely to fail.
Instead of arbitrarily initializing the memory with zero, we train an initialization module that takes the object mask, and the image in a manually chosen ROI in the first frame and generates the initial memory cell state and the hidden state which ideally should capture the appearance features of the object. To overcome the boundary artifact, we use a dilated mask to generate the masked image. In our experiment, we find that instead of applying the object mask in the image domain, applying the mask on the layer right before the pool1 layer in the VGG network would render better performance. We then regress the initial memory and hidden states of ConvLSTM using the concatenated feature. This is done by using another two convolution layers denoted by M-CNN in Fig. 2. Simiar as , we find using a function as the last activation function for M-CNN stabilizes the memory, even thougth the numerical value of memory cell could go beyond the range of . Ideally we want the memory cell to slowly adapt to appearance drift meanwhile while being able to ignore false objects. In Fig. 3 we show the memory state evolution under different training strategies. The training strategies would be discussed in the following subsection.
Visual object tracking (VOT)  dataset is considered one of the hardest dataset for object tracking, because it contains videos in varying resolutions and some of the target objects (e.g. a football) are very small relative to the image size, and some objects undergo significant appearance shifts. The dataset contains 60 video sequences with more than 200 frames per sequence on average. To deal with the limited number of videos, we use 10 fold-cross validation and randomly distribute 60 sequences into 10 data fold. Each fold contains 54 videos in the training set and the other 6 videos in the testing set. Testing videos often contains objects not seen in the training set. For all models training is only done on the training set and we report the average accuracy on the testing set.
The minibatch of sequences are prepared by the following steps:
Manually select a frame from a sequence randomly as the initial frame. Initial frame does not contain artifacts including occlusion, motion blur etc.
Crop this and all subsequent frames to generate ground truth ROI images. The width of the square ROI is twice the longer length of the object along the horizontal and vertical directions. The ROI width is further truncated to within the range of . In order to train the model to deal with the potential error of the global model, the location is set according to the object mask at frame . Resize all ROI images to , equal to the input image size for the VGG network.
Perturb the resulting ROIs in both positions and size randomly. Random scaling is set in the range of and spatial shift pixels. We denote the resulting sequences of ROI images for all training videos (each video contains only one object) as , and the sequences of ground truth segmentation masks within the ROI as .
For each training video , replace the ROI image at a randomly chosen time with the ROI image for another randomly chosen video at another time . The ground truth segmentation maps for such ROI images are set to all zero. Motivation for this step is explained in Sec. 3.4.
After these steps, each training sample is a pair of video clips (the ROI image sequence and the ground truth ROI mask sequence for a training video), we then solve the following optimization problem in Eq. (2), where and are element-wise cross entropy loss and image total variation loss respectively. defines the local segmentation network and is the parameters belonging to . We use the image total variation loss to discourage the resulting segmentation map to contain multiple small isolated components. We intentionally avoid applying more complicated post-processing on the segmentation map using approaches like markov random field (MRF) to both reduce the computation complexity at the inference time and to enable end-to-end training. is a thresholding term that stablizes the training procedure especially at the beginning stage. and was found to achieve the best performance.
3.4 Comparison and Analysis
We found step 4 in the data preparation is crucial for the success of the local segmentation network. Without step 4, the convolution LSTM merely learns a frame by frame saliency detection. In Fig. 3, we compare the memory state evolution for two networks with and without step 4 on unseen sequences. The module learnt with step 4 is much more stable especially when there are multiple salient objects in the same ROI.
We further conducted another experiment to demonstrate the benefit of using convLSTM. In this experiment, we fine tune a pre-trained segmentation network using fully convolutional neural network (FCN) structure for the local segmentation task. The FCN is pretrained on COCO dataset 
. The feature extraction part of our local model use the same model up to pool4. When using the FCN segmentation network on a testing video, we fine tune it on the first frame of testing video clips with small learning rate and few iterations, and apply the refined model to subsequent frames. We compare the segmentation accuracy for the following 32 frames in all testing video clips. For convLSTMs trained with and without step 4, we don’t fine tune based on the first frame of the testing video. We report the ROC curve and framewise IOU curve for 1200 randomly sampled video clips in the testing set in Fig.4. True positive rate and false positive rate is defined at the pixel level. Framewise IOU is defined as in Eq. (3). Convolution LSTM trained under both strategies get higher AUC for the ROC curves and the FCN network with refinement during testing stage could not adapt to appearance shift as demonstrated with Fig. 4.
The better peformance using ConvLSTM for local model comes with a price, as analyzed in subsection 3.1. ConvLSTM is not shift equivariant, a large spatial drift between consecutive frames could cause loss of tracking. In our observation, spatial shift larger than 30 pixels in the ROI could cause instability in our tracking system. To circumvent this problem, we predict the ROI using a global attention network.
4 Global Attention Network
To predict where the ROI should be located in the current frame based on predicted segmentation map in the history, one naive way is to use weighted average of the past predicted location directly to decide where the ROI should be cropped. However during the test time, the local predictor might make prediction mistakes caused by light condition, drastic appearance change, motion blur etc. Such mistakes could then cause the global model to locate a wrong ROI for the next frame. Overtimes, the ROI could drift away from the correct object location. Therefore, we need to develop a rather robust global model that can handle such problems. The ROI is specified by a spatial transform acting on a fixed anchor (a square region) . We apply a LSTM on the past global segmentation maps to generate features that are then fed to a spatial transformer network to determine the transform parameter. Our spatial transform network is a special form of , but the transformation is not applied on the feature map, but on a fixed anchor . The training framework of global attention network is shown in Fig. 6.
During training stage, at each time a fixed size segmentation map is feed into the global attention model . The network generates a special form of affine transform parameters . The spatial transform is applied on , so that the transformed anchor maximally overlaps with the ground truth segmentation map in frame , . We want the transformed anchor to enclose as much foreground pixel as possible, and we use a weighted loss between and . We further add a loss term between and so that the tranformer is temporally smooth. Parameter
constrains the transform to only allow spatial shift and resizing. The resizing operation takes consideration of image aspect ratio, so that when cropping the image at the image domain the aspect ratio is not distorted(the ROI on the real image is always a square but with varying sizes). The overall loss function is defined as:
The detailed number of parameters of our global model is shown in Tab. 1. During training, we observe that recurrent model needs burn-in time to accurately predict the spatial transform. Otherwise it would not utilize the full history of the observations. So we only compute the loss after th frame. In our experiment, we find setting works best for a total sequence length of . To let our model converge faster, in practice we apply a dilation kernel on our input sequence and gradually shrink the size of the dilation kernel until convergence.
However during inference stage, since the model could only utilize the predicted masks by the local model , there is a distribution difference between testing sequences and training sequences. Fig. 5 demonstrates the training set and testing set difference. To handle the distribution gap we iteratively adapt our global model and local model. We discribe the way to update our model in Sec. 5.
5.1 Iterative Optimization
Train the initial global model using sequences of ground truth segmentation maps with loss function Eq. (LABEL:optimization2). To increase convergence speed, we apply dilation operation on and shrink dilation kernel size every ten thousand iterations until convergence.
Use the trained local model 1 from step 1 and global model from step 2 to generate predicted segmentation maps, and ROI images and ROI segmentation maps for training data in subset 2. Procedure is discribed at algorithm 1. Use local model 2 to do the same on subset 1.
Update the global model with modified input sequence generated by step 3 using Eq. (LABEL:optimization2).
Train local model using the ROI image sequence and segmentation map sequence , which are generated by the updated global model for the entire training set with Eq. (2).
5.2 Time Complexity
We evaluate our tracking algorith on VOT2016 
segmentation dataset. We implement our algorithm with tensorflow and test it on a single NVIDIA Tesla K80 with 24G RAM. The inference speeds using the local and global models are 24 ms and 6 ms per frame respectively. With memory initialization and ROI interpolation included, the entire framework still runs more than 20 fps.
5.3 Quantitaive Analysis
First, we compare our local segmentation network using LSTM with FCN segmentation network. Both models are provided with the same ROI sequences. The only difference is FCN network is further fine tuned on the first frame for each of the sequence. We follow the same procedure in Sec. 3.3 using the ground truth label to prepare sequences. The only exception here is that we use a exponential weighting on the ROI location center by . The decay rate is set at 0.8. Here, we use the object ground truth location at frame to crop the local ROI so that object is reasonably far apart from the center but still in the ROI. We admit that the test could be still favoring our convolutional LSTM as the location of the object is registered to be close to the center of the ROI. On the other hand this test demonstrates the upper bound of the proposed object tracking approach, achievable when the global model can accurately locate the ROI. The evaluation using the FCN segmentation network, on the other hand, is meant to evaluate the achievable tracking performance by a CNN-based segmentation network, when equipped with a near perfect global tracker. The result is shown in Tab. 2.
Next, we evaluate our global model and local model jointly. The inference for the two stage model (denoted by 2-stage ConvLSTM) follows algorithm 1. We also evaluate a benchmark 1-stage model, which replaces the global model with a simple predictor for the ROI center, described in Eq. (5). Here is the estimated probalility of pixel belonging to the foreground by the local model for the previous frame. The ROI size is fixed. FCN based tracking uses the same approach for determining the ROI with its own segmentation map. Table 3 compares the performance of our 2-stage model, the benchmark 1-stage and the FCN segmentation network.
Results in Tables 2 and 3 demonstrate that the proposed 2-stage convLSTM architecture is better than a CNN fine tuned on the first frame. Even when provided with nearly correct location of the ROI in each new frame, the CNN-based segmentation network could not handle appearance shift as well as our 2-stage ConvLSTM. Furthermore, our global model performs better than a naive location predictor. Sample visual results are shown in Fig. 7.
|Test sequence length threshold at 0.4||20||40||80||160|
|Test sequence length threshold set at 0.4||20||40||80||160|
|Test sequence length threshold at 0.7||20||40||80||160|
This work was funded by National Science Foundation award CCF-1422914.
In this work, we tackle tracking problem at the pixel level. By providing the beginning frame and corresponding segmentation map, we model the appearance shift as a time series. We propose a novel two-stage model handling micro-scale appearance change and macro-scale object motion seperately. The local segmentation model has far better performance compared to a CNN fine-tuned on the first frame. The global model can accurately predict the rough location and size of the object from frame to frame. We demonstrate our novel approach on a very challenging VOT dataset. Finally our model performs pixel-wise object tracking at a reasonable accuracy in real time.
-  C. Aeschliman, J. Park, and A. C. Kak. A probabilistic framework for joint segmentation and tracking. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1371–1378. IEEE, 2010.
-  V. Belagiannis, F. Schubert, N. Navab, and S. Ilic. Segmentation based particle filtering for real-time 2d object tracking. Computer Vision–ECCV 2012, pages 842–855, 2012.
-  L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision, pages 850–865. Springer, 2016.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
-  Q. Gan, Q. Guo, Z. Zhang, and K. Cho. First step toward model-free, anonymous object tracking with recurrent neural networks. arXiv preprint arXiv:1511.06425, 2015.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
S. Hong, T. You, S. Kwak, and B. Han.
Online tracking by learning discriminative saliency map with
convolutional neural network.
In F. Bach and D. Blei, editors,
Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 597–606, Lille, France, 07–09 Jul 2015. PMLR.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. CoRR, abs/1506.02025, 2015.
-  W.-D. Jang and C.-S. Kim. Online video object segmentation via convolutional trident network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5849–5858, 2017.
-  S. E. Kahou, V. Michalski, and R. Memisevic. Ratm: recurrent attentive tracking model. arXiv preprint arXiv:1510.08660, 2015.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Čehovin, T. Vojír, G. Häger, A. Lukežič, G. Fernandez Dominguez, A. Gupta, A. Petrosino, A. Memarmoghadam, A. Garcia-Martin, A. Solís Montero, A. Vedaldi, A. Robinson, A. Ma, A. Varfolomieiev, and Z. Chi. The visual object tracking vot2016 challenge results, 10 2016.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
-  H. Nam, M. Baek, and B. Han. Modeling and propagating cnns in a tree structure for visual tracking. arXiv preprint arXiv:1608.07242, 2016.
-  H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4293–4302, 2016.
-  G. Ning, Z. Zhang, C. Huang, X. Ren, H. Wang, C. Cai, and Z. He. Spatially supervised recurrent convolutional neural networks for visual object tracking. In Circuits and Systems (ISCAS), 2017 IEEE International Symposium on, pages 1–4. IEEE, 2017.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
-  P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In Advances in Neural Information Processing Systems, pages 1990–1998, 2015.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  B. Romera-Paredes and P. H. S. Torr. Recurrent instance segmentation. In European Conference on Computer Vision, pages 312–329. Springer, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  J. Son, I. Jung, K. Park, and B. Han. In Proceedings of the IEEE International Conference on Computer Vision, pages 3056–3064, 2015.
-  R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1420–1429, 2016.
-  L. Wang, W. Ouyang, X. Wang, and H. Lu. Stct: Sequentially training convolutional networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1373–1381, 2016.
-  N. Wang, S. Li, A. Gupta, and D.-Y. Yeung. Transferring rich feature hierarchies for robust visual tracking. arXiv preprint arXiv:1501.04587, 2015.
-  S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, pages 802–810, 2015.
-  J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang. Object contour detection with a fully convolutional encoder-decoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 193–202, 2016.
-  T. Yang and A. B. Chan. Recurrent filter learning for visual tracking. arXiv preprint arXiv:1708.03874, 2017.
-  D. Yeo, J. Son, B. Han, and J. Hee Han. Superpixel-based tracking-by-segmentation using markov chains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1812–1821, 2017.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013.