Effects of Blur and Deblurring to Visual Object Tracking

Intuitively, motion blur may hurt the performance of visual object tracking. However, we lack quantitative evaluation of tracker robustness to different levels of motion blur. Meanwhile, while image deblurring methods can produce visually clearer videos for pleasing human eyes, it is unknown whether visual object tracking can benefit from image deblurring or not. In this paper, we address these two problems by constructing a Blurred Video Tracking benchmark, which contains a variety of videos with different levels of motion blurs, as well as ground truth tracking results for evaluating trackers. We extensively evaluate 23 trackers on this benchmark and observe several new interesting results. Specifically, we find that light blur may improve the performance of many trackers, but heavy blur always hurts the tracking performance. We also find that image deblurring may help to improve tracking performance on heavily blurred videos but hurt the performance on lightly blurred videos. According to these observations, we propose a new GAN based scheme to improve the tracker robustness to motion blurs. In this scheme, a finetuned discriminator is used as an adaptive assessor to selectively deblur frames during the tracking process. We use this scheme to successfully improve the accuracy and robustness of 6 trackers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 7

page 8

page 9

page 10

09/01/2019

Multiple Object Tracking with Motion and Appearance Cues

Due to better video quality and higher frame rate, the performance of mu...
05/09/2019

Intra-frame Object Tracking by Deblatting

Objects moving at high speed along complex trajectories often appear in ...
11/21/2017

Robust Object Tracking Based on Self-adaptive Search Area

Discriminative correlation filter (DCF) based trackers have recently ach...
11/21/2016

SANet: Structure-Aware Network for Visual Tracking

Convolutional neural network (CNN) has drawn increasing interest in visu...
09/18/2017

Rotation Adaptive Visual Object Tracking with Motion Consistency

Visual Object tracking research has undergone significant improvement in...
02/20/2015

Visual object tracking performance measures revisited

The problem of visual tracking evaluation is sporting a large variety of...
06/07/2018

Information-Maximizing Sampling to Promote Tracking-by-Detection

The performance of an adaptive tracking-by-detection algorithm not only ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Results of ECO [6] and Staple_CA [37, 3] on sharp, blurred or deblurred videos captured from one scene. The first subfigure shows that ECO locates the target accurately on sharp frames while losing it on blurred ones. In contrast, Staple_CA can capture the ball in both cases. Such situation is not considered by existing benchmarks, e.g. OTB [57, 58] in which ECO has much higher accuracy than Staple_CA. The bottom subfigure presents that ECO misses the target on blurred frames while locating it accurately when we selectively deblur the frames.

Motion blur caused by camera shake and object movement not only reduces the visual perception quality, but also may severely degrade the performance of video analysis tasks, e.g. single object tracking [58]. In recent years, numerous tracking benchmarks are proposed to evaluate how well current trackers can handle motion blur by comparing their accuracy on videos containing blurred frames [57, 58, 33, 12]. However, such benchmarks do not exclude the influence of other possible interferences, e.g. the limitation of the algorithms of the time, thus leads to incomplete conclusion of a tracker, since the accuracy may be underestimated due to other issues. In addition, with current blur-related tracking benchmark, we cannot quantitatively evaluate trackers’ robustness to different levels of motion blur, thus cannot support deep exploration of the way that motion blur affects tracking performance.

As shown in Fig. 1, given a sharp video and its blurred version for the same scene, ECO [6] can locate the white billiard ball accurately on the sharp video but fail to do so when motion blur happens. In contrast, Staple_CA [3, 37] tracks the ball accurately on both videos. This situation can not be thoroughly evaluated on blurred videos captured under different scenes. For example, OTB benchmark [57, 58] shows that ECO has much higher accuracy than Staple_CA on its motion blur subsets, which certainly does not consider the above situation. A comprehensive benchmark that fairly measures the blur robustness of trackers is necessary and will encourage the development of blur-robust trackers.

A naive solution for blur-robust tracking is to first deblur the frames of a video and then apply trackers on the deblurred video. However, it is known that such naive deblurring strategy may introduce ringing artifacts, due to the Gibbs phenomenon that hurts the features of raw frames and fails the tracking easily [10, 35, 23, 55, 56]. Instead of direct deblurring, many recent blur-aware trackers add different kinds of blur to the target template, forming an augmented template set, then locate the target at following frames by matching candidates with all of the blur augmented templates [55, 35]. Although such trackers are effective but they have high memory and computing cost. Besides, how to do effective blur augmentation is unknown.

Note, the negative effects of deblurring to visual tracking are concluded mainly based on early deblurring algorithms. Recently, numerous successful deblurring methods have been developed via deep learning, with significantly improved performance, fewer artifact noises and much faster speed 

[28, 41, 44, 50, 59, 50]. But, whether they are helpful for visual object tracking still remains questionable.

In this paper, we aim to analyze the effects of motion blur and deblurring methods to current trackers, and explore an effective way of using existing deep deblurring to achieve blur-robust tracking. Our main contributions are three-fold:

  • We construct a Blurred Video Tracking (BVT) benchmark with a dataset containing 500 videos for 100 scenes. Each scene consists of 5 videos having different levels of motion blurs. We use three metrics to evaluate the accuracy and blur robustness of trackers.

  • We extensively evaluate 23 trackers on the BVT benchmark and find that the light motion blur improves most of the trackers, while the heavy blur hurt their accuracy significantly. We also find that deblurring methods can improve the tracking performance on heavily-blurred videos, while having negative effects to the ones with light blur.

  • We propose a new GAN-based tracking scheme that adopts the fine-tuned discriminator of DeblurGAN as an adaptive blur assessor to selectively deblur frames during the tracking process and improve the accuracy of 6 state-of-the-art trackers.

2 Related Work

2.1 Tracking benchmarks

In recent years, numerous tracking benchmarks have been proposed for general performance evaluation or specific issues [48, 57, 58, 33, 26, 27, 40, 30, 25, 39, 12, 22]. The OTB [57, 58], ALOV++ [48], VOT [27, 26, 25], TrackingNet [39], LaSOT [12] , and GOT-10K [22] benchmarks provide unified platforms to compare state-of-the-art trackers. More recent ones, e.g. TrackingNet, LaSOT and GOT-10K, contain a large scale of videos and cover a wide range of classes, which will make training a high performance deep learning based trackers available. Other benchmarks focus on specific applications or problems. For example, the NfS [15] benchmark consists of 100 high frame rate videos and analyze the influence of appearance variation to deep and correlation filter-based trackers respectively.

Among these benchmarks, the OTB-2013 [57], OTB-2015 [58], TC-128 [33], and LaSOT [12] datasets contain motion blur subsets that can be used to evaluate the ability of trackers to handle the motion blur. Nevertheless, the evaluation results are incomplete, since other interference that also affects the tracking accuracy is not excluded.

A better solution is to compare trackers on the videos that are captured at the same scene but have different levels of motion blur to see if the tracker can obtain the same performance. In this paper, we construct a dataset for motion blur evaluation by averaging the frames on high frame rate videos with different ranges, thus generate testing videos having the same content with different levels of motion blur. By doing this, we are able to score the robustness of trackers and help study the effects of motion blur.

2.2 Motion blur-aware trackers

Numerous works have studied the relationship between the motion blur and the object tracking [35, 47, 23, 5, 38, 56, 55]. Jin et al. [23] have observed that matching between blurred images help realize effective object tracking. In [5, 38, 56]

, how to estimate the blur kernel accurately during object tracking is carefully studies. Ma et al. 

[35] and Wu et al. [55] propose to integrate the visual object tracking with the motion blur problem through sparse representation and realize blur robust trackers.

Above works are studied according to the observation that deblurring methods can introduce negative effects to frames and corrupt the features. However, deblurring methods have achieved great progress in recent years. Whether the latest works are helpful for object tracking remains questionable. A recent work [47] finds that motion blur is helpful and provides additional motion information of the target. However, this work does not discuss the effects of different levels of the motion blur to object tracking.

2.3 Other state-of-the-art trackers

Latest tracking works focus on construct powerful appearance models to realize high performance tracking. We can coarsely split recent works into three categories including correlation filter (CF) based [31, 6, 3, 34, 14, 16], classification&updating based [42, 49, 24] and Siamese network or matching based [4, 19, 63, 54, 53, 13] trackers.

Although these trackers have achieved great performance improvement on benchmarks, there is no specific benchmark that can evaluate their ability to handle different levels of motion blur.

2.4 GAN based methods

Generative adversarial networks (GANs) [17] is to train two competitors, i.e. the discriminator and the generator. The generator is to produce fake samples that can fool the discriminator. The discriminator is to separate fake samples from real ones. With recent studies [1, 18] to alleviate training problems of GAN [46], it has helped achieve great progress in areas of deblurring [28], superresolution [29] and image painting [60, 11] and other related problems.

Nevertheless, most of the GAN-based methods just regard the discriminator as a part of loss function to train the generator and discard it during testing time. In this paper, we find that the discriminator trained for DeblurGAN 

[28] can score the blur level of motion blur and help realize selective deblurring for blur-robust tracking.

3 Blurred Video Tracking (BVT) Benchmark

3.1 Dataset

Figure 2: Examples of frames that are blurred in 5 levels. ‘’ represents the blurred video contains raw frames that are captured at 240 fps and have the least serious blur.

Galoogahi et al. [15] proposed the NfS dataset that consists of 100 videos captured at 240 fps. Since frames in such high frame rate videos are sharp, we can generate realistic motion blur with different levels by averaging these sharp frames, as done in deblurring methods [41, 44] . Given a video in the NfS dataset, we produce a blurred one each frame of which is the average of successive frames of , i.e. . The length decides the level of motion blur, that is, a larger leads to more serious blur. The ground truth of the target in is set as the average of annotations of medium frames in .

The blurred video, i.e. , are still at the high frame rate, and the difference between neighbor frames is small. This will affect the blur robustness evaluation since a simple tracker can also obtain high accuracy on high frame rate videos [15]. We then temporally sample at every 8 frames and obtain a new video denoted as whose frame rate is 30 fps. Note, to avoid the initialized target template containing motion blur, we borrow the first frame from the high frame rate video, i.e. , and set it as the first frame of .

Following the above setup, for each video in the NfS dataset, we generate 5 blurred videos by setting . ‘’ represents the video contains raw frames with the least serious blur. All these videos make up a new dataset denoted as that contains 500 videos and consists of 5 subsets, i.e. , corresponding to 5 different levels of motion blur. Fig. 2 shows 3 cases of blurred frames in 5 different levels. Clearly, through temporal averaging on high frame rate frames, we obtain realistic blurred videos in which the blur is directly related to the object and camera motion pattern. When the camera is fixed, the object is heavily-blurred while the background is still sharp. Such results are not easily achieved by using synthetic technologies.

3.2 Metrics

Figure 3: Evaluation results of 23 trackers on the BVT benchmark. The left subfigure shows the blur robustness plot of each tracker. The medium subfigure presents normalized robustness curves of all trackers. The right subfigure displays the normalized robustness score (NRS), average AUC and its standard variation on 5 subsets of each tracker respectively.

We set three metrics for the blur robustness evaluation based on the success metric defined in [58]. Specifically, we first calculate the intersection over union (IoU) between predicted and annotated bounding boxes at each frame of a subset . We then draw a success plot which presents the percentage of bounding boxes whose IoU is larger than given thresholds. The area under curve (AUC) of the success plot is used to compare different trackers on the subset and is denoted as

. Given a tracker, we can obtain 5 AUC scores for the 5 subsets and draw a blur robustness plot with the X-axis representing different subsets and Y-axis being AUC scores. We can rank compared trackers according to the average and standard variance of AUC scores respectively. The average of 5 AUC scores measures the absolute accuracy of a tracker on different blurred videos while the standard variance represents the robustness.

In addition, we propose a new metric named as normalized robustness score to make the blur robustness be independent to the accuracy. Specifically, we first evaluate a tracker on the sharp video subset, i.e. , and obtain a set of frames denoted as on which the tracker can locate the target accurately while the IoU is larger than 0.5. Note, each frame of has corresponding blurred versions on other subsets and we denote them as where . We then run the tracker on other blurred subsets, i.e. , and calculate the average IoUs on

respectively. We finally get a normalized vector by

(1)

where is the average IoU on , and corresponds to a normalized robustness curve (NRC). The average of all elements in is denoted as the normalized robustness score (NRS). If the NRS of a tracker approximates to 1, it means that the tracker is not affected by the motion blur and can still locate the target on blurred versions of .

4 Evaluation Results

With the proposed BVT benchmark, we evaluate 23 trackers and analyze their blur robustness. Meanwhile, we use two state-of-the-art deep deblurring methods to handle the blurred subsets of the BVT benchmark and discuss how these methods can help improve tracking performance.

4.1 Effects of blur to tracking

Figure 4: Normalized robustness score (NRS) and average AUC of 23 trackers. The legend is the same with that of Fig. 3.

Trackers. We evaluate 23 trackers on the proposed benchmark and categorize them into 4 classes according to representations they used: trackers using intensity based features 111Here, the intensity based features consist of the template used by IVT [45], L1APG [2], CSK [20] and STC [62], and haar-like features used by CT [61]., i.e. IVT [45], L1APG [2], CT [61], CSK [20], STC [62] and MBT [35], trackers based on HoG features, i.e. BT [52], DSST [7], KCF [21], SAMF [32], SRDCF [8], fDSST [9], BACF [16] and STRCF [31]

, trackers with deep features, i.e. HCF 

[36], ECO [6], MDNet [43], Siamfc [4] and VITAL [49], trackers using mixed features, i.e. Staple [3], Staple_CA [37], ECO_HC [6] and CSRDCF [34].

Overall results. We present the evaluation results on Fig. 3 and 4. In general, the accuracy of trackers decreases with the increase of the motion blur level. In terms of the average AUC, ECO achieves the highest accuracy on the BVT benchmark while VITAL [49] is in the second place, since these trackers employ deep features as object representations and are equipped with sophisticatedly designed online learning strategies. Among trackers base on hand-crafted features, STRCF [31], ECO_HC [6] , and CSRDCF [34] are in the first, second, and third places respectively according to the average AUC. Moreover, these trackers are better than MDNet [43] and HCF [36] that use deep features. The trackers using intensity-based features have much lower accuracy than others due to the less discriminative power.

It terms of the robustness evaluation, we observe that trackers using intensity-based features are generally more robust to motion blur, since they obtain similar accuracy on both heavily-blurred and sharp videos. Specifically, IVT [45] has the highest normalized robustness score (NRS) and smallest standard variance of AUC while L1APG [2] gets the second high NRS. STC [62] and CT [61] have bad NRSs while their standard variations of AUC are very small.

As shown in Fig. 4, considering both average AUC and NRS, we find that Staple_CA [37] achieves well balance between the accuracy and blur robustness. Although VITAL [49] is slightly worse than ECO [6] on the average AUC, it has much higher NRS than ECO. According to blur robustness plots, we find that the rank of trackers has great difference on 5 subsets. For example, VITAL [49] obtains smaller AUC score than ECO [6] and STRCF [31] on while being the best one on . We can find similar results on BACF [15], CSRDCF [34], DSST [7] and CSK [20].

In summary, we have following observations: Simply comparing trackers on a single subset is not enough to conclude their abilities to handle motion blur. The accuracy and blur robustness of trackers are dependent on features they used. Trackers using intensity-based features obtain low accuracy while usually being robust to motion blur. Deep features help track accurately but are somehow sensitive to severe blur. It is necessary to explore possible combination strategies to take both advantages.

Benefits of light motion blur. According to blur robustness plots shown in Fig. 3, a lot of trackers obtain higher AUC on lightly-blurred subsets, e.g. and , than on , which infers that the light motion blur has positive effects on tracking performance. To better understand this observation, for each tracker, we calculate the AUC gain of blurred subsets, i.e. , over the sharp version, i.e. , through where means a tracker has higher accuracy on than on . As shown in Fig. 5, on the lightly-blurred subsets, i.e. and , there are 17 and 14 trackers that have positive gains. Such numbers reduce to 7 and 2 on heavily-blurred subsets, i.e. and , respectively. Hence, light motion blur does help most of the compared trackers obtain higher accuracy. This is because the lightly-blurred videos generated by averaging neighbor high rate frames contain more effective information for separating the target from the background.

For some specific methods, we find that ECO using deep features always obtains negative gains on all subsets with gradually enlarging magnitude. We have similar observations on VITAL and Siamfc, although they obtain higher AUCs on . In contrast, trackers with intensity-based features, e.g. IVT and CT, have positive gains on all subsets, which further demonstrate the importance of features in handling motion blur. We also show that the motion blur-aware tracker, i.e. MBT [35], achieves positive gains on all subsets except and has the highest gain on .

Figure 5: AUC gains of blurred subsets, i.e. , over sharp video subset, i.e. for all compared trackers.

In summary, we have following observations: Light motion blur helps most of the trackers achieve higher accuracy while heavy blur significantly reduces the performance of almost all trackers.

4.2 Effects of deblurring to tracking

Figure 6: Evaluation results of 7 typical trackers and their four variants. DeblurGAN [28] and Scale-recurrent network (SRN) [51] are used to cope with the blurred frames respectively. ‘*_gan’ and ‘*_srn’ denote trackers deblurring each frame via DeblurGAN and SRN respectively. ‘*_ganslt’ and ‘*_srnslt’ are methods that selectively deblur frames according to the localization error of trackers.

In following, we will study whether state-of-the-art deep deblurring methods could help improve the accuracy of trackers under the motion blur.

Methods. Early deblurring methods run slowly and are not suitable for real-time tracking. We select two deep deblurring methods, i.e. DeblurGAN [28] and SRN [51], that run much faster via the GPU222DeblurGAN takes average 0.05 s to deblur search regions that are about 5 times larger than targets. and achieve state-of-the-art deblurring performance. Given a tracker, we use a deblurring method to get two variants. The first one is to deblur all frames before tracking and we name it as the full deburring based method. The second one is to selectively deblur frames during the tracking process according to center localization errors, i.e. the distance between predicted bounding boxes and ground truth.

With two deblurring methods, we get four variants for each tracker and denote them as ‘*_gan’, ‘*_srn’ for full deblurring based ones, and ‘*_ganslt’, ‘*_srnslt’ for selective deblurring based methods respectively, where ‘*’ represents the name of a tracker. We test these variants on four blurred video subsets, i.e. .

To achieve comprehensive study, we select 7 representative trackers including the ones that achieve best accuracy on the BVT benchmark, i.e. STRCF [31] and ECO [6], the Siamese network based tracker, i.e. SiamFC [4], CF trackers using hand-crafted features, i.e. fDSST [9] and Staple_CA [3, 37], a typical classification based tracker, i.e. BT [52] and a motion blur-aware tracker, i.e. MBT [35].

Cons of full deburring. As shown in Fig. 6, when we deblur all frames during tracking process via DeblurGAN, we get lower accuracy than using blurred frames at most of the time. The performance decline decreases as the motion blur level being serve. For example, the AUC of fDSST_gan is much smaller than that of fDSST on while becomes slightly better on . Such observation encourages that we should perform deblurring on heavily-blurred frames and pass the ones containing light blur, when we use DeblurGAN to improve the blur robustness.

In terms of the SRN method, by deburring all frames, it slightly improves most of the trackers. Specifically, STRCF_srn and ECO_srn achieve 2.4% and 1.8% relative improvement over STRCF and ECO while performance gains on other trackers are very small and even negative. Similar with DeblurGAN, SRN helps trackers get higher improvement on heavily-blurred videos while making their accuracy drop on videos having light blur. For example, STRCF_srn has similar or even worse AUC score than STRCF on and while obtaining great improvement on and . We have similar observations on the BT and Siamfc.

In summary, we have following observations: State-of-the-art deep deblurring methods, i.e. DeblurGAN [28] and SRN [51], usually result in tracking accuracy decreasing on lightly-blurred videos while having positive effects on the ones containing heavy motion blur.

Pros of selective deblurring.

Figure 7: AUC gains of selective deblurring based trackers, i.e. ‘*_ganslt’ and ‘*_srnslt’, over original ones on blurred video subsets, i.e. .

According to observations in Section. 4.1 and 4.2, selective deblurring should help improve tracking performance. To validate this assumption, we selectively deblur an incoming frame according to localization errors during the tracking process. Specifically, for the incoming frame , we first use DeblurGAN or SRN to handle it and obtain a deblurred image. We then predict the target position according to raw and deblurred frames respectively and obtain two bounding boxes whose center localization errors are calculated according to the ground truth. The result with higher precision is saved as the final output. We name above method as ‘*_ganslt’ or ‘*_srnslt’.

Fig. 6 shows that selective deblurring via DeblurGAN and SRN improves the tracking performance of all trackers significantly. Furthermore, we notice that selective debluring based methods generally have higher gain over the original versions on heavily-blurred videos than on light ones. As shown in Fig. 6, the performance improvements of STRCF_*slt, ECO_*slt, Siamfc_*slt and fDSST_*slt w.r.t. their original versions gradually increase and reach their maximum on . Other trackers, e.g. BT, Staple_CA and MBT, have similar trend while achieving the highest gain on .

In summary, we have following observations: Selective deblurring improves tracking performance significantly. Accuracy gains incrementally increase with the growing motion blur level and generally reach the maximum at the most heavily-blurred video subset.

5 Blur-Robust Tracking via DeblurGAN-D

5.1 DeblurGAN-D as blur assessor

Figure 8: Outputs of the discriminator of DeblurGAN, i.e. , on bird sequences that contain 4 levels of motion blur.

DeblurGAN [28] uses the critic network as the discriminator () to output scores of sharp and restored images and calculate their Wasserstein distance as the loss to train the generator () and the discriminator itself. only works at the training process and is discarded at testing time. In the training stage, outputs deblurred images whose quality is gradually improved. We can regard these images as blurred ones having different blur levels. From the view of training , it is tuned to distinguish between sharp images and the ones generated by , which have different blur levels. As a result, the discriminator has the ability to make a distinction between sharp and blurred images.

As shown in Fig. 8, we calculate discriminator outputs of frames in four videos that have different blur levels. Clearly, the heavily-blurred video, i.e. , has the smallest value while the sharp one, i.e. , has the highest score. Hence, the discriminator of DeblurGAN is able to score the blur levels of frames and will help decide when we should do deblur during the tracking process.

5.2 Fine-tuning DeblurGAN-D

Although we have shown DeblurGAN-D can score the blur degree of a frame, it easily fails and cannot discriminate motion blur degrees when their visual difference is small. As shown in Fig. 9, DeblurGAN-D cannot rank the blur degree of frames properly. This is because DeblurGAN-D is originally designed to compare the sharp and deblurred images, which has a gap to the task of assessing blur degrees.

To alleviate above problem, we propose to fine-tune DeblurGAN-D with blur & deblur image pairs. Specifically, we select 20 scenes including 80 blurred videos from the dataset of the BRB and obtain 32304 frames. Each scene contains 4 videos corresponding to 4 blur degrees respectively. We use the generator of DeblurGAN to deblur these frames and get 32304 blur & deblur image pairs. Using these pairs as training data, we particularly fine-tune the discriminator via the same adversarial loss of DeblurGAN with the fixed generator.

As shown in Fig. 9, compared with the original discriminator, the fine-tuned one can not only sort blur degrees properly but also reflect the distance between different motion blurs. In practice, we calculate the discriminator difference between blurred and deblurred frames, i.e. , to avoid the influence of non-blur information in the image, where is the deblurred . A larger difference corresponds to a heavy motion blur of .

Note, we can also use blur & sharp image pairs to train . However, in real applications, sharp images are not given and we have to take extra cost to collect suitable images for fine-tuning. In contrast, the proposed strategy does not need extra data and is also suitable for other deblurring methods. Please find more visualization results in supplementary material.

5.3 Selective deblurring for blur-robust tracking

Figure 9: Comparing the fine-tuned discriminator with the original one on airplane sequences.

Given a video , we formulate a tracker within Bayesian framework in which the maximum a posterior estimation of the target state at frame , i.e. a bounding box , is computed by

(2)

where is the set of observed frames, and is a selector that can be 1 and 0 representing to use deblurred and raw to estimate respectively. is used to estimate via observed frames and calculated by

(3)

where is a normalization factor, is a motion model for the selector to consider historical selection results, and measures the necessity to deblur

(4)

where is the deblurred . Instead of directly using for , we calculate the difference between and to remove the influence of non-blur information.

In Eq. (2

), the posterior probability of

being the target, given the selector and previous frames, i.e. can be rewritten as

(5)

where is a normalization factor, and are observation models that compute the likelihood of belonging to the target with inputs being and respectively, and represents the motion model.

For an existing tracker, we can use its observation and motion models to calculate via Eq. (5) and locate the target by solving

(6)

In practice, given a tracker and an incoming frame, we crop a search region and deblur it with the DeblurGAN-G. We then obtain 2 bounding boxes and their object likelihoods by feeding the tracker with raw and deblurred search regions. When where for all trackers, the search region is heavily-blurred and the bounding box of deblurred search region is saved as final result. Otherwise, the one having largest object likelihood is saved. Currently, we set

as a discrete uniform distribution that ignores the historical selection results and will discuss other possible ways in the future.

We can equip extensive existing trackers with the proposed scheme. In following, we will validate the scheme on 7 trackers including STRCF [31], ECO [6], SiamFC [4], fDSST [9], Staple_CA [3, 37], BT [52] and MBT [35].

5.4 Comparative results

Figure 10: Comparing proposed blur-robust trackers (‘*_ganbrt’) with full debluring (‘*_gan’) and non debluring (‘*’) based trackers on blurred videos.

Since we have used 20 scenes, i.e. blurred videos, of the BVT benchmark to fine-tune the DeblurGAN-D in Section 5.2, the remaining 80 scenes form new subsets denoted as each of them consists of 80 videos. We use them to validate the proposed blur robust tracking scheme on 7 trackers. We run the original trackers, full deburring (‘*_gan’) and the proposed scheme based versions (’*_ganbrt’) on and calculate the average AUC and its standard variation as evaluation results. AUC scores of original trackers on are also calculated for the comprehensive comparison.

As shown in Fig. 10, according to the average AUCs, all trackers except fDSST are improved by the proposed blur-robust tracking scheme. In particular, BT_ganbrt achieves 9.3% relative improvement over the original version. Moreover, the accuracy of BT_ganbrt on is much higher than the one of BT on the sharp subset, i.e. . STRCF_ganbrt, ECO_ganbrt, and Siamfc_ganbrt outperforms STRCF, ECO and Siamfc on all subsets respectively. Staple_CA_ganbrt achieves 2.5% relative improvement over Staple_CA. The accuracy increase of MBT_ganbrt w.r.t. MBT is small since MBT is specifically designed for the tracking under motion blur. fDSST_ganbrt obtains light worse accuracy than fDSST while being better on . More results are presented and discussed in the supplementary material.

6 Conclusion

In this paper, we have proposed the Blurred Video Tracking (BVT) benchmark to explore how motion blur affects visual object tracking and whether state-of-the-art deblurring can benefit the state-of-the-art trackers under different levels motion blur. The proposed BVT benchmark contains 500 videos for 100 scenes, each of which has 5 videos with different levels of motion blurs. According to the evaluation results of 23 recent trackers on the BVT benchmark, we find that slight motion blur may have positive effects to visual tracking, while severe blurs certainly harm the performance of most trackers. Using two state-of-the-art deblurring methods, DeblurGAN [28] and SRN [51], to handle the blurred videos in our BVT benchmark, we study the effects of deblurring to 7 typical trackers. We observe that current deblurring algorithm can improve tracking performance on severely blurred videos, while harm the accuracy on videos with slight motion blur. Accordingly, we propose a general blur-robust tracking scheme that adopts a fine-tuned discriminator of DeblurGAN as an assessor to adaptively determine whether or not conduct deblurring for current frame. This method successfully improves the accuracy of 6 state-of-the-art trackers. In the future, we want to study how to generalize such adaptive deblurring strategy to further boost the robustness to blur in visual tracking.

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv:1701.07875, 2017.
  • [2] C. Bao, Y. Wu, H. Ling, and H. Ji. Real time robust l1 tracker using accelerated proximal gradient approach. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1830–1837, 2012.
  • [3] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr. Staple: Complementary learners for real-time tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-convolutional siamese networks for object tracking. In arXiv preprint arXiv:1606.09549, 2016.
  • [5] S. Dai, M. Yang, Y. Wu, and A. K. Katsaggelos. Tracking motion-blurred targets in video. In Proceedings of IEEE International Conference on Image Processing, 2006.
  • [6] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [7] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference, 2014.
  • [8] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual tracking. In Proceedings of IEEE International Conference on Computer Vision, pages 4310–4318, 2015.
  • [9] M. Danelljan, G. H?ger, F. S. Khan, and M. Felsberg. Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8):1561–1575, 2018.
  • [10] J. Ding, Y. Huang, W. Liu, and K. Huang. Severely blurred object tracking by learning deep image representations. IEEE Transactions on Circuits and Systems for Video Technology, 26(2):319–331, 2016.
  • [11] B. Dolhansky and C. Canton Ferrer. Eye in-painting with exemplar generative adversarial networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [12] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [13] H. Fan and H. Ling. Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [14] W. Feng, R. Han, Q. Guo, J. K. Zhu, and S. Wang. Dynamic saliency-aware regularization for correlation filter based object tracking. IEEE Transactions on Image Processing, pages 1–1, 2019.
  • [15] H. K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of IEEE International Conference on Computer Vision, 2017.
  • [16] H. K. Galoogahi, A. Fagg, and S. Lucey. Learning background-aware correlation filters for visual tracking. In Proceedings of IEEE International Conference on Computer Vision, pages 1144–1152, 2017.
  • [17] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. 2014.
  • [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. arXiv:1704.00028, 2017.
  • [19] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang. Learning dynamic Siamese network for visual object tracking. In Proceedings of IEEE International Conference on Computer Vision, 2017.
  • [20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision, pages 702–715, 2012.
  • [21] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
  • [22] L. Huang, X. Zhao, and K. Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. arXiv preprint arXiv:1810.11981, 2018.
  • [23] H. Jin, P. Favaro, and R. Cipolla. Visual tracking in the presence of motion blur. volume 2, pages 18–25, 2005.
  • [24] I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In Proceedings of the European Conference on Computer Vision, 2018.
  • [25] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Čehovin Zajc, T. Vojir, G. Häger, A. Lukežič, and G. Fernandez. The visual object tracking vot2017 challenge results. In Proceedings of IEEE International Conference on Computer Vision Workshop, 2017.
  • [26] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, R. Pflugfelder, A. Gupta, A. Bibi, A. Lukezic, A. Garcia-Martin, A. Saffari, A. Petrosino, and A. S. Montero. The visual object tracking VOT2015 challenge results. In Proceedings of IEEE International Conference on Computer Vision Workshop, pages 564–586, 2015.
  • [27] M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. Cehovin. A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(11):2137–2155, 2016.
  • [28] O. Kupyn, V. Budzan, M. Mykhailych1, D. Mishkin, and J. Matas.

    Deblurgan:blind motion deblurring using conditional adversarial networks.

    In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [29] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi.

    Photo-realistic single image super-resolution using a generative adversarial network.

    In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 105–114, 2017.
  • [30] A. Li, M. Lin, Y. Wu, M.-H. Yang, and S. Yan. Nus-pro:a new visual tracking challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):335–349, 2016.
  • [31] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang. Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4904–4913, 2018.
  • [32] Y. Li and J. Zhu. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision Workshop, 2014.
  • [33] P. Liang, E. Blasch, and H. Ling. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Transactions on Image Processing, 24(12):5630–5644, 2015.
  • [34] A. Lukežič, T. Vojíř, L. Čehovin, J. Matas, and M. Kristan. Discriminative correlation filter with channel and spatial reliability. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [35] B. Ma, L. Huang, J. Shen, L. Shao, M.-H. Yang, and F. Porikli. Visual tracking under motion blur. IEEE Transactions on Image Processing, 25(12):5867–5876, 2016.
  • [36] C. Ma, J. B. Huang, X. Yang, and M. H. Yang. Hierarchical convolutional features for visual tracking. In Proceedings of IEEE International Conference on Computer Vision, 2015.
  • [37] N. S. Matthias Mueller and B. Ghanem. Context-aware correlation filter tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [38] C. Mei and I. Reid. Modeling and generating complex motion blur for real-time tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008.
  • [39] M. Mueller, A. Bibi, S. Giancola, S. Al-Subaihi, and B. Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, 2018.
  • [40] M. Mueller, N. Smith, and B. Ghanem. A benchmark and simulator for uav trackings. In Proceedings of the European Conference on Computer Vision, 2016.
  • [41] S. Nah, T. H. Kim, and K. M. Lee.

    Deep multi-scale convolutional neural network for dynamic scene deblurring.

    In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [42] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4293–4302, 2016.
  • [43] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [44] M. Noroozi, P. Chandramouli, and P. Favaro. Motion deblurring in the wild. In Pattern Recognition, pages 65–77, 2017.
  • [45] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1):125–141, 2007.
  • [46] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, , and X. Chen. Improved techniques for training gans. arXiv:1606.03498, 2016.
  • [47] C. Seibold, A. Hilsmann, and P. Eisert. Model-based motion blur estimation for the improvement of motion tracking. Computer Vision and Image Understanding, 160:45–56, 2017.
  • [48] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1442–1468, 2014.
  • [49] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. W. Lau, and M.-H. Yang. Vital:visual tracking via adversarial learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [50] J. Sun, W. Cao, Z. Xu, and J. Ponce. Learning a convolutional neural network for non-uniform motion blur removal. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [51] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia. Scale-recurrent network for deep image deblurring. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [52] N. Wang, J. Shi, D. Y. Yeung, and J. Jia. Understanding and diagnosing visual tracking systems. In Proceedings of IEEE International Conference on Computer Vision, pages 3101–3109, 2015.
  • [53] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr. Fast online object tracking and segmentation: A unifying approach. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [54] X. Wang, C. Li, B. Luo, and J. Tang. Sint++:robust visual tracking via adversarial positive instance generation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [55] Y. Wu, , , , and and. Blurred target tracking by blur-driven tracker. In Proceedings of IEEE International Conference on Computer Vision, pages 1100–1107, 2011.
  • [56] Y. Wu, J. Hu, F. Li, E. Cheng, J. Yu, and H. Ling. Kernel-based motion-blurred target tracking. In Advances in Visual Computing, pages 486–495, 2011.
  • [57] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: a benchmark. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2411–2418, 2013.
  • [58] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
  • [59] L. Xu, J. S. J. Ren, C. Liu, and J. Jia. Deep convolutional neural network for image deconvolution. pages 1790–1798, 2014.
  • [60] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do.

    Semantic image inpainting with deep generative models.

    In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 6882–6890, 2017.
  • [61] K. Zhang, L. Zhang, and M.-H. Yang. Real-time compressive tracking. In Proceedings of the European Conference on Computer Vision, volume 36, page 864–877, 2012.
  • [62] K. Zhang, L. Zhang, M.-H. Yang, and D. Zhang. Fast tracking via dense spatio-temporal context learning. In Proceedings of the European Conference on Computer Vision, pages 127–141, 2014.
  • [63] Z. Zhu, Q. Wang, B. Li, W. Wei, J. Yan, and W. Hu. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision, 2018.