Code for our WACV paper "Unsupervised Adversarial Visual Level Domain Adaptation for Learning Video Object Detectors from Images"
Deep learning based object detectors require thousands of diversified bounding box and class annotated examples. Though image object detectors have shown rapid progress in recent years with the release of multiple large-scale static image datasets, object detection on videos still remains an open problem due to scarcity of annotated video frames. Having a robust video object detector is an essential component for video understanding and curating large-scale automated annotations in videos. Domain difference between images and videos makes the transferability of image object detectors to videos sub-optimal. The most common solution is to use weakly supervised annotations where a video frame has to be tagged for presence/absence of object categories. This still takes up manual effort. In this paper we take a step forward by adapting the concept of unsupervised adversarial image-to-image translation to perturb static high quality images to be visually indistinguishable from a set of video frames. We assume the presence of a fully annotated static image dataset and an unannotated video dataset. Object detector is trained on adversarially transformed image dataset using the annotations of the original dataset. Experiments on Youtube-Objects and Youtube-Objects-Subset datasets with two contemporary baseline object detectors reveal that such unsupervised pixel level domain adaptation boosts the generalization performance on video frames compared to direct application of original image object detector. Also, we achieve competitive performance compared to recent baselines of weakly supervised methods. This paper can be seen as an application of image translation for cross domain object detection.READ FULL TEXT VIEW PDF
Can we detect common objects in a variety of image domains without
In this paper, we investigate a weakly-supervised object detection frame...
We train an object detector built from convolutional neural networks to ...
Shadow detection in a single image has received significant research int...
This paper proposes a method for video smoke detection using synthetic s...
We propose to leverage a generic object tracker in order to perform obje...
Despite the remarkable progress in recent years, detecting objects in a ...
Code for our WACV paper "Unsupervised Adversarial Visual Level Domain Adaptation for Learning Video Object Detectors from Images"
We consider the problem of object detection in unconstrained videos with close to zero supervision for video domain. Object detection in videos is a crucial component in several downstream vision applications such as video anomaly detection, autonomous driving, tracking etc. In this work, we assume that we only have access to a fully annotated dataset of still images (which we refer to assource domain) while there is no annotation (not even any form of weak supervision) for video domain (which we refer to as target domain
). Intuitively, an object detector trained on still images performs worse on video frames primarily due to significant appearance disparities between the two domains. Specifically, still images retain much more high frequency components and are less cluttered/occluded. Conversely, objects in videos often suffer from motion blur and poor resolution. Thus even though we have availability of large-scale annotated image datasets such as ImageNet, MS-COCO, PASCAL , performance of image object detectors on videos are worse compared to training on manually annotated video datasets [11, 28]. An immediate direction of effort can be to annotate video datasets. However, annotating large scale video datasets demand humongous manual labor, time and cost.
Though collecting annotated video seems a daunting task, there is abundance of unlabeled natural videos available publicly from sources such as Youtube and Flickr. The aim of this paper is to exploit such unlabeled videos to learn an end-to-end trainable network to transform images sampled from static image dataset to ‘appear’ as if being sampled from a video dataset. Following this, if we train an object detector on such ‘transformed image dataset’, we expect to see a boost in the generalization capability of the detector on videos(See Fig. 1 for exemplary success). Specifically our contributions in this paper can be summarized as:
We apply the concept of cycle consistent image-to-image translation with generative adversarial networks(GAN)  for learning a completely unsupervised transformation from image to video in pixel domain. To our best knowledge, this work is the first demonstration of the applicability of GAN based pixel level domain adaptation for adapting object detectors across image to video.
We empirically show the importance of cyclic network architecture for training an unsupervised GAN based image translation framework.
Evaluations on recently released Youtube-Objects  and Youtube-Objects-Subset  datasets reveal that our approach of domain adaptation improves upon two contemporary baseline state-of-the-art image object detectors. Also, we get competitive performance compared to recent weakly supervised methods.
Object detection in still images is one of the traditional genres of computer vision research[4, 8, 10, 22]. These methods require bounding box annotations on a large sample of diversified images. Although annotating boxes for large datasets is tedious it is necessary especially for deep learning frameworks in which the complexity of the neural net makes it vulnerable to overfitting if not trained with sufficiently large labeled datasets. To circumnavigate this requirement there are two genres of approach closely related to our current effort. One line of approach is weakly supervised object localization [1, 17, 23, 24, 30, 3], wherein we only have meta information such as the presence/absence of an object category. Majority of these algorithms are based on multiple instance learning(MIL) framework. In this formulation, an image is represented as a bag of regions. It is assumed if an object category is marked as ‘present’ in an image then one of the regions in positive bag tightly bound the object while a tag of ‘absent’ mean no regions contain the object. MIL training alternates between learning an appearance model and selecting the proper region in positive bags which contain the object using the learnt appearance model. However, weakly supervised training is very difficult and its performance is still not at par with fully supervised methods [11, 28].
The second line of effort is to exploit information from both videos and images [23, 26, 18]. In , Prest et al. presented a framework for adapting detectors from videos to images (which is opposite to ours). First, they learn an automated video object localizer in videos. This is followed by training video object detector under a domain adaptation setting with fully annotated image data and weakly annotated video data. In , Sharma and Nevatia present a framework for detecting humans in videos by online refinement of a baseline pre-trained detector. The idea is to apply the pre-trained detector at a high recall rate so that it outputs a lot of true positive regions. The false positive regions are discarded by an online refinement. Clearly, this method is not suited for real-time applications. Our detector modules consist of Faster R-CNN  which runs in near real time. Also, we experiment on wide varieties of moving objects(rather than only humans) under more unconstrained environments.
Generative adversarial network, drawn from a prior noise distribution, . Following ,
(uniform distribution) and generator maps it onto an image,;
. The discriminator classifies samples between true data distributionand the generated distribution, . Specifically, generator and discriminator play the following game on :
With sufficient capacity for both generator and discriminator, this min-max game has global optimum when . Empirically, it has been observed that for the generator, it is prudent to maximize instead of minimizing .
Our core motivation is to transform a static image to appear as if being sampled from a dataset of video frames. This can be seen as an appearance mapping in pixel space, which is nowadays studied under the umbrella of adversarial image-to-image translation genre. GANs show the potential to approximate an unknown static distribution. This property has been recently leveraged by several authors for pixel level domain adaptation [27, 14, 2]. The basic idea is to have paired image samples in both domains, and and then learn a conditional generator network(conditioned on an image sampled from ) to map to . The discriminator’s task is to identify whether an image is sampled from or is transformed from . Shrivastava et al.  and Bousmalis et al. 
had the common motivation to design a ‘refiner’ network to transform synthetic images to appear like real images and then train discriminative models on these transformed synthetic datasets. This is helpful because getting labelled data in a rendered/synthetic domain is often free of cost. Though promising, these methods were only applied for inferencing on very small objects such as estimating gaze from a properly cropped out human eye sample of resolution 3555. In , the authors presented results of recognition and pose estimation from specifically centre cropped small objects such as ‘phone’, ‘lamp’ etc., from Cropped LineMod dataset . Our application case is much more difficult as we are working with non cropped image/video frames in the wild with the problem of detecting an arbitrary number of instances of a given class in an image. The first success of applying GANs for high resolution (256 256) image-to-image translation was proposed by Isola et al. . Their framework, for example, transforms sketch of a shoe to real textured consumer shoe or converts an aerial map to actual city image. Though promising, this method is restricted in applicability due to the requirement of paired examples across both domains. This is specifically restrictive in our case because it is not possible to have an object with the same scale and orientation to be present in both image and video datasets. To mitigate this restriction, Zhu et al.  proposed to incorporate a cycle consistency loss so that a forward transform, , followed by a backward transform, G(F(x)), gives back the starting distribution, . The same restriction is applied for domain . The cycle consistency loss is a key component to learn in absence of paired data and is thus well suited for our application use case.
Our aim is to learn an unsupervised mapping between two domains of data, viz., high-quality static images, , and the other domain of video frames, . We have unpaired training samples, and . We denote data distribution as and . In Fig. 2
we show the two components of adversarial transformation and cycle consistency loss between image and video domains. There are two transformation networks,and , which map and respectively.
Domain discriminator, , discriminates between images transformed from the static images and frames sampled from videos. The adversarial loss for this forward transformation is defined as,
discriminates between video frames transformed to image domain and images sampled from static image database. The corresponding adversarial loss is,
Theoretically, with enough capacity, and can learn to generate samples from and respectively. However, without any additional constraint, a given sample from can be mapped to any random point in and be indistinguishable from real samples of . For example, if we provide an image of a car from the static dataset, can map it to look like a car from video dataset, but can change the scale and pose of the car. Though, this might not be an issue from an artistic point of view, it is a point of concern for training object detectors because we will use the bounding box annotations from the original static image. Thus we cannot afford to have any structural change during mapping from . To restrict the domain of possible transformations, a cycle loss is introduced such that learned mapping respects the following sequence of transformation constraint.
This ensures that the learned cyclic mapping can start from a given image and we get back after the two transformations of the cycle. This is termed as the forward cycle consistency constraint and the corresponding loss is,
A similar consistency is also imposed for frames being converted to static images with and back to video frames with . The following constraint needs to be maintained,
and the corresponding backward consistency loss is
The complete objective function can be written as,
where controls the relative importance of cycle consistency loss over the adversarial loss. The task is to find optimum, and such that,
Our entire framework consists of two phases of training. In the first phase, we train the CycleGAN network to transform high-quality static images to appear as video frames. So, a given image dataset is transformed into a pseudo video dataset. The next phase is training a standard object detector on the previously transformed image dataset using the same annotations of the original static image dataset. CycleGAN is not required during object detection testing.
In this paper we have used two contemporary object detectors viz., Faster R-CNN 111Available at: https:github.comsmallcorgiFaster-RCNN_TF and RFB Net  222Available at: https:github.comlzx1413PytorchSSD. We have used the default settings in the respective papers for training on all different variants of the training datasets.
We have trained CycleGAN, on unaligned images and video frames. During training, images are resized to 286286 and then randomly cropped to 256256 to increase the robustness of the model. During inference, transformation is done on 256256 resized image and then rescaled to the original resolution before object detection. We have modelled the generator with 9 resnet blocks as implemented in  and the discriminator with PatchGAN classifier [14, 19]333Implementation: https://github.com/junyanz/CycleGAN. GAN loss is implemented with vanilla GAN  objective, while cycle loss is L loss. We have considered = 10. We have considered a batch size of 1 and used Adam optimizer  with a momentum of 0.5 and an initial learning rate of 0.0002.
All computations are performed on NVIDIA Tesla K40C 12GB GPU. Training of CycleGAN with 5000 images each in source and target domain takes 2 days. Faster R-CNN training on image dataset takes 12 hours, while RFB Net training takes 10 hours. During testing, on average, Faster R-CNN and RFB Net run at about 7/8 frames per second.
In the first part, we show the visual effects of applying adversarial transformation on static images and in second half we visually and numerically show the benefit of pixel level domain adaptation for object detection in videos.
We use fully annotated dataset of static images as source domain and a dataset of unannotated video frames as target domain. Please note in this paper we are performing object detection on stand-alone video frames and not on video sequences. Exploiting temporal information for efficient object detection is a separate genre of research and is not the main interest of the paper.
For source domain we use the images from PASCAL VOC 2007 dataset , which is a standard dataset for object detection. PASCAL VOC 2007 dataset consists of about 5000 training and 5000 testing images for 20 object categories. We train the object detector on all 20 object categories. For the target domain we use video frames from YouTube-Objects dataset  which consists of about 4300 training and 1800 testing images over 10 object categories which forms a subset of PASCAL VOC dataset. Testing is done on the annotated test set of Youtube-Objects dataset (YTO). For testing, we also consider another dataset, Youtube-Objects-Subset which is derived from the videos of Youtube-Objects dataset but has more ground truth annotations. Please note we never make use of bounding box annotations on video training set in any stage of our framework.
After completion of CycleGAN training, we would expect a high quality static image to be transformed to visually look like a video frame. In general, after the transformation, static images loose high frequency components, colors tend to get desaturated, discriminative parts gets blended with surroundings. We show some exemplary transformations in Fig. 3.
To appreciate the benefit of such a cyclic structure in our adversarial network, we also trained a simple forward transform GAN with only the video domain discriminator. We term this transformation as ForwardGAN. This is similar to the methods of [27, 2]. The discriminator discriminates between video frames and forward transformed static images. As we can see in Fig. 4, ForwardGAN adds implausible structural perturbations to the images. Hence the results of training object detectors on these images gave poor test results.
|Original VOC(Lower Bound)||75.0||89.9||35.1||68.9||56.7||46.2||45.7||39.8||62.6||46.2||56.6|
|Conditioned CycleGAN VOC(Proposed)||79.6||90.4||37.5||71.9||61.1||67.4||47.7||46.3||64.8||52.8||62.0|
|Original VOC(Lower Bound)||75.2||80.9||45.1||74.5||45.5||40.1||33.7||37.4||61.8||58.8||55.3|
|Conditioned CycleGAN VOC(Proposed)||83.0||83.8||49.2||76.1||51.0||56.7||35.2||42.7||61.9||72.1||61.2|
|Youtube Objects Dataset|
|Proposal Only||Proposal + Transfer||Teh et al.||Chanda et al.||Proposed||Proposed(Conditioned)|
|Youtube Objects Subset Dataset|
we show some examples from original VOC and YTO. Indeed, we can appreciate that objects in YTO suffer from defocussing, blurriness, color desaturation effects. Our framework learns this distribution difference automatically instead of relying on manual hand crafted dataset augmentation techniques. In fact, in our initial experiments, we augmented original VOC with Gaussian noise of standard deviation 0.01, 0.05, 0.1, 0.5 and 1. mAP of Faster RCNN trained on these augmented datasets and tested on YTO were 56.4, 57.1, 57.2, 50.3 and 44.1. We also tried with Gaussian blurring of VOC images with kernel sizes of 5, 9 and 13 at standard deviation of 2. mAPs on YTO were 56.4, 57.0 and 56.8. With standard deviation of 4, mAPs were 56.3, 57.2 and 55.9. Thus, it is safe to say simple dataset augmentations do not help in the current domain adaptation problem.
A commonly used metric for evaluating object detection is mean Average Precision(mAP). According to Pascal criterion, a detected bounding box, , is considered as true positive, , if, Intersection Over Union (IoU) = , else is considered as a false positive, . is an annotated box. mAP is defined as the mean of average precision over all classes. Another popular measure is CorLoc  which is given by . CorLoc can be interpreted as the proportion of detected bounding boxes which satisfy the Pascal criterion.
Special care on Youtube-Objects-Subset dataset: The annotations released for this dataset consist of pixel-wise segmentation maps instead of bounding boxes as in Youtube-Objects. We convert the segmentation maps to bounding boxes by enclosing the segmentation maps by smallest possible rectangles. We also followed the conversion strategy as proposed in ; converting a detected bounding box to a segmentation map using grab-cut algorithm 444Available: https://docs.opencv.org/trunk/d8/d83/tutorial_py_grabcut.html. Under this formulation, IoU is measured with respect to the overlap of segmentation maps. The numerical results following both paradigms are almost comparable and thus, following , we stick to the latter framework.
As a baseline we train standard object detectors on PASCAL VOC dataset and test on video frames without any adaptation. This gives us the lower bound of performance. The upper bound is achieved by training detectors on annotated video frames. To show the robustness of the approach, we report the observed effect of domain adaptation using two different object detectors, viz., Faster R-CNN and RFBNet300. We have considered backbone as VGG16 for both detectors. In Table LABEL:tab:table-one we compare the performances of Faster R-CNN trained on different versions of training set and then tested on Youtube-Objects(YTO) test set. In Table LABEL:tab:table-two we report performances with RFBNet300 network tested on Youtube-Objects-Subset. For both detectors, we see an appreciable boost in performance if trained on CycleGAN transformed images dataset compared to original VOC dataset. Of course, training on labelled YTO frames still gives the upper bound of performance but using our proposed method we reduce the performance gap appreciably. Also, it is to be noted that the training of ForwardGAN deteriorates performance even worse than training on original VOC. Tables LABEL:tab:table-one and LABEL:tab:table-two strongly bolster our hypothesis that visual domain adaptation is a viable approach for cross domain learning of object detectors with close to zero supervision in unlabeled domain. Next, in Table LABEL:table_compare, we also compare our model with some of the contemporary weakly supervised baselines. In  (proposal only) refers to learning of appearance model based on object proposals on weakly annotated frames while (proposal + transfer) refers to combination of appearance model from ’novel’ objects (video) and transferred appearance model from ’familiar’ annotated objects (static images). Method of Chanda et al. is based on a 2-stream network, wherein one stream they perform regular fully supervised image object detection and in another stream, they perform frame level classification on the weakly annotated videos. These 2 streams share parameters to counter domain shift factors. Teh et al. proposed an attention network so as to mimic the score of region proposal networks on weakly annotated objects to be similar to a strong fully supervised object detector. Please note all these competing methods assume the presence of meta information such as presence/absence of object categories on each training video frame. However, we assume no information to be associated on video training dataset. It is evident from Table LABEL:table_compare that our model presents competitive performance(better in majority cases) compared to these methods.
One of the drawbacks555See discussion: https://github.com/junyanz/CycleGAN#failure-cases of CycleGAN is that it is tough to train if the two domains differ structurally. In our case, this bottleneck arises because, without a priori knowledge of labels, a given mini batch can consist of different object categories. To mitigate this drawback one can train a separate CycleGAN conditioned on each category on YTO. This requires weak label information for each frame. So, during offline CycleGAN training, we train 10 different networks to individually transform each category. However, we still need to train only a single object detector on this conglomerated transformed dataset. Such use of weak labels further boosts the performance of our framework as reported in Tables LABEL:tab:table-one, LABEL:tab:table-two and LABEL:table_compare. This observation indicates that class conditioned CycleGANs are better at capturing the appearance diversities across two datasets.
In this paper, we mainly focused on unsupervised pixel level domain adaptation for transferring image object detectors for videos. In contrast to the contemporary trend of weakly supervised learning on videos which still requires manual intervention in videos, our framework requires no supervision. The core idea is to pose the problem as an adversarial image-to-image translation for converting annotated static images to be visually indistinguishable from video frames. We also showed that the inclusion of class labels on videos improves our framework further. A straightforward application of our model will be to automatically annotate large video datasets for object detection. Currently, our method is focused on detecting objects on standalone video frames. An immediate extension would be to leverage temporal information in videos for enhanced detection performance. Moreover, the ideas of the paper in general, should encourage researchers towards other interesting visual domain adaptation applications such as emotion recognition from 3D face avatars, learning pose estimation in a virtual world, robotic navigation in simulated environments and finally applying in real world frameworks.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
Image-to-image translation with conditional adversarial networks.CVPR, 2017.