Unsupervised Adversarial Visual Level Domain Adaptation for Learning Video Object Detectors from Images

by   Avisek Lahiri, et al.

Deep learning based object detectors require thousands of diversified bounding box and class annotated examples. Though image object detectors have shown rapid progress in recent years with the release of multiple large-scale static image datasets, object detection on videos still remains an open problem due to scarcity of annotated video frames. Having a robust video object detector is an essential component for video understanding and curating large-scale automated annotations in videos. Domain difference between images and videos makes the transferability of image object detectors to videos sub-optimal. The most common solution is to use weakly supervised annotations where a video frame has to be tagged for presence/absence of object categories. This still takes up manual effort. In this paper we take a step forward by adapting the concept of unsupervised adversarial image-to-image translation to perturb static high quality images to be visually indistinguishable from a set of video frames. We assume the presence of a fully annotated static image dataset and an unannotated video dataset. Object detector is trained on adversarially transformed image dataset using the annotations of the original dataset. Experiments on Youtube-Objects and Youtube-Objects-Subset datasets with two contemporary baseline object detectors reveal that such unsupervised pixel level domain adaptation boosts the generalization performance on video frames compared to direct application of original image object detector. Also, we achieve competitive performance compared to recent baselines of weakly supervised methods. This paper can be seen as an application of image translation for cross domain object detection.


page 1

page 3

page 6

page 7

page 8


Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation

Can we detect common objects in a variety of image domains without insta...

Temporal Dynamic Graph LSTM for Action-driven Video Object Detection

In this paper, we investigate a weakly-supervised object detection frame...

Context-Driven Detection of Invertebrate Species in Deep-Sea Video

Each year, underwater remotely operated vehicles (ROVs) collect thousand...

Towards Low-Cost and Efficient Malaria Detection

Malaria, a fatal but curable disease claims hundreds of thousands of liv...

ConvNets for Counting: Object Detection of Transient Phenomena in Steelpan Drums

We train an object detector built from convolutional neural networks to ...

HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images

Existing state-of-the-art methods for Video Object Segmentation (VOS) le...

Triple-cooperative Video Shadow Detection

Shadow detection in a single image has received significant research int...

Code Repositories


Code for our WACV paper "Unsupervised Adversarial Visual Level Domain Adaptation for Learning Video Object Detectors from Images"

view repo

1 Introduction

We consider the problem of object detection in unconstrained videos with close to zero supervision for video domain. Object detection in videos is a crucial component in several downstream vision applications such as video anomaly detection, autonomous driving, tracking etc. In this work, we assume that we only have access to a fully annotated dataset of still images (which we refer to as

source domain) while there is no annotation (not even any form of weak supervision) for video domain (which we refer to as target domain

). Intuitively, an object detector trained on still images performs worse on video frames primarily due to significant appearance disparities between the two domains. Specifically, still images retain much more high frequency components and are less cluttered/occluded. Conversely, objects in videos often suffer from motion blur and poor resolution. Thus even though we have availability of large-scale annotated image datasets such as ImageNet

[5], MS-COCO[20], PASCAL [7], performance of image object detectors on videos are worse compared to training on manually annotated video datasets [11, 28]. An immediate direction of effort can be to annotate video datasets. However, annotating large scale video datasets demand humongous manual labor, time and cost.

Though collecting annotated video seems a daunting task, there is abundance of unlabeled natural videos available publicly from sources such as Youtube and Flickr. The aim of this paper is to exploit such unlabeled videos to learn an end-to-end trainable network to transform images sampled from static image dataset to ‘appear’ as if being sampled from a video dataset. Following this, if we train an object detector on such ‘transformed image dataset’, we expect to see a boost in the generalization capability of the detector on videos(See Fig. 1 for exemplary success). Specifically our contributions in this paper can be summarized as:

  1. We apply the concept of cycle consistent image-to-image translation[32] with generative adversarial networks(GAN) [12] for learning a completely unsupervised transformation from image to video in pixel domain. To our best knowledge, this work is the first demonstration of the applicability of GAN based pixel level domain adaptation for adapting object detectors across image to video.

  2. We empirically show the importance of cyclic network architecture for training an unsupervised GAN based image translation framework.

  3. Evaluations on recently released Youtube-Objects [15] and Youtube-Objects-Subset [29] datasets reveal that our approach of domain adaptation improves upon two contemporary baseline state-of-the-art image object detectors. Also, we get competitive performance compared to recent weakly supervised methods.

Figure 3: Effect of applying learnt transformations on high quality static VOC images to appear as if being sampled from a video sequence. Note (left) how the legs of the bird get almost blended with nearby surroundings after transformation. Also, we show a case(right) where discriminative fur colors of a cat get desaturated and details such as eyes, whiskers and ears becomes indistinguishable. Training object detectors on such adversarially perturbed images helps in improving test performance on actual video frames. Best appreciated when viewed in color and zoomed up.

2 Related Works

2.1 Object Detection

Object detection in still images is one of the traditional genres of computer vision research

[4, 8, 10, 22]. These methods require bounding box annotations on a large sample of diversified images. Although annotating boxes for large datasets is tedious it is necessary especially for deep learning frameworks in which the complexity of the neural net makes it vulnerable to overfitting if not trained with sufficiently large labeled datasets. To circumnavigate this requirement there are two genres of approach closely related to our current effort. One line of approach is weakly supervised object localization [1, 17, 23, 24, 30, 3], wherein we only have meta information such as the presence/absence of an object category. Majority of these algorithms are based on multiple instance learning(MIL) framework. In this formulation, an image is represented as a bag of regions. It is assumed if an object category is marked as ‘present’ in an image then one of the regions in positive bag tightly bound the object while a tag of ‘absent’ mean no regions contain the object. MIL training alternates between learning an appearance model and selecting the proper region in positive bags which contain the object using the learnt appearance model. However, weakly supervised training is very difficult and its performance is still not at par with fully supervised methods [11, 28].

The second line of effort is to exploit information from both videos and images [23, 26, 18]. In [23], Prest et al. presented a framework for adapting detectors from videos to images (which is opposite to ours). First, they learn an automated video object localizer in videos. This is followed by training video object detector under a domain adaptation setting with fully annotated image data and weakly annotated video data. In [26], Sharma and Nevatia present a framework for detecting humans in videos by online refinement of a baseline pre-trained detector. The idea is to apply the pre-trained detector at a high recall rate so that it outputs a lot of true positive regions. The false positive regions are discarded by an online refinement. Clearly, this method is not suited for real-time applications. Our detector modules consist of Faster R-CNN [9] which runs in near real time. Also, we experiment on wide varieties of moving objects(rather than only humans) under more unconstrained environments.

2.2 Generative Adversarial Networks

Generative adversarial network[12]

has two parametrized models, discriminator and generator pitted against each other in a zero-sum game. The generator network’s input is a noise vector,

, drawn from a prior noise distribution, . Following [12],

(uniform distribution) and generator maps it onto an image,


. The discriminator classifies samples between true data distribution

and the generated distribution, . Specifically, generator and discriminator play the following game on :


With sufficient capacity for both generator and discriminator, this min-max game has global optimum when [12]. Empirically, it has been observed that for the generator, it is prudent to maximize instead of minimizing .

2.3 Image to Image Translation with Adversarial Learning

Our core motivation is to transform a static image to appear as if being sampled from a dataset of video frames. This can be seen as an appearance mapping in pixel space, which is nowadays studied under the umbrella of adversarial image-to-image translation genre. GANs show the potential to approximate an unknown static distribution. This property has been recently leveraged by several authors for pixel level domain adaptation [27, 14, 2]. The basic idea is to have paired image samples in both domains, and and then learn a conditional generator network(conditioned on an image sampled from ) to map to . The discriminator’s task is to identify whether an image is sampled from or is transformed from . Shrivastava et al. [27] and Bousmalis et al. [2]

had the common motivation to design a ‘refiner’ network to transform synthetic images to appear like real images and then train discriminative models on these transformed synthetic datasets. This is helpful because getting labelled data in a rendered/synthetic domain is often free of cost. Though promising, these methods were only applied for inferencing on very small objects such as estimating gaze from a properly cropped out human eye sample of resolution 35

55. In [2], the authors presented results of recognition and pose estimation from specifically centre cropped small objects such as ‘phone’, ‘lamp’ etc., from Cropped LineMod dataset [31]. Our application case is much more difficult as we are working with non cropped image/video frames in the wild with the problem of detecting an arbitrary number of instances of a given class in an image. The first success of applying GANs for high resolution (256 256) image-to-image translation was proposed by Isola et al. [14]. Their framework, for example, transforms sketch of a shoe to real textured consumer shoe or converts an aerial map to actual city image. Though promising, this method is restricted in applicability due to the requirement of paired examples across both domains. This is specifically restrictive in our case because it is not possible to have an object with the same scale and orientation to be present in both image and video datasets. To mitigate this restriction, Zhu et al. [32] proposed to incorporate a cycle consistency loss so that a forward transform, , followed by a backward transform, G(F(x)), gives back the starting distribution, . The same restriction is applied for domain . The cycle consistency loss is a key component to learn in absence of paired data and is thus well suited for our application use case.

3 Our Approach

Our aim is to learn an unsupervised mapping between two domains of data, viz., high-quality static images, , and the other domain of video frames, . We have unpaired training samples, and . We denote data distribution as and . In Fig. 2

we show the two components of adversarial transformation and cycle consistency loss between image and video domains. There are two transformation networks,

and , which map and respectively.

3.1 Adversarial Transformation

Domain discriminator, , discriminates between images transformed from the static images and frames sampled from videos. The adversarial loss for this forward transformation is defined as,


discriminates between video frames transformed to image domain and images sampled from static image database. The corresponding adversarial loss is,


3.2 Cycle Consistency Loss

Theoretically, with enough capacity, and can learn to generate samples from and respectively. However, without any additional constraint, a given sample from can be mapped to any random point in and be indistinguishable from real samples of . For example, if we provide an image of a car from the static dataset, can map it to look like a car from video dataset, but can change the scale and pose of the car. Though, this might not be an issue from an artistic point of view, it is a point of concern for training object detectors because we will use the bounding box annotations from the original static image. Thus we cannot afford to have any structural change during mapping from . To restrict the domain of possible transformations, a cycle loss is introduced such that learned mapping respects the following sequence of transformation constraint.


This ensures that the learned cyclic mapping can start from a given image and we get back after the two transformations of the cycle. This is termed as the forward cycle consistency constraint and the corresponding loss is,


A similar consistency is also imposed for frames being converted to static images with and back to video frames with . The following constraint needs to be maintained,


and the corresponding backward consistency loss is


3.3 Complete Objective

The complete objective function can be written as,


where controls the relative importance of cycle consistency loss over the adversarial loss. The task is to find optimum, and such that,


4 Implementation Details

Our entire framework consists of two phases of training. In the first phase, we train the CycleGAN network to transform high-quality static images to appear as video frames. So, a given image dataset is transformed into a pseudo video dataset. The next phase is training a standard object detector on the previously transformed image dataset using the same annotations of the original static image dataset. CycleGAN is not required during object detection testing.

4.1 Object Detector

In this paper we have used two contemporary object detectors viz., Faster R-CNN [9]111Available at: https:github.comsmallcorgiFaster-RCNN_TF and RFB Net [21] 222Available at: https:github.comlzx1413PytorchSSD. We have used the default settings in the respective papers for training on all different variants of the training datasets.

Figure 4: Benefit of cycle consistent GAN for image to image translation over a simple forward transform. In each tuple, left column: original VOC image, middle column: ForwardGAN transformed image and right column: transformed image by CycleGAN. A ForwardGAN only itself is not able to maintain the structural information in this high dimensional space. The degradations by this framework are not representative of frames from videos and thus training object detectors on these transformed images yield inferior results. Visually, CycleGAN does not degrade the essential structural information of the transformed images but still incorporates necessary perturbations to become indistinguishable from YTO frames. Training object detectors on CycleGAN transformed images thus results in better performance than training on ForwardGAN transformed images.

4.2 CycleGAN

We have trained CycleGAN, on unaligned images and video frames. During training, images are resized to 286286 and then randomly cropped to 256256 to increase the robustness of the model. During inference, transformation is done on 256256 resized image and then rescaled to the original resolution before object detection. We have modelled the generator with 9 resnet blocks as implemented in [13] and the discriminator with PatchGAN classifier [14, 19]333Implementation: https://github.com/junyanz/CycleGAN. GAN loss is implemented with vanilla GAN [12] objective, while cycle loss is L loss. We have considered = 10. We have considered a batch size of 1 and used Adam optimizer [16] with a momentum of 0.5 and an initial learning rate of 0.0002.

4.3 Timings

All computations are performed on NVIDIA Tesla K40C 12GB GPU. Training of CycleGAN with 5000 images each in source and target domain takes 2 days. Faster R-CNN training on image dataset takes 12 hours, while RFB Net training takes 10 hours. During testing, on average, Faster R-CNN and RFB Net run at about 7/8 frames per second.

5 Experiments

In the first part, we show the visual effects of applying adversarial transformation on static images and in second half we visually and numerically show the benefit of pixel level domain adaptation for object detection in videos.

Figure 5: Exemplary detections on Youtube-Objects dataset using Faster R-CNN trained on CycleGAN transformed VOC(propsed) and original VOC(baseline). Training on CycleGAN transformed dataset enables an object detector to deal with visual characteristics of videos such as cluttered background, motion blur effect, small size of objects etc., and thus improves detection performance on test set sampled from video sequences. Please note how small objects such as distant persons, bikes, animals are detected by our model. Also note how our model is robust to distant blurry objects, cluttered/occluded group of objects. All these are achieved with zero video supervision.

5.1 Datasets

We use fully annotated dataset of static images as source domain and a dataset of unannotated video frames as target domain. Please note in this paper we are performing object detection on stand-alone video frames and not on video sequences. Exploiting temporal information for efficient object detection is a separate genre of research and is not the main interest of the paper.

For source domain we use the images from PASCAL VOC 2007 dataset [7], which is a standard dataset for object detection. PASCAL VOC 2007 dataset consists of about 5000 training and 5000 testing images for 20 object categories. We train the object detector on all 20 object categories. For the target domain we use video frames from YouTube-Objects dataset [23] which consists of about 4300 training and 1800 testing images over 10 object categories which forms a subset of PASCAL VOC dataset. Testing is done on the annotated test set of Youtube-Objects dataset (YTO). For testing, we also consider another dataset, Youtube-Objects-Subset[29] which is derived from the videos of Youtube-Objects dataset but has more ground truth annotations. Please note we never make use of bounding box annotations on video training set in any stage of our framework.

5.2 CycleGAN Training

5.2.1 Effect of adversarial image transformation

After completion of CycleGAN training, we would expect a high quality static image to be transformed to visually look like a video frame. In general, after the transformation, static images loose high frequency components, colors tend to get desaturated, discriminative parts gets blended with surroundings. We show some exemplary transformations in Fig. 3.

Figure 6: Visualizing domain differences between images from VOC(top row) and YTO(bottom row) datasets. Images in VOC are sharper, rich in color representation and properly focused. Objects in YTO are usually not focused, blurred and manifests color desaturation.

5.2.2 ForwardGAN

To appreciate the benefit of such a cyclic structure in our adversarial network, we also trained a simple forward transform GAN with only the video domain discriminator. We term this transformation as ForwardGAN. This is similar to the methods of [27, 2]. The discriminator discriminates between video frames and forward transformed static images. As we can see in Fig. 4, ForwardGAN adds implausible structural perturbations to the images. Hence the results of training object detectors on these images gave poor test results.

                 Train Set aero bird boat car cat cow dog horse bike train mAP
Original VOC(Lower Bound) 75.0 89.9 35.1 68.9 56.7 46.2 45.7 39.8 62.6 46.2 56.6
ForwardGAN VOC 75.3 83.1 31.4 68.3 51.0 55.8 37.6 41.9 61.5 47.3 55.3
CycleGAN VOC(Proposed) 77.8 89.6 34.9 70.9 58.3 65.6 45.7 44.7 62.6 51.4 60.2
Conditioned CycleGAN VOC(Proposed) 79.6 90.4 37.5 71.9 61.1 67.4 47.7 46.3 64.8 52.8 62.0
Youtube-Object(Upper Bound) 78.1 91.3 46.7 72.3 62.1 69.6 46.6 48.0 63.1 53.7 63.1
Table 1: Comparison of detection performances(mean Average Precision: mAP) of Faster R-CNN object detector (trained on different variants of training datasets) on Youtube-Objects dataset
                Train Set aero bird boat car cat cow dog horse bike train mAP
Original VOC(Lower Bound) 75.2 80.9 45.1 74.5 45.5 40.1 33.7 37.4 61.8 58.8 55.3
ForwardGAN VOC 72.7 67.8 41.1 67.7 38.0 58.8 35.7 34.5 59.6 64.7 54.0
CycleGAN VOC(Proposed) 82.0 82.6 48.4 74.7 49.2 56.3 34.1 42.0 60.4 71.0 60.1
Conditioned CycleGAN VOC(Proposed) 83.0 83.8 49.2 76.1 51.0 56.7 35.2 42.7 61.9 72.1 61.2
Youtube-Object(Upper Bound) 83.5 84.6 48.9 75.2 51.6 58.0 36.9 45.3 62.7 74.1 62.1
Table 2: Comparison of detection performances(mean Average Precision: mAP) of RFBNet object detector (trained on different variants of training datasets) on Youtube-Objects-Subset dataset
Youtube Objects Dataset
Proposal Only[24] Proposal + Transfer[24] Teh et al.[30] Chanda et al.[3] Proposed Proposed(Conditioned)
51.5 55.3 56.7 61.9 58.8 61.6
Youtube Objects Subset Dataset
36.0 41.6 48.5 51.1 50.8 52.1
Table 3: Comparison of CorLoc on Youtube-Objects and Youtube-Objects-Subset dataset. Proposed methods refer to performances of Faster R-CNN models trained on CycleGAN and class conditioned CycleGAN transformed PASCAL VOC datasets.

5.2.3 Dataset adaptation or regularization ?

Fig. 3 might tempt one to believe that proposed model succeeds just by adding some noise to the training data and thereby making training data more difficult. In Fig. 6

we show some examples from original VOC and YTO. Indeed, we can appreciate that objects in YTO suffer from defocussing, blurriness, color desaturation effects. Our framework learns this distribution difference automatically instead of relying on manual hand crafted dataset augmentation techniques. In fact, in our initial experiments, we augmented original VOC with Gaussian noise of standard deviation 0.01, 0.05, 0.1, 0.5 and 1. mAP of Faster RCNN trained on these augmented datasets and tested on YTO were 56.4, 57.1, 57.2, 50.3 and 44.1. We also tried with Gaussian blurring of VOC images with kernel sizes of 5, 9 and 13 at standard deviation of 2. mAPs on YTO were 56.4, 57.0 and 56.8. With standard deviation of 4, mAPs were 56.3, 57.2 and 55.9. Thus, it is safe to say simple dataset augmentations do not help in the current domain adaptation problem.

5.3 Object Detection

5.3.1 Evaluation Metrics

A commonly used metric for evaluating object detection is mean Average Precision(mAP)[7]. According to Pascal criterion, a detected bounding box, , is considered as true positive, , if, Intersection Over Union (IoU) = , else is considered as a false positive, . is an annotated box. mAP is defined as the mean of average precision over all classes. Another popular measure is CorLoc [6] which is given by . CorLoc can be interpreted as the proportion of detected bounding boxes which satisfy the Pascal criterion.
Special care on Youtube-Objects-Subset dataset: The annotations released for this dataset consist of pixel-wise segmentation maps instead of bounding boxes as in Youtube-Objects. We convert the segmentation maps to bounding boxes by enclosing the segmentation maps by smallest possible rectangles. We also followed the conversion strategy as proposed in [30]; converting a detected bounding box to a segmentation map using grab-cut algorithm [25]444Available: https://docs.opencv.org/trunk/d8/d83/tutorial_py_grabcut.html. Under this formulation, IoU is measured with respect to the overlap of segmentation maps. The numerical results following both paradigms are almost comparable and thus, following [30], we stick to the latter framework.

5.3.2 Comparing Baselines

As a baseline we train standard object detectors on PASCAL VOC dataset and test on video frames without any adaptation. This gives us the lower bound of performance. The upper bound is achieved by training detectors on annotated video frames. To show the robustness of the approach, we report the observed effect of domain adaptation using two different object detectors, viz., Faster R-CNN and RFBNet300. We have considered backbone as VGG16 for both detectors. In Table LABEL:tab:table-one we compare the performances of Faster R-CNN trained on different versions of training set and then tested on Youtube-Objects(YTO) test set. In Table LABEL:tab:table-two we report performances with RFBNet300 network tested on Youtube-Objects-Subset. For both detectors, we see an appreciable boost in performance if trained on CycleGAN transformed images dataset compared to original VOC dataset. Of course, training on labelled YTO frames still gives the upper bound of performance but using our proposed method we reduce the performance gap appreciably. Also, it is to be noted that the training of ForwardGAN deteriorates performance even worse than training on original VOC. Tables LABEL:tab:table-one and LABEL:tab:table-two strongly bolster our hypothesis that visual domain adaptation is a viable approach for cross domain learning of object detectors with close to zero supervision in unlabeled domain. Next, in Table LABEL:table_compare, we also compare our model with some of the contemporary weakly supervised baselines. In [24] (proposal only) refers to learning of appearance model based on object proposals on weakly annotated frames while (proposal + transfer) refers to combination of appearance model from ’novel’ objects (video) and transferred appearance model from ’familiar’ annotated objects (static images). Method of Chanda et al.[3] is based on a 2-stream network, wherein one stream they perform regular fully supervised image object detection and in another stream, they perform frame level classification on the weakly annotated videos. These 2 streams share parameters to counter domain shift factors. Teh et al.[30] proposed an attention network so as to mimic the score of region proposal networks on weakly annotated objects to be similar to a strong fully supervised object detector. Please note all these competing methods assume the presence of meta information such as presence/absence of object categories on each training video frame. However, we assume no information to be associated on video training dataset. It is evident from Table LABEL:table_compare that our model presents competitive performance(better in majority cases) compared to these methods.

5.4 Can YTO class labels help CycleGAN?

One of the drawbacks555See discussion: https://github.com/junyanz/CycleGAN#failure-cases of CycleGAN is that it is tough to train if the two domains differ structurally. In our case, this bottleneck arises because, without a priori knowledge of labels, a given mini batch can consist of different object categories. To mitigate this drawback one can train a separate CycleGAN conditioned on each category on YTO. This requires weak label information for each frame. So, during offline CycleGAN training, we train 10 different networks to individually transform each category. However, we still need to train only a single object detector on this conglomerated transformed dataset. Such use of weak labels further boosts the performance of our framework as reported in Tables LABEL:tab:table-one, LABEL:tab:table-two and LABEL:table_compare. This observation indicates that class conditioned CycleGANs are better at capturing the appearance diversities across two datasets.

6 Conclusion

In this paper, we mainly focused on unsupervised pixel level domain adaptation for transferring image object detectors for videos. In contrast to the contemporary trend of weakly supervised learning on videos which still requires manual intervention in videos, our framework requires no supervision. The core idea is to pose the problem as an adversarial image-to-image translation for converting annotated static images to be visually indistinguishable from video frames. We also showed that the inclusion of class labels on videos improves our framework further. A straightforward application of our model will be to automatically annotate large video datasets for object detection. Currently, our method is focused on detecting objects on standalone video frames. An immediate extension would be to leverage temporal information in videos for enhanced detection performance. Moreover, the ideas of the paper in general, should encourage researchers towards other interesting visual domain adaptation applications such as emotion recognition from 3D face avatars, learning pose estimation in a virtual world, robotic navigation in simulated environments and finally applying in real world frameworks.


  • [1] H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with posterior regularization. In Proceedings BMVC 2014, pages 1–12, 2014.
  • [2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, volume 1, page 7, 2017.
  • [3] O. Chanda, E. W. Teh, M. Rochan, Z. Guo, and Y. Wang. Adapting object detectors from images to weakly labeled videos. In The 28th British Machine Vision Conference (BMVC).
  • [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In

    Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on

    , volume 1, pages 886–893. IEEE, 2005.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
  • [6] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning with generic knowledge. International journal of computer vision, 100(3):275–293, 2012.
  • [7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
  • [9] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [11] R. Gokberk Cinbis, J. Verbeek, and C. Schmid. Multi-fold mil training for weakly supervised object localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2409–2416, 2014.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [14] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    CVPR, 2017.
  • [15] V. Kalogeiton, V. Ferrari, and C. Schmid. Analysing domain shift factors between videos and images for object detection. IEEE Trans. Pattern Anal. Mach. Intell., 38(11):2327–2334, 2016.
  • [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [17] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
  • [18] C. Leistner, M. Godec, S. Schulter, A. Saffari, M. Werlberger, and H. Bischof. Improving classifiers with unlabeled weakly-related videos. 2011.
  • [19] C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pages 702–716. Springer, 2016.
  • [20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [21] S. Liu, D. Huang, and Y. Wang. Receptive field block net for accurate and fast object detection. arXiv preprint arXiv:1711.07767, 2017.
  • [22] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 89–96. IEEE, 2011.
  • [23] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3282–3289. IEEE, 2012.
  • [24] M. Rochan and Y. Wang. Weakly supervised localization of novel objects using appearance transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4315–4324, 2015.
  • [25] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314. ACM, 2004.
  • [26] P. Sharma and R. Nevatia. Efficient detector adaptation for object detection in a video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3254–3261, 2013.
  • [27] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, page 6, 2017.
  • [28] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects with minimal supervision. arXiv preprint arXiv:1403.1024, 2014.
  • [29] K. Tang, R. Sukthankar, J. Yagnik, and L. Fei-Fei. Discriminative segment annotation in weakly labeled video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2483–2490, 2013.
  • [30] E. W. Teh, Z. Guo, and Y. Wang. Object localization in weakly labeled data using regularized attention networks. In Visual Communications and Image Processing (VCIP), 2017 IEEE, pages 1–4. IEEE, 2017.
  • [31] P. Wohlhart and V. Lepetit. Learning descriptors for object recognition and 3d pose estimation. In CVPR, pages 3109–3118, 2015.
  • [32] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.