While humans seem to learn from minimal supervision, existing machine learning techniques often require a tremendous amount of labeled data. Although huge progress has been made in the field, with the recent weakly-supervised training methodsArun et al. (2019, to appear), coming close to the results achieved in fully-supervised approaches Singh et al. (2018), the size, quality, and availability of labeled data are currently becoming a major bottleneck. One possibility allowing to break free from this limitation is the self-supervised learning paradigm. Examples of self-supervised learning are reinforcement Nachum et al. (2018)
and pretext-based learning such as image colorizationGoyal et al. (2019), yet these methods require a reward function, or introduce rare tasks.
The increasing amount of online videos brings several opportunities for training deep neural networks in the self-supervised regime. Large-scale video data sets such as the YouTube-8MAbu-El-Haija et al. (2016) and the How2 data set Sanabria et al. (2018) can be leveraged for this purpose. In this paper, we explore new ways for self-supervised learning, tackling the challenging task of visual object detection. To this end, we exploit the How2 data set by taking advantage of the multi-modal information it provides (video and automatic closed captions). Fig. 1 describes the targeted problem in this paper.
Given unlabeled training videos, the audio channel can be used as a "free" source of weak labels, allowing a convolutional network to learn objects and scenes. For instance, by seeing and hearing many frames where the word "guitar" is mentioned, it should be possible to detect the guitar due to its shared characteristics over different frames. Yet, self-supervised learning from the videos themselves is quite hard when performed in the wild, as the audio and the visual contents may often appear completely unrelated. Nevertheless, the results in Section 4 show that our approach is able to fairly successfully reduce the level of this source of noise, detecting frames that contain a desired object, and localizing the objects in the relevant frames – all this in hard scenarios of large variation in object appearance, typical motion blur in video frames and in the presence of strong label noise.
We pose the self-supervised object detection problem as noisy weakly-labeled learning task. In order to ground a certain category of object we consider the large corpus of How2 Sanabria et al. (2018) instructional videos covering a wide variety of topics across 13,000 clips (about 300 hours total duration), with word-level time-aligned subtitles. We extract candidate frames from time intervals corresponding to the subtitle, containing the object’s name (synced with the speech mentioning the name of the object). This set of frames comprises our positive set, yet it is noisy labeled as the object might not appear in all of the selected frames. Creating a negative set (that most likely lacks the object of interest) allows discriminating between the object and background, essentially solving the object detection problem. The task is similar to weakly supervised object detection (WSOD), yet is distinct from it in the high level of label noise.
The key contribution of this paper is three-fold: Firstly, we introduce a methodology capable of detecting objects, learned from continuous videos without any manual labeling. Secondly, we pose the object detection problem as a noisy weakly-labeled learning task and propose a novel training scheme for accomplishing it. Lastly, our scheme incorporates a novel cluster scoring that distills the detected regions separating them from the surrounding clutter.
2 Related Work
Multimodal learning. Synchronization between the visual and audio or text as a source for self-supervised learning has been studied before Harwath et al. (2016); Suris et al. (2018). In Harwath et al. (2016)
, the authors suggest a method to learn the correspondence between audio captions and images, for the task of image retrieval. A cross-modal learning inSuris et al. (2018) is used to create links between audio and video frames, in order to retrieve audio samples that fit a given silent video. Another class of works suggest to train a model based on "pretext" task for which the ground-truth is available for free, such as solving a Jigsaw puzzle or image colorization Goyal et al. (2019); Zhang et al. (2016). In this paper we address a different problem of self-supervised object detection from unlabeled and unconstrained videos.
Audio-visual alignment. These approaches target learning from a large number of unlabeled videos Arandjelovic and Zisserman (2017); Gao et al. (2018); Korbar et al. (2018); Song et al. (2016); Sun et al. (2016); Yu and Siskind (2013); Zhao et al. (2018) capitalizing on the natural synchronization of the visual and audio modalities, to learn models that discover audio-visual correspondence. Particularly, Arandjelovic and Zisserman (2017); Korbar et al. (2018); Zhao et al. (2018) use video and sound (not speech) to discover the relevant audio track for a certain region in a frame. In Arandjelovic and Zisserman (2017)
the authors also suggest an object localization (in both modalities), yet by activation heat maps (in low resolution) and not as detection. The computer vision and NLP communities have begun to leverage deep learning to create multimodal models of images and text. Grounded language learning from video has been studied inSong et al. (2016); Sun et al. (2016); Yu and Siskind (2013) by learning joint representations of text sentences and visual cues from videos. For instance, in Yu and Siskind (2013) words and objects are related using a trained object detector from an external source, and therefore is not self-supervised. Sun et al. Sun et al. (2016) target a different problem of speech recognition and Song et al. (2016) studies correspondence between words and concepts in human actions. The work in Naim et al. (2014) presents an unsupervised alignment method for natural language instructions in videos, with the specific goal of automatically align the video segments to the corresponding protocol sentences, and track hands in video and detect the blobs touched. The closet work to our study is Harwath and Glass (2017) that attempts to map word-like acoustic units in the continuous speech, to semantically relevant regions in the image. However this method is not self-supervised as the captions are manually created for the image samples.
Weakly-supervised object detection. Weakly supervised learning and particularly of object detection has attracted high interest due to the less annotation labour, involved in creating the weak labels. Particularly, weakly supervised object detection (WSOD) Bilen and Vedaldi (2016); Jie et al. (2017); Tang et al. (2018); Wan et al. (2018) uses only image-level annotations to train object detectors. While weakly supervised methods accept well curated and "clean" labels (correct labels for all the images in the train-set), our problem involves a "noisy" labeled data set, where many of our selected and self-labeled frames are falsely labeled. In this study we suggest a novel approach for weakly and noisily labeled object detection problem. We compare our method to the PCL weakly supervised object detection Tang et al. (2018) as a baseline under noisy labels, showing superior results.
Noisy labels. Many noisy labeled methods target the training of deep neural networks on large-scale weakly-supervised web images, which are crawled from the internet by using text queries, without any human annotation Guo et al. (2018); Zhuang et al. (2017). Although our selected frames are expected to have fairly clean labels by construction, they still contain considerable noise, e.g. many extracted frames lack the object in the scene. Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can totally overfit the data with the noise Guo et al. (2018); Han et al. (2018); Zhuang et al. (2017). Handling extreme noise levels for image classification has been shown in Han et al. (2018), (tested on MNIST and CIFAR) for up to 50% noise level. However, our problem involves higher noise level of up to 68%, for a more challenging task of detection with weak labels.
The proposed self-supervised scheme is described in Fig. 2 with the pipeline summarized in Algorithm 1. Our input comprises a large set of unlabeled videos with speech transcription from instructional video corpus of How2 Sanabria et al. (2018). In following, we detail the key steps of our method.
Extraction of positive key frames. For a given object name, let’s say a guitar, we extract a single key frame from each center of temporal period where the object was mentioned. While this is not ideal, it works fairly well for selecting frames that contain the object. This selection approach further drives us toward relevant videos. We now dub these selected images as our (noisy) positive set, labeled as . We construct also a balanced negative set, , containing frames randomly selected from disparate videos, that the object was not mentioned in. These frames will most likely be without the object of interest, but will include elements contained in the surroundings of our object instances in the positive frames, such as faces, hands, tables and chairs.
For each image (positive and negative) we extract region proposals using Selective Search Uijlings et al. (2013). Regions are labeled as positive or negative according to the corresponding frame label, (similar to the bag and instance labels in multi-instance learning paradigm Maron and Lozano-Pérez (1997)). Using a pre-trained back-bone such as Inception-ResNet-v2 CNN Szegedy et al. (2017) or a trained Auto-Encoder, we map each candidate region to a feature space, represented by .
Potential score. The purpose of our learning approach is to find a common theme across positive regions that is less likely to exist in negative counterparts. To this end, we cluster the regions in the embedded space. Clusters with dense population of positive regions are likely to contain the object of interest. We therefore associate a positive ratio
score to each cluster, defined as the ratio between the positive and the total number of samples in the cluster (note that regions are labeled according to their corresponding frame). Yet, high positive-ratio clusters are noisy, so that real object clusters are not always distinguishable. Specifically, we search for a target cluster, satisfying the following properties: (1) High positive ratio ; (2) Low cluster variance, for tendency to include a single object type; and (3) Cluster members that come from a wide variety of videos, since we expect the object to have a common characteristics among various videos. The latter property also copes with the high temporal correlation in a single video, that may create dense clusters. We formalize these constraints using the following softmax function, to which we refer as the potential score, i.e. score of cluster containing the object:
Here, is the sofmax function, denotes the total number of clusters, is the softmax temperature, is the positive-ratio (according to the raw weak labels, since the ground truth labels are not accessible), is the cluster distance variance, and denotes the number of unique videos. All parameters are normalized to unit sum. Our observations showed the following importance order in the potential score components: positive-ratio , the cluster variance , and lastly the number of unique videos . For this reason is squared and we take the of .
Deep embedded clustering.
Following feature extraction, we cluster our region proposals using a variation of deep embedded clustering (DEC)Xie et al. (2016). Following Xie et al. (2016) we suggest the weighted student’s t-distribution as a similarity measure:
with indices and associated with the sample and cluster respectively, corresponds to region embedding and is the cluster centroid. Here and in the following, we drop the frame index for simplicity. The newly added acts as selective weights. We set the weights according to the region label as:
and use the new measure in (2) to drive the clustering to the target distribution with
, using the Kullback-Leibler divergence loss (seeXie et al. (2016) for more details). This weighted DEC focuses the clustering toward positive regions. In practice, we apply the weighting for clusters with positive ratio above a threshold. We further re-initialize our DEC by weighted K-means every epochs, with the new weights set by normalized by the number of positive samples in the cluster, defined by DSD (see Fig. 2). The clusters, potential score and DSD are refined iteratively. Only cluster centroids are optimized, while the embeddings remain fixed.
Dense subgraph discovery. A frequent shortcoming of weakly-supervised approaches is their inability to distinguish between candidates with high and low object overlap. However, for training a high-performance object detector, regions with tight spatial coverage of the object are required, i.e., high Intersection-over-Union (IoU). To address this issue, we use the Dense Subgraph Discovery (DSD) algorithm Jie et al. (2017). This model defines an undirected unweighted graph for a set of region proposals in a given image. The nodes correspond to region proposals and the edges are formed by connecting each proposal (node) to its multiple neighbors, which have mutual IoU larger than a pre-defined threshold. For our use case, we found that simply extracting the top 10% of the most connected nodes works well. Unlike Jie et al. (2017), we further make use of the remaining regions as "hard negative" examples.
Sampling and training of the detector. Each cluster is assigned a potential score as defined in (1). This score is likely to correlate with cluster purity, i.e., the ratio of regions in a cluster that contains instances of the object. We then train a detector fed by the following samples: for positive samples we consider the regions selected by DSD and sample regions with high potential score . Our sampling distribution is the normalized score . Note that sample scores are associated with their corresponding cluster
, and this sampling strategy allows sampling from several clusters. This sampling regime continuously reduces the noise level in the positive set, that is necessary to reach a high accuracy detector (a region classifier). Negative samples are sampled uniformly from the negative frames and are combined with the rejected regions from DSD (used as hard negatives). Our detector is a multilayer perceptron with three fully connected layers trained to separate between object and background, using cross-entropy loss. In every training cycle, we initialize the detector training with weights from previous iteration.
Data set. Our evaluation is based on the How2 data set Sanabria et al. (2018) that includes 300 hours of instructional videos with synchronized closed captions in English. Processing the caption text in the whole data set, we extract all object nouns and choose 11 top references, avoiding objects with dual verb and noun meanings (such as ’chip’ that could be a ’Poker Chip’ or a ’chip shot’ in golf) and objects with extreme appearance variability (such as bag or hat). However, we include in our corpus challenging nouns such as "Bike" that corresponds to both ’Bicycle’ and ’Motorbike’ or gun that refer to pistol, rifle and glue gun or even cup that corresponds to glass or porcelain cup, paper cup or measuring cup. This results a total of 5,120 frames from the videos, with an average of 465 frames per-object. Our transcript-based frame selection introduces noisily labeled frames on average (i.e., only 48% of frames are correctly labeled and contain the object of interest), presenting 2,457 frames with the true weak labels. The statistics behind our data set and the region proposal selected search recall are shown in Table 1. For validation of our self-supervised scheme, we manually annotated every object instance in the corresponding selected frames and performed detection, separately for each object category.
|No. of frames||419||583||1243||457||442||182||597||351||341||200||305||5120|
|Noise level %||28.4||54.4||59.5||53.4||35.1||61.5||47.1||68.4||67.7||58.5||22.6||51.5|
|SS recall %||85||85||97||98||94||95||99||94||97||83||83||91.8|
Baseline and upper bound. As our baseline, we opted for the PCL weakly-supervised object detection method Tang et al. (2018)
currently achieving the highest mAP of 19.6% for weakly supervised object detection on ImageNet-Det. We used the PCL code from the authors’ GitHub. Similarly to our self-supervised scheme, we fed PCL with the noisy positive labels. For each object category, PCL was trained on two classes: the positive class with noisy labels, and the negative class labeled as background. As an upper bound reference, we report the performance of the fully-supervised version of our method, where the detector (object region classifier) is trained with ground truth labels. These comparisons also emphasized the challenge of learning object detection from unconstrained videos manifesting motion blur, extreme views (far-field and close-ups), and large occlusions. Some examples are shown in Fig.3).
Evaluation. We randomly split our data into 80%-20% train-test sets containing mutually exclusive frames, and evaluate the performance on randomized folds. The training and test sets contain on average of 372 and 93 frames per object, respectively. Since our task is self-supervised, we allowed frames from the same video to participate in both train and test sets. For quantitative evaluation, we manually annotated the bounding boxes of object categories. Note that annotations were used only for the testing and were not available at training to keep the method annotation-free. As the evaluation criterion, we use the standard detection mean average precision (mAP) with different thresholds on bounding box overlap measured as intersection-over-union (IoU). Training and testing was performed for each object category on the selected frames (i.e., the noisy positive set), addressing the problem of self-supervised learning. Note since the positive set is noisy (see Table 1), that evaluation was also applied to frames without the objects (background).
Results. To the best of our knowledge, this is the first self-supervised object detection method trained and evaluated using standard evaluation practices. The results for our benchmarks are summarized in Table 2, for IoU over 0.3 and 0.5. Our methods attains an overall mAP of 14.4% and 23.9% for the different IoU levels, respectively. We observe higher mAP for rather larger objects such as Bike, Drum, Horse and Tire, having in average 22.6% and 31.5% mAP for IoU 0.5 and 0.3, respectively. These objects are also associated with lower label-noise level of 37.9% in average (see Table 1). Smaller objects such as Cup, Gun, Plate and Scissors yield lower mAP of 7.7% and 14.7% for IoU 0.5 and 0.3, respectively. However, these objects are also associated with higher average noise level of 60.5%. For several objects such as Dog, Gun and Plate we observe a nearly double mAP with IoU over 0.3 showing that our self-supervised detector still succeeds to localize these objects fairly well. Fig. 3 shows several detection examples on the test set.
Applying the weakly-supervised PCL, yields inferior mAP on all the objects except "Plate", due to the lack of robustness of PCL to label noise. In fact, we faced convergence problems in PCL due to the noisy labels, with no convergence success in all test folds for Bike and Drum. The low performance in Gun case for the wealky-supervised PCL can be related, to the small object size, high object variably (see examples in Fig. 3), and high noise level of 61.5%. Yet, our self-supervised method operated favorably reaching 33% mAP for IoU 0.3 (compared to 4.1% in PCL). Overall, for the 9 successfully learned objects, we obtained 6.7% mAP on PCL compared to 12.2% at IoU=0.5, and 14.4% against 22.1% at IoU 0.3, both favorably for our approach.
As the upper bound, we present the results from our trained region classifier (see Fig. 2) when fed with true region labels (starting with the SS Recall as in Table 1). Although we are still far from this extreme labeling scenario, we believe that our results and labeled data can motivate others to tackle the challenging task of self-supervised detection.
Cluster analysis. In this section we demonstrate the effectiveness of our potential score function (1) in producing semantically meaningful clusters. In our self-supervised approach, common features that are unique to the positive set are learned. Often these common themes include "side" objects or certain scenery. In Fig. 4, we show an example of such case for Gun and Guitar. In the How2 data set, videos addressing gun handling have the most frequent gun word in their captions. These videos most likely show operation of weapons outdoors. Interestingly, ranking our clusters according to the potential score, yields objects in secondary clusters (ranked below the max score), with disk shelves on guitar and sand for gun that are semantically related to the detected objects.
Implementation details. For clustering, we use , and set in (1). We set the positive ratio threshold as
. In our region classifier we use 3 FC layers (1024,1024,2) with a ReLU activation in layers 1-2 and a softmax activation for the output layer. Dropout is used for the two hidden layers with probability of 0.8. The classifier is trained with the cross-entropy loss function. We use ADAM for optimization with a learning rate of. The learning rate is decreased by a factor of 0.6 every 6 epochs. We train our model for 35 epochs for all objects. All experiments were done on a Tesla K80 GPU. After initial feature extraction, a single epoch duration (DEC, DSD & detector training) is around 15 mins, amounting to nearly 9 hrs for an object.
We have presented a model for the challenging task of self-supervised object detection from unlabeled videos. Considering a large corpus of instructional videos with closed captions, we select frames that correspond to the transcript where the object name is mentioned. We pose the problem as weakly and noisily-labeled supervised learning. Our object detection is based on a model that captures regions with a common theme across the selected frames, distinguished from frames from disparate videos. This new region-level approach shows promising results in the detection of objects with high appearance variability and multiple sub-classes arising from the language ambiguities. We evaluate our method in terms of detection mean average precision, show an upper bound performance and demonstrate favorably comparable success to a top performing weakly-supervised approach. Our method handles noisy labels in the weak setting, and is capable of detecting objects in challenging scenarios without any human labeling.
- Abu-El-Haija et al. (2016) Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol (Paul) Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. In arXiv:1609.08675, 2016.
- Arandjelovic and Zisserman (2017) Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In ICCV, 2017.
- Arun et al. (2019, to appear) Aditya Arun, C. V. Jawahar, and M. Pawan Kumar. Dissimilarity coefficient based weakly supervised object detection. In CVPR, 2019, to appear.
- Bilen and Vedaldi (2016) Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.
- Gao et al. (2018) Ruohan Gao, Rogerio Feris, and Kristen Grauman. Learning to separate object sounds by watching unlabeled video. In CVPR, 2018.
- Goyal et al. (2019) Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In arXiv:1905.01235, 2019. URL https://arxiv.org/abs/1905.01235.
- Guo et al. (2018) Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, and Dinglong Huang. Curriculumnet: Weakly supervised learning from large-scale web images. In ECCV, 2018.
- Han et al. (2018) Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, 2018.
- Harwath and Glass (2017) David Harwath and James Glass. Learning word-like units from joint audio-visual analysis. In ACL, 2017.
- Harwath et al. (2016) David Harwath, Antonio Torralba, and James R. Glass. Unsupervised learning of spoken language with visual context. In NIPS, 2016.
- Jie et al. (2017) Zequn Jie, Yunchao Wei, Xiaojie Jin, Jiashi Feng, and Wei Liu. Deep self-taught learning for weakly supervised object localization. In CVPR, 2017.
- Korbar et al. (2018) Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
- Maron and Lozano-Pérez (1997) Oded Maron and Tomás Lozano-Pérez. A framework for multiple-instance learning. In NIPS, pages 570–576, 1997.
Nachum et al. (2018)
Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine.
Data-efficient hierarchical reinforcement learning.In NeurIPS, 2018.
- Naim et al. (2014) Iftekhar Naim, Young Chol Song, Qiguang Liu, Henry Kautz, Jiebo Luo, and Daniel Gildea. Unsupervised alignment of natural language instructions with video segments. In AAAI, 2014.
- Sanabria et al. (2018) Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. How2: a large-scale dataset for multimodal language understanding. In Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL). NeurIPS, 2018.
- Singh et al. (2018) Bharat Singh, Mahyar Najibi, and Larry S. Davis. SNIPER: efficient multi-scale training. In NeurIPS, 2018.
Song et al. (2016)
Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag
Singla, Jiebo Luo, Daniel Gildea, and Henry Kautz.
Unsupervised alignment of actions in video with text descriptions.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 2025–2031. AAAI Press, 2016. ISBN 978-1-57735-770-4.
- Sun et al. (2016) Felix Sun, David Harwath, and James Glass. Look, listen, and decode: Multimodal speech recognition with images. In IEEE Spoken Language Technology Workshop, 2016.
- Suris et al. (2018) Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giro i Nieto. Cross-modal embeddings for video and audio retrieval. In arxiv:1801.02200, 2018.
Szegedy et al. (2017)
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 4278–4284, 2017.
- Tang et al. (2018) Peng Tang, Xinggang Wang, Song Bai, Wei Shen, Xiang Bai, Wenyu Liu, and Alan Loddon Yuille. PCL: Proposal cluster learning for weakly supervised object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.
- Uijlings et al. (2013) J. R. Uijlings, K. E. Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. Int. J. Comput. Vision, 104(2):154–171, September 2013. ISSN 0920-5691. doi: 10.1007/s11263-013-0620-5.
- Wan et al. (2018) Fang Wan, Pengxu Wei, Zhenjun Han, Jianbin Jiao, and Qixiang Ye. Min-entropy latent model for weakly supervised object detection. In CVPR, 2018.
- Xie et al. (2016) Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 478–487, 2016.
- Yu and Siskind (2013) Haonan Yu and Jeffrey Mark Siskind. Grounded language learning from video described with sentences. In ACL, 2013.
- Zhang et al. (2016) R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016.
- Zhao et al. (2018) Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. In ECCV, 2018.
- Zhuang et al. (2017) Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, and Ian Reid. Attend in groups: a weakly-supervised deep learning framework for learning from web data. In CVPR, 2017.