1 Introduction
Unsupervised learning is one of the most difficult and interesting problems in computer vision and machine learning today. Many researchers believe that learning from large collections of unlabeled videos could help decode hard questions regarding the nature of intelligence and learning. Moreover, as unlabeled videos are easy to collect at relatively low cost, unsupervised learning could be of real practical value in many computer vision and robotics applications. In this article we propose a novel approach to unsupervised learning that successfully tackles many of the challenges associated with this task. We present a system that is composed of two main pathways, one that performs unsupervised object discovery in videos or large image collections along the teacher branch, and the other, the student branch, which learns from the teacher to detect foreground objects in single images. Our approach is general in the sense that the student or teacher pathways do not depend on a specific neural network architecture or implementation. Also, our approach allows the unsupervised learning process to continue over several generations of students and teachers. In Algorithm
1 we present the high level description of our method. We will use throughout the paper the terms ”generation” and ”iteration” of Algorithm 1 interchangeably. A preliminary version of this work, without presenting the possibility of learning over several generations and with fewer experimental results appeared at ICCV 2017 (Croitoru et al (2017)).In Figure 1 we present a graphic overview of our full system. In the unsupervised training stage the student network (module A) learns, frame by frame, from an unsupervised teacher pathway (modules B and C) to produce similar object masks in single images. The student branch tries to imitate for each frame the output of the teacher, while having as input only a single image - the current frame. The teacher on the other hand has access to an entire video sequence. The method presented in Algorithm 1 follows the main steps of the system as it learns from one iteration (generation) to the next. The steps are discussed in more detail in Section 3.
During the first iteration of Algorithm 1, the unsupervised teacher pathway has access to information over time - a video. In contrast, the student is deeper in structure, but it has access only to a single image - the current video frame. Thus, the information discovered by the teacher in time is captured by the student in added depth, over neural layers of abstraction. Several student nets with different architectures are trained at the first iteration. In order to use as supervisory signal only good quality masks, an unsupervised mask selection procedure is applied, as explained in Section 4. Once several student nets are trained, their output is combined to form the teacher at the next iteration. Then, we run, at the next generation, the newly formed teacher on a larger set of unlabeled videos, to produce supervisory signal for the next generation students. Note that while at the first iteration the teacher pathway is required to receive video sequences as input, from the second generation on, it could receive as input large image collections, as well. Due to the very high computational and storage costs, required during training time, we limit our experiments to learning over two generations, but our algorithm is general and could run over many iterations. We show in extensive experiments that even two generations are sufficient to significantly outperform the current state of the art on object discovery in video and images. We also demonstrate a solid improvement from one generation to the next. Now we enumerate the main contributions of our approach:

1) We introduce a novel approach to unsupervised learning from videos to detect foreground objects in images. The overview of our system and algorithm are presented in Figure 1 and Algorithm 1. The system has two main pathways - one that acts as a teacher and discovers objects in videos or large collections of images and the other that acts as student and learns from the teacher to detect the foreground objects in single input images. We provide a general algorithm for unsupervised learning over several generations of students and teachers. We experiment with different types of student nets and show how they collectively work together to form the teacher at the next generation. This is done in conjunction with a novel unsupervised soft-mask selection scheme. We demonstrate experimentally that within a generation the students are more powerful than their teachers, while both pathways improve significantly from one generation to the next.
2) At the higher level, our proposed algorithm is sufficiently general to accommodate different implementations and neural network architectures. In this paper, we also provide a specific implementation which we describe in detail. We demonstrate its performance on three recent datasets, namely YouTube Objects (Prest et al (2012)), Object Discovery in Internet Images (Rubinstein et al (2013)) and Pascal-S (Li et al (2014)), on which we obtain state of the art results. To our best knowledge, it is the first system that learns to detect and segment foreground objects in images in unsupervised fashion, with no pre-trained features given or manual labeling, while requiring only a single image at test time.
2 Scientific context
The literature on unsupervised learning follows two main directions. 1) One is to learn powerful features in an unsupervised way and then use them for transfer learning, within a supervised scheme and in combination with different classifiers, such as SVMs or CNNs (
Radenović et al (2016); Misra et al (2016); Li et al (2016)). 2) The second direction is to discover, at test time, common patterns in unlabeled data, using clustering, feature matching or data mining formulations (Jain et al (1999); Cho et al (2015); Sivic et al (2005)).Belonging to the first category and closely related to our work, the approach in Pathak et al (2017)
proposes a system in which a deep neural network learns to produce soft object masks from an unsupervised module that uses optical flow cues in video. The deep features learned in this manner are then applied to several transfer learning tasks. Different from their work, we provide a more general approach that could learn in an unsupervised manner over several generations. From an experimental point of view, while
Pathak et al (2017) tests their work on a supervised transfer learning task, we evaluate ours on specific unsupervised foreground object detection and segmentation tasks and demonstrate state of the art performance, often by a large margin.Recently, researchers have started to use the natural, spatial and temporal structure in images and videos as supervisory signals in unsupervised learning approaches that are considered to follow a self-supervised learning
). Methods that fall into this category include those that learn to estimate the relative patch positions in images (
Doersch et al (2015)), predict color channels (Larsson et al (2016)), solve jigsaw puzzles (Noroozi and Favaro (2016)) and inpaint (Pathak et al (2016)). One trend is to use as supervisory signal, spatial and appearance information collected from raw single images. In such single-image cases the amount of information that can be learned is limited to a single moment in time, as opposed to the case of learning from video sequences. Using unlabeled videos as input is closer related to our work and includes learning to predict the temporal order of frames (
Lee et al (2017)), generate the future frame (Finn et al (2016); Xue et al (2016); Goroshin et al (2015)) or learn from optical flow (Wang and Gupta (2015b)).For most of these papers, the unsupervised learning scheme is only an intermediate step to train features that are eventually used on classic supervised learning tasks, such as object classification, object detection or action recognition. Such pre-trained features perform better than randomly initialized ones, as they contain valuable semantic information implicit in the natural structure of the world used as supervisory signal. In our work, we focus mostly on specific unsupervised tasks on which we perform extensive evaluations, but we also show some results on transfer learning experiments.
The second main approach to unsupervised learning includes methods for image co-segmentation (Joulin et al (2010); Kim et al (2011); Rubinstein et al (2013); Joulin et al (2012); Kuettel et al (2012); Vicente et al (2011); Rubio et al (2012); Leordeanu et al (2012)) and weakly supervised localization (Deselaers et al (2012); Nguyen et al (2009); Siva et al (2013)). Earlier methods are based on local feature matching and detection of their co-occurrence patterns (Stretcu and Leordeanu (2015); Sivic et al (2005); Leordeanu et al (2005); Parikh and Chen (2007); Liu and Chen (2007)), while more recent ones (Joulin et al (2014); Rochan and Wang (2014)) discover object tubes by linking candidate bounding boxes between frames with or without refining their location. Traditionally, the task of unsupervised learning from image sequences has been formulated as a feature matching or data clustering optimization problem, which is computationally very expensive due to its combinatorial nature.
There are also other papers (Lee et al (2011); Cheng et al (2017); Dutt Jain et al (2017); Tokmakov et al (2017)
) that tackle unsupervised learning tasks but are not fully unsupervised, using powerful features that are pre-trained in supervised fashion on large datasets, such as ImageNet (
Russakovsky et al (2015)) or VOC2012 (Everingham et al (2015)). Such works take advantage of the rich source of supervised information learned from other datasets, through features trained to respond to general object properties over tens or hundreds of object categories.With respect to the end goal, our work is more related to the second research direction, on unsupervised discovery in video. However, unlike that research, we do not discover objects at test time, but during the unsupervised training process, when the student pathway learns to detect foreground objects. Therefore, from the learning perspective, our work is more related to the first research direction based on self-supervised training.
3 Overall approach
We propose a genuine unsupervised learning algorithm for foreground object detection that offers the possibility to improve over several iterations. Our method combines in complementary ways multiple modules that are well suited for this task. It starts with a teacher pathway that discovers objects in unlabeled videos and produces a soft mask of the foreground object in each frame. The resulting soft-masks of lower quality are then filtered out automatically. Next, the remaining ones are passed to a student ConvNet, which learns to predict object masks in single images. When several student nets of different architectures are learned they form a new teacher for the next generation, then the whole process is repeated. At the next iteration we bring in more unlabeled data, we learn in an unsupervised fashion a better data selection mechanism and ultimately train more powerful student networks. In Algorithm 1 we enumerate concisely the main steps of our approach.
Now we present the main algorithm in more detail. At Step 1 we start with an object discoverer in video sequences. There are several available methods for video discovery in the literature, with good performance (Borji et al (2012); Cheng et al (2015); Barnich and Van Droogenbroeck (2011)). We chose the VideoPCA algorithm introduced as part of the system in Stretcu and Leordeanu (2015) because it is very fast (50-100 fps), uses very simple features (individual pixel colors) and it is completely unsupervised, with no usage of supervised pre-trained features. It learns how to separate the foreground from the background. It exploits the spatio-temporal consistency in appearance, shape, movement and location of objects, common in video shots, along with the contrasting properties, in size, shape, motion and location, between the main object and the background scene. Note that it would be much harder, at this first stage, to discover objects in collections of unrelated images, where there is no smooth variation in shape, appearance and location over time. Only at the second iteration of the algorithm, the simpler VideoPCA is replaced with a more powerful ensemble of student nets which is able to discover objects in collections of images as well.
The teacher branch produces soft foreground masks, one per each frame, which are not always of good quality. Thus, at Step 2, we use, during the first iteration, a simple and effective way to filter out poor masks. Only at the second iteration we are able to learn a more powerful soft-mask selector (see Section 4.2.1). The soft-masks that pass the filtering phase are then used (Algorithm 1, Step 3) to train the student pathway. As we want the student branch to learn general visual properties of objects in images, we limit its access to a single input image.
Our approach offers the possibility of improving performance by training a next generation of object detectors. In experiments, we found that there are three key aspects, which are effective at improving generalization at the next iteration: 1) we need to train several student nets (at module A), preferably of different architectures, which are stronger in combination than separately. Then, they become the teacher (module B) at the next iteration; 2) we train, also in an unsupervised fashion, a better soft-mask selector (module C); 3) it is preferred to increase the unlabeled training set at the next iteration, for improved generalization.
Having access to the complete training set at the very first iteration could be useful, but it is not optimal. At that stage, the teacher is still weak and imposes a certain limitation on how much could be learned from the data, no matter how large that data is. Getting access to a larger unlabeled training dataset is more effective at the second iteration, when the teacher pathway is significantly stronger. The idea of gradually increasing the complexity in the training set is also related to curriculum learning (Bengio et al (2009)), when we start with simpler cases then add more difficult ones. Increasing the strength of the teacher pathway improves the quality of the supervisory signal, while introducing more unlabeled data increases variety. Both act together in order to improve generalization.
4 System architecture
We, now, detail the architecture and training process of our system, module by module, as seen in Figure 1. We first present the student pathway (module A in Figure 1), which takes as input an individual image (e.g. current frame in the video) and learns to predict foreground soft-masks from an unsupervised teacher. The teacher pathway (represented by modules B and C in Figure 1), is explained in detail in the Section 4.2.

4.1 Student path: single-image segmentation
The student processing pathway (module A in Figure 1) consists of a deep convolutional network. We test different neural network architectures, some of which are commonly used in the recent literature on semantic image segmentation. We create a small pool of relatively diverse architectures, presented next.
The first convolutional network architecture for semantic segmentation that we test, is based on a more traditional CNN design. We term it LowRes-Net (see Figure 2) due to its low resolution soft-mask output. It has ten layers (seven convolutional, two pooling and one fully connected) and skip connections. Skip connections have proved to offer a boost in performance, as shown in the literature (Raiko et al (2012); Pinheiro et al (2016)). We also observed a similar improvement in our experiments when using skip connections. The LowRes-Net takes as input a RGB image (along with its hue, saturation and derivatives w.r.t. x and y) and produces a soft segmentation of the main objects present in the image. Because LowRes-Net has a fully connected layer at the top, we reduced the output resolution of the soft-segmentation mask, to limit memory cost. While the derviatives w.r.t x and y are in principle not needed (as they could be learned by appropriate filters during training), in our tests explicitly providing the derivatives along with HSV and by using skip-connections boosted the accuracy by over . The LowRes-Net has a total of 78M parameters, most of them being in the last, fully connected layer.
The second CNN architecture tested, termed FConv-Net, is fully convolutional (Long et al (2015)), as also presented in Figure 2. It has a higher resolution output of 128x128, with input size 256x256. Its main structure is derived from the basic LowRes-Net model. Different from LowRes-Net, it is missing the fully connected layer at the end and has more parameters in the convolutional layers, for a total of 13M parameters.
We also tested three different nets based on the U-Net (Ronneberger et al (2015)) architecture, which proved very effective in the semantic segmentation literature. Our U-net networks are: 1) BasicU-Net, 2) DilateU-Net - similar to BasicU-Net but using atrous (dilated) convolutions (Yu and Koltun (2015)) in the center module, and 3) DenseU-Net - with dense connections in the down and up modules (Jégou et al (2017)).
The BasicU-Net has 5 down modules with 2 convolutional layers each, with 32, 64, 128, 256 and 512 features maps, respectively. In the center module the BasicU-Net has two convolutional layers with 1024 feature maps each. The up modules have 3 convolutional layers and the same number of features maps as the corresponding down modules. The only difference between BasicU-Net and DilateU-Net is that the former has a different center module with 6 atrous convolutions and 512 feature maps each. Then, DenseU-Net has 4 down modules with 4 corresponding up modules. Each down and up module has 4 convolutions with skip-connections (as presented in Figure 2). The modules have 12, 24, 48 and 64 features maps, respectively. The transition represents a convolution, having the role of reducing the output number of feature maps from each module. The BasicU-Net has 34M parameters, while the DilateU-Net has 18M parameters. DenseU-Net has only 3M parameters, but uses skip-connections inside the up and down
blocks in order to make up for the difference in the number of parameters. All three U-Nets have 256x256 input and same resolution output. All networks use ReLU activation functions. Please see Figure
2 for more specific details regarding the architectures of the different models.Given the current setup, the student nets do not learn to identify specific object classes. They will learn to softly segment the main foreground objects present, regardless of their particular category. The main difference in their performance is in their ability to produce fine object segmentations. While the LowRes-Net tends to provide a good support for estimating the object’s bounding box due to its simpler output, the other ConvNets (especially the U-Nets), with higher resolution, are better at finely segmenting objects. Due to the different ways in which the particular models make mistakes, they are always stronger when forming an ensemble. In experiments we also show that they outperform their teacher and are able to detect objects from categories that were not seen during training.
4.1.1 Student networks ensemble
The pool of student networks with different architectures produce varied results that differ qualitatively. While the bounding boxes computed from their soft-masks have similar accuracy, the actual soft-segmentation output looks differently. They have different strengths, while making different kinds of mistakes. The above observation immediately suggests that they should be stronger in combination, so we have experimented with the idea of combining them into an ensemble. We propose two types of ensembles.
The first one, termed Multi-Net, outputs a soft-mask that is obtained by multiplying pixel-wise the soft-masks produced by each individual student net. Thus, only positive pixels, on which all nets agree, survive to the final segmentation. Multi-Net offers robust masks of significantly higher quality. In Section 4.2.1 we show how Multi-Net can be effectively used to learn in an unsupervised fashion, a network (EvalSeg-Net) for evaluating the goodness of a specific segmentation. That network is an important part of the next generation teacher pathway and replaces module C at the next iteration.
The second approach to forming an ensemble is to use EvalSeg-Net in order to select the best soft-mask from the pool of masks generated by the student nets. We term this ensemble system, MultiSelect-Net. Quantitatively, MultiSelect-Net and Multi-Net perform similarly, but Multi-Net tends to produce fuzzier masks due to the additional multiplication of the student’s soft-masks.
4.1.2 Training the student ConvNets
We treat foreground object segmentation as a multidimensional regression problem, where the soft mask given by the unsupervised video segmentation system acts as the desired output. Let be the input RGB image (a video frame) and be the corresponding 0-255 valued soft segmentation given by the unsupervised teacher for that particular frame. The goal of our network is to predict a soft segmentation mask of width and height (where for the basic architecture, for fully convolutional architecture and for U-Net architectures), that approximates as well as possible the mask . For each pixel in the output image, we predict a 0-255 value, so that the total difference between and is minimized. Thus, given a set of training examples, let be the input image (a video frame), be the predicted output mask for , the soft segmentation mask (corresponding to ) and the network parameters. is produced by the video discoverer after processing the video that belongs to. Then, our loss is:
(1) |
where and denotes the -th pixel from , respectively .
We observed that in our tests, the L2 loss performed better than the cross-entropy loss, due to the fact that the soft-masks used as labels have real values, not discrete ones. Also, they are not perfect, so the idea of thresholding them for training does not perform as well as directly predicting their real values. We train our network using the Tensorflow (
Abadi et al (2015)) framework with the Adam optimizer (Kingma and Ba (2014)). All models are trained end-to-end using a fixed learning rate of 0.001 for 10 epochs. The training time for any given model is about 3-5 days on a Nvidia GeForce GTX 1080 GPU, for the first iteration and about 2 weeks for the second iteration students.
Post-processing. The student CNN outputs a soft mask. In order to fairly compare our models with other methods, we have two different post processing steps: 1) bounding box fitting and 2) segmentation refinement. For fitting a box around the soft mask, we first up-sample the output to the original size of the image, then threshold the mask (validated on a small subset), determine the connected components and fit a tight box around each of the components. We perform segmentation refinement (point 2) in a single case, on the Internet Images Dataset as also specified in the experiments section. For that, we use the OpenCV implementation of GrabCut (Rother et al (2004)) to refine our soft mask, up-sampled to the original size. In all other tests we use the original output of the networks.
4.2 Teacher path: unsupervised discovery in video
There are several methods
available for discovering objects and salient regions in images and videos (Borji et al (2012); Cheng et al (2015); Hou and Zhang (2007); Jiang et al (2013); Cucchiara et al (2003); Barnich and Van Droogenbroeck (2011))
with reasonably good performance. More recent methods for foreground objects discovery such as Papazoglou and Ferrari (2013) are both relatively fast and accurate, with runtime around seconds per frame. However, that runtime is still long and prohibitive for training the student CNN that requires millions of images. For that reason we used at the first generation (Iteration 1 of Algorithm 1) for module B in Figure 1, the VideoPCA algorithm, which is a part of the whole system introduced in Stretcu and Leordeanu (2015). It has lower accuracy than the full system, but it is much faster, running at fps. At this speed we can produce one million unsupervised soft segmentations in a reasonable time of about 5-6 hours.
VideoPCA.
The main idea behind VideoPCA is to model the background in video frames with Principal Component Analysis. It finds initial foreground regions as parts of the frames that are not reconstructed well with the PCA model. Foreground objects are smaller than the background, have contrasting appearance and more complex movements. They could be seen as outliers, within the larger background scene. That makes them less likely to be captured well by the first PCA components. Thus, for each frame, an initial soft-mask is produced from an error image, which is the difference between the original image and the PCA reconstruction. These error images are first smoothed with a large Gaussian filter and then thresholded. The binary masks obtained are used to learn color models of foreground and background, based on which individual pixels are classified as belonging to foreground or not. The object masks obtained are further multiplied with a large centered Gaussian, based on the assumption that foreground objects are often closer to the image center. These are the final masks used in your system. For more technical details, the reader is invited to consult
Stretcu and Leordeanu (2015). In this work, we use the method exactly as found online111https://sites.google.com/site/multipleframesmatching/ without any parameter tuning.Teacher pathway at the next generation: At the next iteration of Algorithm 1, VideoPCA (in module B) is replaced by the student nets trained at the previous iteration in the following way. While we could use as new module B any of the two ensembles Multi-Net or MultiSelect-Net, we preferred a simpler and more efficient approach. For each unlabeled training image we ran all student nets and obtain multiple soft-masks, without combining them to produce a single output per image. Therefore the new module B is the collection of all student nets acting in parallel. Then, their soft-masks are filtered independently (using a given threshold) by the new Module C in Figure 1, which is represented at the second iteration by EvalSeg-Net. Note that it is possible in this manner to obtain one, several or no soft segmentations for a given training image. This approach is fast and it offers the advantage of processing data in parallel over multiple GPUs, without having to wait for all student nets to finish for every input image. As our experiments demonstrate, the approach is also efficient, with significantly better results at the second generation.
4.2.1 Unsupervised soft masks selection


The performance of the student net is influenced by the quality of the soft masks provided as labels by the teacher branch. The cleaner the masks, the more chances the student has to learn to segment well objects in images. VideoPCA tends to produce good results if the object present in the video stands out well against the background scene, in terms of motion and appearance. However, if the object is occluded at some point, does not move w.r.t the scene or has a similar appearance to its background, the resulting soft masks might be poor. In the first generation, we used a simple measure of masks quality to select only the good soft-masks for training the student pathway, based on the following observation: when VideoPCA masks are close to the ground truth, the average of their nonzero values is usually high. Thus, when the discoverer is confident, it is more likely to be right. The average value of non-zero pixels in the soft mask is then used as a score indicator for each segmented frame. Only masks of certain quality according to this indicator are selected and used for training the student nets. This represents module C in Figure 1 at the first generation of Algorithm 1. While being effective at iteration 1, the simple average value over all pixels cannot capture the goodness of a segmentation at the higher level of overall shape. At the next iterations, we therefore explore new ways to improve it.
Consequently, at the next iterations we propose an unsupervised way for learning the EvalSeg-Net to estimate segmentation quality. As mentioned previously, Multi-Net provides masks of higher quality as it cancels errors from individual student nets. Thus, we use the cosine similarity between a given individual segmentation and the ensemble Multi-Net mask, as a cost for ”goodness” of segmentation. Having this unsupervised segmentation cost we train the EvalSeg-Net deep neural net to predict it. As previously mentioned, this net acts as an automatic mask evaluation procedure, which in subsequent iterations becomes module C in Figure
1, replacing the simple mask average value used at Iteration 1. Only masks that pass a certain threshold are used for training the student path.The architecture of EvalSeg-Net is similar to LowRes-Net (Figure 2
), with the difference that the input channel containing image derivatives is replaced by the actual soft-segmentation that requires evaluation and it does not have skip connections. Also, after the last fully connected layer (size 512) we add a last one-neuron layer to predict the segmentation quality score, which is a single real valued number.
Let be an input RGB image, an input soft-mask, be the output of our Multi-Net where denotes the output of network . We treat the segmentation ”goodness” evaluation task as a regression problem where we want to predict the Cosine similarity between and . So, our loss for EvalSeg-Net is defined as follows:
(2) |
where represents the number of training examples and represents the output of EvalSeg-Net for image and soft mask .
Given a certain metric for segmentation evaluation (depending on the learning iteration), we keep only the soft masks above a threshold for each dataset (e.g. VID (Russakovsky et al (2015)), YTO (Prest et al (2012)), Youtube Bounding Boxes (Real et al (2017))). In the first iteration this threshold was obtained by sorting the VideoPCA soft-masks based on their score and keeping only the top 10 percentile, while on the second iteration we validate a threshold () on a small dataset and select each mask independently by using this threshold on the single value output of EvalSeg-Net.
Mask selection evaluation. In Figure 3 we present the dependency of segmentation performance w.r.t ground truth object boxes (used only for evaluation) vs. the percentile of masks kept after the automatic selection, for both generations. We notice the strong correlation between the percentage of frames kept and the quality of segmentations. It is also evident that the EValSeg-Net is vastly superior to the simpler procedure used at iteration 1. EvaSeg-Net is able to correctly evaluate soft segmentations even in more complex cases (see Figure 4).
Even though, we can expect to improve the quality of the unsupervised masks by drastically pruning them (e.g. keeping a smaller percentage), the fewer we are left with, the less training data we get, increasing the chance to overfit. We make up for the losses in training data by augmenting the set of training masks and by also enlarging the actual unlabeled training set at the second generation. There is a trade-off between level of selectivity and training data size: the more selective we are about what masks we accept for training, the more videos we need to collect and process through the teacher pathway, to obtain the sufficient training data size.
Data augmentation.
A drawback of the teacher at the first learning iteration (VideoPCA) is that it can only detect the main object if it is close to the center of the image. The assumption that the foreground is close to the center is often true and indeed helps that method, which has no deep learned knowledge, to produce soft masks with a relatively high precision. Not surprisingly, it often fails when the object is not in the center, therefore its recall is relatively low. Our data augmentation procedure addresses this limitation and can be concisely described as follows: randomly crop patches of the input image, covering 80% of the original image and scale up the patch to the expected input size. This produces slightly larger objects at locations that cover the whole image area, not just the center. As experiments show, the student net is able to see objects at different locations in the image, unlike its raw teacher (VideoPCA at iteration 1), which is strongly biased towards the image center.
At the second generation, the teacher branch is significantly better at detecting objects at various locations and scales in the image. Therefore, while artificial data augmentation remains useful (as it is usually the case in deep learning), its importance diminishes at the second iteration of learning (Algorithm 1).
4.3 Implementation pipeline
Now that we have presented in technical detail all major components of our system, we concisely present the actual steps taken in our experiments, in sequential order, and show how they relate to our general Algorithm 1 for unsupervised learning to detect foreground objects in images.
-
Run VideoPCA on input images from VID and YouTube Objects datasets (Algorithm 1, Iteration 1, Step 1)
-
Select VideoPCA masks using first generation selection procedure (Algorithm 1, Iteration 1, Step 2)
-
Train first generation student ConvNets on the selected masks, namely LowRes-Net, FConv-Net, BasicU-Net, DilateU-Net and DenseU-Net (Algorithm 1, Iteration 1, Step 3).
-
Create first generation student ensemble Multi-Net by multiplying the outputs of all students and train EvalSeg-Net to predict the similarity between a particular mask and the mask of Multi-Net. Create the second ensemble MultiSelect-Net by using EvalSeg-Net in combination with the student’s masks (Algorithm 1, Iteration 1, Step 4).
-
Add new data from YouTube Bounding Boxes. (Algorithm 1, Iteration 1, Step 5)
-
Return to Step 1, the teacher pathway: predict multiple soft-masks per input image on the enlarged unlabeled video set, using the student nets from Iteration 1 (Module B, Iteration 2), which will be then selected with EvalSeg-Net at Module C. (Algorithm 1, Iteration 2, Step 1)
-
Select only sufficiently good masks evaluated with EvalSeg-Net (Algorithm 1, Iteration 2, Step 2)
-
Train the second generation students on the newly selected masks. We use the same architectures as in Iteration 1 (Algorithm 1, Iteration 2, Step 3)
-
Create the second generation student ensembles Multi-Net and MultiSelect-Net. (Algorithm 1, Iteration 2, Step 4)
The method presented in the introduction sections (Algorithm 1) is a general algorithm for unsupervised learning from video to detect objects in single images. It presents a sequence of high level steps followed by different modules for an unsupervised learning system. The modules are complementary to each other and function in tandem, each focusing on a specific aspect of the unsupervised learning process. Thus, we have a module for generating data, where soft-masks are produced. There is a module that selects good quality masks.
Then, we have a module for training the next generation classifiers. While, our concept is first presented in high level terms, we also present a specific implementation that represents the first two iterations of the algorithm.
While our implementation is costly during training, in terms of storage and computation time, at test time it is very fast - 0.02 sec per student net
and 0.15 sec per student ensemble.
Computation and storage costs. During training, the computation time for passing through the teacher pathway during the first iteration of Algorithm 1 is about 2-3 days: it requires processing data from VID and YTO datasets, including running the VideoPCA module. Afterwards, training the first iteration students, with access to 6 GPUs, takes about 5 days - 6 GPUs are needed for training the 5 different student architectures, since training FConv-Net requires two GPUs in parallel. Next, training the EvalSeg-Net requires 4 additional days on one GPU. At the second iteration, processing the data through the teacher pathway takes about 3 weeks on 6 GPUs in parallel - it is more costly due to the larger training set from which only a small percent (about 10 percent) is selected with EvalSeg-Net. Finally, training the second generation students takes 2 additional weeks. In conclusion, the total computation time required for training, with full access to 6 GPUs is about 7 weeks, when everything is optimized. The total storage cost is about 4TB. At test time the student nets are fast, taking 0.02 sec per image, while the ensemble nets take around 0.15 sec per image.
5 Experimental analysis
In the first set of experiments we evaluate the impact of the different components of our system. We experimentally verify that at each iteration the students perform better than their teachers. Then we test the ability of the system to improve from one generation to the next. We also test the effects of data selection and increasing training data size. Then, we compare the performances of each individual network and their combined ensembles.
In Section 5.2, we compare our algorithm to state of the art methods on object discovery in videos and images. We perform tests on three datasets: YouTube Objects (Prest et al (2012)), Object Detection in Internet images (Rubinstein et al (2013)) and Pascal-S (Li et al (2014)). In Section 5.3 we verify that our unsupervised deep features are also useful in different transfer learning tasks.
Datasets. Unsupervised learning requires large quantities of unlabeled video data. We have chosen for training data, videos from three large datasets: ImageNet VID dataset (Russakovsky et al (2015)), YouTube Objects (Prest et al (2012)) and YouTube Bounding Boxes (Real et al (2017)). VID is one of the largest video datasets publicly available, being fully annotated with ground truth bounding boxes. The dataset consists of about 4000 videos, having a total of about 1.2M frames. The videos contain objects that belong to 30 different classes. Each frame could have zero, one or multiple objects annotated. The benchmark challenge associated with this dataset focuses on the supervised object detection and recognition problem, which is different from the one that we tackle here. Our system is not trained to identify different object categories, so we do not report results compared to the state of the art on object class recognition and detection, on this dataset.
YouTube Objects (YTO) is a challenging video dataset with objects undergoing significant changes in appearance, scale and shape, going in and out of occlusion against a varying, often cluttered background. YTO is at its second version now and consists of about 2500 videos, having a total of about 700K frames. It is specifically created for unsupervised object discovery, so we perform comparisons to state of the art on this dataset.
For unsupervised training of our system we used approximately 190k frames from videos chosen from each dataset (120k from VID and 70k from YTO), at learning iteration 1 - those frames which survived after the data selection module. At the second learning iteration, besides improving the classifier, it is important to have access to larger quantities of new unlabeled data. Therefore, for training the second generation of classifiers we added to the unlabeled training set additional 1 million soft-masks, as follows: 600k frames from VID and 400k from the YouTube Bounding Boxes dataset - again, those frames which survived after filtering with the EvalSeg-Net data selection module. Before data selection videos were randomly chosen from each set, VID or YouTube Bounding Boxes, until the total of 1M was reached. We did not add more frames due to heavy computation and storage limitations.
Evaluation metrics. We use different kinds of metrics in our experiments, which depend on the specific task that requires either bounding box fitting or fine segmentation:
-
CorLoc - for evaluating the detection of bounding boxes the most commonly used metric is CorLoc. It is defined as the percentage of images correctly localized according to the PASCAL criterion:, where is the predicted bounding box and is the ground truth bounding box.
-
F- for evaluating the segmentation score on Pascal-S dataset. We use the official evaluation code when reporting results. As in all previous works, we set .
-
P-J metric P refers to the precision per pixel, while J is the Jaccard similarity (the intersection over union between the output mask the and ground truth segmentations). We use this metric only on Object Discovery in Internet images. For computing the reported results we use the official evaluation code.
-
MAE - Mean Absolute Error is defined as the average pixel-wise difference between the predicted mask and the ground truth. Different from the other metrics, for this metric a lower value is better.
-
mean IoU score is defined as where represents the ground truth and the predicted mask.
5.1 Evaluation of different system components


LowRes-Net | FConv-Net | DenseU-Net | BasicU-Net | DilateU-Net | Avg | Multi-Net | MultiSelect-Net | Avg | |
Iteration 1 | 62.1 | 57.6 | 54.6 | 59.1 | 61.8 | 59.0 | 65.3 | 62.4 | 63.9 |
Iteration 2 | 63.5 | 61.3 | 59.4 | 65.2 | 65.8 | 63.0 | 67.0 | 67.3 | 67.2 |
Gain | 1.4 → | 3.7 → | 4.8 → | 6.1 → | 4.0 → | 4.0 → | 1.7 → | 4.9 → | 3.3 → |
LowRes-Net | FConv-Net | DenseU-Net | BasicU-Net | DilateU-Net | Avg | Multi-Net | MultiSelect-Net | Avg | |
Iteration 1 | 85.8 | 79.8 | 83.3 | 86.8 | 85.6 | 84.3 | 85.8 | 86.7 | 86.3 |
Iteration 2 | 86.7 | 85.6 | 86.7 | 87.1 | 87.9 | 86.8 | 86.4 | 88.2 | 87.3 |
Gain | 0.9 → | 5.8 → | 3.4 → | 0.3 → | 2.3 → | 2.5 → | 0.6 → | 1.5 → | 1.0 → |
LowRes-Net | FConv-Net | DenseU-Net | BasicU-Net | DilateU-Net | Avg | Multi-Net | MultiSelect-Net | Avg | |
Iteration 1 | 64.6 | 51.5 | 65.2 | 65.4 | 65.8 | 62.5 | 67.8 | 67.1 | 67.5 |
Iteration 2 | 66.9 | 61.7 | 68.4 | 68.0 | 67.5 | 66.5 | 69.1 | 68.5 | 68.8 |
Gain | 2.3 → | 10.2 → | 3.2 → | 2.6 → | 1.7 → | 4.0 → | 1.3 → | 1.4 → | 1.3 → |
Student vs. teacher In Figure 8 we present qualitative results on VID dataset as compared to VideoPCA. We can see that the masks produced by VideoPCA are of lower quality, often having holes, non-smooth boundaries and strange shapes. In contrast, the students learn more general shape and appearance characteristics of objects in images, reminding of the grouping principles governing the basis of visual perception as studied by the Gestalt psychologists (Rock and Palmer (1990)) and the more recent work on the concept of ”objectness” (Alexe et al (2010)). The object masks produced by the students are simpler, with very few holes, have nicer and smoother shapes and capture well the foreground-background contrast and organization. Another interesting observation is that the students are able to detect multiple objects, a feature that is less commonly achieved by the teacher.
In Figure 5 we see comparative results between the average of individual models, the ensembles formed and the teacher. Note that the teacher at the next generation reported is the MultiSelect-Net ensemble from the first. We observe that the students at both iterations outperform their respective teachers, which is an interesting and positive outcome. It suggests that we can repeat the process over several iterations and continue to improve. It is also encouraging that the individual nets, which see a single image, are able to generalize and detect objects that are discovered by the teacher in sequences of images.
First vs. next generation.
As seen in Tables 1, 2 3 and Figure 7 at the second generation we obtain a clear gain over the first, on all experiments and datasets. This result proves the value of our proposed algorithm that starts from a completely unsupervised object discoverer in video (VideoPCA) and is able to train neural nets for foreground object segmentation, while improving their accuracy over two generations. It uses the students from iteration 1 as teachers at iteration 2. At the second iteration, it also uses more unlabeled training data and it is better at automatically filtering out poor quality segmentations.

Impact of data selection. Data selection is important as seen in Figure 6. The more selective we are when we accept or reject soft-masks used for training, the better the end result. Also note that being more selective means decreasing the training set. There is a trade-off between selectivity and training data size.
Neural architecture vs. data.
As seen in Tables 1, 2 and 3 different network architecture yield different results, while ensembles always outperform individual models. While the actual CNN architecture has a certain role in performance, another equally important aspect is that of data size. The more data we have the more selective we can afford to be and also the more we could generalize. It is important to increase the data from one generation to the next in order to avoid simply imitating the ensemble of the previous generation. In Tables 4 and 5 we show additional tests with our baseline architecture, LowRes-Net, when trained with training sets of different sizes. It is obvious that adding new unlabeled data has a positive effect on performance. The idea of increasing the data in stages is also related to approaches in curriculum learning (Bengio et al (2009)), where we first learn from easy cases then move to the more complex ones.
Analysis of different ConvNets. Our experiments show that different architectures are better at different tasks. LowRes-Net, for example, performs well on the task of box fitting since that does not require a fine sharp object mask. On the other hand, when evaluating the exact segmentation, nets with higher resolution output, which are more specialized for this task perform better. Overall, at the second generation, on box fitting the best single net on average is DilateU-Net and the top ensemble is MultiSelect-Net. However, when it comes to evaluating the actual segmentation the winner is DenseU-Net for single models and Multi-Net for ensembles. In our qualitative results we find that DenseU-Net produces masks with fewer ”holes” when compared to DilateU-Net, after thresholding and, thus, it is better suited for segmentation evaluation. When evaluating the bounding box, these holes do not affect the box and the best model is DilateU-Net. Also, DenseU-Net tends to outputs a mask with higher confidence on the whole object, as opposed to the BasicU-Net and DilateU-Net that output masks with lower confidence around some regions of the object (such as the eyes or wheels). This could be another reason why DenseU-Net produces better segmentations. The model that struggles most during the first iteration is FConv-Net, with significant improvement at the second iteration when the unsupervised training masks are closer to the correct ones. Also note that the baseline LowRes-Net is a top model on box fitting at the first iteration. The quantitative differences between architectures are shown in Tables 1, 2 and 3, while the qualitative differences can be seen in Figure 7.
Training data | CorLoc | Testing dataset | |
---|---|---|---|
LowRes-Net | VID | 56.1 | YTO |
LowRes-Net | VID + YTO | 62.2 |
Training data | mean P | mean J | |
---|---|---|---|
LowRes-Net | VID | 87.73 | 61.25 |
LowRes-Net | VID + YTO | 88.36 | 62.33 |
5.2 Comparisons with state of the art

Object discovery in video.
We first performed comparisons with methods specifically designed for object discovery in video. For that, we choose the YouTube Objects dataset and compare it to the best methods on this dataset in the literature (Table 6). Evaluations are conducted on both
versions of YouTube Objects dataset,
YTOv1 (Prest et al (2012)) and YTOv2.2 (Kalogeiton et al (2016)). On YTOv1 we follow the same experimental setup as (Jun Koh et al (2016); Prest et al (2012)), by running experiments only on the training videos. We have not included in Table 6 the results reported by Stretcu and Leordeanu (2015) because they use a different setup, testing on all videos from YTOv1. It is important to stress out, again, the fact that while the methods presented here for comparison have access to whole video shots, ours only needs a single image at test time. Despite this limitation, our method outperforms the others on 7 out of 10 classes and has the best overall average performance. Note that even our baseline LowRes-Net at the first iteration achieves top performance. The feed-forward CNN processes each image in 0.02 sec, being at least one to two orders of magnitude faster than all other methods (see Table 6). We also mention that in all our comparisons, while our system is faster at test time, it takes much longer during its unsupervised training phase and requires large quantities of unsupervised training data.
Method | Aero | Bird | Boat | Car | Cat | Cow | Dog | Horse | Mbike | Train | Avg | Time | Version |
Prest et al (2012) | 51.7 | 17.5 | 34.4 | 34.7 | 22.3 | 17.9 | 13.5 | 26.7 | 41.2 | 25.0 | 28.5 | N/A | |
Papazoglou and Ferrari (2013) | 65.4 | 67.3 | 38.9 | 65.2 | 46.3 | 40.2 | 65.3 | 48.4 | 39.0 | 25.0 | 50.1 | 4s | v1 |
Jun Koh et al (2016) | 64.3 | 63.2 | 73.3 | 68.9 | 44.4 | 62.5 | 71.4 | 52.3 | 78.6 | 23.1 | 60.2 | N/A | |
Haller and Leordeanu (2017) |
76.3 | 71.4 | 65.0 | 58.9 | 68.0 | 55.9 | 70.6 | 33.3 | 69.7 | 42.4 | 61.1 | 0.35s | |
LowRes-Netiter1 |
77.0 | 67.5 | 77.2 | 68.4 | 54.5 | 68.3 | 72.0 | 56.7 | 44.1 | 34.9 | 62.1 | 0.02s | |
LowRes-Netiter2 |
79.7 | 67.5 | 68.3 | 69.6 | 59.4 | 75 | 78.7 | 48.3 | 48.5 | 39.5 | 63.5 | 0.02s | |
DilateU-Netiter2 |
85.1 | 72.7 | 76.2 | 68.4 | 59.4 | 76.7 | 77.3 | 46.7 | 48.5 | 46.5 | 65.8 | 0.02s | |
MultiSelect-Netiter2 |
84.7 | 72.7 | 78.2 | 69.6 | 60.4 | 80.0 | 78.7 | 51.7 | 50.0 | 46.5 | 67.3 | 0.15s | |
|
|||||||||||||
Haller and Leordeanu (2017) | 76.3 | 68.5 | 54.5 | 50.4 | 59.8 | 42.4 | 53.5 | 30.0 | 53.5 | 60.7 | 54.9 | 0.35s | v2.2 |
LowRes-Netiter1 |
75.7 | 56.0 | 52.7 | 57.3 | 46.9 | 57.0 | 48.9 | 44.0 | 27.2 | 56.2 | 52.2 | 0.02s | |
LowRes-Netiter2 | 78.1 | 51.8 | 49.0 | 60.5 | 44.8 | 62.3 | 52.9 | 48.9 | 30.6 | 54.6 | 53.4 | 0.02s | |
DilateU-Netiter2 | 74.9 | 50.7 | 50.7 | 60.9 | 45.7 | 60.1 | 54.4 | 42.9 | 30.6 | 57.8 | 52.9 | 0.02s | |
BasicU-Netiter2 | 82.2 | 51.8 | 51.5 | 62.0 | 50.9 | 64.8 | 55.5 | 45.7 | 35.3 | 55.9 | 55.6 | 0.02s | |
MultiSelect-Netiter2 | 81.7 | 51.5 | 54.1 | 62.5 | 49.7 | 68.8 | 55.9 | 50.4 | 33.3 | 57.0 | 56.5 | 0.15s |
Object discovery in images We compare our system against other methods that perform image discovery in images. We use two different datasets for this comparison: Object Discovery in Internet Images and Pascal-S datasets. We report results using metrics that are commonly used for these tasks, as presented at the beginning of the experimental section.
Object Discovery in Internet Images is a representative benchmark for foreground object detection in single images. This set contains internet images and it is annotated with high detail segmentation masks. In order to enable comparison with previous methods, we use the 100 images subsets provided for each of the three categories: airplane, car and horse. The methods evaluated on this dataset in the literature, aim to either discover the bounding box of the main object in a given image or its fine segmentation mask. We evaluate our system on both. Note that different from other works, we do not need a collection of images during test time, since each image can be processed independently by our system. Therefore, unlike other methods, our performance is not affected by the structure of the image collection or the number of classes of interest being present in the collection.
In Table 7 we present the performance of our method as compared to other unsupervised object discovery methods in terms of CorLoc on the Object Discovery dataset. We compare our predicted box against the tight box fitted around the ground-truth segmentation as done in Cho et al (2015); Tang et al (2014). Our system can be considered in the mixed class category: it does not depend on the structure of the image collection. It treats each image independently. The performance of the other algorithms degrades as the number of main categories increases in the collection (some are not even tested by their authors on the mixed-class case), which is not the case with our approach.
We obtain state of the art results on all classes, improving by a significant margin over the method of Cho et al (2015). When the method in Cho et al (2015) is allowed to see a collection of images that are limited to a single majority class, its performance improves and it is equal with ours on one class. However, our method has no other information necessary besides the input image, at test time.
Method | Airplane | Car | Horse | Avg |
---|---|---|---|---|
Kim et al (2011) | 21.95 | 0.00 | 16.13 | 12.69 |
Joulin et al (2010) | 32.93 | 66.29 | 54.84 | 51.35 |
Joulin et al (2012) | 57.32 | 64.04 | 52.69 | 58.02 |
Rubinstein et al (2013) | 74.39 | 87.64 | 63.44 | 75.16 |
Tang et al (2014) | 71.95 | 93.26 | 64.52 | 76.58 |
Cho et al (2015) | 82.93 | 94.38 | 75.27 | 84.19 |
Cho et al (2015) mixed | 81.71 | 94.38 | 70.97 | 82.35 |
LowRes-Netiter1 |
87.80 | 95.51 | 74.19 | 85.83 |
LowRes-Netiter2 |
93.90 | 92.13 | 74.19 | 86.74 |
DilateU-Netiter2 |
95.12 | 95.51 | 73.12 | 87.92 |
MultiSelect-Netiter2 |
93.90 | 95.51 | 75.27 | 88.22 |
|
Airplane | Car | Horse | ||||
P | J | P | J | P | J | |
Kim et al (2011) | 80.20 | 7.90 | 68.85 | 0.04 | 75.12 | 6.43 |
Joulin et al (2010) | 49.25 | 15.36 | 58.70 | 37.15 | 63.84 | 30.16 |
Joulin et al (2012) | 47.48 | 11.72 | 59.20 | 35.15 | 64.22 | 29.53 |
Rubinstein et al (2013) | 88.04 | 55.81 | 85.38 | 64.42 | 82.81 | 51.65 |
Chen et al (2014) | 90.25 | 40.33 | 87.65 | 64.86 | 86.16 | 33.39 |
LowRes-Netiter1 |
91.41 | 61.37 | 86.59 | 70.52 | 87.07 | 55.09 |
LowRes-Netiter2 |
90.61 | 60.19 | 87.05 | 71.52 | 88.73 | 55.31 |
DenseU-Netiter2 |
91.03 | 64.46 | 85.71 | 72.51 | 87.14 | 55.44 |
Multi-Netiter2 |
91.13 | 66.02 | 87.67 | 73.98 | 88.83 | 55.23 |
|


We also tested our method on the task of fine foreground object segmentation and compared to the best performers in the literature on the Object Discovery dataset in Table 8. For refining our soft masks we apply the GrabCut method, as it is available in OpenCV. We evaluate based on the same P, J evaluation metric as described by Rubinstein et al (2013) - the higher P and J, the better. In Figure 9 and 10 we present some qualitative results for each class. As mentioned previously, these experiments on Object Discovery in Internet Images are the only ones on which we apply GrabCut as a post-processing step, as also used by all competing methods presented in Table 8.
Another important dataset used for the evaluation of a related task, that of salient object detection, is Pascal-S dataset, consisting of 850 images. As seen from Table 9 we achieve top results on all three metrics against methods that do not use any supervised pre-trained features. Being a foreground object detection method, our approach is usually biased towards the main object in the image - even though it can also detect multiple ones. Images in Pascal-S usually have more objects, so we consider our results very encouraging being close to approaches that use features pre-trained in a supervised manner. Also note that we did not use GrabCut for these experiments.
On single image experiments, our system was trained, as discussed before on other, video datasets (VID, YTO and YTB). It has not previously seen any of the images in Pascal-S or Object Discovery datasets during training.
Method | MAE | mean | ||
---|---|---|---|---|
IoU | pre-trained | |||
supervised features? | ||||
Wei et al (2012) | 56.2 | 22.6 | 41.6 | no |
Li et al (2015) | 56.8 | 19.2 | 42.4 | no |
Zhu et al (2014) | 60.0 | 19.7 | 43.9 | no |
Yang et al (2013) | 60.7 | 21.7 | 43.8 | no |
Zhang et al (2015) | 60.8 | 20.2 | 44.3 | no |
Tu et al (2016) | 60.9 | 19.4 | 45.3 | no |
Zhang et al (2017) | 68.0 | 14.1 | 54.9 | init VGG |
LowRes-Netiter1 | 64.6 | 19.6 | 48.7 | no |
LowRes-Netiter2 | 66.9 | 18.3 | 51.4 | no |
DenseU-Netiter2 | 68.4 | 17.6 | 51.6 | no |
Multi-Netiter2 | 69.1 | 19.2 | 53.0 | no |
|
5.3 Transfer learning experiments
While the focus of the paper is foreground object detection in the unsupervised learning setup, we also want to verify the usefulness of our approach on transfer learning experiments. We design experiments to test two aspects of our system - the actual unsupervised features learned and the final output foreground mask. We perform tests on YouTube Objects v1 dataset, in a relatively standard supervised classification setup, by learning to classify individual video frames with the class given by their parent video shot - for a total of ten classes.
We use the frames from the YTO training videos for training and the ones from the YTO test videos for testing. We test on a frame by frame basis and report the average multiclass classification percentage - how often the correct class is chosen out of ten classes. This problem is difficult for several reasons: 1) the training and testing frames come from different videos in YTO, that vary significantly in appearance and background scene 2) the object of interest is not present in every frame, which makes the classification rely heavily on the contextual scene. 3) there are multiple objects in many frames, having a cluttered background, while the object of interest goes through different changes in scale, viewpoint and pose.
We have two experimental setups for this task, one focused on the pre-trained features and the other on the foreground masks. In the first setup, we replace the last fully connected layer from our baseline model LowRes-Net with a classification part and freeze the network up to a given depth, using as pre-trained features the ones from the unsupervised learning task. Then, we fine-tune the end part on the given supervised classification task. In the second experimental setup we extract features from VGG network pre-trained (Simonyan and Zisserman (2014)) on ImageNet from different subwindows of the image, one being the bounding box given by the unsupervised LowRes-Net. Both tests that are presented next in more detail, prove that our approach is useful on transfer learning tasks.
Using the unsupervised features. In this experimental setup, we replace the last fully connected layer with classification part, composed of a reduction convolutional layer having four filters and a final fully connected layer with 10 neurons. We test various cases by freezing different parts of the LowRes network and fine-tune the rest on the supervised classification task. The results are presented in Figure 11.
They strongly suggest that the features learned in an unsupervised way from the middle of the network are best suited for semantic classification. The result clearly demonstrates the usefulness of the unsupervised features on the supervised classification task. In all cases when these features are used the results are improved (”concat”, ”conv2_2”, ”init pre-trained”) except for one case, ”conv3_3”. This happens because the pretrained features used in this case are from the top level - when the final segmentation is produced. At that level the semantic information is already lost. On the contrary, when features are frozen at the middle of the network, the best results are obtained.

Using the detected foreground bounding box. In these experiments we extract ’fc7’ VGG19 features, pre-trained on ImageNet, by passing through VGG19 different subwindows of the image rescaled appropriately, namely the whole image, the center box with height and width being half the original image size and the window cropped according to the bounding box produced by LowRes-Net. We concatenate such features taken from these windows in different combinations and pass them through a last fully connected layer with 10 neurons, which we train on the given classification task. We then, test the different combinations as shown in Table 10
. When using features extracted from the bounding-box fitted with LowRes-Net (alone or in combination with the whole image), we obtain significantly better results compared to the case when windows are extracted from fixed locations only (middle box, whole image or in combination). These results verify that the foreground segmentation mask detected with our models is, as expected, directly related to the main video class and constitutes a valuable source of information in image classification tasks.
Overall, the classification experiments presented in this Section indicate that the features learned in an unsupervised manner with our algorithm contain relevant semantic information about object classes and could be useful for related supervised learning tasks.
Region of extracted features | Multiclass recognition rate |
---|---|
Whole image |
69.1 |
Middle crop image | 64.9 |
Cropped image by LowRes-Net | 70.2 |
Whole + middle crop | 67.2 |
Whole + cropped by LowRes-Net | 72.7 |
|
6 Short discussion on unsupervised learning
The ultimate goal of unsupervised learning might not be about matching the performance of the supervised case but rather about reaching beyond the capabilities of the classical supervised scenario. An unsupervised system should be able to learn and recognize different object classes, such as animals, plants and man-made objects, as they evolve and change over time, from the past and into the unknown future. It should also be able to learn about new classes that might be formed, in relation to others, maybe known ones. We see this case as fundamentally different from the supervised one in which the classifier is forced to learn from a distribution of samples that is fixed and limited to a specific period of time - that when the human labeling was performed.
Therefore, in the supervised learning paradigm a car from the future, should not be classified as car, because it is not a car, according to the supervised distribution of cars given at present training time, when human annotations are collected. On the other hand, a system that learns by itself should be able to track how cars have been changing in time and recognize such objects as ”cars” - with no step by step human intervention.
From a temporal perspective, unsupervised learning is about continuous learning and adaptation to huge quantities of data that are perpetually changing. Human annotation is extremely limited in an ocean of data and not able to provide the so called ”ground truth” information continuously. Therefore, unsupervised learning will soon become a core part, larger than the supervised one, in the future of artificial intelligence.
7 Conclusions and future work
In this article, we present a novel and effective approach to learning from video, in an unsupervised fashion, to detect foreground objects in single images. We present a relatively general algorithm for this task, which offers the possibility of learning several generations of students and teachers. We demonstrate in practice that the system improves its performance over the course of two generations. We also test the impact of the different system components on performance and show state of the art results on three different datasets. To our best knowledge, it is the first system that learns to detect and segment foreground objects in images in an unsupervised fashion, with no pre-trained features given or manual labeling, while requiring only a single image at test time.
The convolutional networks trained along the student pathway are able to learn general ”objectness” characteristics, which include good form, closure, smooth contours, as well as contrast with the background. What the simpler initial VideoPCA teacher discovers over time, the deep, complex student is able to learn across several layers of image features at different levels of abstraction. Our results on transfer learning experiments are also encouraging and show additional cases in which such a system could be useful. In future work we plan to further grow our computational and storage capabilities to demonstrate the power of our unsupervised learning algorithm along many generations of student and teacher networks. We believe that our approach, tested here in extensive experiments, will bring a valuable contribution to computer vision research.
Acknowledgements.
This work was supported by UEFISCDI, under projects PN-III-P4-ID-ERC-2016-0007, PN-III-P2-2.1-PED-2016-1842 and PN-III-P1-1.2-PCCDI-2017-0734.References
- Abadi et al (2015) Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, et al (2015) Tensorflow: Large-scale machine learning on heterogeneous systems. Software available from tensorfloworg
- Alexe et al (2010) Alexe B, Deselaers T, Ferrari V (2010) What is an object? In: CVPR
- Barnich and Van Droogenbroeck (2011) Barnich O, Van Droogenbroeck M (2011) Vibe: A universal background subtraction algorithm for video sequences. IEEE Transactions on Image processing 20(6):1709–1724
- Bengio et al (2009) Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 41–48
- Borji et al (2012) Borji A, Sihite D, Itti L (2012) Salient object detection: A benchmark. In: ECCV
- Chen et al (2014) Chen X, Shrivastava A, Gupta A (2014) Enriching visual knowledge bases via object discovery and segmentation. In: CVPR
- Cheng et al (2017) Cheng J, Tsai YH, Wang S, Yang MH (2017) Segflow: Joint learning for video object segmentation and optical flow. In: The IEEE International Conference on Computer Vision (ICCV)
- Cheng et al (2015) Cheng M, Mitra N, Huang X, Torr P, Hu S (2015) Global contrast based salient region detection. PAMI 37(3)
- Cho et al (2015) Cho M, Kwak S, Schmid C, Ponce J (2015) Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In: CVPR
- Croitoru et al (2017) Croitoru I, Bogolin SV, Leordeanu M (2017) Unsupervised learning from video to detect foreground objects in single images. In: Computer Vision (ICCV), 2017 IEEE International Conference on, IEEE, pp 4345–4353
- Cucchiara et al (2003) Cucchiara R, Grana C, Piccardi M, Prati A (2003) Detecting moving objects, ghosts, and shadows in video streams. PAMI 25(10)
- Deselaers et al (2012) Deselaers T, Alexe B, Ferrari V (2012) Weakly supervised localization and learning with generic knowledge. IJCV 100(3)
- Doersch et al (2015) Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1422–1430
-
Dutt Jain et al (2017)
Dutt Jain S, Xiong B, Grauman K (2017) Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- Everingham et al (2015) Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111(1):98–136
- Finn et al (2016) Finn C, Goodfellow I, Levine S (2016) Unsupervised learning for physical interaction through video prediction. In: Advances in neural information processing systems, pp 64–72
- Goroshin et al (2015) Goroshin R, Mathieu MF, LeCun Y (2015) Learning to linearize under uncertainty. In: Advances in Neural Information Processing Systems, pp 1234–1242
-
Haller and Leordeanu (2017)
Haller E, Leordeanu M (2017) Unsupervised object segmentation in video by efficient selection of highly probable positive features. In: The IEEE International Conference on Computer Vision (ICCV)
- Hou and Zhang (2007) Hou X, Zhang L (2007) Saliency detection: A spectral residual approach. In: CVPR
- Jain et al (1999) Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM computing surveys 31(3):264–323
- Jégou et al (2017) Jégou S, Drozdzal M, Vazquez D, Romero A, Bengio Y (2017) The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, IEEE, pp 1175–1183
- Jiang et al (2013) Jiang H, Wang J, Yuan Z, Wu Y, Zheng N, Li S (2013) Salient object detection: A discriminative regional feature integration approach. In: CVPR
- Joulin et al (2010) Joulin A, Bach F, Ponce J (2010) Discriminative clustering for image co-segmentation. In: CVPR
- Joulin et al (2012) Joulin A, Bach F, Ponce J (2012) Multi-class cosegmentation. In: CVPR
- Joulin et al (2014) Joulin A, Tang K, Fei-Fei L (2014) Efficient image and video co-localization with Frank-Wolfe algorithm. In: ECCV
- Jun Koh et al (2016) Jun Koh Y, Jang WD, Kim CS (2016) Pod: Discovering primary objects in videos based on evolutionary refinement of object recurrence, background, and primary object models. In: CVPR
- Kalogeiton et al (2016) Kalogeiton V, Ferrari V, Schmid C (2016) Analysing domain shift factors between videos and images for object detection. PAMI 38(11)
- Kim et al (2011) Kim G, Xing E, Fei-Fei L, Kanade T (2011) Distributed cosegmentation via submodular optimization on anisotropic diffusion. In: ICCV
- Kingma and Ba (2014) Kingma D, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
- Kuettel et al (2012) Kuettel D, Guillaumin M, Ferrari V (2012) Segmentation propagation in imagenet. In: ECCV
-
Larsson et al (2016)
Larsson G, Maire M, Shakhnarovich G (2016) Learning representations for automatic colorization. In: European Conference on Computer Vision, Springer, pp 577–593
- Lee et al (2017) Lee HY, Huang JB, Singh M, Yang MH (2017) Unsupervised representation learning by sorting sequences. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, pp 667–676
- Lee et al (2011) Lee YJ, Kim J, Grauman K (2011) Key-segments for video object segmentation. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, pp 1995–2002
- Leordeanu et al (2005) Leordeanu M, Collins R, Hebert M (2005) Unsupervised learning of object features from video sequences. In: CVPR
- Leordeanu et al (2012) Leordeanu M, Sukthankar R, Hebert M (2012) Unsupervised learning for graph matching. Int J Comput Vis 96:28–45
- Li et al (2016) Li D, Hung WC, Huang JB, Wang S, Ahuja N, Yang MH (2016) Unsupervised visual representation learning by graph-based consistent constraints. In: ECCV
- Li et al (2015) Li N, Sun B, Yu J (2015) A weighted sparse coding framework for saliency detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5216–5223
- Li et al (2014) Li Y, Hou X, Koch C, Rehg JM, Yuille AL (2014) The secrets of salient object segmentation. Georgia Institute of Technology
- Liu and Chen (2007) Liu D, Chen T (2007) A topic-motion model for unsupervised video object discovery. In: CVPR
- Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
- Misra et al (2016) Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: ECCV
- Nguyen et al (2009) Nguyen M, Torresani L, la Torre FD, Rother C (2009) Weakly supervised discriminative localization and classification: a joint learning process. In: CVPR
- Noroozi and Favaro (2016) Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision, Springer, pp 69–84
- Papazoglou and Ferrari (2013) Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. In: ICCV
- Parikh and Chen (2007) Parikh D, Chen T (2007) Unsupervised identification of multiple objects of interest from multiple images: discover. In: Asian Conference on Computer Vision
- Pathak et al (2016) Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2536–2544
- Pathak et al (2017) Pathak D, Girshick R, Dollar P, Darrell T, Hariharan B (2017) Learning features by watching objects move. In: CVPR
- Pinheiro et al (2016) Pinheiro PO, Lin TY, Collobert R, Dollár P (2016) Learning to refine object segments. In: ECCV
- Prest et al (2012) Prest A, Leistner C, Civera J, Schmid C, Ferrari V (2012) Learning object class detectors from weakly annotated video. In: CVPR, IEEE, pp 3282–3289
-
Radenović et al (2016)
Radenović F, Tolias G, Chum O (2016) Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In: ECCV
-
Raiko et al (2012)
Raiko T, Valpola H, LeCun Y (2012) Deep learning made easier by linear transformations in perceptrons. In: AISTATS, vol 22, pp 924–932
- Raina et al (2007) Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th international conference on Machine learning, ACM, pp 759–766
- Real et al (2017) Real E, Shlens J, Mazzocchi S, Pan X, Vanhoucke V (2017) Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp 7464–7473
- Rochan and Wang (2014) Rochan M, Wang Y (2014) Efficient object localization and segmentation in weakly labeled videos. In: Advances in Visual Computing, Springer, pp 172–181
- Rock and Palmer (1990) Rock I, Palmer S (1990) Gestalt psychology. Sci Am 263:84–90
- Ronneberger et al (2015) Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer, pp 234–241
- Rother et al (2004) Rother C, Kolmogorov V, Blake A (2004) Grabcut: Interactive foreground extraction using iterated graph cuts. In: ACM Transactions on Graphics, vol 23, pp 309–314
- Rubinstein et al (2013) Rubinstein M, Joulin A, Kopf J, Liu C (2013) Unsupervised joint object discovery and segmentation in internet images. In: CVPR
- Rubio et al (2012) Rubio J, Serrat J, López A (2012) Video co-segmentation. In: ACCV
- Russakovsky et al (2015) Russakovsky O, et al (2015) Imagenet large scale visual recognition challenge. IJCV 115(3)
- Simonyan and Zisserman (2014) Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
- Siva et al (2013) Siva P, Russell C, Xiang T, Agapito L (2013) Looking beyond the image: Unsupervised learning for object saliency and detection. In: CVPR
- Sivic et al (2005) Sivic J, Russell B, Efros A, Zisserman A, Freeman W (2005) Discovering objects and their location in images. In: ICCV
- Stretcu and Leordeanu (2015) Stretcu O, Leordeanu M (2015) Multiple frames matching for object discovery in video. In: BMVC
- Tang et al (2014) Tang K, Joulin A, Li LJ, Fei-Fei L (2014) Co-localization in real-world images. In: CVPR
- Tokmakov et al (2017) Tokmakov P, Alahari K, Schmid C (2017) Learning motion patterns in videos. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- Tu et al (2016) Tu WC, He S, Yang Q, Chien SY (2016) Real-time salient object detection with a minimum spanning tree. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2334–2342
- Vicente et al (2011) Vicente S, Rother C, Kolmogorov V (2011) Object cosegmentation. In: CVPR
- Wang and Gupta (2015a) Wang X, Gupta A (2015a) Unsupervised learning of visual representations using videos. arXiv preprint arXiv:150500687
- Wang and Gupta (2015b) Wang X, Gupta A (2015b) Unsupervised learning of visual representations using videos. In: The IEEE International Conference on Computer Vision (ICCV)
- Wei et al (2012) Wei Y, Wen F, Zhu W, Sun J (2012) Geodesic saliency using background priors. In: European conference on computer vision, Springer, pp 29–42
- Xue et al (2016) Xue T, Wu J, Bouman K, Freeman B (2016) Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: Advances in Neural Information Processing Systems, pp 91–99
- Yang et al (2013) Yang C, Zhang L, Lu H, Ruan X, Yang MH (2013) Saliency detection via graph-based manifold ranking. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE, pp 3166–3173
- Yu and Koltun (2015) Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:151107122
- Zhang et al (2017) Zhang D, Han J, Zhang Y (2017) Supervision by fusion: Towards unsupervised learning of deep salient object detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4048–4056
- Zhang et al (2015) Zhang J, Sclaroff S, Lin Z, Shen X, Price B, Mech R (2015) Minimum barrier salient object detection at 80 fps. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1404–1412
- Zhu et al (2014) Zhu W, Liang S, Wei Y, Sun J (2014) Saliency optimization from robust background detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2814–2821