Deep convolutional models are a core instrument for visual understanding problems, including object localizationZhou et al. (2016); Choe et al. (2020), saliency detection Wang et al. (2019), segmentationLong et al. (2015) and others. Deep models, however, require a large amount of high-quality training data to fit a huge number of learnable parameters. In practice, obtaining groundtruth pixel-level labeling is expensive, since it requires labor-intensive human efforts. Therefore, much research attention has currently focused on weakly-supervised and unsupervised approaches for challenging pixel-level tasks, such as segmentationXia and Kulis (2017); Ji et al. (2019); Chen et al. (2019); Bielski and Favaro (2019).
An emerging line of research on unsupervised segmentation exploits generative models as a tool for image decomposition. Namely, recent works Chen et al. (2019); Bielski and Favaro (2019) have designed training protocols that include generative adversarial networks (GANs), to solve the foreground object segmentation without human labels. Given the promising results and the fact that the GANs’ performance is steadily improving, this research direction will likely develop in the future.
In practice, however, training high-quality generative models is challenging. This is especially the case for GANs, which training can be both time-consuming and unstable. Moreover, the models in Chen et al. (2019); Bielski and Favaro (2019) typically include a large number of hyperparameters that can be tricky to tune, especially in the completely unsupervised scenario when labeled validation set is not available.
To this end, we propose an alternative way to exploit GANs for unsupervised segmentation, which does not train a separate generative model for each task. Instead, we use a publicly available pretrained GAN to generate synthetic images equipped with segmentation masks, which can be obtained automatically. In more detail, we explore the latent space of the publicly available BigBiGAN model Donahue and Simonyan (2019)
, which is an unsupervised GAN trained on the ImagenetDeng et al. (2009). With the recent unsupervised technique Voynov and Babenko (2020), we demonstrate that manipulations in the BigBiGAN latent space allow to distinguish object/background pixels in the generated images, providing decent segmentation masks. These masks are then used to supervise a discriminative U-Net model Ronneberger et al. (2015), which is much easier to train. As another advantage, our approach also provides a straightforward way to tune hyperparameters. Since an amount of synthetic data is unlimited, its hold-out subset can be used as validation.
Our work confirms the promise of using GANs to produce synthetic training data, which is a long-standing goal of research on generative modeling. In extensive experiments, we show that the approach often outperforms the existing unsupervised alternatives for object segmentation and saliency detection. Furthermore, our approach performs on par with weakly-supervised methods for object localization, despite being completely unsupervised.
The main contributions of our paper are the following:
We introduce an alternative line of research on using GANs for unsupervised object segmentation. In a nutshell, we advocate the usage of high-quality synthetic data produced by BigBiGAN, which can provide high-quality saliency masks for generated images.
We compare our method to existing approaches and achieve a new state-of-the-art in most operating points. Given its simplicity, the method can serve as a useful baseline in the future.
We demonstrate a novel unsupervised scenario, where GAN-produced imagery becomes a useful source of training data for supervised computer vision models.
2 Related work
In this paper, we address the binary object segmentation problem, i.e, for each image pixel we aim to predict if it belongs to the object or to the background. In the literature this setup is typically referred to as saliency detectionWang et al. (2019) and foreground object segmentationChen et al. (2019); Bielski and Favaro (2019). While most prior works operate in fully-supervised or weakly-supervised regimes, we focus on the most challenging unsupervised scenario, for which only a few approaches have been developed.
Existing unsupervised approaches.
Before a rise of deep learning models, a large number of “shallow” unsupervised techniques were developedZhu et al. (2014b); Jiang et al. (2013); Peng et al. (2016); Cong et al. (2017); Cheng et al. (2014); Wei et al. (2012)
. These earlier techniques were mostly based on hand-crafted features and heuristics, e.g., color contrastCheng et al. (2014) or certain background priors Wei et al. (2012). Often these approaches also utilize traditional computer vision routines, such as super-pixelsYang et al. (2013); Wang et al. (2016), object proposalsGuo et al. (2017), CRFKrähenbühl and Koltun (2011). These heuristics, however, are not completely learned from data, and the corresponding methods are inferior to the more recent “deep” approaches.
Regarding unsupervised deep models, several works have recently been proposed by the saliency detection communityWang et al. (2017b); Zhang et al. (2018, 2017); Nguyen et al. (2019). Their main idea is to combine or fuse the predictions of several heuristic saliency methods, typically using them as a source of noisy groundtruth for deep CNN models. However, these methods are not completely unsupervised, since they typically rely on the pretrained classification or segmentation networks. In contrast, in this work, we focus on the methods that do not require any source of external supervision.
Generative models for object segmentation. The recent line of completely unsupervised methods Chen et al. (2019); Bielski and Favaro (2019) employs generative modeling to decompose the image into the object and the background. In a nutshell, these methods exploit the idea that the object location or appearance can be perturbed without affecting the image realism. This inductive bias is formalized in the training protocols Chen et al. (2019); Bielski and Favaro (2019), which include learning of GANs. Therefore, for each new segmentation task, one has to perform adversarial learning, which is known to be unstable, time-consuming, and sensitive to hyperparameters.
In contrast, our approach avoids these disadvantages, being much simpler and easier to reproduce. In essence, we propose to use the “inner knowledge” of the off-the-shelf large-scale GAN to produce the saliency masks for synthetic images and use them as a supervision for discriminative models.
Latent spaces of large-scale GANs. Our study is partially inspired by the recent findings from Voynov and Babenko (2020). This work introduces an unsupervised technique that discovers the directions in the GAN latent space corresponding to interpretable image transformations. Among its findings, Voynov and Babenko (2020) demonstrates that the large-scale conditional GAN (BigGAN Brock et al. (2018)) possesses a “background removal” direction that can be used to obtain saliency masks. However, this direction was discovered only for BigGAN that was trained under the supervision from the image class labels. For unconditional GANs, such a direction was not discovered in Voynov and Babenko (2020), hence, it is not clear if the supervision from the class labels is necessary for the GAN latent space “to understand” what pixels belong to object/background. In this paper, we demonstrate that this supervision is not necessary, therefore, even completely unsupervised GANs can serve as an excellent source of synthetic data for object segmentation.
3.1 Exploring the BigBiGAN latent space.
The main component of our method is the recent BigBiGAN model Donahue and Simonyan (2019). BigBiGAN is the state-of-the-art generative adversarial network trained on the Imagenet Deng et al. (2009) without labels and its parameters are available online222 https://tfhub.dev/deepmind/bigbigan-resnet50/1. The BigBiGAN generator maps the samples from the latent space into the image space . BigBiGAN is also equipped with an encoder that was trained jointly with the generator and maps images to the latent space. In this section, we explore the BigBiGAN latent space to investigate if its properties can be useful for downstream tasks.
A very recent paper Voynov and Babenko (2020) has introduced an unsupervised technique that identifies interpretable directions in the latent space of a pretrained GAN. By moving a latent code in these directions, one can achieve different image transformations, such as image zooming or translation. Formally, given an image corresponding to a latent code , one can modify it via shifting the code in an interpretable direction . Then a modified image can be generated. Importantly, operates consistently over the whole latent space, i.e. for all , shifting results in the same type of transformation. As the first step of our study, we apply the technique from Voynov and Babenko (2020) to the BigBiGAN generator to explore the potential of its latent space. In a nutshell, Voynov and Babenko (2020) seeks to learn directions in the latent space such that the effects of the corresponding image transformations are “disentangled”. More formally, the sets of pairs for different
are easy to distinguish from each other by a CNN classifier, which is trained jointly with.
We use the authors’ implementation333https://github.com/anvoynov/GANLatentDiscovery with default hyperparameters and the number of directions . After learning converged, we inspect the directions manually and filter out only the directions that are interpretable. Several directions revealed by the procedure are provided in Figure 1. Compared to the results from Voynov and Babenko (2020) for the “supervised” conditional BigGAN, the BigBiGAN latent space does not possess any directions that have clear “background removal” effect. However, one of the directions has an effect that can be used to distinguish between object and background pixels. The corresponding transformation “Saliency lighting” is presented on Figure 1 and we refer to this direction as . As one can see, moving in this direction makes the object pixels lighter, while the background pixels become darker. Therefore, despite BigBiGAN is completely unsupervised, its latent space can be used to obtain saliency masks for generated images. Technically, we produce a binary saliency mask for an image by comparing its intensity with the “shifted” image after grayscale conversion. As a shift magnitude, we always use .
3.2 Improving saliency masks.
Here we describe a few tricks increasing the quality of the masks for the particular segmentation task.
Adaptation to the particular segmentation task.
In the scheme above the latent codes are sampled from the standard Gaussian distribution. To make the distribution of generated images closer to the particular dataset at hand , we aim to sample from the latent space regions that are close to the latent codes of . To this end, we use the BigBiGAN encoder to compute the latent representations and sample the codes from the neighborhood of these representations. Formally, the samples have the form . Here denotes the neighborhood size and it should be larger for small to prevent overfitting. In particular, we use for Imagenet and for all other cases.
Mask size filtering. Since some of the BigBiGAN-produced images are low-quality and do not contain clear objects, the corresponding masks can result in a very noisy supervision. To avoid this, we apply a simple filtering that excludes the images where the ratio of foreground pixels exceeds .
Histogram filtering. Since should have mostly dark and light pixels, we filter out the images that are not contrastive enough. Formally, we compute the intensity histogram with bins for the grayscaled . Then we smooth it by taking the moving average with a window and filter out the samples that have local maxima outside the first/last buckets of the histogram.
Connected components filtering. For each generated mask we group the foreground pixels into connected (by edges) groups forming clusters . Assuming that is the cluster with the maximal area, we exclude all the clusters with . This technique allows to remove visual artifacts from the synthetic data.
3.3 Training model on synthetic data
Given a large amount of synthetic data, one can train one of the existing image-to-image CNN architectures in the fully-supervised regime. The whole pipeline is schematically presented in Figure 4. In all our experiments we employ a simple U-net architecture Ronneberger et al. (2015). We train U-net on the synthetic dataset with Adam optimizer and the binary cross-entropy objective applied on the pixel level. We perform steps with batch 95. The initial learning rate equals and is decreased by on step . During inference, we rescale an input image to have a size 128 along its shorter side and scale the color channels to . Compared to existing unsupervised alternatives, the training of our model is extremely simple, does not include a large number of hyperparameters. The only hyperparameters in our protocol are batch size, learning rate schedule, and a number of optimizer steps and we tune them on the hold-out validation set of synthetic data. Training with on-line synthetic data generation takes approximately seven hours on two Nvidia 1080Ti cards.
The goal of this section is to confirm that the usage of GAN-produced synthetic data is a promising direction for unsupervised saliency detection and object segmentation. To this end, we extensively compare our approach to the existing unsupervised counterparts on the standard benchmarks.
Evaluation metrics. All the methods are compared in terms of the three measures described below.
F-measure is an established measure in the saliency detection literature. It is defined asand
, where TP, TN, FP, FN denote true-positive, true-negative, false-positive, and false-negative, respectively. We compute F-measure for 255 uniformly distributed binarization thresholds and report its maximum valuemax . We use for consistency with existing works.
IoU (intersection over union) is calculated on the binarized predicted masks and groundtruth as , where denotes the area. The binarization threshold is set to .
Accuracy measures the proportion of pixels that have been correctly assigned to the object/background. The binarization threshold for masks is set to .
Since the existing literature uses different benchmark datasets for saliency detection and object segmentation, we perform a separate comparison for each task below.
4.1 Object segmentation.
Datasets. We use two following datasets from the literature of segmentation with generative models.
Caltech-UCSD Birds 200-2011 Wah et al. (2011) contains 11788 photographs of birds with segmentation masks. We follow Chen et al. (2019), and use 10000 images for our training subset and 1000 for the test subset from splits provided by Chen et al. (2019). Unlike Chen et al. (2019), we do not use any images for validation and simply omit the remaining 788 images.
Flowers Nilsback and Zisserman (2007) contains 8189 images of flowers equipped with saliency masks generated automatically via the method developed for flowers. In experiments with the Flowers dataset, we do not apply the mask area filter in our method, since it rejects most of the samples.
On these two datasets we compare the following methods:
BigBiGAN is our method where the latent codes are sampled from .
E-BigBiGAN (w/o -noising) is our method where the latent codes of synthetic data are sampled from the outputs of the encoder applied to the train images of the dataset at hand.
E-BigBiGAN (with -noising) same as above with latent codes sampled from the vicinity of the embeddings with the neighborhood size set to .
The comparison results are provided in Table 1, which demonstrates the significant advantage of our scheme. Note, since, both datasets in this comparison are small-scale, -noising considerably improves the performance, increasing diversity of training images.
The comparison of unsupervised object segmentation methods. For our model we report the performance averaged over ten runs. For the best model we also report the standard deviation values.
4.2 Saliency detection.
Datasets. We use the following established benchmarks for saliency detection. For all the datasets groundtruth pixel-level saliency masks are available.
Baselines. While there are a large number of papers on unsupervised deep saliency detection, all of them employ pretrained supervised models in their training protocols. Therefore, we use the most recent “shallow” methods HSYan et al. (2013), wCtrZhu et al. (2014a), and WSCLi et al. (2015) as the baselines. These three methods were chosen based on their state-of-the-art performance reported in the literature and publicly available implementations. The results of the comparison are reported in Table 2. In this table, BigBiGAN denotes the version of our method where the latent codes of synthetic images are sampled from . In turn, in E-BigBiGAN, are sampled from the latent codes of Imagenet-train images, for all three datasets. Since the Imagenet dataset is large enough, we do not employ -noising in this comparison.
As one can see, our method mostly outperforms the competitors by a considerable margin, which confirms the promise of using synthetic imagery in the unsupervised scenario. Several qualitative segmentation samples are provided on Figure 5.
4.3 Weakly-supervised object localization (WSOL)
A closely related to segmentation problem is object localization, where for a given image one has to provide a bounding box instead of a segmentation mask. In this section, we demonstrate that our unsupervised method performs on par with the weakly-supervised state-of-the-art. To compare with the previous literature, we use the numbers from the very recent evaluation paper Choe et al. (2020) that reviews a large number of existing WSOL methods and reports actual state-of-the-art. We employ exactly the same evaluation protocols as in Choe et al. (2020) and compare the prior works with our E-BigBiGAN method, which samples from the latent codes of Imagenet-train images, as described in Section 3.2. The comparison results are provided in Table 3.
Evaluation metrics. For the WSOL problem we use the following metrics Choe et al. (2020):
MaxBoxAcc Russakovsky et al. (2015); Zhou et al. (2016). For an image , let us have a predicted mask and a set of ground truth bounding boxes for (some datasets can provide several bounding boxes per image). Let us select a threshold and denote the largest (in terms of the area) connected component of the mask binarized with threshold . Let us denote with the minimal bounding box containing the set . Then we define
where corresponds to the ground truth bounding box with the maximal IoU with and denotes the number of images. Then the final metrics MaxBoxAcc is the maximum of over all thresholds .
PxAP Achanta et al. (2009). Let us have a predicted mask and ground truth mask . For a threshold we define a pixel precision and recall
We average both values over all images and then PxAP is defined as the area under curve of the pixel precision-recall curve.
Datasets. We use the following benchmarks for weakly-supervised object localization.
|Method||Imagenet (MaxBoxAcc)||CUB (MaxBoxAcc)||OpenImages (PxAP)|
|Previous SOTA Choe et al. (2020)||0.654||0.781||0.630|
In Table 4 we demonstrate the impacts of individual components in our method. First, we start with a saliency detection model trained on the synthetic data pairs with . Then we add one by one the components listed in Section 3.2. The most significant performance impact comes from using the latent codes of the real images from the Imagenet.
In our paper, we continue the line of works on unsupervised object segmentation with the aid of generative models. While the existing unsupervised techniques require adversarial training, we introduce an alternative research direction, based on the high-quality synthetic data from the off-the-shelf GAN. Namely, we utilize the images produced by the BigBiGAN model, which is trained on the Imagenet dataset. Exploring BigBiGAN, we have discovered that its latent space semantics allows to produce the saliency masks for synthetic images automatically via latent space manipulations. As shown in experiments, this synthetic data is an excellent source of supervision for discriminative computer vision models. The main feature of our approach is its simplicity and reproducibility since our model does not rely on a large number of components/hyperparameters. On several common benchmarks, we demonstrate that our method achieves superior performance compared to existing unsupervised competitors.
We also highlight the fact that the state-of-the-art generative models, such as BigBiGAN, can be successfully used to generate training data for yet another computer vision task. We expect that other problems such as semantic segmentation can also benefit from the usage of GAN-produced data in the weakly-supervised or few-shot regimes. Since the quality of GANs will likely improve in the future, we expect that the usage of synthetic data will become increasingly widespread.
Frequency-tuned salient region detection.
2009 IEEE conference on computer vision and pattern recognition, pp. 1597–1604. Cited by: 2nd item.
- Large-scale interactive object segmentation with human annotators. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11700–11709. Cited by: 3rd item.
- Emergence of object segmentation in perturbed generative models. In Advances in Neural Information Processing Systems, pp. 7254–7264. Cited by: §1, §1, §1, §2, §2, 1st item.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.
- Unsupervised object segmentation by redrawing. In Advances in Neural Information Processing Systems, pp. 12705–12716. Cited by: §1, §1, §1, §2, §2, 1st item, 2nd item.
- Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 569–582. Cited by: §2.
- Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1, 3rd item, §4.3, §4.3, Table 3.
- Co-saliency detection for rgbd images based on multi-constraint feature matching and cross label propagation. IEEE Transactions on Image Processing 27 (2), pp. 568–579. Cited by: §2.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1, §3.1.
- Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10541–10551. Cited by: §1, §3.1.
- Video saliency detection using object proposals. IEEE transactions on cybernetics 48 (11), pp. 3159–3170. Cited by: §2.
- Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874. Cited by: §1.
- Salient object detection: a discriminative regional feature integration approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2083–2090. Cited by: §2.
- Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pp. 109–117. Cited by: §2.
- A weighted sparse coding framework for saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5216–5223. Cited by: §4.2.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
- DeepUSPS: deep robust unsupervised saliency prediction via self-supervision. In Advances in Neural Information Processing Systems, pp. 204–214. Cited by: §2.
- Delving into the whorl of flower segmentation.. In BMVC, Vol. 2007, pp. 1–10. Cited by: 2nd item.
- Salient object detection via structured matrix decomposition. IEEE transactions on pattern analysis and machine intelligence 39 (4), pp. 818–832. Cited by: §2.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §3.3.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: 1st item, 1st item.
- Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence 38 (4), pp. 717–729. Cited by: 1st item.
- Unsupervised discovery of interpretable directions in the gan latent space. arXiv preprint arXiv:2002.03754. Cited by: §1, §2, §3.1, §3.1.
- The caltech-ucsd birds-200-2011 dataset. California Institute of Technology. Cited by: 1st item, 2nd item.
- Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145. Cited by: 2nd item.
- Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145. Cited by: §2.
- Salient object detection in the deep learning era: an in-depth survey. arXiv preprint arXiv:1904.09146. Cited by: §1, §2.
- Correspondence driven saliency transfer. IEEE Transactions on Image Processing 25 (11), pp. 5025–5034. Cited by: §2.
- Geodesic saliency using background priors. In IEEE, ICCV, Cited by: §2.
- W-net: a deep model for fully unsupervised image segmentation. arXiv preprint arXiv:1711.08506. Cited by: §1.
Sun database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: 2nd item.
- Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1155–1162. Cited by: §4.2.
- Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3166–3173. Cited by: §2, 3rd item.
Supervision by fusion: towards unsupervised learning of deep salient object detector. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4048–4056. Cited by: §2.
- Deep unsupervised saliency detection: a multiple noisy labeling perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9029–9038. Cited by: §2.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1, 1st item.
- Saliency optimization from robust background detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2814–2821. Cited by: §4.2.
- Saliency optimization from robust background detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2814–2821. Cited by: §2.