A crucial part of human intelligence is scene understanding, which means decomposing the scene into objects and discovering their relationships. Here we mainly focus on object segmentation, an important method for scene decomposition. In a supervised learning scheme, recent methodsLong_2015_CVPR ; Novotny_2018_ECCV
mainly rely on convolution neural network (CNN) or its variant to minimize the deviance between generated object masks and the ground truth. In an unsupervised learning scheme, traditional pixel clustering methodsdelong2008scalable lead to more sophisticated image clustering methods and loss with CNN kkanezaki2018_unsupervised_segmentation . Other models achieve state-of-the-art performances by learning object masks and imagine reconstruction burgess2019monet through one network IODINE
or two separate networks. Inspired by the observation that typical unsupervised models learn to reconstruct relatively simple backgrounds in early epochs and then more complicated details in later epochs, we believe that different parts of images have a different level of reconstruction difficulty. We argue that those higher-level details are the objects, which usually should not share characters with background and thus harder to be reconstructed during the first few epochs.
We approach this problem from pixel-level clustering and iterative object segmentation, similar to MONet burgess2019monet . We propose a network that removes MONet’s attention network and segments objects one-by-one through the reconstruction quality mask. Since MONet’s attention network does not rely on a reconstruction image before updating the network parameters, which may lead to inconsistencies between the reconstruction image and attention mask, our model utilizes a reconstruction image that decides which area to focus, which instead leads to a consistent approach that “where the network reconstructs is where it focuses.” In our method, the first step segments the background and then later steps “fill in” the image details missing from the “first impression.” Objects are considered higher-level details of the scenes that cannot be easily reconstructed by the “first impression” of the scene , i.e. the background. Moreover, a group of pixels should be considered as an object only if its parts move as a whole (except for deformative objects). Thus, objects should be considered through a clear and explicit localization mask.
Our contributions are as follows. We propose a new algorithm that localizes the areas needed for object segmentation by directly measuring reconstruction quality. This coarse-grained estimate of the focused area is then fine-tuned by a Gaussian Mixture Model with very few components that cluster the pixels in that area to obtain a detailed boundary for the object. More importantly, compared with models that rely on network output attention masks, our model has an explicit localization module that guarantees masks of objects are localized. We show that this explicit localization module is necessary for more complicated datasets, such as Montezuma’s Revenge, where some objects share similar color but have different modeling difficulties.
Review of MONet is in Section 2.1 and our motivation of model design is in Section 2.2. Section 2.3 discusses how we measure the quality of reconstruction and Section 2.4 explains how we conduct the local clustering through GMM. We observe that the reconstruction quality between the background and the objects are adversarial with each other, so we propose a method that mitigates this effect in Section 2.5. The overview of our model and algorithm is Section 2.6.
2.1 MONet Overview
As a recent work of unsupervised scene decomposition, MONet is an important framework on which our model is based. The main idea of MONet is to learn to reconstruct the input image by identifying one object at a time. To achieve this goal, an attention model is trained and outputs an attention mask, which claims to focus on one single object in a given image. A VAE is trained to reconstruct the object covered by this attention mask. Furthermore, VAE also tries to recover the attention mask obtained by the attention model to stabilize the training. The network is tuned so that the “background” is always given attention in the first step.
To be specific, if we use to represent the “unexplained ratio” of each pixel after the -th step, the “explained ratio” at -th step as , and the original image input as , the idea of MONet can be written as:
where the is a trainable attention network parameterized by and is the total number of segmentation steps. We adopt this framework but use a different method, which we detail in later sub-sections, to find the mask at each step.
2.2 Model Motivation
The idea of MONet, which is basically to use attention as masks and cover the image step by step, is natural. But in practice, we found that it is very hard to tune the hyperparameters to learn a good attention model. Specifically, it’s hard to make sure that the CNN-based attention model finds masks for objects. This motivates us to propose a new method with an explicit nonparametric localization module that helps find objects. To produce better object boundaries, we considered simpler clustering methods like GMM or KNN and used traditional features like color or location. Besides, instead of finding an accurate attention mask for the objects directly, we choose to find a larger area that contains the object inside and split the object from the local area in the second step. The smaller area makes it possible to use a simple clustering algorithm to detect the object in different remaining not-yet-covered parts.
2.3 Reconstruction Quality Estimates and Coarse Grained Attention
We first measure the quality of the reconstruction by the pixel-wise square error between the reconstruction and the input images. The pixel-wise reconstruction quality, , is measured by
where is the reconstruction image and is the original image. and
are RGB vectors at the corresponding pixels determined by.
is a hyperparameter for variance. In order to locate an object roughly by reconstruction quality, both the pixel reconstruction and its neighbor pixels’ reconstruction matter. To evaluate that, we used a non-trainable convolution kernel with,
, and SAME padding on. Relying on non-trainable kernels, we can avoid trivial solutions found by neural networks. The output of this convolution indicates how well the network reconstructs at each local region; partially overlapping regions are allowed. We then find the location with the highest output value, denoted by . This pair of coordinates indicates the center of the region with the best reconstruction quality, and our model generates a rough attention based on . Given our rough assumption that attention has peak 1 in the center and decays gradually, we choose a Butterworth filter ctx3916707340003821 to model attention in a region. Mathematically, a Butterworth filter is
Here is the distance from a point to the center and are hyperparameters. With the center point determined by , we can easily obtain the local mask by this filter.
In practice we treated the horizontal and veritcal coordinates independently and the Butterworth filter attention on pixel is
A Gaussian mixture model (GMM) is capable of finding components when we have samples from a mixture of separate Gaussian distributions. In this segmentation task, it is possible to approximate the distribution of pixels if we treat them as 5-d Gaussian random variables. The 5 dimensions are RGB channels and two coordinates, denoted as
For a general GMM, if we know there are groups and we somehow initialized their means as and let
we can easily obtain the update rule for GMM:
Here is a kernel function. In GMM, since we assume each group is a Gaussian distribution, we choose the multi-variate Gaussian density function as our kernel.
For our problem, the pixels in which we are interested might have been partially/totally explained by previous steps. So they should have less/no impact on the clustering process in later steps. Thus we give each pixel a “weight” that indicates its importance. The weight is easy to find: simply use the Butterworth filter weight :
And in practice we found that giving a hard threshold gives clearer segmentation, so we use
2.5 Adversary Between background and object reconstruction
A problem we observe while using reconstruction quality as masks is that the more details the background could obtain through training, the worse the segmentation result will be, because it leaves less room for perfect reconstruction of objects in later iterations. In order to mitigate this problem, we generate a mask for the background by subtracting all the intermediate objects’ attention masks. The portion of the image that should be captured by the background at the first iteration is derived from the input image masked by this background mask. In other words, for background is computed by
2.6 Model Overview
Our model utilizes VAE kingma2013auto for simple datasets or an auto-encoder with skip-connection for complicated datasets. We assume there are objects, including background and our model keeps track of the not-yet-explained areas by a remaining-mask . As an alternative, we can use stick-breaking process to find just the right number of K. In every iteration, our model tries to reconstruct the unexplained part. We evaluate the reconstruction quality and denote the quality by . A basic observation is that VAEs and auto-encoders will learn the representation of often observed items so the reconstruction quality of these objects will get better faster than other areas. Thus, except for the first iteration where the background is found, we can locate the object roughly by finding the area where reconstruction quality is high. Then we look into the area and tell whether each pixel belongs to the object. As for the first iteration, we directly use the quality of reconstruction of each pixel as its explained ratio by the background.
For the following iterations, the image- and remaining-mask are input to the network again for reconstruction. Different from the first iteration, a location-sensitive mask, denoted by , is derived from the location where the reconstruction quality is the best. Then the remaining-mask is updated again and the next iteration starts. The remaining-mask and object component mask are updated by the formula:
where denotes the th iteration. Moreover, when is 0, is to indicate no location-sensitive mask is applied to the background.
Denote as input image’s channel, as the trainable weights for encoder-decoder, as the constant weights for a convolution layer that evaluates the area reconstruction quality from pixel-wise reconstruction quality mask , as the constant variance for pixel-wise reconstruction quality mask, and as the constant variance for location mask . The algorithm we use is summarized in Algorithm 2 and its flowchart is in Figure 1.
The only trainable parameter is updated through the following equation:
where is a constant prior for pixels that are masked out by , is a hyperparameter that controls the weight of prior loss, and is the hyperparameter that controls the weight of KL-div for the VAE prior. For auto-encoder , is the prior for enbedding of VAE and is the embedding in the th step. Lastly, if we use as the object mask, then redefining for Equ. 1 helps, since the reconstruction at this time focuses more on the region found by . This allows mask to contribute to the loss, but is strictly constant.
3 Related Work
A lot of the recent progress has been made as a result of convolutional neural networks (CNN) and their variants. Many of the progress on object segmentation has been based on supervised learning, where people label the images given their prior knowledge and train the network accordingly. Fully Convolutional Networks Long_2015_CVPR is a paradigm network architecture for semantic segmentation and more advanced results are achieved by recently with a semi-convolutional operator Novotny_2018_ECCV .
People also work on unsupervised object segmentation through neural network. Unsupervised object discovery hsu2018co with a pre-trained model, such as VGG, is proposed, but it is highly based on the performance of the pre-trained model; the model is not end-to-end unsupervised. An object segmentation method characterizes pixel similarities based on CNN. Recently, a generative adversarial network (GAN) for object segmentation chen2019unsupervised is successfully applied to real world dataset. However, this model is based on the network itself to find attention masks, which are good in datasets where the objects are salient and big enough.
The most related work is MONet burgess2019monet , which segments objects iteratively through a scope mask. Its performance relies on the interactions between the attention mask and VAE for reconstruction. A similar idea is also used in IODINE IODINE , where each independent embedding tries to recover an object and its mask for the object. Then all recovered objects are combined with a normalized mask to reconstruct the original image. It uses special techniques to learn the joint posterior embedding to overcome the shortage of VAE, which is only able to learn independent posterior embedding given the input. In practice, tuning the parameters such that the CNN-based attention model masks exactly over objects is very difficult: there is no feedback loop between the reconstruction image and the attention module for every scene-decomposing step in MONet. Therefore, what to decomposed in every step is purely determined by the attention model; it is difficult to guarantee that the attention model masks over a localized region and that region happens to contain an object. Compared to MONet, our model directly computes an attention mask from reconstructed images, and we have an explicit localized module that makes sure our model focuses on local regions.
Lastly, in terms of object discovery through a sequence of frames, a network pathak2017learning
that is based on optical flow can learn moving objects. Neural Expectation MaximizationNEM proposes to learn embeddings of objects through a sequence of frames through EM and learn the transformation from frames to embeddings through training. Based on NEM, Relational NEM van2018relational where object relationships are extracted through the embedding and R-NEM achieves better performance than NEM. Object discovery through a sequence of frames provides more information than independent frames. Since our model currently only focuses on object segmentation on images, this direction is interesting future work.
We test our model on three different datasets. We mainly focus on the performance on object segmentation.
Multi-dSprites multiobjectdatasets19 This dataset has a colored background and a random number of objects. We use training samples and testing samples. We use the ARI score provided. We conduct an ablation study on this dataset.
Montezuma’s Revenge Montezuma’s Revenge OpenAI Gym 1606.01540
is a game where object discovery plays an important role in hierarchical deep reinforcement learning proposedkulkarni2016hierarchical and goal-driven/symbolic planning for reinforcement learning lyu2019sdrl . We use training samples with a random policy. For testing sets, we use a pre-trained policy where the agent successfully solve the first stage. We manually label 100 samples of nine objects, including the agent, the skull, the rope, the key (may be missing as the agent obtains it), two doors, and three ladders. Since current GAN-based segmentation chen2019unsupervised supports segmenting simple scene (one foreground object only), and official implementations of MONet or IODINE are not available online, we only report our AMI score on Montezuma’s Revenge as a benchmark result. The AMI score is calculated as in NEM NEM .
4.1 Results and Discussion
|Dataset||Multi-dSprites||Category Flower Dataset||Montezuma’s Revenge|
We summarize our results in Table 1. We also analyze the results by different datasets in the following paragraphs.
Multi-dSprites Our model so far is not able to achieve as good results as MONet, because this dataset has samples in which objects are partially covered by another object, a situation that GMM cannot handle easily. For ablation study, where the whole location mask is removed and only the reconstruction quality remains, we found that the segmentation metric provided multiobjectdatasets19 does not apply to our case, because images with one object leads to NAN as our model incorrectly decomposes that object to different object slots. We see similar poor performance when we leave the GMM module and only without training the network.
Category Flower Dataset This is a real world dataset where the background is more complicated and thus requires more sophisticated background removal techniques. Similar to the results achieved when IODONE IODINE applies their model to a real-world dataset, we observe a noticeable gap between our model and the benchmark, because the assumption of 5D vector in GMM module, RGB and x-y coordinates, is too simple for real world dataset.
Montezuma’s Revenge Our model can achieve an AMI score of 0.375. Figure 3 is a reconstruction of the objects in a typical frame. Figure 4 in Appendix A provides more reconstruction results. Most of the important objects that can serve as goals are found by our model, including the three ladders, skull, the doors, and even the keys. More strikingly, our model successfully finds objects, such as the ladders at the bottom, that share the same color as the walls. Since the wall is easier to reconstruct, it is captured by the background, whereas the ladders, with their more complicated details, are left to be captured by later object slots. Through this experiment, we confirm our argument that objects can be extracted iteratively with different reconstruction difficulties, which we believe can be a new method for object discovery.
4.2 Object Location Extractor
|bottom right ladder||0.595||0.827|
|bottom left ladder||0.595||0.180|
One of the benefits of our model is that the objects’ location is automatically extracted. We train our model in only 9 epochs and extract the locations of the objects through the coordinate means calculated with the GMM. A random reconstructed frame is provided as Figure 3 and the corresponding objects and their location is provided in Table 2. The location is shown as scaled from 0–1 in (vertical axis) and (horizontal axis). is the top left of the image. Most of the objects are found by our model, visually near the locations provided.
5 Discussion and future work
In this paper, we propose a new unsupervised object segmentation algorithm with an explicit localization module. The localization module serves as an attention mask derived from the reconstruction quality. By iteratively segmenting the objects, our method finds objects one-by-one, filling in the details of the image missing from previous iterations. We empirically confirm our beliefs that those details correspond to objects.
As for future work, it is promising to extend this work in a sequence of frames, a context where objects are mostly consistent between frames. We also notice that GMM does not always lead to good results and more sophisticated (local) segmentation algorithms could possibly lead to better results. Lastly, our model still has a decent amount of prior knowledge injected through hyperparameters. Making the model simpler should be helpful in future work.
- (1) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
- (2) Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
- (3) S Butterworth. On the theory of filter amplifiers. Experimental Wireless and the Wireless Engineer, 7:536,541, 1930.
- (4) Mickaël Chen, Thierry Artières, and Ludovic Denoyer. Unsupervised object segmentation by redrawing. arXiv preprint arXiv:1905.13539, 2019.
- (5) A Delong and Y Boykov. A scalable graph-cut algorithm for nd grids. In , 2008.
Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick Watters,
Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and
Multi-object representation learning with iterative variational
In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2424–2433, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- (7) Klaus Greff, Sjoerd van Steenkiste, and Jürgen Schmidhuber. Neural expectation maximization. In Advances in Neural Information Processing Systems, pages 6691–6701, 2017.
Kuang-Jui Hsu, Yen-Yu Lin, and Yung-Yu Chuang.
Co-attention cnns for unsupervised object co-segmentation.
Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 748–756. AAAI Press, 2018.
- (9) Rishabh Kabra, Chris Burgess, Loic Matthey, Raphael Lopez Kaufman, Klaus Greff, Malcolm Reynolds, and Alexander Lerchner. Multi-object datasets. https://github.com/deepmind/multi-object-datasets/, 2019.
Unsupervised image segmentation by backpropagation.In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
- (11) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- (12) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
- (13) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- (14) Daoming Lyu, Fangkai Yang, Bo Liu, and Steven Gustafson. Sdrl: interpretable and data-efficient deep reinforcement learning leveraging symbolic planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2970–2977, 2019.
- (15) M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
- (16) David Novotny, Samuel Albanie, Diane Larlus, and Andrea Vedaldi. Semi-convolutional operators for instance segmentation. In The European Conference on Computer Vision (ECCV), September 2018.
- (17) Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2701–2710, 2017.
- (18) Sjoerd van Steenkiste, Michael Chang, Klaus Greff, and Jürgen Schmidhuber. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In International Conference on Learning Representations, 2018.