Automatically removing unrelated foreground objects from an image to “clean” it, is an intriguing and still-open area of research. Images tend to loose the naturalness or the visually-pleasing nature due to the presence of such objects. For instance, a scene of a landscape may contain an artificial object in the foreground, or a picturesque attraction may be crowded, making it impossible to capture its untainted beauty. Such complications give rise to the requirement of automatic detection and removal of undesired or unrelated objects in images. We refer to such objects as occlusions. In a general sense, this task is challenging since the occlusions are dependent on the image context. For instance, an occlusion in one image may be perfectly natural in another.
The process of acquiring an occlusion-free image involves two subtasks: identifying unrelated things, and removing them coherently. Image context defines which objects are unrelated. Yet, due to high complexity of natural images, proper interpretation of the image context is difficult [1, 2]
. Being unrelated is usually subjective in human perception. However, in a computer vision system, this should be captured objectively. Object detection coupled with scene understanding and pixel generation can potentially address these subtasks.
In this paper, we propose a novel approach for occlusion removal using an architecture which fuses both image and language processing. First, we process the input image to extract background and foreground object classes separately with pixel-wise annotations of foreground objects. Then, our language model intuitively decides the relation of each foreground object to the image context; hence, it identifies occlusions. Finally, we mask the pixels of occlusions and feed the image into an inpainting model which produces an occlusion-free image. This task separation allows us to tackle the issue of the lack of object-background relationships in datasets, since our system relies on semantic segmentations and image captions for training. An example result of our system is shown in Fig.1.
Contributions of this paper are as follows:
We propose a novel system to automatically detect occlusions based on image context and remove them, producing a visually-pleasing occlusion-free image.
We present the first approach which makes intelligent decisions on the relation of objects to image context, based on foreground- and background-object detection.
2 Related Work
Previous work related to context-aware occlusion removal includes pre-annotated occlusion removal, occlusion removal in a specific context, and image adjustment for producing visually-pleasing images.
Following the proposal of GANs , extensive research has been carried out in GAN-based image modeling. GAN-based inpainting is one such application, where output images are conditioned on occluded input images [8, 9]
. Conventional approaches for image inpainting, which do not use deep network based learning approaches[10, 11], perform reasonably well for homogeneous image patches, but they fail in natural images. Yu et al.  recently proposed contextual attention in dilated CNN, which performs well for non-homogeneous patches. Moreover, several approaches for inpainting in a specific context, for instance, either face , head  or eye , and for occlusion removal in a specific context, for instance, rain or shadow [5, 14], have been proposed recently. Existing inpainting and occlusion removal techniques provide visually-pleasing results, but require manually annotating occluded regions and are not generic.
Recently proposed SEGAN  is partially aligned with our work. It relies on a semantic segmentator and a GAN to regenerate obscure content of objects in images, completing their appearance. Shotton et al.  present a novel approach for automatic semantic segmentation and visual understanding of images, which involves learning a discriminative model of object classes, interactive semantic segmentation and interactive image editing. Other related work include object recognition in cluttered environments  and occlusion-aware layer-wise modeling of images .
These methods of enhancing occluded images either reconstruct images based on manually-annotated occluded regions or remove a specific type of occlusions in a particular context. However, none of these approaches declare occlusions based on image context. In other words, these do not address occlusion removal as a context-aware generic problem. In contrast, our approach involves making intelligent decisions on occlusions: detecting objects as occlusions depending on the image context, characterized by foreground and background objects, and regenerating occluded patches coherently. Thus, what we propose is an architecture for context-aware automatic occlusion removal in a generic domain, based on a fusion of image and language processing.
3 System Architecture
Our system consists of four interconnected sub-networks as in Fig. 2: a foreground segmentator, a background extractor, a relation predictor, and an inpainter. An input image is fed into both the foreground segmentator and the background extractor. The foreground segmentator outputs pixel-wise association of foreground objects, or what we refer to as thing classes, whereas the background extractor predicts background objects present in the image, commonly referred to as stuff classes. These two components are followed by the relation predictor which utilizes both previously extracted thing and stuff classes. Relation of each thing class to the image context is evaluated, and unrelated thing classes, i.e., occlusions, are provided as the output of the relation predictor. Finally, the image inpainter network exploits the relation predictions together with the pixel associations of the thing classes, to mask and regenerate pixels which belong to the occluding thing classes, generating an occlusion-free image. Following sub-sections elaborate more on the sub-networks in our proposed system.
Our foreground segmentator is a semantic segmentation network based on DeepLabv3+ , which achieves state-of-the-art performance on the PASCAL VOC 2012 semantic image segmentation dataset  with an mIoU of . Due to the highly optimized nature of this network, we do not consider suggesting further improvements. Instead, we train it on the COCO-Stuff dataset , which is more challenging due to the generality of context and the higher number of classes, compared to the VOC dataset.
The background extractor is a CNN based on ResNet-101  model which consists of 100 convolutional layers and a single fully-connected layer. Since the task of the background extractor is to predict background classes, i.e., stuff in an image, this can be modeled as a multi-task classification problem. Thus, when training this component, we optimize an error which considers the summation of class-wise negative log-likelihoods.
The relation predictor provides the intelligence on occlusions based on image context. The relation between objects is estimated based on vector embeddings of class labels trained with a language model. To this end, we adopt the model proposed by Mikolovet al.  to represent COCO labels as vectors. Training this model for our requirement is different to the conventional requirement, where the model is trained on a large corpus of text, essentially learning linguistic relationships and syntax. What we have is image captions: five per image from MS COCO , and what we want is to learn relations between objects as they appear in images, not linguistic relationships as in conventional models. Thus, we initially create a corpus concatenating image captions as the training set of the dataset. Importantly, the model should not learn the relations between any objects in separate images. Thus, we insert an end-of-paragraph (EOP) character at the end of each set of captions corresponding to a single image. Depending on the window size used in the model, the number of EOP characters inserted in between two sets of captions changes. Moreover, since we require the relation between object classes, not in a linguistic sense, it is logical to modify the corpus, removing all words except object classes. This allows the model to learn stronger relations between classes. Therefore, we train a separate model on this modified corpus. To visualize the embedding vectors of 128 dimensions in a 2D space, we use t-SNE algorithm . We use these word embeddings generated by the word-to-vector model trained on the modified captions of COCO-Stuff, to predict the relation between objects. What we want to quantify is “how-related” each thing class in an image to its context. We objectively capture this measure based on both thing and stuff classes, which can be represented as
where and represent the sets of thing and stuff classes of the image . Here, represents an embedding vector of a class and
, the cosine similarity between two vectors. The intuition here is to capture a similarity score between each embedding vector of a thing class and the remaining classes in an image, both thing and stuff.
We base our inpainter on a recently proposed generative image inpainter , which is attentive to contextual information of images. This follows a two-stage architecture: coarse to fine. The first stage coarsely fills the mask, which is trained with a spatially discounted reconstruction loss. The second refines the generated pixels, which is trained with additional local and global WGAN-GP  adversarial loss. Contextual attention is imposed into the generation of missing pixels by implementing a special fully-convolutional encoder, which compares the similarity of patches around each generated pixel and patches in unmasked region to draw contextual information from the region having highest weighted similarity. Instead of training the inpainter with simulated square masks, we consider generating masks of random shapes and sizes utilizing the segmentation ground truth masks of thing classes. This provides better convergence for the network at the task of removing thing classes based on image context.
3.2 Implementation Details
Our system depends on image context to detect and remove occlusions. Thus, the dataset that we train our system on, should contain rich information about image context, which is why we choose COCO-Stuff dataset. In addition to the annotations of 91 foreground classes in MS COCO, this contains annotations of 91 background classes called stuff. This dataset includes 118K training samples and 5K validation samples. The large number of annotated images over different contexts enables better training of the system to extract contextual information. We utilize the semantic segmentations to train the foreground segmentator and the inpainter, the stuff class labels to train the background extractor, and the image captions to train the relation predictor.
In the foreground segmentator, we use a DeepLabv3+ model pre-trained on Pascal VOC, without initializing the last layer to accommodate the change in the number of classes. we train the segmentator for 125K iterations on COCO-Stuff. The background extractor based on ResNet-101 is trained from scratch for 50 epochs with multi-class classification loss and an input resoution of. The inpainter network is trained from scratch for 1000 epochs with random masks. For the relation predictor, we use the skip-gram model with a window size of 3 and train for 100K iterations on the corpus of image captions to learn 128-dimensional word embeddings.
To compensate for the aggregation of errors, we implement a few measures when interconnecting the networks111Source and trained models are available here on GitHub.. For instance, we handle inconsistent masks by dilation, and false-positives which are small in area by discarding the masks less than of the image. When considering the output of the relation predictor to identify occlusions, we consider a similarity threshold, which is set to be . We compare the cosine similarity score of each thing class, normalized to a range of 0 to 1, and declare an occlusion when the similarity is less than this threshold.
Our system automatically detects and removes unrelated objects, i.e., occlusions in input images. Thus, to evaluate the performance of the system, we raise two questions: which objects are removed by the system, and how good are the reconstructed images. Answering these evaluates the system for its utility in the application scenario. However, the evaluation of the system as a whole is challenging. This is because there is no publicly available dataset which annotates objects not related to an image context, or which contains images both with and without such occlusions. Thus, we choose to evaluate the system intuitively as presented below.
4.1 Effectiveness of Word-Embeddings
The relation predictor is responsible for making intelligent decisions on occlusions in images based on word-embeddings. Thus, it is useful to evaluate the effectiveness of the trained embedding-vectors. To qualitatively evaluate the embeddings, we map the 128-dimensional vectors to 2-dimensional space using t-SNE to visualize in a graph. Fig. 4 shows the mapped embedding vectors in a 2D plane, separately trained on the original corpus of image captions and the modified corpus. The embeddings on the right show more meaningful mapping: having strongly related objects being mapped closer. For instance, it shows more distinction between some related indoor and outdoor object clusters. This is because the modified corpus contains only the thing and stuff class labels, enabling a stronger relation learning between classes.
We consider the relation between each pair of classes calculated as a cosine similarity, to quantitatively evaluate the effectiveness of the embeddings, comparing against a set of ground truth values. However, such ground truth data is not available and, generating such in wild would not be unbiased due to its subjectiveness. Thus, we intuitively consider a measure of relation using the count of occurrences of each object in dataset, which can be represented as
where denotes the set of images which consists both and classes, and , the set of images which consists either of the classes. To capture the similarity between the relations from learned embeddings and counts, we evaluate their Pearson correlation, which yields a value of 0.527. This shows that our approach of estimating the relation has captured the co-existence of classes up to a certain extent.
4.2 User Study
We conduct two user studies to evaluate the performance of the complete system, involving 1245 image pairs: original and occlusion-free reconstructed images in the validation set of COCO-Stuff. The two studies evaluate the visual-pleasing nature of the reconstructed images and a comparison between the occlusions identified by the users and the system. The former study shows users only the occlusion-free images and allows them to choose between the options: visually-pleasing or not, whereas the latter shows users only original images together with the thing class labels and allows them to record their opinion on the unrelated objects in the image context. Results of these user studies are presented in Table 1.
The majority of our reconstructed images has been rated visually-pleasing in the user study. This shows the effectiveness of the foreground segmentator and the inpainter, in accurately predicting the masks and reconstructing with attention to contextual details. Although the precision and recall of the objects that our system detected as occlusions, in comparison with the user preferences is not high, it establishes a baseline as the first of its kind. This shows the deviation of our learning algorithm form actual human perception in some cases. In other words, in our proposed method, we consider learning relations from scratch, based only on image captions without any human annotations on relation. In contrast, humans inherit the intuition of natural relation that comes from experience, which can be different to what is learned by the system. However, as shown in Fig.3, a diverse set of objects has been detected as occlusions based on different image contexts.
We proposed a novel methodology for automatic detection and removal of occluding unrelated objects based on image context. To this end, we utilize vector embeddings of foreground and background objects, trained on modified image captions, to capture the image context by predicting the relation between objects. Although our approach learns meaningful relations between object classes and utilizes a hand-designed algorithm to decide on occlusions, human perception of it can be different. However, we establish a baseline for context-aware automatic occlusion removal in a generic domain, even without a publicly available dataset on relation. As future work, we hope to develop a dataset that captures human annotations on object relations, which will enable end-to-end training of such networks, improving our baseline on context-aware occlusion removal in images.
Acknowledgments: K. Kahatapitiya was supported by the University of Moratuwa Senate Research Committee Grant no. SRC/LT/2016/04 and D. Tissera, by QBITS Lab, University of Moratuwa. The authors thank the Faculty of Information Technology of the University of Moratuwa, Sri Lanka for providing computational resources.
-  Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi, “Textonboost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context,” IJCV, vol. 81, no. 1, pp. 2–23, 2009.
-  Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang, “Generative Image Inpainting with Contextual Attention,” in CVPR, June 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image Recognition,” in CVPR, 2016, pp. 770–778.
-  Brian Dolhansky and Cristian Canton Ferrer, “Eye In-Painting With Exemplar Generative Adversarial Networks,” in CVPR, 2018, pp. 7902–7911.
-  Jiaying Liu, Wenhan Yang, Shuai Yang, and Zongming Guo, “Erase or Fill? Deep Joint Recurrent Rain Removal and Reconstruction in Videos,” in CVPR, June 2018.
-  Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, in CVPR, 2017, pp. 1125–1134.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative Adversarial Nets,” in NeurIPS, 2014, pp. 2672–2680.
-  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros, “Context Encoders: Feature Learning by Inpainting,” in CVPR, 2016, pp. 2536–2544.
-  Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa, “Globally and Locally Consistent Image Completion,” ACM Transactions on Graphics, vol. 36, no. 4, pp. 107, 2017.
Coloma Ballester, Marcelo Bertalmio, Vicent Caselles, Guillermo Sapiro, and
“Filling-In by Joint Interpolation of Vector Fields and Gray Levels,”IEEE Transactions on Image Processing, vol. 10, no. 8, pp. 1200–1211, 2001.
-  Alexei A Efros and William T Freeman, “Image Quilting for Texture Synthesis and Transfer,” in Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. ACM, 2001, pp. 341–346.
-  Yijun Li, Sifei Liu, Jimei Yang, and Ming-Hsuan Yang, “Generative Face Completion,” in CVPR, 2017, vol. 1, p. 3.
-  Qianru Sun, Liqian Ma, Seong Joon Oh, Luc Van Gool, Bernt Schiele, and Mario Fritz, “Natural and Effective Obfuscation by Head Inpainting,” in CVPR, 2018, pp. 5050–5059.
-  Jifeng Wang, Xiang Li, and Jian Yang, “Stacked Conditional Generative Adversarial Networks for Jointly Learning Shadow Detection and Shadow Removal,” in CVPR, 2018, pp. 1788–1797.
-  Kiana Ehsani, Roozbeh Mottaghi, and Ali Farhadi, “SeGAN: Segmenting and Generating the Invisible,” in CVPR, 2018, pp. 6144–6153.
-  Duofan Jiang, Hesheng Wang, Weidong Chen, and Ruimin Wu, “A Novel Occlusion-Free Active Recognition Algorithm for Objects in Clutter,” in IEEE International Conference on Robotics and Biomimetics, 2016, pp. 1389–1394.
-  Jonathan Huang and Kevin Murphy, “Efficient Inference in Occlusion-Aware Generative Models of Images,” arXiv preprint arXiv:1511.06362, 2015.
-  Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam, “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” in ECCV, 2018, pp. 801–818.
-  Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge,” IJCV, vol. 88, no. 2, pp. 303–338, 2010.
-  Holger Caesar, Jasper Uijlings, and Vittorio Ferrari, “COCO-Stuff: Thing and Stuff Classes in Context,” CoRR, abs/1612.03716, vol. 5, pp. 8, 2016.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,
“Distributed Representations of Words and Phrases and their Compositionality,”in NeurIPS, 2013, pp. 3111–3119.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft COCO: Common Objects in Context,” in ECCV. Springer, 2014, pp. 740–755.
Laurens van der Maaten and Geoffrey Hinton,
“Visualizing Data using t-SNE,”
Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.