Semantic Aware Attention Based Deep Object Co-segmentation
Object co-segmentation is the task of segmenting the same objects from multiple images. In this paper, we propose the Attention Based Object Co-Segmentation for object co-segmentation that utilize a novel attention mechanism in the bottleneck layer of deep neural network for the selection of semantically related features. Furthermore, we take the benefit of attention learner and propose an algorithm to segment multi-input images in linear time complexity. Experiment results demonstrate that our model achieves state of the art performance on multiple datasets, with a significant reduction of computational time.READ FULL TEXT VIEW PDF
Deep neural network models have recently draw lots of attention, as it
This paper presents a fast and accurate Chinese word segmentation (CWS) ...
Most scenes in practical applications are dynamic scenes containing movi...
In this work, we analyze if it is possible to distinguish between differ...
Today's deep learning systems deliver high performance based on end-to-e...
Object cosegmentation addresses the problem of discovering similar objec...
In this paper we introduce a Transformer-based approach to video object
Semantic Aware Attention Based Deep Object Co-segmentation
Image segmentation is one of a fundamental computer vision problem which aims to segment images into semantically different regions. Recently, remarkable success have been made based on the rapid development of deep learning[22, 2, 37, 4, 41]. First proposed by Rother et al. , object co-segmentation
which aims at extracting similar objects from multiple inputs, utilizes joint information from two images and achieves higher accuracy compared to segmenting the objects independently. This can be used in various applications like image retrieval and object discovery .
While considerable attention have been paid to object segmentation, there are limited previous works focusing on object co-segmentation [20, 23, 26, 31, 9, 18, 6, 25, 39, 16, 17]. Intuitively thinking, the advantage of co-segmentation against segmentation is to jointly utilize information from both images so as to perform better segmentation. In particular these information include 1)appearance similarity, 2)background similarity and 3)semantic similarity. There are previous works that leverage 1) and 2) [9, 18], but these are not general since appearance or background is not always similar. Recently, a deep learning based method  focused on semantic similarity, they can co-segment objects in the same semantic class even with different appearance and background, and outperformed other conventional methods by a large margin. They use a correlation layer  to compute localized correlations between semantic features of two input images, then predict the mask of common objects. However, since the correlation is computed in a pair-wise manner, their method is hard to extend to co-segmentation of more than two images. If the number of images in the group is large, the time complexity of co-segmentation with multiple (i.e, more than three) images will increase drastically when considering all different pairs in the group.
In this work we aim to co-segment the objects of a same semantic class in multiple input images, even with different appearance and background. We also aim to enable instant group co-segmentation, without suffering extra time complexity by pair-wisely considering all possible image pairs.
In a deep neural network, higher abstract semantic information is encoded in deeper layers, and different channels correspond to different semantic meanings . Figure 1 is a visualization of the channel activation of the conv5_3 layer of VGG16 to different input images. We can see that strong activations are observed in channel 120 and 223 with respect to class Bull Mastiff, while the same channel only have very weak activations with respect to class Tiger Cat
. Motivated by this observation, we argue that by applying attention in deep features (i.e, emphasizing channels whose activation is strong in both images’ features and suppressing the other channels), semantic information can be selected and enhanced, thus co-segmentation can be performed. According to this disentangled property, when dealing with multiple inputs, we can regard attentions as semantic selectors which can be applied globally instead of taking care of intra semantic relationship pair-wisely. Also, attention learner can be effectively implemented as fully connected layers and average pooling layers, which makes it faster than correlation layer used in in one forward operation.
In this paper, we propose a novel attention based co-segmentation model that leverages attention in bottleneck layers of the deep neural network. The proposed model is mainly composed of three modules: an encoder, a semantic attention learner, and a decoder. The encoder encodes the images into highly abstract features, in this work we take the convolutional layers of VGG16 as the encoder. The semantic attention learner takes the encoded feature and learns to pay attention to the co-existing objects. We propose three mechanisms for this attention learner and will describe them in detail in Section 3. The decoder then uses the attended deep features to output the final segmentation mask.
We summarize the main contributions of this work as follows:
We propose a simple yet efficient deep learning architecture for object co-segmentation: Attention Based Deep Object Co-segmentation model. In our model, we use a semantic attention learner to spotlight feature channels that have high activation in all input images and suppress other irrelevant feature channels. To the best of our knowledge, this is the first work that leverages attention mechanism for deep object co-segmentation.
Compared with previous works that perform co-segmentation in quadratic time complexity, our proposed model can do co-segmentation of multiple images in linear time complexity.
Our model achieves state of the art performance in multiple object co-segmentation datasets and is able to generalize to unseen objects absent from the training dataset.
The term Object Co-segmentation that segment "object" instead of "stuff" was first proposed by Rother et al.  in 2011. Rubinstein et al.  captured the sparsity and visual variability of the common object over the entire database using dense correspondences between images to avoid noise while finding saliency. Utilizing the clustering method, Joulin et al.  pointed out that discriminative clustering is well adapted to the co-segmentation problem and they extended its formulation to fit the co-segmentation task.
With the assistance of deep learning, Mukherjee et al. 
generated object proposals from two images and turned them into vectors by a Siamese network. During the training, they built an Annoy (Approximate Nearest Neighbor) Library to measure their Euclidean distance or Cosine distance between two vectors. Recently, DOCS turned PASCAL VOC dataset into a co-segmentation dataset, producing more paired data. They applied a correlation layer  to find out similar features. They proposed group co-segmentation to test on several datasets and demonstrated that their result has achieved state-of-the-art. However, their pairwise scheme makes the testing cost a lot of computation time when performing group co-segmentation, so testing on the whole Internet dataset becomes computationally intractable. They only tested on the subset of it. In this paper, our method can run in linear time and in the meantime, achieve state of the art performance. Our proposed model has a much simpler structure, yet can still get better performance.
Pixel Objectness was first proposed in 
. Their research revealed that the feature map from the model pretrained by Imagenet could be used in the task called object discovery. Similar with our model, they extended a decoder after the last convolutional layer of VGG16 to produce the segmentation mask. Though they only trained the model on PASCAL VOC Dataset, their model can segment other objects even not existed in the training dataset.
Our model extended their work with a novel attention learner. By this means, we can not only perform image co-segmentation task, but also enhance the model’s ability for object discovery.
As far as we know, visual attention was first proposed in . Since then many attention based models have been used in many computer vision tasks such as VQA [36, 34, 42, 10] and image captioning [38, 5, 32, 1]
. Recently, attention models have been widely used in many other research domains[21, 40, 33, 13, 11, 12]. Attention can be considered as laying weights on channels of feature map to enhance some semantic information and, at the same time, remove other unwanted semantic information.
In our paper, we utilize the channel-wise attention model as semantic selectors, since specific channels contribute to the specific semantic class. In this object co-segmentation task, we generate channel-wise attention from one image to decide which semantic information should be removed in the other. To the best of our knowledge, this is the first work to use channel-wise attention model in the task of object co-segmentation.
Fig. 2 presents the overview of our model. For simplicity, we demonstrate our model using two inputs, which is the typical case of co-segmentation. We will show in Section 3.4 that our model can be extended to multiple inputs when testing. Our model is composed of an encoder, a semantic attention learner, and a decoder. The encoder is identical with the convolutional layers of VGG16, thus the output from the encoder is a 512-channel feature map. The feature maps are forwarded into the semantic attention learner. Here we propose three different architectures for the semantic attention learner: channel-wise attention model (CA), fused-channel-wise attention model (FCA) and channel-spatial-wise attention model (CSA). Processed by semantic attention learner, we obtain the attended feature map and forward it to the decoder. We adopt the up-sampling method for the decoder and add a dropout layer after each layer to avoid overfitting. For the last layer, we use a convolution layer to output two channels, representing the foreground and the background respectively. To be emphasized, unlike , we do not concatenate the original feature with the processed feature since the innovative feature tends to segment every "object-like stuff" and ignore the semantic selector, which is contrary to the goal of this task.
It is known that each channel contains different semantic information with respect to different semantic classes  as mentioned in Fig 1. We construct the Channel-wise attention (CA) to help us enhance the semantic information we need and suppress the remaining. To be specific, we first do global average pooling for the output of the encoder and and forward to a learnable fully connected (FC) layer to get two weight vectors . contain semantic information of two inputs: if the -th index of is large, it indicates the channel has high activation, and thus the image contains the semantic represented by channel . By performing channel-wise multiplication with and , and , we can get the attended feature maps and . Here receives the attention weights computed from , so if has high activation in channel while contains no such semantic ( is small), this channel will be suppressed. On the other hand, if does not have high activation in channel while the -th index of is large, channel will not be activated since this channel activation is initially small. Same will happen to . As a consequence, by applying this channel wise attention mechanism, only the channels with high activations in both inputs are preserved and enhanced relatively. Other channels which do not contain semantics in both inputs will be suppressed. We also show the property of the channel wise disentanglement in our supplementary material. We present details in Equation (1) - (4). In all of the equations below, * represents matrix multiplication, and represents element-wise multiplication with broadcasting.
Note that here we use a sigmoid function instead of the generally used softmax function because our approach is to use attentions for retaining or removing semantic information. If two objects with different semantics both appear in both images, according to our assumption, the weights of related channels of the objects appear in the other image need to be close to 1 in order to maintain their information. However, if we use a softmax layer, both related and irrelevant channel weights may become less than 0.5, which will affect the performance of attention co-segmentation.
Since we aim to find the same semantic information in both inputs, the output attention weights from the semantic attention learner for both input images should be the same. Therefore, we propose another way of finding common semantic information from and , by using one FC layer to fuse the two attention weights. Fig. 2(b) shows the architecture of our fused channel-wise attention. We take the same procedure from CA to generate and . is the combined results of both attentions:
While Global Average Pooling can extract global information from the feature map, spatial information will be lost due to this pooling operation. To further improve CA, inspired by , we propose to lay some spatial information for further improving its segmentation performance. We can find that our spatial attention produces reasonable heat map on objects in different pictures. We show visualizations of spatial wise attention in supplementary material.
For this channel spatial attention architecture, we first generate channel wise attention using the same fashion as CA. For generating the spatial-wise attention, we calculate mean value of each spatial location across all channels to generate spatial attention maps :
With group co-segmentation proposed in , we have to match all possible image pairs separately. Co-segmenting images needs quadratic computation time. Thus, it becomes computationally intractable to test on a large scale dataset. Moreover, because of their fixed structure, the existing deep learning models shows weaknesses when co-segmenting more than two images at the same time. For example, we can hardly use  to discover and segment the most frequently appearing object among several (more than 2) images, since it is hard to measure frequency due to the lack of global semantical understanding in correlation modules.
To address this problem, thanks to our attention mechanism, we present an approach to accomplish co-segmentation in a single shot by controlling their generated attentions. Without loss of generality, we use CA (the first proposed model) to demonstrate the procedure.
Although our model is trained end to end by pairs of input images, when testing, our model can be seen as a composition of two parts: an attention-generating module and a segmentation module. When forwarding images into our model, each image will generate an attention weight . Each represents the disentangled semantic information of each . To get the common semantic information for all the input images, we take the averaged attention weight and use for attending on each image features. We name this procedure Group Average Attention and show in detail in Algorithm 1. Thus, we reduce time complexity into linear time . Based on specific task and dataset, the global average attention can also be changed to global minimum attention, by changing the averaging operation to minimum operation of all along all dimensions. By this means we can strictly co-segment objects that appear in all input images.
We use PyTorch library to implement our model. For the training, we rescale every image into spatial size 512*512. As for the decoder we adopt the architecture: upsample conv (512,256) upsample conv (256,128) upsample conv (128,64) upsample conv (64,32) upsample
conv (32,2), in which conv (a,b) indicates a convolution layer whose input channel is a and output channel is b. For all convolution layers we use kernel size 3 and stride 1, followed by a Rectified Linear Unit (ReLU) layer and Batch Normalize Layer. We compute loss using cross entropy loss, and back propagate gradients to all layers. We use Adam optimizer with learning rate 1e-5 for optimization.
For training, we use the same dataset as , where the image pairs are extracted from the PASCAL VOC 2012 training sets. The total number of pairs is over 160k.
For validation and testing our model, we randomly separate the PASCAL VOC2012 validation set into 724 validation images and 725 test images like , by pairing the images we get 46973 and 37700 pairs respectively.
We also test our model using other datasets commonly used in object co-segmentation. They include:
Internet  sub-dataset, each of the 3 classes: car, horse and airplane contains 100 images.
Internet  dataset. The three classes car, horse and airplane contain 1306, 879 and 561 images correspondingly. Note that the MSRC dataset and two Internet dataset all contains classes within the training data. Here we refer to these object classes as Seen Objects.
ICoseg  sub-dataset. This dataset contains 8 classes, each with a different number of images. Different from the previous datasets, the class in ICoseg dataset is different from the training dataset. We adopt this dataset to test the generalizability of our model. We refer to the objects in ICoseg dataset as Unseen Objects.
For result comparison, we use the following baselines: 
is a conventional method which first extracts the features from image pairs and then trains a Random Forest Regressor based on these features.[26, 16, 17] do co-segmentation based on saliency. [18, 6] utilize clustering method to find the similarity of the image.  uses conditional random fields to find the relationship between pixels.  connects the superpixel nodes located in the image boundaries of the entire image group, then infers via the proposed GO-FMR algorithm.  first induces affinities between image pairs and then co-segments objects from co-occurring regions. They further improve their results with Consensus Scoring. Other than the pre deep learning conventional methods, [23, 20] both utilized deep learning method for object co-segmentation.  found the objects with object proposals, encoded them into feature vectors with VGG16, then they compare the similarity of the feature vectors and decided which to segment. Recently,  proposed an end-to-end deep learning model with VGG16, a correlation layer and a decoder, which made great progress and achieved state of the art in this area.
Pixel Objectness is the state of the art method of object discovery, and it is capable of segmenting all the existing objects with a single input image. The network architecture of Pixel Objectness is a VGG16 encoder directly followed by a decoder. So in the view of network architecture, our model can be viewed as adding attention learner in the middle of Pixel Objectness. So since the task is different, it is unfair to compare their method with co-segmentation methods quantitatively. We, therefore, demonstrate qualitative results to show that our model can inherit (and even perform better than) the object discovery ability of Pixel Objectness.
We first show the performance comparisons of conventional pairwise co-segmentation. For PASCAL VOC Datasets, we test our models on 37700 images pairs. We gain the Jaccard accuracy 59.24%, 59.41% and 59.76% for CA, FCA and CSA respectively. Table 1 and 2 show quantitative results on different MSRC and Internet sub-dataset. We can see that our model is able to get state of the art performance when segmenting Seen Objects. Among the three proposed attention mechanisms, CSA is best at the co-segmentation of Seen Objects.
Table 3 shows quantitative result of Unseen Objects from the ICoseg sub dataset. We can see that our model (especially FCA) outperforms all baseline methods. In particular, we get 1.8% performance gain compared with  which also uses deep learning based method.
In Fig. 3, we visualize some co-segmentation results from three models. These demonstration images are selected from all of the three datasets containing Seen Objects: MSRC, Internet, PASCAL VOC and MSCOCO dataset. We can see that Pixel Objectness  tends to over segment since it outputs the segmentation of all possible objects. We can see that our method performs comparably well as the state of the art deep learning based method . We will further demonstrate that our method has much less time complexity compared with  in Section 4.4.
In Fig. 3(b), the comparison between our method and Pixel Objectness  demonstrates that our approach has a stronger ability in object discovery, thanks to the attention learner that can help to reduce semantic noise and enhance some semantic information in the feature map. For example, from the first row in Fig. 3(b), with Pixel Objectness, no object could be detected but assisted with reference attention from another image, our model can detect and segment the pyramid precisely. Also, we can see that our method achieves better performance in co-segmentation task compared with  and . FCA is the best architecture for co-segmentation of Unseen Objects. According to the architecture, CSA remains most information from the original images because of the spatial attention. However, it is unclear that spatial attention also has the disentangle properties which may lead to miss-segment in the unseen objects. On the other hand, FCA aims at finding the common attention between two inputs, so noises from two generated attention will be suppressed.
Algorithm 1 presents each step in instant group co-segmentation. By using this method, we can reduce the time complexity of instant group co-segmentation to linear time complexity. All of the previous work has quadratic time complexity which made it computationally intractable to test on the whole Internet dataset. In Table 4, we show the results using instant group co-segmentation in the whole Internet dataset which contains 1306 car, 879 horse, and 561 airplane images labeled. We reach state-of-the-art performance without doing co-segmentation pair-wisely. Figure 4 shows some qualitative results of our instant group co-segmentation.
|Avg Jaccard (test time)||||CA||CA-instant|
|MSRC(seen)||79.9 (203)||76.5 (51)||73.9 (17)|
|Internet(seen)||70.3 (9531)||72.8 (2179)||70.9 (63)|
|ICoseg(unseen)||84.2 (1077)||85.2 (268)||87.1 (43)|
To show that our instant group co-segmentation method does not sacrifice accuracy compared with previous methods based on pairwise co-segmentation, we carry out the same quantitative experiment in the same dataset as done in Section 4.3. From Table 5, we can see that the instant group co-segmentation achieves higher accuracy for Unseen Objects of iCoseg dataset, while performing comparably well in Seen Objects from MSRC dataset and Internet dataset, compared with  and our CA model used pair-wisely. This demonstrates that the global average attention helps us filtering some semantic noises in the irrelevant channels.
There are three typical failure cases in co-segmentation task: loss of segmentation accuracy, over-segmentation, and under-segmentation. For example in Fig. 4(b), the 7th image is over-segmented and in Fig. 4(c), the 2nd image is under-segmented. Over-segment and under-segment can be easily avoided in Seen Objects as long as the model converges. However, Unseen Objects are beyond control during training.
According to , the pretrained model is well capable of segmenting unseen objects. Our models find the balance between attention learners and the pretrained model during training. If the pretrained model dominates, it tends to segment all objects in the image, resulting in over-segment. On the other hand, if the attention learner overfits, unseen objects would be ignored, leading to under-segment.
Group Average Attention in Section 3.4 is not the only meaningful procession in our model. For other tasks, such as segmenting the common objects that exist in all images, we can choose the minimum weight for each channel from the generated attention and obtain Global Minimum Attention. Fig. 5 shows an example of comparing global average attention and global minimum attention. Group Average Attention can segment the most common object in multiple images, while Group Minimum Attention strictly finds the objects appear in all images.
In this paper, we propose three different architectures of attention learner as semantic selectors. With proposed instant group co-segmentation, we can co-segment multiple images in the linear time. We present that our results have achieved state-of-the-art in the object co-segmentation task and outperformed  in object discovering ability. We also visualize the channel-wise attention and the spatial-wise attention to show the correctness of our proposed models.
However, using our proposed model, if the inputs do not contain the same object, our model will output wrong results since there’s no common semantic class. Together with applying zero-shot learning to improve segmentation of unseen objects, we leave this for our future work.
To make sure that the channel wise attention has obeyed our assumption that enhance specific channels and suppress others, we generate channel wise attention from five different classes: Monitor, Indigo bird, Sheep, Jellyfish, Shopping Cart. Here, each classes contain 1300 images from Imagenet .
Fig. 6 shows the results. We can see that images in the same classes generate the similar attention. Those images in different classes generate different attention. So it proves that channel wise attention enhances some channels and suppress some channels according to the classes in images.
In CSA, we improve our model with an additional spatial attention. After Global Average Pooling (GAP), some spatial information has been lost. Thus we use spatial attention to enhance some important spatial locations in the images. Fig.7 shows the visualization of generated spatial attention map.
Huang, Y., Cai, M., Kera, H., Yonetani, R., Higuchi, K., Sato, Y.: Temporal localization and spatial segmentation of joint attention in multiple first-person videos. In: (ICCVW) (2017)
Jerripothula, K.R., Cai, J., Meng, F., Yuan, J.: Automatic image co-segmentation using geometric mean saliency. In: ICIP (2014)