Image segmentation is a basic computer vision task
. In recent years, with the rapid development of deep learning method, several convolution neural network based segmentation methods have improved the performance of image segmentation greatly, such as FCN and DeepLab v3. However, these methods rely heavily on a large amount of annotations. In order to overcome this shortcoming, few-shot segmentation task  is proposed to achieve segmentation of new class with a few of manual annotations, such as one annotation (1-shot segmentation) and five annotations (5-shot segmentation). Few-shot segmentation is a challenging task due to the asymmetry of training data and testing data.
In few-shot segmentation task, the annotated and unlabeled images are called support images and query images respectively. The existing models usually consist of three terms. 1) The support branch that extracts feature from support images. 2) The query branch that extracts feature from query images. 3) The transformation module that transfers the features between support branch and query branch to facilitate the segmentation of query branch. The essential term is the transformation module and the most challenging obstacle is how to design a transformation module that is class-agnostic, so that the transformation module can be generalized to new classes efficiently. The existing methods use the global cues of the support image to model the transformation process, which however ignores the geometry relationships of the local features. This paper demonstrates that the geometry relationships of the local features are very useful to the transformation module.
This paper proposes a new transformation module based on local cues, where the relationship of the local features is used to accomplish the transformation. Our idea is to use linear transformation of the relationship matrix in a high-dimensional metric embedding space to accomplish the transformation. To this end, we firstly map the local features into an embedding space, where cosine distance is used to obtain the relationship matrix of local features. Then, the relationship matrix is transformed linearly by the generalized inverse matrix of the annotated matrix of support image. After linear transformation, the result is regarded as an attention map containing high-level semantic information, by which we establish a new attention transformation module. We verify the effectiveness of our transformation module on Pascal VOC 2012 dataset. The value of mIoU achieves at 57.0% in 1-shot and 60.6% in 5-shot, which outperform the state-of-the-art method by 1.6% and 3.5%, respectively.
2 Proposed Method
2.1 Problem Definition
Few-shot segmentation is a task that uses a few of annotations to segment unknown images for new classes. Let be a set of support images and the corresponding manual annotations. Let be query images set that needs to be segmented. The images in and belong to the same new class . Let be the training dataset of known classes that already exist, and . The goal of few-shot segmentation is to build a model by that outputs binary mask for query image based on .
Similar to the existing few-shot segmentation network, the proposed framework includes a support branch, a query branch and a transformation module, as shown in Fig. 1
. In order to make the network more generalized to unseen classes, our feature extraction backbone adopts relatively shallow layers, such as the first three layers of. In addition, the support branch and the query branch share the feature extraction backbone.
After obtaining deep features and from support image and query image, is weighted spatially by the annotated mask to get the features , which guarantees the features only containing corresponding foreground regions. Such process can be represented as
where , is spatial location of feature map or annotated mask .
Then, the learned features and are mapped into a high-dimensional embedding space by further convolution operations to get corresponding embedding features and respectively, so that cosine distance can be used to calculate the relationship between the feature pixels in this space. Simultaneously, the feature is learned by convolution operation to get .
Based on the embedding , and the groundtruth mask of support images , the proposed applies linear transformation of matrix to obtain the attention map with high-level semantic cues. The detailed description refers to section 2.3. The attention map finally filters the deep features to by
where , is spatial location of feature maps or attention map . It is seen that the attention map indicates a rough object area to be segmented. This coarse object area provides high-level semantic information for the subsequent sub-network.
We next use sub-network to generate segmentation mask from . The sub-network is shown in Fig. 2. Specifically, in order to handle the scale changes of the object better, we introduce multi-scale feature fusion module ASPP18] in . The output of
is the probability mapwith the same size to the query image, and the cross entropy loss in Eq. (3) is used to supervise the training of the model, i.e.,
where is the groundtruth mask of the query image, is the predicted probability map of our few-shot segmentation model, and , is the spatial location of or .
2.3 Transformation module
Existing methods often establish the transformation between support image and query image by the global features of the support image, thus lose local geometric information, which however is also important to the transformation. The proposed transformation module is designed based on the relationship between local pixels (represented by the relationship matrix in NON-Local model), and more accurate segmentation can be realized by propagating local relationships.
The detailed steps are illustrated in Fig. 3. The transformation is achieved by the linear transformation of the relationship matrix. Specifically, the relationship matrix between (for support image) and (for query image) is linearly transformed by the generalized inverse matrix of (groundtruth mask matrix of the support image), which overcomes the difficulty of transforming local relationship matrix to high-level semantic information. The result of linear transformation is the attention map of the query image.
2.3.1 Relationship Matrix
For few-shot segmentation task, it is very important to model the relationship between each pair of deep local features of support image and query image. Due to the local computational nature of the convolution operation, the relationships between long-distance pixels cannot be established directly. Therefore, NON-Local
structure was proposed to conquer it, where the feature tensor is reshaped into a matrix, and a relationship matrix is established by matrix product. This relationship matrix contains the relationships between each pair of local deep features. We imitate the relationship matrix in NON-Local to establish the relationship in few-shot segmentation.
In NON-Local, it is only desirable to establish long-distance constraints, and the matrix product is just used to describe the relationship between two local features. For few-shot segmentation task, this is a rough description of feature similarity. Therefore, in our proposed transformation module, the feature is firstly mapped into an embedding space, in which the cosine distance can be used to calculate the relationship between local features. Such process is represented as
where represents the th local information in the embedding , and represents the th local information in the embedding . So we have established a relationship matrix between query image and support image.
2.3.2 Linear transformation based on generalized inverse matrix
With the above relationship matrix , how to convert this relationship matrix to high-level semantic information of query image becomes another key point. Let and be the binary groundtruth mask of support image and query image respectively. Based on Eq. (4), the true relationship matrix between query image and support image can be simplified by matrix product between and , i.e.,
where demonstrates matrix product. The original size of and are . The reshaped size of and are and respectively. The size of is , which contains the relationship information of each pair of local feature pixels of and . Our target is to obtain based on Eq. (5).
We suppose the matrix is approximately equal to the true relationship matrix between query image and support image. Furthermore, we relax the binary groundtruth mask to the soft attention map , which provides high-level semantic information. Since is known for few-shot segmentation task, the problem is transformed to get the attention map based on,
Moreover, since the is not square matrix, its inverse matrix does not exist. But it can be regarded as matrix with row full rank. According to the generalized inverse matrix theory , the transformation problem can be represented by:
where demonstrates matrix product. is the right inverse matrix (one type of generalized inverse matrix) of . This is just a process that applies the generalized inverse matrix of to linearly transform the relationship matrix . Finally, the attention map can be obtained by Eq. (7) directly.
In order to ensure that the learned relation matrix is consistent with during training, the mean square error loss is used to supervise it, i.e.,
2.3.3 Attention Map
By linearly transforming the relationship matrix by Eq. (7), the attention map of the query image is obtained, with reshaped size to . Moreover, we normalize it to by
where is normalized counterpart of , for the convenience of expression, we do not distinguish them.
In order to ensure the accuracy of the attention map , we regard it as the foreground probability map of the segmentation result, and
as the background probability map. The two maps are concatenated and resized to the same size of original query image (by bilinear interpolation). The combined mapcan be regarded as a segmentation result and supervised by the cross entropy loss.
where is the groundtruth mask of the query image, is the probability map derived from attention map . and are the spatial location of and .
For 5-shot, there are five support images. In order to combine the attention maps provided by the five different images, we simply average the attention maps by
The project of our method is built based on the Pytorch library, Adam
optimizer is adopted to update the parameters, and all experimental code is executed on a machine equipped with a Titan XP GPU. We set the initial learning rate to 1e-4. The backbone network of our feature extraction is pre-trained on ImageNet dataset, and the parameters of the previous layers of backbone are frozen, and we apply the first three layers of  as our backbone.
|bus, car, cat, chair, cow|
|diningtable, dog, horse, motorbike, person|
|potted plant, sheep, sofa, train, tv/monitor|
3.1 Detail of Implementation
We validate the proposed method on the Pascal VOC 2012 dataset and its enhanced dataset SDS. Similar to the existing methods [3, 4, 5, 6, 7], we split images of 20 classes into four subsets, each of which contains images of five classes, the detailed description can be found in Table 1. For these four subsets, three of them are selected as the training set, and the rest one is used as the test set to validate the effectiveness of the proposed method. In the training stage, we randomly select two images for each class, one as a support image and another as a query image until all images of training classes were selected. In the testing stage, in order to make a fair comparison with the existing methods, we use the same random seed in the existing method to sample the same 1000 pairs of images as the test data for each evaluation sub-dataset.
We use the mean intersection over union of foreground (mIoU) to measure the performance of our proposed method, which is widely used in few-shot segmentation. In addition, the FB-IoU proposed in co-FCN is also considered, which includes mean intersection over union of foreground and background.
3.2 Comparison with Benchmarks
In order to verify the effectiveness of our method, we compare with existing method in 1-shot and 5-shot. We follow CA-Net to adopt DenseCRF and multi-scale evaluation strategy to improve the performance, which are always employed in existing semantic segmentation method. The detailed results can be found in Table 2, Table 3 and Table 4. We can see the values of mIoU by our method achieve at 57.0% in 1-shot and 60.6% in 5-shot, which outperform the state-of-the-art few-shot segmentation method CA-Net by 1.6% and 3.5% respectively. The improvement in 5-shot indicates the superiority of our method when it comes to more annotations. In addition, the values of FB-IoU in 1-shot and 5-shot achieve at 71.8%, 74.6% respectively, which also outperforms the comparison methods obviously.
In the training stage, the weight , and are set to 1. In order to validate the effectiveness of our three loss functions, ablation experiment is implemented. The detailed results can be found in Table 5. We can see and improve the performance by 1.1% and 1.0% respectively. In addition, we follow CA-Net to employ DenseCRF and multi-scale evaluation in our test stage. The ablation of these two strategies is also conducted, and the detailed results can be found in Table 6. We can see that DenseCRF and the multi-scale evaluation strategy can improve the mIoU value by 0.4% and 0.3% respectively.
3.4 Subjective Result
The subjective results of the proposed method are shown in Fig. 4. The support image, the ground-truth mask of the support image, the query image, the ground-truth mask of the query image and the segmentation result are displayed from left column to right column, respectively. It is seen that the proposed method segments objects from these images successfully.
This paper proposes a new transformation module for few-shot segmentation. Rather than focusing on global cues, the relationships of local features are used to form the transformation. Local feature relationship matrix calculated by the cosine similarity is used to represent the relationships of local features. Linear transformation of relationship matrix based on generalized inverse of the groundtruth matrix is implemented to transform the relationship matrix. We also map the features into a high-dimensional metric embedding space to enhance the generalization of the proposed module. We propose a new few-shot segmentation network based on transformation module, and better results are obtained in terms of both mIoU value and FB-IoU value.
This work was supported in part by the National Natural Science Foundation of China under Grant 61871087, Grant 61502084, Grant 61831005, and Grant 61601102, and supported in part by Sichuan Science and Technology Program under Grant 2018JY0141.
Long, J., Shelhamer, E., Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
-  Chen, L. C., Papandreou, G., Schroff, F., Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
-  Boots, Z. L. I. E. B., Shaban, A., Bansal, S. (2017). One-shot learning for semantic segmentation. BMVC.
-  Levine, T. D. A. E. S., Rakelly, K., Shelha-mer, E. (2018). Conditional networks for few-shot semantic segmentation. In ICLR workshop.
-  Zhang, X., Wei, Y., Yang, Y., Huang, T. (2018). Sg-one: Similarity guidance network for one-shot semantic segmentation. arXiv preprint arXiv:1810.09091.
Hu, T., Yang, P., Zhang, C., Yu, G., Mu, Y., Snoek, C. G. (2019). Attention-based Multi-Context Guiding for Few-Shot Semantic Segmentation.In Proceedings of the Association for the Advance of Artificial Intelligence.
-  Zhang, C., Lin, G., Liu, F., Yao, R., Shen, C. (2019). CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5217-5226).
-  Wang, X., Girshick, R., Gupta, A., He, K. (2018). Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7794-7803).
-  Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1), 98-136.
-  Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L. (2017). One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 221-230).
-  Dong, N., Xing, E. (2018). Few-Shot Semantic Segmentation with Prototype Learning. In BMVC (Vol. 1, p. 6)
-  Hariharan, B. , Arbelaez, P. , Bourdev, L. D. , Maji, S. , Malik, J. . (2011). Semantic contours from inverse detectors. IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011. IEEE.
-  Faktor, A., Irani, M. (2013). Co-segmentation by composition. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1297-1304).
-  Krähenbühl, P., Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems (pp. 109-117).
-  Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.
-  Kingma, D. P., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
-  Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). Ieee.
-  He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
-  Rao, C. R., Mitra, S. K. (1972). Generalized inverse of a matrix and its applications. Icams Conference (Vol.1).