A New Local Transformation Module for Few-shot Segmentation

10/14/2019 ∙ by Yuwei Yang, et al. ∙ 0

Few-shot segmentation segments object regions of new classes with a few of manual annotations. Its key step is to establish the transformation module between support images (annotated images) and query images (unlabeled images), so that the segmentation cues of support images can guide the segmentation of query images. The existing methods form transformation model based on global cues, which however ignores the local cues that are verified in this paper to be very important for the transformation. This paper proposes a new transformation module based on local cues, where the relationship of the local features is used for transformation. To enhance the generalization performance of the network, the relationship matrix is calculated in a high-dimensional metric embedding space based on cosine distance. In addition, to handle the challenging mapping problem from the low-level local relationships to high-level semantic cues, we propose to apply generalized inverse matrix of the annotation matrix of support images to transform the relationship matrix linearly, which is non-parametric and class-agnostic. The result by the matrix transformation can be regarded as an attention map with high-level semantic cues, based on which a transformation module can be built simply.The proposed transformation module is a general module that can be used to replace the transformation module in the existing few-shot segmentation frameworks. We verify the effectiveness of the proposed method on Pascal VOC 2012 dataset. The value of mIoU achieves at 57.0 outperforms the state-of-the-art method by 1.6

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image segmentation is a basic computer vision task

[1]

. In recent years, with the rapid development of deep learning method, several convolution neural network based segmentation methods have improved the performance of image segmentation greatly, such as FCN

[1] and DeepLab v3[2]. However, these methods rely heavily on a large amount of annotations. In order to overcome this shortcoming, few-shot segmentation task [3] is proposed to achieve segmentation of new class with a few of manual annotations, such as one annotation (1-shot segmentation) and five annotations (5-shot segmentation). Few-shot segmentation is a challenging task due to the asymmetry of training data and testing data.

In few-shot segmentation task, the annotated and unlabeled images are called support images and query images respectively[3]. The existing models[3][4][5][6][7] usually consist of three terms. 1) The support branch that extracts feature from support images. 2) The query branch that extracts feature from query images. 3) The transformation module that transfers the features between support branch and query branch to facilitate the segmentation of query branch. The essential term is the transformation module and the most challenging obstacle is how to design a transformation module that is class-agnostic, so that the transformation module can be generalized to new classes efficiently. The existing methods[5][7] use the global cues of the support image to model the transformation process, which however ignores the geometry relationships of the local features. This paper demonstrates that the geometry relationships of the local features are very useful to the transformation module.

This paper proposes a new transformation module based on local cues, where the relationship of the local features is used to accomplish the transformation. Our idea is to use linear transformation of the relationship matrix in a high-dimensional metric embedding space to accomplish the transformation. To this end, we firstly map the local features into an embedding space, where cosine distance is used to obtain the relationship matrix of local features. Then, the relationship matrix is transformed linearly by the generalized inverse matrix of the annotated matrix of support image. After linear transformation, the result is regarded as an attention map containing high-level semantic information, by which we establish a new attention transformation module. We verify the effectiveness of our transformation module on Pascal VOC 2012 dataset

[9]. The value of mIoU achieves at 57.0% in 1-shot and 60.6% in 5-shot, which outperform the state-of-the-art method by 1.6% and 3.5%, respectively.

2 Proposed Method

2.1 Problem Definition

Few-shot segmentation is a task that uses a few of annotations to segment unknown images for new classes. Let be a set of support images and the corresponding manual annotations. Let be query images set that needs to be segmented. The images in and belong to the same new class . Let be the training dataset of known classes that already exist, and . The goal of few-shot segmentation is to build a model by that outputs binary mask for query image based on .

2.2 Overview

Similar to the existing few-shot segmentation network, the proposed framework includes a support branch, a query branch and a transformation module, as shown in Fig. 1

. In order to make the network more generalized to unseen classes, our feature extraction backbone adopts relatively shallow layers, such as the first three layers of

[18]. In addition, the support branch and the query branch share the feature extraction backbone.

Figure 1: The pipeline of the proposed method. The Feature Extractor extracts the features and . The feature is weighted spatially by the groundtruth mask to obtain the feature

. The deep features

and are mapped into an embedding space by to get and respectively. Simultaneously, the feature is learned by to get . Based on , and , the proposed outputs attention map , which weights spatially. The segmentation result is obtained through module finally.

After obtaining deep features and from support image and query image, is weighted spatially by the annotated mask to get the features , which guarantees the features only containing corresponding foreground regions. Such process can be represented as

(1)

where , is spatial location of feature map or annotated mask .

Then, the learned features and are mapped into a high-dimensional embedding space by further convolution operations to get corresponding embedding features and respectively, so that cosine distance can be used to calculate the relationship between the feature pixels in this space. Simultaneously, the feature is learned by convolution operation to get .

Based on the embedding , and the groundtruth mask of support images , the proposed applies linear transformation of matrix to obtain the attention map with high-level semantic cues. The detailed description refers to section 2.3. The attention map finally filters the deep features to by

(2)

where , is spatial location of feature maps or attention map . It is seen that the attention map indicates a rough object area to be segmented. This coarse object area provides high-level semantic information for the subsequent sub-network.

We next use sub-network to generate segmentation mask from . The sub-network is shown in Fig. 2. Specifically, in order to handle the scale changes of the object better, we introduce multi-scale feature fusion module ASPP[15]

and residual connection

[18] in . The output of

is the probability map

with the same size to the query image, and the cross entropy loss in Eq. (3) is used to supervise the training of the model, i.e.,

(3)

where is the groundtruth mask of the query image, is the predicted probability map of our few-shot segmentation model, and , is the spatial location of or .

Figure 2: The structure of module. The residual connection[18] and multi-scale strategy ASPP[15] are implemented to handle the scale variations of object.

2.3 Transformation module

Existing methods often establish the transformation between support image and query image by the global features of the support image, thus lose local geometric information, which however is also important to the transformation. The proposed transformation module is designed based on the relationship between local pixels (represented by the relationship matrix in NON-Local model[8]), and more accurate segmentation can be realized by propagating local relationships.

The detailed steps are illustrated in Fig. 3. The transformation is achieved by the linear transformation of the relationship matrix. Specifically, the relationship matrix between (for support image) and (for query image) is linearly transformed by the generalized inverse matrix of (groundtruth mask matrix of the support image), which overcomes the difficulty of transforming local relationship matrix to high-level semantic information. The result of linear transformation is the attention map of the query image.

2.3.1 Relationship Matrix

For few-shot segmentation task, it is very important to model the relationship between each pair of deep local features of support image and query image. Due to the local computational nature of the convolution operation, the relationships between long-distance pixels cannot be established directly. Therefore, NON-Local[8]

structure was proposed to conquer it, where the feature tensor is reshaped into a matrix, and a relationship matrix is established by matrix product. This relationship matrix contains the relationships between each pair of local deep features. We imitate the relationship matrix in NON-Local

[8] to establish the relationship in few-shot segmentation.

In NON-Local[8], it is only desirable to establish long-distance constraints, and the matrix product is just used to describe the relationship between two local features. For few-shot segmentation task, this is a rough description of feature similarity. Therefore, in our proposed transformation module, the feature is firstly mapped into an embedding space, in which the cosine distance can be used to calculate the relationship between local features. Such process is represented as

(4)

where represents the th local information in the embedding , and represents the th local information in the embedding . So we have established a relationship matrix between query image and support image.

Figure 3: The detailed information of the proposed transformation module. The embedding and are reshaped to matrix. Then, the relationship matrix is obtained based on and

by cosine similarity. Finally, the relationship matrix

is transformed linearly by the matrix of generalized inverse matrix of . After reshape operator, the attention map is obtained.

2.3.2 Linear transformation based on generalized inverse matrix

With the above relationship matrix , how to convert this relationship matrix to high-level semantic information of query image becomes another key point. Let and be the binary groundtruth mask of support image and query image respectively. Based on Eq. (4), the true relationship matrix between query image and support image can be simplified by matrix product between and , i.e.,

(5)

where demonstrates matrix product. The original size of and are . The reshaped size of and are and respectively. The size of is , which contains the relationship information of each pair of local feature pixels of and . Our target is to obtain based on Eq. (5).

We suppose the matrix is approximately equal to the true relationship matrix between query image and support image. Furthermore, we relax the binary groundtruth mask to the soft attention map , which provides high-level semantic information. Since is known for few-shot segmentation task, the problem is transformed to get the attention map based on,

(6)

Moreover, since the is not square matrix, its inverse matrix does not exist. But it can be regarded as matrix with row full rank. According to the generalized inverse matrix theory [19], the transformation problem can be represented by:

(7)

where demonstrates matrix product. is the right inverse matrix (one type of generalized inverse matrix) of . This is just a process that applies the generalized inverse matrix of to linearly transform the relationship matrix . Finally, the attention map can be obtained by Eq. (7) directly.

In order to ensure that the learned relation matrix is consistent with during training, the mean square error loss is used to supervise it, i.e.,

(8)

2.3.3 Attention Map

By linearly transforming the relationship matrix by Eq. (7), the attention map of the query image is obtained, with reshaped size to . Moreover, we normalize it to by

(9)

where is normalized counterpart of , for the convenience of expression, we do not distinguish them.

The deep feature of the query image is filtered by the normalized attention map to get by Eq. (2). Then is proceeded by (as shown in Fig. 2) to obtain the segmentation result.

In order to ensure the accuracy of the attention map , we regard it as the foreground probability map of the segmentation result, and

as the background probability map. The two maps are concatenated and resized to the same size of original query image (by bilinear interpolation). The combined map

can be regarded as a segmentation result and supervised by the cross entropy loss.

(10)

where is the groundtruth mask of the query image, is the probability map derived from attention map . and are the spatial location of and .

For 5-shot, there are five support images. In order to combine the attention maps provided by the five different images, we simply average the attention maps by

(11)

In the training stage, we combine the three losses in Eq. (3), (10) and (8) to supervise the learning of our model, i.e.,

(12)

where , ,

are the weights of corresponding loss function.

3 Experiment

The project of our method is built based on the Pytorch library, Adam

[16]

optimizer is adopted to update the parameters, and all experimental code is executed on a machine equipped with a Titan XP GPU. We set the initial learning rate to 1e-4. The backbone network of our feature extraction is pre-trained on ImageNet

[17] dataset, and the parameters of the previous layers of backbone are frozen, and we apply the first three layers of [18] as our backbone.

sub-dataset corresponding classes
aeroplane,bicycle,bird,boat,bottle
bus, car, cat, chair, cow
diningtable, dog, horse, motorbike, person
potted plant, sheep, sofa, train, tv/monitor
Table 1: The detailed setting for splitting the sub-dataset to evaluate the few-shot segmentation. There are 4 sub-datasets, and represents the th subset, where . When the -th sub-dataset is selected for evaluation, the rest three datasets are used for training.

3.1 Detail of Implementation

We validate the proposed method on the Pascal VOC 2012[9] dataset and its enhanced dataset SDS[12]. Similar to the existing methods [3, 4, 5, 6, 7], we split images of 20 classes into four subsets, each of which contains images of five classes, the detailed description can be found in Table 1. For these four subsets, three of them are selected as the training set, and the rest one is used as the test set to validate the effectiveness of the proposed method. In the training stage, we randomly select two images for each class, one as a support image and another as a query image until all images of training classes were selected. In the testing stage, in order to make a fair comparison with the existing methods, we use the same random seed in the existing method to sample the same 1000 pairs of images as the test data for each evaluation sub-dataset.

Methods Mean
1-NN 25.3 44.9 41.7 18.4 32.6
LogReg 26.9 42.9 37.1 18.4 31.4
Siamese 28.1 39.9 31.8 25.8 31.4
OSVOS[10] 24.9 38.8 36.5 30.1 32.6
OSLSM[3] 33.6 55.3 40.9 33.5 40.8
co-FCN[4] 36.7 50.6 44.9 32.4 41.1
SG-One[5] 40.2 58.4 48.4 38.4 46.3
CA-Net[7] 52.5 65.9 51.3 51.9 55.4
Ours 52.8 69.6 53.2 52.3 57.0
Table 2: The comparison results (mIoU value) on four evaluation sub-datasets in 1-shot. The best results are in bold.
Methods Mean
1-NN 34.5 53.0 46.9 25.6 40.0
LogReg 25.9 51.6 44.5 25.6 39.3
Co-segmentation[13] 25.1 28.9 27.7 26.3 27.1
OSLSM[3] 35.9 58.1 42.7 39.1 43.9
co-FCN[4] 37.5 50.0 44.1 33.9 41.4
SG-One[5] 41.9 58.6 48.6 39.4 47.1
CA-Net[7] 55.5 67.8 51.9 53.2 57.1
Ours 57.9 69.9 56.9 57.5 60.6
Table 3: The comparison results (mIoU value) on four evaluation sub-datasets in 5-shot. The best results are in bold.
Figure 4: The subjective results of the proposed method. From left to right: query image, ground-truth mask of the query image, the support image, ground-truth mask of the support image and the segmentation result, respectively.
Methods OSLSM[3] co-FCN[4] PL[11] SG-One[5] A-MCG[6] CA-Net[7] Ours
1-shot 61.3 60.1 61.2 63.1 61.2 66.2 71.8
5-shot 61.5 60.2 62.3 65.9 62.2 69.6 74.6
Table 4: The comparison results (FB-IoU value) on four evaluation sub-datasets in 1-shot and 5-shot. The best results are in bold.
k-shot mIoU FB-IoU
1-shot 55.3 70.2
1-shot 55.9 70.9
1-shot 56.0 71.1
1-shot 57.0 71.8
Table 5: The ablation results of three loss functions. The ticking indicates that the loss function is used.

We use the mean intersection over union of foreground (mIoU) to measure the performance of our proposed method, which is widely used in few-shot segmentation. In addition, the FB-IoU proposed in co-FCN[4] is also considered, which includes mean intersection over union of foreground and background.

3.2 Comparison with Benchmarks

In order to verify the effectiveness of our method, we compare with existing method in 1-shot and 5-shot. We follow CA-Net[7] to adopt DenseCRF[14] and multi-scale evaluation strategy to improve the performance, which are always employed in existing[15] semantic segmentation method. The detailed results can be found in Table 2, Table 3 and Table 4. We can see the values of mIoU by our method achieve at 57.0% in 1-shot and 60.6% in 5-shot, which outperform the state-of-the-art few-shot segmentation method CA-Net[7] by 1.6% and 3.5% respectively. The improvement in 5-shot indicates the superiority of our method when it comes to more annotations. In addition, the values of FB-IoU in 1-shot and 5-shot achieve at 71.8%, 74.6% respectively, which also outperforms the comparison methods obviously.

k-shot Dense-CRF Multi-scale mIoU FB-IoU
1-shot 56.1 71.0
1-shot 56.7 71.5
1-shot 56.6 71.3
1-shot 57.0 71.8
Table 6: The effectiveness of Dense-CRF post-processing and Multi-scale strategy are demonstrated. The ticking indicates that the strategy is used.

3.3 Ablation

In the training stage, the weight , and are set to 1. In order to validate the effectiveness of our three loss functions, ablation experiment is implemented. The detailed results can be found in Table 5. We can see and improve the performance by 1.1% and 1.0% respectively. In addition, we follow CA-Net[7] to employ DenseCRF[14] and multi-scale evaluation in our test stage. The ablation of these two strategies is also conducted, and the detailed results can be found in Table 6. We can see that DenseCRF[14] and the multi-scale evaluation strategy can improve the mIoU value by 0.4% and 0.3% respectively.

3.4 Subjective Result

The subjective results of the proposed method are shown in Fig. 4. The support image, the ground-truth mask of the support image, the query image, the ground-truth mask of the query image and the segmentation result are displayed from left column to right column, respectively. It is seen that the proposed method segments objects from these images successfully.

4 Conclusion

This paper proposes a new transformation module for few-shot segmentation. Rather than focusing on global cues, the relationships of local features are used to form the transformation. Local feature relationship matrix calculated by the cosine similarity is used to represent the relationships of local features. Linear transformation of relationship matrix based on generalized inverse of the groundtruth matrix is implemented to transform the relationship matrix. We also map the features into a high-dimensional metric embedding space to enhance the generalization of the proposed module. We propose a new few-shot segmentation network based on transformation module, and better results are obtained in terms of both mIoU value and FB-IoU value.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 61871087, Grant 61502084, Grant 61831005, and Grant 61601102, and supported in part by Sichuan Science and Technology Program under Grant 2018JY0141.

References

  • [1]

    Long, J., Shelhamer, E., Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

  • [2] Chen, L. C., Papandreou, G., Schroff, F., Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
  • [3] Boots, Z. L. I. E. B., Shaban, A., Bansal, S. (2017). One-shot learning for semantic segmentation. BMVC.
  • [4] Levine, T. D. A. E. S., Rakelly, K., Shelha-mer, E. (2018). Conditional networks for few-shot semantic segmentation. In ICLR workshop.
  • [5] Zhang, X., Wei, Y., Yang, Y., Huang, T. (2018). Sg-one: Similarity guidance network for one-shot semantic segmentation. arXiv preprint arXiv:1810.09091.
  • [6]

    Hu, T., Yang, P., Zhang, C., Yu, G., Mu, Y., Snoek, C. G. (2019). Attention-based Multi-Context Guiding for Few-Shot Semantic Segmentation.In Proceedings of the Association for the Advance of Artificial Intelligence.

  • [7] Zhang, C., Lin, G., Liu, F., Yao, R., Shen, C. (2019). CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5217-5226).
  • [8] Wang, X., Girshick, R., Gupta, A., He, K. (2018). Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7794-7803).
  • [9] Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1), 98-136.
  • [10] Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L. (2017). One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 221-230).
  • [11] Dong, N., Xing, E. (2018). Few-Shot Semantic Segmentation with Prototype Learning. In BMVC (Vol. 1, p. 6)
  • [12] Hariharan, B. , Arbelaez, P. , Bourdev, L. D. , Maji, S. , Malik, J. . (2011). Semantic contours from inverse detectors. IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011. IEEE.
  • [13] Faktor, A., Irani, M. (2013). Co-segmentation by composition. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1297-1304).
  • [14] Krähenbühl, P., Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems (pp. 109-117).
  • [15] Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.
  • [16] Kingma, D. P., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [17] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). Ieee.
  • [18] He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
  • [19] Rao, C. R., Mitra, S. K. (1972). Generalized inverse of a matrix and its applications. Icams Conference (Vol.1).