One-Shot Texture Retrieval with Global Context Metric

05/16/2019 ∙ by Kai Zhu, et al. ∙ USTC 0

In this paper, we tackle one-shot texture retrieval: given an example of a new reference texture, detect and segment all the pixels of the same texture category within an arbitrary image. To address this problem, we present an OS-TR network to encode both reference and query image, leading to achieve texture segmentation towards the reference category. Unlike the existing texture encoding methods that integrate CNN with orderless pooling, we propose a directionality-aware module to capture the texture variations at each direction, resulting in spatially invariant representation. To segment new categories given only few examples, we incorporate a self-gating mechanism into relation network to exploit global context information for adjusting per-channel modulation weights of local relation features. Extensive experiments on benchmark texture datasets and real scenarios demonstrate the above-par segmentation performance and robust generalization across domains of our proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As texture refers to the fundamental microstructures of natural image, humans have a strong visual perception of texture, which can not only obtains descriptions of new texture from a small number of training samples (few-shot learning) [Sung et al.2018], but also marks such texture regions in the other cluttered scenes (texture segmentation) [Cimpoi et al.2015]

. This suggests that texture features provide a powerful visual prior for comprehensive scene understanding

[Krishna et al.2017].

To learn such texture prior, we present the problem of one-shot texture retrieval: given an example of a new reference texture, detect and segment all the pixels of the same texture category within an arbitrary image (see Figure 1). This task is different from one-shot segmentation of general objects [Shaban et al.2017], as the learned texture representation should be invariant to spatial layout but preserve the rough semantic concepts. Therefore, an adaptable and robust texture encoding model should be presented to finely discriminate the orderless texture details. In addition, for texture segmentation, global context is also an important cue since the scene surfaces are usually not completely orderless. A similarity metric should be introduced to balance local spatial details and global scene context for pixel-wise segmentation.


Figure 1: Examples of our task. Given a reference texture of new category, we can segment all pixels of the same texture category within a query image.

In this paper, we present an One-Shot Texture Retrieval (OS-TR) network to learn the texture representation, and model the similarities between reference and query image, leading to achieve one-shot texture segmentation towards the reference category. Specifically, a Siamese network [Koch et al.2015]is first presented to embed the reference and query image into an encoded representation space by feature learning and parametric prior encoding. Unlike the existing texture encoding methods [Xue et al.2018] that integrate CNN with orderless pooling, a directionality-aware module is proposed to perceive the local texture variations at each direction, resulting in spatially invariant representation. Then, we propose to incorporate global context into the relation network [Sung et al.2018] by aggregating feature maps across their spatial dimensions. Different from previous approaches that only consider the similarity of local features or semantic concepts, our method exploits global context information to adjust local relation features by per-channel modulation weights. The key idea is to use a self-gating mechanism for generating global distribution of local relation features with an aggregation unit. The evaluation on synthetic dataset demonstrates the superiority of our proposed model against the state-of-the-art methods [Shaban et al.2017, Ustyuzhaninov et al.2018]. Furthermore, the result on nature scenes is promising.

Our main contribution are as follows:

1. We introduce a novel one-shot texture segmentation network with global context metric, achieving texture detection and segmentation with a single example of reference texture.

2. We propose a directionality-aware module to perceive the local texture variations at each direction, resulting in spatially invariant representation.

3. We present a global context metric for performing one-shot texture segmentation, which extends relation network with a self-gating mechanism to adjust local relation features.

2 Related Work

Our work focuses on one-shot learning, texture modeling and one-shot segmentation task, so in this section we mainly review the research status of these areas.

One-shot learning:

In the computer vision community, one-shot learning has recently received a lot of attention and substantial progress has been made based on metric learning using Siamese neural networks

[Koch et al.2015, Snell et al.2017]. In addition, there are some work that build upon meta learning [Finn et al.2017, Ren et al.2018], information retrieval techniques [Triantafillou et al.2017] and generation models [Lake et al.2015] to achieve one-shot learning.

Texture representation: Texture representation is an important research area in computer vision for potential applications in classification, segmentation and synthesis. The research of texture representation are mainly divided two classes: traditional method [Kumar et al.2011] and CNN-based method [Cimpoi et al.2015]. Different from object recognition where spatial order is critical for feature representation, texture recognition usually uses an orderless component to provide invariance to spatial layout.

One-shot segmentation: While the work on one-shot learning is quite extensive, the research on one-shot segmentation [Dong and Xing2018, Wu et al.2018] have been established only recently, including one-shot image segmentation [Shaban et al.2017] and one-shot video segmentation [Caelles et al.2018]. The most closely related to our work is [Ustyuzhaninov et al.2018], whose task is to segment an input image containing multiple textures by given a patch of a reference texture. Different from their setup in that the reference patch is interactively selected from the input image, our work targets on a more complex problem of one-shot texture retrieval: given an example of a new reference texture, detect and segment all the pixels of the same texture category within an arbitrary image.


Figure 2: Overview of the OS-TR network. The texture features of the query image and reference are first extracted by the proposal Siamese texture encoder. Through a global context metric, the relation score is obtained by means of global information and then combined with the encoding features to get the final segmentation results.

3 One-shot Texture Segmentation

3.1 Probelm Setup

We define a triple and a relation function , where is the query image with multiple classes of textures, is the pixel-wise labels corresponding to the class texture in , is a reference texture image of the same class j, is the actual segmentation result, and is all parameters to be optimized in function F. Our task is to randomly sample triples from the dataset, train and optimize

, thus minimizing the loss function

:

(1)

We expect that the relationship function can segment the same class texture region in another target image each time it sees a reference texture picture of new class. This is the embodiment of the meaning of one-shot segmentation. It should be mentioned that the texture classes sampled by the test set are not present in the training set, that is, . The relation function in this problem is implemented by the model detailed in Section 3.2.

3.2 Model Architecture

In this section we will detail the overall framework of the proposed OS-TR network. As shown in Figure 2, the network is based on a classic encoder-decoder network [Ronneberger et al.2015] to complete the segmentation task. It consists of the texture encoder for captioning the characteristics of the texture and global context metric that better determines the matching pixels. These two modules will be explained further in Section 3.3 and 3.4.

The network uses the texture encoders to transform the input of two branches into the variable embeddings respectively. The first branch takes the reference texture from the triple as input. And the second branch takes the synthetic texture image as input during training, which may be from a wide range of sources in the real scenarios. We use the texture encoder to perceive the local texture features at each dirction, and the corresponding feature map pairs , obtained can be expressed as follows:

(2)
(3)

Here is the learnable parameter of our texture encoder.

Different from the existing work, one-shot relation network [Sung et al.2018] proposes a deep nonlinear metric. However, this nonlinaer comparator is limited to local ralation features. Instead we learn a global context metric to compare the pixel information of query images with the reference texture by taking global context in consideration. It provides important features for subsequent pixel-level segmentation. In our network,

(4)

where S refers to the element-wise relation score, and is the parameter to be optimized in .

To generate the same size as the original image, we combine the score with the information of encoder in multiple stages. We make full use of extracted features at different scales in the encoding process to form a refined decoding layer . The final actual segmentation result

of the original image is obtained through a sigmoid function:

(5)

Similarly, stands for the parameters of decoding part.

The number of positive and negative samples in the training set is unbalanced (i.e., the foreground and background of query images). To ensure their equal contribution to optimization function, we use weighted Binary Cross Entory Loss function, that is

(6)

where stands for the ground truth of corresponding pixel p, . and denotes positive and negtive sample sets in training images, respectively. We can see in the loss function is the collection of and described earlier. The purpose of reducing loss function is to optimize the parameters of corresponding modules.

Layer Input Output Type
Feature Extractor
ResNet-50:
Directionality-Aware Module
DirConv
unit
Conv
Join
8
[]
Cat
Global Context Metric
Join
2
[]
Cat+
Conv
Aggregation
Max pooling
+ Conv
Weighting
Channel-wise
Multiplication
Decoder
Upsample
()
Bilinear
Interpolation
+ Conv
+ Sigmoid
Table 1: Details of our network. ‘’ represents four upsampling operations. The bold 1 in the DirConv unit represents a directional map.

3.3 Texture Encoder

To perceive the local texture variations at each direction, we propose the texture encoder consisting of feature extraction and directionality-aware module. As shown in Figure

2, ResNet (ResNet in this paper represents ResNet-50 from [He et al.2016] that removes the last fully connected layer) is used to extract preliminary features , from the query image and reference texture. Then the DirConv unit is presented to capture eight directional texture features under the guidance of corresponding directional feature map . We only detail the first branch as the parameters and structures of two branches are identical. Specifically,

(7)

where denotes a convolution block which contains a

convolution, a batch normalization and a ReLU activation layer, cat refers to the concatenation function in the channel dimension. In this paper, eight directions refer to top, bottom, left, right, top left, bottom left, top right, and bottom right. Each directional map

is a generated trend graph that decreases from to along a certain direction. Finally, the output features of different branches are concatenated to form the whole spatial invariant feature , that is,

(8)

where represents a set of standard convolution block. Since the proposed DirConv unit is sensitive to the local variation of image along each direction, it can provide the network with good adaptability to spatial distortion and scale changes.

classes
i=0 perforated, pitted, pleated, polka-dotted, porous
i=1 stained, stratified, striped, studded, swirly
i=2 veined, waffled, woven, wrinkled, zigzagged
Table 2: Specific class names of three test sets defined in section 4.1.

3.4 Global Context Metric

To achieve pixel-wise segmentation, we incorporate global context into local relation feature to adjust per-channel modulation weights. Firstly, we use a non-linear function to capture local relation features L:

(9)

where represents two sets of standard convolution blocks, and refers to spatial and channel dimensions of the feature map, respectively. It can be seen that the two-layer convolution module represents more metric possibilities, which is not limited by cosine, Euclidean distance [Vinyals et al.2016, Snell et al.2017], etc. However, it only considers local feature similarity which has its limitation. Instead we take global context [Qiao et al.2019] into consideration by aggregating feature maps across their spatial dimensions, which is similar to SENet [Hu et al.2018].

Let denote the local relation features of different channels. In this paper, we simply obtain the global information of each channel through maximum pooling . The aggregation unit can be represented as follows:

(10)

Next we use the collected global context to balance the relation features. Here we use two simple convolutional blocks to achieve per-channel modulation weights. Finally, the obtained weights are normalized to 0-1 and used as multiplication coefficient of corresponding channel of original feature . The re-layout of local relation features can be formulated as follows:

(11)

where is the same as the one in Equation 4 and represents the channel-wise multiplication. It can be seen from the visualization of experimental part that global context metric realizes the fine adjustment of different texture matching pairs. It is beneficial for the optimization of the final segmentation result.

Method i=0 i=1 i=2 mean
Baseline 56.4 47.2 44.9 49.5
+Tex 59.3 50.3 46.4 52.0
+Glo 59.4 51.5 45.4 52.1
+Tex& Glo 60.7 52.8 44.8 52.8
Table 3: Results of ablation study. ‘Tex’ and ‘Glo’ represent texture encoder and global context metric respectively. The middle three columns represent the mean IoU metric(%) for the three test sets, and the mean represents the average of results for the three test sets.

Figure 3: Verification results of spatial invariance. Part (A) and (B) represent the results of affine transformation and scale change respectively. From left to right, they represent query image, results of OSTC compared to ours and groud truth respectively. Reference textures are shown in the lower-right corner of each image of results.

4 Experiments

To validate the superiority of our model in one-shot texture segmentation task, we designed a series of experiments based on Describable Textures Dataset(DTD) [Cimpoi et al.2014] dataset. In this section, we first introduce the preprocessing process of the dataset, and then perform ablation experiments on main modules in the model. While demonstrating its superiority from objective indicators, we also give some good visualization results. Finally, we compare our model to the most advanced model One-Shot Learning for Semantic Segmentation (OSLSM) [Shaban et al.2017] and One-shot Texture Segmentation (OSTC) [Ustyuzhaninov et al.2018] in the current one-shot segmentation field. We also introduce expanded experimental contents in supplementary materials222https://github.com/zhukaii/ijcai2019.

4.1 Dataset and Setting

To solve the proposed one-shot texture segmentation task, we redivide DTD dataset into training and test sets. We divide the last 15 classes of DTD dataset into three test subsets on average, and the remaining classes are used as the training set. The specific class name is shown in Table 2. During the training, we randomly sample 2-5 texture images from the training set to synthesize query images (these textures may come from the same class), and generate labels for one of the texture regions. The technique of texture synthesis comes from [Ustyuzhaninov et al.2018]. Finally, we randomly sample the reference texture image from the training set with the same kind of labels to form the triple defined in Section 4.1. In the test phase, we synthesize query images (a total of 960 images) from the three subtest sets in the same way. We send them to the trained model to calculate evaluation indicators and then average them to ensure that different texture classes in the test set are relatively fair. The regular evaluation indicator IoU is used in our task.

Our model uses the SGD optimizer during the training process. The initial learning rate is set to and the attenuation rate is set to . The model stops training after epochs, where each epoch synthesizes query images. All images are resized to size and the batch size is set to .

4.2 Ablation Study

We conduct several ablation experiments to prove the effectiveness of directionality-aware module and global context metric of our model in Table 3. We first train a baseline without the two modules, and then add them respectively to compare the results. First of all, we can see that the global context metric improves the model by mean IoU. This is due to its use of global information to help the model adjust the overall feature distribution. In order to explain this more vividly, we make a visualization of per-channel modulation weight distribution learned by global context metric in Section 4.3. Then, the result has a mean IoU boost with the dierctionality-aware module. This module takes into account the characteristics of texture spatial distribution, which is very helpful for texture encoding. The specific analytical experiment is also shown in Section 4.3. Finally, we add both the two modules to form our OS-TR network, which has a improvement over the baseline.

4.3 Evaluation

Spatial Invariance: To evaluate the performance of our model on the spatial invariant texture representation, we conduct the following experiments. First, in order to demonstrate the adaptability of our model to spatial distortion, we show its segmentation effect by affining the reference texture image. As shown in Figure 3 (A), we take several query images as examples and randomly determine the parameters of affine transformation to compare the segmentation results. It can be seen the results achieve the segmentation effect similar to the original reference texture. As shown in Figure 3 (B), we select three scale reference textures , and to evaluate the sensitivity of the model to scale change. The segmentation effect is still excellent. It is the proposed DirConv unit that is sensitive to the local variation of image along each direction, so that the spatial arrangement of texture can be accurately extracted.

Figure 4: Illustration of the role of global context. The abscissa represents the channel and the ordinate represents the respective weight between 0 and 1 in the two figures on the left. The figure on the right is visualized by t-SNE dimensionality reduction. See detailed explanation in section 4.3.

Figure 5: Comparison of our segmentation results with OSTC. Three rows represent three groups of results. From left to right, they represent query image, results of OSTC compared to ours and groud truth respectively. Reference textures are show in the upper-left corner of each image of results. The reference images with red borders are from the original query image, while reference images with black borders are from the same class textures in DTD dataset.
Method mean
OSTC
OSLSM
Ours 60.7 52.8 44.8 52.8
Table 4: Mean IoU(%) of our model and other state-of-the-art methods.

Figure 6: Some qualitative results of our model in practical application. The reference images of each picture are placed in small graphics, and the corresponding segmentation results are marked in red.

Effectiveness of Global Context Metric: To demonstrate the function of the module, we visualize per-channel modulation weights for different reference images. We cite five categories in our test set as examples. As shown in two figures on the left in Figure 4, all the polylines of the same class in the left picture are almost identical and the right of five classes is different. To further explain this, we reduce the weights of these five classes and visualize them through t-SNE [Maaten and Hinton2008]. It can be seen these five classes are clearly distinguished in the relation feature space. We think our module has learned an adaptive weight adjustment method among channels, where the intra-class is similar and inter-class appears inconsistent. As analyzed in Section 3.4, this module combines global information to further separate different matching pairs in feature space.

Comparision with state-of-the-art:

To better assess the overall performance of our reference network, we compare it to OSLSM and OSTC models. Because the tasks and model backbones are different, we set all the backbones to ResNet for fair comparision, and reproduce them with pytorch according to the two articles. All models are trained and tested in the same steps to achieve the purpose of adapting to our task. OSLSM is the first solution to one-shot semantic segmentation task, which is similar to our task. So after modifying backbone, we train their models directly in our tasks and don’t need to modify any dataset settings. To compare with OSTC, we change the reference image to a

patch for training in our task.

As can be seen from Table 4, our model has more than three points mean IoU boost over the best performing OSLSM. Since OSTS is essentially designed for the interactive texture segmentation task, it does not work well for one-shot segmentation task. So we reproduce the model they use to solve their interactive texture segmentation task, which follows the settings of their papers. The reference texture of our model is the same class as a texture area in query image, and of course it may be the same texture image. Conversely, OSTS is not necessarily able to segment textures of the same class. As shown in Figure 5, the reference texture image in our model can achieve better segmentation effect whether it is the original image of a texture region of the query image or a different image of the same class. It benefits from the texture characteristics and the distinction of different texture features acquired by our texture module and global context metric respectively.

Evaluation on real scenarios: Our model not only performs well in synthetic texture images, but also has scalability in practical applications. To prove this, we replace the query iamges with ones taken from real scenarios. We download high-quality indoor scene pictures with rich texture information from OpenSurfaces dataset [Bell et al.2013], and select animal and plant pictures with describable texture features from the Internet. For reference images, we randomly select images with similar texture information from DTD dataset and Internet according to query image content. For example, a striped image from DTD is randomly selected as a reference for zebras. In testing the above natural images, we did not pre-train or adjust the model. The parameters are still fixed in the state trained in synthetic texture images. As shown in Figure 6, We still get good segmentation results. It can be seen that our model can really learn new texture properties quickly, which is helpful for comprehensive scene understanding.

5 Conclusion

In this paper, a novel one-shot texture segmentation network OS-TR is proposed. By using a directionality-aware module to perceive texture variations at each direction and adjusting local features with global context information, OS-TR extracts the pixel information related to given texture in query images. In addition, our model can be well generalized from synthetic images to real scenarios without any adjustment. Experimental results show that our model is superior in both performance and adaptability with respect to existing methods.

References