FakeMix Augmentation Improves Transparent Object Detection

03/24/2021
by   Yang Cao, et al.
0

Detecting transparent objects in natural scenes is challenging due to the low contrast in texture, brightness and colors. Recent deep-learning-based works reveal that it is effective to leverage boundaries for transparent object detection (TOD). However, these methods usually encounter boundary-related imbalance problem, leading to limited generation capability. Detailly, a kind of boundaries in the background, which share the same characteristics with boundaries of transparent objects but have much smaller amounts, usually hurt the performance. To conquer the boundary-related imbalance problem, we propose a novel content-dependent data augmentation method termed FakeMix. Considering collecting these trouble-maker boundaries in the background is hard without corresponding annotations, we elaborately generate them by appending the boundaries of transparent objects from other samples into the current image during training, which adjusts the data space and improves the generalization of the models. Further, we present AdaptiveASPP, an enhanced version of ASPP, that can capture multi-scale and cross-modality features dynamically. Extensive experiments demonstrate that our methods clearly outperform the state-of-the-art methods. We also show that our approach can also transfer well on related tasks, in which the model meets similar troubles, such as mirror detection, glass detection, and camouflaged object detection. Code will be made publicly available.

READ FULL TEXT VIEW PDF

page 1

page 3

page 4

page 6

page 7

03/31/2020

Segmenting Transparent Objects in the Wild

Transparent objects such as windows and bottles made by glass widely exi...
07/02/2022

Boundary-Guided Camouflaged Object Detection

Camouflaged object detection (COD), segmenting objects that are elegantl...
12/25/2018

Selectivity or Invariance: Boundary-aware Salient Object Detection

Typically, a salient object detection (SOD) model faces opposite require...
08/25/2021

GlassNet: Label Decoupling-based Three-stream Neural Network for Robust Image Glass Detection

Most of the existing object detection methods generate poor glass detect...
02/28/2022

Background Mixup Data Augmentation for Hand and Object-in-Contact Detection

Detecting the positions of human hands and objects-in-contact (hand-obje...
07/25/2019

Learning Transparent Object Matting

This paper addresses the problem of image matting for transparent object...
11/05/2021

Fast Camouflaged Object Detection via Edge-based Reversible Re-calibration Network

Camouflaged Object Detection (COD) aims to detect objects with similar p...

1 Introduction

[width=1]samples_v12.pdf

Figure 1: Visual result comparisons between TransLab [22] and our FANet. In the top two rows, the boundaries, labeled by the pink rectangle, have similar content on their two sides. In the bottom two rows, the boundaries, labeled by the pink rectangle, have obvious reflection and refraction. These boundaries are similar to the boundaries of transparent objects, leading to false detection labeled by the orange rectangle. Our approach performs better due to FakeMix.

Transparent Object Detection (TOD) [22, 23, 1] aims at detecting the transparent objects from the natural scene, e.g.

windows, utensils, and glass doors, which widely exist in natural scenes. It is a relatively new and challenging task in the vision community. Unlike opaque objects, transparent objects often share similar textures, brightness, and colors to the surrounding environments, making them hard to detect for vision systems and even humans. Because transparent objects are a widespread presence in our lives, they also have an influence on other tasks, such as depth estimation 

[21], segmentation [15], and saliency detection [9, 6]. Therefore, how to accurately detect transparent objects is essential for many vision applications.

Benefiting from boundary clues, recent deep-learning-based work [22] have made great achievements in TOD. However, deep-learning-based methods usually meet the data imbalance problem, resulting in limited generalization. When it comes to TOD, existing methods [22] encounter serious boundary-related imbalance problem. Specifically, the boundary-guided methods pay too much attention to the Boundaries of Transparent objects (T-Boundaries). Thus, they prefer to regard the background regions surrounded by Fake T-Boundaries (sharing the same characteristics with T-boundaries but belonging to background) as transparent objects. For example, the boundaries labeled by the pink rectangle in the top two rows of Fig. 1 have similar content on their two sides. In the bottom two rows of Fig. 1, the boundaries have obvious refraction or reflection. These boundaries make their surroundings falsely predicted. So how to distinguish T-Boundaries from Fake T-Boundaries is key for addressing the boundary-related imbalance problem in TOD.

To improve the generation ability of models, some data augmentation methods have been proposed in [5, 29, 28]. Unfortunately, they are all content-agnostic and ignore the boundary clues, leading to weak improvements for boundary-related imbalance problem. In this paper, we propose a novel data augmentation method termed FakeMix. FakeMix is content-dependent and could combat the boundary-relate imbalance problem by balancing the data distribution of boundaries. Concretely, we increase the proportion of Fake T-Boundaries in the training set. Notably, it is hard to collect Fake T-Boundaries directly from the background without corresponding annotations. So we design a novel and efficient method to generate Fake T-Boundaries. Based on our observation, Fake T-boundaries share the following characteristic with T-boundaries: (1) There are similar appearances on both sides and (2) There are obvious refraction or reflection. And the main difference between the two kinds of boundaries is the appearances surrounded by them. Thus, we generate Fake T-Boundaries by blending background with the T-Boundaries. As the data distribution is balanced, the model’s capability of discriminating Fake T-Boundaries and T-Boundaries can be improved. Actually, as shown in Fig. 4, we find that FakeMix drives the model to explore the apprearances inside, which are the key differences between the above two kinds of boundaries.

Furthermore, we improved ASPP [2] in an attention manner [10] and obtain the AdaptiveASPP module. It inherits the characteristic of ASPP extracting multi-scale features, but more importantly, it also benefits from attention way, dynamically enhancing cross-modality (segmentation and boundary) features. The exploration indicates adopting the multi-scale and cross-modality features dynamically is effective for TOD.

By adopting both FakeMix and AdaptiveASPP, our FANet clearly outperforms the state-of-the-art TOD and semantic segmentation methods. Besides, we verify that our method also transfers well in relevant detection tasks, such as mirror detection [25, 13], glass detection [16], and camouflaged object detection [7], showing the robustness of the proposed methods. Extensive experiments on three corresponding real-world datasets demonstrate that our FANet achieves state-of-the-art achievements.

In summary, our contributions are three-fold:

  • [leftmargin=*]

  • We propose a novel content-dependent data augmentation method, called FakeMix, for transparent object detection. It balances the data distribution of boundaries and weakens the boundary-relate imbalance problem in TOD. In detail, we generate Fake T-Boundaries by blending background with the T-Boundaries, which improve the model’s discrimination of the two kinds of boundaries by scanning the appearances inside as shown in Fig. 4.

  • We improved ASPP in an attention manner and propose an AdaptiveASPP to extract features adaptively in cross-modal and multi-scale levels. Experiments help to validate its effectiveness.

  • Without bells and whistles, our model named FANet outperforms state-of-the-art TOD methods. We further find more applications in related “confusing region” detection tasks, i.e., mirror detection, glass detection and camouflaged object detection. FANet also gains state-of-the-art performances.

[width=1]aug_cmp_v5_cropped.pdf

Figure 2: Comparison with other data augmentation methods and pipeline of FakeMix. Best view in color. is the image where we add the Fake T-Boundaries. is the image where we extract the T-Boundaries. denotes the boundary label of . Firstly, as “step 1”, we extract the T-Boundaries from , which is labeled . Secondly, we translate and randomly to the same position as “step 2”. Then we get and . Finally, we combine , and to get as “step 3”.

2 Related Work

Data augmentation. To improve the generation and prevent the models from focusing too much on some regions on input image, some data augmentation methods[5, 29, 28] have been proposed. As shown in Fig. 2, Mixup [29]

combines two images by linear interpolation. Cutout 

[5] randomly removes some regions of the input image. CutMix [28] randomly replaces some regions with a patch from another image. These methods are simple and effective. However, all of them are content-agnostic and ignore the boundary clues, resulting in limited improvements for TOD. FakeMix combats the boundary-relate imbalance problem by adjusting the data distribution of boundaries.

Transparent object detection. Early work [23] proposed a model based on LF-linearity and occlusion detection from the 4D light-field image. [1] treats transparent object matting as the refractive flow estimation problem. Recently, [22] proposed a large-scale dataset for TOD, which consists of 10428 images. Besides, they also designed a boundary-aware segmentation method, named TransLab. TransLab adopts the boundary clues to improve the segmentation of transparent regions. Our method, named FANet, also follows the boundary-aware way. While we found hidden troubles of boundary-aware methods: some boundaries, which are similar to the boundaries of transparent objects, may hurt the detection. Then we propose a novel data augmentation method called FakeMix. Besides, rather than directly introducing ASPP in [22], we design an AdaptiveASPP to extract features adaptively for the segmentation and boundary branches respectively. Besides, some topics that focus on specific region detection are proposed recently, such as mirror detection [25], glass detection [16] and camouflaged object detection [7]. Considering boundary clues are also important for distinguish the mirror, glass and camouflaged object, these tasks might meet the similar problem with TOD. We apply our FANet in the three tasks above to measure the potential of our method from more perspectives.

3 Methods

Facing the challenge of Transparent Object Detection (TOD), we propose a method named FANet, which contains FakeMix and AdaptiveASPP. The data augmentation method named FakeMix is proposed to inspire the model to exploit appearance clues, which will be introduced in Sec. 3.1. AdaptiveASPP module is designed to capture features of multiple fields-of-view adaptively for the segmentation branch and boundary branch, respectively. The details will be formulated in Sec. 3.3.

3.1 FakeMix

TransLab [22] proposes to exploit boundary clues to improve transparent detection. However, we observe that some boundaries in the background may hurt the performance, including (1) the boundaries with similar contents on both sides and (2) the boundaries with obvious refraction or reflection. We define these boundaries as Fake T-Boundaries. Fake T-Boundaries share the same characteristics with the boundary of the transparent object (T-Boundaries) but have a much smaller amount in the natural world. Due to this boundary-related imbalance problem, the model prefers to regard the background regions surrounded by Fake T-Boundaries as transparent regions. Thus, we propose a novel content-dependent data augmentation methods, named FakeMix. Considering that it is hard to collect Fake T-boundaries without corresponding annotations, we elaborately manufacture them by appending T-Boundaries from other samples into the current image during training. As the data distribution is balanced, the discrimination ability of the model between Fake T-Boundaries and T-Boundaries is improved during training.

[width=1]tranSeg_daspp_cropped_v3.pdf

Figure 3: Overview architecture of FANet. The AdaptiveASPP modules are plugged at four stages of the backbone, which capture features of multiple fields-of-view adaptively for the segmentation branch and boundary branch, respectively. Then the features are integrated from bottom to top in each branch. In the segmentation branch, we follow [22] to fuse boundary features in the attention way. In the AdaptiveASPP. The transform function generates adaptive enhancement scores for the segmentation stream and boundary stream, respectively. Then the enhancement function enhances the features by enhancement scores adaptively for the two modalities.

Formally, let denotes the input image during training. represents the segmentation label. is the boundary label generated from as [22]. FakeMix combines the input image with Fake T-Boundaries from another training sample which is randomly selected. Firstly, we extract the T-Boundaries from as:

(1)

where denotes pixel-wise multiplication. Then we apply the same affine transformation to the boundary mask and T-Boundaries , which translates them to a random location. The formulation can be written as:

(2)

where is the translation function.

denotes the translation vector:

(3)

and are uniformly sampled for every training sample according to:

(4)

where and are the width and height of the corresponding image. is the parameter to control the range of translation. We conduct ablation studies of in Sec. 4.3.3.

According to Eqn. (2), we get the randomly translated boundary mask and T-Boundaries . Then we combine them with the input training sample as:

(5)

Considering the T-Boundaries are separated away from the transparent appearances, obtains more Fake T-Boundaries.

Further, we consider the choice of input sample as one bernoulli trial in each training iteration. The trial results in one of two possible outcomes: either or

. The probability mass function is:

(6)

Finally, we get the training sample by the novel FakeMix data augmentation method. The pipeline can be visualized in Fig. 2.

3.2 Architecture Overview

Following [22], ResNet50  [8] is used as the encoder and a boundary branch is also included in the decoder to help detect transparent regions. An AdaptiveASPP is designed to extract features of multiple fields-of-view for both segmentation branch and boundary branch.

As shown in Fig. 3, our AdaptiveASPP is plugged at four stages of the encoder. AdaptiveASPP extracts features for the segmentation and boundary branch, respectively. Let

denotes the features extracted by AdaptiveASPP in the

-th () stage for branch (, for the segmentation branch and for the boundary branch). Then the features are integrated from bottom to top in each branch. We formulate the features in the -th stage for branch of the decoder as . The cross-model feature fusion, which is not the direction we delve, simply follows [22] to apply boundary information in attention way as:

(7)

where means the interpolation method which helps to keep the same scale between features from different stages. denotes a convolutional function. For the boundary branch, the integration way is:

(8)

The segmentation loss and boundary loss supervise the two branches separately.

3.3 AdaptiveASPP

To collect the information of multiple fields-of-view, [22] adopts the Atrous Spatial Pyramid Pooling module (ASPP) [2]. However, as exposed in [14, 32], detecting boundary and region focus on different targets and pay attention to different characteristics. Thus we argue that richer features of multiple fields-of-view with appropriate importances in cross-modal and multi-stage levels will exploit more promotion spaces. Motivated by existing attention mechanism in [10], we carefully design an AdaptiveASPP to capture features of multiple fields-of-view adaptively for the boundary branch and segmentation branch.

As shown in Fig. 3, AdaptiveASPP firstly extracts features of multiple fields-of-view by convolution kernels with different dilation rates, which follows ASPP [2] and can be formulated as:

(9)

where means the input backbone features of AdaptiveASPP. denotes a convolutional function and the subscript represents the -th dilation rate setting. Let the feature maps extracted by different kernels be denoted by , where denotes the number of dilation rate settings. In Fig. 3, we take for example. Given , we concatenate them as and conduct average pooling. Then we have -dimensional vector in which is calculated as follows:

(10)

Then two transform functions for the segmentation and boundary branch are adopted to generate adaptive importances. The formulation can be written as:

(11)

where represents the modalities including boundary() and segmentation().

denotes the FC-ReLU-FC block for corresponding modality.

indicates the normalization function which maps the scores to .

is the activation function. The setting follows 

[12] as:

(12)

As we can see in Fig. 3, transform functions (refer to Eqn. (11)) generate adaptive importances for boundary and segmentation modalities respectively. Given importance vectors , we adopt the enhancement function as shown in Fig. 3 to enhance the features of multiple fields-of-view for the two modalities. Then we can get modality-specific features as:

(13)

In Eqn. (13

), the residual connection is added to preserve the original features

in the enhancement function. Following the enhancement function, a convolutional block is used to squeeze channel numbers.

4 Experiments

4.1 Implementation Details

The proposed model is implemented by PyTorch 

[18]. In the encoder, we choose ResNet50 [8] to be the backbone as [22]. In the decoder, the channel numbers of convolutional layers are set to 256. The convolution type is set to separable convolution  [3] as [22]. The number of dilation rate is set to 7 experimentally in AdaptiveASPP. And the decoder is randomly initialized.

Training and Testing.

During training, we train our model for 400 epochs. We choose the stochastic gradient descent (SGD) optimizer. The momentum and weight decay are set to 0.9 and 0.0005 respectively. The learning rate is initialized to 0.01. A poly strategy with the power of 0.9 is employed. We use 8 V100 GPUs for our experiments. Batch size is 4 on every GPU. Random flip for the input image is also conducted during training. Following 

[22], we use dice loss [4, 17, 31] for the boundary branch and CrossEntropy loss for the segmentation branch. Besides, images are resized to during training and testing as [22].

4.2 Datasets

We will introduce the datasets adopted in our experiments. The ablation experiments are conducted on Trans10K, which is a challenging Transparent Object Detection (TOD) dataset. When comparing with state-of-the-art, we find more applications of our methods in related topics for specific region detection: mirror detection [25], glass detection [16] and camouflaged object detection [7]

, which could demonstrate the potential of our method. We keep exactly the same dataset and evaluation metrics setting with the original paper.

4.3 Ablation Study

4.3.1 Alternatives of FakeMix.

We compare our FakeMix with three different kinds of popular data augmentation method, i.e., Mixup [29], Cutout [5] and CutMix [28] on the latest deep-learning-based TOD method TransLab [22]. According to Tab. 2, our FakeMix gains the best performance in all four metrics. FakeMix is the only method that take boundaries into consideration and combat the boundary-related imbalance problem, leading to stable superiority.

Methods Acc mIoU MAE mBer Translab 83.04 72.10 0.166 13.30 Translab+Cutout 80.36 72.27 0.160 13.37 Translab+CutMix 81.32 72.21 0.162 12.94 Translab+Mixup 77.06 72.33 0.158 14.63 Translab+FakeMix 83.07 73.01 0.152 12.76
Table 1: Comparison among different data augmentation methods.
Methods Acc mIoU MAE mBer Convs 79.32 69.87 0.184 13.42 ASPPs 82.17 70.56 0.170 13.74 AdaptiveASPP 85.81 76.54 0.139 10.59
Table 2: Ablation for the alternative methods of AdaptiveASPP. “Convs” denotes we adopt convolutional layers to replace AdaptiveASPP. “ASPP” represents replacing AdaptiveASPP with ASPP.

4.3.2 AdaptiveASPP

As described in Sec. 3.3, AdaptiveASPP could enhance features adaptively in multiple fields-of-view and double-modality levels. Here we delve into the effectiveness and details of AdaptiveASPP. Specifically, we study (1) the effect of AdaptiveASPP, (2) positions to adopt AdaptiveASPP, (3) the activation function and (4) other specific settings. Due to space constraints, the last three experiments are presented in the supplementary materials.

Effectiveness of AdaptiveASPP. To explore the effect of AdaptiveASPP, we replace the AdaptiveASPP in Fig. 3 with two convolutional layers or two ASPPs [2] to generate features for the segmentation branch and the boundary branch separately. As we can see in Tab. 2, the performance elaborates on the superiority of AdaptiveASPP. Compared with Convs and ASPPs which treat the features with the same importance, our AdaptiveASPP enhances features adaptively, leading to better performances.

[width=1]abla_features_v8.pdf

Figure 4: Comparison of features w/o FakeMix and w/ FakeMix. Best view in color and zoom-in. Features and predictions in the 2nd row labeled “w/o FakeMix” tell that the model without utilizing FakeMix is confused by the boundaries labeled by the pink rectangle, leading to failed detection labeled by the orange rectangle. Features and predictions in the 3rd row labeled “w/ FakeMix” show that the model focuses on the appearances of transparent objects, which are the key differences between T-Boundaries and Fake T-Boundaries, resulting in better prediction results labeled by the orange rectangle.

4.3.3 Delving into FakeMix.

As demonstrated in Sec. 3.1, FakeMix enhances the discrimination ability of the model during training. Here we study the effectiveness and different settings of FakeMix. Specifically, we firstly visualize (1) the features of our model trained w/o and w/ FakeMix to observe how FakeMix works on the model. Then we conduct ablation experiments on different settings, including (2) the range of translation, (3) the probability of adding Fake T-Boundaries, (4) the content of Fake T-Boundaries and (5) the number of Fake T-Boundaries.

Look deeper into the features. To study how FakeMix works, we look deeper into FANet and visualize the features following the samilar way with [14]. In detail, the features visualized are from the first stage of the decoder, namely in Eqn. (7) in the paper. The max function is applied along the channel dimension to get the visualization features. In Fig. 4, features and predictions in the 2nd row are from FANet trained without FakeMix. As we can see, these features notice the Fake T-Boundaries obviously, which resulting in failed prediction in nearby regions. The features and predictions in the 3rd row are from FANet trained with FakeMix. These features pay more attention to transparent appearances, which are the key differences between T-Boundaries and Fake T-Boundaries, leading to better predictions. Research on human perception of transparent objects [20] elaborates that some optical phenomena in transparent appearances, e.g., the refraction and reflection, may bring potential clues, which might be utilized by our FANet trained with FakeMix.

Acc mIoU MAE mBer - 85.81 76.54 0.139 10.59 0 87.23 76.99 0.129 10.22 1/3 86.99 76.89 0.130 10.32 1/2 87.36 77.62 0.128 9.93 2/3 87.14 77.54 0.129 10.20
Table 3: Ablation for the range of translation, namely in Eqn. (4). “-” represents that the model is trained without FakeMix.
Acc mIoU MAE mBer 0 85.81 76.54 0.139 10.59 1/3 86.93 77.60 0.126 10.03 1/2 87.36 77.62 0.128 9.93 2/3 86.98 77.40 0.129 10.17
Table 4: Ablation for the probability of adding Fake T-Boundaries, namely in Eqn. (6). “0” represents that the model is trained without FakeMix.
Acc mIoU MAE mBer - 85.81 76.54 0.139 10.59 zero 86.80 76.71 0.130 10.21 mean 86.75 77.08 0.132 10.23 random 86.22 76.62 0.132 10.33 boundary 87.36 77.62 0.128 9.93
Table 5: Ablation for the content of Fake T-Boundaries, namely in Eqn. (1). “-” represents that the model is trained without FakeMix. “zero” denotes that the values of Fake T-Boundaries are set to 0. “mean” denotes that the values of Fake T-Boundaries are set to the mean value of Trans10K. “random” represents that the values of Fake T-Boundaries are set to the random region of images. “boundary” means the values are set to the content of the boundary from transparent objects as computed in Eqn. (1).
Numbers Acc mIoU MAE mBer 0 85.81 76.54 0.139 10.59 1 87.12 78.00 0.127 9.96 2 87.14 77.71 0.126 9.99 3 87.36 77.62 0.128 9.93 4 86.61 77.87 0.127 10.21
Table 6: Ablation for the number of Fake T-Boundaries we put on the input image. To gain “n” Fake T-Boundaries, we repeat the progress described in Sec.3.3 of our paper “n” times.

Different settings. We analyze the range of translation, namely in Eqn. (4). As shown in Tab. 6, different settings from the second row to the fourth row all bring improvements compared with the one without FakeMix, which validates the effectiveness and practicability of our FakeMix. Experimentally, we set to .

Method ACC MAE IoU BER Computation
Stuff Things Stuff Things Params/M FLOPs/G
ICNet [30] 52.65 0.244 47.38 53.90 29.46 19.78 8.46 10.66
BiSeNet [26] 77.92 0.140 70.46 77.39 17.04 10.86 13.30 19.95
DenseASPP [24] 81.22 0.114 74.41 81.79 15.31 9.07 29.09 36.31
FCN [15] 83.79 0.108 74.92 84.40 13.36 7.30 34.99 42.35
UNet[19] 51.07 0.234 52.96 54.99 25.69 27.04 13.39 124.62
OCNet [27] 80.85 0.122 73.15 80.55 16.38 8.91 35.91 43.43
DUNet [11] 77.84 0.140 69.00 79.10 15.84 10.53 31.21 123.35
PSPNet [31] 86.25 0.093 78.42 86.13 12.75 6.68 50.99 187.27
DeepLabv3+ [3] 89.54 0.081 81.16 87.90 10.25 5.31 28.74 37.98
TransLab [22] 92.69 0.063 84.39 90.87 7.28 3.63 40.15 61.27
TransLab [22] + FakeMix 93.14 0.057 85.62 91.91 6.68 3.28 40.15 61.27
FANet 94.00 0.052 87.01 92.75 6.08 2.65 35.39 77.57
FANet + FakeMix 94.93 0.046 88.29 93.42 5.43 2.36 35.39 77.57
Table 7: Comparison between stuff set and thing set of Trans10K. Note that FLOPs is computed with one image.
Method mIoU Acc MAE mBER
Hard Easy All Hard Easy All Hard Easy All Hard Easy All
ICNet [30] 33.44 55.48 50.65 35.01 58.31 52.65 0.408 0.200 0.244 35.24 21.71 24.63
BiSeNet [26] 56.37 78.74 73.93 62.72 82.79 77.92 0.282 0.102 0.140 24.85 10.83 13.96
DenseASPP [24] 60.38 83.11 78.11 66.55 86.25 81.22 0.247 0.078 0.114 23.71 8.85 12.19
FCN [15] 62.51 84.53 79.67 68.93 88.55 83.79 0.239 0.073 0.108 20.47 7.36 10.33
UNet[19] 37.08 58.60 53.98 37.44 55.44 51.07 0.398 0.191 0.234 36.80 23.40 26.37
OCNet [27] 59.75 81.53 76.85 65.96 85.63 80.85 0.253 0.087 0.122 23.69 9.43 12.65
DUNet [11] 55.53 79.19 74.06 60.50 83.41 77.84 0.289 0.100 0.140 25.01 9.93 13.19
PSPNet [31] 66.35 86.79 82.38 73.28 90.41 86.25 0.211 0.062 0.093 20.08 6.67 9.72
DeepLabv3+ [3] 69.04 89.09 84.54 78.07 93.22 89.54 0.194 0.050 0.081 17.27 4.91 7.78
TransLab [22] 72.10 92.23 87.63 83.04 95.77 92.69 0.166 0.036 0.063 13.30 3.12 5.46
TransLab [22] + FakeMix 73.01 93.19 88.76 83.07 96.37 93.14 0.152 0.032 0.057 12.76 2.71 4.98
FANet 76.54 93.77 89.88 85.81 96.62 94.00 0.139 0.029 0.052 10.59 2.43 4.37
FANet + FakeMix 77.62 94.70 90.86 87.36 97.36 94.93 0.128 0.024 0.046 9.93 2.03 3.89
Table 8: Comparison between hard set and easy set of Trans10K.

Besides, we study the probabilities of adding Fake T-Boundaries, i.e., the in Eqn. (6). Results from Tab. 6 demonstrates that using non-zero probability settings could boost performance. We choose .

The content of Fake T-Boundaries. Furthermore, we explore the content of Fake T-Boundaries. As shown in Tab. 6, the boundary of transparent objects can provide the best performance. This is because the boundaries of transparent objects have two characteristics: (1) the boundaries with similar appearances on their two sides; (2) the boundaries with obvious refraction or reflection. These boundaries are most likely to cause failed detection. Considering choosing the boundaries of transparent objects as the content of Fake T-Boundaries will help models to distinguish these boundaries, the “boundary” in Tab. 6 achieves better performance.

The number of Fake T-Boundaries. we try different numbers of Fake T-Boundaries in FakeMix. As shown in Tab. 6, we found that the performances, when the number is greater than 0, become generally better than the model trained without without FakeMix.

[width=1]sample_cmp_paper_v5.pdf

Figure 5: Visual comparison on Trans10K [22].

4.4 Compare with the State-of-the-art.

This section compares our FANet with alternative methods on the large-scale transparent object detection dataset Trans10K. Besides, we apply our FANet in related confusing region detection tasks, i.e., mirror detection, glass detection and camouflaged object detection to measure the potential of our method in more topics. Noted that we retrain and evaluate our model respectively following the same train/test setting with the original papers for the fair comparison. The comparison on mirror detection, glass detection and camouflaged object detection could be found in the supplementary materials.

We compare our FANet with state-of-the-art methods named TransLab [22] and main-stream semantic segmentation methods on TOD dataset Trans10K. Tab. 7 reports the quantitative results of four metrics in both easy/hard set. Tab. 8 reports the quantitative results in things/stuff set. As we can see, benefitting from our FakeMix and AdaptiveASPP, FANet outperform alternative methods significantly. Furthermore, we compare the qualitative results of TOD methods. Especially, we summarize several challenging scenes in TOD: the complex scene, the scene with occlusion, and the scene with small objects. More hard scenes, i.e., the scene with multi-category objects and the scene with unnoticeable objects, are shown in the supplementary materials. As shown in Fig. 5, the 1st row shows a simple example in which most methods perform well. In the 2nd-3rd rows, we sample some images in which the scenes are complex. The scenes with occlusion are shown in the 4th-5th rows. As we can see, in complex and occlusion scenes, our model avoids bad influences of the boundaries from non-transparent regions and gain more complete results with better details, benefiting from our FakeMix. Then we show other challenging situations in which the transparent objects are small. As we can see in the last three rows of Fig. 5, considering AdaptiveASPP could capture features of multiple fields-of-view for segmentation and boundary branches adaptively, our model locates the small transparent objects well and segments them finely.

5 Conclusion

In this paper, we proposed a novel content-dependent data augmentation method termed FakeMix. FakeMix weakens the boundary-related imbalance problem in the natural word and strengthens the discrimination ability of the model for Fake T-Boundaries and T-Boundaries. Besides, we design an AdaptiveASPP module to capture features of multiple fields-of-view adaptively for the segmentation and boundary branches, respectively. Benefiting from FakeMix and AdaptiveASPP, our FANet surpasses state-of-the-art TOD methods significantly. Further, FANet also gains state-of-the-art performance in related confusing region detection tasks, i.e., mirror detection, glass detection and camouflaged object detection.

References

  • [1] Guanying Chen, Kai Han, and Kwan-Yee K Wong. Tom-net: Learning transparent object matting from a single image. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 9233–9241, 2018.
  • [2] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  • [4] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu. Learning to predict crisp boundaries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 562–578, 2018.
  • [5] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • [6] Deng-Ping Fan, Ming-Ming Cheng, Jiang-Jiang Liu, Shang-Hua Gao, Qibin Hou, and Ali Borji. Salient objects in clutter: Bringing salient object detection to the foreground. In Proceedings of the European conference on computer vision (ECCV), pages 186–202, 2018.
  • [7] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2777–2787, 2020.
  • [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [9] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip HS Torr. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3203–3212, 2017.
  • [10] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [11] Qiangguo Jin, Zhaopeng Meng, Tuan D Pham, Qi Chen, Leyi Wei, and Ran Su. Dunet: A deformable network for retinal vessel segmentation. Knowledge-Based Systems, 178:149–162, 2019.
  • [12] Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang, Xingang Wang, and Jian Sun. Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8553–8562, 2020.
  • [13] Jiaying Lin, Guodong Wang, and Rynson WH Lau. Progressive mirror detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3697–3705, 2020.
  • [14] Jiang-Jiang Liu, Qibin Hou, and Ming-Ming Cheng. Dynamic feature integration for simultaneous detection of salient object, edge and skeleton. arXiv preprint arXiv:2004.08595, 2020.
  • [15] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [16] Haiyang Mei, Xin Yang, Yang Wang, Yuanyuan Liu, Shengfeng He, Qiang Zhang, Xiaopeng Wei, and Rynson WH Lau. Don’t hit me! glass detection in real-world scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3687–3696, 2020.
  • [17] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.

    V-net: Fully convolutional neural networks for volumetric medical image segmentation.

    In 2016 fourth international conference on 3D vision (3DV), pages 565–571. IEEE, 2016.
  • [18] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [19] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [20] Nick Schlüter and Franz Faul. Visual shape perception in the case of transparent objects. Journal of vision, 19(4):24–24, 2019.
  • [21] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pages 746–760. Springer, 2012.
  • [22] Enze Xie, Wenjia Wang, Wenhai Wang, Mingyu Ding, Chunhua Shen, and Ping Luo. Segmenting transparent objects in the wild. arXiv preprint arXiv:2003.13948, 2020.
  • [23] Yichao Xu, Hajime Nagahara, Atsushi Shimada, and Rin-ichiro Taniguchi. Transcut: Transparent object segmentation from a light-field image. In Proceedings of the IEEE International Conference on Computer Vision, pages 3442–3450, 2015.
  • [24] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3684–3692, 2018.
  • [25] Xin Yang, Haiyang Mei, Ke Xu, Xiaopeng Wei, Baocai Yin, and Rynson WH Lau. Where is my mirror? In Proceedings of the IEEE International Conference on Computer Vision, pages 8809–8818, 2019.
  • [26] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
  • [27] Yuhui Yuan and Jingdong Wang. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.
  • [28] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo.

    Cutmix: Regularization strategy to train strong classifiers with localizable features.

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6023–6032, 2019.
  • [29] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • [30] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 405–420, 2018.
  • [31] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  • [32] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao, Jufeng Yang, and Ming-Ming Cheng. Egnet: Edge guidance network for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 8779–8788, 2019.

References

  • [1] Guanying Chen, Kai Han, and Kwan-Yee K Wong. Tom-net: Learning transparent object matting from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9233–9241, 2018.
  • [2] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  • [4] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu. Learning to predict crisp boundaries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 562–578, 2018.
  • [5] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • [6] Deng-Ping Fan, Ming-Ming Cheng, Jiang-Jiang Liu, Shang-Hua Gao, Qibin Hou, and Ali Borji. Salient objects in clutter: Bringing salient object detection to the foreground. In Proceedings of the European conference on computer vision (ECCV), pages 186–202, 2018.
  • [7] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2777–2787, 2020.
  • [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [9] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip HS Torr. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3203–3212, 2017.
  • [10] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [11] Qiangguo Jin, Zhaopeng Meng, Tuan D Pham, Qi Chen, Leyi Wei, and Ran Su. Dunet: A deformable network for retinal vessel segmentation. Knowledge-Based Systems, 178:149–162, 2019.
  • [12] Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang, Xingang Wang, and Jian Sun. Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8553–8562, 2020.
  • [13] Jiaying Lin, Guodong Wang, and Rynson WH Lau. Progressive mirror detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3697–3705, 2020.
  • [14] Jiang-Jiang Liu, Qibin Hou, and Ming-Ming Cheng. Dynamic feature integration for simultaneous detection of salient object, edge and skeleton. arXiv preprint arXiv:2004.08595, 2020.
  • [15] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [16] Haiyang Mei, Xin Yang, Yang Wang, Yuanyuan Liu, Shengfeng He, Qiang Zhang, Xiaopeng Wei, and Rynson WH Lau. Don’t hit me! glass detection in real-world scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3687–3696, 2020.
  • [17] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.

    V-net: Fully convolutional neural networks for volumetric medical image segmentation.

    In 2016 fourth international conference on 3D vision (3DV), pages 565–571. IEEE, 2016.
  • [18] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [19] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [20] Nick Schlüter and Franz Faul. Visual shape perception in the case of transparent objects. Journal of vision, 19(4):24–24, 2019.
  • [21] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pages 746–760. Springer, 2012.
  • [22] Enze Xie, Wenjia Wang, Wenhai Wang, Mingyu Ding, Chunhua Shen, and Ping Luo. Segmenting transparent objects in the wild. arXiv preprint arXiv:2003.13948, 2020.
  • [23] Yichao Xu, Hajime Nagahara, Atsushi Shimada, and Rin-ichiro Taniguchi. Transcut: Transparent object segmentation from a light-field image. In Proceedings of the IEEE International Conference on Computer Vision, pages 3442–3450, 2015.
  • [24] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3684–3692, 2018.
  • [25] Xin Yang, Haiyang Mei, Ke Xu, Xiaopeng Wei, Baocai Yin, and Rynson WH Lau. Where is my mirror? In Proceedings of the IEEE International Conference on Computer Vision, pages 8809–8818, 2019.
  • [26] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
  • [27] Yuhui Yuan and Jingdong Wang. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.
  • [28] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6023–6032, 2019.
  • [29] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • [30] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 405–420, 2018.
  • [31] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  • [32] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao, Jufeng Yang, and Ming-Ming Cheng. Egnet: Edge guidance network for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 8779–8788, 2019.