Hard Pixels Mining: Learning Using Privileged Information for Semantic Segmentation

06/27/2019 ∙ by Zhangxuan Gu, et al. ∙ Shanghai Jiao Tong University 10

Semantic segmentation has achieved significant progress but is still challenging due to the complex scene, object occlusion, and so on. Some research works have attempted to use extra information such as depth information to help RGB based semantic segmentation. However, extra information is usually unavailable for the test images. Inspired by learning using privileged information, in this paper, we only leverage the depth information of training images as privileged information in the training stage. Specifically, we rely on depth information to identify the hard pixels which are difficult to classify, by using our proposed Depth Prediction Error (DPE) and Depth-dependent Segmentation Error (DSE). By paying more attention to the identified hard pixels, our approach achieves the state-of-the-art results on two benchmark datasets and even outperforms the methods which use depth information of test images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Semantic segmentation is a fundamental problem, with the goal to classify each pixel in an image, which has a wide range of real-world applications including autonomous driving, visual scene understanding, and image editing. Recently, abundant RGB based semantic segmentation methods 

[39, 8, 13, 3, 35, 59, 26, 51, 2] have been developed. However, these methods still exhibit clear limitations due to the hard pixels induced by complicated scenes, poor lighting condition, confusing object appearances, and so on.

To address the issues in RGB-based semantic segmentation, many attempts [20, 52, 21, 6, 30] have been made to exploit depth information for semantic segmentation, because the depth map can provide complementary 3D information, which may be helpful for the segmentation task. For example, Wang et al. [52] proposed to learn RGB-specific features and depth-specific features besides the common features, while Lee et al. [30] suggested fusing RGB features and depth features in different scales. These methods use extra depth information for both training and test images, which falls into the scope of multi-view learning. Specifically, each training or test sample consists of two views, i.e., RGB and depth. The results of these methods show that extra depth information can benefit the segmentation task.

However, depth information is often unavailable for test images, which limits the application of multi-view learning method. Inspired by Learning Using Privileged Information (LUPI) [50], we propose to use depth information as the privileged information, which means that we only use the depth information of training images in the training stage and do not require the depth information of test images in the testing stage. It has been proved in [14, 16, 15, 54] that only using privileged information of training samples can still help to learn a more robust model. One research line of LUPI is using privileged information to identify hard training samples. For example, the method in [45] distinguishes hard training samples from easy ones and assigns higher weights on the training losses of hard training samples, which could guide the training of a better classifier.

Figure 1: Illustration of hard pixels in semantic segmentation. For the entire image and two cropped regions, we show their ground-truth (GT) segmentation masks, segmentation results of RGB-based baseline [34] and our method, their ground-truth (GT) depth maps, and depth maps generated by our depth prediction branch from left to right. The second row shows a subregion with large depth prediction error (DPE), in which the chair and the table have huge depth gap, but the chair is misclassified as the table by the baseline. The third row shows a subregion with large depth-dependent segmentation error (DSE), in which the cushion is misclassified as the pillow in the same depth bin by baseline due to similar visual appearance. Best viewed in color.

In this paper, similarly, we tend to use the depth information as privileged information to identify the hard pixels in the segmentation task and assign higher weights on the segmentation losses of these hard pixels. The remaining problem is how to use depth information to identify and mine hard pixels. By considering segmentation and depth prediction as two joint tasks, if two neighboring regions from two different categories have a huge depth gap, accurate detection of the depth boundary between these two regions might be highly correlated with accurate segmentation of these two regions. In other words, inaccurate depth prediction of these two regions, which leads to the failure of detecting depth boundary, might be highly correlated with the segmentation error. For example, as illustrated in the second row in Figure 1, a chair is placed next to a glass table. The chair is misclassified as the table by RGB-based baseline [34]probably because they are visually similar. In fact, the chair and table have large depth gap, which is not estimated by our depth prediction branch. Therefore, we conjecture that Depth Prediction Error (DPE) could be used as a measurement of segmentation difficulty, which means that the pixels with large DPE are hard pixels.

However, when two neighboring regions from two different categories have similar depths (in the same depth bin), accurate depth prediction may not be highly correlated with accurate segmentation of these two regions, in which case DPE could become less effective in identifying hard pixels. For a better explanation, we define a local region in the same depth bin with multiple categories as a Depth-dependent Local Region (DLR). In a DLR, if the categories of different subregions are confused with each other due to similar visual appearance, this region becomes a hard region. For example, as illustrated in the third row in Figure 1, neighboring pillow and cushion form a hard region, in which the cushion is misclassified as a pillow by RGB-based baseline [34] although the depth prediction results are basically correct. In some other circumstances, the subregions in a hard region could also be misclassified as a wrong category irrelevant to neighboring subregions. Without loss of generality, we simply compute the segmentation error rate in each DLR as Depth-dependent Segmentation Error (DSE) and assume that DSE could be used as another measurement of segmentation difficulty, that being said, the DLRs with large DSE are hard regions.

In our method, we use both Depth Prediction Error (DPE) and Depth-dependent Segmentation Error (DSE) to mine the hard pixels. With identified hard pixels, we also explore two different training strategies. The first strategy is training with easy and hard pixels at the same time. The second strategy is starting with easy pixels and gradually including hard pixels, similar to curriculum learning [1]. Our main contributions are as follows:

  • This is the first work to use depth information as the privileged information for mining hard pixels in semantic segmentation task.

  • We propose two measurements of hard pixels: Depth Prediction Error (DPE) and Depth-dependent Segmentation Error (DSE), which can be easily integrated into any semantic segmentation network. We also explore different training strategies with hard pixels.

  • Extensive experiments demonstrate that our method can achieve the state-of-the-art performances on the SUNRGBD and NYU-v2 datasets, and even surpass the baselines which use depths for test images.

Ii Related Works

In this section, we will discuss related works from the following three aspects: RGB-based semantic segmentation, learning using privileged information, and semantic segmentation with depth information.

RGB-based semantic segmentation:

Recent deep learning methods  

[39, 8, 13, 3, 35, 59, 26, 51, 2] have shown impressive results in the semantic segmentation. Most of them are based on the encoder-decoder architecture which is first proposed in Fully Convolutional Networks (FCN) [36]. The extension based on FCN can be grouped into the following two directions: capturing the contextual information at multiple scales and designing more sophisticated decoder. In the first direction, some works [4, 58] combined feature maps generated by different dilated convolutions and pooling operations. For example, PSPNet [58] adopts Spatial Pyramid Pooling which pools the feature maps into different sizes for detecting objects of different scales. Deeplab v3 and v3+ [4, 5] proposed an Atrous Spatial Pyramid Pooling by using dilated convolutions to keep the large receptive field. In the second direction, some works [34, 43, 10] proposed to construct better decoder modules to fuse mid-level and high-level features. For example, RefineNet-152 [34] is a multi-path refinement network which fuses features at multiple levels of the encoder and decoder. However, all the above methods are RGB-based segmentation methods while our approach can utilize depth information to facilitate semantic segmentation.

Figure 2: An overview of our method. Our overall network is build upon RefineNet-152. RefineNet-152 consists of four encoders marked with “CNN”, four decoders marked with “Refine Net”, and five convolutional layers to adjust the number of channels. Based on RefineNet-152, we add five extra convolutional layers denoted by pink boxes to predict depths. Blue and pink rectangles are the feature maps with their size and channel numbers denoted at the bottom of this figure, while the green diamonds are our losses ,,, and . Best viewed in color.

Learning using privileged information: Learning Using Privileged Information (LUPI) was first introduced by Vapnik and Vashist [50], which extends SVM to SVM+ by using privileged information to control the training loss. Besides classification [45, 44, 32], privileged information has also been used for clustering [14], verification [16, 15, 54], hashing [60]

, random forest 

[55], and etc.

Recently, privileged information has also been integrated into deep learning methods [37, 24, 17, 28, 56] to distill knowledge or control the training process. More recently, SPIGAN [29] proposed to use privileged depth information in the semantic segmentation. However, their main contribution is exploiting depth information to assist with domain adaptation, which adapts the synthetic image domain to the real image domain. So the motivation and solution of their method are intrinsically different from ours. Distinctive from all the above methods, this the first work to use depth information as privileged information to mine hard pixels for semantic segmentation.

Semantic segmentation using depth information: Compared to the traditional RGB-based segmentation, recent RGBD-based segmentation methods exploit depth information under the framework of multi-view learning or multi-task learning. Under the framework of multi-view learning, some works [33, 52, 21, 30] fuse depth information and RGB information in various ways. For example, STD2P [23] proposed spatio-temporal pooling layer to model indoor videos, while Cheng et al. [6] proposed a gated fusion layer for better boundary segmentation. More recently, CFN [9]

is a neural network with multiple branches of the context-aware receptive field, which learns better contextual information. RDFNet-152 

[30] captures multi-level RGBD features by using the proposed multi-modal feature fusion blocks, which combines residual RGB and depth features to fully exploit the depth information. All the above methods are in demand for depth information for test images in the testing stage, which is not required in our method.

Under the framework of multi-task learning, Eigen et al. [11] predicts depths, surface norm, and segmentation in one unified network. Hoffman et al. [25] proposed to learn an extra branch to hallucinate middle-level depth features. Although our method also employs another branch to predict depths under the multi-task learning framework, our focus is using Depth Prediction Error (DPE) to mine hard pixels, which contributes more to the segmentation task than merely predicting depths (see Section IV-B).

Iii Methodology

In our method, we aim to mine the hard pixels in the segmentation task by using depth information as the privileged information. Then, we assign higher weights on the training losses of hard pixels to learn a better network. In the following sections, we will first introduce how to mine hard pixels based on our proposed Depth Prediction Error (DPE) and Depth-dependent Segmentation Error (DSE), and then describe how to train with those identified hard pixels.

Iii-a Hard Pixels Mining

Given a training image , we denote its predicted segmentation masks and ground-truth mask as and respectively. Similarly, we use and to denote its predicted depth map and ground-truth depth map respectively. Our network is built upon the RefineNet-152 with improved residual pooling [34], which has achieved compelling results in the semantic segmentation task recently. RefineNet-152 follows the standard encoder-decoder architecture as illustrated in Figure 2 (the blue feature maps and gray convolutional layers), in which the encoders are the first four pretrained ResNet-152 aggregated layers (conv1 and pool1 are included in the first “CNN”) [22] scaling the input images down to , and the decoders are four RefineNet modules proposed in [34]. RefineNet-152 refines coarse-grained segmentation result with fine-grained features in a recursive manner to generate the final segmentation mask.

For RGB-based segmentation, we adopt cross-entropy segmentation loss with label smoothing as follows:

(1)

where is the number of pixels in image , is the number of categories, and the is the predicted probability of the -th pixel for the -th category based on our network. Following [49], we use smooth labels for better performance, that is, if the -th pixel belongs to the -th category and otherwise. In analogy to , the losses in the remainder of this paper are all defined based on a single image .

Iii-A1 Segmentation Loss Weighted by Depth Prediction Error ()

As mentioned in Section I, accurate detection of the depth boundary between two regions from different categories might be highly correlated with successful segmentation of these two regions. In other words, inaccurate depth prediction of these two regions could cause the failure of detecting depth boundary, which may imply the difficulty of segmenting these two regions. So we conjecture that Depth Prediction Error (DPE) could be a measurement of segmentation difficulty. Specifically, we tend to predict the depth map of a given RGB image and use DPE to identify the hard pixels.

To predict depth map based on RGB image, a natural idea is extending an extra branch [25] on the foundation of standard segmentation network, since the depth estimation and the semantic segmentation are highly related. As illustrated in Figure 2, we add five more convolutional layers (the red boxes) to RefineNet-152, producing another set of fine-grained depth feature maps (the pink rectangles). Then we fuse these depth feature maps with the feature maps from encoders (the blue rectangles) to predict the final depth map. Compared with original RefineNet-152, our network only introduces five more convolutional layers while the rest of the model parameters are all shared with the segmentation task. Thus, our method only has slightly more computational cost than RefineNet-152. Moreover, our method can be easily adapted to any other segmentation network with encoder-decoder architecture.

For the training loss of depth prediction, inspired by [12], we compute the mean square error (MSE) of the differences between all pixel pairs based on the logarithm of depths. Concretely, given the predicted log-depth map and the ground-truth log-depth map , we use and to denote the value indexed by pixel position in and respectively. and are in the range of , so and are in the range of . By denoting , the depth prediction loss can be written as

(2)

However, based on the analysis in [12], minimizing could possibly lead to a trivial constant solution. To avoid such trivial solution, they add a hyper-parameter to , resulting in the following cost function

(3)

When , reduces to element-wise loss (MSE). When , is equal to . In our experiments, we set , which can generally produce good results.

By integrating the depth prediction task with the semantic segmentation, our multi-task learning loss can be written as , in which is a hyper-parameter.

With predicted depths of all pixels in image , we use the Depth Prediction Error (DPE) to identify hard pixels and the pixels with large are treated as hard pixels. To learn more robust segmentation network, we tend to pay more attention to the hard pixels in the training process. Hence, we use to weigh the pixel-wise classification losses of different pixels, leading to our DPE loss as follows:

(4)

Iii-A2 Segmentation Loss Weighted by Depth-dependent Segmentation Error ()

Figure 3: The cushion (subregion A) is misclassified as a pillow (subregion B) by the baseline RefineNet-152. We assign higher weights on the whole region AB instead of only the misclassified subregion A.

As discussed in Section I, when neighboring regions from different categories are in the same depth bin, accurate depth prediction may not be highly correlated with correct segmentation of these neighboring regions. In this case, Depth Prediction Error (DPE) will become less effective to measure the segmentation difficulty. As a supplement of DPE, we propose another measurement named Depth-dependent Segmentation Error (DSE).

For a better explanation, we define a local region in the same depth bin as a Depth-dependent Local Region (DLR). In some DLRs, the categories of different subregions could be easily confused with each other due to the visual resemblance. We refer to such DLRs with confusing subregions as hard regions. One example is illustrated Figure 3, in which the cushion (subregion A) and the pillow (subregion B) have similar depths, and the region AB forms a DLR. The cushion is misclassified as a pillow by the baseline RefineNet-152 because the cushion looks very similar to the pillow. Thus, we claim region AB as a hard region. In some other cases, one subregion in a hard region could also be misclassified into an arbitrary wrong category instead of the category of neighboring subregions. Considering more general cases, we simply calculate the segmentation error rate in a DLR to measure the segmentation difficulty of this region, leading to our second measurement Depth-dependent Segmentation Error (DSE). For DSE, we only use the ground-truth depth map to obtain DLRs without using the predicted depth map.

To locate Depth-dependent Local Regions (DLR), for ease of implementation, we divide an image into two sets of regions based on two criteria. On one hand, we uniformly divide image into cells, leading to one set of regions with . On the other hand, we divide the range of depth values (i.e., ) into depth bins with the size of each bin being (we set in our experiments), leading to another set of regions , in which each region contains the pixels with depth values belonging to each depth bin. With two sets of regions and , we calculate the intersection between and , yielding a DLR . Then, we simply calculate the segmentation error rate within this DLR where if pixel is misclassified else . We use to measure the segmentation difficulty of the region and the regions with large are regarded as hard regions.

Similar to Section III-A1, we tend to use to weigh the training losses of different regions to learn a more robust segmentation network. Note that we assign weights on the entire region which may include the correctly classified pixels, and the reason can be explained as follows. Recall that in Figure 3, subregion A from category is misclassified as the category of subregion B. If we only assign higher weights on subregion A, the classifier might be biased towards category , which means that a subregion from category bearing visual resemblance with subregion A may be prone to be misclassified as category . Therefore, we assign higher weights on both subregions A and B to better distinguish between category and category .

By using to weigh different Depth-dependent Local Regions (DLR), we can have our Depth-dependent Segmentation Error (DSE) loss as follows:

(5)
SUNRGBD NYU-v2
Type IoU pixel acc. mean acc. IoU pixel acc. mean acc.
RefineNet-101 [34] S 47.93 77.80 65.61 45.69 75.59 59.83
RefineNet-152 [34] S 48.54(45.90*) 78.03(80.60*) 66.18(58.50*) 45.89(46.50*) 76.02(73.60*) 60.02(58.90*)
PSPNet [58] S 47.10 76.64 63.67 45.91 75.31 60.18
DeepLabv3+ [5] S 47.43 77.17 64.49 45.52 75.58 59.70
FCN [36] D 31.76 61.73 49.74 34.00* 65.40* 46.10*
STD2P [23] D - - - 40.10* 70.10* 53.80*
3DGNN [42] D 40.20* - 52.50* 39.90* - 54.00*
LS-DeconvNet [6] D - - 58.00* 45.90* 71.90* 60.70*
CFN [9] D 48.10* - - 47.70* - -
Wang et al. [53] D 48.40* - 61.10* 42.00* 72.90* 53.50*
RDFNet-152 [30] D 50.20(47.70*) 79.07(81.50*) 67.40(60.10*) 47.01(50.10*) 76.77(76.00*) 61.32(62.80*)
LW-RefineNet [40] P 47.89 76.87 65.10 44.87 74.01 58.65
RDFNet-152 [30] P 48.73 78.42 66.41 46.03 75.97 59.92
Ours P 50.47 79.56 68.45 49.40 78.23 64.14
Table I: Segmentation results (%) on SUNRGBD and NYU-v2. The results of our method are denoted in boldface. The results marked with * are from corresponding papers. Three types S, D, and P stand for RGB, RGBD, and depths as privileged information respectively.

One remaining problem is how to determine the number of cells . Two special cases are and , in which and are the width and height of image . When , the image is divided into depth-dependent global regions, in which distant subregions in the same depth bin could be grouped into one region. Considering two distant disjoint subregions in the same depth-dependent global region, the segmentation difficulty of one subregion is actually less affected by the other distant region, so it may be unhelpful to assign the same weight on these two subregions. When , each cell only contains one pixel. In this case, is equivalent to only assigning higher weights on the misclassified pixels instead of hard regions, which is against our motivation of identifying hard regions. In our experiments, we use

, which is located in the middle ground between the above two special cases. We will also show the performance variance with different choices of

in Section IV-B.

Iii-B Training Strategies with Hard Pixels

To this end, our total training loss for Hard Pixels Mining can be written as

(6)

where and are two trade-off parameters. By using and , we place more emphasis on hard pixels to learn a better segmentation network. By minimizing (6), we employ and on all pixels. An alternative strategy is that and are first employed on easy pixels and then extended to hard pixels, similar to curriculum learning [1]. Curriculum learning [1] has achieved great success in a variety of applications [57, 38, 31, 18, 41], in which the training process starts with easy training samples and gradually includes hard training ones. The key problem in curriculum learning is how to define easy and hard training samples, corrsponding to how to identify easy and hard pixels in our problem. One popular curriculum learning approach is Self-Paced Learning (SPL) [27], in which the training samples with small training losses are regarded as easy training samples. Specifically, SPL aims to learn the model parameters

and the binary indicator vector

indicating easy training samples, in which is the total number of training samples. The objective function of SPL can be written as

(7)

in which is the training loss of the -th training sample and is the learning pace. The problem in (7) can be solved by updating and alternatingly. When fixing , we use to store and can be updated as , which is equal to if and otherwise. Then, model parameters can be updated by solving

(8)

Inspired by (8), we extend and to and respectively. In particular, is formulated as

(9)

where the learning pace is defined as , in which

is the current epoch,

is the total number of epochs, (resp., ) is the median (resp., maximum) of . In a similar way, we can extend to with similarly defined

. Then, we can have our new total loss function as follows:

(10)

The training strategy based on (10) can be explained as follows. At first, only easy pixels are selected for training with and . Then, more and more hard pixels will contribute to training with and . Note that when setting and as a sufficiently large constant, will reduce to . In our experiments, we set , , and for the first training epochs. After that, is changed to since depth predicting branch is stable.

Iv Experiments

In this section, we first evaluate our method on SUNRGBD [48] and NYU-v2 [46] with depth information, then we also provide comprehensive ablation studies and qualitative results.

Iv-a Datasets and Implementation Details

SUNRGBD [48]: This dataset consists of 10335 pairs of RGB and depth images. We use the standard split of 5285 for training and 5050 for testing with 37 categories.

NYU-v2 [46]: This dataset contains 1449 densely labeled pairs of RGB and depth images. Following the standard train/test split, we use 795 training images and 654 test images. We evaluate our network for 40 categories by using the labels provided by [19].

In the training stage, we set the momentum as and weight decay as . The learning rate is initialized as and divided by 10 when the loss stops decreasing. The size of the training batch is , and the size of the input image is . In the testing stage, we apply multi-scale evaluation [7]

for all methods. We report results based on three evaluation metrics,

i.e., pixel accuracy, mean accuracy and mean Intersection over Union (IoU).

Iv-B Experimental Results

Figure 4: Qualitative results of our method compared with RefineNet-152 on the NYU-v2 dataset. From left to right for each sample: input image, ground-truth segmentation mask, segmentation mask predicted by RefineNet-152 and our method. The bounding boxes highlight the improvements of our method. Best viewed in color.

We compare our method with three groups of baselines: 1) S group: baselines without using depth information; 2) D group: baselines using depth information for both training and test images; 3) P group: baselines using depth information only for training images.

For the S group, we compare with RefineNet [34], PSPNet [58], and DeepLabv3+ [5]. Since our method is built upon RefineNet-152, RefineNet-152 can be treated as our backbone network. We did not exactly reproduce the results reported in [34] probably due to the mismatch of some experimental details, so we additionally report our obtained results for RefineNet-152 for more meaningful comparison.

For the D group, we compare with FCN [36], STD2P [23], 3DGNN [42], LS-DeconvNet [6], CFN [9], Wang et al. [53], and RDFNet-152 [30]. RDFNet-152 [30] is also built upon RefineNet-152 and closely related to our method, so we additionally report our implemented results.

For the P group, we compare with LW-RefineNet [40] and RDFNet-152 [30], in which we remove their depth branches and retrain the remaining network for several epochs so that they can use depth information as privileged information. For other methods from D group, some cannot be easily modified to use depth information as privileged information while some modified ones are much worse than their original version, so we omit their results here.

Experimental results are summarized in Table I. We observe that RefineNet-152 achieves very competitive results within the S group. We also observe that recent methods CFN, Wang et al., and RDFNet-152 from D group are generally better than the methods in S group (except the performance of Wang et al. on NYU-v2 because they used VGG [47] on this dataset), which proves that depth information is beneficial for semantic segmentation. Our method achieves significant improvements over our backbone network RefineNet-152, i.e., 2% and 3.5% IoU improvement on SUNRGBD and NYU-v2 datasets respectively. Moreover, our method not only beats the baselines in the P group, but also generally outperforms the baselines in the D group, which demonstrates the effectiveness of our method in using depth information as privileged information.

SUNRGBD NYU-v2
48.54 45.89
(with SPL) +0.31 +0.27
+0.20 +0.18
+0.68 +0.96
+0.87 +1.02
+0.89 +2.45
+1.14 +2.98
+1.35 +3.38
+1.92 +3.51
Table II: Ablation studies on our backbone network RefineNet-152 () w.r.t. IoU (%) on SUNRGBD and NYU-v2 datasets.

Ablation studies: We validate the effectiveness of each component in our method and the results are reported in Table II. is RefineNet-152 reported in Table I which only uses segmentation loss. Then, we report the IoU improvement of our special cases over . Concretely, (with SPL) employs self-paced learning for RefineNet-152. is a baseline as multi-task learning. We also report the results using the training loss and . Moreover, we tune the number of cells and observe the performance variance of .

From Table II, we observe that the performance gain brought by DPE loss ( v.s. ) and DSE loss ( v.s. ) is more significant than that brought by multi-task learning ( v.s. ). We also observe that when using different for , achieves the best result, which verifies that it would be more beneficial to assign higher weights on hard local regions instead of hard global regions or only misclassified pixels. Another observation is that by adopting the training strategy similar to self-paced learning, achieves better results than , which indicates the effectiveness of our proposed training strategy.

Qualitative analyses: We also provide some segmentation results of the baseline RefineNet-152 and our method on the NYU-v2 dataset in Figure 4. It can be seen that our method is adept at segmenting some objects with sharp depth boundary between them and neighboring regions (f, h, i, l), which cannot be recognized by RefineNet-152. For example, our method can successfully segment the upper part of the chair back in (f), the table under the TV set in (h), and the attic sloped ceiling in (l), which is probably attributed to our Depth Prediction Error (DPE) loss. Moreover, our method also demonstrates great ability in classifying confusing objects in the same depth bin (a, b, c, d, e, g, j, k), which are misclassified as the wrong categories with similar visual appearance by RefineNet-152. For example, our method can successfully segment the bookshelf in (b), the towel clinging to the refrigerator in (c), the sink in (e), and the curtain in (k), which is partially credited to our Depth-dependent Segmentation Error (DPE) loss.

V Conclusions

In this work, we have proposed to mine the hard pixels by using depth information as privileged information in the semantic segmentation task. Specifically, we have designed depth prediction error (DPE) and depth-dependent segmentation error (DSE) to mine hard pixels. We have also explored different training strategies with identified hard pixels. Quantitative and qualitative results on three benchmark datasets have demonstrated the superiority of our method.

References

  • [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
  • [2] S. Chandra and I. Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian CRFs. In ECCV, 2016.
  • [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 40(4):834–848, 2018.
  • [4] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [5] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611, 2018.
  • [6] Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang. Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In CVPR, 2017.
  • [7] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, 2015.
  • [8] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015.
  • [9] L. Di, G. Chen, D. Cohen-Or, P. A. Heng, and H. Hui. Cascaded feature network for semantic segmentation of RGB-D images. In ICCV, 2017.
  • [10] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In CVPR, 2018.
  • [11] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015.
  • [12] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, 2014.
  • [13] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. PAMI, 35(8):1915–1929, 2013.
  • [14] J. Feyereisl and U. Aickelin. Privileged information for data clustering. Information Sciences, 194:4–23, 2012.
  • [15] S. Fouad and P. Tiňo. Ordinal-based metric learning for learning using privileged information. In IJCNN, 2013.
  • [16] S. Fouad, P. Tino, S. Raychaudhury, and P. Schneider. Incorporating privileged information through metric learning. IEEE transactions on neural networks and learning systems, 24(7):1086–1098, 2013.
  • [17] N. C. Garcia, P. Morerio, and V. Murino. Modality distillation with multiple stream networks for action recognition. In ECCV, 2018.
  • [18] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. In ICML, 2017.
  • [19] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from RGB-D images. In CVPR, 2013.
  • [20] S. Gupta, P. Arbel¨¢ez, R. Girshick, and J. Malik. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation.

    International Journal of Computer Vision

    , 112(2):133–149, 2015.
  • [21] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers. Fusenet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In ACCV, 2016.
  • [22] K. He, X. Zhang, S. Ren, and S. Jian. Deep residual learning for image recognition. In CVPR, 2016.
  • [23] Y. He, W. C. Chiu, M. Keuper, and M. Fritz. RGBD semantic segmentation using spatio-temporal data-driven pooling. In CVPR, 2017.
  • [24] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [25] J. Hoffman, S. Gupta, and T. Darrell. Learning with side information through modality hallucination. In CVPR, 2016.
  • [26] V. Jampani, M. Kiefel, and P. V. Gehler. Learning sparse high dimensional filters: Image filtering, dense CRFs and bilateral neural networks. In CVPR, 2016.
  • [27] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.
  • [28] J. Lambert, O. Sener, and S. Savarese.

    Deep learning under privileged information using heteroscedastic dropout.

    In CVPR, 2018.
  • [29] K.-H. Lee, G. Ros, J. Li, and A. Gaidon. SPIGAN: Privileged adversarial learning from simulation. ICLR, 2019.
  • [30] S. Lee, S. J. Park, and K. S. Hong. RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In ICCV, 2017.
  • [31] S. Li, X. Zhu, Q. Huang, H. Xu, and C.-C. J. Kuo. Multiple instance curriculum learning for weakly supervised object detection. arXiv preprint arXiv:1711.09191, 2017.
  • [32] W. Li, L. Niu, and D. Xu. Exploiting privileged information from web data for image categorization. In ECCV, 2014.
  • [33] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. Lstm-cf: Unifying context modeling and fusion with lstms for RGB-D scene labeling. In ECCV, 2016.
  • [34] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR, 2017.
  • [35] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015.
  • [36] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [37] D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
  • [38] W. Lotter, G. Sorensen, and D. Cox. A multi-scale cnn and curriculum learning strategy for mammogram classification. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 169–177. 2017.
  • [39] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward semantic segmentation with zoom-out features. In CVPR, 2015.
  • [40] V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen, and I. Reid. Real-time joint semantic segmentation and depth estimation using asymmetric annotations. arXiv preprint arXiv:1809.04766, 2018.
  • [41] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multiple tasks. In CVPR, 2015.
  • [42] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3d graph neural networks for RGBD semantic segmentation. In ICCV, 2017.
  • [43] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  • [44] N. Sarafianos, M. Vrigkas, and I. A. Kakadiaris. Adaptive svm+: Learning with privileged information for domain adaptation. In CVPR, 2017.
  • [45] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning to rank using privileged information. In ICCV, 2014.
  • [46] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012.
  • [47] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [48] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite. In CVPR, 2015.
  • [49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • [50] V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5):544–557, 2009.
  • [51] R. Vemulapalli, O. Tuzel, M.-Y. Liu, and R. Chellapa. Gaussian conditional random field network for semantic segmentation. In CVPR, 2016.
  • [52] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In ECCV, 2016.
  • [53] W. Wang and U. Neumann. Depth-aware CNN for RGB-D segmentation. In ECCV, 2018.
  • [54] X. Xu, W. Li, and D. Xu. Distance metric learning using privileged information for face verification and person re-identification. IEEE transactions on neural networks and learning systems, 26(12):3150–3162, 2015.
  • [55] H. Yang and I. Patras. Privileged information-based conditional regression forest for facial feature detection. In FG, 2013.
  • [56] H. Yang, J. Tianyi Zhou, J. Cai, and Y. Soon Ong. MIML-FCN+: Multi-instance multi-label learning via fully convolutional networks with privileged information. In CVPR, 2017.
  • [57] Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In ICCV, 2017.
  • [58] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
  • [59] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr.

    Conditional random fields as recurrent neural networks.

    In ICCV, 2015.
  • [60] J. T. Zhou, X. Xu, S. J. Pan, I. W. Tsang, Z. Qin, and R. S. M. Goh. Transfer hashing with privileged information. arXiv preprint arXiv:1605.04034, 2016.