DDNet: Dual-path Decoder Network for Occlusion Relationship Reasoning

11/26/2019 ∙ by Panhe Feng, et al. ∙ 0

Occlusion relationship reasoning based on convolution neural networks consists of two subtasks: occlusion boundary extraction and occlusion orientation inference. Due to the essential differences between the two subtasks in the feature expression at the higher and lower stages, it is challenging to carry on them simultaneously in one network. To address this issue, we propose a novel Dual-path Decoder Network, which uniformly extracts occlusion information at higher stages and separates into two paths to recover boundary and occlusion orientation respectively in lower stages. Besides, considering the restriction of occlusion orientation presentation to occlusion orientation learning, we design a new orthogonal representation for occlusion orientation and proposed the Orthogonal Orientation Regression loss which can get rid of the unfitness between occlusion representation and learning and further prompt the occlusion orientation learning. Finally, we apply a multi-scale loss together with our proposed orientation regression loss to guide the boundary and orientation path learning respectively. Experiments demonstrate that our proposed method achieves state-of-the-art results on PIOD and BSDS ownership datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a)

(b)

(c)

(d)
Figure 1: A ground truth from the PIOD dataset. Given an image (a), PIOD provides two annotated maps, namely (b) occlusion orientation map and (c) object boundary map. (d) Object occlusion relationship is represented by red arrows, each indicating an orientation . By the ”left” rule, the left side of each arrow indicates the foreground.

Occlusions occur when a 3D scene is projected onto the image plane, which is common present in 2D images. Occlusion relationship reasoning recovers occlusion relationship between objects from monocular images, which is promotion and important for a variety of computer vision research areas, such as object detection

[4, 1]

, scene understanding

[17], segmentation [4, 23, 3] and 3D reconstruction [15].

Previously, a number of traditional methods extract occlusion boundary by image segmentation and then recover occlusion relationship on the basis of boundary by designing hand-crafted features or adding extra prior information, e.g. [8, 16, 12, 14]. Recently, driven by Convolutional Neural Networks (CNNs), occlusion boundary extraction and occlusion relationship inference can be accomplished simultaneously to reason occlusion relationship. DOC [19] uses two networks for occlusion boundary pixel classification and occlusion orientation regression respectively. DOOBNet [18] designs an unified end-to-end multi-task deep object occlusion boundary detection network to simultaneously predict object boundary and occlusion orientation by sharing convolutional features. OFNet [10] considers the importance of occlusion cues for two tasks, and designs two sub-networks to share occlusion cues. In general, the results of occlusion relationship reasoning are the combination of the boundary map and the orientation map, which are obtained separately form boundary pixel classification and occlusion orientation regression, as shown in Fig. 1.

The boundary map and orientation map are both partial expressions of occlusion relationship, but the boundary extraction and orientation regression are two essentially different tasks for the network. Occlusion boundary extraction determines where the occlusion occurs and occlusion orientation inference reasons occlusion relationship. Therefore, it is challenging to predict two tasks simultaneously in the network. Previous methods, such as DOOBNet [18], OFNet [10]

first use the backbone network to extract different level features and use the entire decoder to extract shared information for two tasks, which improve the effect of occlusion reasoning but ignores the different effects of different level features on boundary and orientation. The occlusion information extracted from higher stages is consistent for both tasks, while lower stages tend to contain more specific spatial information that two tasks prefer respectively. Processing the information of higher stages and lower stages through the entire decoder probably restricts the learning flexibility and the representation of low-level features for the two tasks. The second issue is about the representation and learning of occlusion orientation in network. Previous methods

[19, 18, 10] infer the occlusion orientation by regressing a continuous orientation value. However, due to the periodicity of the occlusion orientation itself, predicting the occlusion orientation based on the continuous orientation value is difficult for the network.

To address two issues, we propose a novel path-splitting decoder structure for predicting boundary and occlusion orientation, called the Dual-path Decoder Network (DDNet). Our decoder uniformly extracts occlusion information at the higher stage and separate into two paths to recover boundary and occlusion orientation respectively in lower stages. For the boundary path, layer-by-layer decoder structure and multi-scale supervision are designed to strengthen the utility of spatial information at lower stages. For the occlusion orientation path, we use a decoder with skip connection. Besides, we propose a novel method using a pair of orthogonal vectors to represent occlusion orientation, which avoids the problem caused by the periodicity of the occlusion orientation. Based on the orthogonal orientation representation, we design the Orthogonal Orientation Regression loss (OOR) to make the network learn occlusion orientation. Experiments prove that we achieve state-of-the-art results on both PIOD

[19] and BSDS ownership [13] datasets.

The main contributions of this paper are as follows:

  • We propose the DDNet for occlusion relationship reasoning, which decodes the high-stage features together and decodes the low-stage features for two tasks in two paths. And the multi-scale supervision structure is designed for boundary path, which strengthens the utility of multi-scale information.

  • We propose using orthogonal vectors to represent the occlusion orientation in the network and design the OOR loss, which better express the occlusion relationship and improve the result of occlusion relationship reasoning.

  • We achieve new state-of-the-arts results on two challenging benchmarks including PIOD [19] and BSDS ownership dataset [13].

2 Related Work

Encoder-Decoder: In occlusion relationship reasoning, Encoder-Decoder can effectively utilize semantic information and occlusion cue of high-layer features to guide the network to recover the spatial location of occlusion from low-layer features. Deng et al. [2] use a bottom-up/top-down architecture to predict crisp boundaries. Based on HED [20], CED [22] implements edge enhancement by adding a decoder structure. CEDN [21] utilizes a fully convolutional encoder-decoder network to focus on detecting higher-level object contours. DOOBNet [18] adopts an encoder-decoder structure with skip connection to recover occlusion boundaries and occlusion relationship. The Encoder stage encoded different levels of feature and the Decoder is used to recover spatial dimensions and details.

Multi-scale Supervision:

Multi-scale supervision is a multi-scale deep learning method, which is used to perceive features of different scales and obtain results at different scales. For example, HED

[20] gets multi-scale boundaries by supervising the different side-layer output results. BDCN [5] performs different supervision on different side-layers to optimize side output results. With multi-scale supervision, the network perceives multi-scale features and obtain multi-scale results. However, previous methods all directly supervise the side-layer output layer of the backbone network, which lacks the guidance of high-level features on the recovery of the space in lower stages.

Representation and prediction of occlusion orientation: To recover occlusion relationship between objects, DOC [19]

proposes using the pixel-based orientation variable to represent the occlusion orientation relationship and regression to estimate the orientation variable. The main problem with this method is that the orientation value is a variable with a period of

, and it is difficult for the network to regress the period value. DOOBNet [18] limits the predicted occlusion orientation to the ranges and regress the orientation value within the ranges. It can’t handle the values at both ends of the interval well.

3 Method

(a) Dual-path Decoder Network
(b) FBR: Feature Boost Residual Block
(c) FFS: Feature Fusion and Separation Block
Figure 2:

The overall architecture and some details of Dual-path Decoder Network. (a) Dual-path Decoder Network. The green path represents the boundary path and the red path represents the orientation path. The upsample, Batch Normalization and ReLU operations are omitted from the figure. The full name of OSM is Occlusion Shared Module. The bottom side is the high stage and the upper side is the low stage. (b) components of the Feature Fusion and Separation (FFS). (c) components of the Feature Boost Residual module (FBR). c⃝ represents Concat operation.

In this section, we first introduce the proposed DDNet, which contains a dual-path decoder structure and the multi-scale supervision structure for boundary path. Then, we propose using a pair of orthogonal vectors to represent occlusion orientation in the network and introduce our OOR loss. Finally, we introduced the composition of our overall loss, including occlusion boundary loss and occlusion orientation loss.

3.1 Dual-path Decoder Network

In the two tasks of occlusion relationship reasoning, occlusion boundary extraction determines where the occlusion occurs and occlusion orientation inference recovers occlusion relationship. Both tasks are partial expressions of occlusion relationship reasoning, but they are two very different tasks for the network. Therefore, it is challenging to predict both tasks simultaneously by sharing features in a network. Especially, because the prediction of the occlusion orientation is valid only when the occlusion position is correctly determined, the accuracy of boundary position is the basis of the whole occlusion relationship reasoning.

In CNN-based methods of occlusion relationship reasoning, the backbone network is used to extract different levels features and is divided into multiple stages according to the size of the feature maps. These features need to be decoded by the decoder to extract occlusion boundaries and speculate predictive occlusion directions. In higher stages, the network has strong occlusion information due to large receptive view. Because occlusion information can guide the inference of occlusion orientation and the locating of the object boundary, they are consistent for two tasks and can be shared well by both tasks. In lower stages, the network encodes finer spatial information, which is very important for restoring spatial position. Compared with the occlusion orientation inference, boundary pixel classification requires more and stronger spatial information. Recovering spatial information together for two tasks will seriously interfere with the results of boundary pixel classification. Therefore, recovering spatial locations for two tasks need to be handled separately. Overall, the two tasks should share occlusion information that higher stages contains, while they should be separated in lower stages to recover spatial information separately. Based on this observation, we propose the DDNet, which shares occlusion information of higher stages and recovers their spatial location for two tasks separately in lower stages.

In our network, as shown in Fig. 2, we use ResNet [6]

as a base feature extraction model as in previous works

[18, 11, 10]. According to the size of the feature maps, this model is divided into five stages (see Fig. 2 Encoder) and has five sets corresponding side-layer features. Side-layer features in the different stages contain different information. In the dual-path decoder, we first use the occlusion-shared module (shown in Fig. 3) to uniformly extract occlusion information from the higher-stage features. In the lower-stage, we separate the decoder into the boundary path and the occlusion orientation path, which use spatial information to recover resolution for two tasks respectively. Two decoder paths in lower stages use similar structures, which combine the lower-stage features to recover the resolution gradually. The difference is that because of the occlusion orientation not sensitive to accurate spatial information, we use skip connection for the decoder path of the orientation path.

Furthermore, to make full use of multi-scale features and spatial information, we design a multi-scale supervision method for boundary path. Different from the method of supervised encoders that are widely used for edge detection [20, 22, 9, 5], we output and supervise the side-layer result of the decoder, which makes the lower-stage get the guide of occlusion information, thereby recovering more accurate boundary results. As shown in Fig. 2 (c), the Feature Fusion and Separation module combines the side-layer feature of the encoder stage and output a set of decoder side-layer features in boundary path. Finally, we use a 1x1 convolution layer to fuse multiple results.

Occlusion-shared module: The higher-stage features are aggregated by the Occlusion Shared Module and transported to the next stage. As shown in Fig. 3, we use the convolutional layer to extract the top-level features, and increase the feature resolution by upsampling. The intermediate feature maps of each stage in feature network all go through the Feature Boost Residual block. Then, the feature map of the up path and the intermediate feature map are fused by the concat operation and the convolution layer.

Figure 3: Occlusion-shared Module

Feature Boost Residual Block: The feature maps in the encoder network go through the Feature Boost Residual Block. As shown in Fig. 2 (b), the first component of the block is a residual block, which can boost the feature map. Then the following is a 1x1 convolution layer, which is used to reduce the number of channels to half. This block can strengthen the recognition ability of each stage, benefited from the architecture of ResNet [7].

Feature Fusion and Separation Block: We use the Feature Fusion and Separation module to connect the various stages of the decoder. Firstly two 3x3 convolutions layers are used to fuse feature. Then two 1x1 convolutions are used to output features up and to the right, one for the next phase of the decoder and one for the side output. Both 1x1 convolutions reduce the number of channels accordingly. As shown in Fig. 2 (c), FFS-x indicates that the input/output corresponding to x is missing, we have several versions of FFS-b, FFS-c, FFS-d, FFS-bd respectively.

3.2 Orthogonal Orientation Regression

The goal of occlusion orientation regression is to infer the foreground/background on both side of the occlusion boundary. But due to the periodicity of the occlusion orientation itself, recovering the foreground/background relationships based on the continuous orientation value is difficult for the network.

Figure 4: Left: Occlusion orientation represented by orientation. Right: Decompose the orientation into orthogonal vectors and

We define a new orthogonal occlusion orientation representation, which decomposes the orientation orthogonally into a horizontal vector and a vertical vector as the Fig. 4. The orientation can be represented by a pair of orthogonal vector .

(1)

Our representation not only conveys the occlusion orientation smoothly and completely, but also avoids the abnormal occlusion orientation loss near both ends of the orientation intervals.

Based on our representation, we proposed a novel loss called Orthogonal Orientation Regression loss, which is defined as:

(2)

in which:

(3)
(4)

where the and represent prediction value of the network.

Compared with the former orientation regression method [19, 18]

, our OPR loss has the following advantages: 1) we address the periodicity problem of the occlusion orientation representation fundamentally and avoid the bad effect of improper and inaccurate loss computation; 2) unlike directly regression the orientation value, we utilize the network to learn the orthonormal vector pair and the relation of orthonormal vector pair to mediately recover the occlusion orientation.

3.3 Network Training

In CNN-based methods of occlusion relationship reasoning, the object boundary extraction and occlusion orientation inference are trained simultaneously. Thus the overall loss is formulated as:

(5)

where is weight for the orientation loss. and is the loss of boundary part and orientation part, respectively.

For an input image , the ground truth is represented as a pair of object boundary map and occlusion orientation map . Where, and and is invalid when . In boundary path, to make full use of multi-scale features and spatial information, each FFS block is connected with a specific side supervision. Besides that, we fuse the intermediate boundary predictions with a fusion layer as the final result. We formulate the loss of boundary part as:

(6)
(7)

where and are weights for the side loss and fusion loss respectively, represents the predicted boundary map .

The function is computed at boundary pixel with respect to its boundary annotation. Because the distribution of boundary/non-boundary pixels is heavily biased, we employ a class-balanced cross-entropy loss as . we define and denoting the non-boundary and boundary ground truth label sets in a batch of images, respectively. We define as:

(8)

where balance the boundary/non-boundary pixels, controls the weight of positive over negative samples.

The function calculates the loss of the orientation only on the boundary pixels. We define as:

(9)

4 Experiment

4.1 Dataset and Implementation Details

We evaluate the proposed approach on two public datasets: PIOD [19] and BSDS ownership dataset [13].

PIOD dataset: It contains 9,175 images for training and 925 images for testing. Each image is annotated with ground truth object instance boundary map and its corresponding orientation map.

BSDS ownership dataset: It includes 100 training images and 100 testing images of natural scenes, which are annotated with boundary map and corresponding orientation map. Finally, Each image is annotated with ground truth object instance boundary map and its corresponding orientation map.

Our network is implemented in Pytorch using the ResNet50 pretrained on the ImageNet as the backbone. AdamW optimizer is adopted to train our network. The initial learning rate is 3e-6. The training image are randomly crop into 320x320 and form the training batch with the size of 8. For the PIOD dataset, we set 0.5, 1.1, 1.1, 2.1, 1.7 for the hyper-parameters

, , , and respectively and train for 40k iterations. For the BSDS ownership dataset, we set as 1.1 and train the network for 20k iterators. Other hyper-parameters are the same as them for PIOD.

For the evaluation metrics

[18, 10], we use the Boundary Map Recall (BPR) and Oritation Maps Recall (OPR) with three-level measure: the Fixed Contour Threshold (ODS), the Best Threshold of Image (OIS), the Average Precision (AP). Note that, the Non-Maximum Suppression are generated on the final boundary map during evaluation and the OPR is only calculated at the correctly detected boundary pixels.

4.2 Comparison with Other Works

We compare our approach with recent methods including SRF-OCC [16], DOC-HED [19], DOC-DMLFOV [19], DOOBNet [18], OFNet [10].

Methods BPR OPR
ODS OIS AP ODS OIS AP
SRF-OCC [16] .345 .369 .207 .268 .286 .152
DOC-HED [19] .509 .532 .468 .460 .479 .405
DOC-DMLFOV [19] .669 .684 .677 .601 .611 .585
DOOBNet [18] .736 .746 .723 .702 .712 .683
OFNet [10] .739 .750 .685 - - -
OFNet [10] - - - .705 .716 .674
OFNet [10] .751 .762 .770 .718 .728 .729
Ours .786 .796 .795 - - -
Ours - - - .761 .770 .761
Ours .790 .800 .813 .766 .776 .779
Table 1: Comparison with recent works on PIOD. indicates training boundary only. indicates training orientation only.
(a) BPR
(b) OPR
Figure 5: The precision-recall curves of our method and compared works on PIOD

Performance on PIOD: The ”Precision-Recall Curve” of the boundary is shown in Fig. 5 (a), we can see that even though training alone, our method can outperforms other methods (High PR Curve) profited by the multi-scale supervision, which provides more spatial information to guide the boundary extraction. When training the occlusion orientation regression alone, our model also outperforms all other methods (High PR Curve for occlusion boundary), which shows that our orthogonal orientation regression can predict the occlusion boundary well.

Performance on BSDS ownership: The BSDS ownership is difficult to train due to the small number of the training samples. Nonetheless, our method can still get superior performance than other models both in BPR and OOR as shown in Fig. 6. Specifically, as shown in Table 2, the performance of our model are 5% higher than the state-of-the-art model benefited from our OOR loss, which can fit the occlusion orientation reasonably. Note that, this loss can also promote precision of the boundary map and make our BPR also higher than other models.

Methods BPR OPR
ODS OIS AP ODS OIS AP
SRF-OCC [16] .511 .544 .442 .419 .448 .337
DOC-HED [19] .658 .685 .602 .522 .545 .428
DOC-DMLFOV [19] .579 .609 .519 .463 .491 .369
DOOBNet [18] .647 .668 .539 .555 .570 .440
OFNet [10] .662 .689 .585 .583 .607 .501
Ours .678 .704 .604 .632 .658 .549
Table 2: Comparison with recent works on BSDS ownership.
(a) BPR
(b) OPR
Figure 6: The precision-recall curves of our method and compared works on BSDS ownership

Finally, we visualize the boundary maps and occlusion relationship maps of our models and others including the DOOBNet[18] and OFNet[10]. Their boundary maps and occlusion relationship maps are shown in Fig. 7. Obviously, our method extracts the more complete and clearer boundary on multiple images, especially the first and fourth images. In complex and indistinguishable scenes, such as the second and third images, we are still ahead of the other two methods, because our network has more powerful learning and generalization skills. In the last image, although having similar boundary precision, our model can discern the occlusion orientation legs. In summary, our method can outperforms others by a large margin both in boundary and occlusion orientation, which proves the effectiveness of our network.

Figure 7: Example results on PIOD (first three rows) and BSDS ownership dataset (last two rows). 1st column: input images; 2st column: visualization results and boundary map of ground truth; The next four columns include the visualization results and the boundary maps of DOOBNet, OFNet, DDNet. Notably, ”red” pixels with arrows: correctly labeled occlusion boundaries; ”cyan”: correctly labeled boundaries but mislabeled occlusion; ”green”: false negative boundaries; ”orange”: false positive boundaries (Best viewed in color).

4.3 Ablation Study

In this section, we conduct experiments on PIOD dataset to study the impact of parameters and verify each component in our network.

Comparison for number of shared stages: We compared sharing top-1, top-2, top-3, and top-4 stages at high stages. As shown in Fig. 2, as the number of layers included in the occlusion sharing module changes, the length of the boundary path and the occlusion orientation path change accordingly. As shown in Table 3, The best results appear when sharing the top-2 stages, which is also the structure we use. Then, when the sharing extends to the low stage, the results begin to deteriorate. This proves that two tasks should share occlusion information that higher stages contains, while they should be separated in lower stages to recover spatial information separately.

Shared stages BPR OPR
ODS OIS AP ODS OIS AP
top-1 .787 .798 .808 .762 .771 .771
top-2 .790 .800 .813 .766 .776 .779
top-3 .776 .786 .806 .754 .763 .775
top-4 .770 .784 .797 .748 .760 .764
Table 3: Comparison for number of layer-shared on PIOD.

Comparison of orientation supervision methods: We change the output of our occlusion orientation path so that it can use the previous method of regressing the continuous orientation value. We compare the performance of predicting continuous orientation value and predicting orthogonal vectors on our network structure. As shown in Table 4, Inference of occlusion orientation by predicting orthogonal vectors has resulted in a large increase in both boundary results and occlusion inference results, which proves the rationality of the orthogonal orientation method and the OOR loss.

Methods BPR OPR
ODS OIS AP ODS OIS AP
DDNet .778 .790 .806 .747 .758 .759
DDNet .790 .800 .813 .766 .776 .779
Table 4: Comparison of orientation supervision methods on PIOD.

Necessity for multi-scale supervision: As shown in Fig. 2, the green path can output five side results through the FFS. We compared supervise the lowest one side (no multi-scale supervision), the lowest two sides, the lowest three side, the lowest four sides, the lowest five sides. As shown in Table 5, the performance of only supervise the lowest one side is significantly lower than other solutions. We only added side supervision for the decoder, and the performance has been greatly improved. This result demonstrates the importance of multi-scale supervision and multi-result fusion.

Number BPR OPR
ODS OIS AP ODS OIS AP

1
.754 .765 .774 .731 .740 .741
2 .786 .795 .826 .763 .764 .789
3 .781 .798 .820 .764 .773 .786
4 .786 .795 .811 .761 .769 .776
5 .790 .802 .813 .766 .776 .780
Table 5: Validation Necessity for multi-scale supervision by setting number of side supervision on PIOD.

5 Conclusions

This paper proposes the DDNet for occlusion relationship reasoning. By using the dual-path decoder structure to decodes the high-stage features together and the low-stage features for occlusion boundary extraction and occlusion orientation inference. We have multi-scale supervision structure for the boundary path, which provides more and stronger spatial information for boundary extraction. For better predicting the occlusion orientation in the network, we use the orthogonal orientation vector to represent the occlusion relationship and propose the corresponding OOR loss.

References

  • [1] Alper Ayvaci and Stefano Soatto. Detachable object detection: Segmentation and depth ordering from short-baseline video. IEEE transactions on pattern analysis and machine intelligence, 34(10):1942–1951, 2011.
  • [2] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu. Learning to predict crisp boundaries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 562–578, 2018.
  • [3] Doron Feldman and Daphna Weinshall. Motion segmentation and depth ordering using an occlusion detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(7):1171–1185, 2008.
  • [4] Tianshi Gao, Benjamin Packer, and Daphne Koller. A segmentation-aware object detection model with occlusion handling. In CVPR 2011, pages 1361–1368. IEEE, 2011.
  • [5] Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, and Tiejun Huang. Bi-directional cascade network for perceptual edge detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3828–3837, 2019.
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [8] Derek Hoiem, Alexei A Efros, and Martial Hebert. Recovering occlusion boundaries from an image. International Journal of Computer Vision, 91(3):328–346, 2011.
  • [9] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. Richer convolutional features for edge detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3000–3009, 2017.
  • [10] Rui Lu, Feng Xue, Menghan Zhou, Anlong Ming, and Yu Zhou. Occlusion-shared and feature-separated network for occlusion relationship reasoning. arXiv preprint arXiv:1908.05898, 2019.
  • [11] Rui Lu, Menghan Zhou, Anlong Ming, and Yu Zhou. Context-constrained accurate contour extraction for occlusion edge detection. 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 1522–1527, 2019.
  • [12] Xiaofeng Ren, Charless C Fowlkes, and Jitendra Malik. Figure/ground assignment in natural images. In European Conference on Computer Vision, pages 614–627. Springer, 2006.
  • [13] Xiaofeng Ren, Charless C Fowlkes, and Jitendra Malik. Figure/ground assignment in natural images. In European Conference on Computer Vision, pages 614–627. Springer, 2006.
  • [14] Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168, 2006.
  • [15] Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, and Steven M Seitz. Occluding contours for multi-view stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4002–4009, 2014.
  • [16] Ching Teo, Cornelia Fermuller, and Yiannis Aloimonos. Fast 2d border ownership assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5117–5125, 2015.
  • [17] Joseph Tighe, Marc Niethammer, and Svetlana Lazebnik. Scene parsing with object instances and occlusion ordering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3748–3755, 2014.
  • [18] Guoxia Wang, Xiaochuan Wang, Frederick WB Li, and Xiaohui Liang. Doobnet: Deep object occlusion boundary detection from an image. In Asian Conference on Computer Vision, pages 686–702. Springer, 2018.
  • [19] Peng Wang and Alan Yuille. Doc: Deep occlusion estimation from a single image. In European Conference on Computer Vision, pages 545–561. Springer, 2016.
  • [20] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  • [21] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. Object contour detection with a fully convolutional encoder-decoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 193–202, 2016.
  • [22] Yin Li Yupei Wang, Xin Zhao and Kaiqi Huang. Deep crisp boundaries: From boundaries to higher-level tasks. TIP, 2018.
  • [23] Ziyu Zhang, Alexander G Schwing, Sanja Fidler, and Raquel Urtasun. Monocular object instance segmentation and depth ordering with cnns. In Proceedings of the IEEE International Conference on Computer Vision, pages 2614–2622, 2015.