Occlusion-shared and Feature-separated Network for Occlusion Relationship Reasoning

08/16/2019 ∙ by Rui Lu, et al. ∙ Huazhong University of Science u0026 Technology Lenovo 1

Occlusion relationship reasoning demands closed contour to express the object, and orientation of each contour pixel to describe the order relationship between objects. Current CNN-based methods neglect two critical issues of the task: (1) simultaneous existence of the relevance and distinction for the two elements, i.e, occlusion edge and occlusion orientation; and (2) inadequate exploration to the orientation features. For the reasons above, we propose the Occlusion-shared and Feature-separated Network (OFNet). On one hand, considering the relevance between edge and orientation, two sub-networks are designed to share the occlusion cue. On the other hand, the whole network is split into two paths to learn the high-level semantic features separately. Moreover, a contextual feature for orientation prediction is extracted, which represents the bilateral cue of the foreground and background areas. The bilateral cue is then fused with the occlusion cue to precisely locate the object regions. Finally, a stripe convolution is designed to further aggregate features from surrounding scenes of the occlusion edge. The proposed OFNet remarkably advances the state-of-the-art approaches on PIOD and BSDS ownership dataset. The source code is available at https://github.com/buptlr/OFNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reasoning the occlusion relationship of objects from monocular image is fundamental in computer vision and mobile robot applications, such as

[11, 2, 24, 17, 29]

. Furthermore, it can be regarded as crucial elements for scene understanding and visual perception

[40, 42, 43, 39, 18], such as object detection, image segmentation and 3D reconstruction [6, 1, 37, 7, 26, 34]. From the perspective of the observer, occlusion relationship reflects relative depth difference between objects in the scene.

Figure 1: (a) visualization result of DOC-HED, (b) visualization result of DOOBNet, (c) visualization result of ours, (d) the occlusion cue, (e) the bilateral feature, (f) visualization result of ground truth. Occlusion relationship (the red arrows) is represented by orientation (tangent direction of the edge), using the ”left” rule where the left side of the arrow means foreground area. Notably, ”red” pixels with arrows: correctly labeled occlusion boundaries; ”cyan”: correctly labeled boundaries but mislabeled occlusion; ”green”: false negative boundaries; ”orange”: false positive boundaries (Best viewed in color).
Figure 2: The schematic demonstration of the high-level feature propagation process of the state-of-the-arts and ours. (a) indicates the two separate stream networks employing side-outputs of various layers. (b) presents the single stream network sharing decoder features. (c) shows our network which captures contextual features for specific tasks and shares decoder features.

Previously, a number of influential studies infer the occlusion relationship by designing hand-crafted features, e.g. [5, 22, 25, 16, 38, 41]

. Recently, driven by Convolutional Neural Networks (CNN), several deep learning based approaches outperform traditional methods at a large margin. DOC

[21] specifies a new representation for occlusion relationship, which decomposes the task into the occlusion edge classification and the occlusion orientation regression. And it utilizes two networks for these two sub-tasks, respectively. DOOBNet [31] employs an encoder-decoder structure to obtain multi-scale and multi-level features. It shares backbone features with two sub-networks and simultaneously acquires the predictions.

In occlusion relationship reasoning, the closed contour is employed to express the object, and the orientation values of the contour pixels are employed to describe the order relationship between the foreground and background objects. We observe that two critical issues have rarely been discussed. Firstly, the two elements, i.e, occlusion edge and occlusion orientation, have the relevance and distinction simultaneously. They both need the occlusion cue, which describes the location of the occluded background, as shown in Fig.1 (d). Secondly, the high-level features for orientation prediction are not fully revealed. It needs additional cues from foreground and background areas (shown in Fig.1 (e)). Consequently, existing methods are limited in reasoning accuracy. Compared with our approach (shown in Fig.1 (c)), previous works [21, 31] (shown in Fig.1 (a)(b)) exist false positive and false negative detection of edge, as well as false positive prediction of orientation.

Aiming to address the two issues above, and boost the occlusion relationship reasoning, a novel Occlusion-shared and Feature-separated Network (OFNet) is proposed. As shown in Fig.2 (c), considering the relevance and distinction between edge and orientation, our network is different from the other works (shown in Fig.2 (a)(b)). Two separate network paths share the occlusion cue and encode different high-level features. Furthermore, a contextual feature for orientation prediction is extracted, which is called the bilateral feature. To learn the bilateral feature, a Multi-rate Context Learner (MCL) is proposed. The learner has different scales of receptive field so that it can fully sense the two objects, i.e, the foreground and background objects, fundamentally assisting the occlusion relationship reasoning. To extract the feature more accurately, the Bilateral Response Fusion (BRF) is proposed to fuse the occlusion cue with the bilateral feature from MCL, which can precisely locate the areas of foreground and background. To effectively infer the occlusion relationship by the special orientation features, a stripe convolution is designed to replace the traditional plain convolution, which elaborately integrates the bilateral feature to distinguish the foreground and background areas. Experiments prove that we achieve SOTA performance on both PIOD [21] and BSDS ownership [22] dataset.

The main contributions of our approach lie in:

  • The relevance and distinction between occlusion edge and occlusion orientation are re-interpreted. The two sub-tasks share the occlusion cues, but separate the contextual features.

  • The bilateral feature is proposed, and two particular modules are designed to obtain the specific features, i.e, Multi-rate Context Learner (MCL) and Bilateral Response Fusion (BRF).

  • To elaborately infer the occlusion relationship, a stripe convolution is designed to further aggregate the feature from surrounding scenes of the contour.

2 Related Work

Contextual Learning plays an important role in scene understanding and perception [4, 32]. At first, Mostajabi et al. [19] utilize multi-level, zoom-out features to promote feedforward semantic labeling of superpixels. Meanwhile, Liu et al. [13] propose a simple FCN architecture to add the global context for semantic segmentation. Afterwards, Chen et al. [3] apply the Atrous Spatial Pyramid Pooling to extract dense features and encode image context at multi-scale.

Multi-level Features are extracted from different layers, which are widely used in image detection [14, 23, 27, 33]. Peng et al. [20] fuse feature maps from multi-layer with refined details. Shrivastava et al. [28] adopt lateral connections to leverage top-down context and bottom-up details.

Occlusion Relationship Representation has evolved overtime from triple points and junctions representation [10, 22] to pixel-based representation [30, 21]. The latest representation [21]

applies a binary edge classifier to determine whether the pixel belongs to an occluded edge, and a continuous-valued orientation variable is proposed to indicate the occlusion relationship by the left-hand rule

[21].

Figure 3: Illustration of our proposed network architecture. The length of the block expresses the map resolution and the thickness of the block indicates the channel number.

3 OFNet

Two elements of occlusion relationship reasoning, i.e, edge and orientation, are in common of necessity for the occlusion cue while differing in the utilization of specific contextual features. In this section, a novel Occlusion-shared and Feature-separated Network (OFNet) is proposed. Fig.3 illustrates the pipeline of the proposed OFNet, which consists of a single stream backbone and two parallel paths, i.e, edge path and orientation path.

Specifically, for the edge path (see Sec.3.1), a structure similar to [15] is employed to extract consistent and accurate occlusion edge, which is fundamental for occlusion reasoning. For the orientation path (see Sec.3.2), to learn more sufficient cues near the boundary for occlusion reasoning, the high-level bilateral feature is obtained, and a Multi-rate Context Learner (MCL) is proposed to extract the feature (see Sec.3.2.1). To enable the learner to locate the foreground and background areas precisely, a Bilateral Response Fusion module (BRF) is proposed to fuse the bilateral feature and the occlusion cue (see Sec.3.2.2). Furthermore, a stripe convolution is proposed to infer the occlusion relationship elaborately (see Sec.3.2.3).

3.1 Edge Path

The occlusion edge expresses the position of objects, and defines the boundary location between the bilateral regions. It requires reserved resolution of the original image to provide the accurate location and large receptive field to perceive the mutual constraint of pixels on the boundary.

We adopt the module proposed in [15], which has a high capability to capture accurate location cue and sensitive perception of the entire object. In [15], the low-level cue from the first three side-outputs preserves the original size of the input image and encodes abundant spatial information. Without losing resolution, the large receptive field is achieved via dilated convolution [36] after res50 [9]. The Bilateral Response Fusion (BRF) shown in Fig.3 is presented to compensate the precise position for high-level features and suppress the clutter of non-occluded pixels for low-level features. Different from [15], we employ an additional convolution block to refine the contour, and integrate specific task features provided by diverse channels. Besides, this well-designed convolution block eliminates the gridding artifacts [35] caused by the dilated convolution in high-level layers.

The resulting edge map embodies low-level and high-level features, which guarantees the consistency and accuracy of the occlusion edge. Specifically, the edge path provides complete and continuous contour, which makes up the object region. The object region is delineated by a set of occlusion edges.

3.2 Orientation Path

For the orientation path, we innovatively introduce the bilateral feature, which is conducive to describe the order relationship. Specifically, the bilateral feature represents information of surrounding scenes, which includes sufficient ambient context to deduce whether it belongs to the foreground or background areas.

3.2.1 Multi-rate Context Learner

Bilateral feature characterizes the relationship between the foreground and background areas. To infer the occlusion relationship between objects, the sufficient receptive field for the objects with different sizes is essential.

To perceive the object with various ranges and learn the bilateral feature, the Multi-rate Context Learner (MCL) is designed, which consists of three components, as shown in Fig.4. Firstly, the high-level semantic cue is convolved by multiple dilated convolutions, which allows the pixels on the edge to perceive the foreground and background objects as completely as possible. The dilated convolutions have kernel size of 33 with various dilation rates. With various dilated rates of the dilated convolutions, the learner is able to perceive the scene cue at different scales from the foreground and background areas, which is beneficial to deduce which side of the region is in front. Secondly, an element-wise convolution module, i.e., 11 conv, is used to integrate the scene cue between various channels and promote cross-channel bilateral feature aggregation at the same location. Compared to dilated convolution, the element-wise convolution module retains the local cues near the contour. Besides, it greatly clarifies occlusion cue and bilateral cue in occlusion reasoning. The function represents the dilated convolution and the function represents the 11 conv. is the input of convolution. is the convolution layer parameters to be learned. Thirdly, the 11 conv is once again applied to normalize the values nearby the contour, where the bilateral cue is further enhanced and other irrelevant cues are suppressed. The MCL learns the cues of the foreground and background objects. The feature map of bilateral cue, , is denoted as:

(1)
Figure 4: Illustration of our proposed Multi-rate Context Learner (MCL). The MCL module includes 3 dilated convolutions with kernel size of 33 and dilation rate of 6, 12 and 18, respectively.
Figure 5: Illustration of our proposed Bilateral Response Fusion (BRF).

Difference with ASPP: Notably, our MCL module is inspired by the “Atrous Spatial Pyramid Pooling” (ASPP) [3], but there exists several differences. Firstly, we add a parallel element-wise convolution module, which additionally gains local cues of the specific region. It compensates for the deficiency that dilated convolution is not sensitive to nearby information. Secondly, the convolution blocks after each branch remove the gridding artifacts [35] caused by the dilated convolution. Thirdly, the 11 conv can adjust channel numbers and explore relevance between channels.

3.2.2 Bilateral Response Fusion

Respectively, the bilateral cue obtained by the method in Sec.3.2.1 discriminates which side of the contour belongs to foreground area, the occlusion cue obtained through the decoder represents the location information of the boundary. As shown in Fig.6, after the bilinear upsampling, bilateral feature is hard to locate the exact location of the contour. Hence, to sufficiently learn the feature for occlusion relationship reasoning, more precise location of object region is demanded, which is provided by the occlusion cue from the decoder. Thus, it is necessary to introduce clear contour to describe the areas of the foreground and background objects, thereby extracting the object features more accurately.

The Bilateral Response Fusion (BRF), shown in Fig.5, is proposed to fuse these two disparate streams of features, i.e. the bilateral map and occlusion map . The unified orientation fused map of ample bilateral response and emphatic occlusion is formed, which is denoted as , where represents the 33 conv:

(2)

denotes the feature map generated by BRF module, and each element of the set is a feature map. Subsequently, has 224224 spatial resolution and is taken as the input of the Occlusion Relationship Reasoning module (Sec.3.2.3), as shown in Fig.6. Through BRF, the occlusion feature is effectively combined with the bilateral feature. For occlusion relationship reasoning, the fused orientation map not only possesses the boundary location between two objects with occlusion relationship, but also own contextual information of each object. The BRF module provides adequate cues for the following feature learning module to infer the foreground and background relationship. Besides, by integrating bilateral feature, the scene cue near the contour is enhanced.

Figure 6: The test demo of the generation of orientation fused map. We acquire the fused map by adopting BRF to complement bilateral feature with occlusion feature.

3.2.3 Occlusion Relationship Reasoning

By utilizing the MCL and BRF, the bilateral feature is learned and fused, an inference module is necessary to determine the order of the foreground and background areas, which makes full use of this feature. Existing method [31] utilizes 33 conv to learn the features. This small convolution kernel only extracts the cues at the local pixel patch, which is not suitable to infer the occlusion relationship. The reason is that the tiny perceptive field is unable to perceive the learned object cue. Thus, a large convolution kernel is necessary for utilizing the bilateral feature, which is able to perceive surrounding regions near the contour.

Nevertheless, large convolution kernels are computation demanding and memory consuming. Instead, two stripe convolutions are proposed, which are orthogonal to each other. Compared to the 33 conv, which captures only nine pixels around the center (shown in Fig.7(a)), the vertical and horizontal stripe convolutions have 113 and 311 receptive field, as shown in Fig.7(b). Specifically, for a contour pixel with arbitrary orientation, its tangent direction can be decomposed into vertical and horizontal directions. Contexts along orthogonal directions make varied amount of contributions in expressing the orientation representation. Thus, tendency of the extended contour and occluded relationship of bilateral scenes are recognized.

In addition, two main advantages are achieved. First, the large receptive field aggregates contextual information of object to determine the depth order, which is without large memory consuming. Second, although the slope of the edge is not exactly perpendicular or parallel to the ground, one of the stripe convolutions can successfully perceive the foreground and background objects. After the concatenation of the two orthogonal convolution modules, we apply the 33 conv to refine the features.

3.3 Loss Function

Occlusion Edge: Occlusion edge characterizes depth discontinuity between regions, reflecting as the boundary between objects. Given a set of training images , the corresponding ground truth edge of the -th input image at pixel is and we denote

as its network output, indicating the computed edge probability.

Occlusion Orientation: Occlusion orientation indicates the tangent direction of the edge using the left rule (i.e. the foreground area is on the left side of the background area). Following the mathematical definition above, for the -th input image, its orientation ground truth at pixel is . The regression prediction result of orientation path is .

Occlusion Relationship: During the testing phase, we first refine the by conducting non-maximum suppression . The nonzero pixels of sharpened form the binary matrix . We then perform element-wise product of and orientation map , obtain refined orientation map . Finally, following [11], we adjust the to the tangent direction of and gain the final occlusion edge map.

Loss Function: Following [31]

, we use the following loss function to supervise the training of our network.

(3)

The parameters include: collection of all standard network layer parameters (), predicted edge value at pixel (), mini-batch size (), image serial number in a mini-batch (),the Attention Loss (), the Smooth Loss () [31].

Figure 7: The schematic illustration of how orientation information propagates in the feature learning phase. (a) the plain convolution. (b) the stripe convolution.

4 Experiments

In this section, abundant experiments are demonstrated to validate the performance of the proposed OFNet. Further, we present some ablation analyses for discussions of the network design choices.

4.1 Implementation Details

Dataset: Our method is evaluated on two challenging datasets: PIOD [21] and BSDS ownership [22]. The PIOD dataset is composed of 9,175 training images and 925 testing images. Each image is annotated with ground truth object instance edge map and its corresponding orientation map. The BSDS ownership dataset includes 100 training images and 100 testing images of natural scenes. Following [31], all images in the two datasets are randomly cropped to 320320 during training while retaining their original sizes during testing.

Initialization:

Our network is implemented in Caffe

[12] and finetuned from an initial pretrained Res50 model. All convolution layers added are initialized with msra [8].

Evaluation Criteria: Following [21]

, we compute precision and recall of the estimated occlusion edge maps (i.e.OPR) by performing three standard evaluation metrics: fixed contour threshold (ODS), best threshold of each image (OIS) and average precision (AP). Notably, the orientation recall is only calculated at the correctly detected edge pixels. Besides, the above three metrics are also used to evaluate the edge map after

NMS.

ODS OIS AP FPS ODS OIS AP FPS
.702 .712 .683 .555 .570 .440
.718 .728 .729 .583 .607 .501

Table 1: OPR results on PIOD (left) and BSDS ownership dataset (right). ①-⑤ represent SRF-OCC [30], DOC-HED [21], DOC-DMLFOV [21], DOOBNet [31] and ours, respectively. refers to GPU running time. Red bold type indicates the best performance, blue bold type indicates the second best performance (the same below).
ODS OIS AP ODS OIS AP
.658 .685 .602
.736 .746 .723
.751 .762 .773 .662 .689 .585

Table 2: EPR results on PIOD (left) and BSDS ownership dataset (right).

4.2 Evaluation Results

Quantitative Performance: We evaluate our approach with comparisons to the state-of-the-art algorithms including SRF-OCC [30], DOC-HED [21], DOC-DMLFOV [21] and DOOBNet [31].

Figure 8: OPR results on two datasets.
Figure 9: EPR results on two datasets.
Figure 10: Example results on PIOD (first four rows) and BSDS ownership dataset (last four rows). column: input images; - columns: visualization results of ground truth, baseline and ours; - columns: edge maps and orientation maps of ours. Notably, ”red” pixels with arrows: correctly labeled occlusion boundaries; ”cyan”: correctly labeled boundaries but mislabeled occlusion; ”green”: false negative boundaries; ”orange”: false positive boundaries (Best viewed in color).

As shown in Table.1 and Fig.8, our approach outperforms all other state-of-the-art methods for OPR results. Specifically, in terms of the PIOD dataset, our method performs the best, outperforming the baseline DOOBNet of 4.6% AP. This is due to the efficiency of extracting high semantic features of the two paths separately. Edge path succeeds in enhancing contour response and orientation path manages to perceive foreground and background relationship. Splitting these two tasks into two paths enables the promotion of the previous algorithms. For the BSDS ownership dataset, which is difficult to train due to the small training samples, the proposed OFNet obtains the gains of 2.8% ODS, 3.7% OIS and 6.1% AP compared with the baseline DOOBNet. Specifically, our approach increases bilateral cue between the foreground and background objects, and fuses them with high semantic features to introduce clear contour, which describes the areas of the foreground and background better. Besides, stripe convolution in our network plays an important role in harvesting the surrounding scenes of the contour. The improvement in orientation proves the effectiveness of the module.

EPR results are presented in Table.2 and Fig.9. For the PIOD dataset, our approach superiorly performs against the other evaluated methods, surpassing DOOBNet by 5.0% AP. We take the distinction between edge and orientation into consideration, and extract specific features for sub-networks, respectively. For edge path, by utilizing the contextual features, which reflect pixels constraint on the occlusion edge, our network outputs edge maps with augmented contour and less noise around. With location cue extracted from low-level layers, the predicted edge in our method fits the contour better, thus avoiding false positive detections compared to others. For the BSDS ownership dataset, our approach achieves the highest ODS as well.

Qualitative Performance: Fig.10 shows the qualitative results on the two datasets. The top four rows show the results of the PIOD dataset [21], and the bottom four rows represent the BSDS ownership dataset [22]. The first column to the sixth column show the original RGB image from datasets, ground truth, the result predicted by DOOBNet [31], the result predicted by the proposed OFNet, the detected occlusion edge and the predicted orientation, respectively. In the resulting image, the right side of the arrow direction is the background, and the left side corresponds to the foreground area.

In detail, the two occluded buses in the first row have similar appearances. Thus, it is hard to detect the dividing line between them, just as our baseline DOOBNet fails. However, our method detects the occlusion edge consistently. In the second row, the occlusion relationship between the wall and the sofa is easy to predict failure. Instead of the small receptive field, which is difficult to perceive objects with large-area pure color, our method with sufficient receptive field correctly predicts the relationship. The third scene is similar to the second row. Compared with the baseline, our method predicts the relationship between the sofa and the ground correctly. In the fourth row, the color of the cruise ship is similar to the hill behind, which is not detected by the baseline. By using the low-level edge cues, our method accurately locates the contour of the ship. The fifth row shows people under the wall, and the orientation cannot be correctly detected due to the low-level features in the textureless areas. Our method correctly infers the relationship by using the high-level bilateral feature. The last three scenes have the same problem as the third row, i.e, the object with a large region of pure color. Our method outperforms others in this situation by a large margin, which proves the effectiveness of our designed modules.

4.3 Ablation Analysis

One-branch or Multi-branch Sub-networks: To evaluate that our method provides different high-level features for different sub-tasks, an existing method [31], which adopts a single flow architecture by sharing high-level features, is used to be compared with our method. As shown in Table.3, the high-level features for two paths promote the correctness of occlusion relationship. In addition, each path is individually trained for comparison, validating the help of occlusion cue for orientation prediction in our method.

Methods ODS OIS AP ODS OIS AP
Baseline
Baseline(split decoder)
Single edge stream
Single ori stream
Ours .751 .762 .773 .718 .728 .729

Table 3: Experimental results of baseline DOOBNet [31], baseline with split decoder, baseline with single stream sub-network and our approach. The experiments are conducted on the PIOD dataset (the same below).
Methods ODS OIS AP ODS OIS AP
Ours(w/o low-cues)
Ours(w/o edge high-cues)
Ours(w/o ori high-cues)
Ours .751 .762 .773 .718 .728 .729

Table 4: Experimental results of our model without low-cues, without edge high-cues, without orientation high-cues and our model.

Necessity for Each Feature: In order to verify the role of various low-level and high-level features, each feature is removed to construct an independent variant for evaluation, as shown in Table.4. Intuitively, if the low-level features for edge path are removed, the occlusion edge is difficult to be accurately located. If the high-level features for edge path are removed, the occlusion edge is failed to be detected consistently. Furthermore, if the high-level features for orientation path are removed, although the occlusion edge could be detected accurately and consistently, the ability to reason occlusion relationship reduces sharply. The intrinsic reason is that the MCL perceives the bilateral cue around the contour, and affirms the foreground and background relationship. The bilateral feature plays an important role in occlusion relationship reasoning.

Proportion of Bilateral and Contour Features: The bilateral feature provides relative depth edgewise, and occlusion cue supplies the location of the boundary. We fuse them with various channel ratios to best refine the range of the foreground and background. The proportion of bilateral and occlusion features determines the effectiveness of the fusion. Table.5 reveals various experimental results with different proportions of two features. Experiments prove that fusing bilateral feature and occlusion feature with 64:16 channel ratio in the BRF outperforms others. It reveals that bilateral feature plays a more important role in the fusion operation. Occlusion cue mainly plays an auxiliary role, which distinguishes the region of foreground and background. However, when the bilateral feature occupies an excess proportion, the boundary will be ambiguous, blurring the boundary between foreground and background, which causes a negative impact on the effect.

Scale ODS OIS AP ODS OIS AP
scale = 16:16
scale = 32:16
scale = 48:16
scale = 64:16 .751 .762 .773 .718 .728 .729
scale = 80:16

Table 5: Experimental results of bilateral feature and occlusion feature with various fusion ratio in BRF module.
Scale ODS OIS AP ODS OIS AP
conv = 33
conv = 35
conv = 37
conv = 39
conv = 311 .751 .762 .773 .718 .728 .729

Table 6: Experimental results of stripe convolutions with different aspect ratios.

Plain or Stripe Convolution: To evaluate the effect of stripe convolution for occlusion relationship reasoning, the stripe-based convolution variants with different aspect ratios are employed to make comparisons. As shown in Table.6, intuitively, even if the slope of the edge is not in a horizontal or vertical direction, the convolution kernels possess large receptive field and tend to learn the cues of both directions, respectively. Nevertheless, the larger convolution layer takes up too much computation cost, which increases the number of parameters. Consequently, the stripe convolutions in orthogonal directions extract the tendency of edges and bilateral cue around the contour.

5 Conclusion

In this paper, we present a novel OFNet, which shares the occlusion cue from the decoder and separately acquires the contextual features for specific tasks. Our algorithm builds on top of the encoder-decoder structure and side-output utilization. For learning the bilateral feature, an MSL is proposed. Besides, a BRF module is designed to apply the occlusion cue to precisely locate the object regions. In addition, we utilize a stripe convolution to further aggregate features from surrounding scenes of the contour. Significant improvement of the state-of-the-art through numerous experiments on PIOD and BSDS ownership dataset demonstrates the effectiveness of our network.

Acknowledgement. This work was supported by the National Natural Science Foundation of China Nos. 61703049, 61876022, 61872047, and the Fundamental Research Funds for the Central Universities No. 2019kfyRCPY001.

References

  • [1] A. Alper and S. Stefano (2012) Detachable object detection: segmentation and depth ordering from short-baseline video. TPAMI 34 (10), pp. 1942–1951. Cited by: §1.
  • [2] A. Ayvaci and S. Soatto (2011) Detachable object detection with efficient model selection. In CVPR, Cited by: §1.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §2, §3.2.1.
  • [4] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. External Links: Link, 1706.05587 Cited by: §2.
  • [5] H. Derek, E. Alexei, and H. Martial (2007) Recovering occlusion boundaries from an image. In ICCV, Cited by: §1.
  • [6] T. Gao, B. Packer, and D. Koller (2011) A segmentation-aware object detection model with occlusion handling. In CVPR, Cited by: §1.
  • [7] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. TPAMI. Cited by: §1.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In ICCV, Cited by: §4.1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
  • [10] D. Hoiem, A. Efros, and M. Hebert (2011) Recovering occlusion boundaries from an image. IJCV 91 (3), pp. 328–346. Cited by: §2.
  • [11] N. Jacobson, Y. Freund, and T. Q. Nguyen (2012) An online learning approach to occlusion boundary detection. TIP. Cited by: §1.
  • [12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. Vol. abs/1408.5093. External Links: Link Cited by: §4.1.
  • [13] W. Liu, A. Rabinovich, and A. C. Berg (2015) ParseNet: looking wider to see better. Computer Science. Cited by: §2.
  • [14] J. Long, E. Shelhamer, and T. Darrell (2014) Fully convolutional networks for semantic segmentation. TPAMI 39 (4), pp. 640–651. Cited by: §2.
  • [15] R. Lu, M. Zhou, A. Ming, and Y. Zhou (2019) Context-constrained accurate contour extraction for occlusion edge detection. In ICME, Vol. abs/1903.08890. External Links: Link, 1903.08890 Cited by: §3.1, §3.
  • [16] J. Ma, A. Ming, Z. Huang, X. Wang, and Y. Zhou (2017) Object-level proposals. In ICCV, Cited by: §1.
  • [17] J. Marshall, C. Burbeck, D. Ariely, J. Rolland, and K. Martin (1996) Occlusion edge blur: a cue to relative visual depth. Journal of the Optical Society of America A Optics Image Science & Vision 13 (4), pp. 681–8. Cited by: §1.
  • [18] A. Ming, T. Wu, J. Ma, F. Sun, and Y. Zhou (2016) Monocular depth ordering reasoning with occlusion edge detection and couple layers inference. IEEE Intelligent Systems 31 (2), pp. 54–65. Cited by: §1.
  • [19] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich (2015) Feedforward semantic segmentation with zoom-out features. In CVPR, Cited by: §2.
  • [20] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun (2017) Large kernel matters – improve semantic segmentation by global convolutional network. In CVPR, Cited by: §2.
  • [21] W. Peng and A. Yuille (2016) DOC: deep occlusion estimation from a single image. In ECCV, Cited by: §1, §1, §1, §2, §4.1, §4.1, §4.2, §4.2, Table 1.
  • [22] X. Ren, C. Fowlkes, and J. Malik (2006) Figure/ground assignment in natural images. In ECCV, Cited by: §1, §1, §2, §4.1, §4.2.
  • [23] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing & Computer-assisted Intervention, Cited by: §2.
  • [24] M. E. Sargin, L. Bertelli, B. Manjunath, and K. Rose (2009) Probabilistic occlusion boundary detection on spatio-temporal lattices. In ICCV, Cited by: §1.
  • [25] A. Saxena, S. Chung, and A. Ng (2005) Learning depth from single monocular images. In NIPS, Cited by: §1.
  • [26] W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang (2015) DeepContour: a deep convolutional feature learned by positive-sharing loss for contour detection. In CVPR, Cited by: §1.
  • [27] W. Shen, K. Zhao, Y. Jiang, Y. Wang, Z. Zhang, and X. Bai (2016) Object skeleton extraction in natural images by fusing scale-associated deep side outputs. In CVPR, Cited by: §2.
  • [28] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta (2016) Beyond skip connections: top-down modulation for object detection. CoRR abs/1612.06851. External Links: Link, 1612.06851 Cited by: §2.
  • [29] A. Stein and M. Hebert (2009) Occlusion boundaries from motion: low-level detection and mid-level reasoning. IJCV 82 (3), pp. 325. Cited by: §1.
  • [30] C. Teo, C. Fermuller, and Y. Aloimonos (2015) Fast 2d border ownership assignment. In CVPR, Cited by: §2, §4.2, Table 1.
  • [31] G. Wang, X. Liang, and F. Li (2018) DOOBNet: deep object occlusion boundary detection from an image. In ACCV, Cited by: §A.1, §1, §1, §3.2.3, §3.3, §3.3, §4.1, §4.2, §4.2, §4.3, Table 1, Table 3.
  • [32] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell (2018) Understanding convolution for semantic segmentation. In IEEE Winter Conference on Applications of Computer Vision, Cited by: §2.
  • [33] Y. Wang, Y. Xu, S. Tsogkas, X. Bai, S. Dickinson, and K. Siddiqi (2019) DeepFlux for skeletons in the wild. In CVPR, Cited by: §2.
  • [34] F. Xue, A. Ming, M. Zhou, and Y. Zhou (2019) A novel multi-layer framework for tiny obstacle discovery. In ICRA, Cited by: §1.
  • [35] F. Yu, V. Koltun, and T. Funkhouser (2017) Dilated residual networks. In CVPR, Cited by: §3.1, §3.2.1.
  • [36] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. In ICLR, Cited by: §3.1.
  • [37] Z. Zhang, A. Schwing, S. Fidler, and R. Urtasun (2015) Monocular object instance segmentation and depth ordering with cnns. In ICCV, Cited by: §1.
  • [38] M. Zhou, J. Ma, A. Ming, and Y. Zhou (2018) Objectness-aware tracking via double-layer model. In ICIP, Cited by: §1.
  • [39] Y. Zhou, X. Bai, W. Liu, and L. J. Latecki (2012) Fusion with diffusion for robust visual tracking. In NIPS, Cited by: §1.
  • [40] Y. Zhou, X. Bai, W. Liu, and L. J. Latecki (2016) Similarity fusion for visual tracking. IJCV 118 (3), pp. 337–363. Cited by: §1.
  • [41] Y. Zhou, J. Ma, A. Ming, and X. Bai (2018) Learning training samples for occlusion edge detection and its application in depth ordering inference. In ICPR, Cited by: §1.
  • [42] Y. Zhou and A. Ming (2016) Human action recognition with skeleton induced discriminative approximate rigid part model. PRL. Cited by: §1.
  • [43] Y. Zhou, Y. Yang, Y. Meng, X. Bai, W. Liu, and L. J. Latecki (2014) ONLINE multiple targets detection and tracking from mobile robot in cluttered indoor environments with depth camera. IJPRAI. Cited by: §1.

A Appendix

Figure 11: Occlusion relationship of various approaches. The occlusion relationship (the red arrows) is represented by orientation (tangent direction of the edge), using the ”left” rule where the left side of the arrow means foreground area. Notably, ”red” pixels with arrows: correctly labeled occlusion boundaries; ”cyan”: correctly labeled boundaries but mislabeled occlusion; ”green”: false negative boundaries; ”orange”: false positive boundaries (Best viewed in color). Column : Input image. Column : Output of split decoder, DOOBNet, Single edge single ori and OFNet. Column : Ground truth.

In this appendix material, we provide full qualitative analysis for the ablation study. The experiments are conducted on the PIOD dataset.

a.1 One-branch or Multi-branch Sub-networks

Previous approach DOOBNet [31]

adopts a single flow architecture by sharing decoder features that represent high-level features. The shared decoder features reflect the contour cues, which are necessary for both edge and orientation estimations. Besides, edge detection and orientation detection are different in the choice of feature extraction, especially in the case of high semantic layers. We innovatively split the features produced by side-outputs and share decoder features to fit both tasks, respectively. Fig.

11 reveals the effectiveness of our design.

a.2 Necessity for Each Feature

To verify the role of various low-level and high-level features, each feature is removed to construct an independent variant for evaluation. If the low-level features for edge path are removed, the occlusion edge is difficult to be accurately located, leading to decrease in the accuracy of occlusion relationship reasoning (shown in Fig.12(w/o low-cues)). If the high-level feature for edge path is removed, the occlusion edge is failed to be detected consistently, which decreases the accuracy at a large margin (shown in Fig.12(w/o high-cues)). By capturing spatial and contextual cues from the side-outputs respectively, the network is able to explore specific features for individual predictions.

Figure 12: Edge maps of various approaches. Column : Input image. Column : OFNet without high-cues, OFNet without low-cues and OFNet. Column : Ground truth.

a.3 Proportion of Bilateral-Contour Features

Previous works utilize inappropriate feature maps to predict the orientation, which reflects as the characteristic of the edge outline. The features of orientation on both sides of the contour are filtered gradually which are adversely affected by the edge prediction. We take advantage of an MCL to perceive the bilateral cues around the contours, and affirm the foreground and background relationship. As shown in Fig.13, fusing bilateral feature and occlusion feature with 64:16 channel ratio in the BRF outperforms others.

Figure 13: Occlusion relationship of various approaches. Column : Input image. Column : Fusing bilateral feature and occlusion feature with 16:16, 32:16, 48:16, 80:16 and 64:16 channel ratio, respectively. Column : Ground truth.

a.4 Plain or Stripe Convolution

Plain convolutions perceive information about surrounding small areas. To extract the tendency of edges to extend and bilateral cues around contours, we employ stripe convolutions in orthogonal directions. The convolution kernels possess large receptive field and tend to learn the cues of both directions, respectively. We test stripe convolution kernels with different aspect ratios, which are exhibited in Fig.13. The larger convolution layer takes up too much computation cost, which increases the number of parameters. We evaluate the performance of the model with 1111 conv on PIOD and BSDS datasets, the EPR (left) and OPR (right) are reported in Table.7. Compared with 311 conv, the model with 1111 conv achieves limited improvement, while it increases about 50 gpu memory usage (10031MB to 14931MB).

Figure 14: Edge maps of various approaches. Column : Input image. Column : conv kernel size = 33, 35, 37, 39, 311 and 1111. Column : Ground truth.
Dataset Scale ODS OIS AP ODS OIS AP
PIOD conv = 311
conv = 1111
BSDS conv = 311
conv = 1111

Table 7: Results of our model with different conv kernel sizes.