Object contour detection extracts information about the object shape in images. Reliable detectors distinguish between desired object contours and edges from the background. Resulting object contour maps are very useful for supporting and/or improving various computer vision applications, like semantic segmentation[5, 31, 35], object proposal 
and object flow estimation[25, 31].
Holistically-Nested Edge Detection (HED)  has shown that it is beneficial to use features of a pre-trained classification network to capture desired image boundaries and suppressing undesired edges. Khoreva et al.  have specifically trained the HED on object contour detection and proven the potential of HED for this task. Yang et al. have used a Fully Convolutional Encoder-Decoder Network (CEDN) to produce contour maps, in which the object contours of certain object classes are highlighted and other edges are suppressed more effectively than before . Convolutional Oriented Boundaries (COB)  outperforms these results by using multi-scale oriented contours derived from a HED-like network architecture together with an efficient hierarchical image segmentation algorithm. A common feature in all this work is that a Very Deep Convolutional Network for Large-Scale Image Recognition (VGG) 
and its classifying ability is used as a backbone network. It is obvious that this backbone and its effective use are the major keys to the results achieved, but the new methods mentioned here do not use the latest classification networks, like Deep Residual Learning for Image Recognition (ResNet), which show a higher classification ability than VGG. We use a ResNet as backbone and propose a strategy to prioritise the effective utilization of the high-level abstraction capability for object contour detection. Accordingly we choose a fitting architecture and a customized training procedure. We outperform the methods mentioned previously and achieve a very robust detector with an excellent performance on the validation data of a refined PASCAL VOC . High-level edge detection is closely related to object contour detection, because object contours are often an important subset of the desired detection. Continuing, we will introduce the edge detection task and show that, unlike object contour detection, there is unexploited potential for using the high abstraction capability of classification networks.
or generative image inpainting[22, 33]. Classical low-level detectors, such as Canny  and Sobel , or the recently applied edge-detection with the Phase Stretch Transform , filter the entire image and do not distinguish between semantic edges and the rest. Edge detection is no longer limited only to low-level computer vision problems. Even the evaluation method established to date – the Berkeley Segmentation Dataset and the Benchmark 500 (BSDS500) 
– requires high-level image processing algorithms for good results. Before Convolutional Neural Networks (CNNs) became popular, algorithms like the gPb, which uses contour detection together with hierarchical image segmentation, reached impressive results. In recent years, edge detectors, such as DeepNet  and N4-Fields  have begun to use operations from CNNs to reach a higher-level detection. DeepEdge  and DeepContour 
are CNN applications that use more high-level features to extract contours, and show that this capability improves the detection of certain edges. HED uses higher abstraction abilities than previous methods by combining multi-scale features and multi-level features extracted of a pre-learned CNN, and improves edge detection. Latest edge detectors such as the Crisp Edge Detector (CED), Richer Convolutional Features (RCF) , COB and a High-for-Low features algorithm (HFL)  make use of their backbone classification nets for edge detection in different ways. But some of these networks are based on older backbone CNNs like the VGG and/or use simple HED-like skip-layer architectures. Similarly for object contour detection, we assume recent work has unexploited potential in the utilization of pre-trained classification abilities, in terms of architecture, backbone network, training procedure and datasets. We contribute a simple but decisive strategy, a network architecture choice following our strategy and unconventional training methods to reach state-of-the-art.
Section 2 will briefly summarize the closest related work, Section 3 contains main contributions, like concept, realization and special training procedures for the proposed detector, Section 4 compares the method with other relevant methods and Section 5 concludes the paper.
2 Related Work
AlexNet  was a breakthrough for image classification and was extended to solve other computer vision tasks, such as image segmentation, object contour, and edge detection. The step from image classification to image segmentation with the Fully Convolutional Network (FCN) 
has favored new edge detection algorithms such as HED, as it allows a pixel-wise classification of an image. HED has successfully used the logistic loss function for the edge or non-edge binary classification. Our approach uses the same loss function, but differs in term of another weighting factor, network architecture, and backbone network. Another image segmentation network, Learning Deconvolution Network for Semantic Segmentation, has favored the development of the CEDN, demonstrating the strong relationship between image segmentation, object contour detection and edge detection. The good results of the CEDN inspired us to consider recent image segmentation networks for our task. Yang et al. created a new contour dataset using a Conditional-Random-Fields (CRF)  refining method. CEDN and edge detector networks such as COB and HFL have an older backbone net and are outperformed by RCF, which is based on a ResNet and improved the edge detection. RCF has the same backbone network, but differs from our approach in using a different network architecture because it uses a skip-layer structure for feature concatenation like HED. We state that this simple concatenation is not effective enough for edge detection, and we propose to use a more advanced network structure. We have the hypothesis that an effective network architecture for edge detection should prioritise the high abstraction capability itself. As the deepest feature maps are the next ones to the classification layer, we propose to use them as the starting point to refine them layer-by-layer with features of a lower level until it reaches the level of classical edge detection algorithms. Our required properties are combined in RefineNet  and that is why we have used the publicly available code from Guosheng Lin et al. as the basis of our approach. Parallelly to the implementation of our method, the CED from the work Learning to Predict Crisp Boundaries from Deng et al.  has used a similar bottom-up architecture and surpasses RCF and achieves state-of-the-art. The work from Wang et al. Deep Crisp Boundaries  further develops this method and improved state-of-the-art results. Our approach mainly differs from theirs in its conceptualization. They also focus on producing ”crisp” (thinned) boundaries, as they have shown that this benefits their results. We assume in contrast, that by focusing on the effective utilization of the high abstraction capability of a backbone network, we could achieve better results.
Detecting edges with classical low-level methods can visualize the high amount of edges in many images. To distinguish between meaningful edges and undesired edges, a semantic context is required. Our selected contexts are the object contours of the 20 classes of the PASCAL VOC dataset. If context is clear and some low-level vision functions are available, the most important ability for an object contour detector is the high-level abstraction capability, so that edges can be distinguished in the sense of context. For this reason, our concept focuses on the effective use of the high-level abstraction ability of a modern classification network for object contour detection. With this strategy, we choose the architecture, backbone network, training procedure and datasets.
We hypothesize that an effective edge detection network architecture should prioritize the above mentioned capability. For this we propose to give preference to the deepest feature maps of the backbone network and to use them as the starting point in a refinement architecture. To connect the high-level classification ability with the pixel-wise detection stage, we assume, that a step-by-step refinement, where deep features are fused with features of the next shallower level until the shallowest level is reached, should be more effective than skip-layers with a simple feature concatination architecture. In most classification networks, features of different abstraction levels have different resolutions. To merge these features, a multi-resolution fusion is necessary. The RefineNet from Lin et al. provides the desired multi-path refinement and we base our application upon that and name our application in reference to this network RefineContourNet (RCN).
The training procedure has to accomplish two main goals: To effectively use the pre-trained features to form a specific abstraction capability for identifying desired object contours learned from data on the one hand and to connect this to the pixel-wise detection stage on the other hand. Because training data of object contours is limited, both training goals can be enhanced with data augmentation methods. For a similar reason, we do some experiments with a modified Microsoft Common Objects in Context (COCO)  dataset, to create an additional object contour dataset, usable for a pre-training. For fine-training on edge-detection, we offer a simple and unconventional training method that considers the individuality of BSDS500’s hand-drawn labels.
3.2 Image Segmentation Network for Contour Detection
The main difference between an image segmentation network and a contour detection network lies in the definition of the objective function. Instead of defining a multi-label segmentation, an object contour can be defined binary. We use the logistic regression loss function
with , and , where is the prediction for a pixel with the corresponding binary label . symbolizes the learned parameters and is a weighting factor for enhancing the contour detection due to the large imbalance between the contour and the non-contour pixels. Changing the loss function results in a change of the last layer of the RefineNet according to the binary loss function. The 21 feature map layers previously used to segment 20 PASCAL-VOC classes, including background class, will be replaced by a single feature map sufficient for binary classification of contours.
3.3 Network Architecture
Figure 1 shows the RefineContourNet. For clarity, the connections between the blocks specifies the resolution of the feature maps and the size of the feature channel dimension. The Residual Blocks (RB) are part of the ResNet-101. RefineNet has introduced three different refinement path blocks: The Residual Convolution Unit (RCU), the Multi-Resolution Fusion (MRF) and the Chained Residual Pooling (CRP). They are arranged in a row to use the higher-level features as input to combine them with the lower-level features of the RB at the same level. The RCU in Fig. 5 (a) has residual convolutional layers and enriches the network with more parameters. It can adjust and modify the input for MRF. The MRF block adapts the input first by performing a convolution operation, in order to adjust the channel dimension of the feature space corresponding to the higher-level ones with the lower level. Then, the smaller resolution feature maps get upsampled to have same tensor dimensions as the larger ones, after which they are added, as shown in Fig. 5
(b). The goal of the CRP is to gather more context from the feature maps than a normal max pooling layer. Several pooling blocks are concatenated and each block consists of a max-pooling with higher stride length and a convolutional operation. Illustration of CRP with two max-pooling operations is shown in Fig.5 (c). In the final refinement step, we use the original image as input for an extra path with 3 RCUs, which improves the results.
We have done various experiments with different combinations of refinement blocks per multipath, and we have always observed the best results by placing the three blocks sequentially in a row, as shown in Fig. 1.
For each epoch, 1000 random images are selected from the training set, and the data is augmented by random cropping, vertical flipping, and scaling between 0.7 and 1.3. To find an optimal training method, we have examined the following training variants:
RCN-VOC is trained only on the CRF-refined object contour dataset proposed by Yang et al. .
RCN-COCO is pre-trained on a modified COCO dataset, where we have considered only the 20 PASCAL VOC classes and have produced contours. COCO segmentation masks and from those generated contours are not accurate, so we enrich them with additional contours. For this we use our own object contour detector RCN-VOC and set a high threshold to add only confident contour detections.
RCN is pre-trained on the modified COCO and trained on the refined PASCAL VOC.
Training the network for edge detection involves fine-training on the validation and train sets of the BSDS500 dataset. The BSDS contains individual, hand-drawn contours for the same images created by different people. We take the subjective decisions into account of which edge is a desired edge and which is not, by simply using all individual labels, and let the CNN form a compromise. To give an indication of how such training affects the results, we fine-train one of the networks only on the drawings of one single person, called RCN-VOC-1. All trainings and modifications are done in MatConvNet .
4.2 Object Contour Detection Evaluation
For evaluation, we use Piotr’s Computer Vision Matlab Toolbox 
, the included Non-Maximum-Suppression (NMS) algorithm for thinning the soft object contour maps and a subset of 1103 images of a CRF-refined PASCAL val2012. We calculate the Precision and Recall (PR) curve for the RCN models, CEDN, HED and COB in Fig.13. In Tab. 13 the Optimal Dataset Scale (ODS), Optimal Image Scale (OIS) and the Average Precision (AP) for the methods are noted. The quantitative analysis reveals that the RCN models significantly perform better in comparison to the other methods on all three metrics. This is also reflected in the visual results, cf. Fig. 11. The RCN-VOC and RCN have upper hand in suppressing the undesired edges, such as inner contours of the objects. At the same time, they also can recognize object contours more clearly. A disadvantage is that the contour predictions are thicker than in CEDN, which is due to the halved resolution owed by the network architecture. Nevertheless, the detection is very robust and the NMS can effectively calculate 1-pixel thinned object contours.
4.3 Edge Detection Evaluation
The results of a quantitative evaluation of the RCN on unseen test images of BSDS500 are represented in the PR-curves in Fig 20. ODS, OIS and AP from methods such as RCN, CED, RCF, COB and HED are listed in Tab. 20. The proposed RCN achieves the state-of-the-art with a higher ODS than recent methods, closely followed by CED. In Fig. 18 results of CED and RCN are visualized for some test images. Careful analysis of the results reveals that RCN detects some relevant edges, cf. inner contours of the snowshoes (1st row), face of the young man (2nd row), snout of the llama (4th row), which CED no longer recognizes. As for the object contour detection task, the disadvantage of the thicker edge predictions persists for the RCN. However, the NMS works more precisely for edge prediction maps from RCN, as the bit depth per pixel is increased from 8 to 16 bits. The difference is evident in the results for background edges in the image of the young man (2nd row), since an absolute maximum of the CED prediction could not be clearly distinguished, RCN edges are thinned more effectively.
The strategy, of using the high abstraction capability for object contour and edge detection more effectively than previous methods, has given us very good results in object contour detection and state-of-the-art results in edge detection. Our concept that RefineNet  provides a very useful bottom-up multipath refinement architecture for edge detection is supported by these results. With the unconventional training methods, like the pre-training with a modified COCO dataset or by simply using all individual labels for fine-training on BSDS500, we have been able to improve the respective task.
-  Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5), 898–916 (May 2011). https://doi.org/10.1109/TPAMI.2010.161
-  Asghari, M.H., Jalali, B.: Physics-inspired image edge detection. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP). pp. 293–296 (Dec 2014). https://doi.org/10.1109/GlobalSIP.2014.7032125
Bertasius, G., Shi, J., Torresani, L.: Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4380–4389 (June 2015).https://doi.org/10.1109/CVPR.2015.7299067
-  Bertasius, G., Shi, J., Torresani, L.: High-for-low and low-for-high: Efficient boundary detection from deep object features and its applications to high-level vision. In: 2015 IEEE International Conference on Computer Vision (ICCV). pp. 504–512 (Dec 2015). https://doi.org/10.1109/ICCV.2015.65
-  Bertasius, G., Shi, J., Torresani, L.: Semantic segmentation with boundary neural fields. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3602–3610 (June 2016). https://doi.org/10.1109/CVPR.2016.392
-  Canny, J.: A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence (6), 679–698 (1986)
-  Chen, L., Barron, J.T., Papandreou, G., Murphy, K., Yuille, A.L.: Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4545–4554 (June 2016). https://doi.org/10.1109/CVPR.2016.492
-  Deng, R., Shen, C., Liu, S., Wang, H., Liu, X.: Learning to predict crisp boundaries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. pp. 570–586. Springer International Publishing, Cham (2018)
-  Dollár, P.: Piotr’s Computer Vision Matlab Toolbox (PMT). https://github.com/pdollar/toolbox
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
-  Ganin, Y., Lempitsky, V.: -Fields: Neural network nearest neighbor fields for image transforms. In: Cremers, D., Reid, I., Saito, H., Yang, M.H. (eds.) Computer Vision – ACCV 2014. pp. 536–551. Springer International Publishing, Cham (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (June 2016). https://doi.org/10.1109/CVPR.2016.90
-  Khoreva, A., Benenson, R., Omran, M., Hein, M., Schiele, B.: Weakly supervised object boundaries. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 183–192 (June 2016). https://doi.org/10.1109/CVPR.2016.27
Kivinen, J., Williams, C., Heess, N.: Visual Boundary Prediction: A Deep Neural Prediction Network and Quality Dissection. In: Kaski, S., Corander, J. (eds.) Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 33, pp. 512–521. PMLR, Reykjavik, Iceland (22–25 Apr 2014),http://proceedings.mlr.press/v33/kivinen14.html
-  Krähenbühl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 109–117. Curran Associates, Inc. (2011), http://papers.nips.cc/paper/4296-efficient-inference-in-fully-connected-crfs-with-gaussian-edge-potentials.pdf
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012),http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
-  Lei, P., Li, F., Todorovic, S.: Boundary flow: A siamese network that predicts boundary motion without training on motion. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3282–3290 (June 2018). https://doi.org/10.1109/CVPR.2018.00346
-  Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5168–5177 (July 2017). https://doi.org/10.1109/CVPR.2017.549
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755. Springer International Publishing, Cham (2014)
-  Liu, Y., Cheng, M., Hu, X., Bian, J., Zhang, L., Bai, X., Tang, J.: Richer convolutional features for edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1–1 (2018). https://doi.org/10.1109/TPAMI.2018.2878849
-  Maninis, K., Pont-Tuset, J., Arbeláez, P., Gool, L.V.: Convolutional oriented boundaries: From image segmentation to high-level tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40(4), 819 – 833 (2017)
-  Nazeri, K., Ng, E., Joseph, T., Qureshi, F.Z., Ebrahimi, M.: Edgeconnect: Generative image inpainting with adversarial edge learning. CoRR abs/1901.00212 (2019), http://arxiv.org/abs/1901.00212
-  Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: 2015 IEEE International Conference on Computer Vision (ICCV). pp. 1520–1528 (Dec 2015). https://doi.org/10.1109/ICCV.2015.178
-  Pont-Tuset, J., Arbeláez, P., Barron, J., Marques, F., Malik, J.: Multiscale combinatorial grouping for image segmentation and object proposal generation. In: arXiv:1503.00848 (March 2015)
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpolation of correspondences for optical flow. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1164–1172 (June 2015).https://doi.org/10.1109/CVPR.2015.7298720
-  Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4), 640–651 (April 2017). https://doi.org/10.1109/TPAMI.2016.2572683
-  Shen, W., Wang, X., Wang, Y., Bai, X., Zhang, Z.: Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3982–3991 (June 2015). https://doi.org/10.1109/CVPR.2015.7299024
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  Sobel, I.: Camera models and machine perception. Tech. rep., Stanford Univ Calif Dept of Computer Science (1970)
-  Vedaldi, A., Lenc, K.: Matconvnet – convolutional neural networks for matlab. In: Proceeding of the ACM Int. Conf. on Multimedia (2015)
-  Wang, Y., Zhao, X., Li, Y., Huang, K.: Deep crisp boundaries: From boundaries to higher-level tasks. IEEE Transactions on Image Processing 28(3), 1285–1298 (March 2019). https://doi.org/10.1109/TIP.2018.2874279
-  Xie, S., Tu, Z.: Holistically-nested edge detection. In: 2015 IEEE International Conference on Computer Vision (ICCV). pp. 1395–1403 (Dec 2015). https://doi.org/10.1109/ICCV.2015.164
-  Xu, S., Liu, D., Xiong, Z.: Edge-guided generative adversarial network for image inpainting. In: 2017 IEEE Visual Communications and Image Processing (VCIP). pp. 1–4 (Dec 2017). https://doi.org/10.1109/VCIP.2017.8305138
-  Yang, J., Price, B., Cohen, S., Lee, H., Yang, M.: Object contour detection with a fully convolutional encoder-decoder network. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 193–202 (June 2016). https://doi.org/10.1109/CVPR.2016.28
-  Zhang, H., Jiang, K., Zhang, Y., Li, Q., Xia, C., Chen, X.: Discriminative feature learning for video semantic segmentation. In: 2014 International Conference on Virtual Reality and Visualization. pp. 321–326 (Aug 2014). https://doi.org/10.1109/ICVRV.2014.65