Richer Convolutional Features for Edge Detection
In this paper, we propose an accurate edge detector using richer convolutional features (RCF). Since objects in nature images have various scales and aspect ratios, the automatically learned rich hierarchical representations by CNNs are very critical and effective to detect edges and object boundaries. And the convolutional features gradually become coarser with receptive fields increasing. Based on these observations, our proposed network architecture makes full use of multiscale and multi-level information to perform the image-to-image edge prediction by combining all of the useful convolutional features into a holistic framework. It is the first attempt to adopt such rich convolutional features in computer vision tasks. Using VGG16 network, we achieve results on several available datasets. When evaluating on the well-known BSDS500 benchmark, we achieve ODS F-measure of .811 while retaining a fast speed (8 FPS). Besides, our fast version of RCF achieves ODS F-measure of .806 with 30 FPS.READ FULL TEXT VIEW PDF
We develop a new edge detection algorithm that tackles two important iss...
Recently, Deep-Neural-Network (DNN) based edge prediction is progressing...
Recent methods for boundary or edge detection built on Deep Convolutiona...
In recent years, there has been a rapid progress in solving the binary
Motivated by the fact that characteristics of different sound classes ar...
Representing features at multiple scales is of great importance for nume...
In this paper, we propose VLASE, a framework to use semantic edge featur...
Richer Convolutional Features for Edge Detection
Richer Convolutional Features for Edge Detection
Edge detection, which aims to extract visually salient edges and object boundaries from natural images, has remained as one of the main challenges in computer vision for several decades. It is usually considered as a low-level technique, and varieties of high-level tasks [27, 8] have greatly benefited from the development of edge detection, such as object detection [57, 18], object proposal [56, 64, 10, 63, 62] and image segmentation [1, 3, 9, 58].
Typically, traditional methods first extract local cues of brightness, colors, gradients and textures, or other manually designed features like Pb , gPb , and Sketch tokens , then sophisticated learning paradigms [59, 15]
are used to classify edge and non-edge pixels. Although edge detection approaches using low-level features have made great improvement in these years, their limitations are also obvious. For example, edges and boundaries are often defined to be semantically meaningful, however, it is difficult to use low-level cues to represent object-level information. Under these circumstances, gPb  and Structured Edges  try to use complex strategies to capture global features as much as possible.
In the past few years, convolutional neural networks (CNNs) have become popular in the computer vision community by substantially advancing the state-of-the-art of various tasks, including image classification[33, 52, 54], object detection [22, 21, 45, 36] and semantic segmentation [40, 7] etc. Since CNNs have a strong capability to learn high-level representations of natural images automatically, there is a recent trend of using convolutional networks to perform edge detection. Some well-known CNN-based methods have pushed forward this field significantly, such as DeepEdge , N-Fields , CSCNN , DeepContour , and HED . Our algorithm falls into this category as well.
To see the information obtained by different convolution (i.e. conv) layers in edge detection, we build a simple network to produce side outputs of intermediate layers using VGG16  which has five conv stages. Fig. 1 shows an example. We discover that convolutional features become coarser gradually and intermediate layers contain lots of useful fine details. On the other hand, since richer convolutional features are highly effective for many vision tasks, many researchers make efforts to develop deeper networks . However, it is difficult to get the networks to converge when going deeper because of vanishing/exploding gradients and training data shortage (e.g. for edge detection). So why don’t we make full use the CNN features we have now? Our motivation is based on these observations. Unlike previous CNN methods, the proposed novel network uses the CNN features of all the conv layers to perform the pixel-wise prediction in an image-to-image fashion, and thus is able to obtain accurate representations for objects or object parts in different scales. Concretely speaking, we attempt to utilize the CNN features from all the conv layers in a unified framework that can be potentially generalized to other vision tasks. By carefully designing a universal strategy to combine hierarchical CNN features, our system performs very well in edge detection.
When evaluating the proposed method on BSDS500 dataset , we achieve the best trade-off between effectiveness and efficiency with the ODS F-measure of 0.811 and the speed of 8 FPS. It even outperforms the result of human perception (ODS F-measure 0.803). In addition, the fast version of RCF is also presented, which achieves ODS F-measure of 0.806 with 30 FPS.
, researchers have struggled on it for nearly 50 years, and there have emerged a large number of materials. Broadly speaking, we can roughly categorize these approaches into three groups: early pioneering ones, learning based ones using handcrafted features and deep learning based ones. Here we briefly review some representative approaches that were developed in the past few decades.
Early pioneering methods mainly focused on the utilization of intensity and color gradients. Robinson  discussed a quantitative measure in choosing color coordinates for the extraction of visually significant edges and boundaries. [41, 55] presented zero-crossing theory based algorithms. Sobel  proposed the famous Sobel operator to compute the gradient map of an image, and then yielded edges by thresholding the gradient map. An extended version of Sobel, named Canny , added Gaussian smoothing as a preprocessing step and used the bi-threshold to get edges. In this way, Canny is more robust to noise. In fact, it is still very popular across various tasks now because of its notable efficiency. However, these early methods seem to have poor accuracy and thus are difficult to adapt to today’s applications.
Later, researchers tended to manually design features using low-level cues such as intensity, gradient, and texture, and then employ sophisticated learning paradigm to classify edge and non-edge pixels [14, 46]. Konishi et al.
proposed the first data-driven methods by learning the probability distributions of responses that correspond to two sets of edge filters. Martinet al. formulated changes in brightness, color, and texture as Pb features, and trained a classifier to combine the information from these features. Arbeláez et al. developed Pb into gPb by using standard Normalized Cuts  to combine above local cues into a globalization framework. Lim  proposed novel features, Sketch tokens that can be used to represent the mid-level information. Dollár et al. employed random decision forests to represent the structure presented in local image patches. Inputting color and gradient features, the structured forests output high-quality edges. However, all the above methods are developed based on handcrafted features, which has limited ability to represent high level information for semantically meaningful edge detection.
With the vigorous development of deep learning recently, a series of deep learning based approaches have been invented. Ganin et al. proposed N-Fields that combines CNNs with the nearest neighbor search. Shen et al. partitioned contour data into subclasses and fit each subclass by learning model parameters. Hwang et al. considered contour detection as a per-pixel classification problem. They employed DenseNet 
to extract a feature vector for each pixel, and then SVM classier was used to classify each pixel into the edge or non-edge class. Xieet al. recently developed an efficient and accurate edge detector, HED, which performs image-to-image training and prediction. This holistically-nested architecture connects their side output layers, which is composed of one conv layer with kernel size 1, one deconv
layer and one softmax layer, to the lastconv layer of each stage in VGG16 . More recently, Liu et al. used relaxed label generated by bottom-up edges to guide the training process of HED, and achieved some improvement. Li et al.
proposed a complex model for unsupervised learning of edge detection, but the performance is worse than training on the limited BSDS500 dataset.
The aforementioned CNN-based models have advanced the state-of-the-art significantly, but all of them lost some useful hierarchical CNN features when classifying pixels to edge or non-edge class. These methods usually only adopt CNN features from the last layer of each conv stage. To fix this case, we propose a fully convolutional network to combine features from each CNN layer efficiently. We will detail our method below.
Inspired by previous literature in deep learning [21, 45, 40, 60], we design our network by modifying VGG16 network . VGG16 network that composes of 13 conv layers and 3 fully connected layers has achieved state-of-the-art in a variety of tasks, such as image classification  , object detection [22, 21, 45] and etc. Its conv layers are divided into five stages, in which a pooling layer is connected after each stage. The useful information captured by each conv layer becomes coarser with its receptive field size increasing. Detailed receptive field sizes of different layers can be seen in Tab. 1. The use of this rich hierarchical information is hypothesized to help a lot. The starting point of our network design lies here.
The novel network proposed by us is shown in Fig. 2. Compared with VGG16, our modifications can be described as following:
We cut all the fully connected layers and the pool5 layer. On the one side, we remove the fully connected layers due to the fact that they do not align with our design of fully convolutional network. On the other hand, adding pool5 layer will increase the stride by two times, and it’s harmful for edge localization.
Each conv layer in VGG16 is connected to a conv layer with kernel size and channel depth 21. And the resulting layers in each stage are accumulated using an eltwise layer to attain hybrid features.
An conv layer follows each eltwise layer. Then, a deconv layer is used to up-sample this feature map.
A cross-entropy loss / sigmoid layer is connected to the up-sampling layer in each stage.
All the up-sampling layers are concatenated. Then an conv layer is used to fuse feature maps from each stage. At last, a cross-entropy loss / sigmoid layer is followed to get the fusion loss / output.
Hence, we combine hierarchical features from all the conv layers into a holistic framework, in which all of the parameters are learned automatically. Since receptive field sizes of conv layers in VGG16 are different from each other, our network can learn multiscale, including low-level and object-level, information that is helpful to edge detection. We show the intermediate results from each stage in Fig. 3. From top to bottom, the edge response becomes coarser while obtaining strong response at the larger object or object part boundaries. It is consistent with our expectation, in which conv layers will learn to detect the larger objects with the receptive field size increasing. Since our RCF model combines all the accessible conv layers to employ richer features, it is expected to achieve a boost in accuracy.
Edge datasets in this community are usually labeled by several annotators using their knowledge about the presences of objects and object parts. Though humans vary in cognition, these human-labeled edges for the same image share high consistency. For each image, we average all the ground truth to generate an edge probability map, which ranges from 0 to 1. Here, 0 means no annotator labeled at this pixel, and 1 means all annotators have labeled at this pixel. We consider the pixels with edge probability higher than as positive samples and the pixels with edge probability equal to 0 as negative samples. Otherwise, if a pixel is marked by fewer than of the annotators, this pixel may be semantically controversial to be an edge point. Thus, whether regarding it as positive or negative samples may confuse networks. So we ignore pixels in this category.
We compute the loss at every pixel with respect to pixel label as
and denote positive sample set and negative sample set respectively. The hyper-parameter is to balance positive and negative samples. The activation value (CNN feature vector) and ground truth edge probability at pixel are presented by and , respectively.
is the standard sigmoid function, and
denotes all the parameters that will be learned in our architecture. Therefore, our improved loss function can be formulated as
where is the activation value from stage while is from fusion layer. is the number of pixels in image , and is the number of stages (equals to 5 here).
In single scale edge detection, we input an original image into our fine-tuned RCF network, then, the output is an edge probability map. To further improve the quality of edges, we use image pyramids during testing. Specifically, we resize an image to construct an image pyramid, and each of these images is input to our single-scale detector separately. Then, all resulting edge probability maps are resized to original image size using bilinear interpolation. At last, these maps are averaged to get a final prediction map. Fig. 4 shows a visualized pipeline of our multiscale algorithm. We also try to use weighted sum, but we find the simple average works very well. Considering the trade-off between accuracy and speed, we use three scales 0.5, 1.0, and 1.5 in this paper. When evaluating on BSDS500  dataset, this simple multiscale strategy improves the ODS F-measure from 0.806 to 0.811, though the speed drops from 30 FPS to 8 FPS. See Sec. 4 for details.
The most obvious difference between our RCF and HED  is in three parts. First, HED only considers the last conv layer in each stage of VGG16, in which lots of helpful information to edge detection is missed. In contrast to it, RCF uses richer features from all the conv layers, thus it can capture more object or object part boundaries accurately across a larger range of scales. Second, a novel loss function is proposed to treat training examples properly. We only consider the edge pixels that most annotators labeled as positive samples, since these edges are highly consistent and thus easy to train. Besides, we ignore edge pixels that are marked by a few annotators because of their confusing attributes. Thirdly, we use multiscale hierarchy to enhance edges. Our evaluation results demonstrate the strengths (2.3% improvement in ODS F-measure over HED) of these choices. See Sec. 4 for details.
We implement our network using the publicly available Caffe
which is well-known in this community. The VGG16 model that is pre-trained on ImageNet is used to initialize our network. We change the stride of pool4 layer to 1 and use the atrous algorithm to fill the holes. In RCF training, the weights of conv conv
layer in fusion stage are initialized to 0.2 and the biases are initialized to 0. Stochastic gradient descent (SGD) minibatch samples 10 images randomly in each iteration. For other SGD hyper-parameters, the global learning rate is set to 1e-6 and will be divided by 10 after every 10k iterations. The momentum and weight decay are set to 0.9 and 0.0002 respectively. We run SGD for 40k iterations totally. The parametersand in loss function are also set depending on training data. All experiments in this paper are finished using a NVIDIA TITAN X GPU.
Given an edge probability map, a threshold is needed to produce the edge image. There are two choices to set this threshold. The first one is referred as optimal dataset scale (ODS) which employs a fixed threshold for all images in the dataset. And the second is called optimal image scale (OIS) which selects an optimal threshold for each image. We use F-measure () of both ODS and OIS in our experiments.
BSDS500  is a widely used dataset in edge detection. It is composed of 200 training, 100 validation and 200 test images, and each image is labeled by 4 to 9 annotators. We utilize the training and validation sets for fine-tuning, and test set for evaluation. Data augmentation is the same as . Inspired by the previous work [39, 61, 31], we mix augmentation data of BSDS500 with flipped PASCAL VOC Context dataset  as training data. When training, we set loss parameters and to 0.5 and 1.1, respectively. When evaluating, standard non-maximum suppression (NMS)  is applied to thin detected edges. We compare our method with some non-deep-learning algorithms, including Canny , EGB , gPb-UCM , ISCRA , MCG , MShift , NCut , SE , and OEF , and some recent deep learning based approaches, including DeepContour , DeepEdge , HED , HFL , MIL+G-DSN+MS+NCuts  and etc.
Fig. 5 shows the evaluation results. The performance of human eye in edge detection is known as 0.803 ODS F-measure. Both single-scale and multiscale (MS) versions of RCF achieve better results than humans. When comparing with HED , ODS F-measures of our RCF-MS and RCF are 2.3% and 1.8% higher than it, respectively. And the precision-recall curves of our methods are also higher than HED’s. These significant improvements demonstrate the effectiveness of our richer convolutional features. All the conv layers contain helpful hierarchical information, not only the last one in each convolution stage.
|Sketch Tokens ||.727||.746||1|
We show statistic comparison in Tab. 2. From RCF to RCF-MS, the ODS F-measure increases from 0.806 to 0.811, though the speed drops from 30 FPS to 8 FPS. It proves the validity of our multiscale strategy. We also observe an interesting phenomenon in which the RCF curves are not as long as other methods when evaluated using the default parameters in BSDS500 benchmark. It may suggest that RCF tends only to remain very confident edges. Our methods also achieve better results than recent edge detectors, such as RDS  and CEDN . RDS uses relaxed laebls and extra training data to retrain the HED network, and it improves 0.4% of ODS F-measure compared with HED. In contrast, the F-measure of our RCF method is 1.4% higher in ODS F-measure than RDS. It demonstrates our improvement is not trivial or ad hoc.
We can see that RCF achieves the best tarde-off between effectiveness and efficiency. Although MIL+G-DSN+MS+NCuts  achieves a little better accuracy than our methods, our RCF and RCF-MS are much fastest than it. The single-scale RCF achieves 30 FPS, and RCF-MS can also achieve 8 FPS. Note that our RCF network only adds some conv layers to HED, so the time consumption is almost same as HED. Besides, starting from HED, Iasonas et al. added some useful components to it, such as Multiple Instance Learning (MIL) , G-DSN , multiscale, extern training data with PASCAL Context dataset  and Normalized Cuts . Our proposed methods are much simpler than . Since our edge detectors are simple and efficient, it is easy to apply them in various high-level vision tasks.
NYUD  dataset is composed of 1449 densely labeled pairs of aligned RGB and depth images. Recently many works have conducted edge evaluation on it, such as [15, 59]. Gupta et al. split NYUD dataset into 381 training, 414 validation and 654 testing images. We follow their settings and train our RCF network using training and validation sets in full resolution as in HED .
We utilize depth information by using HHA , in which depth information is encoded into three channels: horizontal disparity, height above ground, and angle with gravity. Thus HHA features can be represented as a color image. Then, two models for RGB images and HHA feature images are trained separately. We rotate the images and corresponding annotations to 4 different angles (0, 90, 180 and 270 degrees) and flip them at each angle. In the training process, is set to 1.2 for both RGB and HHA. Since NYUD only has one ground truth for each image, is useless here. Other network settings are the same as used for BSDS500. At testing, the final edge predictions are defined by averaging the outputs of RGB model and HHA model. When evaluating, we increase localization tolerance, which controls the maximum allowed distance in matches between predicted edges and ground truth, from 0.0075 to 0.011, because images in NYUD dataset are larger than images in BSDS500 dataset.
We only compare our single-scale version of RCF with some famous competitors. OEF  and gPb-UCM  only use RGB images, while other methods employ both depth and RGB information. The precision-recall curves are shown in Fig. 6. RCF achieves the best performance on NYUD dataset, and the second place is HED . Tab. 3 shows statistical comparison. We can see that RCF achieves better results than HED not only on separate HHA or RGB data, but also on the merged HHA-RGB data. For HHA and RGB data, ODS F-measure of RCF is 2.4% and 1.2% higher than HED, respectively. For merging HHA-RGB data, RCF is 1.6% higher than HED. Furthermore, HHA edges perform worse than RGB, but averaging HHA and RGB edges achieves much higher results. It suggests that combining different types of information is very useful for edge detection, and it may be the reason why OEF and gPb-UCM perform worse than other methods.
Recently, Multicue dataset is proposed by Mély et al. to study psychophysics theory for boundary detection. It is composed of short binocular video sequences of 100 challenging natural scenes captured by a stereo camera. Each scene contains a left and a right view short (10-frame) color sequences. The last frame of the left images for each scene is labeled for two annotations, object boundaries and low-level edges. Unlike people who usually use boundary and edge interchangeably, they strictly defined boundary and edge according to visual perception at different stages. Thus, boundaries are referred to the boundary pixels of meaningful objects, and edges are abrupt pixels at which the luminance, color or stereo change sharply. In this subsection, we use boundary and edge as defined by Mély et al. while considering boundary and edge having the same meaning in previous sections.
As done in Mély et al. and HED , we randomly split these human-labeled images into 80 training and 20 test samples, and average the scores of three independent trials as final results. When training on Multicue, is set to 1.1, and is set to 0.4 for boundary task and 0.3 for edge task. For boundary detection task, we use learning rate 1e-6 and run SGD for 2k iterations. For edge detection task, we use learning rate 1e-7 and run SGD for 4k iterations. We augment the training data as we do on NYUD dataset. Since the image resolution of Multicue is very high, we randomly crop patches from original images.
|Human-Boundary ||.760 (.017)||–|
|Multicue-Boundary ||.720 (.014)||–|
|HED-Boundary ||.814 (.011)||.822 (.008)|
|RCF-Boundary||.817 (.004)||.825 (.005)|
|RCF-MS-Boundary||.825 (.008)||.836 (.007)|
|Human-Edge ||.750 (.024)||–|
|Multicue-Edge ||.830 (.002)||–|
|HED-Edge ||.851 (.014)||.864 (.011)|
|RCF-Edge||.857 (.004)||.862 (.004)|
|RCF-MS-Edge||.860 (.005)||.864 (.004)|
We show evaluation results in Tab. 4. Our proposed RCF achieve substantially higher results than HED. For boundary task, RCF-MS is 1.1% ODS F-measure higher and 1.4% OIS F-measure higher than HED. For edge task, RCF-MS is 0.9% ODS F-measure higher than HED. Note that the fluctuation of RCF is also smaller than HED, which suggests RCF is more robust over different kinds of images. Some qualitative results are shown in Fig. 7.
To further explore the effectiveness of our network architecture, we implement some mixed networks using VGG16  by connecting our richer feature side outputs to some convolution stages while connecting side outputs of HED to the other stages. With training only on BSDS500  dataset and testing on the single scale, evaluation results of these mixed networks are shown in Tab. 5. The last two lines of this table correspond to HED and RCF, respectively. We can observe that all of these mixed networks perform better than HED and worse than RCF that is fully connected to RCF side outputs. It clearly demonstrates the importance of our strategy of richer convolutional features.
|RCF Stage||HED Stage||ODS||OIS|
|1, 2||3, 4, 5||.792||.810|
|2, 4||1, 3, 5||.795||.812|
|4, 5||1, 2, 3||.790||.810|
|1, 3, 5||2, 4||.794||.810|
|3, 4, 5||1, 2||.796||.812|
|–||1, 2, 3, 4, 5||.788||.808|
|1, 2, 3, 4, 5||–||.798||.815|
In order to investigate whether including additional nonlinearity helps, we connecting ReLU layer afteror conv layers in each stage. However, the network performs worse. Especially, when we attempt to add nonlinear layers to conv layers, the network can not converge properly.
In this paper, we propose a novel CNN architecture, RCF, that makes full use of semantic and fine detail features to carry out edge detection. We carefully design it as an extensible architecture. The resulting RCF method can produce high-quality edges very efficiently, and this makes it promising to be applied in other vision tasks. RCF architecture can be seen as a development direction of fully connected network, like FCN  and HED . It would be interesting to explore the usefulness of our network architecture in other hot topics, such as salient object detection and semantic segmentation. Source code is available at https://github.com/yun-liu/rcf.
We would like to thank the anonymous reviewers for their useful feedbacks. This research was supported by NSFC (NO. 61572264, 61620106008), Huawei Innovation Research Program (HIRP), and CAST young talents plan.
HFS: Hierarchical feature selection for efficient image segmentation.In ECCV, pages 867–882. Springer, 2016.
German Conference on Pattern Recognition, pages 196–208, 2015.