Line as object: datasets and framework for semantic line segment detection

09/14/2019 ∙ by Yi Sun, et al. ∙ 10

In this work, we propose a learning-based approach to the task of detecting semantic line segments from outdoor scenes. Semantic line segments are salient edges enclosed by two endpoints on an image with apparent semantic information, e.g., the boundary between a building roof and the sky (See Fig. 1). Semantic line segments can be efficiently parameterized and fill the gap between dense feature points and sparse objects to act as an effective landmarks in applications such as large-scale High Definition Mapping (HDM). With no existing benchmarks, we have built two new datasets carefully labeled by humans that contain over 6,000 images of semantic line segments. Semantic line segments have different appearance and layout patterns that are challenging for existing object detectors. We have proposed a Semantic Line Segment Detector (SLSD) together with an unified representation and a modified evaluation metric to better detect semantic line segments. SLSD trained on our proposed datasets is shown to perform effectively and efficiently. We have conducted excessive experiments to demonstrate semantic line segment detection task as a valid and challenging research topic.



There are no comments yet.


page 2

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object detection has been an active research topic for decades The introduction of deep learning models has been dominating object detection task during the last few years. Deep learning approaches are also challenging image feature keypoint detection and descriptor extraction task. With objects conveying sparse and high-level semantic information, feature keypoints contain much denser yet low-level information. Semantic line segment is what lies in between. For example, given an image of a city scene, semantic line segment can be the edge of traffic light pole or the boundary in between a building roof and sky. Semantic line segment has larger quantity than objects and richer semantic information than feature keypoints. It can also be easily vectorized in 2D and 3D space saving considerable storage compared with dense feature keypoints. Semantic line segment is one of the ideal components for applications such as large-scale High Definition Mapping (HDM), which acts as an essential ingredients for the emerging autonomous driving technique.

We find little reference that addresses semantic line segment detection problem. There are some very recent works dealing with wireframe detection problem [6, 19, 22, 25]. Wireframes consist of junctions and connected line segments. They can be as dense as feature points yet with no explicit semantic information. Line segment is another relative element [18]. Similar to wireframe, line segment tends to be dense and fragmented without semantics. A recent works has proposed semantic line [9] which captures normally less than three semantic-less major line structures to parse scene layout of a given image. On the contrary, semantic line segments are those edges and boundaries of major salient contours with apparent semantic meaning. Semantic line segments are not anchored at junction points and varies more significantly in length. Consider two specific examples, the shadow projected onto the wall from a pole should be detected as a wireframe [25] but it is not taken as a semantic line segment. A road curb partially blocked by pedestrians and vehicles will be divided into several wireframes while a semantic line segment detector definitely want to recover that curb as one single line segment. Different from objects, there are vertical and horizontal line segments which have extreme width and height ratio rendering anchor-box-based detectors inferior. Fig. 1 shows more examples of semantic line segments, line segments [18], wireframes [19], and semantic line [9] from city scene images from KAIST urban dataset [7]. Line segments and wireframes tend to recover more fragmented line segments.

No existing open-source dataset is found suitable for semantic line segment detection problem. We present two new benchmark datasets, namely, KITTI-SLS and KAIST-SLS. Both datasets are outdoor scenes. KITTI-SLS covers mainly Europe rural areas while KAIST-SLS is recorded in Korea cities.

To this end, we propose an anchor-free Semantic Line Segment Detector (SLSD). We present a general representation which unifies semantic line segment and object detection task. A Bi-Atrous mdule is devised to better handling vertical and horizontal lines. We also propose a gradient-based refinement step to further improve location accuracy. We empirically show that Intersection over Union (IoU) is not discriminate enough for different line segment overlapping situations. A new metric, namely, Angle Center Length (ACL), is presented to achieve desired measurement.

We conduct excessive experiments on the two benchmarks. Results show that SLSD is effective in solving semantic line detection task. Ablation experiments have been conducted to evaluate gains from different modifications. Last but not the least, with our proposed general representation, we attempt to expand SLSD as an unified detector to solve semantic line segment and object detection tasks simultaneously. Unified SLSD provides comparable performance compared to single task detectors. Unified SLSD can solve two tasks with no computation overheads. This benefits application in intelligent robotics where real-time processing of input signals (i.e. images and videos) on embedded device is a must.

Our main contributions are three folds:

  • To the best of our knowledge, we are the first to address the semantic line segment detection problem. We provide two new benchmark datasets KITTI-SLS and KAIST-SLS.

  • We propose SLSD as an effective semantic line segment detector with a Bi-Atrous module and a modified evaluation metric.

  • We thoroughly evaluate the effectiveness and efficiency of SLSD and demonstrate semantic line segment as a challenging and valid research topic.

  • We make an initial attempt to build a unified SLSD for object and semantic line segment detection simultaneously and show its potential.

Fig. 1: Illustration of different line-shaped elements in city-scene images. 1st row: semantic line segments. 2nd row: LSD. 3rd row: wireframes by [19]. 4th row: semantic lines by [9]

Ii Related work

Ii-a Edge detection and line detection

Line detection from images can be traced back to early 60s. Hough Transform was designed to recognize complex patterns of particle tracks on photographs from bubble chambers. A decade later, modified Hough Transform was proposed to locate lines in images. HT line, however, tends to detect lines rather than line segments. Compared with line detection, edge detection aims to structural information that satisfy certain criteria such as sharp gradient changes. Edges can be lines, curves, circles, etc. [1] proposed Canny edge detector in late 80s and has been one of the most popular edge detector owing to its robustness and simplicity. Edge detector provides dense and detailed detections from an image without any semantics. Line segment detection extracts higher level information than edge detection. It can be achieved by an edge detector followed by a Hough Transform to filter out non-straight structures. Line Segment Detector (LSD) is a recently proposed detector specified on line segment. It achieves several preferable characteristics for implementation such as few parameters, sub-pixel accuracy, and low computational complexity. However, detected line segments are still dense and without semantic meaning.

Our proposed semantic line segment detector is a learning-based detector. It will ignore those small line segments with ambiguous semantics and provide detections with selected and interpretable information.

Ii-B Semantic line detection

Semantic line detection problem is proposed in [9]. The author defines semantic line as those “significant lines” that separate major semantic regions in a given image. All semantic lines are end-to-end, i.e., two end points are both located on one out of four image boundaries. Semantic lines are not necessarily obvious lines but rather arbitrary boundaries, e.g., horizon of a city scene. A semantic line detector, SLNet, is proposed to accomplish the task. SLNet contains a VGG16 [17]

backbone to extract features from input image. A line pooling layer is presented to get interpolated features from the surrounding region of each line candidate. Fully connected layers are followed to classify if each line candidate is a semantic line or not. Note that, “semantic” line is actually implicit semantics. SLNet only tells if a line candidate is semantic line or not but output no interpretable meanings. Candidate lines are generated by enumerating two end point locations with a certain step size. Semantic lines are extremely sparse. One image has only about one to three lines which limit the semantics at only scene-level but never component-level. This also prevents semantic lines to be used as a type of image features. Since semantic lines are arbitrary lines, their parameterization has no physical meanings and cannot by projected into 3D space. Such property restricts its application on coarse image layout understanding and makes it unsuitable for our applications. Our semantic line segments, on the contrary, are all real physical edges or boundaries. They are denser with valid parameterization. Semantic line segments in 2D space convey mid-level semantic information acting as an important complementary for feature points and objects. Each semantic line segment has an explicit semantic meaning, i.e., pole, curb, etc. Our proposed detector works in an end-to-end style to output locations and categories of all semantic line segments in single forward pass.

Ii-C Wireframe parsing

Wireframe parsing is an emerging topic during the past few years. It is first proposed in [6] and designed to locate “wireframes” consisting of junctions and connected line segments from images of man-made environment. [19, 22]

propose to use two-staged approach to first predict an intermediate heat map followed by a heuristic algorithm to extract line segments. In

[25], the author propose L-CNN as an end-to-end solution to directly output vectorized wireframes containing geometrically salient junctions and line segments. Wireframe detection is mainly used for 3D reconstruction of environment. It focuses on finding all possible line segments satisfying pixel-level criteria without providing interpretable semantic meaning for each segment. An image can have large amount of wireframes and many are fragmented and easily occluded due to moving viewpoint. Wireframes are demonstrated to be effective elements for 3D reconstruction yet their density and lack of semantics prevent it from being an good choice for applications such as city-scale High Definition Mapping (HDM).

Ii-D Object detection

Object detection aims at detecting objects from images or videos. Given different applications, objects of interest can vary significantly. In terms of robotic mapping, object detection is normally used to separate regions with dynamic and salient objects to, for example, assist selecting reliable feature points. On the other hand, objects are 3D dimensional and hard to be accurately parameterized from image-level detections. Moreover, salient objects in city scenes can be sparse or highly repetitive making them infeasible to be used as landmarks on their own in mapping applications such as loop detection.

In terms of representation, both object and semantic line segment are enclosed by an external rectangle, or bounding box. We have leveraged the similarity to design a unified representation to allow efficient semantic line segment detection while being compatible for object detection task.

Iii New semantic line segment datasets

To support training of our learning-based semantic line segment detector, we propose two new datasets, KITTI-SLS and KAIST-SLS. As semantic line segments are efficient landmarks for outdoor, especially city scene mapping, KITTI [4] and KAIST URBAN [7] are two comprehensive and relative datasets. We have selected 13 video sequences of city scenes from KITTI containing totally 2,324 frames as KITTI-SLS. As for KAIST-SLS, we have picked sequence 39 whose recording route is in Korean Pangyo. The sequence was recorded at 10 fps and we extracted 2 frames every second to collect 3,729 images. Fig. 2 is examples from KAIST-SLS (left column) and KITTI-SLS (right column). Statistics are shown in Table I. We define 14 different categories of semantic line segment and the four most representative ones are listed in the Table. There are totally 45,403 labels and 77,779 labels from KITTI-SLS and KAIST-SLS, respectively. KITTI-SLS has around 20 labels per image on average while KAIST-SLS has 22. As KITTI-SLS covers mostly Europe rural area and KAIST-SLS is from Korean city scene, their labels vary significantly. For example, there is only 2,237 building labels from KITTI-SLS which is about one-tenth of KAIST-SLS.

Fig. 2: Examples from KAIST-SLS (Left column) and KITTI-SLS (Right column) datasets.
building 2,237 22,898
pole 18,888 24,622
curb 2,690 6,894
grass 4,771 1,174
Total 45,403 77,779
labels/image 19.54 22.71
TABLE I: Statistics of KAIST-SLS and KITTI-SLS

For each semantic line segment, we labeled its two endpoints, the line enclosed, and its category. We make no assumptions such as Manhattan world assumption on our labels. For pillar-shaped poles and trees, there are two paralleled or nearly paralleled segments. We did our best to label both of them. But for those poles or trees that are too far away and hard to distinguish, we labeled only one segment.

Iv Semantic Line Segment Detection

Fig. 3: General workflow of our proposed semantic line segment detector.
Fig. 4: An illustration of atrous module. Left: demonstration of different convolutions using a kernel size . First row left side is a normal convolution, right side is a regular atrous convolution with . Second row depicts our proposed VerAtrous and HorAtrous on left and right. Right: Multi-path atrous module.

Deep Convolutional Neural Networks (CNNs) have been showing convincing performance in object detection task. Recently works on anchor-free detectors have pushed the speed/performance trade-off to a new level. In this paper, we propose SLSD to accomplish semantic line segment detection task. Compared with existing object detectors, we design a general representation which is compatible for both semantic line segments and objects.

Inspired by [24], we devise an anchor-free CNN with Bi-Atrous module for better performance. A line gradient based post-processing mechanism is implemented to provide more accurate segment localization. We also design a new metric called Angle Center Length (ACL) as a replacement of IoU to better measure different line segment overlapping situations.

Iv-a General representation

Existing wireframe parsing networks normally predict junction positions and line heat maps then combine them to get the wireframe predictions [6, 25]. Such junction-segment representation is specialized for wireframes and is not suitable for our task. Semantic line segments are not necessary intersect at some junction points. Their end points may even be ambiguous under some circumstances. Therefore, instead of focusing on extreme points, we need to design a representation that stresses more on the length and angle of line segments.

To accomplish the design target, we follow a simple and straight forward observation that each semantic line segment can be taken as the diagonal with designated direction of its boundary rectangle. A boundary rectangle can also be taken as a tightly enclosed bounding box. Without the loss of generality, boundary rectangle and bounding box will be used interchangeable. Let denotes some semantic line segments on an image and denotes its bounding box, we have , where denotes the location and size of . In our implementation, we use , where is the bounding box center coordinate. and are width and height of , respectively. is the index of semantic category and denotes the total number of categories. Since shares the same and with , we define as , where represents its direction. There are two directions, left-top to right-bottom or left-bottom to right-top. We define for direction from left-top to right-bottom and for the other.

The only difference between and is the direction . By expanding with an additional arbitrary value to become and assign as an extra attribute for , we further define a general representation for semantic line segments and objects. . Hence, we can use to express both object and semantic line segment .

Iv-B Semantic Line Segment Detector (SLSD)

Iv-B1 Network design

Upsampling before prediction heads is widely adopted in semantic segmentation since U-Net [16]. The works on stacked hourglass networks [13] further promotes such hourglass-like structure to be adopted in object detectors [8, 23, 10]. With each feature map position representing one candidate of object keypoint (selected corner or center), anchor-free detectors such as [8, 24] are especially benefited.

The overall structure of SLSD is depicted in Fig. 3. As anchor-based detector has limited width-height ratio which does not fit vertical and horizontal line segments, we build SLSD based on a recent anchor-free detector CenterNet [24]. We select CenterNet for its structural simplicity, state-of-the-art performance, and verified flexibility for multiple tasks.

The hourglass-like structure of SLSD includes a backbone for spatial feature extraction and a feature refinement network to a) aggregate multi-scale features; b) upsample intermediate feature maps to provide a final feature map for prediction heads. Given an input image

, the final feature map , where

is output stride and is normally set to

, . We set to be same with the literature [2, 14, 13]. is number of channels for . We omit batch size from dimension for simplicity.

Iv-B2 Bi-Atrous Module

Atrous convolution is originally proposed for semantic segmentation task [20]. By adjusting dilation rate, it expands reception field without introducing extra computations.

In terms of semantic line segment detection, we have found that vertical and horizontal line segments are typically challenging due to its extreme scale ratio. To deal with such segments, we propose Vertical Atrous (VerAtrous) and Horizontal Atrous (HorAtrous) convolution. As shown in Fig. 4, we use different rates in vertical and horizontal direction, respectively. VerAtrous has whilst HorAtrous is implemented reversely. VerAtrous is designed to collect a larger reception field along y-axis and a smaller one along x-axis to better fit those vertical or near-vertical line segments. HorAtrous is designed similarly for horizontal segments.

We further build a Bi-Atrous module which consists of a pair of VerAtrous and HorAtrous together with a deformable convolution [3]. Bi-Atrous module is designed to extract and aggregate multi-scale and multi-aspect ratio features without changing feature map dimension. It is implemented right before the upsample layer, i.e., bilinear upsampling or transpose convolution, in feature refinement network.

Specifically, we use two types f Bi-Atrous module, namely, BAM51 and BAM33. BAM51 has kernel size and for VerAtrous and HorAtrous, respectively. VerAtrous has and while HorAtrous has and . To keep input dimension, we use a padding for VerAtrous and for HorAtrous. BAM33 has kernel size for both VerAtrous and HorAtrous. VerAtrous uses and with padding. HorAtrous has and with padding Deformable convolution is with kernel size , stride 1 and padding 1.

Each sub-path has a batch normalization and a ReLU activation. As for aggregating features from the three sub-paths in a Bi-Atrous Module, we apply regular pixel-wise summation and a trainable weighted summation. Let

, , and denote the output feature maps from VerAtrous, HorAtrous, and deformable convolution, the module output feature map is calculated as follows:


where under regular summation mode. For trainable weighted mode, , , and are trainable variables with initial values at and . During training, they are restricted by and .

Iv-B3 Prediction heads

We introduce four prediction heads, namely, heat map head () , width and height head (), offset head (), and direction head (). The first three heads work similarly with [24] and is specialized for SLSD.

predicts a segment center point heat map , where the number of channels is also the number of semantic categories. For each ground truth center point , its equivalent position on is calculated by . The ground truth heat map is generated by assigning values according to a Gaussian kernel, , where

is a self-adaptive standard deviation with respective to

. Only at center coordinate of ground truth line segments with category , we have .

predicts width and height for bounding boxes centered at each feature map location.

tries to recover the error introduced by and the floor operation between and .

is designed specifically for semantic line segment detection. It predicts the direction of all semantic line segments.

Iv-B4 Loss function

Loss function consists of four parts:


where , , , are loss from four prediction heads, respectively. , , are corresponding weights.

As ground truth can be only or , calculates a per-pixel logarithmic loss. For simplicity, we define and :


is then given by:


where and are focusing parameters [11]. We set and [8] throughout our experiments.

Iv-B5 From representation to semantic line segment

During inference, a feature map position is said to be a center point for a semantic line segment if and only if . Given as detected center point, the correspondent detection results are , . We define :


Semantic line segment end points can then be recovered by:


Iv-C Acl

IoU is widely adopted as a metric to measure the overlapping level of two object bounding boxes. As we have noted in Section IV-A, semantic line segment can also be considered as the diagonal of its bounding box. Therefore, we may directly apply IoU to calculate the overlapping level of two semantic line segments. However, the overlapping situations of two line segments are more complicated than bounding boxes and we empirically show that IoU is not discriminative enough for line segment. First row of Fig. 5 (best viewed in color) shows an example of two overlapped line segments for better illustration of relationship between line segment and bounding box. The second row lists three different overlapping cases with same IoU value (0.6). But apparently we would like to have different overlapping measurement among such three cases where Case 3 should have the highest value while Case 2 is the least overlapped.

We design ACL as a replacement for IoU to achieve desired overlapping measurement. IoU calculation is same as bounding box IoU and we will omit details. ACL is calculated based on the difference between angle, center coordinate, and length of two line segments. Given two semantic line segments represented by , where , denote the coordinate of two end points. denotes the semantic category index.

ACL similarity is defined as follows:


where , , and are center coordinate similarity, length similarity, and angle similarity, respectively.

is given by:


where , is center point coordinates of two line segments.

is given by:


where and are length of two line segments.

is given by:




To further demonstrate the effectiveness of ACL, we shown seven different overlapping cases of two line segments in Fig. 6 (best viewed in color). Taking angle (a), center (c), and length (l) as three elements to describe the position of a line segment, all cases have a unique combination of same and different elements. While all cases share the same IoU value (0.6), they return different ACL values. ACL value tends to be higher when more elements are same and lower when less are same. This empirically verify that ACL is more desirable metric than IoU for semantic line segment overlapping measurement.

Fig. 5: First row is an example of two semantic line segments to calculate IoU and ACL. We draw only angle , center point , and length for one line segment for simplicity. Second row demonstrate three cases of two overlapped line segment whose IoU values are the same (0.6).
Fig. 6: Different overlapping cases of two line segments. All are with the same IoU value (0.6). ACL value is shown to have better discrimination.

V Experiment results

V-1 Evaluation metric

To better fit for semantic line segment detection task and the proposed datasets, we design two metrics based on MS COCO mAP [12], namely and .

Similar to mAP, we average precisions over all categories. But adopts ACL instead of IoU since IoU has been demonstrated in Sec. IV-C to be ineffective in distinguishing different line segment overlaps. We calculates precision over ten ACL values at . The restriction on confidence score is imposed by the application and datasets. Semantic line segment is particularly useful in mapping-related applications where the detection is require to have certain level of confidence. Moreover, both KITTI-SLS and KAIST-SLS have around labels per image, there is no need to keep upto detections. considers the three categories with the most occurrence, namely, pole, building, and curb.

V-a Dataset arrangement

We conduct all experiments on KITTI-SLS and KAIST-SLS datasets. We split all sequences (13 of KITTI-SLS and 1 of KAIST-SLS) from both datasets into training and testing with a ratio.

As for evaluating SLSD as a unified detector for object and semantic line segment, we have labeled images from KAIST-SLS with the following objects: car, bus, person, and traffic light.

V-B Implementation and results

SLSD effectiveness We first demonstrate the effectiveness of SLSD on semantic line segment detection task. Table II shows and results on KITTI-SLS dataset. SLSD with DLA-34 [21] backbone is implemented as a universe baseline. Compared with LSD [lsd2008], our baseline approach achieves a gain in . In terms of , as pole, building, and curb are all with more regular and obvious edges, LSD performs much better at . We still outperform LSD over . Note that all LSD detections are without semantics. Bi-Atrous module is also shown to further boost SLSD performance. dla34_BAM51 and dla45_BAM33 applies two proposed types of Bi-Atrous module, respectively. BAM51 provides a gain in and in whereas BAM33 achieves nearly higher in . With trainable weighted summation as feature aggregation in Bi-Atrous module (_wSum), there are marginal improvements in both metrics. Some detections results from LSD and SLSD with different Bi-Atrous are shown in Fig. 7. Failures are found mostly due to inaccurate prediction of end point locations. These results demonstrate SLSD with Bi-Atrous module as an effective solution to the challenging semantic line segment detection task.

LSD 0.4375 0.7667
baseline 0.7174 0.8297
BAM51 0.7334 0.8594
BAM51_wSum 0.7349 0.8630
BAM33 0.7643 0.8631
TABLE II: Results from SLSD with different Bi-Atrous module with dla34 backbone on KITTI-SLS dataset.
Fig. 7: Sample results from LSD and different SLSD with Bi-Atrous modules. Row 1 - ground truth labels; Row 2 - LSD results; Row 3 - baseline model results; Row 4 - BAM51; Row 5 - BAM33

SLSD efficiency

In order to be implemented in real applications for online robotic mapping functions, SLSD needs to be not only accurate but also fast. We evaluate SLSD runtime on a server with single Nvidia GTX 1080Ti and i7 CPU. SLSD is implemented by PyTorch

[15]. Results are shown in Table III. Three backbones are tested on KITTI-SLS, namely, resnet-18 [5], dla34, and hourglass-104 [13]. resnet-18 has the lowest computational complexity but also the lowest . It runs at above fps (6.1ms/image). hourglass-104 backbone provides the best performance at the cost to run only with fps (27.6ms/image). dla34_TX2 is a compatible version on Jetson TX2 platform with deformable convolution replaced by an ordinary convolution. It runs at speed with an over loss on . We ported dla34_TX2 to TX2 platform with ONNX format and it still achieves around fps with TensorRT acceleration, which is good enough for near-real time processing. SLSD is hence demonstrated to be effective and efficient for semantic line segment detection task.

Models fps
resnet-18 0.4918 0.7755 164
dla34 0.7174 0.8297 51
dla34_TX2 0.6406 0.8390 75
hourglass104 0.7353 0.8788 36
TABLE III: SLSD with different backbone tested on KITTI-SLS dataset.

Unified detector Last but not the least, we go one step further to try directly implement SLSD as a unified detector for both objects and semantic line segments. Dataset is the first obstacle. We need data with both semantic line segment and object labels to train SLSD. Currently we have labeled such images from KAIST-SLS and use for training with the rest for testing. It definitely requires larger amount of labeled data to provide any solid conclusions. We make our initial attempts to verify the feasibility of unified SLSD. With current amount of data, we here present the result to focus on conceptually verifying the feasibility of unified SLSD rather than quantitatively evaluation. We show our results in Table IV.
“Train: obj” uses only object labels to train an object detection model with each object represented as (Sec. IV-A).
“Train: line” uses only semantic line segment labels for a SLSD model with each segment represented as .
“Train: obj-line” uses both types of labels to train a unified SLSD with to represent object and semantic line segment. Only is used since object labels do not include the three selected categories in . The unified detector is tested on all types of labels, obj-line, obj, and line. We mainly compare unified SLSD with single-task detector to see if it is able to retain similar level of performance. In terms of semantic line segment detection, original SLSD achieves whereas unified SLSD slightly outperforms at . On the other hand, object detector surpasses unified SLSD with over . Extra object information helps unified SLSD to better detect semantic line segments. But the additional in object representation seems to deteriorate object detection results from unified SLSD.

Train labels Test labels
obj-line obj-line 0.2643
obj-line obj 0.2257
obj obj 0.3281
obj-line line 0.3061
line line 0.2872
TABLE IV: Test SLSD as a unified detector for both object and semantic line segment.

Vi Conclusion

In this work, we propose a systematic solution for semantic line segment detection problem.

KITTI-SLS and KAIST-SLS with over images and labels are proposed to fill the blank of semantic line segment datasets. We design an object-compatible representation for semantic line segment so that it is possible to solve both detections by a single unified detector. SLSD is proposed as an anchor-free semantic line segment detector. To further improve its performance, we device Bi-Atrous module which consists of a VerAtrous, HorAtrous, and a deformable convolution. IoU is demonstrated to be insufficient in discriminating various line segment overlaps. We propose ACL to achieve desired overlapping measurement. We utilize two modified metrics and to demonstrate the effectiveness and efficiency of SLSD in semantic line segment detection. Unified SLSD with the general representation on solving object and semantic line segment detection simultaneously is also conceptually validated.

There are still defections. First of all, SLSD tends to provide more accurate predictions at line segment center regions than end points. The reason can be that the SLSD focuses on predicting center coordinates. End points are calculated indirectly and we impose no extra loss on them. Secondly, the down ratio restricts the resolution of center coordinates. When two semantic line segments of the same category have their center point coordinates difference smaller than , the resolution of final feature map will force them to the same centers. Temporal consistency is another issue. Semantic line segments have different layout patterns and pixel-level features from objects. As static edges or boundaries, their displacement model can be rather simple. Existing literature on video object detection may not directly fit for video semantic line segment detection.

Our attempts demonstrate semantic line segment detection problem as a challenging and valid research topic.


  • [1] John Canny. A computational approach to edge detection. In

    Readings in computer vision

    , pages 184–203. Elsevier, 1987.
  • [2] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.

    Realtime multi-person 2d pose estimation using part affinity fields.


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7291–7299, 2017.
  • [3] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
  • [4] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  • [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [6] Kun Huang, Yifan Wang, Zihan Zhou, Tianjiao Ding, Shenghua Gao, and Yi Ma. Learning to parse wireframes in images of man-made environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 626–635, 2018.
  • [7] Jinyong Jeong, Younggun Cho, Young-Sik Shin, Hyunchul Roh, and Ayoung Kim. Complex urban dataset with multi-level sensors from highly diverse urban environments. The International Journal of Robotics Research, page 0278364919843996, 2019.
  • [8] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.
  • [9] Jun-Tae Lee, Han-Ul Kim, Chul Lee, and Chang-Su Kim. Semantic line detection and its applications. In Proceedings of the IEEE International Conference on Computer Vision, pages 3229–3237, 2017.
  • [10] Yuxi Li, Jiuwei Li, Weiyao Lin, and Jianguo Li. Tiny-dsod: Lightweight object detection for resource-restricted usages. arXiv preprint arXiv:1807.11013, 2018.
  • [11] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • [12] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [13] Alejandro Newell, Kaiyu Yang, and Jia Deng.

    Stacked hourglass networks for human pose estimation.

    In European conference on computer vision, pages 483–499. Springer, 2016.
  • [14] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4903–4911, 2017.
  • [15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [16] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [17] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [18] Rafael Grompone Von Gioi, Jeremie Jakubowicz, Jean-Michel Morel, and Gregory Randall. Lsd: A fast line segment detector with a false detection control. IEEE transactions on pattern analysis and machine intelligence, 32(4):722–732, 2008.
  • [19] Nan Xue, Song Bai, Fudong Wang, Gui-Song Xia, Tianfu Wu, and Liangpei Zhang. Learning attraction field representation for robust line segment detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1595–1603, 2019.
  • [20] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [21] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2018.
  • [22] Ziheng Zhang, Zhengxin Li, Ning Bi, Jia Zheng, Jinlei Wang, Kun Huang, Weixin Luo, Yanyu Xu, and Shenghua Gao. Ppgnet: Learning point-pair graph for line segment detection. arXiv preprint arXiv:1905.03415, 2019.
  • [23] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. M2det: A single-shot object detector based on multi-level feature pyramid network. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 33, pages 9259–9266, 2019.
  • [24] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points, 2019.
  • [25] Yichao Zhou, Haozhi Qi, and Yi Ma. End-to-end wireframe parsing. arXiv preprint arXiv:1905.03246, 2019.