A semantic line [lee2017, jin2020] is defined as a meaningful line, separating different semantic regions in a scene, which is approximated by an end-to-end straight line. A group of semantic lines in an image can be regarded as optimal, when they convey the composition of the image harmoniously, as shown in Figure 1(e). Thus, in an optimal set, the lines should harmonize with one another.
Semantic lines provide important visual cues in high-level image understanding [Freeman2007, krages2012, lee2018photographic, guo2012, hillel2014, lee2017_vpgnet, zhou2019_nips]. In photography, semantic lines, such as horizontal, vertical, and symmetric ones, are essential composition components. Harmony of such lines are closely related to subjective quality of a photograph [Freeman2007, krages2012, lee2018photographic]. In autonomous driving systems [guo2012, hillel2014, hou2020_inter], boundaries of road lanes and sidewalks should be detected reliably to control vehicle maneuvers, which can be also described by semantic lines. Moreover, dominant parallel lines intersect at vanishing points [lee2017_vpgnet, zhou2019_nips] under perspective projection, conveying depth impression. They are also semantic lines [jin2020]. However, it is challenging to detect semantic lines, which are often unobvious and implied by complex boundaries of semantic regions.
Many techniques have been developed to detect line segments in a scene by exploiting hand-crafted features [matas2000, von2008, desolneux200, akinlar2011]huang2018, xue2019, zhou2019_line, lin2020]. However, they may extract redundant short line segments or focus on identifying obvious line structures in man-made environments. Recently, several attempts have been made to detect semantic lines [workman2016, zhai2016, lee2017, jin2020, han2020]
. Horizon lines, which are a specific type of semantic lines, have been estimated by CNN-based methods[workman2016, zhai2016]. In [lee2017, jin2020, han2020]
, semantic line detectors have been proposed. They have two stages: line detection and refinement. In the detection stage, deep line features are extracted to classify line candidates, but implied lines may be undetected or the computational cost for extracting discriminative features can be too high. In the refinement stage, redundant lines are removed through non-maximum suppression (NMS) or pairwise comparison. Although these techniques provide promising results, they may fail to consider the harmony between detected lines and thus may yield sub-optimal results, as shown in Figure1(d).
In this paper, a novel algorithm to detect an optimal set of harmonious semantic lines is proposed based on maximal weight clique selection (MWCS). We formulate the detection as finding a maximal weight clique in a complete graph [Graph1998, chartrand2019]. To this end, we design two networks: selection network (S-Net) and harmonization network (H-Net). Given an image and a set of line candidates, S-Net first computes the classification probability and regression offsets of each candidate. Second, we filter out irrelevant lines by performing a selection-and-removal process. Third, we construct a complete graph, in which the node set contains the selected lines. H-Net computes its edge weights. Finally, we determine a maximal weight clique representing harmonious semantic lines. Experimental results demonstrate that the proposed algorithm can detect harmonious semantic lines accurately and efficiently.
This work has the following major contributions:
We formulate the semantic line detection as finding an maximal weight clique in a complete graph.
We develop two networks, S-Net and H-Net, to construct the complete graph.
We introduce a novel metric, called HIoU, to assess the overall harmony of semantic lines, which is more reasonable than the existing metrics in [lee2017, han2020].
The proposed algorithm yields competitive semantic line detection performance to the state-of-the-art DRM technique [jin2020], while reducing the computational complexity by a factor of .
2 Related Work
2.1 Line segment detection
Line segments give important visual cues for image semantics. In line segment detection [matas2000, von2008, desolneux200, akinlar2011]
, many short segments are detected using low-level features, such as image gradients. This approach, however, may not discriminate meaningful lines from noisy ones. To utilize higher-level features, deep learning methods have been proposed[huang2018, xue2019, zhou2019_line, lin2020]. In [huang2018], a line heat map and junctions were predicted by networks. Then, a wireframe was obtained by connecting the junctions based on the heat map. In [zhou2019_line], a line candidate was generated by connecting two junctions and then was classified into either a salient one or not. In [xue2019], attraction field maps were computed by a network to deal with local ambiguity and class imbalance in line segment detection. In [lin2020], a network was trained with a Hough transform block to combine local information with global line priors. These methods [huang2018, xue2019, zhou2019_line, sun2019] focus on detecting obvious lines in man-made environments.
2.2 Semantic line detection
Semantic lines, located near the boundaries of semantic regions, represent the layout and composition of images. Several methods [workman2016, zhai2016, diaz2019, lee2017, han2020, jin2020] have been developed to detect implied but semantically meaningful lines. In [workman2016, zhai2016, diaz2019], horizon lines were detected by CNNs, which were refined by exploiting vanishing points or using soft labels of line parameters. In [lee2017], Lee proposed the first semantic line detector. They devised a line pooling layer to extract local features along each line candidate. Those features were fed into classification and regression layers to detect semantic lines. Then, an NMS scheme was performed to remove redundant lines, based on the edge detector [xie_2015_ICCV]. In [jin2020], Jin extracted more discriminative line features by designing a region pooling layer and the mirror attention module. Then, they selected the most semantic lines and removed redundant lines alternately through pairwise ranking and matching. In [han2020], Han transformed line features into a Hough parametric space to facilitate parallel processing of multiple line candidates. Then, they trained a network to predict a line probability map, which was used to determine semantic lines by computing the centroids of connected components.
2.3 Road lane detection
In autonomous driving systems, it is important to reliably detect the boundaries of road lanes, sidewalks, or crosswalks. Early methods [he2004, aly2008, hillel2014, zhou2010] used hand-crafted low-level features to extract lanes. Recently, to cope with complicated road scenes, attempts have been made to detect road lanes using deep semantic segmentation frameworks [pan2018, hou2019_road, hou2020_inter, qin2020]. In [pan2018], Pan proposed a network to learn spatial relationship of lanes through message passing between convolution layers. In [hou2019_road], a network was designed to generate attention maps at different layers, which were used to refine the output of deeper ones. In [hou2020_inter], the inter-region affinity graph was constructed to transfer structural relationship between lanes from teacher to student networks. In [qin2020], to achieve a high processing speed, a network was developed to identify the location of each lane on a predefined set of rows only.
3 Proposed Algorithm
Figure 2 is an overview of the proposed algorithm, which contains S-Net and H-Net. First, given an image and a set of line candidates, S-Net computes the line probability and the regression offsets of each candidate. Second, irrelevant candidates are filtered out through a selection-and-removal process. Third, a complete graph, whose node set consists of the selected lines, is constructed and its edge weights are computed by H-Net. Finally, a maximal weight clique, representing harmonious semantic lines, is determined.
3.1 Problem formulation
Semantic lines in an image can be regarded as optimal if they convey the composition of the image harmoniously. In other words, in an optimal set, every pair of semantic lines should harmonize with each other. As in Figure 3(b), a pair of semantic lines should direct visual attention to meaningful regions. In contrast, in Figure 3(c), two lines are redundant or inharmonious. Based on this observation, we formulate the semantic line detection as finding a maximal weight clique in a complete graph [Graph1998, chartrand2019]. In the complete graph, detected lines form the node set, and each edge weight represents how harmonious the associated two lines are. Thus, by finding a maximal weight clique, we find an optimal set of harmonious semantic lines.
3.2 Node selection: filtering line candidates
It is computationally infeasible to construct a complete graph for all line candidates. Therefore, we select reliable nodes only by filtering line candidates.
Line candidate generation: A line candidate, which is an end-to-end straight line in an image, can be parameterized by polar coordinates in the Hough space [kiryati1991, han2020, lin2020]. Let denote a line, where is its distance from the center of the image and is its angle from the -axis. Then, we generate line candidates, denoted by , , by quantizing and uniformly.
S-Net: For each line candidate, we compute its classification probability and regression offsets. To this end, we develop S-Net based on the conventional line detectors [lee2017, han2020, jin2020]. Figure 4(a) shows the architecture of S-Net. From an image, S-Net extracts a convolutional feature map , where , , and denote the feature height, the feature width, and the number of channels. Then, the line feature map is obtained by averaging the features of pixels along ;
for and , where denotes the number of pixels along
. We then obtain the probability vectorand the line offset matrix by
where and are fully-connected layers of sizes and for classification and regression, respectively, and
is the sigmoid activation function. For theth line candidate , indicates the probability that it is semantic, and is the offset vector for line refinement in Section 3.4.
The architecture and training process of S-Net are described in detail in the supplemental document.
Selection and removal: In the conventional algorithms [lee2017, han2020, jin2020], to detect semantic lines, only the line candidates with probabilities higher than a threshold are selected and then post-processed (non-maximum suppression). However, this may cause false negatives, which have low probabilities because of being implicit but are semantic nonetheless. To reduce such false negatives, instead of thresholding, we perform the selection-and-removal process in Figure 2(b). We select the most reliable line by
and then remove overlapping lines with the selected one. Specifically, we remove 24 lines within the grid centered at in the Hough space [han2020, lin2020]. We perform this process times to compose the node set of selected lines. Figure 5(b) and (e) show such selected lines on the image and Hough spaces, respectively.
3.3 Edge weighting: harmony score estimation
Inter-region correlation: To tell positive pairs in Figure 3(b) from negative pairs in Figure 3(c), we design the inter-region-correlation (IRC) module that analyzes the regions separated by a pair of lines.
Let , , denote the regions separated by two lines. There can be three or four regions, or . We extract the regional feature vector of by
We compute the softmax probability of the area to scale the regional feature vectors, and then concatenate the scaled vectors into
of size . If , we fill in the rightmost vector with zeros. Then, is fed into a fully connected layer to yield the IRC feature.
H-Net: We develop H-Net using the IRC module. It takes an image and a pair of lines, indexed by and , to yield the harmony score ranging from 0 to 1. Figure 4(b) shows the H-Net architecture. The convolution layers of VGG16 [Simonyan2015] are used as the feature extractor, which is followed by three parallel branches of the IRC module and line pooling layers. We employ the line pooling layers to perform the pooling in (1) for lines and , respectively. We use two types of regression layers: one for yielding the IRC score of the two lines (Reg1), and the other for computing unary reliability of each line (Reg2). Finally, we compute the harmony score by multiplying the IRC score with the average of the unary reliability levels.
We configure the training data for H-Net as follows. It is assumed that every pair of ground-truth semantic lines in an image harmonize with each other. Thus, we declare such pairs as positive, while the others as negative. In other words, a line pair is positive only if both lines and are semantic. Then, the harmony score is annotated as 1 or 0 depending on whether the pair is positive or not. However, this strict definition of a positive pair causes a class imbalance: there are too few positive pairs. Thus, we disturb the line locations of each positive pair and annotate the corresponding harmony score to be proportional to , where and denote the disturbances of lines and
. Also, the loss function for training H-Net is defined as, where is the ground-truth harmony score and is its estimate. The supplemental document describes the training process and architecture of H-Net in more detail.
3.4 Graph optimization: finding harmonious lines
Graph construction: We construct a complete graph , in which the node set represents the lines selected using S-Net in Section 3.2. Every pair of lines are connected by an edge in the edge set . Each edge is assigned a weight by H-Net in Section 3.3. Figure 5(f) visualizes a complete weighted graph.
MWCS: As mentioned earlier, a set of semantic lines is optimal, if any two lines in the set are harmonious with each other. Thus, finding such an optimal set is equivalent to finding a clique of nodes [Graph1998], which are mutually connected and have a maximal sum of weights (harmony scores).
Let denote a clique, represented by the index set of member nodes. Then, we define the harmonization energy of clique as
which is the sum of all edge weights in . Finding the clique that maximizes this energy is NP-hard [feremans2003generalized]. However, in this work, is set to be a small number. The default is 8. There are about possible cliques, which are also manageable. Thus, exhaustive search is adopted to find a maximal weight clique. First, we generate the set of possible cliques in the graph , where each clique consists of more than two nodes. Then, we select the maximal weight clique that maximizes the harmonization energy:
subject to a constraint
where is a threshold. If there is no clique satisfying the constraint, we select the maximal single-node clique by
The self-harmony score is obtained by applying the same line as duplicated input to H-Net.
After obtaining the set of harmonious semantic lines, we refine each line by
where denotes the offset vector, generated by the regression layer of S-Net. Figure 5(c) and (g) show the set of harmonious semantic lines on the image and Hough spaces.
4 Experimental Results
SEL [lee2017]: It is the first semantic line dataset, containing 1,750 outdoor images, which are split into 1,575 training and 175 testing images. Each semantic line is annotated by the coordinates of two end points on an image boundary.
It is a more challenging dataset for testing semantic line detectors. It contains 300 test images, selected from the ADE20K segmentation dataset[zhou2017_ade]. Its semantic lines are less obvious and more severely occluded in more cluttered scenes.
SL5K [kai2020]: It is a rich and diverse dataset in terms of the number of lines and scene categories. It is composed of 4,000 training and 1,000 testing images.
CULane [pan2018]: It is a dataset for road lane detection, containing 88,000 training images. Its 34,680 test images are classified into 9 categories. For each image, the pixel-wise mask for up to 4 road lanes is provided. The proposed algorithm is tested on 3,911 test images in the ‘no lane’ category, in which each lane is highly implied or even invisible.
Conventional metrics: There are two existing metrics to assess semantic line detection results: mIoU [lee2017] and EA-score [han2020]. In the mIoU metric, a detected line is regarded as correct if its mIoU score with the ground-truth semantic line is greater than a threshold as illustrated in Figure 6(a). In the EA-score, a detected line is regarded as correct if its similarity with the ground-truth is greater than the threshold as shown in Figure 6(b). The similarity is composed of two factors and , which are based on the Euclidean distance between the midpoints of the lines and the angular distance of the lines, respectively. In both metrics, the precision and the recall are computed by
where is the number of correctly detected semantic lines, is the number of false positives, and is the number of false negatives. Then, the F-measure is computed by
The area under curve (AUC) performances of the precision, recall, F-measure curves are measured in the entire range of the threshold , which are denoted by AUCP, AUCR, and AUCF, respectively [lee2017].
However, these metrics measure only the positional accuracy of each detected line. They do not consider how harmonious multiple detected lines are with one another in a scene. Hence, they may yield misleading scores, as exemplified in Figure 7.
HIoU metric: We propose the harmony-based intersection-over-union (HIoU) metric to assess the overall harmony of detected lines. Detected lines tend to convey harmonious impression about the composition of an image, when their division of the image is consistent with the division by the ground-truth. Suppose that the set of detected lines and the set of ground-truth lines divide the image into regions and , respectively. Then, we define HIoU as
In other words, for each , we find the matching and measure their IoU. Similarly, for each , we find its IoU with the matching . Then, the average of these bi-directional matching IoU’s becomes the HIoU score. Figure 6(c) illustrates how to compute an HIoU score. Figure 7 shows that HIoU assesses detected lines more reasonably than the existing metrics do, by considering the harmony among the detected lines.
4.3 Comparative assessment
We compare semantic line detection results of the proposed algorithm with those of the conventional SLNet [lee2017], DHT [han2020], and DRM [jin2020].
reports the AUC performances of these curves. The proposed algorithm provides a poorer recall but a better precision than the conventional algorithms. F-measure is the harmonic mean of recall and precision. Note that the proposed algorithm outperforms all conventional algorithms in terms of F-measure and HIoU.
Comparison on SEL_Hard: Table 1 also compares the results on SEL_Hard. For this comparison as well, we use the same algorithms that are trained using the training images in the SEL dataset. As mentioned previously, SEL_Hard images are much more complicated than SEL images. Also, many of SEL images contain only one semantic line. Thus, it is challenging to use only SEL images to learn the harmony between lines in more complicated SEL_Hard images. Nevertheless, the proposed algorithm yields competitive results to DRM, which performs the best but demands a too high computational cost. Note that the proposed algorithm is about 20 times faster than DRM. Moreover, the proposed algorithm outperforms DRM in terms of AUC_P.
Figure 9 compares detection results on the SEL and SEL_Hard datasets. The conventional algorithms detect redundant lines near object boundaries or fail to detect implied semantic lines. In contrast, the proposed algorithm detects implied as well as obvious semantic lines more reliably, while ensuring the harmony between detected lines.
|Zhao et al. [kai2020]||70.3||74.5||72.3||-|
Comparison on SL5K: Table 2 compares the performances on the SL5K dataset. Zhao et al. [kai2020] report the performances of their algorithm in the EA-score metric only, and their training codes or model parameters are not available. Thus, we compare the results in the EA-score metric only, as done in [kai2020] . We see that the proposed algorithm outperforms Zhao et al. by significant margins 9.1, 6.9, and 8.0 in terms of precision, recall, and F-measure, respectively. Also, the proposed algorithm yields the HIoU score of 74.1. Figure 10 shows some detection results.
Comparison on CULane: We compare the proposed algorithm with the conventional road lane detectors [hou2019_road, qin2020] on the ‘no lane’ category in CULane, in which lanes are implicit or invisible. Conventional techniques are based on the segmentation framework and the ground-truth is also given as a binary mask for each lane. Thus, for comparison, we declare the most overlapping line with the segmentation mask of each lane as a semantic line. The experimental settings are described in detail in the supplemental document. Figure 11 shows some ground-truth semantic lines and compares their detection results. Although the lines are extremely unobvious, the proposed algorithm detects them more reliably than the conventional detectors. Table 3 compares the AUC and HIoU scores. Note that, unlike the conventional detectors, the proposed algorithm does not use the information of the maximum number of lanes in a scene. The conventional algorithms poorly recall implied or invisible lanes. The proposed algorithm is slightly less precise, but provides significantly higher recall and F-measure scores than the conventional detectors. Also, the proposed algorithm yields a better HIoU score than the conventional detectors, by exploiting the harmonious property of road lanes, such as parallelness and equal width between adjacent lanes.
Running time analysis: Table 1 also compares the running times. We use a PC with Intel Core i5-8500 CPU and NVIDIA RTX 2080 ti GPU. Note that SLNet and DRM require a lot of time to extract discriminative line features. Especially, DRM is the slowest method at 1.05 fps, because its mirror attention module and iterative ranking-and-matching process are too demanding. The proposed algorithm and DHT are much faster. Although DHT is the fastest, its recall performance is not competitive.
4.4 Ablation studies
We conduct ablation studies to analyze the efficacy of the proposed S-Net, H-Net, and MWCS process on the SEL dataset. Table 4 compares several ablated methods. Method I uses S-Net only to detect semantic lines, in which the selection-and-removal process is performed iteratively until the maximum probability becomes lower than 0.5. Method II uses H-Net and the MWCS process as well, but H-Net is trained without employing the IRC module. In Method III, line offsets are not used to refine detection results. Method I is significantly inferior to the other methods, indicating that both H-Net and MWCS are essential for detecting harmonious semantic lines. Also, by comparing II with IV, we see that the inter-region correlation feature is effective for estimating the harmony between two lines. Also, from III with IV, note that the performance is improved by refining detected lines using regression offsets.
We proposed a novel semantic line detector. First, we developed S-Net to compute the line probabilities and offsets of line candidates. Second, we filtered out irrelevant lines through a selection-and-removal process. Third, we constructed a complete graph, whose edge weights were computed by H-Net. Finally, we determined a maximal weight clique representing a group of harmonious semantic lines. Also, to assess the overall harmony of detected lines, we proposed a novel metric called HIoU. It was experimentally demonstrated that the proposed algorithm can detect harmonious semantic lines effectively and efficiently.
This work was supported in part by the National Research Foundation of Korea (NRF) through the Korea Government (MSIT) under grant NRF-2018R1A2B3003896 and in part by the 42dot Inc.