FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning

by   Mengyang Wu, et al.
The Chinese University of Hong Kong

The ability to recognize the position and order of the floor-level lines that divide adjacent building floors can benefit many applications, for example, urban augmented reality (AR). This work tackles the problem of locating floor-level lines in street-view images, using a supervised deep learning approach. Unfortunately, very little data is available for training such a network - current street-view datasets contain either semantic annotations that lack geometric attributes, or rectified facades without perspective priors. To address this issue, we first compile a new dataset and develop a new data augmentation scheme to synthesize training samples by harassing (i) the rich semantics of existing rectified facades and (ii) perspective priors of buildings in diverse street views. Next, we design FloorLevel-Net, a multi-task learning network that associates explicit features of building facades and implicit floor-level lines, along with a height-attention mechanism to help enforce a vertical ordering of floor-level lines. The generated segmentations are then passed to a second-stage geometry post-processing to exploit self-constrained geometric priors for plausible and consistent reconstruction of floor-level lines. Quantitative and qualitative evaluations conducted on assorted facades in existing datasets and street views from Google demonstrate the effectiveness of our approach. Also, we present context-aware image overlay results and show the potentials of our approach in enriching AR-related applications.



There are no comments yet.


page 1

page 2

page 3

page 5

page 7

page 12

page 13

page 14


TMBuD: A dataset for urban scene building detection

Building recognition and 3D reconstruction of human made structures in u...

Geometry-Guided Street-View Panorama Synthesis from Satellite Imagery

This paper presents a new approach for synthesizing a novel street-view ...

Defining and Generating Axial Lines from Street Center Lines for better Understanding of Urban Morphologies

Axial lines are defined as the longest visibility lines for representing...

Semantic-Aware Label Placement for Augmented Reality in Street View

In an augmented reality (AR) application, placing labels in a manner tha...

CBHE: Corner-based Building Height Estimation for Complex Street Scene Images

Building height estimation is important in many applications such as 3D ...

Automated Building Image Extraction from 360-degree Panoramas for Post-Disaster Evaluation

After a disaster, teams of structural engineers collect vast amounts of ...

Combining Maps and Street Level Images for Building Height and Facade Estimation

We propose a method that integrates two widely available data sources, b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Floor-level lines are line segments that separate adjacent floors on a building facade; see Fig. 1 (middle). Being able to recognize them in city-wide street views can benefit various applications, e.g., urban 3D reconstruction [28], building topology analysis [42], and urban vitality study [48]. Intrinsically, floor-level lines for the same building are parallel, through which we can reconstruct the homography of the facade in a perspective view [8] and support floor-aware augmented reality (AR) applications; see Fig. 1 (right).

This work considers the problem of inferring floor-level lines in street-view images, which requires the recovery of not only geometric priors (e.g., positions and vanishing directions) in the image view, but also semantic information (e.g., floor orders) relevant to the floor-level lines. The task is relatively intuitive for humans but very challenging for computers. So far, we are not aware of any work that can robustly detect and recognize floor-level lines.

Related prior works on 3D reconstruction (e.g.[2, 14]), and facade parsing (e.g.[26, 44]), typically rely on various assumptions about the structural regularity on building facades, e.g., repetitive windows and balconies (e.g.[30, 41]). Though these extra constraints help disambiguate the problem and the results might help infer floor-level lines, they still have various limitations, such as being error-prone when handling perspective-oriented facades (Fig. 1 top-left) and scenes with occlusions (Fig. 1 bottom-left). Alternatively, we may infer floor-level lines by locating line segments roughly in the same direction. However, existing line detectors (e.g.[13, 33]) can easily generate huge amount of irrelevant line segments in cluttered street views, and offer no semantic information for inferring the floor order. Recently, Lee et al. [21] proposed the concept of semantic lines that characterize the spatial scene structure in images. The method first finds candidate lines using line features at multi-scale pooling layers, then filters out semantic lines using local line features. However, the filtering process (semantic or not) is a binary classification problem, whilst this work requires to recover not only a multi-class label per line but also a plausible floor order for lines in the same facade, i.e., order 1, order 2, etc.

Fig. 1: Left column shows two example street-view images in London (top) and Hong Kong (bottom), where the camera views are side- and front-facing relative to the building, respectively. Note the occlusions introduced by the advertisement billboard and light post circled in red on bottom left. Middle column shows floor-level lines recognized by our method with geometric positions and semantic order labels. Right column shows potential floor-aware image-overlay results to aid shopping (top) and navigation (bottom).

Recent attempts of using deep neural networks for semantic scene segmentation (

e.g.[6]) and planar surface recognition (e.g.[45, 49]) exploit the possibility of jointly learning the semantic and geometric attributes in street views. The methods achieve superior performances over previous deep learning methods that use solely the semantic features. This work also leverages a supervised deep learning approach that can jointly infer geometric priors and semantic information of floor-level lines. This is nevertheless a challenging task. First, a key requirement for network training is the availability of large, labeled street-view images with annotations of floor-level lines. However, existing street-view datasets contain either semantic annotations that lack geometric attributes, or rectified facades without perspective priors. To address this issue, we devise a new data augmentation scheme and compile a new dataset by effectively combining the rich context of floor-level lines marked on an existing facade dataset, with perspective priors easily-extracted on street-view images. In this way, we can largely reduce the manual workload in the dataset construction, while promoting the generalizability of our network model.

Second, we are not aware of any network architecture that can fulfill the requirement of recognizing and locating floor-level lines in street-view images. To fill the gap, we design FloorLevel-Net a new deep learning framework with two main components inspired by the characteristics of our floor-level dataset. (i) FloorLevel-Net leverages multi-task learning that associates geometric properties of facades orientation to camera, with semantic features of facade appearance, including windows, doors, and balconies, etc. (ii) FloorLevel-Net incorporates a height-attention mechanism to enforce vertical orderings of floor-level lines, as floor order naturally increases from bottom to top in image space. We further exploit various geometric properties, including the vanishing point (VP) and floor order constraints, to enforce a consistent reconstruction of floor-level lines from piecewise segmentations by FloorLevel-Net.

We evaluate the performance of our approach on both building images from existing facade datasets and self-collected street-view images from Google Street View (GSV) [1]. Both qualitative and quantitative analysis demonstrate the effectiveness of the individual components in our approach. The main contributions of this work include:

  • We propose a data augmentation scheme to integrate rich semantics of rectified facades and geometric priors of buildings in diverse street views, and compile a new dataset for recognizing floor-level lines (Sec. IV).

  • We design FloorLevel-Neta multi-task CNN architecture with height attention to simultaneously predict a facade segmentation map and a floor-level distribution map for an input street-view image (Sec. V-A to V-C). We further develop a post-processing framework to extract floor-level parameters (including the endpoints and vanishing points), and refine the parameters with self-constrained geometric priors (Sec. V-D).

  • We evaluate the effectiveness of our approach in recognizing floor-level lines, and demonstrate the applicability in enriching context-aware urban AR (Sec. VI).

Ii Related Work

The recognition of geometric structures in urban scenes, e.g., surface layout [17] and driving lanes [50], has been gaining attention in image processing and vision research. This work targets at floor-level lines that separate adjacent floors on building facades. A closely-related topic is to reconstruct 3D building models from a monocular image. Conventional approaches typically adopt a two-stage approach: first parse a facade into piecewise regions like windows, balconies, etc. [26, 44], then split the regions into regular layouts for subsequent modeling, e.g., by shape grammars [27, 34]. Existing methods, however, rely heavily on detecting repetition [41, 30], symmetric or rectangular structures on facade layout [35, 39], thus exhibiting significant challenges for parsing general urban scenes in the wild. Particular obstacles include perspective-oriented facades (e.g., see Fig. 1 (top) and the red inset in Fig. 2) and scene occlusions by billboards and traffic lights (e.g., see Fig. 1 (bottom) and the blue inset in Fig. 2), resulting in failures in matching repetition/symmetry/rectangular constraints. It is hard to identify common geometric assumptions that can well fit diverse styles of building facades in different cities.

Recent studies on urban scene recognition focus on deep-learning-based approaches, benefiting from new datasets of urban scenes and advancements in neural network architecture,

e.g.[25, 51, 4]. Specifically, DeepFacade [23, 22] parses facades using a fully convolutional network to produce pixel-wise semantics. Some work also attempts to formulate geometric understanding of urban scenes as an image segmentation problem. For instance, Haines and Calway [15] infer surface orientations using spatiograms of gradients and colors. However, the results are piecewise segments that are typically discontinuous with coarse and irregular boundaries, hindering the recognition of geometric priors like the endpoints of floor-level lines. To further recover geometric priors, a second-stage post-processing framework can be employed. Zeng et al. [49] employ a vanishing-point-constrained optimization to enhance piecewise segmentations of planar building facades.

Similarly, we adopt a twofold process to recognize floor-level lines on building facades. First, we predict a segmentation mask of pixel-wise line labels (floor-level distribution) using a multi-task CNN architecture. Second, we refine the piecewise floor-level distribution into line parameters using self-constrained geometric priors. Nevertheless, the work faces severe challenges of being lack of training dataset and suitable network architecture. We tackle the challenges from the following perspectives:

  • Data augmentation is a vital technique to improve the diversity of the training data and the network generalizability. There are several publicly available datasets for urban scene (e.g., KITTI [11]

    and Cityscapes 

    [7]) and facade (e.g., CMP [37] and LabelMeFacade [10]). Yet, we cannot use them directly in this work. In contrast to explicit building structures such as windows and balconies that have obvious visual appearance, floor-level lines are rather implicit, requiring contextual information for the recognition. To fill the gap, we propose a new data augmentation scheme that leverages detailed facade semantics of front-facing facades in CMP [37] and diverse building perspectives in GSV images. In this way, the synthesized dataset efficiently captures both the rich context of floor-level lines and geometric priors of building facades in reality.

    Fig. 2: Our two-stage approach: (i) FloorLevel-Net is a multi-task learning network that segments the input image into building-facade-wise semantic regions (top) and floor-level distributions (bottom); and (ii) our method further fits and refines the pixel-wise network outputs into polylines with geometric parameters. Further, we can take the reconstructed floor-level lines to support and enrich urban AR applications with floor-aware image overlay.
  • Multi-task learning helps to boost the performance of many deep-learning-based vision tasks, by learning features from relevant knowledge domains. In the literature, various efforts have been devoted to exploit multi-task learning for geometric understanding of urban scenes. For example, Liu et al. [24] tackle scene recognition and reconstruction tasks in a tightly-coupled framework; besides, there are assorted works for indoor scene understanding (e.g., [38, 29, 9]) by fusing features of depth, surface normals, and objects. Human experience on locating floor-level lines typically relies on recognizing the facades and its enclosing regions, e.g., doors and windows. Therefore, we divide our task into subtasks of recognizing facades and floor-level lines, and design a multi-task learning network to simultaneously address the two subtasks together.

  • Channel-wise attention exploits the inter-channel relationship of features and scales the feature map according to the importance of each channel. The mechanism is first proposed in SENet [18] and adopted widely in image classification (e.g.[40]) and segmentation (e.g.[46]). Recent attention approaches take more advantages of contextual information inside the image domains, e.g., the criss-cross attention module by Huang et al. [19] and recursive context routing (ReCoR) mechanism by Chen et al. [5]. Choi et al. [6] utilize the unbalanced class distribution at varying vertical locations in urban scene images, and design a height-attention module to emphasize the unbalance. Floor-level lines in our dataset exhibit a similar property, since floor order on the same facade always increases from bottom to top in the image space. This drives us to adopt a similar configuration in the recognition of floor-level lines.

Iii Overview

In this work, our goal is

to recognize and locate the floor-level lines on each building facade (that is close to the camera) in the image space of street-view images.

The input to our method is a single RGB image of width and height . Given , we aim to predict a set of line segments per facade that marks the separations between adjacent floors; see Fig. 1 (middle) for two examples. In the following, we refer line segments as lines for simplicity. Each line contains a pair of endpoints () and () in the image space of . Also, we aim to recognize a floor order (denoted as ) per line , such that the line that separates the ground floor and the first floor has a floor order value of one, the next line above has a floor order value of two, etc. So, each is specified as a 5-tuple . As a visualization, we use the same coloring scheme (orange for order 1, green for order 2, etc.) to reveal the floor orders in all our results.

Fig. 3: Illustration of our data augmentation scheme. We take the advantages of the rich context of facades in the CMP dataset [37] (a1 & a2) and the perspective-oriented building facades readily extracted from the GSV (Google Street View) images (b1). From them, we can efficiently obtain simplified semantics (a3) and annotate floor-level lines (a4), and further generate a very large amount of augmented image samples, i.e., an augmented image (b2) with its associated semantic image (b3) and floor-level-lines image (b4), by pairing up different geometric priors with different CMP facades.

There are three main parts in our approach.

  1. We compile a set of facade images from the CMP dataset [37], and employ them to augment building facades in GSV images based on the geometric prior of facade perspective (Sec. IV). By our new data augmentation scheme, we can efficiently generate a large amount of image samples to train our network.

  2. We develop a multi-task learning networkFloorLevel-Net (Sec. V-A to V-C), to segment the input image into piecewise regions of facades and to detect candidate pixels possibly associated with floor-level lines; see the multi-task learning module in Fig. 2. Particularly, FloorLevel-Net encapsulates semantic (e.g.windows, doors, shops, etc.) and geometric (e.g., floor orders, facade orientation

    , etc.) information related to floor-level lines by articulating a fused loss function that is differentiable by means of pixel-wise convolutions.

  3. We infer per-line 5-tuple parameters from the piecewise segmentations in the geometry post-processing stage (Sec. V-D). Here, we locate facade regions, group floor-level lines per facade, then regress a polyline per floor-level line; see the geometry processing module in Fig. 2. Further, we refine the polylines based on self-constrained geometric priors, i.e., the facade boundary, floor order, and vanishing points.

So far, we are not aware of any work on recognizing floor-level lines in street-view images. Existing related works focus on recognizing building facades as a whole or explicit objects, such as windows, doors, and balconies. Hence, we show the effectiveness of our approach through comparisons with ablation techniques (Sec. VI-C) and DeepFacade, which is a closely-related work on facade parsing (Sec. VI-D), and demonstrate the potential of our work to support and enrich urban AR applications (Sec. VI-F).

Iv Dataset Preparation & Analysis

An immediate challenge to recognizing floor-level lines is the lack of properly-annotated street-view images. On the one hand, existing city-wide datasets such as Cityscapes [7] provide mainly urban scene segmentations, e.g., roads, buildings, vehicles, etc., while lacking details of facades, not to mention floor-level lines. On the other hand, building facade datasets such as CMP [37] provide detailed semantics of facades, e.g., windows, doors, balcony, etc. However, the images exhibit mostly rectified views of front-facing facades, which cannot reflect real-world facades with perspective orientations.

Iv-a Data Augmentation

To relax the enormous workload for labeling a new dataset, we propose a data augmentation scheme (Fig. 3) to take the best advantages of the facade semantics in the CMP dataset [37] together with the perspective-oriented building facades extracted from GSV images.

To start, we first analyze the facade images (Fig. 3 (a1)) with their semantics (Fig. 3 (a2)) provided by the CMP dataset. Here, we make use of the simplified semantics (Fig. 3 (a3)), i.e., window, shop, and door, which offer contextual information for floor-level lines, i.e., windows usually appear on every floor, whereas shops and doors usually appear on the ground floor. There may not be obvious lines between adjacent floors, yet humans can infer them based on the associated contexts, e.g., windows. Thanks to the regular front-facing structures in these inputs, manually labeling floor-level lines (Fig. 3 (a4)) is very fast, compared with labeling on general street-view images that are perspective. We extract a total of 150 rectified facade images from the CMP dataset.

Next, we employ the GSV API [12] to collect street-view photos. Here, we randomize the camera headings to obtain perspective-oriented facades, and set the pitch parameter in the API to zero, so the camera view directions are always horizontal, i.e., parallel to the ground. Also, for each GSV photo, we identify nearby planar facades that are feasible for overlaying CMP facade images, i.e., with an aspect ratio similar to those in the CMP dataset. Then, we manually annotate a quad to mark the region of each feasible planar facades (Fig. 3 (b1)) and label also its orientation relative to the camera view direction. The facade priors serve as reference locations for overlaying facade images and semantics. For instance, see Fig. 3 (b1), the blue quad marks the facade region of the left building with orientation categorized as front (i.e., front facing the camera), whereas the green quad marks the facade region of the right building with right orientation. By doing so, we can derive an affine transformation matrix for each facade, and use it to generate an augmented street-view image (Fig. 3 (b2)) with its associated semantic image (Fig. 3 (b3)) and labeled floor-level-lines image (Fig. 3 (b4)).

To enrich the data sample diversity and to improve the network generalizability, we collect 200 GSV images in Hong Kong and London, with both high-rise and relatively low-rise buildings, respectively. Together with the facade images from the CMP dataset, we compile a new dataset with 3,000 pairs of augmented street-view images, each associated with facade semantics and floor-level line labels. We train our FloorLevel-Net using the dataset, and compare it with an ablation model trained on the CMP dataset of only rectified facades. The results show a significant boost by the data augmentation scheme (Sec. VI-C).

Fig. 4: Characteristics of the data samples in our dataset: distributions of (a) the facade orientation, (b) the highest floor order, and (c) average number of pixels of each floor order in a single image.

Iv-B Dataset Characterization

We conduct a preliminary analysis of our augmented dataset to show its strong generalizability, and later take it to derive feasible components in our network. By observing the following characteristics of the dataset, some hyper parameters, e.g., maximum floor order, are determined.

  • Facade orientation: Fig. 4 (a) reports the distribution of facade orientations. As mentioned in the previous subsection, three orientations are considered based on the facade orientation relative to the camera. The three orientations share similar proportions, without biasing towards any of the three.

  • Highest floor order: Fig. 4 (b) plots the distribution of the highest floor order in the data samples. Most facades have a floor order of five. Also, there is no single-floor building, since we omit single-floor buildings when compiling the data. There are few high-rise buildings () with the highest floor order above 10.

  • Average pixel number per floor order: Fig. 4 (c) presents the average number of pixels belonging to each floor order in a single image of size 480360. Since the street-view images are taken from the ground, low floor orders have more pixels, and the number of pixels decreases as floor order increases. Floor-level lines of orders above 10 are too small to overlay AR images.

In consideration of the distribution of highest floor order and average pixel number per floor order in the augmented dataset, we set the maximum floor order to be 10 in our floor-level line detection module (Sec. V-A).

Fig. 5: Overview of our FloorLevel-Net framework. We consume each input street-view image by a multi-task learning network to jointly segment the facades and detect floor-level lines. In the module of our line feature decoder, we incorporate height-attention layers (see bottom left) along with the original convolutional layers, to help enforce the vertical ordering of the floor-level lines, where “multi” denotes element-wise multiplication.

V FloorLevel-Net Approach

In this section, we first present the framework of FloorLevel-Net (see Fig. 5), which is a multi-task learning network (Sec. V-A) that fully utilizes the two-stream semantics of facades and floor-level lines, with a height-attention mechanism (Sec. V-B) to enforce the vertical ordering of floor-level lines. Then, we present the implementation details of FloorLevel-Net (Sec. V-C), followed by the add-on geometry post-processing module to generate the final parameters for each floor-level line (Sec. V-D).

V-a Multi-task Learning

Based on the establishment of our augmented dataset, each street-view image for training is coupled with two label classes: (i) facade semantics and (ii) floor-level distributions . consists of window, door, shop, left, right, front, other, where the first three labels are the context, the subsequent three indicate the facade orientation, and marks the non-facade pixels. consists of , where denotes the highest floor order (i.e., 10) as suggested by the data characteristics (Sec. IV-B), and marks the non-floor-level-lines pixels. The two-stream semantics complement each other for the goal: provides rich context for separating floor-level lines in each facade, whereas

helps to estimate the order and position of each floor-level line. This observation on our training data inspires us to apply a multi-task learning process to benefit each task.

As illustrated in Fig. 5, FloorLevel-Net first employs a shared encoder to learn a feature map, then it applies two-branch decoders to gradually upsample the feature map via deconvolutional layers. The facade feature decoder in the upper branch outputs segmentation mask , such that

indicates the probability of pixel

having label . is computed by



is the feature vector of pixel

and denotes the parameters learned by the network. We use the loss function to supervise the network training, where is the ground truth map of facades and is a standard softmax cross entropy function.

On the other hand, the line feature decoder in the lower branch of FloorLevel-Net outputs floor-level distribution map for floor-level line detection. Besides the softmax cross entropy , where denotes the associated ground truth, we also fuse here in the loss function for supervising the network training:


Incorporating in contributes to the floor-level lines predictions, since facade context such as windows can help infer the floor levels. Experimental results also confirm that the fused loss helps to predict more continuous segmentations and more accurate floor orders (Sec. VI-C).

Fig. 6: Height-related feature: Pixel distribution of floor-level lines in vertical bounds of the image space.

V-B Height-Attention Mechanism

Observing that “cars can’t fly in the sky”, i.e., pixels of car are generally below those of the sky, Choi et al. [6] suggest a height-attention mechanism based on the analysis of the CityScape [7] dataset, which indicates that the class distribution significantly depends on a vertical position in urban scene. For instance, a lower part of an image is mainly composed of road, while the upper part, buildings and sky are principal objects. We hypothesis a similar pattern for pixel vertical distribution of floor-level lines in our case, i.e., pixels of order floor-level lines are generally below those of order floor-level lines, given . Here we simplify the analysis by dividing the image space (from bottom to top) into four equal-sized vertical bounds, i.e., low, mid-low, mid-high, and high; see the inset in Fig. 6. For each floor-level line in our data samples, we count its pixels in each vertical bound, and compute its distribution ratios in the four bounds. Then, we sum and normalize the distribution ratios for floor-level lines of same order in the whole data, and obtain the distribution ratios per floor-level line order. From Fig. 6, we can see that pixels of low floor-level lines appear more in the lower vertical bound (e.g., 73% of floor order 1 pixels are in the low bound), whilst pixels of high floor-level lines have a higher chance of locating in the upper vertical bound (e.g., 53% of floor order 6 pixels are in the high bound).

Next, we analyze the probabilities of floor-level line pixels in the whole image and in each vertical bound. Table I

presents the probability distributions of the lowest six dominant floors orders, which take over 97% of all floor orders (see Fig. 

4 (c)). Here denotes the probability that an arbitrary pixel is assigned to the -th floor order. The results further confirm the hypothesis of the pixel vertical distribution. For example, the probability of floor order 1 is 9.75% in the whole image, and it drops from 27.7% to 1.8% in vertical bounds from low to high. On the contrary, increases from 0.03% to 3.09%, which matches with the distribution in Fig. 6. We further measure the uncertainty of pixel distribution probabilities in separate bound by calculating the entropies as . The overall entropy of the entire images is 0.436, and the low bound has the smallest entropy of 0.298 due to the dominant probability of floor order 1. The result indicates that if a pixel falls in the low bound, it shall probably be predicted as floor order 1 but not the other floor orders.

Given Probabilities of the lowest six floor orders Class
Image 9.75 11.8 9.56 4.90 2.83 1.17 0.436
Low 27.7 10.1 1.71 0.35 0.12 0.03 0.298
Mid-low 6.75 20.1 8.72 2.97 1.09 0.32 0.386
Mid-high 2.79 10.3 14.8 7.07 3.86 1.20 0.429
High 1.80 6.66 13.0 9.20 6.22 3.09 0.442
TABLE I: Probability distributions of pixels (in percentage) of the lowest six floor orders in the whole image and in each vertical bound.

These observations suggest that vertical position in image space can serve as a good indicator of floor order. Hence, we are motivated to leverage height-attention layers in the design of FloorLevel-Net. In detail, we include an HA layer between adjacent convolutional layers in the line feature decoder; see Fig. 5 (bottom left) for the illustration.

Specifically, the HA layer bridges a lower-level feature map of floor-level lines and a higher-level feature map (note: subscripts and denotes lower-level and higher-level, respectively) in the following ways: (i) adopts a width-wise pooling to extract vertical features from

; (ii) employs a 1-D convolutional layer with a bilinear interpolation to generate attention map

, which shares the same channel and height size as ; and (iii) generates a refined higher-level feature map by an element-wise multiplication of and . In doing so, enriches the per-channel scaling factors for each individual row of vertical positions, and the refined higher-level feature map embodies the height-wise contextual information for locating and also ordering the floor-level lines.

To study the effectiveness of the HA mechanism, we extract attention weights learned by the last HA layer and compare them with floor-level distributions in street-view images. Fig. 7 shows a typical comparison example using the input image in the left. Here, we plot two heatmaps with floor orders as the horizontal axis and image-space height as the vertical axis. The heatmaps are constructed in the following way. Horizontal axes denote floor order from 1 to 10, whereas vertical axes denote vertical position (height) sub-divided into 10 vertical bounds for visualization. For each cell , we count the pixels in floor-level (middle) or extract the attention weight of the -th channel (right), inside the -th vertical bound. Also, we normalize the intensities (in red) in each map. Comparing the two heatmaps, we can see that they exhibit very similar patterns. Particularly, if we mark the most frequent height(s) per floor order (see the blue boxes) in both plots, we can see that the attention weights reveal the heights of the pixels in floor-level lines of different orders. The comparison together with the ablation analysis presented in Sec. VI-C demonstrate the effectiveness of the HA mechanism.

Fig. 7: Left is an example input image with annotated floor orders. Middle plots the floor-level line pixel distributions in the input image, whereas right plots the attention weights in the last height-attention layer.
Fig. 8: Geometry-constrained post-processing for retrieving line parameters: grouping floor-level lines based on facade segmentation (a), fitting polylines according to line predictions (b), refining vanishing point using geometry constraints (c), and the final output (d).

V-C Implementation of FloorLevel-Net

FloorLevel-Net adopts an encoder-decoder structure based on DeepLabV3+ [4] with ResNet101 [16]

as the backbone. Specifically, we employ four ResNet stages and one atrous spatial pyramid pooling (ASPP) layer to generate the shared features, and use five convolutional layers in each of the decoders for facade segmentation and floor-level lines detection. In the line feature decoder, we arrange one height-attention layer between each pair of adjacent convolutional layer, and append a residual layer at the end to fuse the features from the facade feature decoder. We use ReLU for all layers, except for the final prediction layers, where softmax is applied. We implement


using PyTorch and train it from scratch on a single NVidia GeForce GTX 1080 Ti GPU card. During the training, we adopt the momentum optimizer with learning rate 1e-3 and a small batch size of four. Convergence is reached at about 100K iterations.

V-D Geometry Post-Processing

FloorLevel-Net predicts two piecewise segmentation masks and , from which we further extract line parameters by considering various geometric constraints:

  • Facade constraint: Floor-level lines are attached to facades, so the detected lines are valid only if they lie inside a facade region. Since may contain multiple facades, we process floor-level line segmentations and generate only one line with a specific floor order per facade.

  • Vanishing point (VP) constraint: Floor-level lines of the same facade are intrinsically parallel in 3D, so they should meet at a common VP, which is at a finite location for perspective-oriented facades.

  • Order constraint: Assuming that the up direction in input image roughly matches the up direction in real world, orders of floor-level lines for the same facade should strictly increase from bottom to top.

To enforce these constraints, we formulate a second-stage geometry post-processing to generate floor-level line parameters with the following steps (as illustrated in Fig. 8):

  • Line grouping (Fig. 8 (a)): First, we group relevant labels in to form a super-segment per facade. Then we group floor-level lines segmentation in , and identify a group of lines per facade.

  • Polyfit (Fig. 8 (b)): We fit a polyline (in the form of ) for pixels of each floor-level line (say for -th line) using a least squares regression:


    Note that on the same facade, there could be disjoint predicted regions of the same floor order; we have to merge these regions before the regression.

  • Refinement (Fig. 8 (c)): Polylines derived for the same facade may not perfectly meet at a common VP, so we regress location for the common and refine each polyline as by taking as an anchor point. To do so, we formulate a gradient-based optimization to update parameters , , with a global least squares:

  • Final result (Fig. 8 (d)): Last, we take the horizontal and vertical ranges of each facade, to derive the two endpoints () and () for each polyline. Together with the floor order , we can then obtain the five-tuple for each floor-level line.

Vi Experiment

To the best of our knowledge, this work is the first attempt at recognizing floor-level lines in street-view images. Hence, there are no benchmark datasets, evaluation metrics, and existing methods for the task. We fill the gaps by preparing a new dataset (Sec. 

VI-A), proposing quantitative metrics for evaluation (Sec. VI-B), and comparing with ablated techniques (Sec. VI-C) and some closely-related techniques (Sec. VI-D and Sec. VI-E). Further, we present some AR-related applications to demonstrate the applicability of this work (Sec. VI-F).

Vi-a Testing Dataset

Besides the new training dataset prepared by using our data augmentation scheme (see Sec. IV),nwe prepare a new testing dataset for evaluating the effectiveness and generalizability of our approach. Here, we choose street-view images by considering the followings: (i) the image should contain at least one (nearby) facade, which is not fully occluded; (ii) the facades should contain at least one floor-level line; (iii) the facades can be perspective-oriented in view but cannot be curved in the physical world; and (iv) the facades should contain some semantic elements, e.g., windows or balconies. We collect 600 street-view images for the testing dataset from the following four different sources:

  • We randomly select 150 street views from the eTrims [20] and arcDataset [43]. Each image includes only one single facade, and the buildings are mostly low-rise.

  • We randomly select 150 more images from the LabelMeFacade [10], TSG-20 [36] and ZuBuD+ [31] datasets. These images have more high-rise buildings, and some have multiple facades.

  • We randomly download 200 GSV images in London and Glasgow (UK), which feature typical European-style buildings, similar to those in the CMP dataset.

  • Lastly, we randomly download another 100 GSV images in Hong Kong (HK). These images feature high-rise buildings, and many facades are partially occluded by billboards, advertisements, etc.

On each test image, we manually label the position and order of each floor-level line on the facades in the form of a quadrilateral region, e.g., see the “GT” row in Fig. 9. It takes around 2 minutes to label one image and around 20 hours in total for all the 600 images. The datasets are available on the project website: https://wumengyangok.github.io/Project/FloorLevelNet/.

Vi-B Evaluation Metrics

As a two-stage method, our results include (i) predicted pixel-wise segmentations from FloorLevel-Net, and (ii) regressed endpoints & and order of each floor-level line from the geometry post-processing. We employ the following metrics to evaluate the two results separately:

  • Pixel-wise accuracy. For each street-view image, we denote the set of pixels in ground-truth and in predicted line regions as and , respectively. We employ the score for quantitative comparisons, which can be measured upon the number of true positives (TP = ), false positives (FP = ), and false negatives (FN = ):

  • Line-wise accuracy. We join the endpoints of each floor-level line to form a three-pixel wide straight line, yielding a bag of pixels . Then, we can compute the confidence of as being correctly recognized as


    where is an indicator function for counting the number of pixels in that appears on the ground truth image with label . If , we regard the line prediction as a true positive. For each street-view image, we again count the number of floor-level lines as being correctly detected (TP), missed (FN), and wrongly recognized (FP), then employ the score (like Eq. (5)) to compute the line-wise accuracy.

Fig. 9: Qualitative comparison results in the ablation analysis on the unseen testing dataset. The results illustrate the effects of (i) our data augmentation scheme by comparing DeeplabV3+ models trained on CMP and on our training data (green background), (ii) multi-task learning with and without the height-attention mechanism (red background), and (iii) our full method further with geometry post-processing (yellow background), vs. the ground truths (GT).

Vi-C Ablation Analysis

We carry out an ablation analysis to evaluate the major components in our approach, including (i) the data augmentation scheme, (ii) the multi-task learning, (iii) the height-attention mechanism in FloorLevel-Net, and (iv) the geometry-post-processing. In the analysis, we employ the testing dataset, in which the images do not appear in the training data.

Ablated techniques. To evaluate components in FloorLevel-Net, we take a stepwise testing strategy by adding the components one by one to the baseline DeepLabV3+ [4] model until we have our full architecture. The difference between consecutive tests indicates the performance enhancement made by the added component. To start, we evaluate our data augmentation scheme by the following pair of experiments:

  • Baseline+CMP: We train a DeepLabV3+ model (baseline) using the facade dataset CMP [37] with rectified facades (Fig. 3 (a1)) and our labels of floor-level lines (Fig. 3 (a4)).

  • Baseline+DataAugm: We train another DeepLabV3+ model on the training dataset of 3,000 pairs of augmented street-view images (Fig. 3 (b2)) and labels of floor-level lines (Fig. 3 (b4)), which are produced by our data augmentation scheme, as described in Sec. IV-A.

Next, we evaluate the multi-task learning framework and height-attention mechanism in FloorLevel-Net by adding these components to Baseline+DataAugm, as follows:

  • Multi-task: This refers to the multi-task learning network that we build to simultaneously predict facades and floor-level lines, as described in Sec. V-A.

  • Multi-task+HA: We further incorporate a set of height-attention layers, as described in Sec. V-B, to form the full architecture of FloorLevel-Net.

Last, we arrive at Our Full Method by including the second-stage geometry-post-processing to regress floor-level lines.

Qualitative analysis. Fig. 9 shows the qualitative comparison results. Here, we select one or two street-view images from each constituting component in our testing dataset: eTrims (1st column), arcDataset (2nd column), TSG-20 (3rd column), LabelMeFacade (4th column), ZuBuD+ (5th column), GSVs in UK (6th & 7th columns), and GSVs in Hong Kong (8th & 9th columns). The top row shows the input images, the bottom row shows the ground-truth images, whereas each of the other rows from the top show the prediction results obtained by the ablated methods, i.e., Baseline+CMP, Baseline+DataAugm, Multi-task, Multi-task+HA, and Our Full Method.

Overall, Baseline+DataAugm generates better predictions than Baseline+CMP (comparing 2nd & 3rd rows in Fig. 9), especially for street views with multiple facades and for facades that are perspective-oriented, e.g., the GSV images in UK & Hong Kong. The results indicate the effectiveness of our data augmentation scheme in generating augmented street-view images for network training.

Next, comparing 3rd & 4th rows Fig. 9, we can see that Multi-task is able to produce more continuous and distinctive segmentations of floor-level lines than Baseline+DataAugm. This shows that exploring the rich semantics in facades indeed facilitates the recognition of floor-level lines. Further, we can see that the full architecture of FloorLevel-Net with the height-attention mechanism (5th row in Fig. 9) can produce even better results, especially in the prediction of floor orders. Particularly, the full architecture of FloorLevel-Net successfully predicts the floor orders, even for challenging street-view images shown in the 4th to 9th columns, whilst the method fails in the absence of the height-attention mechanism. Better predictions help simplify the line generation and refinement in the subsequent geometry post-processing stage. Benefiting from the fine predictions by FloorLevel-Net, the post-processing stage (6th row in Fig. 9) is able to deduce satisfactory floor-level line parameters, subject to the various geometric constraints we discussed earlier.

Lower floors Upper floors Overall
Pixel-wise Line-wise Pixel-wise Line-wise Pixel-wise Line-wise
Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy
Facades & buildings Baseline + CMP 0.374 0.725 0.136 0.676 0.289 0.702
Baseline + DataAugm 0.599 0.887 0.479 0.828 0.560 0.859
Multi-task 0.631 0.912 0.498 0.845 0.586 0.880
Multi-task + HA 0.676 0.926 0.586 0.860 0.644 0.894
GSV in UK & HK Baseline CMP 0.152 0.592 0.114 0.486 0.133 0.524
Baseline + DataAugm 0.530 0.781 0.449 0.708 0.505 0.742
Multi-task 0.555 0.819 0.491 0.734 0.535 0.774
Multi-task + HA 0.636 0.854 0.531 0.749 0.605 0.798
TABLE II: Quantitative comparison with the ablated techniques in terms of pixel-wise accuracy on network prediction results, and line-wise accuracy on geometry post-processing results. Results colored in blue are associated with our full method.

Quantitative analysis. Further, we conduct a quantitative evaluation for each ablated technique. Here, we categorize the testing dataset into two groups: images from datasets focusing on buildings or facades (eTrims, LabelmeFacade, etc.) are relatively easier for recognizing floor-level lines as the images are prepared for facade parsing or building classification task with regular facade elements, whereas GSVs in UK & HK with complex environments and dynamic attributes are harder for recognition. We would also like to evaluate the performance of our approach for different floor-level lines, as many AR applications need only floor information close to the camera. To do so, we categorize floors 1-3 as lower floors since most of these floor pixels are in the low bound of image space, and the remaining floors as upper floors.

Fig. 10: Comparison between our approach and DeepFacade-Variant on orthogonal (left two columns) and perspective (right two columns) facades. Both methods produce satisfactory results for facades that are orthogonal to camera view, whilst our approach is more robust to facades in perspective view, and multiple facades in one image (last row).
Fig. 11: The potential of our approach to support and enrich various AR scenarios, e.g., navigation, advertisement, etc.

We measure the pixel-wise accuracy on the ablation network outputs, and line-wise accuracy on the line parameters after geometry post-processing. The quantitative results are presented in Table II, in which our full method, Multi-task + HA with geometry post-processing, is marked in blue. Based on the results, we can derive some interesting observations. First, the overall accuracy drops when scene complexity increases from simple street views in facade datasets to GSVs in UK and HK. There are more facades and higher floors in GSVs, especially for those in HK, resulting in less accurate predictions. Second, predictions for lower floors have better accuracy than those for higher floor. This is probably because of relatively weaker supervisions for upper floor-level lines, as upper floor-level lines are typically further away from the camera than the lower ones and there are relatively less contextual information. Third, the data augmentation scheme, multi-task learning, and height-attention layers gradually contribute to the performance enhancement, yet there are still much room for improvement in terms of pixel-wise accuracy. The result indicates that floor-level line recognition is more challenging compared with general urban scene segmentation tasks. Nevertheless, our predictions provide enough information for the geometry post-processing, which can correctly recognize and locate most floor-level lines in terms of line-wise accuracy.

Vi-D Comparison with a Variant of DeepFacade

As there are no previous works on locating floor-level lines, we consider those on facade parsing, i.e., segmenting a building facade into windows, doors, etc., which can then serve as references for locating floor-level lines, as competitors. Here, we compare with DeepFacade [23], which is a state-of-the-art deep-learning-based method for facade parsing. In detail, we extract floor-level lines from their results as follows: (i) use RANSAC [3] to rectify a street-view image and record the associated homographic matrix; (ii) apply DeepFacade [23] on the rectified image to produce a facade segmentation map, and extract only the window, door, and shop regions as positive regions in the map; (iii) mark a floor-level line in-between each pair of adjacent (piecewise) positive regions; and (iv) apply the inverse of the homographic matrix to revert the original positions of the detected lines to obtain the final results. From now, we refer the above steps as DeepFacade-Variant.

This experiment compares Our Full Method with DeepFacade-Variant. However, we find that DeepFacade-Variant can hardly produce satisfactory results for GSVs in UK and HK (see supplemental material for examples) for two reasons: (i) RANSAC rectification is constrained to street views with only a single facade, but street views in UK and HK often contain multiple facades, leading to multiple vanishing points detected and unstable rectification results; (ii) buildings in modern cities may contain facades without obvious line segments, thereby making it hard to deduce the vanishing points and perform the rectification. Hence, we report the results only on 300 testing images from the eTrims, arcDataset, LabelMeFacade, TSG-20, and ZuBuD+ datasets, which contain only one building facade in most of the images.

Fig. 10 shows typical results produced by our approach and DeepFacade-Variant. From the results shown on the left (1st & 2nd columns), we can see that both methods produce good results for facades that are orthogonal to camera view. Yet, DeepFacade-Variant is prone to fail when the improper homographic matrix inferred by RANSAC (see row 2 for an example), whilst our approach overcomes the limitation with the refinement made by geometry constraints. More failure cases for DeepFacade-Variant are presented in 3rd & 4th columns, where the facades are in perspective view. Input image of row 1 is not rectified properly, hence the interpolated lines point to wrong directions. In rows 2, windows on up floors have little saliency due to distortion, thus the results miss floor-level lines on high levels. The building in row 3 has partial irregular facade textures, such as lintel and stairs, and the method wrongly recognizes adjacent floors. In the last row, there are two facades, whilst the rectification cannot be managed by one single homographic matrix. In all these cases, our approach produces good predictions.

Line-wise Recti- Network Post- SUM
accuracy fication inference processing
DeepFacade-Variant 0.721 1.08s 0.06s 0.04s 1.18s
Ours 0.876 - 0.05s 0.14s 0.19s
TABLE III: Quantitative comparison between our approach and DeepFacade-Variant in terms of accuracy and time efficiency.

Table III quantitatively compares the results of DeepFacade-Variant and Our Full Method. From the line-wise accuracies, we can see an obvious advancement of our approach over the competitor. Also, our approach outperforms DeepFacade-Variant in terms of execution time, in which the total running time is only 0.19 seconds vs. 1.18 seconds. In DeepFacade-Variant, a major bottleneck is the heavy time cost of RANSAC rectification, whilst our approach directly predicts pixelwise segmentation without requiring rectification.

Vi-E Comparison with Semantic Segmentation

As the pixel-wise floor segmentation can be formulated as an image segmentation task, we also present comparison with semantic segmentation methods on the pixel accuracy of the intermediate results. Here we choose PSPNet [52] and OCR [47] as baselines, since PSPNet is widely used as baseline for semantic segmentation task, and OCR is one of the state-of-the-art methods with highest pixel accuracy in some urban semantic segmentation datasets. For each method, we train the model with both CMP dataset and our augmented dataset, and test with 600 real images following the same settings as the ablation study in Sec. VI-C.

Table IV presents quantitative results in terms of pixel-wise accuracy. Similar to the results of ablation analysis, the low overall pixel accuracy by the comparison methods trained on CMP dataset indicates the difficulty of inferring floor-level lines, especially without proper dataset. Our data augmentation significantly improves the prediction accuracy to an acceptable level for both segmentation methods, showing well generalizability of the baseline network architectures. Moreover, our methods with multi-task learning and height-attention mechanism achieves the best performance, for both the facades & buildings datasets and GSVs in UK & HK.

Datasets Segmentation methods Lower Upper Overall
Facades &
PSPNet + CMP 0.258 0.059 0.197
PSPNet + DataAugm 0.649 0.528 0.609
OCR + CMP 0.379 0.254 0.330
OCR + DataAugm 0.610 0.492 0.570
Ours 0.676 0.586 0.644
GSV in
PSPNet + CMP 0.081 0.001 0.067
PSPNet + DataAugm 0.369 0.205 0.321
OCR + CMP 0.083 0.016 0.062
OCR + DataAugm 0.626 0.474 0.581
Ours 0.636 0.531 0.605
TABLE IV: Quantitative comparison between our method and image segmentation baselines in terms of pixel-wise accuracy.

Vi-F Applicability

Fig. 11 shows some example usage scenarios that demonstrates the potential of our method in enriching AR-related applications. Enabled by the recognized floor-level lines, we can augment real-world scenes with virtual contents in a floor-aware manner. First, from the floor-level line results, we can identify the bound of each facade, i.e., the lowest and highest floor-level lines (and their locations), and overlay virtual contents that cover the whole facade, while skipping the real contents on the ground level (Fig. 11 (top)). By this means, we may deliver AR contents, without obscuring necessary real contents around the user on the street.

More importantly, with the floor-level lines, we can place context-aware information on specific floors. For example, in the bottom-row result on the 2nd column of Fig. 11, we specifically mark the location of the target room for convenient navigation; and in the middle-row result on the 3rd column of Fig. 11, we put shopping directory information over the corresponding floors to aid users to find the user’s target items in the department store. Besides, we may mark specific apartments in a building that are for sales or for rent to aid the property agents. On the other hand, we may also employ our approach to present floor-related data (e.g., floor size, occupancy information, etc.) on buildings and extend it to estimate the number of floor levels and infer the building heights, which could be helpful for land usage analysis and urban planning [32, 42, 48].

Fig. 12: Limitations of our current approach: obstacle interference (left), glass reflection (middle), and facades too close to camera (right).

Vii Conclusion, Discussion, and Future Work

We presented a new approach for recognizing floor-level lines in street-view images. As a first attempt to this challenging task, we contribute to (i) devising a new data augmentation scheme that leverages an existing facade dataset to efficiently generate data samples for network training; and (ii) formulating FloorLevel-Net, a new multi-task learning framework that associates explicit features of facades and implicit floor-level lines and incorporates a height-attention mechanism to enforce a vertical ordering of floor-level lines. The pixelwise semantic segmentations by FloorLevel-Net are further refined through a geometry post-processing module for plausible and consistent reconstruction of floor-level lines. Quantitative and qualitative comparisons with existing methods demonstrate the effectiveness of our proposed method. Further, we demonstrate the potential of our approach in supporting various AR scenarios.

Discussion. This work targets at inferring floor-level lines in street-view images. We proposed the data augmentation scheme and FloorLevel-Net to overcome the challenges of being lack of training dataset and suitable network architecture. Though the methods are dedicated to the specific task, we believe the ideas of augmenting dataset and tuning network based on data characteristics can be feasibly extended to some other applications, where labeling is expensive and data have certain patterns. For instance, in autonomous driving, a key requirement is to detect driving lanes and pedestrians simultaneously. In such scenarios, we can possibly improve the network performance by carefully modeling the relationship between driving lane and pedestrian.

Limitations. First, our method could be disturbed by obstacles at critical regions, e.g., a cab that occludes the ground floor of a building (Fig. 12 (left)) may cause a wrongly-recognized ground floor. Second, our method cannot handle facades that are reflective (Fig. 12 (middle)). Third, when the camera is too close to the facades (Fig. 12 (right)), our method cannot get a wide view of the facades, so it may wrongly recognize the upper and lower bounds of a floor-level line region as two successive floor-lever lines.

Future works. We plan to focus on improving the prediction precision and efficiency of our approach. First, a possible direction is to work on short video clips instead of on single images, so that we can obtain temporal information, e.g., structure from motion (SfM) features, to improve the prediction precision. Second, we plan to develop strategies to estimate the distance from camera to facades to address the last limitation mentioned above. Third, our current pipeline takes around 0.19 seconds (see Table III), which is not slow compared with DeepFacade-Variant but insufficient to run in real-time. At present, the main bottleneck is the post-processing module, which can be improved through multiprocessing implementations, since we can simultaneously process line grouping and polyfit. Also, we plan to explore lightweight networks to further improve the pipeline efficiency.


We thank the street-view images from the Google Street View service. This work is supported partially by the Research Grants Council of the Hong Kong Special Administrative Region (Project no. CUHK 14206320) and Guangdong Basic and Applied Basic Research Foundation (2021A1515011700).


  • [1] D. Anguelov, C. Dulong, D. Filip, C. Frueh, S. Lafon, R. Lyon, A. Ogale, L. Vincent, and J. Weaver (2010) Google street view: capturing the world at street level. Computer 43 (6), pp. 32–38. Cited by: §I.
  • [2] O. Barinova, V. Konushin, A. Yakubenko, K. Lee, H. Lim, and A. Konushin (2008) Fast automatic single-view 3-D reconstruction of urban scenes. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 100–113. Cited by: §I.
  • [3] K. Chaudhury, S. DiVerdi, and S. Ioffe (2014) Auto-rectification of user photos. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 3479–3483. Cited by: §VI-D.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: §II, §V-C, §VI-C.
  • [5] Z. Chen, J. Zhang, and D. Tao (2021) Recursive context routing for object detection. International Journal of Computer Vision 129 (1), pp. 142–160. Cited by: 3rd item.
  • [6] S. Choi, J. T. Kim, and J. Choo (2020) Cars can’t fly up in the sky: improving urban-scene segmentation via height-driven attention networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 9373–9383. Cited by: §I, 3rd item, §V-B.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. Cited by: 1st item, §IV, §V-B.
  • [8] A. Criminisi, I. Reid, and A. Zisserman (2000) Single view metrology. International Journal of Computer Vision 40 (2), pp. 123–148. External Links: ISSN 1573-1405, Document, Link Cited by: §I.
  • [9] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2650–2658. Cited by: 2nd item.
  • [10] B. Fröhlich, E. Rodner, and J. Denzler (2010) A fast approach for pixelwise labeling of facade images. In Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 3029–3032. Cited by: 1st item, 2nd item.
  • [11] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: 1st item.
  • [12] Google’s street view static API. Web Page. Note: Accessed: 2020-10-01https://developers.google.com/maps/documentation/streetview Cited by: §IV-A.
  • [13] R. Grompone von Gioi, J. Jakubowicz, J. Morel, and G. Randall (2012) LSD: a Line Segment Detector. Image Processing On Line 2, pp. 35–55. External Links: Document Cited by: §I.
  • [14] A. Gupta, A. A. Efros, and M. Hebert (2010) Blocks world revisited: image understanding using qualitative geometry and mechanics. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 482–496. Cited by: §I.
  • [15] O. Haines and A. Calway (2015) Recognising planes in a single image. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1849–1861. External Links: ISSN 0162-8828, Document Cited by: §II.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §V-C.
  • [17] D. Hoiem, A. A. Efros, and M. Hebert (2007) Recovering surface layout from an image. International Journal of Computer Vision 75 (1), pp. 151–172. External Links: ISSN 1573-1405, Document, Link Cited by: §II.
  • [18] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Cited by: 3rd item.
  • [19] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) CCNet: criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 603–612. Cited by: 3rd item.
  • [20] F. Korč and W. Förstner (2009-04) eTRIMS Image Database for interpreting images of man-made scenes. Technical report Technical Report TR-IGG-P-2009-01. External Links: Link Cited by: 1st item.
  • [21] J. Lee, H. Kim, C. Lee, and C. Kim (2017) Semantic line detection and its applications. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3229–3237. Cited by: §I.
  • [22] H. Liu, Y. Xu, J. Zhang, J. Zhu, Y. Li, and C. H. S. Hoi (2020) DeepFacade: a deep learning approach to facade parsing with symmetric loss. IEEE Transactions on Multimedia (), pp. 1–1. Cited by: §II.
  • [23] H. Liu, J. Zhang, J. Zhu, and S. C. H. Hoi (2017) DeepFacade: a deep learning approach to facade parsing. In

    Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)

    pp. 2301–2307. Cited by: §II, §VI-D.
  • [24] X. Liu, Y. Zhao, and S. Zhu (2018) Single-view 3D scene reconstruction and parsing by attribute grammar. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (3), pp. 710–725. Cited by: 2nd item.
  • [25] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §II.
  • [26] A. Martinović, M. Mathias, and L. Van Gool (2016) ATLAS: a three-layered approach to facade parsing. International Journal of Computer Vision 118 (1), pp. 22–48. Cited by: §I, §II.
  • [27] P. Mueller, G. Zeng, P. Wonka, and L. Van Gool (2007) Image-based procedural modeling of facades. ACM Trans. Graph. (SIGGRAPH) 26 (3), pp. 85:1–85:10. Cited by: §II.
  • [28] P. Musialski, P. Wonka, D. G. Aliaga, M. Wimmer, L. Van Gool, and W. Purgathofer (2013) A survey of urban reconstruction. Computer Graphics Forum (Eurographics) 32 (6), pp. 146–177. Cited by: §I.
  • [29] Z. Ren and Y. J. Lee (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 762–771. Cited by: 2nd item.
  • [30] G. Schindler, P. Krishnamurthy, R. Lublinerman, Y. Liu, and F. Dellaert (2008) Detecting and matching repeated patterns for automatic geo-tagging in urban environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–7. Cited by: §I, §II.
  • [31] H. Shao, T. Svoboda, and L. Van Gool (2003) Zubud-zurich buildings database for image based recognition. Computer Vision Lab, Swiss Federal Institute of Technology, Switzerland, Tech. Rep 260 (20), pp. 6. Cited by: 2nd item.
  • [32] Q. Shen, W. Zeng, Y. Ye, S. Mueller Arisona, S. Schubiger, R. Burkhard, and H. Qu (2018) StreetVizor: visual exploration of human-scale urban forms based on street views. IEEE Transactions on Visualization and Computer Graphics 24 (1), pp. 1004 – 1013. External Links: Document Cited by: §VI-F.
  • [33] G. Simon, A. Fond, and M. Berger (2018) A-contrario horizon-first vanishing point detection using second-order grouping laws. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 323–338. Cited by: §I.
  • [34] O. Teboul, I. Kokkinos, L. Simon, P. Koutsourakis, and N. Paragios (2011)

    Shape grammar parsing via reinforcement learning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2273–2280. Cited by: §II.
  • [35] O. Teboul, L. Simon, P. Koutsourakis, and N. Paragios (2010) Segmentation of building facades using procedural shape priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3105–3112. Cited by: §II.
  • [36] Tourist sights graz image database. Web Page. Note: Accessed: 2021-04-12http://dib.joanneum.at/cape/TSG-20/ Cited by: 2nd item.
  • [37] R. Tyleček and R. Šára (2013) Spatial pattern templates for recognition of objects with regular structure. In Proceedings of German Conference on Pattern Recognition (GCPR), pp. 364–374. Cited by: 1st item, Fig. 3, item (i), §IV-A, §IV, 1st item.
  • [38] X. Wang, D. Fouhey, and A. Gupta (2015) Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 539–547. Cited by: 2nd item.
  • [39] J. Weissenberg, H. Riemenschneider, M. Prasad, and L. Van Gool (2013) Is there a procedural logic to architecture?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 185–192. Cited by: §II.
  • [40] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) CBAM: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: 3rd item.
  • [41] C. Wu, J. Frahm, and M. Pollefeys (2011) Repetition-based dense single-view reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3113–3120. Cited by: §I, §II.
  • [42] Y. Wu, L. S. Blunden, and A. S. Bahaj (2018) City-wide building height determination using light detection and ranging data. Environment and Planning B: Urban Analytics and City Science 46 (9), pp. 1741–1755. External Links: ISSN 2399-8083, Document, Link Cited by: §I, §VI-F.
  • [43] Z. Xu, D. Tao, Y. Zhang, J. Wu, and A. C. Tsoi (2014)

    Architectural style classification using multinomial latent logistic regression

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 600–615. Cited by: 1st item.
  • [44] C. Yang, T. Han, L. Quan, and C. Tai (2012) Parsing façade with rank-one approximation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1720–1727. Cited by: §I, §II.
  • [45] F. Yang and Z. Zhou (2018)

    Recovering 3D planes from a single image via convolutional neural networks

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–103. Cited by: §I.
  • [46] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1857–1866. Cited by: 3rd item.
  • [47] Y. Yuan, X. Chen, and J. Wang (2020) Object-contextual representations for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 173–190. Cited by: §VI-E.
  • [48] W. Zeng and Y. Ye (2018-09) VitalVizor: a visual analytics system for studying urban vitality. IEEE Computer Graphics and Applications 38, pp. 38–53. External Links: Document Cited by: §I, §VI-F.
  • [49] Z. Zeng, M. Wu, W. Zeng, and C. Fu (20202020) Deep recognition of vanishing-point-constrained building planes in urban street views. IEEE Transactions on Image Processing 29, pp. 5912–5923. Cited by: §I, §II.
  • [50] J. Zhang, Y. Xu, B. Ni, and Z. Duan (2018) Geometric constrained joint lane segmentation and lane boundary detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 502–518. Cited by: §II.
  • [51] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890. Cited by: §II.
  • [52] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §VI-E.