Reconstruction of building façades is one of the key steps towards complete reconstruction of a LOD-3 (Level-of-Details) model in CityGML protocol [groger2012citygml]. Semantic objects such as windows, doors, and balconies are important components of a building façade. Extracting them [hoegner2015building] and arranging them in a regularized manner [hensel2019facade] are two important steps towards structured LOD-3 reconstruction [zhu2020interactive]. And the street-view image is arguably the best option for the above objectives due to the public availability and effectiveness in collecting, such as the Google street map [anguelov2010google].
For the detection of semantic objects in street-view images, classical methods include the use of projected histograms [lee2004extraction, kostelijk2012semantic]
, gradient projection, K-means clustering[recky2010windows], correlation coefficient [mayer2007building], perceptual grouping [sirmacek2011detection] and etc.
. Such methods do not consider the structural and spatial distribution of the semantic objects. Recently, methods based on deep learning[mathias2016atlas, liu2017deepfacade] have been widely used to extract the semantic objects on building façade, which have achieved impressive results on images with projective distortion and scale difference; but the regularities of semantic objects have not been considered yet.
In general, these semantic objects should conform to certain regularities, such as aligned locations and consistent sizes. However, due to the characteristics of projection distortion and complex background, the geometric attributes of the extracted primitives in images of buildings façade are generally deviated slightly from the expected. Although the regularization of 2D boundaries, such as edges of buildings, are widely studied in the community [xie2018hierarchical], the approaches cannot be directly adopted. In addition, the regular arrangements of façades can also be learned for specific scenarios [dehbi2011learning, dehbi2017statistical]; however, the learned models can only be used in inductive fashion, e.g. it does not generalize to unseen data.
Recently, a general and promising approach to align different objects of building façades using Mixed Integer Linear Programming (MILP) was proposed [hensel2019facade]. However, in our practice the MILP is too complex to solve, which requires prohibitively high runtime consumption. Because we are aiming to integrate the pipeline into an interactive reconstruction environment, at least near real-time response of the solver is required. To solve this issue, we reformulate the problem as a Binary Integer Programming (BIP), with all the unknowns in the binary space of
, and the objective can be expressed explicitly as logical operations of the binary variables. Rather than MILP, the BIP is relatively more efficient to be handled by state-of-the-art optimization routines[gleixner2018scip, gurobi2014gurobi].
In summary, this paper proposes a fast and regularized reconstruction methods for the façades of buildings from street-view images. Firstly, we extract typical façade primitives using real-time object detection pipeline, e.g. the YOLOv3 architectural [redmon2016you, redmon2018yolov3]. Secondly, the positions and sizes of the primitives are clustered using BIP by optimizing two competing desires of retaining the best fitness and regularities, for which we require no extra information of the façades. At last, the primitives after clustering are reconstructed in an interactive environment, e.g. SketchUp, by substituting each clustered primitive with a pre-built component model or interactively sketching the component on street-view images.
2 Related Works
A lot of works have been devoted to extraction and segmentation of building façades, in the communities of photogrammetry, computer vision and computer graphics. With regard to detecting façade objects from images, in recent years, various deep learning architectures, such as CNN[krizhevsky2012imagenet] and RNN [graves2008novel], have achieved impressive results for various computer vision tasks, such as image classification [chan2015pcanet] and object detection [girshick2015region]. Although earlier CNN architectures can greatly improve the accuracy of object detection, the detection rate is very slow. This is because that several segregated steps [girshick2015region] are used, including generation of proposals and classification of the regions. For this reason, the usage in applications requiring real-time responses is limited. The YOLO (You Only Look Once) network [redmon2016you, redmon2018yolov3], as the name suggested, only requires a single integrated forward passing in the testing stage and achieves real-time detection rates for off-the-shelf video sensors. The incrementally upgraded YOLOv3 [redmon2018yolov3], due to the integration of ResNet [he2016deep], FPN (Feature Pyramid Network) [lin2017feature], and binary cross entropy loss, greatly improves both detection speed and detection accuracy. In the meantime, it has also increased the performance on small targets, which is suitable for detecting semantic objects with complex repeating structures on the building façade. And therefore, this paper adopts the YOLOv3 as the backbone for the detection of the primitives.
With regard to the regular arrangements of objects, based on explicit or implicit procedural methods, the structure of façade was inferred through grammatical rules, including random grammar [alegre2004probabilistic], syntax trees [ripperda2006reconstruction], and the bottom-up or top-down hybrid approach [han2008bottom]
. They all required setting the correct parameters of the shape syntax in advance. Although these methods have achieved good results, they assume that the image is composed of a fairly regular grid; in addition, fixed expressions of the grammars are not capable to cover the diversities in real-world applications. Procedural grammars are also quite cumbersome to be edited and compiled, which requires tremendous expert knowledge. Human intervention is also required to select the appropriate grammar for a particular building. Although style classifiers[mathias2016atlas]
was developed to alleviate the above issues, which automatically recognized architectural styles from low-level image features, the use of style syntax is still needed in advance, which is probably a limitation for this approach.
Recent approaches based on mixed integer programming is arguably the most flexible and powerful tool for the problem of regular arrangement of objects. It has been used for arrangements of the 2D boundaries and 3D planes [monszpart2015rapter], reconstruction of surface meshes [boulch2014piecewise, nan2017polyfit], modeling of the roof structures of the LOD-2 models [goebbels2019beautification] and the façades [hensel2019facade]. However, most of them formulated the optimization problem as MILP [goebbels2019beautification, hensel2019facade] or even mixed integer non-linear programming [monszpart2015rapter], which has unknowns in both spaces of integer and real values. Unfortunately, this kind of problems raised up in the operational research has no efficient solvers for large scale problems, even using state-of-the-art commercial libraries [gurobi2014gurobi]. A practical remedy is to reformulate the problem into BIP [nan2017polyfit, kelly2017bigsur, kelly2018simplifying], which only considers binary variables and linear energies; the regularities can still be explicitly modeled through the logical operations using the binary variables and there are relatively more efficient solvers for these simpler problems. Therefore, we use BIP to model the regularization problem of the façade objects.
3 Detection of façade primitives using YOLOv3
We use YOLOv3 [redmon2018yolov3] to detect axis-aligned bounding boxes of primitives because of its real-time performance. For completeness, we briefly introduce the architecture and implementation details of YOLOv3 here. Rather than other region-based CNN methods [girshick2015region], YOLO [redmon2016you] uses regression to directly process the entire image, and obtains categories and positions of the targets in a single forward propagation. YOLO implements an end-to-end pipeline for detection by dividing the image into grids. If the center of the semantic component is in a grid, the grid is responsible for predicting the target. Each grid will generate bounding boxes, and each bounding box must predict its confidence , which is defined as the product of the probability of the target contained in the bounding box and the accuracy , as . If the grid contains semantic objects, then , otherwise . represents the intersection ratio of the labeled box in training samples and the predicted box. When , it means that the labeled box and the predicted box coincide perfectly.
If a grid contains semantic components, which corresponds to classes, it is represented by for each class. Therefore, we can obtain the intermediate score of each grid and each class as . The scores are truncated at and non-maximum suppression is used to remove bounding boxes with a large repetition rate. In the end, each bounding box only retains the objects with positive confidence scores and the highest categories. In YOLOv3, in order to improve the accuracy of target detection, the residual network [he2016deep] is used as backbone. The features before entering the residual box and the features output by the residual box are combined to extract deeper feature information. On the building façade, even if they are the same type of semantic objects, their sizes and poses are not the same. YOLOv3 uses multi-scale fusion [lin2017feature] to detect objects, and has good adaptability to the scale changes of objects.
4 Regular arrangements of façade primitives using binary integer programming
After initial extraction of the bounding boxes of the building façade, we then use BIP to restore the spatial regularity of the windows, doors and balconies, inspired by previous work [hensel2019facade]. Although the MILP method has been successfully used in many studies [boulch2014piecewise, hensel2019facade], in our pipeline, because we are aiming at an interactive reconstruction pipeline, the runtime should be kept reasonably low. In the following, we describe our reformulated problem setup using BIP instead of MILP.
4.1 Problem setup using binary integer programming
After extracting the initial primitives, we have bounding boxes for each image, and each bounding box is uniquely determined by a set of four parameters , where and are coordinate of the upper left corner and size of the bounding box, respectively (Figure 1a). Instead of directly optimizing these parameters that are real values using MILP [hensel2019facade], we cast it into a model selection problem using BIP.
Specifically, we first establish a model space for each attribute of the bounding box, e.g. for the attribute of coordinate. The size of could be the number of bounding boxes , but we choose to compress it by pre-cluster the model space using a very confident lower bound as described later. We then assign a binary variable to represent the state of the selection, i.e. if the model is selected for the attribute of the
bounding box. In addition, we use the one-hot vector111We omit the superscript for attribute when not ambiguous. In addition the Greek symbols are used for one-hot vectors and Roman symbols for variables. to represent the whole state of the bounding box as .
In fact, the model spaces of the attributes of the primitives for a single façade are generally quite limited in urban environment. That is the ratio is generally quite large, which leads to unnecessarily too many parameters. Therefore, we pre-cluster all the attributes separately using the mean shift approach [cheng1995mean]; and the threshold is set to the lower bound . The values in the model space X are determined by the centers of the clusters, as shown in Figure 1b. To ensure the accuracies of the results, the lower bound in mean shift clustering should be as small as possible to avoid aggregating parameters of different properties into the same category. It should be noted that, although the same threshold is used for all the attributes, the number of clusters , , and are generally different.
In summary, the purpose is to optimize all the selecting vectors , under the energy functions and constraints as described below. And the total size of explicit unknowns is .
4.2 Energy functions to be optimized
Our loss function consists of a data item and a regularity item. First of all, our goal is to make the sum of the changes of the bounding boxes against the initial locations as small as possible after the regularization. Therefore, we first calculate the residual vectorfor each bounding box, which represents the errors for different selections, as
where the superscript denotes different attributes.
In this way, the total energy for attribute caused by the selection vectors, e.g. offsets for the coordinates of upper left corners and differences for the sizes of the rectangles, can be briefly expressed as,
Equation 2 means that, for each bounding box, we only account for the error of the selected value in model space, i.e. when . The final data term of the energy function is therefore intuitively the summation of all the attributes as
With only the data term, we always have a trivial solution that have the best fit, e.g. choosing the nearest center of the mean shift clustering. Therefore, we introduce a regularity item. The intuition behind this term is that higher regularity generally means less categories; fortunately, the number of selected categories is easy to model as illustrated in Figure 2. For each attribute , the total number of selected categories, e.g. the regularity term , can be explicitly expressed as the following logical expression,
where is the norm that is the absolute summation of all the elements of a vector and for binary variables norm simply counts the number of non-zero variables; the binary operator is the element-wise logical or for the one-hot vectors. Similar to Equation 3, the final regularity term is a weighted summation across all the attributes as
where denotes the weights of different attributes. And the final energy function is
4.3 Constraints of the binary integer programming
The variables cannot be adjusted freely. Obviously, because each bounding box can only choose one state, we have the following constraint for each bounding box,
Another practical constraint is that we could very confidently ignore certain model spaces if the residual exceeds an upper bound , as.
It seems the additional constraints may increase the complexity of the problem, but interestingly, in practice, we find that the additional constraints significantly reduce the runtime, with almost no differences in the final results.
4.4 Implementation details
The implementation of Equation 4 needs some tricks, because it involves the logical operations. For two binary variable and , the logical or result can be modeled by adding the following constraints,
In fact, this kind of fixed routines can be handled efficiently and gracefully by state-of-the-art solvers [chinneck2007feasibility, gurobi2014gurobi]. For the parameters, we set pixels and ; and and are used empirically. In this way, all the energy functions and constraints are linear functions, which are solved using the Mosek library [mosek2010mosek].
5 Experimental evaluations
5.1 Evaluation of detections of façade primitives
This paper uses the CMP façade database [tylecek2012cmp]
as the training data set, which contains a total of 606 building façade images around the world. These images are manually labeled with 12 semantic objects on the façade. We choose three typical primitives: window, door and balcony. We built the YOLOv3 model based on Keras[gulli2017deep] to train the above data set. At the same time, we took 30 typical building façade images from Google street view [anguelov2010google] for testing, and manually labeled them for evaluations. In order to verify the effectiveness of this method, we adopted the same evaluation method in [rahmani2018high]. For windows, doors, and balconies, our average extraction accuracy reached 0.917, 0.856, and 0.852, which is higher than 0.84 as baseline [rahmani2018high]. Therefore, it is feasible to extract the primitives on the building façade based on YOLOv3.
5.2 Evaluation and comparisons of the regular arrangements of the primitives
We selected three typical building façade images of three cities in the United States (US), United Kingdom (UK), and Canada (CA) to evaluate the performance of the regularization. Both qualitative and quantitative evaluations are conducted and we also compare the runtime performance against the MILP approach [hensel2019facade].
Figures 3 through 5 compare the extracted and regularized bounding boxes for the US, UK and CA datasets, respectively. The black frame represents the extracted primitives and the red frame indicates the regularized results. It can be noticed that after regularization, the semantic objects on the building façade are arranged more neatly and consistently and still fit well enough to the original bounding boxes. In addition, Figure 6 demonstrates the reconstructed façades for the three datasets in off-the-shelf modeling solutions.
We counted the number of used model space before and after the regularization to measure the regularity of the results. Table 1 demonstrates the results, and it could be noted that, the selected parameters only account for about for the coordinates of the corners and for the sizes.
Comparisons of runtime.
In order to verify the efficiency of the method in this paper, we tested six building façades with complex structures and numerous parameters, and compared the proposed BIP approach against the MILP approach [hensel2019facade]. The results are shown in the Table 2 and the runtime of the proposed BIP approach only accounts for about to of the MILP approach. For the MILP approach [hensel2019facade], the number of explicit unknown parameters are , including real value parameters. In the proposed approach, the number of explicit unknown parameters is . Although the proposed method has slightly fewer parameters, the numbers are still in the same order of magnitude. Therefore, it is the reformulated problem that account for the performance differences.
|MILP (s)||BIP (s)|
This paper proposed an approach for the regular arrangement of primitives of the building façades using BIP. Compared to the MILP approach, BIP is considerably faster and achieves near real-time performance with similar level of data fitness and regularities. The detected and rearranged bounding boxes of the primitives can be directly used for the modeling of the façade features, which is a key step towards the LOD-3 reconstruction. However, current approaches can only detect axis-aligned objects, future works may be devoted to explore the reconstruction of more complex façade features. Code is available at https://github.com/saedrna/Ranger.