1. Introduction
Building height plays an essential role in many applications, such as 3D city reconstruction (Pan et al., 2015; Armagan et al., 2017), urban planning (Ng, 2009), navigation (Grabler et al., 2008; Rousell and Zipf, 2017), and geographic knowledge bases (Zhang, 2017). For example, in navigation, knowing the height of buildings helps identify those standing out in a city block, which can then be used to facilitate navigation by generating instructions such as ”turn left before an 18 meters high (fivestory) building”.
Previous studies for building height estimation are mainly based on highresolution optical data (Izadi and Saeedi, 2012; Zeng et al., 2014), synthetic aperture radar (SAR) images (Wang et al., 2015; Brunner et al., 2010), and Light Detection and Ranging (LiDAR) data (Sampath and Shan, 2010; WordPress and HitMag, 2018). Such data, however, are expensive to obtain and hence the above approach is difficult to apply at a large scale, e.g., to all the buildings on earth. Moreover, such data is usually proprietary and not available to the public and research community. Recently, street scene images (or together with 2D maps) have been used for building height estimation (Yuan and Cheriyadat, 2016; Díaz and Arguello, 2016), which can be easily obtained at large scale (e.g., via open source mapping applications such as Google Street View (Anguelov et al., 2010) and OpenStreetMap (Haklay and Weber, 2008)). Estimating building height via street scene images relies on accurate detection of building rooflines from the images, which then enables building height computation using camera projection. However, existing methods for roofline detection check the roofline segments only, which may be confused with overlapping buildings. As shown in Fig. 1, the roofline of building B may be detected as the roofline of building A because the rooflines of different buildings may be in parallel with each other, and the buildings have similar colors and positions in the street scene images.
In this paper, we present a novel algorithm named Cornerbased Height Estimation (CBHE) to estimate building height for complex street scene images (with blocking problem from other buildings and trees). Our key idea to handle overlapping buildings is to detect not only the rooflines but also building corners. We obtain coordinates of building corners from building footprints in a 2D map (e.g., OpenStreetMap). We then map the corner coordinates into the street scene image to detect building corners in the image. Corners of different buildings do not share the same coordinates, and it is easier to associate them with different buildings, as shown in Fig. 1.
CBHE works as follows. It starts with an upperbound estimation of the height of a building, which is computed as the maximum height that can be captured by the camera. It then tries to locate a line (i.e., a roofline candidate) at this height and repeats this process by iteratively reducing the height estimation. Following a similar procedure, CBHE also locates a set of building corner candidates. Next, CBHE filters the roofline candidates with the help of the corner candidates (i.e., a roofline candidate needs to be connected to a corner of the same building to be a true roofline). When the true roofline is identified, CBHE uses the pinhole camera model to compute the building height.
In the process above, when locating the roofline and corner candidates, we fetch two sets of image segments that may contain building rooflines or corners, respectively. To filter each set of objects and identify the true roofline and corner images, we propose a deep neural network model named BuildingNet. The key idea of BuildingNet is as follows. Building corner has a limited number of patterns (e.g., ” ”, ””, ””, ””), while noncorner images may have any pattern. The same applies to the building rooflines. Thus, we model building corner (roofline) identification as an open set
image classification problem, where each corner (roofline) pattern forms a class, while noncorner (nonroofline) images should be differentiated from them when building the classifier. To do so, BuildingNet learns embeddings of the input images, which minimize the intraclass distance and maximize the interclass distance, and then differentiates different classes using a Support Vector Classification (SVC) model on the learned embeddings. When a new image comes, the trained SVC model will tell whether it falls into any corner (roofline) classes. If the image does not fall into any corner (roofline) classes, it is a noncorner image and can be safely discarded.
When estimating building height via the pinhole camera model, the result highly relies on the accuracy of the camera location due to GPS errors. Therefore, CBHE calibrates the camera location before roofline detection. To calibrate the camera location, CBHE detects candidate regions of all building corners in street scene images by matching buildings in street scene images with their footprints in 2D maps based on the imprecise camera location from GPS. Then, it uses BuildingNet to classify the corner candidates and remove those images classified as noncorner. From the remaining corner after the classification through BuildingNet, CBHE selects two corners with the highest score (detailed in Section 4.2) to calibrate the camera location via the pinhole camera model. We summarize our contributions as follows:

[leftmargin=*]

We model building corner and roofline detection as an open set classification problem and propose a novel deep neural network named BuildingNet to solve it. BuildingNet learns embeddings of the input images, which minimize the intraclass distance and maximize the interclass distance. Experimental results show that BuildingNet achieves higher accuracy on building corner and roofline identification compared with the stateoftheart open set classifiers.

We propose a cornerbased building height estimation algorithm named CBHE, which uses an entropybased algorithm to select the roofline among all candidates from BuildingNet. The entropybased algorithm considers both building corner and roofline features and yields higher robustness for the overlapping problem in complex street scene images. Experiments on realworld datasets show that CBHE outperforms the baseline method by over 10% regarding the estimation error within two meters.

We propose a camera location calibration method with an analytical solution when given the locations of two building corners in a 2D map, which means highly accurate result can be guaranteed with the valid building corners from BuildingNet.
We organize the rest of this paper as follows. We review related work in Section 2 and give an overview of CBHE in Section 3. The BuildingNet and entropybased ranking algorithm are presented in Section 4, and the building height estimation method is detailed in Section 5. We report experimental results in Section 6 and conclude the paper in Section 7.
2. Related Work
In this section, we review studies on camera location calibration and building heigh estimation. We also detail our baseline method (Yuan and Cheriyadat, 2016).
2.1. Camera Location calibration
Camera location calibration aims to refine the camera location of the taken images, given rough camera position information from GPS devices or image localization (Agarwal et al., 2015; Liu et al., 2017).
Existing work uses 2.5D maps (2D maps with height information) to calibrate camera locations. Arth et al. (Arth et al., 2015) present a mobile device localization method that calibrates the camera location by matching building facades in street scene images with their footprints in 2.5D maps. Armagan et al. (Armagan et al., 2017) train a convolutional neural network (CNN) to predict the camera location based on a semantic segmentation of the building facades in input images. Their method iteratively applies CNN to compute the camera’s position and orientation until it converges to a location that yields the best match between building facades and 2.5D maps. Camera location calibration using 2.5D maps can produce good results. The hurdle is the requirement of building height information for generating 2.5D maps, which may not be available for every building. Chu et al. (Chu et al., 2014) extract the position features of building corner lines (the vertical line of a corner) and then find the camera location and orientation by matching the extracted position features with building footprints in 2D maps. However, their method cannot handle buildings overlapping with each other or having nonuniform patterns on their facades.
2.2. Building Height Estimation
Building height estimation has been studied using geographical data such as highresolution images, synthetic aperture radar (SAR) images, and Light Detection and Ranging (LiDAR) data.
Studies (Liasis and Stavrou, 2016; Izadi and Saeedi, 2012; Zeng et al., 2014; Tack et al., 2012; Qi et al., 2016) based on highresolution images (such as satellite or optical stereo images) estimate building height via methods such as elevation comparison and shadow detection, which may be impacted by lighting and weather condition when the images are taken. Similarly, methods based on height estimation is synthetic aperture radar (SAR) images (Wang et al., 2015; Brunner et al., 2010; Sportouche et al., 2011) are mainly based on shadow or layover analysis. Methods based on aerial images and aerial LiDAR data (Sohn et al., 2008; Sampath and Shan, 2010) usually segment, cluster and then reconstruct building rooftop planar patches according to predefined geometric structures or shapes (Zeng et al., 2014). LiDAR data is expensive to analysis and has a limited operating altitude because the pulses are only effective between 500 and 2,000 meters (WordPress and HitMag, 2018). A common limitation shared by the methods above is that the data that they use are expensive to collect, which significantly constraints the scalability of these methods.
Method based on street scene image and 2D map. Yuan and Cheriyadat propose a method for building height estimation uses street scene images facilitated by 2D maps (Yuan and Cheriyadat, 2016). Street scene images are widely available from Google Street View (Anguelov et al., 2010), Bing StreetSide (Kopf et al., 2010) and Baidu Map (Baidu, 2018), which makes building height estimation based on such data easier to scale. Yuan and Cheriyadat’s method has four main steps: (i) Match buildings in a street scene image with their footprints in a 2D map via camera projection based on the camera location that comes with the image. Here, the camera location may be imprecise due to GPS error (Zandbergen and Barbeau, 2011; Grammenos et al., 2018). (ii) Calibrate the camera location via camera projection with the extracted building corner lines from street scene images. (iii) Rematch buildings from a 2D map with those in the street scene image based on the calibrated camera location, and then detect building rooflines through edge detection methods. (iv) Compute building height via camera projection with camera parameters, calibrated camera location, the height of building rooflines in the street scene image, and the building footprint in the 2D map.
Our proposed CBHE differs from Yuan and Cheriyadat’s method in the following two aspects: (A) In Step (ii) of their method, they calibrate camera location by matching building corner lines in the street scene image with building footprints in the 2D map. Such a method cannot handle images in urban areas where the corner lines of different buildings are too close to be differentiated, or the buildings have nonuniform patterns/colors on their facades which makes corner lines difficult to recognize. CBHE uses building corners instead of corner lines, which puts more restriction on the references for camera location calibration, and thus yields more accurate results. (B) In Step (iv) of their method, they use a local spectral histogram representation (Liu and Wang, 2002) as the edge indicator to capture building rooflines, which can be ineffective when buildings overlap with each other. CBHE uses the proposed deep neural network named BuildingNet to learn a latent representation of building rooflines, which has been shown to be more accurate in the experiments.
3. Overview of CBHE
We present the overall procedure of our proposed CBHE in this section. We also briefly present the process of camera projection, which forms the theoretical foundation of building height estimation using street scene images.
3.1. Solution Overview
We assume a given street scene image of buildings that comes with geocoordinates and angle of the camera by which the image is taken. Here, the geocoordinates may be imprecise due to GPS errors. Google Street View images are examples of such images, and we aim to compute the height of each building in the image. As illustrated in Fig. 2, CBHE contains three stages:

[leftmargin=*]

Stage 1 – Preprocessing: In the first stage, we preprocess the input image by recognizing the buildings and computing their sketches. There are many methods for these purposes. We use two existing models RefineNet (Lin et al., 2017) and Structured Forest (Dollár and Zitnick, 2013) to identify the buildings and compute their sketches, respectively. After this step, the input image will be converted into a grayscale image with each pixel valued from 0 to 255 that contains building sketches, which enables identifying rooflines and computing the height of the building via camera projection.

Stage 2 – Camera location calibration: Before computing building height by camera projection, in the second stage, we calibrate the camera location. This is necessary because a precise camera location is required in the camera projection, while the geocoordinates that come with street scene images are imprecise due to GPS errors. To calibrate the camera location, we first detect building corner candidates in street scene images according to their footprints in 2D maps and their relative position to the camera. Then, by comparing the locations and the projected positions of building corners (two corners), we calibrate the camera location via camera projection. In this stage, we propose a deep neural network named BuildingNet to determine whether an image segment contains a valid building corner. The BuildingNet model and the process of selecting two building corners for the calibration are detailed in Section 4.

Stage 3 – Building height computation: In this stage, we obtain the roofline candidates of each building via weighted Hough transform and filter out those invalid roofline candidates via BuildingNet. Then we rank all valid rooflines by an entropybased ranking algorithm considering both corner and roofline features and select the best one for computing building height via camera projection. The detailed process is provided in Section 5.
3.2. Camera projection
We use Fig. 3 to illustrate the idea of camera projection and the corresponding symbols. In this figure, there are two coordinate systems, i.e., the camera coordinate system and the image plane coordinate system. Specifically, , , , represent the camera coordinate system, where origin represents the location of the camera. The camera is set horizontal to the sea level, which means that plane is vertical to the building facades while the axis is horizontal to the building facades. We use , , to represent the image plane coordinate system, where origin is the center of the image, and plane is parallel to plane .
In Fig. 3, there are two buildings and that have been projected onto the image. For each building, we use , , and to represent the roofline, the floor, and the line on the building projected to the axis (center line) of the image plane . Corners , , and are the corner nearest to the camera, the corner farthest to the axis of the image plane when projected to the image plane (along the axis), and the corner closest to the axis of the image plane when projected to the image plane (along the axis), respectively. The height of the building is the sum of the distance between and and the distance between and . These two distances are denoted as and , and the projected length of in the image plane is denoted by . Since the camera is set horizontal to the sea level, the height of is the same as the height of the car or human beings who captured the street scene image, which can be regarded as a constant.
Let be the distance from the camera to corner , be the projected length of onto the axis, and be the focal length of the camera (i.e., the distance between the image center and the camera center ). Based on the pinhole camera projection, the height of a building can be computed as follows:
(1) 
In this equation, the focal length comes with the input image as its metadata. The distance is computed based on the geocoordinates of the building and the camera, as well as the orientation of the camera. The geocoordinates of the building are obtained from an opensourced digital map, OpenStreetMap, while the geocoordinates and orientation of the camera come with the input image from Google Street View. Due to GPS errors, we describe how to calibrate the location of the camera in Section 4. The height is computed based on the position of the roofline which is discussed in Section 5. Table 1 summarizes the symbols that are frequently used in the rest of the discussion.
Notation  Description 

the height of a building above images’ center line  
the height of a building below images’ center line  
the distance from the camera to of a building  
the projected length of onto the axis  
the focal length of the camera  
a building roofline  
the corner nearest to the camera  
the corner farthest to in the image plane  
the corner closest to in the image plane 
4. Camera Location Calibration
When applying camera projection for building height estimation, we need the distance between the building and the camera. Computing this distance is based on the locations of both the building and the camera. Due to the error of GPS, we calibrate the camera location in this section.
4.1. Key Idea
We use two building corners in the street scene image with known realworld locations for camera location calibration. To illustrate the process, we project Fig. 3 to a 2D plane, as shown in Fig. 4a, and assume that corner of building and building are two reference corners.
We consider a coordinate system with corner of building as the origin, and the camera orientation as the axis. Let and be the angles of corner of and corner of from the orientation of the camera, respectively. Then the ratio of is determined by the position of these two reference corners in the image. represents the angle between the line connecting corner of and corner of and axis, and it can be computed according to the camera’s orientation and the relative locations of the two reference corners in 2D maps. Therefore, we can compute the coordinates of corner of building in the coordinate system. With , , and the coordinate , we compute the coordinate of the camera as follows:
(2) 
Since the coordinate of the camera is equals to , we obtain the relative position of the camera to the corner of building . Thus, camera location calibration becomes the problem of matching two building corners with their positions in the image.
The realworld location of the building corners can be obtained from 2D maps, and we need to locate their corresponding positions in the street scene image based on the (inaccurate) geocoordinates of the camera. For a pinhole camera, matching a 3D point in the real world to a 2D point in the image is determined by a camera projection matrix as follows:
(3) 
where a realworld point = can be projected to its position = in the image plane; is the parameter that transfers pixel scale to millimeter scale (Meyer and Meyer, 2010); is the camera matrix determined by focal length ; is the camera rotation matrix, while is a 3dimensional translation vector that describes the transformation from the realworld coordinates to the camera coordinates.
Since the image geocoordinates may be inaccurate, we can only compute rough locations of the building corners. Based on the rough position of each corner, we then iteratively assume a height for each building to obtain the gradient of its rooflines, as shown in Fig. 4b. We use subimages with the horizontal position and the assumed height of the corner as their center for building corner detection. A building corner consists of two rooflines or a roofline with a building corner line, as shown in Fig. 4b. For each building, we only consider their corners and . There are three types of formation for corner as illustrated by the red lines on the lefthand side building in Fig. 4b, and there is one type of formation for corner as illustrated by the blue lines. Based on the detected building corner candidates, we use BuildingNet described Section 4.2 to filter out noncorner image segments, and then select the two reference corners which is discussed in Section 4.2.
We assume the camera location error from Google Street View API to be less than three meters due to its camera location optimization (Klingner et al., 2013). If the camera location we compute is more than three meters away from the one provided by Google Street View API, we use the camera location from Google Street View API directly. We further improve the estimation accuracy by a multisampling strategy, which uses the median height among results from different images of the same building taken at different distances.
4.2. BuildingNet
We formulate building corner detection as an object classification problem, which first detects candidate corner regions for a specific building by a heuristic method, and then classifies them into different types of corners or noncorners.
We classify images that may contain building corners into five classes. The first four classes correspond to images containing one of the four types of building corners, i.e., corner , of the lefthand side buildings, and corner , of the righthand side buildings (” ”, ””, ””, ””). The last class corresponds to noncorner images which may contain any pattern except the above four types of corners (e.g., they could contain trees, lamps or power lines), and should be filtered out. Such a classification problem is an open set problem in the sense that the noncorner images do not have a unified pattern and will encounter unseen patterns. To solve this classification problem, we build a classifier that only requires samples of the first four classes in the training stage (can also take advantage of noncorner images), while can handle all five classes in the testing stage. To enable such a classifier, we first propose the BuildingNet model based on LeNet 5 (LeCun et al., 1998)
and triplet loss functions, which learns embeddings that map potential corner region image segments to a Euclidean space where the embeddings have small intraclass distances and large interclass distances.
4.2.1. Triplet Relative Loss Function
As shown in Fig 5, an input of BuildingNet contains three images. Two of them ( and ) contain the same type of corner, and we name them the target () and positive (), respectively. The other image contains another type of corners (or a noncorner image if available), and we name it negative. BuildingNet trains its inputs to dimensional embeddings based on a triplet relative loss function inspired by TripletCenter Loss and FaceNet (Schroff et al., 2015; Wen et al., 2016; He et al., 2018; Wang et al., 2018), which minimizes the distances within the same type of corners, and maximizes the distances between different types of corners as follows:
(4) 
where is the weight of intraclass distance in the dimensional Euclidean space; is the weight of the ratio between intraclasses distance and interclass distance, which aims to separate different classes in the dimensional Euclidean space; is the cardinality of all input triplets. Function computes the dimensional embedding of an input image, and we normalize it to . Different from existing loss function based on triplet selection (Weinberger et al., 2006; Schroff et al., 2015), triplet relative loss function can minimise the intraclass distance and maximize the interclass distance by means of their relative distance.
4.2.2. Hard Triplet Selection
Generating all possible image triplets for each batch during the training process will result in a large amount of unnecessary training data (e.g., and are too similar, while
is way different). It is crucial to select triplets that contribute more to the training phase. In BuildingNet, we accelerate convergence by assigning a higher selection probability to triplets that may contribute more to the training process. The probability of selecting a negative image
to form a training triplet is:(5) 
Here, is the total number of negative images in a batch. After randomly choosing and for a triplet, we compute the Euclidean distance between and , as well as the distances between and the negative images in the batch. Let be the minimum Euclidean distance between and any , which can be positive or negative. Then, the negative image similar to will have a higher probability to be selected.
After the training process, we obtain a dimensional embedding for each input image. We then learn a support vector classifier (Chang and Lin, 2011) based on these embeddings for corner region image classification.
4.3. Entropybased Ranking
BuildingNet can filter out noncorner images. Among the remaining corner candidates, we select the two corners with the highest score as the reference corners. Reference corner selection relies on multiple factors: the length and edgeness (detailed in Section 5) of the lines forming the corner, the number of other corner candidates (, , and ) of the same building with the same assumed height, and the position of the corner in the image. We take the position of the corner into consideration because, empirically, corners close to one quarter or threequarters (horizontally) of the image yield more accurate matching between their positions in the image and their footprints in 2D maps. We also consider their realworld locations because a corner close to the camera will be clearer and has higher accuracy when matching them to their footprints in 2D maps. Therefore, we define the score of each corner candidate as:
(6) 
where is the number of corner candidates from all buildings; is the th corner candidate; is the score of the th corner candidate; is the detected length of the two lines that form a corner, while is the sum of the edgeness of the two lines; is the number of other corner candidates of the same building with the same assumed height; is the minimum distance from the corner to a quarter or threequarters of the image, and is the distance from the corner to the camera; , , , , are the weight of these parameters. Parameters and correlate with the score positively, while parameter correlates with the score negatively.
We use an entropybased ranking method to compute the weights of parameters . Shannon entropy is a commonly used measurement of uncertainty in information theory (Sun et al., 2017). The main idea of the entropybased ranking algorithm is to compute the objective weights of different parameters according to their data distribution. If the samples of a parameter vary greatly, the parameter should be considered as a more important feature.
For building corner classification, there are parameters and samples. We denote the decision matrix as , where is the value of the th sample under the th parameter. Before applying the entropybased ranking algorithm, we preprocess these parameters by Minmax scaling as follows:
(7) 
where positive and negative mean that the th parameter is positively/negatively correlated with the value of . After Minmax scaling, the entropy of each parameter based on the normalized decision matrix is defined as:
(8) 
where is the standardized . Based on the entropy of each parameter, the weight of each parameter is computed by:
(9) 
After computing the weight of each parameter, we apply them to all corner candidates and rank all the candidates by their scores to obtain the best two as the reference corners.
5. Roofline Detection
Building height estimation requires detecting the roofline of each building. In this section, we present our method for roofline candidate detection in Section 5.1, and our method for the true roofline selection in Section 5.2. We further present a strategy for handling tall building (over 100 meters) in Section 5.3.
5.1. Roofline Candidate Detection
We consider the rooflines from corner to the corner next to , along the positive direction of the axis in the camera coordinate system, and the one from corner to the corner next to corner along the negative direction of the axis in the camera coordinate system. The corner between corner and corner is corner if they are adjacent to each other, as shown in Fig. 3, and we take this situation to simplify the explanation.
Similar to corner candidate detection, as shown in Fig. 6a, we find all roofline candidates of each building by a heuristic method, which projects the rooflines of each building according to its relative location to the camera in the real world, together with the camera’s parameters. To do so, we first assume of a building to be the maximum height that can be captured, which means that at least a roof corner (, , and ) is visible in the image. If corner is visible, the maximum height computed via camera projection is:
(10) 
where is the height of the street scene image; is the distance from corner to the camera projected to the axis of the camera coordinate system. If corner is invisible, we use as the reference corner when computing the maximum height of a building in the same way. With the maximum height of the building, we compute the position of corner , , and in the image. We then apply Hough transform to the input edge map in Fig. 2 to detect roofline candidates, and the roofline candidates from to need to match the computed position of and . Similarly, the roofline candidates from to need to match the computed position of and
. Instead of using the typical Hough transform for line detection, which takes binarized images as the input, we sum the value of all pixels valued from 0 to 255 within a line as its weight, and name the summed value as the
edgeness of a roofline candidate, which reflects the visibility of a line in the edge map.We iteratively reduce the assumed height with a step length of 0.5 meters until and collect all candidate rooflines. Similar to reference corner detection, we formulate the true building roofline detection as an open set classification and ranking problem.
5.2. Roofline Classification and Ranking
There are three types of rooflines: (i) Roofline from to ; (ii) Roofline from to of the left hand side buildings; (iii) Roofline from to of the right hand side buildings, as shown in Fig. 4c. We use BuildingNet to filter these candidates and find the true roofline, which is similar to the corner candidate validation process in Section 4.2. Based on the valid roofline candidates from BuildingNet, we weight each roofline candidate by its detected length , edgeness , and the number of corners with the same assumed height of the same building. We rank all roofline candidates via the entropybased ranking algorithm in Section 4.3, as follows:
(11) 
where is the number of roofline candidates for a specific roofline; is the th roofline candidate; is the score of the th roofline candidate. , and are the weight of these parameters based on all candidates of a specific roofline of a building, and all three parameters are positively correlated with the score . The value of depends on the number of corners (, , and ) with the same assumed height as the roofline candidate, and its value is {0, 1, 2, 3}. We discussed how to detect references corners in Section 4.2, and the difference in detecting the corners of a specific building is that we do not consider and in Equation 6 and all corner candidates here are those of a specific building corner.
Different from building corners, which can only be visible or invisible, rooflines can also be partially blocked by other objects (trees in particular). Therefore, before we apply the ranking algorithm, we preprocess the length and edgeness which are affected by the blocking via Algorithm 1 as follows:
When estimating building height, we first separate buildings into two classes: (i) with at least one valid corner; (ii) without any valid corner. Then, we process buildings in class (i) according to their distance to the camera. After all the buildings in class (i) have been processed, we process the buildings in class (ii) according to their distance to the camera. After we obtain the height of a building, we mark the scope of the building in the street scene image, as shown in Fig. 6b (i.e., height has been obtained).
After detecting the roofline candidates of a building, we refine the of each roofline candidate using the following equation:
(12) 
where is the detected length of a roofline candidate . checks whether a pixel within a roofline belongs to a building’s scope in the street scene image that has been processed and closer to the camera, or within another building’s roofline that has been processed but farther to the camera. We remove pixel from a roofline if is true. checks whether a pixel , which is in the extended line of but within the projected scope of the roofline, has been blocked by trees. If there do exist these pixels and they connect to the detected roofline segment, we add them to the roofline. Accordingly, we update the edgeness of a roofline as:
(13) 
where is the input edge map of the original image, and are the initial and prolonged length of the roofline, respectively.
5.3. Tall Building Preprocessing
Height estimation for tall buildings (over 100 meters) requires the camera to be placed far away from the buildings with an upwardlooking view to capture the building roof. For images with an upwardlooking view, all building corner lines will become slanted. Typically, we take the upwardlooking view as 25 degrees as an example to show the strategy that we use for handling tall buildings.
We first compute a planetoplane homography (Zhang, 2000), which maps an image with an upwardlooking view to the corresponding image with a horizontal view. Here, we use the homogeneous estimation method (Criminisi, 1997), which solves a homogeneous matrix that matches a point in an upwardlooking image (Fig. 7a) to a horizontalview image (Fig. 7b) using the Equation 14:
(14) 
where the homogeneous matrix is represented in the vector form as ; is the number of point pairs, which should be no less than four to validate the homogeneous equation; () represents a point in the upwardlooking image and () represents the corresponding point in the resultant image with a horizontal view. Vector minimizes the algebraic residuals, and is a standard result of linear algebra. Subject to
, the least eigenvalue of
is given by the eigenvector, and this eigenvector can be obtained from the singular value decomposition (SVD) of
.6. Experiments
In this section, we first evaluate our proposed BuildingNet model for building corner and roofline classification and then evaluate our proposed CBHE algorithm for building height estimation.
6.1. Datasets
In our experiments, we obtain building footprints (geocoordinates) from OpenStreetMap and building images from Google Street View, respectively. For the experiments on building height estimation, we use two datasets:
(i) City Blocks, which contains 128 buildings in San Francisco. We collect all Google Street View images (640640 pixels) with camera orientation along the street. We set the view of the camera is 90 degrees, and the focal length can be derived via the camera parameters provided by Google. We do not need to consider the camera rotation matrix and the translation vector due to the image preprocessing of Google Street View. We obtain the building height ground truth from highresolution aerial maps (e.g., NearMap (Google, 2018)).
(ii) Tall Buildings, which contains 37 buildings taller than 100 meters in San Francisco, Melbourne, and Sydney collected by us via Google Street View API. We set the camera with an upwardlooking view (25 degrees) to capture their rooflines. The building hight ground truth comes from Wikipedia pages of these buildings or derived from NearMap (Google, 2018).
For building corner classification, we crop images from City Blocks dataset. We generate the corner dataset semiautomatically, where we crop pixels image segments from street scene images, and then manually label whether an image segment contains a building corner (and the type of corner). The training dataset that we collected contains 10,400 images, including 1,300 images of each type of building corner (i.e., a total of 5,200 building corner images) and 5,200 noncorner images. The testing dataset contains 1,280 images, including 160 images for each type of building corners and 640 noncorner images. The training data and testing data come from different city blocks.
Following a similar approach, we collect a roofline dataset. For each roofline candidate, we extend the upper and lower 10 pixels of the roofline to obtain a image segments, where is the length of the roofline, and we further resize (rotate if the roofline is not a horizontal line) the image to to generate samesize inputs for BuildingNet. The training dataset includes 7,800 images, including 1,300 images for each type of rooflines (i.e., a total of 3,900 building roofline images) and 3,900 nonroofline images. The testing dataset contains 960 images, including 160 images for each kind of roofline and 480 nonroofline images.
6.2. Effectiveness of BuildingNet
Building corner and roofline classification is an open set classification problem where the invalid corner or roofline candidates do not have consistent features. To test the effectiveness of BuildingNet, we use two open set classifiers as the baselines: SROSR (Zhang and Patel, 2017) and OpenMax (Bendale and Boult, 2016). SROSR uses the reconstruction errors for classification. It simplifies the open set classification problem into testing and analyzing a set of hypothesis based on the matched and nomatched error distributions. OpenMax handles the open set classification problem by estimating the probability of whether an input comes from unknown classes based on the last fully connected layer of a neural network. Further, we use two loss functions based on triplet selection to illustrate the effectiveness of our proposed triplet relative loss function. The loss function in FaceNet (Schroff et al., 2015) makes the intraclass distance smaller than interclass distance by adding a margin, and the one in MSML (Xiao et al., 2017) optimizes the triplet selection process towards selecting hard triplets in each in training.
For the OpenMax, we use the LeNet 5 model to train on the building corner and roofline dataset for 10K iterations with the default setting in Caffe
(Jia et al., 2014). We then apply the last fully connected layer to OpenMax for classification. For our BuildingNet, we pretrain LeNet 5 with the MNIST dataset and finetune it with our collected building corner and roofline images. Further, since BuildingNet can also take advantage of unlabeled data (known unknown (Scheirer et al., 2014)) during training, we also pretrain a LeNet 5 model based on MNIST dataset (0 to 4 as the labeled data and 5 to 9 as the unlabeled data) and finetune it with our data. We set the learning rate as 0.1 with the decay rate of 0.95 after each 1K iterations (50K iterations in total). The batch size is 30 images for each class, the embeddings that BuildingNet learns are 128dimensional, and the in the triplet relative loss function is 0.5. We perform 10fold crossvalidation on the models tested, and then compute the accuracy, precision, recall, and F score of different models, which are summarized in Figure 9.On the corner dataset, BuildingNet achieves an accuracy of 94.34%, and its recall, precision, and F score are all over 91% when using both labeled and unlabeled data for training. Compared with SROSR and OpenMax, BuildingNet improves the accuracy and F score by more than 6% and 10%, respectively. When trained with labeled data only, BuildingNet still has the highest accuracy and F score (i.e., 88.72% and 81.8%), which are 1.1% and 2% higher than OpenMax, respectively. Compared with the two loss functions in MSML and FaceNet which are also based on triplet selection, our proposed loss function can improve the accuracy and F score by more than 0.4% and 0.5%, respectively. For the roofline dataset, the proposed BuildingNet again outperforms the baseline models consistently. These confirm the effectiveness of BuildingNet.
2D embeddings of four types of corners and the unlabeled data after 100 epochs, learned by the loss functions in (a) FaceNet, (b) MSML, and (c) the proposed triplet relative loss (best view in color).
To further illustrate the effectiveness of BuildingNet, we visualize the embeddings generated by three triplet based loss functions on the corner dataset, as shown in Fig. 10. Compared with random triplet selection with margin (FaceNet) and hard triplet selection with margin (MSML), our triplet relative loss function obtains better classification result with smaller average intraclass distance and larger average interclass distance after the same number of epochs.
6.3. Effectiveness of CBHE
We evaluate the performance of CBHE on City Blocks and Tall Buildings in this subsection.
6.3.1. Building height estimation on City Blocks
Figure 11 shows the building high estimation errors of the baseline method [46] and CBHE over the City Blocks dataset. It shows the percentage of buildings where the height estimation is greater than 2, 3, and 4 meters, respectively. In both city blocks, CBHE achieves a smaller percentage of buildings than that of the baseline [46].
In particular, in the first city block (Fig. 11a, which has been used in (Yuan and Cheriyadat, 2016)), CBHE has 10.4%, 5.5%, and 1.2% fewer buildings than those of the baseline with height estimation errors greater than 2, 3, and 4 meters, respectively. Note that the results of the baseline method are obtained from their paper (Yuan and Cheriyadat, 2016) since we are unable to obtain their source code. Also, even though CBHE is run on the same city block as the baseline in this set of experiments, the images that we used are more challenging to handle as the trees in the street scenes have grown larger which block the buildings (cf. Fig. 12).
Fig 11b shows the result in a second city block (which was not used in (Yuan and Cheriyadat, 2016)). As we are unable to obtain the source code of the baseline method, the result is based on our implementation of their method. CBHE again outperforms the baseline. It has 11.5%, 4.8%, and 5% fewer buildings than those of the baseline with height estimation errors greater than 2, 3, and 4 meters, respectively.


6.3.2. Building height estimation on Tall Buildings.
For tall buildings, the camera needs to be placed far away with an upwardlook view to capture the building roofline. We capture the building images 250 meters away from the buildings via Google Street View API. Fig. 13 presents examples of the street scene images for tall building height estimation. For each street scene image, we first rotate it to the horizontal view according to Equation 14, and then compute the height of the buildings according to Section 5.
The baseline method (Yuan and Cheriyadat, 2016) cannot be applied to tall buildings, and here we only show the result of CBHE. As shown in Table 2, more than 53% of the tall buildings have a height estimation error of less than five meters and 73.33% of the tall buildings have an error of less than 10 meters.
Absolute error  Percentage  Relative Error  Percentage 

¿5m  45.9%  ¿5%  40.5% 
¿10m  27.0%  ¿10%  13.5% 
The errors of tall buildings may seem larger due to the camera projection (i.e., the errors are multiplied by a larger multiplier for tall buildings). However, we would like to emphasize that the relative errors are still quite low, e.g., since the tall buildings are taller than 100 meters, even a 10meter error is less than 10% and is barely notable in reality.
6.4. Error Analysis
We summarize the challenging cases for CBHE in this section. These challenging scenarios will be explored in future work.
For those buildings whose rooflines are entirely blocked by other objects such as trees, CBHE will ignore them, or output a wrong estimation. Take Fig. 14a as an example, the trees on the lefthand side of the image block the roof of the green colored building heavily, resulting in a line below the roof to be identified as the roofline. Additionally, if the corners of a building are not detectable, lines from other buildings behind this building may also impact the result. As illustrated in Fig. 14b, the roofline of a building behind the blue colored building was detected as its roofline.
In dense city areas, the buildings may overlap with each other, and it is difficult to match all buildings with their boundaries in a 2D map accurately. Take Fig. 14c as an example, building is blocked by building and building ’s corners have a similar horizontal position to building . Therefore, CBHE regards the rooflines of building as the rooflines of building , which results in the estimated height of building being 77.41m, although its real height is 24.53m. Moreover, the height of building is also wrong because the incorrect rooflines of building block the rooflines of building . In Fig. 14d, the blue shaded building mask on the righthand side is wrongly assigned to the building (between building and building ) because it is closer to the camera with the similar position to building .
7. Conclusions
We proposed a cornerbased algorithm named CBHE to learn building height from complex street scene images. CBHE consists of camera location calibration and building roofline detection as its two main steps. To calibrate camera location, CBHE performs camera projection by matching two building corners in street scene images with their physical locations obtained from a 2D map. To identify building rooflines, CBHE first detects roofline candidates according to the building footprints in 2D maps and the calibrated camera location. Then, it uses a deep neural network named BuildingNet that we proposed to check whether a roofline candidate indeed is a building roofline. Finally, CBHE ranks the valid rooflines based on an entropybased ranking algorithm, which also involves building corner information as an essential indicator, and then computes the building height through camera projection. Experimental results show that the proposed BuildingNet model outperforms two stateoftheart classifiers SROSR and OpenMax consistently, and CBHE outperforms the baseline algorithm by over 10% in building height estimation accuracy.
8. acknowledgments
We thank the anonymous reviewers for their feedback. We appreciate the valuable discussion with Bayu Distiawan Trsedya, Weihao Chen, and Jungmin Son. Yunxiang Zhao is supported by the Chinese Scholarship Council (CSC). This work is supported by Australian Research Council (ARC) Discovery Project DP180102050, Google Faculty Research Award, and the National Science Foundation of China (Project No. 61402155).
References
 (1)
 Agarwal et al. (2015) Pratik Agarwal, Wolfram Burgard, and Luciano Spinello. 2015. Metric Localization using Google Street View. In IEEE/RSJ International Conference on Intelligent Robots and Systems. 3111–3118.
 Anguelov et al. (2010) Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh, Stéphane Lafon, Richard Lyon, Abhijit Ogale, Luc Vincent, and Josh Weaver. 2010. Google Street View: Capturing the World at Street Level. Computer 43, 6 (2010), 32–38.

Armagan
et al. (2017)
Anil Armagan, Martin
Hirzer, Peter M. Roth, and Vincent
Lepetit. 2017.
Learning to Align Semantic Segmentation and 2.5D
Maps for Geolocalization. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. 4590–4597.  Arth et al. (2015) Clemens Arth, Christian Pirchheim, Jonathan Ventura, Dieter Schmalstieg, and Vincent Lepetit. 2015. Instant Outdoor Localization and SLAM Initialization from 2.5D Maps. IEEE Transactions on Visualization and Computer Graphics 21, 11 (2015), 1309–1318.
 Baidu (2018) Baidu. 2018. Baidu Map. Retrieved Oct 18, 2018 from https://map.baidu.com/#
 Bendale and Boult (2016) Abhijit Bendale and Terrance E. Boult. 2016. Towards Open Set Deep Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1563–1572.
 Brunner et al. (2010) Dominik Brunner, Guido Lemoine, Lorenzo Bruzzone, and Harm Greidanus. 2010. Building Height Retrieval from VHR SAR Imagery based on an Iterative Simulation and Matching Technique. IEEE Transactions on Geoscience and Remote Sensing 48, 3 (2010), 1487–1504.

Chang and Lin (2011)
ChihChung Chang and
ChihJen Lin. 2011.
LIBSVM: A Library for Support Vector Machines.
ACM Transactions on Intelligent Systems and Technology (TIST) 2, 3 (2011), 27.  Chu et al. (2014) Hang Chu, Andrew Gallagher, and Tsuhan Chen. 2014. GPS Refinement and Camera Orientation Estimation from a Single Image and a 2D Map. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 171–178.
 Criminisi (1997) Antonio Criminisi. 1997. Computing the Plane to Plane Homography.
 Díaz and Arguello (2016) Elkin Díaz and Henry Arguello. 2016. An Algorithm to Estimate Building Heights from Google Streetview Imagery using Single View Metrology across a Representational State Transfer System. In Dimensional Optical Metrology and Inspection for Practical Applications V, Vol. 9868. 98680A.
 Dollár and Zitnick (2013) Piotr Dollár and C. Lawrence Zitnick. 2013. Structured Forests for Fast Edge Detection. In IEEE International Conference on Computer Vision (ICCV). 1841–1848.
 Google (2018) Google. 2018. NearMap. Retrieved Nov 4, 2018 from http://maps.au.nearmap.com/
 Grabler et al. (2008) Floraine Grabler, Maneesh Agrawala, Robert W. Sumner, and Mark Pauly. 2008. Automatic Generation of Tourist Maps. ACM Transactions on Graphics (TOG) 27, 3 (2008), 100:1–100:11.
 Grammenos et al. (2018) Andreas Grammenos, Cecilia Mascolo, and Jon Crowcroft. 2018. You Are Sensing, but Are You Biased?: A User Unaided Sensor Calibration Approach for Mobile Sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 (2018), 11.
 Haklay and Weber (2008) Mordechai Haklay and Patrick Weber. 2008. OpenStreetMap: UserGenerated Street Maps. IEEE Pervasive Computing 7, 4 (2008), 12–18.
 He et al. (2018) Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. 2018. TripletCenter Loss for MultiView 3D Object Retrieval. arXiv preprint arXiv:1803.06189 (2018).
 Izadi and Saeedi (2012) Mohammad Izadi and Parvaneh Saeedi. 2012. ThreeDimensional Polygonal Building Model Estimation from Single Satellite Images. IEEE Transactions on Geoscience and Remote Sensing 50, 6 (2012), 2254–2272.
 Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM international conference on Multimedia. 675–678.
 Klingner et al. (2013) Bryan Klingner, David Martin, and James Roseborough. 2013. Street View MotionfromStructurefromMotion. In IEEE International Conference on Computer Vision (ICCV). 953–960.
 Kopf et al. (2010) Johannes Kopf, Billy Chen, Richard Szeliski, and Michael Cohen. 2010. Street Slide: Browsing Street Level Imagery. In ACM Transactions on Graphics (TOG), Vol. 29. 96.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. GradientBased Learning Applied to Document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 Liasis and Stavrou (2016) Gregoris Liasis and Stavros Stavrou. 2016. Satellite Images Analysis for Shadow Detection and Building Height Estimation. ISPRS Journal of Photogrammetry and Remote Sensing 119 (2016), 437–450.
 Lin et al. (2017) Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. 2017. RefineNet: MultiPath Refinement Networks for HighResolution Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5168–5177.
 Liu et al. (2017) Liu Liu, Hongdong Li, and Yuchao Dai. 2017. Efficient Global 2D3D Matching for Camera Localization in a LargeScale 3D Map. In IEEE International Conference on Computer Vision (ICCV). 2391–2400.
 Liu and Wang (2002) Xiuwen Liu and DeLiang Wang. 2002. A Spectral Histogram Model for Texton Modeling and Texture Discrimination. Vision Research 42, 23 (2002), 2617–2634.

Maaten and Hinton (2008)
Laurens Van Der Maaten and
Geoffrey Hinton. 2008.
Visualizing Data Using TSNE.
Journal of Machine Learning Research
9, Nov (2008), 2579–2605.  Meyer and Meyer (2010) Trish Meyer and Chris Meyer. 2010. Creating Motion Graphics with After Effects. Taylor & Francis.
 Ng (2009) Edward Ng. 2009. Policies and Technical Guidelines for Urban Planning of HighDensity Cities–Air Ventilation Assessment (AVA) of Hong Kong. Building and Environment 44, 7 (2009), 1478–1488.
 Pan et al. (2015) Jiyan Pan, Martial Hebert, and Takeo Kanade. 2015. Inferring 3D Layout of Building Facades from a Single Image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2918–2926.
 Qi et al. (2016) Feng Qi, John Z Zhai, and Gaihong Dang. 2016. Building Height Estimation using Google Earth. Energy and Buildings 118 (2016), 123–132.
 Rousell and Zipf (2017) Adam Rousell and Alexander Zipf. 2017. Towards a LandmarkBased Pedestrian Navigation Service using OSM Data. ISPRS International Journal of GeoInformation 6, 3 (2017), 64.
 Sampath and Shan (2010) Aparajithan Sampath and Jie Shan. 2010. Segmentation and Reconstruction of Polyhedral Building Roofs from Aerial Lidar Point Clouds. IEEE Transactions on Geoscience and Remote Sensing 48, 3 (2010), 1554–1567.
 Scheirer et al. (2014) Walter J Scheirer, Lalit P Jain, and Terrance E Boult. 2014. Probability Models for Open Set Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36, 11 (2014), 2317–2324.

Schroff
et al. (2015)
Florian Schroff, Dmitry
Kalenichenko, and James Philbin.
2015.
Facenet: A Unified Embedding for Face Recognition and Clustering. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 815–823.  Sohn et al. (2008) Gunho Sohn, Xianfeng Huang, and Vincent Tao. 2008. Using a Binary Space Partitioning Tree for Reconstructing Polyhedral Building Models from Airborne Lidar Data. Photogrammetric Engineering & Remote Sensing 74, 11 (2008), 1425–1438.
 Sportouche et al. (2011) Hélène Sportouche, Florence Tupin, and Léonard Denise. 2011. Extraction and ThreeDimensional Reconstruction of Isolated Buildings in Rrban Scenes from HighResolution Optical and SAR Spaceborne Images. IEEE Transactions on Geoscience and Remote Sensing 49, 10 (2011), 3932–3946.
 Sun et al. (2017) Liyan Sun, Chenglin Miao, and Li Yang. 2017. EcologicalEconomic Efficiency Evaluation of Green Technology Innovation in Strategic Emerging Industries based on Entropy Weighted TOPSIS Method. Ecological Indicators 73 (2017), 554–558.
 Tack et al. (2012) Frederik Tack, Gurcan Buyuksalih, and Rudi Goossens. 2012. 3D Building Reconstruction based on Given Ground Plan Information and Surface Models Extracted from Spaceborne Imagery. ISPRS Journal of Photogrammetry and Remote Sensing 67 (2012), 52–64.
 Wang et al. (2018) Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. 2018. KDGAN: Knowledge Distillation with Generative Adversarial Networks. In Advances in Neural Information Processing Systems (NIPS). 783–794.
 Wang et al. (2015) Zhuang Wang, Libing Jiang, Lei Lin, and Wenxian Yu. 2015. Building Height Estimation from High Resolution SAR Imagery via ModelBased Geometrical Structure Prediction. Progress In Electromagnetics Research 41 (2015), 11–24.
 Weinberger et al. (2006) Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. 2006. Distance Metric Learning for Large Margin Nearest Neighbor Classification. In Advances in Neural Information Processing Systems (NIPS). 1473–1480.
 Wen et al. (2016) Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition. In European Conference on Computer Vision (ECCV). 499–515.
 WordPress and HitMag (2018) WordPress and HitMag. 2018. LIDAR and RADAR Information. Retrieved Aug 9, 2018 from http://lidarradar.com/category/info
 Xiao et al. (2017) Qiqi Xiao, Hao Luo, and Chi Zhang. 2017. Margin Sample Mining Loss: A Deep Learning Based Method for Person Reidentification. arXiv preprint arXiv:1710.00478 (2017).
 Yuan and Cheriyadat (2016) Jiangye Yuan and Anil M. Cheriyadat. 2016. Combining Maps and Street Level Images for Building Height and Facade Estimation. In ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics. 8:1–8:8.
 Zandbergen and Barbeau (2011) Paul A. Zandbergen and Sean J. Barbeau. 2011. Positional Accuracy of Assisted GPS Data from HighSensitivity GPSEnabled Mobile Phones. The Journal of Navigation 64, 3 (2011), 381–399.
 Zeng et al. (2014) Chuiqing Zeng, Jinfei Wang, Wenfeng Zhan, Peijun Shi, and Autumn Gambles. 2014. An Elevation Difference Model for Building Height Extraction from StereoImageDerived DSMs. International Journal of Remote Sensing 35, 22 (2014), 7614–7630.
 Zhang and Patel (2017) He Zhang and Vishal M. Patel. 2017. Sparse RepresentationBased Open Set Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 39, 8 (2017), 1690–1696.
 Zhang (2017) Rui Zhang. 2017. Geographic Knowledge Base (2017): http://www.ruizhang.info/GKB/gkb.htm.
 Zhang (2000) Zhengyou Zhang. 2000. A Flexible New Technique for Camera Calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 22, 11 (2000), 1330–1334.
Comments
There are no comments yet.