1 Introduction
Automatic reconstruction of surface models from airborne and spaceborne imagery is a longstanding problem in photogrammetry and computer vision. For imagery collected by airborne platforms, stateoftheart algorithms allow for countryscale reconstructions with an impressive level of detail and a high degree of robustness, as demonstrated for instance by the city models now included in virtual online globes. In many applications builtup areas are most interesting and thus mapped on a regular basis, often with centimeter resolution. The most common product are still 2.5D digital surface models (DSMs) generated from airborne nadir imagery. More recent camera systems can also collect oblique views and enable the reconstruction of true 3D structures, in particular facade elements. Compared to airborne imagery, traditional satellite sensors acquired images with lower resolution, less redundancy, and only nadir views. Consequently, 2.5D DSMs are still the predominant representation for satellitebased reconstructions. The launch of new satellites with steerable highresolution sensors, such as for instance WordView3, has led to a renewed interest in detailed reconstruction from spaceborne imagery. With these new systems, datasets with down to 0.3m GSD and high redundancy can be collected, see for example recent benchmarks
Bosch et al. (2016); Patil et al. (2019). Given such data, the extraction of real 3D geometry, like balconies and other facade structures, seems to be in reach. Most existing algorithms, however, only produce 2.5D height maps or surfaces Kuschk (2013); Wohlfeil et al. (2012); d’Angelo and Reinartz (2011); Shean et al. (2016); Gong and Fritsch (2019) and do not even attempt to recover 3D details.Besides the 2.5D scene representation, conventional reconstruction pipelines have further drawbacks. They are often based on pairwise (dense) stereo and subsequent fusion of stereo models, which does not fully exploit the multiview redundancy. Moreover, the dominant binocular stereo algorithm in those pipelines is SemiGlobal Matching (SGM)
Hirschmüller (2008), due to its good compromise between quality and computational cost. The price to pay is that SGM and its variants are, by construction, subject to modeling errors such as frontoparallel bias (caused by rectangular matching windows) and a preference for areas of constant disparity Roth and Mayer (2019); Scharstein et al. (2017). It is also welldocumented that methods that estimate subpixel disparities in discrete disparity space introduce further systematic errors
Shimizu and Okutomi (2002); Szeliski and Scharstein (2004); Gehrig and Franke (2007). The fusion of individual stereo models into a single, consistent height field is most often done with heuristic rules, which certainly improve robustness and accuracy, but are nevertheless suboptimal. In particular, visibility and occlusions are often handled poorly, or not at all.
To sidestep the mentioned limitations, we propose a novel reconstruction approach that uses a 3D mesh representation. Our method is a local optimisation starting from an initial mesh, i.e., it refines an existing surface model, for instance a conventional 2.5D stereo result. Following Delaunoy et al. (2008); Vu et al. (2012), we assume that the coarse, initial mesh is topologically correct and we refine it by iteratively moving its vertices in the direction that most reduces the texture transfer error across all views. Technically, this is implemented as a variational energy minimisation, subject to a surface smoothness prior. Here, we formulate the corresponding energy function for the RPC model, the dominant sensor model for satellite images. We demonstrate, for the first time, the reconstruction of full 3D surface structure, by incorporating satellite views with large offnadir angles. Moreover, we show that the refinement also tends to improve the accuracy of the 2.5D elevation values.
2 Related Work
The majority of reconstruction algorithms for optical satellite imagery employ the scheme of pairwise epipolar rectification and dense stereo matching, followed by the fusion of depth maps Kuschk (2013); Wohlfeil et al. (2012); d’Angelo and Reinartz (2011); Shean et al. (2016); Gong and Fritsch (2019). Epipolar rectification warps an image pair such that corresponding pixels share the same row index. This reduces the correspondences search space to 1D and accelerates pixelwise matching. In contrast to pinhole cameras, the objecttoimage space mapping for satellite sensors is usually described with the rational function model (RFM), also known as rational polynomial coefficients (RPC) model Kratky (1989); Tao and Hu (2001); Dial and Grodecki (2002); Kim and Dowman (2006); Poli and Toutin (2012); Hartley and Saxena (1997). Since standard epipolar geometry is not valid for this projection model, rectification algorithms like Fusiello et al. (2000); Loop and Zhang (1999); Pollefeys et al. (1999) cannot be applied. However, Kim (2000) have shown that, locally, correspondences are located on a pair of epipolar curves across the images. This finding enables rectification of complete satellite images by resampling the original images along the epipolar curves Wang et al. (2010, 2011); Oh (2011). Due to computational efficiency and low memory consumption, many reconstruction pipelines employ some variant of SGM for dense stereo matching. The classical SGM is implemented in d’Angelo and Reinartz (2011) and they show stateoftheart performance on the ISPRS benchmark Reinartz et al. (2010). d’Angelo (2016) investigate the compensation of overcounting Drory et al. (2014) in the context of SGM for satellite imagery and observe improved density but decreased precision. A hierarchical version of SGM is employed by Gong and Fritsch (2019), Rothermel et al. (2012) to limit the disparity search range and reduce memory footprint and computation time, while also reducing matching ambiguities. De Franchis et al. (2014) and the topranked competitors showed stateoftheart performance on the IARPA satellite benchmark Bosch et al. (2016)
with offtheshelf SGM implementations from open source libraries
Bradski (2000) and with NASA’s open stereo reconstruction software Shean et al. (2016). Beside vanilla SGM, the latter also implements the More Global Matching Facciolo et al. (2015), using a modified cost aggregation scheme that aims for globally more consistent disparities. In order to avoid costly energy minimisation altogether, Wang and Frahm (2017)construct dense correspondence maps from a sparse set of tie points, via edge aware interpolation.
To obtain a consistent representation of the surface, individual stereo models have to eventually be fused . This is typically realized during DSM generation, by binning 3D points from multiple models into a planimetric 2D grid, followed by various filtering strategies to derive a single elevation value per grid cell – accomplished often by simple median filtering Kuschk (2013); d’Angelo and Kuschk (2012); Wang and Frahm (2017). Facciolo et al. (2017) account for changing surface modes (e.g, vegetation) caused by different acquisition dates. They propose median clustering for each grid cell and favour observations from lower clusters. In the spirit of bilateral filtering, Qin (2017) apply median filtering on heights of multiple neighbouring grid cells that share the same intensities. Kuschk et al. (2017) cast the fusion of multiple stereo models or elevation maps as a convex energy minimization problem and solve it with a primaldual algorithm, including an additional planarity prior in the form of a TVL1 and TGVL1 penalty.
Some approaches circumvent the somewhat cumbersome pairwise processing and subsequent fusion. Notably, early approaches of dense surface refinement were published in Helava (1988); Wrobel (1987). Rasterized surface elevations and the corresponding objectspace appearance are simultaneously estimated using iterative nonlinear least squares. Thereby the surface elevations are refined such that gray or RGB values across multiple views (linked by the elevation values) are consistent. Beside the 2.5D representation, and thus the inability to reconstruct 3D surfaces, those early approaches lack proper occlusion handling or any form of geometric regularization. d’Angelo and Kuschk (2012) directly estimate a DSM by assigning photometric similarity costs to a regular 3D cost structure in object space. The final elevation map is derived by semi global optimization. However, the authors find only limited gains compared to the more prevalent late fusion of binocular stereo models. Similarly, Wang et al. (2016) estimate an elevation grid and additionally fuse semantic information from a set of satellite images and corresponding semantic segmentation maps. To estimate surface elevations, PatchMatch Belief Propagation Besse (2013) is employed to maximize an energy function that encourages consistency of appearance and semantics across several images. Additionally, smoothness of semantics, height values and surface orientations are enforced. Pollard and Mundy (2007); Pollard et al. (2010) propose a probabilistic voxelbased model to jointly reconstruct surface voxels and their corresponding colours. To our knowledge, this is the only published method in the satellite domain which is capable of extracting real 3D geometry. It does, however, not include any explicit surface prior, and Ozcanli et al. (2015) found that both urban and rural reconstructions are less accurate than those from pairwise matching and late fusion.
Much less research exists for the satellite domain when compared to the conventional pinhole camera model, presumably because of the limited availability of highresolution data. We thus review relevant work on mesh refinement in the close range and airborne domains. Typical approaches start by reconstructing depth maps, or point clouds with normals, using Multi View Stereo (MVS), for example Schönberger et al. (2016); Furukawa and Ponce (2010); Gösele et al. (2007); Galliani et al. (2016). Next, they employ a volumetric approach to bootstrap a topologically correct mesh representation (e.g., Kazhdan and Hoppe (2013); Jancosek and Pajdla (2011); Ummenhofer and Brox (2015); Labatut et al. (2009); Zach et al. (2007); Fuhrmann and Gösele (2014)). For the latest results we refer the interested reader to one of the active MVS and reconstruction benchmarks Seitz et al. (2006); Schöps et al. (2017); Knapitsch et al. (2017). In order to recover fine details and improve precision, the vertex positions of such surface meshes can be further refined Delaunoy et al. (2008); Vu et al. (2012). An energy composed of the (multiview) texture transfer error and a smoothness prior is minimised with gradient descent. Li et al. (2016)
accelerate the process by limiting the refinement to regions that feature geometric variances. An alternate minimisation of the reprojection error and mesh denoising is given by
Li et al. (2015). A guideline for the mesh refinement by semantic information is pursued in Blaha et al. (2017); Romanoni et al. (2017): semantic consistency across views is enforced and smoothness priors for each class are adapted individually. To the best of our knowledge, all refinement algorithms were formulated for pinhole camera models and cannot be directly applied to satellite data.3 Methodology
The basic idea of photometric mesh refinement is to position the vertices of a mesh surface, such that texture transferred from an image to any other image via the surface should match the original texture of the target image . As long as the textures are not in correspondence, they give rise to a gradient, which can be propagated through the sensor model to obtain a gradient per vertex, defining the direction in which it should be displaced to increase the texture similarity. Iterative gradient decent yields a refined mesh with maximal similarity, respectively minimal photometric reprojection error. For clarity, we first review the computation of gradients for the pinhole camera model in section 3.1, before extending it to the RFM in section 3.2. Additionally, section 3.3 discusses implementation details.
3.1 Mesh Refinement with Frame Sensors
As a background for the subsequent adaption to satellite imagery we review the computation of gradients induced by photometric (dis)similarities under the pinhole model. Let denote the (infinite) set of admissible 2D surface manifolds in . The overall photoconsistency is composed of individual terms that measure the photoconsistency between image to image , when projected onto each other via the surface :
(1) 
Here and are the image regions in and that see the same surface area on . Let be the projections that map object coordinates to image coordinates of , respectively , and let and denote the reprojection from the respective image to the surface . The transfer function relating image coordinates in the two views is given by , such that the pairwise photoconsistency becomes
(2) 
where is a measure of photoconsistency. In our case we seek to minimize the negative zeronormalised crosscorrelation
. Using the chain rule, the variation of
with respect to an infinitesimal variation of the surface is given by(3) 
where denotes the derivative of the similarity measure w.r.t. the second argument . is the gradient in image and is the derivative of the objecttoimage space mapping w.r.t. an object point on the surface. Let be the ray from the projection center of view through pixel coordinate . The term
(4) 
represents the change along when moving the surface by .
This change can be computed geometrically using trigonometric functions and the intercept theorem (see figure 0(a)), which leads to
(5) 
with the surface normal . The change in image coordinates caused by an infinitesimal displacement of the object coordinates (see Figure 0(b)) is computed as
(6) 
Here represents the zcomponent of a surface point in the camera coordinate system of view . By substituting (5) in (3), and using (6) to change the integration domain from image space to the surface, we obtain
(7) 
It has been shown Delaunoy and Prados (2011); Solem and Overgaard (2005) that for the variation of the photoconsistency and the variation
of the surface, the gradient vector field
fulfills(8) 
and the gradient descent flow is given by . Consequently, by comparing (7) and (8), the gradient of the matching function is
(9) 
The flag accounts for visibility, evaluating to 1 if the surface is visible in both views , and to 0 otherwise. Note that this continuous formulation can be directly used to compute gradients of discrete surfaces, for example triangular meshes. In practice a translation of each single vertex is computed by weighted integration of gradients over its onering neighborhood of triangles. For more details the interested reader is referred to Delaunoy and Prados (2011).
3.2 Mesh Refinement with Spaceborne Pushbroom (Line) Sensors
The defacto standard to model the objecttoimage space mapping for satellite imagery is the RFM. In the following, we derive a mesh refinement scheme like the one above under the RFM (sections 3.2.13.2.4). Implementation details for that new model will be given in section 3.3.
3.2.1 Derivatives of the RFM
Since the projection function from object to image space is different from the pinhole model, we first need to adapt the Jacobian . Let , and be normalised geographic lat/lon/height coordinates of an object point, and let be a 20dimensional vector holding the 3rddegree polynomial expansion of those normalised coordinates. With the four vectors each holding 20 rational polynomial coefficients (RPC), the image coordinates of the projected point are
(10) 
with , the scales and offsets between pixels and normalised image coordinates. The RFM parameters shipped with an image tend to only be correct up to a small, global bias. That error is often compensated with an additional affine transformation Fraser and Yamakawa (2004), or with only a translation Fraser et al. (2006). We follow the latter and apply only a simple shift correction . The 23 Jacobian matrix w.r.t the geographic coordinates reads
(11) 
3.2.2 QuasiCartesian Coordinate System
The RFM relates Cartesian image coordinates to polar geographic coordinates. For the mesh refinement, it is not only easier, but also numerically advantageous to operate in local Cartesian coordinates. Hence, we transform geographic coordinates into a "QuasiCartesian" local coordinate system. This can be achieved by scaling latitude and longitude to the (metric) unit of the height component . Let be a point located at the center of the area of interest and its UTM coordinates. Furthermore, let be the corresponding geographic coordinates of . Then the transformation to a local, quasiCartesian frame can be expressed as
(12) 
This transformation mimics a Cartesian coordinate system only locally, but the approximation is valid for a large enough area (refinement of large areas will in practice always be done for local tiles, in order to parallelise the computation). The quality of the approximation (nonorthogonality and scale anisotropy of the coordinate axes) is shown in Table
1. Using the chain rule, the Jacobian of the transformation from image space to quasiCartesian coordinates reads(13) 
with denoting elementwise multiplication.
3.2.3 Independence of Projection Center and Approximation of Depth
The geometric derivations of (5) and (6) are based on the depth , and on the ray connecting the projection center and a surface point. The RFM does not model a single projection centre. Fortunately, (5) is in fact independent of the absolute length . This can be seen by rescaling it to an arbitrary length, e.g., to the unit vector, . Equation (5) then becomes
(14) 
In contrast, for the variation of image space coordinates ((6)) the length of does not cancel out easily. Instead, we use the fact that for large focal lengths . Thus, the denominator is approximately . In other words, a stereo model contributes to the gradient inversely proportional to its distance from the surface. We account for this weighting by scaling with the GSD at the average terrain height. As , this corresponds to a change of variables
(15) 
and the final gradient is calculated as
(16) 
3.2.4 Approximation of Straight Rays With Virtual Cameras
Computing the derivative of the similarity measure requires mapping the image into the view via the surface, which involves ray casting. Viewing rays modeled by the RFM are in general not straight lines which prevents efficient ray casting. In practice one can assume that the curvature of the rays is low enough to represent them by straight rays in the vicinity of the surface (see Table 2). Therefore, we define a virtual camera in the following way: We define two virtual planes in object space with constant heights and above ground. Then each image pixel is projected to those two planes using the RFM (and mapped to quasiCartesian coordinates), to obtain two virtual points and . The set of line segments , together with the corresponding pixel intensities, forms a virtual camera that observes the surface along straight rays.
3.3 Implementation Details
The overall pipeline for surface refinement proceeds as follows: First, the input mesh is transformed to quasiCartesian coordinates. Next, two virtual cameras , are set up for each stereo pair. Now the iterative refinement starts; the image intensities of are projected onto the current surface and back into . There, we densely compute the similarity (in our implementation ZNCC) between the original and the projected images, as well as its derivative. From we read out the ray direction . The remaining components needed to compute the gradient (16) are obtained as discussed in sections 3.2.13.2.4. Pervertex gradients are obtained by integrating (16) over all faces in a onering neighborhood. For each vertex of the input mesh, the resulting gradients are summed over all stereo models and scaled with the step size to obtain a field of vertex displacements. To these photometric gradients, we add displacement vectors corresponding to the thinplate energy Kobbelt et al. (1998) to regularise the surface smoothness Vu et al. (2012). The final displacement field is applied to the mesh vertices to obtain an updated surface, which then serves as input to the next iteration. Formally, the overall energy is given as
(17) 
where , denote the principal curvatures of the surface. The weight balances the photometric term and the smoothness. Homogenisation of smoothness and photometric energies is achieved with an additional parameter that account for different scales across datasets and mesh resolutions.
In all experiments we run 20 iterations of gradient descent, after which the energy barely decreases any more. More advanced stopping criteria are of course possible. The weight factor and the step size for gradient decent were derived by grid search.
We observed convergence problems for input meshes with vertices located too far from the correct surface. We found that convergence can be improved by a hierarchical processing scheme, similar to Li et al. (2016). To that end, we convert the input mesh to a cloud of oriented points and extract a lowresolution mesh via Poisson reconstruction Kazhdan and Hoppe (2013). Thereby we choose the minimal voxel resolution (respectively, octree leaf dimension) to GSD. The lowresolution mesh is refined using downscaled versions of the original images (factor ) for 20 iterations. Then the mesh is densified by splitting each triangle face into four smaller ones, and refinement is repeated with image scale , and so forth until the full resolution is reached. Note that the densification factor is the same as the increase in the number of pixels from one pyramid level to the next, hence the (average) number of pixels per triangle remains the same. In our current implementation a triangle covers 2 pixels in average. For additional implementation details we refer the interested reader to our publicly accessible implementation of mesh refinement for pinhole models published in Blaha et al. (2017).
scale s  length [m]  length [m]  angle , [deg] 

100  100.002  100.002  90.000 
200  200.003  200.003  90.001 
500  500.009  500.008  90.001 
1000  1000.018  1000.016  90.003 
2000  2000.038  2000.031  90.005 
5000  5000.110  5000.067  90.014 
h[m]  1  100  500  1000 

[]  30.131  30.130  30.128  30.126 
4 Results
4.1 Test Sites
We test the proposed algorithm on the publicly available benchmark Brown et al. (2018), which provides multiview collections of 16bit panchromatic WorldView3 images with 0.3m GSD (at nadir, the actual GSD in offnadir views can be up to a factor of 1.5 lower). Two different test sites were evaluated: Downtown Jacksonville (JAX), FL, USA and University of California San Diego (UCSD), CA, USA. The JAX test site features 26 images of an urban area of ca. 750m750m, collected between October 2014 and February 2016. The UCSD test site consists of 35 images covering an area of ca. 600m600m, collected between October 2015 and August 2017. A 2.5D LiDAR DSM with 0.5m GSD is provided and serves as ground truth for both test sites, see Figure 2. The lack of full3D ground truth makes it impossible to quantitatively evaluate 3D elements, such as indentations on facades, in our reconstructed mesh, see Figure 3(c). Nevertheless, since the gist of our approach is its ability to reconstruct true 3D geometry, we first present a qualitative evaluation of such 3D areas in the following section, while the quantitative evaluation based on 2.5D elevation maps is discussed in section 4.4.
4.2 Generation of Initial Meshes, PreProcessing and Parameters
The proposed mesh refinement needs an initial surface mesh to start from. We generate that mesh by running conventional dense stereo and meshing the resulting 2.5D elevation model by Poisson reconstruction (Section 3.3). The MVS system requires subpixel accurate relative alignment for good performance, which is not reached by the image provider’s RPCs. We use RFM parameters that have been refined with the method of Patil et al. (2019), which implements biascorrection via feature matching and subsequent bundle block adjustment.
The image collections feature very high redundancy, almost all images of a site have significant overlap. It would not be meaningful to use possible pairings for surface reconstruction: not only would it be computationally expensive, it can also be harmful to include images with baselines too short for accurate triangulation or too long for robust matching. According to our experimental experience and related research d’Angelo and Kuschk (2012); Facciolo et al. (2017); Qin (2017), we manually remove images with overly poor illumination conditions and select 80 suitable image pairs for JAX and 86 for UCSD, with intersection angles of the viewing directions between and . The parameter which controls the contribution of the unary term was set to for UCSD and for JAX respectively. The parameter , which steers the smoothness was set to 0.05 for both datasets. The average size of projected triangles is 2 pixels. Meshes were refined using the coarsetofine scheme starting at resolution and stopping at the full resolution images. With the current, unoptimised implementation the runtime for the fullresolution refinement is 65mins for JAX and 39mins for the samller UCSD. The most time consuming part is ray casting, which consumes 70% of the computation time. We note that the ray casting is suitable for GPU implementation, as is the computation of , which furthermore could be reused and updated only periodically after several iterations. Together with stricter mechanisms for stereopair selection and masking of regions outside the stereo overlap, a 10 speedup is almost certainly possible.
4.3 Qualitative Evaluation  Reconstruction of 3D Structure
In this section we qualitatively assess the reconstructions obtained with the proposed 3D mesh refinement algorithm. In particular, we illustrate its ability to recover 3D shape details that are not representable in a 2.5D height field, and its superior treatment of sharp discontinuities on manmade structures. Figure 4 shows 2.5D tSGM models (3(a), 3(d), 3(g), 3(j)), the refined 3D models (3(b), 3(e), 3(h), 3(k)) and Google Maps snapshots (3(c), 3(f), 3(i), 3(l)) of four example regions of interest (ROIs). Facades of the building displayed in figure 3(e) are clearly smoother than in the 2.5D version displayed in 3(d)
. For 2.5D models, the facade geometry is defined by roof and ground elevations. Errors in such elevation estimates are propagated over the whole facade. 3D refinement in facade regions is supported by image similarity in oblique views, leading to smoother surfaces without raster artifacts, and without destroying highfrequency crease edges. Thus, the refined models have visibly crisper crease edges. On the building in the first row, there are improvements of the roof geometry (little steps). Such shapes are inherently difficult for 2.5D approaches, where these elements are represented only by very few pixels of the elevation map. Moreover, the facades feature indentations and vertical edges that are, by construction, not representable in a 2.5D heightfield. Similar observations hold for the buildings in the second and third rows: Roof substructures are crisper and more geometric detail is extracted on facades. However, the facades are overly bumpy in places, presumably due to repetitive structures, specular materials and insufficient evidence in the image set due to the uneven distribution of viewing directions. For the challenging case in the fourth row vertical roof structures are only partly reconstructed, apparently due to weak data support. Also the large, dark glass surfaces again impair the facade reconstruction.
depict surfaces of vegetation, bridges, thin roof structures and railway tracks. Compared to tSGM, refined surfaces offer more detail, feature less outliers and appear less noisy. The same holds true for reconstructed vegetation. The street under the bridge (Figure
4(e)) is not reconstructed correctly – while one can see the attempt to "carve out" the empty space under the bridge, the refinement is limited by the faulty topology of the initial surface, which cannot be changed by the algorithm.4.4 Quantitative Evaluation of MultiView Refinement
To test the sensitivity of our method w.r.t. the initialisation, we generate DEMs with three different stateoftheart satellite MVS systems: (a) Gong and Fritsch (2019) using a hierarchical version of SGM Rothermel et al. (2012), (b) the S2P pipeline Facciolo et al. (2017) based on MGM Facciolo et al. (2015), and (c) NASA’s Stereo Pipeline (ASP) Moratto et al. (2010) based on SemiGlobalBlockMatching Bradski (2000). The three DEMs also serve as baselines to compare with. For a fair comparison we use the refined RFM parameters in all MVS systems and in the actual mesh refinement.
The LiDAR ground truth is provided in the form of a 2.5D gridded DEM. Consequently, the refined 3D meshes must be converted back to 2.5D elevation maps. To accomplish this process, we align the mesh to the LiDAR DEM Bosch et al. (2016), cast a vertical ray through the centre of each grid cell, and extract the highest intersection point with the reconstructed mesh.
4.4.1 Evaluation of Refined Surfaces in Benchmark Regions
Table 3 displays error statistics for the MVS results and the corresponding refinement results, for both test sites. The comparisons to ground truth were carried out with the test suite provided by Bosch et al. (2016), where we add additional, robust error metrics: namely, a truncated root mean square error RMSE, computed from only those residuals that are 3m (ca. 10GSD), and the corresponding completeness (percentage of residuals below that threshold). Furthermore, we list the normalised median absolute residuals and the 68% percentile of absolute residuals, denoted by perc68. This value is calculated such, that 68% of the residuals are perc68, corresponding to of a Gaussian error distribution. Note that the framework evaluates only on selected building roofs, see Figure 2, to rule out error sources like seasonal changes, extreme height discontinuities, temporal changes and moving objects. Brown et al. (2018) provide a building mask for the Jacksonville test site. Since for the UCSD site the building mask is not publicly available, we manually created one. As shown in table 3, our mesh refinement significantly improves the 2.5D DEMs generated with both the SGM and ASP methods. There are no significant quantitative differences to S2P (for JAX the refined version is insignificantly better, for UCSD insignificantly worse). We note that the error metrics do not fully reflect the visual quality of the reconstructions. The mesh refinement does bring out additional 3D structure and suppress noise (see Sec. 4.4.2), but it appears that in terms of average 2.5D roof accuracy the S2P method is already close to the achievable limit (normalised median absolute deviation 1.5GSD), so that no further improvement is possible. In general, the accuracy of the DEMs, at least on roof surfaces, is in the submetre range, which we find very encouraging. The majority of the residuals are 1.5GSD. Note that the mesh refinement exhibits little sensitivity to the initialisation, its results are quantitatively very similar in all cases, independent of which stereo method was used to generate the input surface.
tSGM  tSGM ref  S2P  SP2 ref  ASP  ASP ref  

JAX 
Compl. [%]  88.46  88.77  88.32  88.39  84.22  87.14 
RMSE [m]  0.86  0.80  0.79  0.78  1.0  0.83  
NMAD [m]  0.51  0.39  0.42  0.40  0.74  0.46  
perc68 [m]  0.88  0.73  0.72  0.71  1.25  0.80  
UCSD 
Compl. [%]  95.69  96.09  96.78  96.12  93.23  96.44 
RMSE [m]  0.97  0.86  0.82  0.85  1.16  0.86  
NMAD [m]  0.58  0.47  0.45  0.46  0.79  0.47  
perc68 [m]  0.95  0.77  0.74  0.75  1.28  0.77 
4.4.2 Indepth Evaluation of Refined Surfaces on Individual Buildings
To characterise the reconstruction in more detail and gain further insights into its behaviour, we examine four building roofs for each test site in detail. As explained in section 4.4.1 we exclude building edges, since aliasing at large height jumps makes a meaningful evaluation impossible.
tSGM  tSGM ref  S2P  SP2 ref  ASP  ASP ref  

ROI 1 
Compl. [%]  92.69  93.35  93.82  92.83  89.76  90.57 
RMSE [m]  0.80  0.79  0.67  0.77  0.85  0.76  
NMAD [m]  0.31  0.24  0.22  0.23  0.46  0.25  
perc68 [m]  0.70  0.64  0.51  0.56  0.85  0.61  
ROI 2 
Compl. [%]  94.69  94.71  95.06  94.54  89.71  93.49 
RMSE [m]  0.77  0.75  0.69  0.74  1.07  0.75  
NMAD [m]  0.37  0.28  0.27  0.30  0.67  0.30  
perc68 [m]  0.71  0.64  0.60  0.62  1.22  0.65  
ROI 3 
Compl. [%]  79.31  80.54  83.79  80.86  63.26  80.37 
RMSE [m]  1.26  1.18  1.15  1.13  1.56  1.16  
NMAD [m]  1.18  1.00  0.88  0.87  2.15  0.92  
quant68 [m]  1.93  1.75  1.52  1.68  3.32  1.78  
ROI 4 
Compl. [%]  93.68  95.52  94.63  94.86  88.81  92.64 
RMSE [m]  0.89  0.76  0.82  0.74  1.05  0.75  
NMAD [m]  0.46  0.36  0.36  0.35  0.67  0.36  
perc68 [m]  0.84  0.66  0.68  0.63  1.20  0.68 
tSGM  tSGM ref  S2P  SP2 ref  ASP  ASP ref  
ROI 1 
Compl. [%]  99.70  99.20  98.50  99.0  94.40  99.04 
RMSE [m]  0.97  0.87  0.86  0.88  1.21  0.86  
NMAD [m]  0.61  0.50  0.48  0.52  0.90  0.51  
perc68 [m]  0.90  0.74  0.74  0.75  1.34  0.73  
ROI 2 
Compl. [%]  92.46  92.22  94.17  92.33  86.15  91.80 
RMSE [m]  1.12  0.97  0.96  0.97  1.24  0.98  
NMAD [m]  0.68  0.52  0.48  0.52  0.90  0.54  
perc68 [m]  1.22  0.93  0.92  0.92  1.67  0.96  
ROI 3 
Compl. [%]  96.00  98.67  97.66  98.33  85.21  98.31 
RMSE [m]  1.10  1.07  1.09  1.10  1.65  1.12  
NMAD [m]  0.69  0.59  0.67  0.61  1.60  0.69  
perc68 [m]  1.09  1.06  1.10  1.10  2.27  1.17  
ROI 4 
Compl. [%]  99.14  99.44  99.53  99.52  96.44  99.31 
RMSE [m]  0.95  0.95  0.88  0.94  1.35  0.96  
NMAD [m]  0.57  0.46  0.46  0.45  0.70  0.43  
perc68 [m]  0.92  0.85  0.83  0.84  1.48  0.87  

LiDAR  tSGM  tSGM ref  S2P  SP2 ref  ASP  ASP ref.  

ROI1 

ROI2 

ROI3 

ROI4 
LiDAR  tSGM  tSGM ref  S2P  SP2 ref  ASP  ASP ref  

ROI1 

ROI2 

ROI3 

ROI4 
Figures 6 and 7 display exemplary surface reconstructions of the different tested methods for both ROIs. Tables 4 and 5 list the corresponding quantitative results. Again, mesh refinement generally improves the accuracy over SGM and ASP, at both sites. With S2P we get mixed results. Its accuracy is already very high (for several buildings at, or below GSD), so the results after refinement are quantitatively almost the same. Still, visual inspection makes clear that the quantitative metrics do not fully characterise the model quality: The refined surfaces are crisper and more correct on complicated roof structures. The numbers do not reflect this, because after refinement the surface tends to be slightly noisier on flat, frontoparallel areas that play to the strength of the constantheight prior built into most MVS and 2.5D fusion methods. Overall, we observe crisper reconstructions of roof details after refinement. These details are recovered even when starting from the roughest initial surface (generated with NASA’s ASP), where they are completely missing. The meshes after refinement are also visually comparable, independent of the initialisation, indicating favourable convergence properties of the hierarchical optimisation. We note in passing that the 68percentile metric is comparable with the numbers published in Brown et al. (2018).
5 Conclusion
We have presented an algorithm to refine 3D surface meshes by minimising the photometric transfer error between multiple pairs of satellite images, whose sensor poses are specified through the RFM.
In experiments on highresolution data from WorldView3, we have shown that, compared to the input from stateoftheart MVS pipelines, the refinement extracts additional 3D surface details and better reproduces crisp edges and discontinuities. It also decreases the residual errors of the recovered surface in most cases. Being based on image similarity, it is however still challenged by repetitive texture and nonLambertian surfaces. To further diminish artefacts of our method in such areas, one important direction is to endow 3D surface models with more apriori knowledge about the reconstructed scene. In particular, our current implementation includes only a simple thinplate regulariser to favour low curvature. Elementary priors like the preference for piecewise constant heights (extensively used in dense matching and 2.5D model fusion), or a preference for right angles, are still missing and need to be ported to the world of 3D meshbased reconstruction.
A major difference to existing reconstruction pipelines is that our algorithm is able to handle true 3D geometry. We have shown that, when including oblique satellite views, the method produces cleaner facade structures and recovers facade details. This property could be advantageous for downstream building modelling, surface structuring and analysis based on geometric features, and of course visual display from offnadir viewpoints. We believe that with the growing availability of highresolution, oblique satellite images the field could move beyond 2.5D elevation models and the quest for full 3D models will gain momentum. We hope that our work triggers further research in that direction.
References
 Besse (2013) Besse, F.O., 2013. PatchMatch Belief Propagation for Correspondence Field Estimation and its Applications. Ph.D. thesis. University College London (UCL).
 Blaha et al. (2017) Blaha, M., Rothermel, M., Oswald, M.R., Sattler, T., Richard, A., Wegner, J.D., Pollefeys, M., Schindler, K., 2017. Semantically informed multiview surface refinement, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 3819–3827.

Bosch et al. (2016)
Bosch, M., Kurtz, Z.,
Hagstrom, S., Brown, M.,
2016.
A multiple view stereo benchmark for satellite imagery, in: Proceedings of the IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pp. 1–9.
 Bradski (2000) Bradski, G., 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools .
 Brown et al. (2018) Brown, M., Goldberg, H., Foster, K., Leichtman, A., Wang, S., Hagstrom, S., Bosch, M., Almes, S., 2018. Largescale public lidar and satellite image data set for urban semantic labeling, in: Proceedings of Laser Radar Technology and Applications XXIII, pp. 154–167.
 d’Angelo (2016) d’Angelo, P., 2016. Improving semiglobal matching: Cost aggregation and confidence measure. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 41, 299–304.
 d’Angelo and Kuschk (2012) d’Angelo, P., Kuschk, G., 2012. Dense multiview stereo from satellite imagery, in: Proceedings of 2012 IEEE International Geoscience and Remote Sensing Symposium, pp. 6944–6947.
 De Franchis et al. (2014) De Franchis, C., MeinhardtLlopis, E., Michel, J., Morel, J.M., Facciolo, G., 2014. An automatic and modular stereo pipeline for pushbroom images. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 2, 49–56.
 Delaunoy and Prados (2011) Delaunoy, A., Prados, E., 2011. Gradient flows for optimizing triangular meshbased surfaces: Applications to 3d reconstruction problems dealing with visibility. International journal of computer vision 95, 100–123.
 Delaunoy et al. (2008) Delaunoy, A., Prados, E., Piracés, P.G.I., Pons, J.P., Sturm, P., 2008. Minimizing the multiview stereo reprojection error for triangular surface meshes, in: Proceedings of BMVC 2008British Machine Vision Conference, pp. 1–10.
 Dial and Grodecki (2002) Dial, G., Grodecki, J., 2002. Block adjustment with rational polynomial camera models, in: Proceedings of ASPRS 2002 Conference, Washington, DC, pp. 22–26.
 Drory et al. (2014) Drory, A., Haubold, C., Avidan, S., Hamprecht, F.A., 2014. Semiglobal matching: a principled derivation in terms of message passing, in: Proceedings of German Conference on Pattern Recognition, pp. 43–53.
 d’Angelo and Reinartz (2011) d’Angelo, P., Reinartz, P., 2011. Semiglobal matching results on the isprs stereo matching benchmark. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 38, 79–84.
 Facciolo et al. (2015) Facciolo, G., De Franchis, C., Meinhardt, E., 2015. Mgm: A significantly more global matching for stereovision, in: Proceedings of BMVC 2015British Machine Vision Conference, pp. 1–12.
 Facciolo et al. (2017) Facciolo, G., De Franchis, C., MeinhardtLlopis, E., 2017. Automatic 3d reconstruction from multidate satellite images, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 57–66.
 Fraser et al. (2006) Fraser, C., Dial, G., Grodecki, J., 2006. Sensor orientation via RPCs. ISPRS Journal of Photogrammetry and Remote Sensing 60, 182–194.
 Fraser and Yamakawa (2004) Fraser, C., Yamakawa, T., 2004. Insights into the affine model for highresolution satellite sensor orientation. ISPRS Journal of Photogrammetry and Remote Sensing 58, 275–288.
 Fuhrmann and Gösele (2014) Fuhrmann, S., Gösele, M., 2014. Floating scale surface reconstruction. ACM Transactions on Graphics (ToG) 33, 46:1–46:11.
 Furukawa and Ponce (2010) Furukawa, Y., Ponce, J., 2010. Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1362–1376.
 Fusiello et al. (2000) Fusiello, A., Trucco, E., Verri, A., 2000. A compact algorithm for rectification of stereo pairs. Machine Vision and Applications 12, 16–22.
 Galliani et al. (2016) Galliani, S., Lasinger, K., Schindler, K., 2016. Gipuma: Massively parallel multiview stereo reconstruction. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V 25, 361–369.
 Gehrig and Franke (2007) Gehrig, S.K., Franke, U., 2007. Improving stereo subpixel accuracy for long range stereo, in: Proceedings of 2007 IEEE 11th International Conference on Computer Vision, pp. 1–7.
 Gong and Fritsch (2019) Gong, K., Fritsch, D., 2019. DSM generation from high resolution multiview stereo satellite imagery. Photogrammetric Engineering & Remote Sensing 85, 379–387.
 Gösele et al. (2007) Gösele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M., 2007. Multiview stereo for community photo collections, in: Proceedings of 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8.
 Hartley and Saxena (1997) Hartley, R.I., Saxena, T., 1997. The cubic rational polynomial camera model, in: Proceedings of the DARPA Image Understanding Workshop, pp. 649–653.
 Helava (1988) Helava, U., 1988. Objectspace leastsquares correlation, in: (ACSM and American Society for Photogrammety and Remote Sensing, Annual Convention, Saint Louis, MO, Mar. 1418, 1988) Photogrammetric Engineering and Remote Sensing,, pp. 711–714.
 Hirschmüller (2008) Hirschmüller, H., 2008. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence 30, 328–341.
 Jancosek and Pajdla (2011) Jancosek, M., Pajdla, T., 2011. Multiview reconstruction preserving weaklysupported surfaces, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3121–3128.
 Kazhdan and Hoppe (2013) Kazhdan, M., Hoppe, H., 2013. Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG) 32, 29:1–29:13.
 Kim (2000) Kim, T., 2000. A study on the epipolarity of linear pushbroom images. Photogrammetric Engineering & Remote Sensing 62, 961–966.
 Kim and Dowman (2006) Kim, T., Dowman, I., 2006. Comparison of two physical sensor models for satellite images: position–rotation model and orbit–attitude model. The Photogrammetric Record 21, 110–123.
 Knapitsch et al. (2017) Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V., 2017. Tanks and temples: Benchmarking largescale scene reconstruction. ACM Transactions on Graphics 36, 78:1–78:13.
 Kobbelt et al. (1998) Kobbelt, L., Campagna, S., Vorsatz, J., Seidel, H.P., 1998. Interactive multiresolution modeling on arbitrary meshes, in: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pp. 105–114.
 Kratky (1989) Kratky, V., 1989. Online aspects of stereophotogrammetric processing of spot images. Photogrammetric Engineering and Remote Sensing 55, 311–316.
 Kuschk (2013) Kuschk, G., 2013. Large scale urban reconstruction from remote sensing imagery. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 40, 139–146.
 Kuschk et al. (2017) Kuschk, G., d’Angelo, P., Gaudrie, D., Reinartz, P., Cremers, D., 2017. Spatially regularized fusion of multiresolution digital surface models. IEEE Transactions on Geoscience and Remote Sensing 55, 1477–1488.
 Labatut et al. (2009) Labatut, P., Pons, J.P., Keriven, R., 2009. Robust and efficient surface reconstruction from range data. Computer graphics forum 28, 2275–2290.
 Li et al. (2016) Li, S., Siu, S.Y., Fang, T., Quan, L., 2016. Efficient multiview surface refinement with adaptive resolution control, in: Proceedings of European Conference on Computer Vision, pp. 349–364.
 Li et al. (2015) Li, Z., Wang, K., Zuo, W., Meng, D., Zhang, L., 2015. Detailpreserving and contentaware variational multiview stereo reconstruction. IEEE Transactions on Image Processing 25, 864–877.
 Loop and Zhang (1999) Loop, C., Zhang, Z., 1999. Computing rectifying homographies for stereo vision, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 125–131.
 Moratto et al. (2010) Moratto, Z.M., Broxton, M.J., Beyer, R.A., Lundy, M., Husmann, K., 2010. Ames stereo pipeline, nasa’s open source automated stereogrammetry software, in: Proceedings of Lunar and Planetary Science Conference, p. 2364.
 Oh (2011) Oh, J., 2011. Novel Approach to Epipolar Resampling of HRSI and Satellite Stereo Imagerybased Georeferencing of Aerial Images. Ph.D. thesis. The Ohio State University.
 Ozcanli et al. (2015) Ozcanli, O.C., Dong, Y., Mundy, J.L., Webb, H., Hammoud, R., Tom, V., 2015. A comparison of stereo and multiview 3d reconstruction using crosssensor satellite imagery, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 17–25.
 Patil et al. (2019) Patil, S., Comandur, B., Prakash, T., Kak, A.C., 2019. A new stereo benchmarking dataset for satellite images. Computing Research Repository abs/1907.04404. URL: http://arxiv.org/abs/1907.04404.
 Poli and Toutin (2012) Poli, D., Toutin, T., 2012. Review of developments in geometric modelling for high resolution satellite pushbroom sensors. The Photogrammetric Record 27, 58–73.
 Pollard and Mundy (2007) Pollard, T., Mundy, J.L., 2007. Change detection in a 3d world, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–6.
 Pollard et al. (2010) Pollard, T.B., Eden, I., Mundy, J.L., Cooper, D.B., 2010. A volumetric approach to change detection in satellite images. Photogrammetric Engineering & Remote Sensing 76, 817–831.
 Pollefeys et al. (1999) Pollefeys, M., Koch, R., Van Gool, L., 1999. A simple and efficient rectification method for general motion, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 496–501.
 Qin (2017) Qin, R., 2017. Automated 3d recovery from very high resolution multiview satellite images. Computing Research Repository abs/1905.07475. URL: https://arxiv.org/abs/1905.07475.
 Reinartz et al. (2010) Reinartz, P., d’Angelo, P., Krauß, T., Poli, D., Jacobsen, K., Buyuksalih, G., 2010. Benchmarking and quality analysis of dem generated from high and very high resolution optical stereo satellite data, in: Proceedings of The 2010 Canadian Geomatics Conference and Symposium of Commission I, ISPRS Convergence in Geomatics – Shaping Canada’s Competitive Landscape, pp. 1–6.
 Romanoni et al. (2017) Romanoni, A., Ciccone, M., Visin, F., Matteucci, M., 2017. Multiview stereo with singleview semantic mesh refinement, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715.
 Roth and Mayer (2019) Roth, L., Mayer, H., 2019. Reduction of the frontoparallel bias for widebaseline semiglobal matching. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences 4, 69–76.
 Rothermel et al. (2012) Rothermel, M., Wenzel, K., Fritsch, D., Haala, N., 2012. SURE: Photogrammetric surface reconstruction from imagery, in: Proceedings of LowCost 3D Workshop Berlin, pp. 1–9.
 Scharstein et al. (2017) Scharstein, D., Taniai, T., Sinha, S.N., 2017. Semiglobal stereo matching with surface orientation priors, in: Proceedings of International Conference on 3D Vision (3DV), pp. 215–224.
 Schönberger et al. (2016) Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M., 2016. Pixelwise view selection for unstructured multiview stereo, in: Proceedings of European Conference on Computer Vision, pp. 501–518.
 Schöps et al. (2017) Schöps, T., Schönberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A., 2017. A multiview stereo benchmark with highresolution images and multicamera videos, in: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, pp. 3260–3269.
 Seitz et al. (2006) Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R., 2006. A comparison and evaluation of multiview stereo reconstruction algorithms, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 519–528.
 Shean et al. (2016) Shean, D.E., Alexandrov, O., Moratto, Z.M., Smith, B.E., Joughin, I.R., Porter, C., Morin, P., 2016. An automated, opensource pipeline for mass production of digital elevation models (dems) from veryhighresolution commercial stereo satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 116, 101–117.
 Shimizu and Okutomi (2002) Shimizu, M., Okutomi, M., 2002. Precise subpixel estimation on areabased matching. Systems and Computers in Japan 33, 1–10.
 Solem and Overgaard (2005) Solem, J.E., Overgaard, N.C., 2005. A geometric formulation of gradient descent for variational problems with moving surfaces, in: Proceedings of International Conference on ScaleSpace and PDE Methods in Computer Vision, pp. 419–430.
 Szeliski and Scharstein (2004) Szeliski, R., Scharstein, D., 2004. Sampling the disparity space image. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 419–425.
 Tao and Hu (2001) Tao, C.V., Hu, Y., 2001. A comprehensive study of the rational function model for photogrammetric processing. Photogrammetric engineering and remote sensing 67, 1347–1358.
 Ummenhofer and Brox (2015) Ummenhofer, B., Brox, T., 2015. Global, dense multiscale reconstruction for a billion points, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 1341–1349.
 Vu et al. (2012) Vu, H.H., Labatut, P., Pons, J.P., Keriven, R., 2012. High accuracy and visibilityconsistent dense multiview stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 889–901.
 Wang and Frahm (2017) Wang, K., Frahm, J.M., 2017. Fast and accurate satellite multiview stereo using edgeaware interpolation, in: Proceedings of International Conference on 3D Vision (3DV), pp. 365–373.
 Wang et al. (2016) Wang, K., Stutts, C., Dunn, E., Frahm, J.M., 2016. Efficient joint stereo estimation and land usage classification for multiview satellite data, in: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9.
 Wang et al. (2010) Wang, M., Hu, F., Li, J., 2010. Epipolar arrangement of satellite imagery by projection trajectory simplification. The Photogrammetric Record 25, 422–436.
 Wang et al. (2011) Wang, M., Hu, F., Li, J., 2011. Epipolar resampling of linear pushbroom satellite imagery by a new epipolarity model. ISPRS Journal of Photogrammetry and Remote Sensing 66, 347–355.
 Wohlfeil et al. (2012) Wohlfeil, J., Hirschmüller, H., Piltz, B., Börner, A., Suppa, M., 2012. Fully automated generation of accurate digital surface models with submeter resolution from satellite imagery. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences 39, 75–80.
 Wrobel (1987) Wrobel, B., 1987. Facets stereo vision (fast vision)—a new approach to computer stereo vision and to digital photogrammetry, in: ISPRS Intercommission Conf. Fast Processing of Photogrammetric Data, pp. 231–258.
 Zach et al. (2007) Zach, C., Pock, T., Bischof, H., 2007. A globally optimal algorithm for robust tvl 1 range image integration, in: Proceedings of IEEE International Conference on Computer Vision, pp. 1–8.
Comments
There are no comments yet.