Learning Dense Stereo Matching for Digital Surface Models from Satellite Imagery

11/08/2018 ∙ by Wayne Treible, et al. ∙ Vision Systems Inc. University of Delaware 6

Digital Surface Model generation from satellite imagery is a difficult task that has been largely overlooked by the deep learning community. Stereo reconstruction techniques developed for terrestrial systems including self driving cars do not translate well to satellite imagery where image pairs vary considerably. In this work we present neural network tailored for Digital Surface Model generation, a ground truthing and training scheme which maximizes available hardware, and we present a comparison to existing methods. The resulting models are smooth, preserve boundaries, and enable further processing. This represents one of the first attempts at leveraging deep learning in this domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A Digital Surface Model (DSM) is a 3D representation of the surface of a region of terrain. In this work we will focus on modelling the surface of Earth, however DSM’s may be generated for other planets or astronomical bodies such as asteroids. Digital Surface Models model the height of man made structures and foliage, unlike Digital Terrain Models(DTM’s) which represent the height of the bare geologic surface without these structures. DSMs are widely used for resource exploitation, architectural and civil engineering, and a variety of Geospatial Information tasks [3]. DSMs and DTMs also enable a range of current and developing research topics. These include applications in change detection[8, 12, 18, 19], crop measurements[5], weather analysis[6]

, and scene understanding such as mapping dense historical cities

[11].

DSMs and DTMs can be produced by utilizing stereo reconstruction techniques on sets of satellite image pairs of a scene. For stereo matching commercial products often use a patch-based matching cost in addition to variations of semi-global block matching (SGM). A recent study compared two commercial products, OrthoEngine and RPC Stereo Processor (RSP)[1]. This work found RSP’s modified hierarchical SGM method to be superior to OrthoEngine’s hierarchical subpixel mean normalized cross correlation method. For a comparison of multi-view reconstruction using satellite images we refer the reader to [14]

Satellites typically do not take wide baseline synchronized stereo imagery due to a number of physical and practical constraints. As a result, image pairs are captured by different satellites (or multiple passes of the same satellite) at different times. Additionally if arbitrary views can be used more images are available. This drastically complicates matching image points because illumination, environmental, and even large scale scene changes can occur. Differences in time of day and season change solar illumination angle and intensity. Cloud cover, snow, fog and recent rainfall can all affect image quality. Vegetation also undergoes seasonal change, and over the course of multiple years structures may be constructed, painted, or destroyed, changing the overall scene geometry and disrupting matching.

(a) October 18 2014
(b) December 21 2015
Figure 1: Two satellite images of the same location showing a difference in viewing angle and scene lighting.

While terrestrial stereo systems utilize a pinhole camera model and a defined world coordinate system, typically consisting of around 16 parameters. Satellite image projection is modeled using the Rational Polynomial Camera (RPC) model. RPC consist of 80 polynoimial and 10 scale and offset coefficients defining the projection of points at specific latitude, longitude, and elevation to image coordinates. Commercially available satellite images ship with RPC parameters, however better stereo reconstruction results can be achieved by further geocorrection [15].

Recent works utilize neural networks to regress disparity from rectified stereo images. These methods formulate the problem of disparity calculation as a data-driven, end-to-end differential problem. Most of these methods are comprised of 4 main stages: Unary feature extraction, cost volume aggregation, regularization and smoothing, and finally the disparity regression itself. On the KITTI Stereo Benchmark, a street-view dataset to support autonomous driving, methods utilizing neural networks of this form occupy many of the top spots in all of the metrics.

Traditionally, block-matching cost metrics such as SAD, census, and gradients are used to obtain correspondences for disparity calculation. For an in-depth summary of such methods, see [2]

. Currently, the state-of-the-art for stereo image matching involves utilization of deep convolutional neural networks. End-to-end deep learning through training on large color image datasets was performed in Zagoruyko et al.

[21]. Dense stereo matching with deep neural networks was also performed in Luo et al. [10]. At the time of writing, PSMNet was ranked second on the KITTI 2015 leaderboards by leveraging end-to-end learning and spatial pyramid pooling to increase the network’s perceptive field [4].

In this work, we utilize disparity-generating neural networks to solve the challenging problem of constructing DSMs from satellite imagery. Our contributions are as follows:

  1. Adapting Deep Neural Networks for disparity calculation on a new image domain (satellite imagery)

  2. A method for generating ground truth disparity for satellite image pairs

  3. A training scheme for combining satellite imagery and traditional stereo datasets

2 Acknowledgement

Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior / Interior Business Center (DOI/IBC) contract number D17PC00288. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government. WorldView satellite images provided courtesy DigitalGlobe.

Figure 2: A pictorial representation of the network architecture. In one configuration, ResNet-like branches share weights, and are each aggregated into a larger context with a Spatial Pyramid Pooling stage.
Layer Name
Layer Description

(conv shape, num filters, stride)

Layer Output Dimensions
(H x W x C)
1 Input H x W x 1C
2 FeX_InitialA 3x3, 32, stride=2 H x W x 32C
3 FeX_InitialB 3x3, 32 H x W x 32C
4 FeX_InitialC 3x3, 32 x W x 32C
5 FeX_BlockStack0 [3x3, 32] x 6 H x W x 32C
6 FeX_BlockStack1 [3x3, 64] x 32 H x W x 64C
7 FeX_BlockStack2 [3x3, 128] x 6 H x W x 128C
8 FeX_BlockStack3 [3x3, 128] x 6, dilation = 2 H x W x 128C
9 FeX_SPP0
64x64 average pooling
3x3, 32
Linear/Cubic Upsampling
H x W x 32C
10 FeX_SPP1
32x32 average pooling
3x3, 32
Linear/Cubic Upsampling
H x W x 32C
11 FeX_SPP2
32x32 average pooling
3x3, 32
Linear/Cubic Upsampling
H x W x 32C
12 FeX_SPP3
32x32 average pooling
3x3, 32
Linear/Cubic Upsampling
H x W x 32C
13 FeX_Concat
[FeX_BlockStack1,
FeX_BlockStack3,
FeX_SPP0, FeX_SPP1,
FeX_SPP2, FeX_SPP3]
H x W x 32C
14 FeX_LastConvA 3x3, 128 H x W x 32C
15 FeX_LastConvB 1x1, 32 H x W x 32C
16 CostVolume Concatenate disparity levels D x H x W x 32C
17 PreHourglassBlock [3x3, 32] x 4 D x H x W x 32C
18 Hourglass0_ConvA 3x3, 64, stride=2 D x H x W x 64C
19 Hourglass0_ConvB 3x3, 64 D x H x W x 32C
20 Hourglass0_ConvC 3x3, 64, stride=2 D x H x W x 64C
21 Hourglass0_ConvD 3x3, 64 D x H x W x 64C
22 Hourglass0_DeconvE 3x3, 64, stride=2 D x H x W x 64C
23 Hourglass0_DeconvF 3x3, 32, stride=2 D x H x W x 32C
24 Hourglass1 (See Above) D x H x W x 32C
30 Hourglass2 (See Above) D x H x W x 32C
36 DispReg
Linear/Cubic Upsampling
Disparity Regression
D x H x W
H x W
Table 1: Network Description. This architecture builds upon PSMNet for the satellite image domain using dilated convolution at the end of the initial feature extraction stage, and 4 levels of SPP.

3 Network Architecture

In this work, we present a deep neural network architecture for disparity regression from pairs of rectified images. This network has been adapted to the domain of satellite images. It builds upon Pyramid Stereo Matching (PSMNet) architecture [4], which was a top performing method on the KITTI benchmark at the time of writing. A full description detailing the base architecture is given in Table 1.

3.1 Feature Extraction Modules (FeX)

During our development we have experimented with multiple feature extraction (FeX) modules: A ResNet + Spatial Pyramid Pooling (SPP) Module, a ResNeXt + SPP Module, and a ResNet + Crosshair Pooling Module.

The first feature extraction module utilizes ResNet-like branches with shared weights coupled with a Spatial Pyramid Pooling (SPP) stage to aggregate large regions of context. The feature extraction begins with cascaded convolutional layers (3x3) that downsample the input image by a factor of two. This is followed by basic residual blocks (as in ResNet) and dilated convolution for extracting unary features across a very large receptive field. In satellite imagery, there is usually a very diverse set of man-made structures and foliage. This presents an interesting challenge on how to aggregate context information for feature maps. In the first module, feature context is combined with Spatial Pyramid Pooling.

In SPP, average pooling over multiple scales captures a spatially-hierarchical context of the feature maps. Feature channels are compressed using 1x1 convolutions, then each pooled feature map is upsampled to the input feature map dimension with a user-defined choice of bilinear or cubic interpolation. The feature maps from each of the pyramid levels are then concatenated as the final output of this feature extraction module.

The second FeX module operates similarly to the first. In this module however, we utilize the improved residual blocks dubbed ResNeXt[20]. Unary features are obtained from these residual blocks, and then aggregated using SPP in the same fashion as the first FeX module.

The final FeX module uses the same unary feature extraction as the first FeX module. However, we introduce a new type of context pooling in order to capture different spatial context. We call this type of pyramid-style pooling ”Crosshair” pooling, due to it’s vertical and horizontal pooling bands. The motivation for this type of pooling is based on the epipolar geometry constraints of stereo matching, in which horizontal bands are pooled to capture vertical feature relationships perpendicular to the epipole, and vertical bands are pooled to capture horizontal relationships along the epipole.

Of these feature extraction modules, the best qualitative performance was with the standard ResNet + SPP Module. For this reason, experimental results will be presented using this Feature Extraction module.

3.2 Cost Volume Aggregation

Each output of a FeX module is a 3-dimensional volume of feature maps that have encoded context information from Spatial Pyramid Pooling. These two volumes are aggregated with a correlation operation into a 4-dimensional cost volume. In this volume, each level represents a different disparity, with the right feature map is shifted and concatenated with the left or reference feature map. However, since the inputs have been downsampled, this volume is 1/4 the size of each spatial dimension and the newly added disparity dimension is 1/4 of the maximum disparity. The final 4D cost volume results in a shape of 1/4 Height x 1/4 Width x 1/4 Max Disparity x Number of Filters.

3.3 Smoothing and Regularization

The smoothing and regularization stage is performed by multiple stacked 3D hourglass CNNs. The encoder and decoder nature of an hourglass CNN captures additional spatial context information at varying scales. The use of 3D convolutions serve to not only aggregate feature information along the spatial dimensions, but also along the newly created disparity dimension of the cost volume.

In our architecture, three 3D CNNs are stacked in a cascade, with each outputting a disparity map (via the disparity regression module) for intermediate supervision. This provides a type of iterative refinement to the smoothing and regularization aspects of this module.

During training, all three hourglass outputs are fed to the Disparity Regression and are utilized in the loss calculation, with each being weighted by the empirically derived factors as in PSMNet (). During testing and prediction, only the final disparity map is used, as it is the most refined output.

3.4 Disparity Regression

To regress disparity, we calculate the value of disparity of each individual pixel by taking the softmax of the cost along the disparity dimension at that specific pixel. We then sum each disparity value weighted by this softmaxed cost along the disparity axis of the 4D volume. This operation is typically called a soft-attention mechanism. This disparity regression is formalized as:

That is, each pixel in the disparity map is the sum of the disparity multiplied by its softmaxed cost at the corresponding pixel across all disparity levels. There is strong evidence that utilizing this disparity regression is more robust than other methods for stereo networks [9].

Unlike in many other implementations, we do not end up completely squeezing the original 3D volume during the regression stage, and only totally regress a 2D map for loss calculations and prediction output. This allows us to utilize the original disparity costs in the cost volume for some downstream calculations and to better resolve ambiguities across multiple disparity maps that cover certain overlapping regions of interest.

3.5 Network Modularity

To facilitate matching in urban areas with tall buildings we have trained the network with a large disparity range (up to 352 maximum disparity). In practical terms this means splitting up the network into multiple pieces to enable training across multiple GPUs. To compute a 1 meter resolution DSM with the appropriate disparity range required approximately 20 GB of VRAM, and was trianed using two Nvidia Titan X GPU’s on a workstation. Training at the full resolution of the input imagery (i.e. no initial downsampling convolution with half-resolution during cost volume aggregation) requires approximately 60 GB of VRAM.

4 Dataset Overview

The panchromatic grayscale images taken by the WorldView3 satellites have a higher metric resolution than the RGB imagery. There arises a small trade-off of 3-channel color information versus better spatial resolution. We elect to use the panchromatic grayscale imagery because high resolution DSMs are the end goal and previous works with SGM-based methods have proven grayscale imagery is sufficient. Therefore, all image data is first converted to grayscale so that 1-channel input image features can be learned. Our model is trained using a combination of existing terrestrial and synthetic stereo datasets, as well as a custom satellite dataset.

4.1 Existing Stereo Datasets

In many works on stereo matching, training is done with a combination of the four most popular stereo datasets: Sceneflow[13], KITTI[7], ETH3D[17], and Middlebury[16]. Part of the training routine for our model is done using both Sceneflow and KITTI datasets.

The Sceneflow dataset is comprised of a collection of stereo pairs of synthetic scenes and contains dense, high-accuracy disparity maps. It also has the benefit of being one of the only datasets which contains a top-down view of a scene which is the most similar to the perspective of satellite imagery. The dataset contains over 35,000 images with disparity values of upwards of 500 pixels, which is useful for training the network to cover a large range of disparities.

The KITTI dataset (2012 and 2015) is a popular stereo dataset which contains almost 400 images of street-view scenes with sparse ground-truth depth obtained from a LIDAR. The disparity maps in KITTI are therefore also sparse across the entirety of the image. KITTI data is useful because our satellite dataset also utilizes sparsely projected LIDAR points for ground truth, and presents similar difficulties.

4.2 Satellite Stereo Dataset

Figure 3: Sample stereo images and corresponding disparity map.

Satellite imagery is a difficult domain for a variety of reasons. Images of the same scene often are captured at vastly different times, so image pairs often vary highly in lighting and shadow. Satellite images are also modeled using the RPC camera model and often need to be approximated by an affine camera to perform tasks such as rectification. Because of the unique domain of satellite imagery and associated cost, there aren’t any sufficient training datasets of satellite image pairs with ground truth for training our network. Therefore, we have created our own satellite stereo dataset for training our models.

We used a large set of DigitalGlobe WorldView 3 satellite imagery that has an approximate ground resolution of 31cm capturing a region of downtown Jacksonville, Florida. The satellite imagery spans a temporal range of approximately one year, with captures in multiple seasons and different times of day. For ground truth, an aircraft-mounted LiDAR was used to obtain high-accuracy depth of a subset of the imaged area at approximately 1 meter resolution and sampled at 50cm.

Calculating ground truth disparity involved multiple camera model conversions, coordinate system conversions, transformations, and projections. Many of these operations leverage the suite of tools in VXL (Vision-Something-Libraries) which is a collection of C++ libraries and corresponding Python bindings for Computer Vision research and applications. A short summarization of the ground truth data generation is given below.

To generate the rectified grayscale imagery, affine cameras are approximated for each of the RPC cameras. These approximated affine cameras are generated with an iterative optimization process and allow us to perform down-stream tasks such as rectification. The ”scene,” or region of interest, is cropped out from the larger satellite imagery. From there, random points on an approximated ground plane are projected into all combinations of pairs of cameras to find image correspondences. A homography is then calculated for each image in a pair, and the images are warped with bilinear interpolation to produce sets of rectified images.

For ground truth, the LiDAR GeoTiff was converted from UTM (Universal Transverse Mercator) coordinates to WGS84 by using the GDAL library. WGS84 (World Geodetic System 1984) is a latitude, longitude, altitude format that dynamically changes over time with geologic events and represents a standard coordinate system for the Earth. These coordinates are then projected into each of the approximately 20 RPC camera models (more specifically, the affine approximations of the camera models) that correspond to passes of the satellite over the scene. The generated pairs of homographies are then used to transform the LiDAR images to rectified image space. Each individual LiDAR point is used as the correspondence between the two rectified LiDAR images, and disparity is calculated by taking the difference along the epipole.

Due to the sparseness of the LiDAR image compared to the satellite imagery, the ground truth disparity maps are also sparse and lack strong edges which presented some issues during training. The angle of the satellite images also caused a ”bleed-through” effect of occluded ground points projecting between points on the roof of a building. To limit this impact, points with higher relative altitude were dilated, and points with an altitude less than the dilated mask (within a one meter tolerance) were removed. This further increased the jaggedness of the warped points, so bilinear interpolation was used to generate smoother, denser, disparity maps at the cost of slight error in the ground truth. Both the sparse disparity and the interpolated disparity were saved for different training procedures described in section 5.1.

We used these rectified pairs and associated ground truth to create images for training of size 256H x 512W by jointly sliding a window across all three images with a step size of 64 pixels in both the and directions. Image pairs that contained less than 40% density in their ground truth images were not selected. The most common instances where this occurred were on the sub-images that contained a large portion of the side of a tall building (the top-down LiDAR ground truth contained no building sides) or along the edges of the ground truth region.

An additional challenge for determining training data was the uncertainty in relative satellite positioning. In traditional stereo camera setups, the relative position of the reference and target images is known (i.e. a ”left camera” and a ”right camera”). In order to make sure we have the canonical setup, only images with ”positive” disparity – a disparity value of represents the shift of a matching pixel in the form – were selected. Do to the nature of the direction and angle at which satellite images are captured, there were some instances with small ”negative” disparity existed in the ground truth. If this amount was small (<10px), we simply shifted the target image by that amount and added that amount to all disparity values. If it was larger, we either flipped the images and checked again, or discarded that specific pair.

After all filtering, the final satellite stereo dataset used for training and testing contained over 30k pairs of stereo image pairs with corresponding ground truth disparity maps.

(a) SGM Model
(b) Neural Network Model
Figure 4: A shaded point cloud representation of the DSM created using the SGM process and Neural Network implementation.

5 Experiments

5.1 Training Procedures

Multiple training procedures were tested with a combination of different datasets which are briefly summarized below. All training procedures use normalized (zero mean & unit variance) grayscale imagery with a vertical resolution of 256 pixels and a horizontal resolution of 512 pixels (256H x 512W x 1C). The Adam optimizer was used with the Keras parameter defaults of

and . The network was trained with a maximum disparity of 352 in order to account for the large disparity ranges caused by a mix of very tall buildings and smaller ground structures present in satellite imagery. All images were shuffled before training, and in each instance the networks were trained from scratch with the described procedures.

In two of the training procedures, we adopted the general training strategy used by most of the top networks. We first train the network on Sceneflow for 20 epochs at a learning rate of 0.001 followed by fine-tuning on a mix of KITTI and satellite imagery.

In the first procedure, fine-tuning was performed with KITTI using a learning rate of 0.001 for 100 epochs followed by 10 epochs at a learning rate of 0.001 of training with satellite imagery. In this iteration, only a small set of the satellite imagery was used (approx. 15k image pairs). We decided to incorporate KITTI into our training procedure because it contained higher resolution ground truth than our satellite imagery. The set of satellite imagery contained multiple elevated roadways, a mix of small to moderately tall buildings, and some parks and vegetation. We call this training regimen ”traditional transfer” as this approach is commonly used in adapting a network to a new domain.

In the second procedure we used a similar transfer learning style, instead removing KITTI imagery in favor of using more satellite images that contained denser urban regions with taller buildings. We tried this method because the disparity generated in the traditional transfer method appeared to be biased towards smaller disparity predictions when tall structures were present in the testing data. After the initial 20 epochs of Sceneflow training, 20 additional epochs with a learning rate of 0.001 were performed with both sets of satellite images shuffled together. We call this procedure ”full transfer.”

Due to the sparse nature of the ground truth of satellite images, many of the disparity maps that had been fine-tuned on satellite images would have very soft edges and rounded corners at the edges of buildings. We were also worried about over-fitting to urban scenes because our training data came from largely urban areas. In order to attempt to maintain the desired sharp edges in the final disparity map and increase network robustness, training was performed with a mix of the Sceneflow dataset (to learn hard-boundaries between disparity levels) and the 2 satellite image sets (to learn the domain-specific features and matching costs). In this procedure, 10k images were chosen from each set, for a total of 30k images. The images were shuffled together and trained at a constant learning rate of 0.001 for 20 epochs. We call this approach the ”mixed” procedure.

5.2 Evaluation of Training Procedures

Evaluation of the training procedures was performed by calculating the accuracy on a small test set of satellite imagery that was compiled using a section of a scene containing the Wright Patterson Air Force Base near Dayton, OH. The evaluation set contained approximately 10,000 image pairs generated in the same fashion as previously described in §4.2. All three training procedures are evaluated using the same smooth L1 metric used in training and are presented in Table 2.

We determine that the ”Full” procedure is the best generalizing procedure for new regions of satellite imagery. This isn’t particularly surprising, as more unique satellite imagery in the training set should regularize the network better.

Procedure Regression Loss 1 Regression Loss 2 Regression Loss 3 Weighted Loss
Traditional 10.814 10.706 10.662 23.563
Full 10.139 9.940 9.916 21.943
Mixed 12.120 11.692 11.208 25.452
Table 2: Evaluation of Training Procedures. The smooth L1 metrics for each of the 3 disparity regressions as well as the final weighted value are reported on the test set. Here, we see that ”Full” transfer is the best generalizing training procedure for a new region of satellite imagery.
Figure 5: Reconstructed DSM using SGM (Left) and neural network implementation (Right) displayed using QT Modeler.

5.3 Digital Surface Model (DSM) Comparison

DSMs were also constructed by merging pairwise disparity maps. Each disparity map is triangulated and reprojected to form a pairwise DSM. These pairwise DSM’s are then combined using a fusion process that finds the most likely elevation for each point. This process was carried out for pairwise DSM’s generated using both the network generated disparity as well as a SGM process. Results are presented in Figure 4 and Figure 5.

Note the sharp edges in the Network generated DSM, which preserves straight edges on buildings. Figure 5 highlights another advantage of this approach in how it performs in shaded regions. The SGM approach ofen leaves large amorphous regions particularly in the shadow side of the buildings (upper right in Figure 5). While building geometry and edges are preserved the network does suffer from a lack of detail in some places, for example the ventilation systems and piping on the roofs of these structures are over smoothed, potentially due to the downsampling in the network due to memory constraints.

6 Conclusion

In this work we have utilized a neural network for disparity regression from satellite images. Matching pairs of satellite images is a difficult task because significant image variation. The architecture of this network has been developed for Digital Surface Model generation and trained using a scheme that enables dense matching and domain specific features. Resulting DSMs are smooth, have sharp edges, and preserve structural detail.

The network was trained using a combination of existing and widely used stereo datasets and a custom dataset of approximately  30k satellite image pairs. We have generated ground truth disparity maps for each image pair by projecting LiDAR point correspondences. Comparing the trained network to an existing robust implementation using Semi-Global Block Matching on an area never seen during the training of our network shows that the resulting DSM are detailed, smooth, and maintain sharp boundaries.

References

  • [1] M. A. Aguilar, A. Nemmaoui, F. J. Aguilar, and R. Qin. Quality assessment of digital surface models extracted from worldview-2 and worldview-3 stereo pairs over different land covers. GIScience & Remote Sensing, pages 1–21, 2018.
  • [2] A. Awad and M. Hassaballah. Image Feature Detectors and Descriptors; Foundations and Applications, volume 630. 02 2016.
  • [3] A. Balasubramanian. Digital elevation model (dem) in gis, 09 2017.
  • [4] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5410–5418, 2018.
  • [5] N. Demir, N. K. Sönmez, T. Akar, and S. Ünal. Automated measurement of plant height of wheat genotypes using a dsm derived from uav imagery. Proceedings, 2(7), 2018.
  • [6] M. B. Ek, K. E. Mitchell, Y. Lin, E. Rogers, P. Grunmann, V. Koren, G. Gayno, and J. D. Tarpley. Implementation of noah land surface model advances in the national centers for environmental prediction operational mesoscale eta model. Journal of Geophysical Research: Atmospheres, 108(D22).
  • [7] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [8] A. D. Gilliam, T. B. Pollard, A. Neff, Y. Dong, S. Sorensen, R. Wagner, S. Chew, T. V. Rovito, and J. L. Mundy. Sattel: A framework for commercial satellite imagery exploitation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), volume 00, pages 278–286, Mar 2018.
  • [9] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning of geometry and context for deep stereo regression. CoRR, abs/1703.04309, 2017.
  • [10] W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5695–5703, June 2016.
  • [11] E. Maltezos, E. Protopapadakis, N. Doulamis, A. Doulamis, and C. Ioannidis.

    Understanding historical cityscapes from aerial imagery through machine learning.

    In M. Ioannides, E. Fink, R. Brumana, P. Patias, A. Doulamis, J. Martins, and M. Wallace, editors, Digital Heritage. Progress in Cultural Heritage: Documentation, Preservation, and Protection, pages 200–211, Cham, 2018. Springer International Publishing.
  • [12] Y. MARUYAMA, A. TASHIRO, and F. YAMAZAKI. Detection of collapsed buildings due to earthquakes using a digital surface model constructed from aerial images. Journal of Earthquake and Tsunami, 08(01):1450003, 2014.
  • [13] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016. arXiv:1512.02134.
  • [14] O. C. Ozcanli, Y. Dong, J. L. Mundy, H. Webb, R. Hammoud, and V. Tom. A comparison of stereo and multiview 3-d reconstruction using cross-sensor satellite imagery. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 17–25, June 2015.
  • [15] O. C. Ozcanli, Y. Dong, J. L. Mundy, H. Webb, R. Hammoud, and V. Tom. Automatic geolocation correction of satellite imagery. International Journal of Computer Vision, 116(3):263–277, Feb 2016.
  • [16] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In X. Jiang, J. Hornegger, and R. Koch, editors, GCPR, volume 8753 of Lecture Notes in Computer Science, pages 31–42. Springer, 2014.
  • [17] T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [18] J. Tao and S. Auer. Simulation-based building change detection from multiangle sar images and digital surface models. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9(8):3777–3791, Aug 2016.
  • [19] J. Tian, S. Cui, and P. Reinartz. Building change detection based on satellite stereo imagery and digital surface models. IEEE Transactions on Geoscience and Remote Sensing, 52(1):406–417, Jan 2014.
  • [20] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
  • [21] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.