Estimating 3D geometry is an essential part of many robotic tasks such as navigation, recognition, manipulation, and planning. Many sensor systems including LIDAR, structured-light 3D scanner, or stereo cameras have been developed and used to this end. Among these methods the camera-based stereo systems have many benefits as they operate in a passive mode (not emitting any active signal), and they are compact, light-weight, and mechanically robust. Moreover, thanks to the recent advances in GPU processors and deep learning algorithms, the camera-based methods become feasible and promising.
A conventional stereo setup uses two cameras looking in the same direction at a horizontal interval to estimate the disparity map. However, in a real-world environment where obstacles exist around the robot, it is often necessary to estimate the omnidirectional depth map. The popular methods [shimamura2000construction, wang2012stereo] use multiple stereo cameras, estimate the disparity maps from rectified image pairs, and then merge them into one panoramic image. The reconstruction results are mostly favorable, but the size and cost of the system can be problematic in some cases. Algorithmically the distortion in the boundary of the rectified images can cause incorrect depth estimate and the discontinuity in an overlapping area makes fusing multiple disparities difficult. To reduce the number of cameras and the rig size, researchers have proposed two vertically-mounted cameras with wide FOV fisheye lenses, 360° catadioptric lenses or reflective mirrors to get a pair of omnidirectional images. In this setup the rectified images and disparity maps are in low resolution, and the vertical epipolar lines make hard to estimate the depth of vertical structures. Also when long-range sensing is needed, as in autonomous driving, the short baseline in the above systems can limit the effective sensing distance as it is proportional to the baseline between the cameras.
In this paper, we propose a novel wide-baseline omnidirectional stereo vision system which uses only four cameras but estimates full and continuous omnidirectional depth map. Each camera is equipped with a 220° FOV fisheye lens and facing the four cardinal directions as shown in Fig. 1. The proposed system can generate an omnidirectional depth map of 360° horizontal FOV and up to 180° vertical FOV (full sphere). This new camera and lens configuration yields much larger overlapped region of view frustums which enables the robots or cars to sense nearby obstacles in narrow or crowded environments.
In multi-camera stereo systems, the plane sweep method [collins1996space] which sweeps parallel virtual planes, and projects the input images onto the planes to find stereo correspondences has been used. Gallup et al. [gallup2007real] propose a more robust method that uses multiple sweeping directions, and it is further extended for the fisheye images [hane2014real]. In this paper, we propose the spherical sweeping method similar to Im et al. [im2016all]. Instead of parallel planes, the concentric virtual spheres centered at the rig coordinate system are swept for the predefined range of inverse depths. The input fisheye images are individually projected onto the spheres and the matching costs are computed from these projected spherical images. In this way, continuous omnidirectional depth maps can be generated without artificially dividing views and stitching them later, and different camera configurations (numbers and positions) can be seamlessly handled.
In our setup, depending on ray directions, the minimum depth estimate ranges 0.81.7 the distance between cameras, which is extremely wide-baseline in the stereo literature. The image patches at near distance suffer severe appearance changes due to large geometric and radiometric variation between views. Since the conventional patch-based cost metrics use the local information, the generated cost volumes are noisy and often miss correct matches even after cost aggregation. We observe that even the recent deep learning-based models [zbontar2016stereo] suffer a similar problem. Therefore, we propose a novel neural network-based approach that considers global context information in cost computation. The proposed network takes the spherical images of two views and outputs a whole omnidirectional cost map. The extensive experiments show that the proposed network outperforms the conventional local matching methods.
Our contributions are summarized as:
We propose a novel omnidirectional wide-baseline stereo system which can estimate 360° dense depth maps up to very close distance. Both the hardware configuration of a small number of cameras with ultra-wide FOV lenses and the software system for depth estimation are new, flexible, and effective in accomplishing the proposed goal.
We design a deep neural network for computing the matching cost map for a pair of spherical images. The sphere sweeping at the rig coordinate system effectively normalizes the captured images into a uniform input to the network, and by using the whole spherical images, the network learns global contexts for more accurate stereo matching.
The realistic synthetic urban datasets are rendered for training and testing the deep neural network. With this datasets the proposed algorithm is compared with the previous algorithms by extensive quantitative evaluation. Further, real-world datasets are collected to show the performance of the proposed system. All datasets, as well as the trained network, will be made public when the paper is published.
Ii Related Work
Omnidirectional Stereo There have been three major approaches for the omnidirectional stereo vision system: spinning a camera, using mirrors, and using wide FOV fisheye lenses. Kang et al. [kang19973] and Peleg et al. [peleg2001omnistereo] compute one panoramic depth map from multiple images captured while spinning an arm with a camera at the end. Despite the advantage of using one camera, long capture time and the rotating arm make it difficult to be used outside the lab. Using mirrors with a few cameras is popular in omnidirectional stereo systems [bunschoten2003robust, yi2006omnidirectional, shimamura2000construction]. Geyer et al. [geyer2003conformal] and Schönbein et al. [schonbein2014omnidirectional] use two horizontally-mounted cameras with 360° FOV catadioptric lenses. Although they can generate an omnidirectional depth map, there exist two blind spots along the epipole direction, and the depth estimates around them are unstable or missing. Gao and Shen [gao2017dual] propose a system with two vertically-mounted cameras with ultra-wide FOV fisheye lenses. It performs omnidirectional depth estimation by projecting the input fisheye images into four virtual planes parallel to the baseline. However, the disparity maps are in low resolution due to the limitation of sensor resolution and high distortion by the fisheye lenses, and the depth estimates of vertical structures parallel to the baseline are often unavailable. Meanwhile, an omnidirectional motion stereo algorithm Im et al. [im2016all] is presented - it computes a 360° depth map of the static scene from a short video clip captured by a moving omnidirectional camera. The sphere sweeping method allows the images to be captured at any known poses, thus lifts the fixed configuration restriction. While they address a motion stereo in a very short-baseline setup where appearance variations across views are minimal, we try to solve a more challenging extremely wide-baseline problem.
Stereo Matching Cost According to Scharstein et al. [scharstein2002taxonomy], there are four steps in stereo depth estimation: initial matching cost computation, cost aggregation, disparity computation with optimization, and disparity refinement. Among them, computing matching costs from the input images is the most demanding and difficult part. Typical intensity-based matching costs include sum of absolute differences (SAD), normalized cross-correlation (NCC), rank, or census transforms [zabih1994non]. Hirschmuller et al. [hirschmuller2007evaluation] compares and evaluates these matching cost functions.
Instead of finding the minimum values of locally aggregated matching costs, the depth map can be computed by global optimization by graph cuts [kolmogorov2001computing] or belief propagation [klaus2006segment, bleyer2011patchmatch], but they require high computational cost. Semi-global matching (SGM) [hirschmuller2008stereo] is an efficient way of aggregating costs globally using dynamic programming.
Recently, due to the large-scale stereo datasets with ground truth depths [geiger2012we, menze2015object, mayer2016large], deep learning-based algorithms with much improved performance have been developed. After Zagoruyko et al. [zagoruyko2015learning] propose a deep convolutional neural networks for patch comparison, the MC-CNN by Zbontar and LeCun [zbontar2016stereo] is trained for stereo matching cost computation. While [zbontar2016stereo] uses conventional cost aggregation methods, Kendal et al. [kendall2017end] propose an end-to-end network performing all steps in 3D convolutional layers.
Our network is the first neural network-based stereo algorithm that learns omnidirectional cost maps from spherical input images. As shown in the experiments it generates much cleaner cost volumes compared to the conventional intensity-based costs and covers 360° at once.
Iii Omnidirectonal Stereo
Iii-a Fisheye Projection Model and Extrinsic Calibration
We use the omnidirectional camera model [scaramuzza2006flexible, urban2015improved], which models the lens distortion with a polynomial function. The projection function maps a 3D point to a 2D point on the normalized image plane, , where is the fisheye intrinsic parameters. The normalized image coordinate is transformed to the pixel coordinate by an affine transformation , as in Fig. 1(a). The details of the projection models are described in [scaramuzza2006flexible].
We follow the conventional camera rig calibration procedure using a checkerboard - for each camera the lens intrinsic parameters and relative poses of the checkerboards are computed, then the rig is initialized using the relative poses, and finally all extrinsic and intrinsic parameters are optimized. A large checkerboard is used to ensure sufficient overlaps between views. The extrinsic parameters are represented as , where
is an axis-angle rotation vector andis a translation vector (). The rigid transformation matrix is given as where is the rotation matrix corresponding to . From the checkerboard images of a camera , we denote its lens intrinsics and , as well as the checkerboard poses to the camera, where is the capture index. The relative pose from camera to can be computed as from a pair of simultaneously-taken images and , where and denotes the composition and inverse operations. For extrinsic calibration, all camera poses and the checkerboard poses are initialized in the first camera’s coordinate system, Fig. 1(b), and we minimize the reprojection error of the corner points on the checkerboards
where is the set of observations of the checkerboard pose with the camera , is the coordinate of the corner point in the checkerboard, and is the pixel coordinate of in the image . Ceres solver [ceres-solver] is used in optimization.
Iii-B Spherical Sweep
The plane-sweep algorithm [collins1996space] enables dense stereo matching among multi-view images. However, it is difficult to apply the algorithm to fisheye images with more than 180° FOV. Hane et al. [hane2014real] use multiple planes with different normals and distances, and later Im et al. [im2016all] exploit local spheres centered at the reference camera to estimate depth from a spherical panoramic camera.
The proposed system works with both wide FOV images in a wide-baseline setup, which cannot be handled by the existing algorithms. To estimate omnidirectional depth in our wide-baseline system, we propose a global spherical sweep algorithm. The center of sweeps can be anywhere, but to minimize the distortion on the spherical images, we choose the rig center for the origin and align the -plane to be close to the camera centers. In this spherical coordinate system, a ray corresponds to . Let the transformed camera extrinsic parameters in the rig coordinate system be . Also for notational simplicity, we denote the projection function with .
We now warp the input images onto the global spheres. Each pixel in the warped spherical image represents a ray . The spherical image has resolution and varies from to . can be up to to , but we use a smaller range in our experiments as the ceiling (sky) and ground are of less interest. spheres are sampled so that their inverse depths are uniform, i.e., when the minimum depth is , the inverse depth to the -th sphere is , . In other words, the radii of the spheres are except , which corresponds to the sphere at infinity. As shown in Fig. 3 the pixel value of the spherical image is determined as
where is the input image captured by camera . For , we use . When the projected pixels are not in the visible region of the input image, we do not consider them in the further processing.
In the spherical sweep algorithms, we need to compute the matching cost volume for all ray directions and inverse depths. Suppose that we are given a pairwise matching cost function which takes two images and computes the cost map of the same size. The integrated cost map is the average of all possible (and valid) pairwise cost maps, and the cost volume is the collection of integrated cost maps, i.e., the cost of at -th sphere is
where is an unordered index pair of spherical images. As the raw cost volume is often noisy and contains incorrect estimates, we take advantage of SGM [hirschmuller2008stereo] which refines the cost volume by performing minimization of an energy function with dynamic programming. Finally, the inverse depth of a ray is determined by the winner-takes-all strategy as , where . The overall procedure is illustrated in Fig. 4.
As a baseline cost function, we use zero mean normalized cross correlation (ZNCC), which is the covariance of two patches divided by their individual standard deviations. ZNCC is one of the most popular cost functions[faugeras1993real, gallup2007real], since it is robust to radiometric changes. However in our challenging setup, it does not generate good cost maps, thus we propose our neural network cost function.
add circular column padding
|conv1||, 32, s 2, 0, 2|
|conv2||, 32, s 1, p 1|
|conv3||, 32, s 1, p 1, add conv1|
|conv18||, 32, s 1, p 1|
|conv19||, 128, s 1, p 1|
|deconv1||, 128, s 2, p 1|
|conv20||, 128, s 1, p 1|
, 1, no ReLu
SweepNet has 20 convolutional layers and a transposed convolutional layer followed by 5 fully connected layers. Each properties (s, p) means (stride, padding) in the convolutional block.
|Matching cost func.||Sunny||Cloudy||Sunset|
|ZNCC + SGM [hirschmuller2008stereo]||24.0||9.9||6.3||1.5||4.5||25.6||9.9||6.3||1.6||4.5||23.4||9.9||6.4||1.6||4.6|
|MC-CNN + SGM||19.3||7.6||5.1||1.4||4.5||21.2||7.2||4.7||1.4||4.4||18.8||7.6||5.1||1.4||4.6|
|SweepNet + SGM||15.4||6.8||4.8||1.1||3.8||19.6||7.2||4.9||1.2||3.8||14.8||7.0||4.9||1.2||3.9|
Local patch-based approaches including deep learning-based MC-CNN [zbontar2016stereo] fails to find hard negative samples such as two identical patches from different objects since they do not consider holistic visual information. Moreover, in the wide-baseline setting, the same patches may look different due to foreshortening or radiometric differences from viewing direction changes. To handle this, we propose SweepNet which utilizes the global context in the images.
The architecture of the proposed network is detailed in Table I. As shown in Fig. 4, the input of the network is a pair of gray scale spherical images acquired from (1). To ensure that the horizontal ends are connected, we add the circular column padding to the input spherical images. The conv118 layers are Siamese residual blocks [he2016deep]
for learning the unary feature extraction. We reduce the size of the input image in half for the larger receptive field, which helps the network learns from global context. The output feature maps are concatenated, and then the features are upsampled using transposed convolution. Finally, the network outputs thecost map which ranges from 0 to 1, through fully connected layers and a sigmoid layer.
To train our network, we use the following approach. Given a set of ground-truth depth maps , the inverse depth index is given as . Each position on -th sphere is labeled as
We use the negative binary cross-entropy loss [robert2014machine], . For the labeled training set , the loss is defined as
where is the predicted label at with the input images corresponding to
. The loss is minimized by stochastic gradient descent with a momentum.
Iv Experimental Results
System Configuration We use four CCD cameras with 220° FOV fisheye lenses (Pointgrey CM3-U3-31S4C and Entaniya M12-220). images can be captured at 30 Hz, and they are synchronized by software trigger. For indoor experiments, we use a square-shaped rig ( mm), and for outdoors, the cameras are installed at the four corners of the roof of a minivan as shown in Fig. 1. For calibration, a checkerboard with grids each of which is mm is used.
Training In order to train the network, we create the synthetic urban datasets with Blender as shown in Fig. 5. Following [zhang2016benefit], we virtually implement the camera rig similar to our outdoor setting as well as buildings, cars, and roads in Blender, and render each frame as four images. The Sunny dataset consists of 1000 sequential frames of sunny city landscapes, and we split them into two parts, the former 700 frames for training and the later 300 for testing. We also create separate test datasets with varying weather (Cloudy, and Sunset) to test on different photometric conditions.
The input images are converted to grayscale, the intensity values are normalized to zero-mean and unit variance, and they are warped to the spherical images ofand for training. We set from to
for the synthetic datasets. Among the training data, we randomly select 350 frames and train our network for 14 epochs. The learning rate is initially set tofor the first 11 epochs and for the remaining. We sample 192 inverse depths, and the corresponding ground-truth labels are acquired by (3). The inverse depths with less than positive labels are discarded, and the same number of positive and negative labels are used for training. In total 92 million labels from the Sunny training dataset are used to train our network.
). SweepNet classifies negative points more precisely.
Evaluation We evaluate our method quantitatively on the synthetic datasets. The error of inverse depth index is defined as
The number of inverse depth is set to and the size of spherical image is and in testing.
We compare the proposed SweepNet with other matching cost functions, ZNCC and MC-CNN [zbontar2016stereo] on the Sunny, Cloudy and Sunset test sets. The ZNCC window size for local patches is set to , and as high ZNCC values mean same patches, we use the negative ZNCC cost by which ranges from 0 to 1. We train MC-CNN on the Sunny dataset by million pairs of local patches from the spherical images with 14 epochs and 256 batches following the original literature [zbontar2016stereo]. We compare the accuracy of depth maps with and without cost volume refinement by SGM [hirschmuller2008stereo]. Table II shows that the SweepNet outperforms other methods in all metrics. Especially, the SweepNet with SGM gives the best and most robust results in all datasets.
Fig. 6 shows the cross section of the raw cost volume at of the matching cost functions (without SGM). The cost maps by ZNCC and MC-CNN have lots of false positives (green to blue colors outside the ground-truth depths), where SweepNet generates a much cleaner cost map.
In Fig. 7, the estimated inverse depth map is shown with the ground-truth. The buildings, car, and thin structures like traffic signs and poles are reconstructed successfully, and the sky and ground plane are also accurately estimated. In wide-baseline, the thin structures are especially challenging, because in the cost volume, there can be multiple true matches for one ray direction, one at the thin object and another at the object behind it. Fig. 9 shows an example of full spherical depth estimation ( varies from to ), which is useful for drones that can move freely in all 6-DOF. One can verify that the scene including sky and ground are precisely reconstructed, even when the -plane is not aligned with the ground plane.
In addition, we qualitatively evaluate the proposed method with the real world data captured by our indoor and outdoor rigs. Fig. 8 shows the input images, the omnidirectional inverse depth map, the reprojected panorama image, and the 3D rendering of the point cloud.
In the indoor examples, one can see that the regions very close to the rig (such as the floor) are accurately reconstructed, and the walls with little texture are also well recovered. The outdoor scenes are quite challenging since the far objects appear very small in the input images, whereas the near objects cover significant portions of the view. The estimated inverse depth maps show that SweepNet can reconstruct both far and near objects successfully. The panorama images at the bottom-left corners are constructed by projecting the estimated 3D points to the input images, which show how accurate the estimated depths are. Aside the radiometric variations between cameras, it is hard to find any mismatches in the panorama images. The experiments show that the proposed method can effectively handle the wide-baseline omnidirectional depth estimation problem.
We present a novel hardware system and stereo algorithm for omnidirectional depth estimation. The proposed hardware configuration includes multiple widely placed cameras with wide FOV fisheye lenses. After the intrinsic and extrinsic parameters are calibrated, the input images are warped into the spherical images by projection onto the virtual spheres positioned at the rig center with the predefined radii. The proposed SweepNet considers holistic visual content in the spherical images at each radius to build the cost map. With the proposed training data from the synthetically rendered city dataset, the SweepNet can be successfully trained. The extensive experiments show that the SweepNet outperforms the local patch-based methods, and robustly generates accurate depth maps in challenging situations.
This research was supported by Samsung Research Funding & Incubation Center for Future Technology under Project Number SRFC-TC1603-05, Next-Generation Information Computing Development Program through National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT (NRF-2017M3C4A7069369), and the NRF grant funded by the Korea government(MISP)(NRF-2017R1A2B4011928).