I Introduction
Estimating 3D geometry is an essential part of many robotic tasks such as navigation, recognition, manipulation, and planning. Many sensor systems including LIDAR, structuredlight 3D scanner, or stereo cameras have been developed and used to this end. Among these methods the camerabased stereo systems have many benefits as they operate in a passive mode (not emitting any active signal), and they are compact, lightweight, and mechanically robust. Moreover, thanks to the recent advances in GPU processors and deep learning algorithms, the camerabased methods become feasible and promising.
A conventional stereo setup uses two cameras looking in the same direction at a horizontal interval to estimate the disparity map. However, in a realworld environment where obstacles exist around the robot, it is often necessary to estimate the omnidirectional depth map. The popular methods [shimamura2000construction, wang2012stereo] use multiple stereo cameras, estimate the disparity maps from rectified image pairs, and then merge them into one panoramic image. The reconstruction results are mostly favorable, but the size and cost of the system can be problematic in some cases. Algorithmically the distortion in the boundary of the rectified images can cause incorrect depth estimate and the discontinuity in an overlapping area makes fusing multiple disparities difficult. To reduce the number of cameras and the rig size, researchers have proposed two verticallymounted cameras with wide FOV fisheye lenses, 360° catadioptric lenses or reflective mirrors to get a pair of omnidirectional images. In this setup the rectified images and disparity maps are in low resolution, and the vertical epipolar lines make hard to estimate the depth of vertical structures. Also when longrange sensing is needed, as in autonomous driving, the short baseline in the above systems can limit the effective sensing distance as it is proportional to the baseline between the cameras.
In this paper, we propose a novel widebaseline omnidirectional stereo vision system which uses only four cameras but estimates full and continuous omnidirectional depth map. Each camera is equipped with a 220° FOV fisheye lens and facing the four cardinal directions as shown in Fig. 1. The proposed system can generate an omnidirectional depth map of 360° horizontal FOV and up to 180° vertical FOV (full sphere). This new camera and lens configuration yields much larger overlapped region of view frustums which enables the robots or cars to sense nearby obstacles in narrow or crowded environments.
In multicamera stereo systems, the plane sweep method [collins1996space] which sweeps parallel virtual planes, and projects the input images onto the planes to find stereo correspondences has been used. Gallup et al. [gallup2007real] propose a more robust method that uses multiple sweeping directions, and it is further extended for the fisheye images [hane2014real]. In this paper, we propose the spherical sweeping method similar to Im et al. [im2016all]. Instead of parallel planes, the concentric virtual spheres centered at the rig coordinate system are swept for the predefined range of inverse depths. The input fisheye images are individually projected onto the spheres and the matching costs are computed from these projected spherical images. In this way, continuous omnidirectional depth maps can be generated without artificially dividing views and stitching them later, and different camera configurations (numbers and positions) can be seamlessly handled.
In our setup, depending on ray directions, the minimum depth estimate ranges 0.81.7 the distance between cameras, which is extremely widebaseline in the stereo literature. The image patches at near distance suffer severe appearance changes due to large geometric and radiometric variation between views. Since the conventional patchbased cost metrics use the local information, the generated cost volumes are noisy and often miss correct matches even after cost aggregation. We observe that even the recent deep learningbased models [zbontar2016stereo] suffer a similar problem. Therefore, we propose a novel neural networkbased approach that considers global context information in cost computation. The proposed network takes the spherical images of two views and outputs a whole omnidirectional cost map. The extensive experiments show that the proposed network outperforms the conventional local matching methods.
Our contributions are summarized as:

[label=()]

We propose a novel omnidirectional widebaseline stereo system which can estimate 360° dense depth maps up to very close distance. Both the hardware configuration of a small number of cameras with ultrawide FOV lenses and the software system for depth estimation are new, flexible, and effective in accomplishing the proposed goal.

We design a deep neural network for computing the matching cost map for a pair of spherical images. The sphere sweeping at the rig coordinate system effectively normalizes the captured images into a uniform input to the network, and by using the whole spherical images, the network learns global contexts for more accurate stereo matching.

The realistic synthetic urban datasets are rendered for training and testing the deep neural network. With this datasets the proposed algorithm is compared with the previous algorithms by extensive quantitative evaluation. Further, realworld datasets are collected to show the performance of the proposed system. All datasets, as well as the trained network, will be made public when the paper is published.
Ii Related Work
Omnidirectional Stereo There have been three major approaches for the omnidirectional stereo vision system: spinning a camera, using mirrors, and using wide FOV fisheye lenses. Kang et al. [kang19973] and Peleg et al. [peleg2001omnistereo] compute one panoramic depth map from multiple images captured while spinning an arm with a camera at the end. Despite the advantage of using one camera, long capture time and the rotating arm make it difficult to be used outside the lab. Using mirrors with a few cameras is popular in omnidirectional stereo systems [bunschoten2003robust, yi2006omnidirectional, shimamura2000construction]. Geyer et al. [geyer2003conformal] and Schönbein et al. [schonbein2014omnidirectional] use two horizontallymounted cameras with 360° FOV catadioptric lenses. Although they can generate an omnidirectional depth map, there exist two blind spots along the epipole direction, and the depth estimates around them are unstable or missing. Gao and Shen [gao2017dual] propose a system with two verticallymounted cameras with ultrawide FOV fisheye lenses. It performs omnidirectional depth estimation by projecting the input fisheye images into four virtual planes parallel to the baseline. However, the disparity maps are in low resolution due to the limitation of sensor resolution and high distortion by the fisheye lenses, and the depth estimates of vertical structures parallel to the baseline are often unavailable. Meanwhile, an omnidirectional motion stereo algorithm Im et al. [im2016all] is presented  it computes a 360° depth map of the static scene from a short video clip captured by a moving omnidirectional camera. The sphere sweeping method allows the images to be captured at any known poses, thus lifts the fixed configuration restriction. While they address a motion stereo in a very shortbaseline setup where appearance variations across views are minimal, we try to solve a more challenging extremely widebaseline problem.
Stereo Matching Cost According to Scharstein et al. [scharstein2002taxonomy], there are four steps in stereo depth estimation: initial matching cost computation, cost aggregation, disparity computation with optimization, and disparity refinement. Among them, computing matching costs from the input images is the most demanding and difficult part. Typical intensitybased matching costs include sum of absolute differences (SAD), normalized crosscorrelation (NCC), rank, or census transforms [zabih1994non]. Hirschmuller et al. [hirschmuller2007evaluation] compares and evaluates these matching cost functions.
Instead of finding the minimum values of locally aggregated matching costs, the depth map can be computed by global optimization by graph cuts [kolmogorov2001computing] or belief propagation [klaus2006segment, bleyer2011patchmatch], but they require high computational cost. Semiglobal matching (SGM) [hirschmuller2008stereo] is an efficient way of aggregating costs globally using dynamic programming.
Recently, due to the largescale stereo datasets with ground truth depths [geiger2012we, menze2015object, mayer2016large], deep learningbased algorithms with much improved performance have been developed. After Zagoruyko et al. [zagoruyko2015learning] propose a deep convolutional neural networks for patch comparison, the MCCNN by Zbontar and LeCun [zbontar2016stereo] is trained for stereo matching cost computation. While [zbontar2016stereo] uses conventional cost aggregation methods, Kendal et al. [kendall2017end] propose an endtoend network performing all steps in 3D convolutional layers.
Our network is the first neural networkbased stereo algorithm that learns omnidirectional cost maps from spherical input images. As shown in the experiments it generates much cleaner cost volumes compared to the conventional intensitybased costs and covers 360° at once.
Iii Omnidirectonal Stereo
Iiia Fisheye Projection Model and Extrinsic Calibration
We use the omnidirectional camera model [scaramuzza2006flexible, urban2015improved], which models the lens distortion with a polynomial function. The projection function maps a 3D point to a 2D point on the normalized image plane, , where is the fisheye intrinsic parameters. The normalized image coordinate is transformed to the pixel coordinate by an affine transformation , as in Fig. 1(a). The details of the projection models are described in [scaramuzza2006flexible].
We follow the conventional camera rig calibration procedure using a checkerboard  for each camera the lens intrinsic parameters and relative poses of the checkerboards are computed, then the rig is initialized using the relative poses, and finally all extrinsic and intrinsic parameters are optimized. A large checkerboard is used to ensure sufficient overlaps between views. The extrinsic parameters are represented as , where
is an axisangle rotation vector and
is a translation vector (). The rigid transformation matrix is given as where is the rotation matrix corresponding to . From the checkerboard images of a camera , we denote its lens intrinsics and , as well as the checkerboard poses to the camera, where is the capture index. The relative pose from camera to can be computed as from a pair of simultaneouslytaken images and , where and denotes the composition and inverse operations. For extrinsic calibration, all camera poses and the checkerboard poses are initialized in the first camera’s coordinate system, Fig. 1(b), and we minimize the reprojection error of the corner points on the checkerboardswhere is the set of observations of the checkerboard pose with the camera , is the coordinate of the corner point in the checkerboard, and is the pixel coordinate of in the image . Ceres solver [ceressolver] is used in optimization.
IiiB Spherical Sweep
The planesweep algorithm [collins1996space] enables dense stereo matching among multiview images. However, it is difficult to apply the algorithm to fisheye images with more than 180° FOV. Hane et al. [hane2014real] use multiple planes with different normals and distances, and later Im et al. [im2016all] exploit local spheres centered at the reference camera to estimate depth from a spherical panoramic camera.
The proposed system works with both wide FOV images in a widebaseline setup, which cannot be handled by the existing algorithms. To estimate omnidirectional depth in our widebaseline system, we propose a global spherical sweep algorithm. The center of sweeps can be anywhere, but to minimize the distortion on the spherical images, we choose the rig center for the origin and align the plane to be close to the camera centers. In this spherical coordinate system, a ray corresponds to . Let the transformed camera extrinsic parameters in the rig coordinate system be . Also for notational simplicity, we denote the projection function with .
We now warp the input images onto the global spheres. Each pixel in the warped spherical image represents a ray . The spherical image has resolution and varies from to . can be up to to , but we use a smaller range in our experiments as the ceiling (sky) and ground are of less interest. spheres are sampled so that their inverse depths are uniform, i.e., when the minimum depth is , the inverse depth to the th sphere is , . In other words, the radii of the spheres are except , which corresponds to the sphere at infinity. As shown in Fig. 3 the pixel value of the spherical image is determined as
(1) 
where is the input image captured by camera . For , we use . When the projected pixels are not in the visible region of the input image, we do not consider them in the further processing.
In the spherical sweep algorithms, we need to compute the matching cost volume for all ray directions and inverse depths. Suppose that we are given a pairwise matching cost function which takes two images and computes the cost map of the same size. The integrated cost map is the average of all possible (and valid) pairwise cost maps, and the cost volume is the collection of integrated cost maps, i.e., the cost of at th sphere is
(2) 
where is an unordered index pair of spherical images. As the raw cost volume is often noisy and contains incorrect estimates, we take advantage of SGM [hirschmuller2008stereo] which refines the cost volume by performing minimization of an energy function with dynamic programming. Finally, the inverse depth of a ray is determined by the winnertakesall strategy as , where . The overall procedure is illustrated in Fig. 4.
As a baseline cost function, we use zero mean normalized cross correlation (ZNCC), which is the covariance of two patches divided by their individual standard deviations. ZNCC is one of the most popular cost functions
[faugeras1993real, gallup2007real], since it is robust to radiometric changes. However in our challenging setup, it does not generate good cost maps, thus we propose our neural network cost function.Layer  Property  Output Dim. 

input  add circular column padding 

conv1  , 32, s 2, 0, 2  
conv2  , 32, s 1, p 1  
conv3  , 32, s 1, p 1, add conv1  
conv417  repeat conv23  
conv18  , 32, s 1, p 1  
concat  
conv19  , 128, s 1, p 1  
deconv1  , 128, s 2, p 1  
conv20  , 128, s 1, p 1  
fc14  , 256  
fc5 
, 1, no ReLu 

sigmoid 
SweepNet has 20 convolutional layers and a transposed convolutional layer followed by 5 fully connected layers. Each properties (s, p) means (stride, padding) in the convolutional block.
Matching cost func.  Sunny  Cloudy  Sunset  

>1  >3  >5  MAE  RMS  >1  >3  >5  MAE  RMS  >1  >3  >5  MAE  RMS  
ZNCC  40.7  28.0  25.2  10.0  23.0  44.9  31.0  27.9  10.9  23.9  39.5  26.8  24.0  9.7  22.9 
MCCNN [zbontar2016stereo]  42.1  32.7  30.2  13.3  27.8  46.1  33.5  30.6  12.8  26.5  42.9  33.0  30.5  13.8  28.4 
SweepNet  20.7  13.2  11.2  4.1  14.0  28.5  16.4  14.0  5.0  15.0  20.5  13.6  11.7  4.6  15.6 
ZNCC + SGM [hirschmuller2008stereo]  24.0  9.9  6.3  1.5  4.5  25.6  9.9  6.3  1.6  4.5  23.4  9.9  6.4  1.6  4.6 
MCCNN + SGM  19.3  7.6  5.1  1.4  4.5  21.2  7.2  4.7  1.4  4.4  18.8  7.6  5.1  1.4  4.6 
SweepNet + SGM  15.4  6.8  4.8  1.1  3.8  19.6  7.2  4.9  1.2  3.8  14.8  7.0  4.9  1.2  3.9 
IiiC SweepNet
Local patchbased approaches including deep learningbased MCCNN [zbontar2016stereo] fails to find hard negative samples such as two identical patches from different objects since they do not consider holistic visual information. Moreover, in the widebaseline setting, the same patches may look different due to foreshortening or radiometric differences from viewing direction changes. To handle this, we propose SweepNet which utilizes the global context in the images.
The architecture of the proposed network is detailed in Table I. As shown in Fig. 4, the input of the network is a pair of gray scale spherical images acquired from (1). To ensure that the horizontal ends are connected, we add the circular column padding to the input spherical images. The conv118 layers are Siamese residual blocks [he2016deep]
for learning the unary feature extraction. We reduce the size of the input image in half for the larger receptive field, which helps the network learns from global context. The output feature maps are concatenated, and then the features are upsampled using transposed convolution. Finally, the network outputs the
cost map which ranges from 0 to 1, through fully connected layers and a sigmoid layer.To train our network, we use the following approach. Given a set of groundtruth depth maps , the inverse depth index is given as . Each position on th sphere is labeled as
(3) 
We use the negative binary crossentropy loss [robert2014machine], . For the labeled training set , the loss is defined as
(4) 
where is the predicted label at with the input images corresponding to
. The loss is minimized by stochastic gradient descent with a momentum.
Iv Experimental Results
System Configuration We use four CCD cameras with 220° FOV fisheye lenses (Pointgrey CM3U331S4C and Entaniya M12220). images can be captured at 30 Hz, and they are synchronized by software trigger. For indoor experiments, we use a squareshaped rig ( mm), and for outdoors, the cameras are installed at the four corners of the roof of a minivan as shown in Fig. 1. For calibration, a checkerboard with grids each of which is mm is used.
Training In order to train the network, we create the synthetic urban datasets with Blender as shown in Fig. 5. Following [zhang2016benefit], we virtually implement the camera rig similar to our outdoor setting as well as buildings, cars, and roads in Blender, and render each frame as four images. The Sunny dataset consists of 1000 sequential frames of sunny city landscapes, and we split them into two parts, the former 700 frames for training and the later 300 for testing. We also create separate test datasets with varying weather (Cloudy, and Sunset) to test on different photometric conditions.
The input images are converted to grayscale, the intensity values are normalized to zeromean and unit variance, and they are warped to the spherical images of
and for training. We set from tofor the synthetic datasets. Among the training data, we randomly select 350 frames and train our network for 14 epochs. The learning rate is initially set to
for the first 11 epochs and for the remaining. We sample 192 inverse depths, and the corresponding groundtruth labels are acquired by (3). The inverse depths with less than positive labels are discarded, and the same number of positive and negative labels are used for training. In total 92 million labels from the Sunny training dataset are used to train our network.). SweepNet classifies negative points more precisely.
Evaluation We evaluate our method quantitatively on the synthetic datasets. The error of inverse depth index is defined as
(5) 
The number of inverse depth is set to and the size of spherical image is and in testing.
We compare the proposed SweepNet with other matching cost functions, ZNCC and MCCNN [zbontar2016stereo] on the Sunny, Cloudy and Sunset test sets. The ZNCC window size for local patches is set to , and as high ZNCC values mean same patches, we use the negative ZNCC cost by which ranges from 0 to 1. We train MCCNN on the Sunny dataset by million pairs of local patches from the spherical images with 14 epochs and 256 batches following the original literature [zbontar2016stereo]. We compare the accuracy of depth maps with and without cost volume refinement by SGM [hirschmuller2008stereo]. Table II shows that the SweepNet outperforms other methods in all metrics. Especially, the SweepNet with SGM gives the best and most robust results in all datasets.
Fig. 6 shows the cross section of the raw cost volume at of the matching cost functions (without SGM). The cost maps by ZNCC and MCCNN have lots of false positives (green to blue colors outside the groundtruth depths), where SweepNet generates a much cleaner cost map.
In Fig. 7, the estimated inverse depth map is shown with the groundtruth. The buildings, car, and thin structures like traffic signs and poles are reconstructed successfully, and the sky and ground plane are also accurately estimated. In widebaseline, the thin structures are especially challenging, because in the cost volume, there can be multiple true matches for one ray direction, one at the thin object and another at the object behind it. Fig. 9 shows an example of full spherical depth estimation ( varies from to ), which is useful for drones that can move freely in all 6DOF. One can verify that the scene including sky and ground are precisely reconstructed, even when the plane is not aligned with the ground plane.
In addition, we qualitatively evaluate the proposed method with the real world data captured by our indoor and outdoor rigs. Fig. 8 shows the input images, the omnidirectional inverse depth map, the reprojected panorama image, and the 3D rendering of the point cloud.
In the indoor examples, one can see that the regions very close to the rig (such as the floor) are accurately reconstructed, and the walls with little texture are also well recovered. The outdoor scenes are quite challenging since the far objects appear very small in the input images, whereas the near objects cover significant portions of the view. The estimated inverse depth maps show that SweepNet can reconstruct both far and near objects successfully. The panorama images at the bottomleft corners are constructed by projecting the estimated 3D points to the input images, which show how accurate the estimated depths are. Aside the radiometric variations between cameras, it is hard to find any mismatches in the panorama images. The experiments show that the proposed method can effectively handle the widebaseline omnidirectional depth estimation problem.
V Conclusions
We present a novel hardware system and stereo algorithm for omnidirectional depth estimation. The proposed hardware configuration includes multiple widely placed cameras with wide FOV fisheye lenses. After the intrinsic and extrinsic parameters are calibrated, the input images are warped into the spherical images by projection onto the virtual spheres positioned at the rig center with the predefined radii. The proposed SweepNet considers holistic visual content in the spherical images at each radius to build the cost map. With the proposed training data from the synthetically rendered city dataset, the SweepNet can be successfully trained. The extensive experiments show that the SweepNet outperforms the local patchbased methods, and robustly generates accurate depth maps in challenging situations.
Acknowledgment
This research was supported by Samsung Research Funding & Incubation Center for Future Technology under Project Number SRFCTC160305, NextGeneration Information Computing Development Program through National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT (NRF2017M3C4A7069369), and the NRF grant funded by the Korea government(MISP)(NRF2017R1A2B4011928).
Comments
There are no comments yet.