Papers with code. Sorted by stars. Updated weekly.
In this paper, we consider the problem of reconstructing a dense 3D model using images captured from different views. Recent methods based on convolutional neural networks (CNN) allow learning the entire task from data. However, they do not incorporate the physics of image formation such as perspective geometry and occlusion. Instead, classical approaches based on Markov Random Fields (MRF) with ray-potentials explicitly model these physical processes, but they cannot cope with large surface appearance variations across different viewpoints. In this paper, we propose RayNet, which combines the strengths of both frameworks. RayNet integrates a CNN that learns view-invariant feature representations with an MRF that explicitly encodes the physics of perspective projection and occlusion. We train RayNet end-to-end using empirical risk minimization. We thoroughly evaluate our approach on challenging real-world datasets and demonstrate its benefits over a piece-wise trained baseline, hand-crafted models as well as other learning-based approaches.READ FULL TEXT VIEW PDF
The advent of deep learning has given rise to neural scene representatio...
Learning powerful feature representations for image retrieval has always...
Modeling and rendering of dynamic scenes is challenging, as natural scen...
X-ray Computed Tomography (CT) based 3D imaging is widely used in airpor...
Because of the powerful learning capability of deep neural networks, cou...
Traditional dense volumetric representations for robotic mapping make
Papers with code. Sorted by stars. Updated weekly.
A curated list of papers & ressources linked to 3D reconstruction from images.
Code for "RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials", CVPR 2018
Papers with code. Sorted by stars. Updated weekly.
Papers with code. Sorted by stars. Updated weekly.
Passive 3D reconstruction is the task of estimating a 3D model from a collection of 2D images taken from different viewpoints. This is a highly ill-posed problem due to large ambiguities introduced byocclusions and surface appearance variations across different views.
Several recent works have approached this problem by formulating the task as inference in a Markov random field (MRF) with high-order ray potentials that explicitly model the physics of the image formation process along each viewing ray [35, 33, 19]. The ray potential encourages consistency between the pixel recorded by the camera and the color of the first visible surface along the ray. By accumulating these constrains from each input camera ray, these approaches estimate a 3D model that is globally consistent in terms of occlusion relationships.
While this formulation correctly models occlusion, the complex nature of inference in ray potentials restricts these models to pixel-wise color comparisons, which leads to large ambiguities in the reconstruction . Instead of using images as input, Savinov et al. 
utilize pre-computed depth maps using zero-mean normalized cross-correlation in a small image neighborhood. In this case, the ray potentials encourage consistency between the input depth map and the depth of the first visible voxel along the ray. While considering a large image neighborhood improves upon pixel-wise comparisons, our experiments show that such hand-crafted image similarity measures cannot handle complex variations of surface appearance.
In contrast, recent learning-based solutions to motion estimation [15, 24, 10], stereo matching [21, 42, 20] and 3D reconstruction [5, 37, 6, 16, 9] have demonstrated impressive results by learning feature representations that are much more robust to local viewpoint and lighting changes. However, existing methods exploit neither the physical constraints of perspective geometry nor the resulting occlusion relationships across viewpoints, and therefore require a large model capacity as well as an enormous amount of labelled training data.
This work aims at combining the benefits of a learning-based approach with the strengths of a model that incorporates the physical process of perspective projection and occlusion relationships. Towards this goal, we propose an end-to-end trainable architecture called RayNet which integrates a convolutional neural network (CNN) that learns surface appearance variations (e.g. across different viewpoints and lighting conditions) with an MRF that explicitly encodes the physics of perspective projection and occlusion. More specifically, RayNet uses a learned feature representation that is correlated with nearby images to estimate surface probabilities along each ray of the input image set. These surface probabilities are then fused using an MRF with high-order ray potentials that aggregates occlusion constraints across all viewpoints. RayNet is learned end-to-end using empirical risk minimization. In particular, errors are backpropagated to the CNN based on the output of the MRF. This allows the CNN to specialize its representation to the joint task while explicitly considering the 3D fusion process.
Unfortunately, naïve backpropagation through the unrolled MRF is intractable due to the large number of messages that need to be stored during training. We propose a stochastic ray sampling approach which allows efficient backpropagation of errors to the CNN. We show that the MRF acts as an effective regularizer and improves both the output of the CNN as well as the output of the joint model for challenging real-world reconstruction problems. Compared to existing MRF-based  or learning-based methods [16, 14], RayNet improves the accuracy of the 3D reconstruction by taking into consideration both local information around every pixel (via the CNN) as well as global information about the entire scene (via the MRF).
Our code and data is available on the project website111https://avg.is.tue.mpg.de/research˙projects/raynet.
3D reconstruction methods can be roughly categorized into model-based and learning-based approaches, which learn the task from data. As a thorough survey on 3D reconstruction techniques is beyond the scope of this paper, we discuss only the most related approaches and refer to [13, 7, 29] for a more thorough review.
Ray-based 3D Reconstruction: Pollard and Mundy  propose a volumetric reconstruction method that updates the occupancy and color of each voxel sequentially for every image. However, their method lacks a global probabilistic formulation. To address this limitation, a number of approaches have phrased 3D reconstruction as inference in a Markov random field (MRF) by exploiting the special characteristics of high-order ray potentials [35, 33, 28, 19]. Ray potentials allow for accurately describing the image formation process, yielding 3D reconstructions consistent with the input images. Recently, Ulusoy et al.  integrated scene specific 3D shape knowledge to further improve the quality of the 3D reconstructions. A drawback of these techniques is that very simplistic photometric terms are needed to keep inference tractable, e.g., pixel-wise color consistency, limiting their performance.
In this work, we integrate such a ray-based MRF with a CNN that learns multi-view patch similarity. This results in an end-to-end trainable model that is more robust to appearance changes due to viewpoint variations, while tightly integrating perspective geometry and occlusion relationships across viewpoints.
are passed to the loss function. The forward pass is illustrated ingreen. The backpropagation pass is highlighted in red. is the set of all cameras, are the image dimensions and is the max. number of voxels along each ray.
As most of the aforementioned methods solve the 3D reconstruction problem via recognizing the scene content, they are only applicable to object reconstruction and do not generalize well to novel object categories or full scenes. Towards a more general learning based model for 3D reconstruction, [17, 16] propose to unproject the input images into 3D voxel space and process the concatenated unprojected volumes using a 3D convolutional neural network. While these approaches take projective geometry into account, they do not explicitly exploit occlusion relationships across viewpoints, as proposed in this paper. Instead, they rely on a generic 3D CNN to learn this mapping from data. We compare to  in our experimental evaluation and obtain more accurate 3D reconstructions and significantly better runtime performance. In addition, the lightweight nature of our model’s forward inference allows for reconstructing scenes up to voxels resolution in a single pass. In contrast,  is limited to voxels and  requires processing large volumes using a sliding window, thereby losing global spatial relationships.
A major limitation of all aforementioned approaches is that they require full 3D supervision for training, which is quite restrictive. Tulsiani et al.  relax these assumptions by formulating a differentiable view consistency loss that measures the inconsistency between the predicted 3D shape and its observation. Similarly, Rezende et al.  propose a neural projection layer and a black box renderer for supervising the learning process. Yan et al.  and Gwak et al.  use 2D silhouettes as supervision for 3D reconstruction from a single image. While all these methods exploit ray constraints inside the loss function, our goal is to directly integrate the physical properties of the image formation process into the model via unrolled MRF inference with ray potentials. Thus, we are able to significantly reduce the number of parameters in the network and our network does not need to acquire these first principles from data.
The input to our approach is a set of images and their corresponding camera poses, which are obtained using structure-from-motion . Our goal is to model the known physical processes of perspective projection and occlusion, while learning the parameters that are difficult to model, e.g., those describing surface appearance variations across different views. Our architecture utilizes a CNN to learn a feature representation for image patches that are compared across nearby views to estimate a depth distribution for each ray in the input images. For all our experiments, we use nearby views. Due to the small size of each image patch, these distributions are typically noisy. We pass these noisy distributions to our MRF which aggregates them into a occlusion-consistent 3D reconstruction. We formulate inference in the MRF as a differentiable function, hence allowing end-to-end training using backpropagation.
We first specify our CNN architecture which predicts depth distributions for each input pixel/ray. We then detail the MRF for fusing these noisy measurements. Finally, we discuss appropriate loss functions and show how our model can be efficiently trained using stochastic ray sampling. The overall architecture is illustrated in Fig. 2.
CNNs have been proven successful for learning similarity measures between two or more image patches for stereo [40, 41, 20] as well as multi-view stereo . Similar to these works, we design a network architecture that estimates a depth distribution for every pixel in each view.
The network used is illustrated in Fig. 1(a) (left). The network first extracts a
-dimensional feature per pixel in each input image using a fully convolutional network. The weights of this network are shared across all images. Each layer comprises convolution, spatial batch normalization and a ReLU non-linearity. We follow common practice and remove the ReLU from the last layer in order to retain information encoded both in the negative and positive range.
We then use these features to calculate per ray depth distributions for all input images. More formally, let denote a voxel grid which discretizes the 3D space. Let denote the complete set of rays in the input views. In particular, we assume one ray per image pixel. Thus, the cardinality of equals the number of input images times the number of pixels per image. For every voxel along each ray , we compute the surface probability by projecting the voxel center location into the reference view (i.e., the image which comprises pixel ) and into all adjacent views ( views in our experiments) as illustrated in Fig. 1(b). For clarity, Fig. 1(b) shows only out of adjacent cameras considered in our experimental evaluation. We obtain as the average inner product between all pairs of views. Note that is high if all views agree in terms of the learned feature representation. Thus, reflects the probability of a surface being located at voxel along ray . We abbreviate the surface probabilities for all rays of a single image with .
Due to the local nature of the extracted features as well as occlusions, the depth distributions computed by the CNN are typically noisy. Our MRF aggregates these depth distributions by exploiting occlusion relationships across all viewpoints, yielding significantly improved depth predictions. Occlusion constraints are encoded using high-order ray potentials [35, 28, 19]. We differ from [35, 19] in that we do not reason about surface appearance within the MRF but instead incorporate depth distributions estimated by our CNN. This allows for a more accurate depth signal and also avoids the costly sampling-based discrete-continuous inference in . We compare against  in our experimental evaluation and demonstrate improved performance.
We associate each voxel with a binary occupancy variable , indicating whether the voxel is occupied () or free (). For a single ray , let denote the ordered set of occupancy variables associated with the voxels which intersect ray . The order is defined by the distance to the camera.
The joint distribution over all occupancy variablesfactorizes into unary and ray potentials
where Z denotes the partition function. The corresponding factor graph is illustrated in Fig. 3.
Unary Potentials: The unary potentials encode our prior belief about the state of the occupancy variables. We model
using a Bernoulli distribution
where is the probability that the th voxel is occupied.
Ray Potentials: The ray potentials encourage the predicted depth at pixel/ray to coincide with the first occupied voxel along the ray. More formally, we have
where is the probability that the visible surface along ray is located at voxel . This probability is predicted by the neural network described in Section 3.1. Note that the product over the occupancy variables in Eq. (3) equals if and only if is the first occupied voxel (i.e., if and for ). Thus is large if the surface is predicted at the first occupied voxel in the model.
Inference: Having specified our MRF, we detail our inference approach in this model. Provided noisy depth measurements from the CNN (), the goal of the inference algorithm is to aggregate these measurements using ray-potentials into a 3D reconstruction and to estimate globally consistent depth distributions at every pixel.
Let denote the distance of the th voxel along ray to the respective camera center as illustrated in Fig. 3. Let further
be a random variable representing the depth along ray. In other words, denotes that the depth at ray is the distance to the th voxel along ray . We associate the occupancy and depth variables along a ray using the following equation:
Our inference procedure estimates a probability distributionfor each pixel/ray in the input views. Unfortunately computing the exact solution in a loopy high-order model such as our MRF is NP-hard . We use loopy sum-product belief propagation for approximate inference. As demonstrated by Ulusoy et al. , belief propagation in models involving high-order ray potentials is tractable as the factor-to-variable messages can be computed in linear time due to the special structure of the ray potentials. In practice, we run a fixed number of iterations interleaving factor-to-variable and variable-to-factor message updates. We observed that convergence typically occurs after iterations and thus fix the iteration number to when unrolling the message passing algorithm. We refer to the supplementary material for message equation derivations.
We utilize empirical risk minimization to train RayNet. Let denote a loss function which measures the discrepancy between the predicted depth and the ground truth depth at pixel . The most commonly used metric for evaluating depth maps is the absolute depth error. We therefore use the loss to train RayNet. In particular, we seek to minimize the expected loss, also referred to as the empirical risk :
with respect to the model parameters . Here, denotes the set of ground truth pixels/rays in all images of all training scenes and is the depth distribution predicted by the model for ray of the respective image in the respective training scene. The parameters comprises the occupancy prior as well as parameters of the neural network.
In order to train RayNet in an end-to-end fashion, we need to backpropagate the gradients through the unrolled MRF message passing algorithm to the CNN. However, a naïve implementation of backpropagation is not feasible due to memory limitations. In particular, backpropagation requires storing all intermediate messages from all belief-propagation iterations in memory. For a modest dataset of 50 images with pixel resolution and a voxel grid of size , this would require GB GPU memory, which is intractable using current hardware.
To tackle this problem, we perform backpropagation using mini-batches where each mini-batch is a stochastically sampled subset of the input rays. In particular, each mini-batch consists of
rays randomly sampled from a subset of 10 consecutive input images. Our experiments show that learning with rays from neighboring views leads to faster convergence, as the network can focus on small portion of the scene at a time. After backpropagation, we update the model parameters and randomly select a new set of rays for the next iteration. This approximates the true gradient of the mini-batch. The gradients are obtained using TensorFlow’s AutoDiff functionality.
While training RayNet in an end-to-end fashion is feasible, we further speed it up by pretraining the CNN followed by fine-tuning the entire RayNet architecture. For pretraining, we randomly pick a set of pixels from a randomly chosen reference view for each mini-batch. We discretize the ray corresponding to each pixel according to the voxel grid and project all intersected voxel centers into the adjacent views as illustrated in Fig. 1(b). For backpropagation we use the same loss function as during end-to-end training, cf. Eq. (3.3).
In this Section, we present experiments evaluating our method on two challenging datasets. The first dataset consists of two scenes, BARUS&HOLLEY and DOWNTOWN, both of which were captured in urban environments from an aerial platform. The images, camera poses and LIDAR point cloud are provided by Restrepo et al. . Ulusoy et al. triangulated the LIDAR point cloud to achieve a dense ground truth mesh . In total the dataset consists of roughly 200 views with an image resolution of pixels. We use the BARUS&HOLLEY scene as the training set and reserve DOWNTOWN for testing.
Our second dataset is the widely used DTU multi-view stereo benchmark , which comprises 124 indoor scenes of various objects captured from 49 camera views under seven different lighting conditions. We evaluate RayNet on two objects from this dataset: BIRD and BUDDHA. For all datasets, we down-sample the images such that the largest dimension is 640 pixels.
We compare RayNet to various baselines both qualitatively and quantitatively. For the quantitative evaluation, we compute accuracy, completeness and per pixel mean depth error, and report the mean and the median for each. The first two metrics are estimated in 3D space, while the latter is defined in image space and averaged over all ground truth depth maps. In addition, we also report the Chamfer distance, which is a metric that expresses jointly the accuracy and the completeness. Accuracy is measured as the distance from a point in the reconstructed point cloud to its closest neighbor in the ground truth point cloud, while completeness is calculated as the distance from a point in the ground truth point cloud to its closest neighbor in the predicted point cloud. For additional information on these metrics we refer the reader to . We generate the point clouds for the accuracy and completeness computation by projecting every pixel from every view into the 3D space according to the predicted depth and the provided camera poses.
In this Section, we validate our technique on the aerial dataset of Restrepo et al.  and compare it to various model-based and learning-based baselines.
Quantitative evaluation: We pretrain the neural network using the Adam optimizer  with a learning rate of and a batch size of 32 for 100K iterations. Subsequently, for end-to-end training, we use a voxel grid of size , the same optimizer with learning rate and batches of randomly sampled rays from consecutive images. RayNet is trained for K iterations. Note that due to the use of the larger batch size, RayNet is trained on approximately 10 times more rays compared to pretraining. During training we use a voxel grid of size to reduce computation. However, we run the forward pass with a voxel grid of size for the final evaluation.
We compare RayNet with two commonly used patch comparison measures: the sum of absolute differences (SAD) and the zero-mean normalized cross correlation (ZNCC). In particular, we use SAD or ZNCC to compute a cost volume per pixel and then choose the depth with the lowest cost to produce a depthmap. We use the publicly available code distributed by Haene et al.  for these two baselines. We follow  and compute the max score among all pairs of patch combinations, which allows some robustness to occlusions. In addition, we also compare our method with the probabilistic 3D reconstruction method by Ulusoy et al. , again using their publicly available implementation. We further compare against the learning based approach of Hartmann et al. , which we reimplemented and trained based on the original paper .
|Methods||Accuracy||Completeness||Mean Depth Error||Chamfer|
|Ulusoy et al. ||0.0790||0.0167||0.0088||0.0065||0.1143||0.1050||0.0439|
|Hartmann et al. ||0.0907||0.0285||0.0209||0.0209||0.1648||0.1222||0.0558|
Table 1 summarizes accuracy, completeness and per pixel mean depth error for all implemented baselines. In addition to the aforementioned baselines, we also compare our full RayNet approach Ours (CNN+MRF) with our CNN frontend in isolation, denoted Ours (CNN). We observe that joint optimization of the full model improves upon the CNN frontend. Furthermore, RayNet outperforms both the classic as well as learning-based baselines in terms of accuracy, mean depth error and Chamfer distance while performing on par with most baselines in terms of completeness.
Qualitative Evaluation: We visualize the results of RayNet and the baselines in Fig. 4. Both the ZNCC baseline and the approach of Hartmann et al.  require a large receptive field size for optimal performance, yielding smooth depth maps, However, this large receptive field also causes bleeding artefacts at object boundaries. In contrast, our baseline CNN and the approach of Ulusoy et al.  yield sharper boundaries, while exhibiting a larger level of noise. By combining the advantages of learning-based descriptors with a small receptive field and MRF inference, our full RayNet approach (CNN+MRF) results in significantly smoother reconstructions while retaining sharp object boundaries. Additional results are provided in the supplementary material.
In this section, we provide results on the BIRD and BUDDHA scenes of the DTU dataset. We use the provided splits to train and test RayNet. We evaluate SurfaceNet  at two different resolutions: the original high resolution variant, which requires more than hours to reconstruct a single scene, and a faster variant that uses approximately the same resolution () as our approach. We refer to the first one as SurfaceNet (HD) and to the latter as SurfaceNet (LR). We test SurfaceNet with the pretrained models provided by .
We evaluate both methods in terms of accuracy and completeness and report the mean and the median in Table 2. Our full RayNet model outperforms nearly all baselines in terms of completeness, while it performs worse in terms of accuracy. We believe this is due to the fact that both  and  utilize the original high resolution images, while our approach operates on downsampled versions of the images. We observe that some of the fine textural details that are present in the original images are lost in the down-sampled versions. Besides, our current implementation of the 2D feature extractor uses a smaller receptive field size ( pixels) compared to SurfaceNet () and  (
). Finally, while our method aims at predicting a complete 3D reconstruction inside the evaluation volume (resulting in occasional outliers), and  prune unreliable predictions from the output and return an incomplete reconstruction (resulting in higher accuracy and lower completeness).
Fig. 5 shows a qualitative comparison between SurfaceNet (LR) and RayNet. It can be clearly observed that SurfaceNet often cannot reconstruct large parts of the object (e.g., the wings in the BIRD scene or part of the head in the BUDDHA scene) which it considers as unreliable. In contrast, RayNet generates more complete 3D reconstructions. RayNet is also significantly better at capturing object boundaries compared to SurfaceNet.
|SurfaceNet (HD) ||0.738||0.574||0.677||0.505|
|SurfaceNet (LR) ||2.034||1.676||1.453||1.141|
|Hartmann et al. ||0.637||0.206||1.057||0.475|
|Ulusoy et al. ||4.784||3.522||0.953||0.402|
|SurfaceNet (HD) ||1.493||1.249||1.612||0.888|
|SurfaceNet (LR) ||2.887||2.468||2.330||1.556|
|Hartmann et al. ||1.881||0.271||4.167||1.044|
|Ulusoy et al. ||6.024||4.623||2.966||0.898|
The runtime and memory requirements of RayNet depend mainly on three factors: the number of voxels, the number of input images and the number of pixels/rays per image, which is typically equal to the image resolution. All our experiments were computed on an Intel i7 computer with an Nvidia GTX Titan X GPU. We train RayNet end-to-end, which takes roughly day and requires GB per mini-batch update for the DTU dataset. Once the network is trained, it takes approximately minutes to obtain a full reconstruction of a typical scene from the DTU dataset. In contrast, SurfaceNet (HD)  requires more than hours for this task. SurfaceNet (LR), which operates at the same voxel resolution as RayNet, requires hours.
We propose RayNet, which is an end-to-end trained network that incorporates a CNN that learns multi-view image similarity with an MRF with ray potentials that explicitly models perspective projection and enforces occlusion constraints across viewpoints. We directly embed the physics of multi-view geometry into RayNet. Hence, RayNet is not required to learn these complex relationships from data. Instead, the network can focus on learning view-invariant feature representations that are very difficult to model. Our experiments indicate that RayNet improves over learning-based approaches that do not incorporate multi-view geometry constraints and over model-based approaches that do not learn multi-view image matching.
Our current implementation precludes training models with a higher resolution than voxels. While this resolution is finer than most existing learning-based approaches, it does not allow capturing high resolution details in large scenes. In future work, we plan to adapt our method to higher resolution outputs using octree-based representations [31, 12, 35]. We also would like to extend RayNet to predict a semantic label per voxel in addition to occupancy. Recent works show such a joint prediction improves over reconstruction or segmentation in isolation [28, 30].
This research was supported by the Max Planck ETH Center for Learning Systems.
Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
Learning a predictable and generative vector representation for objects.In Proc. of the European Conf. on Computer Vision (ECCV), 2016.
Efficient deep learning for stereo matching.In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
Journal of Machine Learning Research (JMLR), 17(65):1–32, 2016.