Photometric Mesh Optimization for Video-Aligned 3D Object Reconstruction :globe_with_meridians: (CVPR 2019)
In this paper, we address the problem of 3D object mesh reconstruction from RGB videos. Our approach combines the best of multi-view geometric and data-driven methods for 3D reconstruction by optimizing object meshes for multi-view photometric consistency while constraining mesh deformations with a shape prior. We pose this as a piecewise image alignment problem for each mesh face projection. Our approach allows us to update shape parameters from the photometric error without any depth or mask information. Moreover, we show how to avoid a degeneracy of zero photometric gradients via rasterizing from a virtual viewpoint. We demonstrate 3D object mesh reconstruction results from both synthetic and real-world videos with our photometric mesh optimization, which is unachievable with either naïve mesh generation networks or traditional pipelines of surface reconstruction without heavy manual post-processing.READ FULL TEXT VIEW PDF
Photometric Mesh Optimization for Video-Aligned 3D Object Reconstruction :globe_with_meridians: (CVPR 2019)
The choice of 3D representation plays a crucial role in 3D reconstruction problems from 2D images. Classical multi-view geometric methods, most notably structure from motion (SfM) and SLAM, recover point clouds as the underlying 3D structure of RGB sequences, often with very high accuracy [10, 30]. Point clouds, however, lack inherent 3D spatial structure that is essential for efficient reasoning. In many scenarios, mesh representations are more desirable – they are significantly more compact since they have inherent geometric structures defined by point connectivity, while they also represent continuous surfaces necessary for many applications such as robotics (e.g., accurate localization for autonomous driving), computer graphics (e.g., physical simulation, texture synthesis), and virtual/augmented reality.
Another drawback of classical multi-view geometric methods is reliance on hand-designed features and can be fragile when their assumptions are violated. This happens especially in textureless regions or when there are changes in illumination. Data-driven approaches [5, 15], on the other hand, learn priors to tackle ill-posed 3D reconstruction problems and have recently been widely applied to 3D prediction tasks from single images. However, they can only reliably reconstruct from the space of training examples it learns from, resulting in limited ability to generalize to unseen data.
In this work, we address the problem of 3D mesh reconstruction from image sequences by bringing together the best attributes of multi-view geometric methods and data-driven approaches (Fig. 1). Focusing on object instances, we use shape priors
(specifically, neural networks) to reconstruct geometry with incomplete observations as well asmulti-view geometric constraints to refine mesh predictions on the input sequences. Our approach allows dense reconstruction with object semantics from learned priors, which is not possible from the traditional pipelines of surface meshing  from multi-view stereo (MVS). Moreover, our approach generalizes to unseen objects by utilizing multi-view geometry to enforce observation consistency across viewpoints.
Given only RGB information, we achieve mesh reconstruction from image sequences by photometric optimization, which we pose as a piecewise image alignment problem of individual mesh faces. To avoid degeneracy, we introduce a novel virtual viewpoint rasterization to compute photometric gradients with respect to mesh vertices for 3D alignment, allowing the mesh to deform to the observed shape. A main advantage of our photometric mesh optimization is its non-reliance on any a-priori known depth or mask information [20, 35, 38] – a necessary condition to be able to reconstruct objects from real-world images. With this, we take a step toward practical usage of prior-based 3D mesh reconstruction aligned with RGB sequences.
In summary, we present the following contributions:
We incorporate multi-view photometric consistency with data-driven shape priors for optimizing 3D meshes using 2D photometric cues.
We propose a novel photometric optimization formulation for meshes and introduce a virtual viewpoint rasterization step to avoid gradient degeneracy.
Finally, we show 3D object mesh reconstruction results from both synthetic and real-world sequences, unachievable with either naïve mesh generators or traditional MVS pipelines without heavy manual post-processing.
Our work on object mesh reconstruction touches several areas, including multi-view object reconstruction, mesh optimization, deep shape priors, and image alignment.
Multi-view object reconstruction.
Multi-view calibration and reconstruction is a well-studied problem. Most approaches begin by estimating camera coordinates using 2D keypoint matching, a process known as SLAM[10, 29] or SfM [12, 32], followed by dense reconstruction methods such as MVS  and meshing 
. More recent works using deep learning have explored 3D reconstruction from multiple-view consistency between various forms of 2D observations[24, 34, 35, 38, 41]. These methods all utilize forms of 2D supervision that are easier to acquire than 3D CAD models, which are relatively limited in quantity. Our approach uses both geometric and image-based constraints, which allows it to overcome common multi-view limitations such as missing observations and textureless regions.
Mesh optimization. Mesh optimization dates back to classical works of Active Shape Models  and Active Appearance Models [6, 28], which uses 2D meshes to fit facial landmarks. In this work, we optimize for 3D meshes using 2D photometric cues, a significantly more challenging problem due to the inherent ambiguities in the task. Similar approaches for mesh refinement have also been explored [8, 9]; however, a sufficiently good initialization is required with very small vertex perturbations allowed. As we show in our experiments, we are able to handle larger amount of noise perturbation by optimizing over a latent shape code instead of mesh vertices, making it more suitable for practical uses.
Several recent methods have addressed learning 3D reconstruction with mesh representations. AtlasNet  and Pixel2Mesh  are examples of learning mesh object reconstructions from 3D CAD models. Meanwhile, Neural Mesh Renderer  suggested a method of mesh reconstruction via approximate gradients for 2D mask optimization, and Kanazawa et al.  further advocated learning mesh reconstruction from 2D supervision of textures, masks, and 2D keypoints. Our approach, in contrast, does not assume any availability of masks or keypoints and operates purely via photometric cues across viewpoints.
Shape priors. The use of neural networks as object priors for reconstruction has recently been explored with point clouds . However, it requires object masks as additional constraints during optimization. We eliminate the need for mask supervision by regularizing the latent code. Shape priors have also been explored for finding shape correspondences , where the network learns the deformation field from a template shape to match 3D observations. In our method, we directly optimize the latent shape code to match 2D cues from multiple viewpoints and do not require a known shape template for the object. A plane and primitive prior has been used for the challenging task of multi-view scene reconstruction . Although the primitive prior does not need to be learned from an object dataset, the resulting reconstruction can differ significantly from the target geometry when it is not well represented by the chosen primitives.
Image alignment. The most generic form of image alignment refers to prediction of inherent geometric misalignment between a pair of images. Image alignment using simple warping functions can be dated back to the seminal Lucas-Kanade algorithm  and its recent variants [1, 26]. Recent work has also explored learning a warp function to align images from neural networks for applications such as novel view synthesis [39, 40] and learning invariant representations [19, 25]. In this work, we pose our problem of mesh optimization as multiple image alignment problems of mesh faces, and solve it by optimizing over a latent code from a deep network rather than the vertices themselves.
We seek to reconstruct a 3D object mesh from an RGB sequence , where each frame is associated with a camera matrix . In this work, we assume that the camera matrices can be readily obtained from off-the-shelf SfM methods . Fig. 2 provides an overview – we optimize for object meshes that maximize multi-view photometric consistency over a shape prior, where we use a pretrained mesh generator. We focus on triangular meshes here although our method is applicable to any mesh type.
Direct optimization on a 3D mesh with vertices involves solving for degrees of freedom (DoFs) and typically becomes underconstrained when is large. Therefore, reducing the allowed DoFs is crucial to ensure mesh deformations are well-behaved during optimization. We wish to represent the mesh as a differentiable function
of a reduced vector representation.
We propose to use an off-the-shelf generative neural network as the main part of and reparameterize the mesh with an associated latent code , where . The network serves as an object shape prior whose efficacy comes from pretraining on external shape datasets. Shape priors over point clouds have been previously explored ; here, we extend to mesh representations. We use AtlasNet  here although other mesh generators are also applicable. The shape prior allows the predicted 3D mesh to deform within a learned shape space, avoiding many local minima that exist with direct vertex optimization. To utilize RGB information from the given sequence for photometric optimization, we further add a 3D similarity transform to map the generated mesh to world cameras recovered by SfM (see Sec. 3.4).
We define our optimization problem as follows: given the RGB image sequence and cameras , we optimize a regularized cost consisting of a photometric loss for all pairs of frames over the representation , formulated as
where is a regularization term on . This objective allows the generated mesh to deform with respect to an effective shape prior. We describe each term in detail next.
Optimizing the mesh with the photometric loss is based on the assumption that a dense 2D projection of the individual triangular faces of a 3D mesh should be globally consistent across multiple viewpoints. Therefore, we cast the problem of 3D mesh alignment to the input views as a collection of piecewise 2D image alignment subproblems of each projected triangular face (Fig. 2).
To perform piecewise 2D image alignment between and , we need to establish pixel correspondences. We first denote as the 3D vertices of triangle in mesh , defined as column vectors. From triangle , we can sample a collection of 3D points that lie within triangle , related via through the barycentric coordinates . For a camera , let be the projection function mapping a world 3D point to 2D image coordinates. The pixel intensity error between the two views and can be compared at the 2D image coordinates corresponding to the projected sampled 3D points. We formulate the photometric loss as the sum of distances between pixel intensities at these 2D image coordinates over all triangular faces,
As such, we can optimize the photometric loss with pixel correspondences established as a function of .
Visibility. As a 3D point may not be visible in a given view due to possible object self-occlusion, we handle visibility by constraining to be the set of samples in triangle whose projection is visible in both views. We achieve this by returning a mesh index map using mesh rasterization, a standard operation in computer graphics, for each optimization step. The photometric gradients of each sampled point
in turn backpropagate to the vertices. We obtain through differentiable image sampling , by taking the derivative of the projection , and by associating with the barycentric coordinates . We note that the entire process is differentiable and does not resort to approximate gradients .
We can efficiently sample a large number of 3D points in triangle by rendering the depth of from a given view using mesh rasterization (Sec. 3.2). If the depth were rasterized from either input view or , however, we would obtain zero photometric gradients. This degeneracy arises due to the fact that ray-casting from one view and projecting back to the same view results in .
To elaborate, we first note that depth rasterization of triangle is equivalent to back-projecting regular grid coordinates to triangle . We can express each depth point from camera as , where is the inverse projection function realized by solving for ray-triangle intersection with . Combining with the projection equation, we have
becoming the identity mapping and losing the dependency of on , which in turn leads to . This insight is in line with the recent observation from Ham et al. .
To overcome this degeneracy, we rasterize the depth from a third virtual viewpoint . This step allows correct gradients to be computed in both viewpoints and , which is essential to maintain stability during optimization. We can form the photometric loss by synthesizing the image appearance at using the pixel intensities from both and (Fig. 3). We note that can be arbitrarily chosen. In practice, we choose to be the bisection between and by applying Slerp  on the rotation quaternions and averaging the two camera centers.
Coordinate systems. Mesh predictions from a generative network typically lie in a canonical coordinate system [15, 36] independent of the world cameras recovered by SfM. Therefore, we need to account for an additional 3D similarity transform applied to the mesh vertices. For each 3D vertex from the prediction, we define the similarity transform as
where are the parameters and is a 3D rotation matrix parameterized with the Lie algebra. We optimize for together, where is the latent code associated with the generative network.
Since automated registration of noisy 3D data with unknown scales is still an open problem, we assume a coarse alignment of the coordinate systems can be computed from minimal annotation of rough correspondences (see Sec. 4.3 for details). We optimize for the similarity transform to more accurately align the meshes to the RGB sequences.
Regularization. Despite neural networks being effective priors, the latent space is only spanned by the training data. To avoid meshes from reaching a degenerate solution, we impose an extra penalty on the latent code to ensure it stays within a trust region of the initial code (extracted from a pretrained image encoder), defined as . We also add a scale penalty that encourages the mesh to expand, since the mesh shrinking to infinitesimal is a trivial solution with zero photometric error. The regularization in cost (1) is written as
where and are the penalty weights.
Data preparation. We create datasets of 3D CAD model renderings for training a mesh generation network and evaluating our optimization framework. Our rendering pipeline aims to create realistic images with complex backgrounds so they could be applied to real-world video sequences. We use ShapeNet  for the object dataset and normalize all objects to fit an origin-centered unit sphere. We render RGB images of each object using perspective cameras at 24 equally spaced azimuth angles and 3 elevation angles.
To simulate realistic backgrounds, we randomly warp and crop spherical images from the SUN360 database  to create background images of the same scene taken at different camera viewpoints. By compositing the foreground and background images together at corresponding camera poses, we obtain RGB sequences of objects composited on realistic textured backgrounds (Fig. 4). Note that we do not keep any mask information that was accessible in the rendering and compositing process as such information is typically not available in real-world examples. All images are rendered/cropped at a resolution of 224224.
Shape prior. We use AtlasNet  as the base network architecture for mesh generation, which we retrain on our new dataset. We use the same 80%-20% training/test split from Groueix et al.  and additionally split the SUN360 spherical images with the same ratio. During training, we augment background images at random azimuth angles.
Initialization. We initialize the code by encoding an RGB frame with the AtlasNet encoder. For ShapeNet sequences, we choose frames with objects facing sideways. For real-world sequences, we manually select frames where objects are center-aligned to the images as much as possible to match our rendering settings. We initialize the similarity transform parameters to (identity transform).
Evaluation criteria. We evaluate the result by measuring the 3D distances between the sampled 3D points from the predicted meshes and the ground-truth point clouds . We follow Lin et al.  by reporting the 3D error between the predicted and ground-truth point clouds as for some source and target point sets and , respectively. This metric measures the prediction shape accuracy when is the prediction and is the ground truth, while it indicates the prediction shape coverage when vice versa. We report quantitative results in both directions separately averaged across all instances.
We start by evaluating our mesh alignment in a category-specific setting. We select the car, chair, and plane categories from ShapeNet, consisting of 703, 1356, and 809 objects in our test split, respectively. For each object, we create an RGB sequence by overlaying its rendering onto a randomly paired SUN360 scene with the cameras in correspondence. We retrain each category-specific AtlasNet model on our new dataset using the default settings for 500 epochs. During optimization, we use the Adam optimizer with a constant learning rate of for 100 iterations. We manually set the penalty factors to be and .
One challenge is that the coordinate system for a mesh generated by AtlasNet is independent of the recovered world cameras for a real-world sequence. Determining such coordinate system mapping (defined by a 3D similarity transform) is required to relate the predicted mesh to the world. On the other hand, for the synthetic sequences, we know the exact mapping as we can render the views for AtlasNet and the input views in the same coordinate system.
For our first experiment, we simulate the possibly incorrect mapping estimates by perturbing the ground-truth 3D similarity transform by adding Gaussian noise to its parameters, pre-generated per sequence for evaluation. We evaluate the 3D error metrics under such perturbations. Note that our method utilizes no additional information other than the RGB information from the given sequences.
We compare our mesh reconstruction approach against three baseline variants of AtlasNet: (a) mesh generations from a single-image feed-forward initialization, (b) generation from the mean latent code averaged over all frames in the sequence, and (c) the mean shape where vertices are averaged from the mesh generation across all frames.
We show qualitative results in Fig. 5 (compared under perturbation ). Our method is able to take advantage of multi-view geometry to resolve large misalignments and optimize for more accurate shapes. The high photometric error from the background between views discourages mesh vertices from staying in such regions. This error serves as a natural force to constrain the mesh within the desired 3D regions, eliminating the need of depth or mask constraints during optimization. We further visualize our mesh reconstruction with textures that are estimated from all images (Fig. 6). Note that the fidelity of mean textures increases while variance in textures decrease after optimization.
We evaluate quantitatively in Fig. 7, where we plot the average 3D error over mapping noise. This result demonstrates how our method handles inaccurate coordinate system mappings to successfully match the meshes against RGB sequences. We also ablate optimizing the latent code , showing that allowing shape deformation improves reconstruction quality over a sole 3D similarity transform (“fixed code” in Fig. 7). Note that our method is slightly worse in shape coverage error (GTpred.) when evaluated at the ground-truth mapping. This result is attributed to the limitation of photometric optimization that opts for degenerate solutions when objects are insufficiently textured.
We extend beyond a model that reconstructs a single object category by training a single model to reconstruct multiple object categories. We take 13 commonly chosen CAD model categories from ShapeNet [5, 11, 15, 24]. We follow the same settings as in Sec. 4.1 except we retrain AtlasNet longer for 1000 epochs due to a larger training set.
We show visual results in Fig. 8 on the efficacy of our method for multiple object categories (under perturbation ). Our results show how we can reconstruct a shape that better matches our RGB observations (e.g., refining hollow regions, as in the bench backs and table legs). We also show category-wise quantitative results in Table 1, compared under perturbation noise . We find photometric optimization to perform effectively across most categories except lamps, which consist of many examples where optimizing for thin structures is hard for photometric loss.
Finally, we demonstrate the efficacy of our method on challenging real-world video sequences orbiting an object. We use a dataset of RGB-D object scans , where we use the chair model to evaluate on the chair category. We select the subset of video sequences that are 3D-reconstructible using traditional pipelines  and where SfM extracts at least 20 reliable frames and 100 salient 3D points. We retain 82 sequences with sufficient quality for evaluation. We rescale the sequences to and skip every 10 frames.
We compute the camera extrinsic and intrinsic matrices using off-the-shelf SfM with COLMAP . For evaluation, we additionally compute a rough estimate of the coordinate system mapping by annotating 3 corresponding points between the predicted mesh and the sparse points extracted from SfM (Fig. 9), which allows us to fit a 3D similarity transform. We optimize using Adam with a learning rate of 2e-3 for 200 iterations, and we manually set the penalty factors to be and .
We demonstrate how our method is applicable to real-world datasets in Fig. 10. Our method is able to refine shapes such as armrests and office chair legs. Note that our method is sensitive to the quality of mesh initialization from real images, mainly due to the domain mismatch between synthetic and real data during the training/test phases of the shape prior. Despite this, it is still able to straighten and align to the desired 3D location. In addition, we report the average pixel reprojection error in Table 12 and metric depth error in Fig. 12 to quantify the effect of photometric optimization, which shows further improvement over coarse initializations.
Finally, we note that surface reconstruction is a challenging post-processing procedure for traditional pipelines. Fig. 10 shows sample results for SfM , PatchMatch Stereo , stereo fusion, and Poisson mesh reconstruction  from COLMAP . In addition to the need of accurate object segmentation, the dense meshing problem with traditional pipelines typically yields noisy results without laborious manual post-processing.
We have demonstrated a method for reconstructing a 3D mesh from an RGB video by combining data-driven deep shape priors with multi-view photometric consistency optimization. We also show that mesh rasterization from a virtual viewpoint is critical for avoiding degenerate photometric gradients during optimization. We believe our photometric mesh optimization technique has merit for a number of practical applications. It enables the ability to generate more accurate models of real-world objects for computer graphics and potentially allows automated object segmentation from video data. It could also benefit 3D localization for robot navigation and autonomous driving, where accurate object location, orientation, and shape from real-world cameras is crucial for more efficient understanding.
International journal of computer vision, 56(3):221–255, 2004.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1486–1493, 2014.
We use AtlasNet  as the base network architecture for our experiments. Following Groueix et al. , the image encoder is the ResNet-18  architecture where the last fully-connected layer is replaced with one with an output dimension of 1024, which is the size of the latent code. We use the 25-patch version of the AtlasNet mesh decoder, where each deformable patch is an open triangular mesh with triangles on a regular grid. We redirect the readers to Groueix et al.  for more details.
In the stage of pretraining AtlasNet on ShapeNet  with textured background from SUN360 , we train all networks using the Adam optimizer  with a constant learning rate of . We set the batch size for all experiments to be 32. We initialize the AtlasNet encoder with the pretrained ResNet-18 on ImageNet 
except for the last modified layer (before the latent code), and we initialize the decoder with that pretrained from a point cloud autoencoder from Groueixet al. .
We parameterize the rotation component of 3D similarity transformations with the Lie algebra. Given a warp parameter vector , the rotation matrix can be written as
where is the exponential map (i.e. matrix exponential). is the identity transformation when is an all-zeros vector. The exponential map is Taylor-expandable as
We implement the parameterization using the Taylor approximation expression with . We have also tried parametrizing the 3D similarity transformations with the self-contained Lie group , where the scale is incorporated into the exponential map; we find it to yield almost identical results. We also take the exponential on the scale to ensure positivityl; the resulting scale does not change when .