(Arxiv 2021) NeRF--: Neural Radiance Fields Without Known Camera Parameters
This paper tackles the problem of novel view synthesis (NVS) from 2D images without known camera poses and intrinsics. Among various NVS techniques, Neural Radiance Field (NeRF) has recently gained popularity due to its remarkable synthesis quality. Existing NeRF-based approaches assume that the camera parameters associated with each input image are either directly accessible at training, or can be accurately estimated with conventional techniques based on correspondences, such as Structure-from-Motion. In this work, we propose an end-to-end framework, termed NeRF–, for training NeRF models given only RGB images, without pre-computed camera parameters. Specifically, we show that the camera parameters, including both intrinsics and extrinsics, can be automatically discovered via joint optimisation during the training of the NeRF model. On the standard LLFF benchmark, our model achieves comparable novel view synthesis results compared to the baseline trained with COLMAP pre-computed camera parameters. We also conduct extensive analyses to understand the model behaviour under different camera trajectories, and show that in scenarios where COLMAP fails, our model still produces robust results.READ FULL TEXT VIEW PDF
(Arxiv 2021) NeRF--: Neural Radiance Fields Without Known Camera Parameters
The ability to fly through our three-dimensional world has been the dream of human beings for thousands of years — from the 3500-year-old story of Daedalus and Icarus in ancient Greek mythology, to the earliest scientific attempts of Leonardo da Vinci to build flying machines in the late 1400s (Niccoli, 2006). Thanks to the recent advances in virtual reality (VR) technology, it is now possible to capture a digital version of our world and generate arbitrary views, allowing us to traverse the world through a virtual lens.
To generate photo-realistic views of a real-world scene from any viewpoint, it not only requires to understand the 3D scene geometry, but also to model complex viewpoint-dependent appearance resulting from sophisticated light transport phenomena. One way to achieve this is by constructing a so-called 5D plenoptic function that directly models the light passing through each point in space (Adelson and Bergen, 1991) (or a 4D light field (Gortler et al., 1996; Levoy and Hanrahan, 1996) if we restrict ourselves outside the convex hull of the objects of interest). Unfortunately, it is not feasible in practice to physically measure a densely sampled plenoptic function. As an alternative, Novel View Synthesis (NVS) aims to approximate such a dense light field from only sparse observations, such as a small set of images captured from diverse viewpoints.
In literature, a large amount of research effort has been devoted to developing methods for novel view synthesis. One group aims to explicitly reconstruct the surface geometry and the appearance on the surface from the observed sparse views (Debevec et al., 1996; Zitnick et al., 2004; Chaurasia et al., 2013; Waechter et al., 2014; Hedman et al., 2017; Wiles et al., 2020). For the purpose of reconstructing 3D geometry from 2D images, techniques like Structure-from-Motion (SfM) (Faugeras and Luong, 2001) establish correspondences and simultaneously estimate the camera parameters if they are not directly available. However, these methods often struggle to synthesise high-fidelity images due to imperfect surface reconstruction and limited capacity for modelling complex view-dependent effects, such as specularity, transparency and global illumination.
Another group of approaches adopt volume-based representations to directly model the appearance of the entire space (Penner and Zhang, 2017; Zhou et al., 2018; Mildenhall et al., 2019; Sitzmann et al., 2019; Mildenhall et al., 2020)
, and use volumetric rendering techniques to generate images. This enables smooth gradients for photometry-based optimisation, and is capable of modelling highly complex shapes and materials with sophisticated view-dependent effects. Among these approaches, Neural Radiance Fields (NeRF) have recently gained popularity due to its exceptional simplicity and performance for synthesising high-quality images of complex real-world scenes. The key idea in NeRF is to represent the entire volume space with a continuous function, parameterised by a multi-layer perceptron (MLP), bypassing the need to discretise the space into voxel grids, which usually suffers from resolution constraints.
In both groups of research, camera calibration is often assumed to be prerequisite, while in practise, this information is rarely accessible, and requires to be pre-computed with conventional techniques, such as SfM. In particular, NeRF (Mildenhall et al., 2020) and its variants (Zhang et al., 2020; Martin-Brualla et al., 2020; Park et al., 2020) use COLMAP (Schonberger and Frahm, 2016) to estimate the camera parameters (both intrinsics and extrinsics) associated with each input image. This pre-processing step, apart from introducing additional complexity, also suffers from dynamic scenes (Kopf et al., 2020; Park et al., 2020) or the presence of significant view-dependent appearance changes, as a result, making the NeRF training dependent on the robustness and accuracy of the camera parameter estimation.
In this paper, we ask the question: do we really need to pre-compute camera parameters when training a view synthesis model such as a NeRF? We show that the answer is no. The NeRF model is in fact able to automatically discover the camera parameters by itself during training. Specifically, we propose NeRF
, which jointly optimises the 3D scene representation and the camera parameters (both extrinsics and intrinsics). On the standard LLFF benchmark, we demonstrate comparable novel view synthesis results to thebaseline NeRF trained with COLMAP pre-computed camera parameters. Additionally, we also analyse the model behaviour under different camera trajectories, showing that in scenarios where COLMAP fails, our model still produces robust results, which suggests that the joint optimisation can lead to more robust reconstruction, echoing the Bundle Adjustment (BA) in classical SfM pipelines (Triggs et al., 2000).
There is vast literature on novel views synthesis. It can be roughly divided into two categories, one with explicit surface modelling, and the other with dense volume-based representations.
The first group of approaches aim to explicitly reconstruct the surface geometry and model its appearance for novel view rendering. To reconstruct the 3D geometry from 2D images, traditional techniques, such as SfM (Faugeras and Luong, 2001; Hartley and Zisserman, 2003) and Simultaneous Localisation and Mapping (SLAM) jointly solve for the 3D geometry and the associated camera parameters, by establishing feature correspondences (e.g. MonoSLAM(Davison et al., 2007), ORB-SLAM (Mur-Artal et al., 2015), Bundler (Snavely et al., 2006), COLMAP (Schonberger and Frahm, 2016)), or photometric errors, e.g. DTAM (Newcombe et al., 2011) and LSD-SLAM (Engel et al., 2014). However, many of these methods assume diffuse surface texture, and do not recover view-dependent appearance, hence resulting in unrealistic novel view rendering. Multi-view photometric stereo methods (Zhou et al., 2013), on the other hand, aim to explain view-dependent appearance with sophisticated hand-crafted material BRDF models, but suffer from the trade-off between quality and complexity. Recent works such as (Riegler and Koltun, 2020a, b) integrates meshes and features from images to handle such view-dependent appearance synthesis. Ultimately, even though explicit geometry reconstruction facilitates camera parameter estimation, modelling photo-realistic appearance for novel views is still a challenging task.
As an alternative, volume-based representations have been proposed to directly model the appearance of the 3D space (Seitz and Dyer, 1999; Flynn et al., 2016; Penner and Zhang, 2017; Zhou et al., 2018; Mildenhall et al., 2019; Sitzmann et al., 2019; Mildenhall et al., 2020). In recent years, researchers have proposed various volume-based representations of this kind, such as Soft3D (Penner and Zhang, 2017), Multi-Plane Images (MPI) (Zhou et al., 2018; Tucker and Snavely, 2020; Mildenhall et al., 2019; Choi et al., 2019; Flynn et al., 2019), Scene Representation Networks (SRN) (Sitzmann et al., 2019), Occupancy Networks (Mescheder et al., 2019; Yariv et al., 2020) and Neural Radiance Fields (NeRF) (Mildenhall et al., 2020). These dense volumetric representations enable smooth gradients for photometry-based optimisation and has shown to be promising for photo-realistic novel view synthesis of highly complex shapes and appearance.
One common assumption in both groups of research is that, camera parameters for all input images are accessible, or can be accurately estimated by traditional SfM or SLAM techniques, such as COLMAP (Schonberger and Frahm, 2016), Bundler (Snavely et al., 2006) and ORB-SLAM (Mur-Artal et al., 2015). This usually refers to a two-stage system, where the view synthesis would be dependent on accurate camera parameter estimation. In this work, we propose an end-to-end framework, jointly optimising the camera parameters and a NeRF representation given only RGB images, while maintaining the capability to produce comparable view synthesis results.
In particular, a concurrent work, iNeRF (Yen-Chen et al., 2020), is closely related to ours. It shows that given a well-trained NeRF model, the 6DOF camera poses for novel views can be estimated by simply minimising the photometric rendering error. However, they assume a well-trained NeRF model to begin with, whereas our method is able to automatically discover the camera parameters from only RGB images in a fully unsupervised fashion.
Apart from novel views synthesis using multiple images, there are also learning-based approaches (Zhou et al., 2016; Niklaus et al., 2019; Tucker and Snavely, 2020; Wiles et al., 2020; Shih et al., 2020; Wu et al., 2020), which allow for single-image novel view synthesis at inference time by learning a prior over a collection of training data. These methods, however, are either restricted to small camera motions, or produce low quality images, due to the limited information in a single input image.
Given a set of images captured from sparse viewpoints of a scene, with their associated camera parameters , including both intrinsics and extrinsics, the goal of novel view synthesis is to come up with a scene representation that enables the generation of realistic images from novel, unseen viewpoints. In this paper, we follow the approach proposed in Neural Radiance Fields (NeRF) (Mildenhall et al., 2020).
In NeRF, the authors adopt a continuous function for constructing a volumetric representation of the scene from sparse input views. In essence, it models the view-dependent appearance of the 3D space using a continuous function , parameterised by a multi-layer perceptron (MLP). The function maps a location in 3D space together with a viewing direction to a radiance colour and a density value .
To render an image from a NeRF model, the colour at each pixel on the image plane () is obtained by a rendering function , aggregating the radiance along a ray shooting from the camera position , passing through the pixel into the volume (Max, 1995; Gortler et al., 1996):
denotes the accumulated transmittance along the ray, i.e.
, the probability of the ray travelling fromto without hitting any other particle, and denotes the camera ray that starts from camera origin and passes through , controlled by the camera parameter , with near and far bounds and . In practice, the integral in Eq. 1 is approximated by accumulating radiance and densities of a set of sampled points along a ray.
With this implicit scene representation and a differentiable renderer , NeRF can be trained by minimising the photometric error between the observed views and synthesised ones under known camera parameters:
where denotes the set of synthesised images .
To summarise, NeRF represents a 3D scene as a radiance field parameterised by an MLP, which is trained with a set of sparsely observed images via a photometric reconstruction loss. Note that, the camera parameters for these images are required for training, which are usually estimated by SfM packages, such as COLMAP (Schonberger and Frahm, 2016). For more details of NeRF, we refer the readers to (Mildenhall et al., 2020).
In this paper, we show that the pre-processing step on estimating camera parameters of the input images is in fact unnecessary. Unlike the training setup of the original NeRF, here, we only assume a set of RGB images as inputs, without known camera parameters. We seek to jointly optimise the camera parameters and scene representation during the training. Mathematically, this can be written as:
where the camera parameters include both the camera intrinsics and the camera extrinsics. Apart from simplifying the original two-stage approach, another motivation for such a joint optimisation approach comes from bundle adjustment in classical SfM pipelines (Triggs et al., 2000) and SLAM systems (Davison et al., 2007; Newcombe et al., 2011; Engel et al., 2014), which is key step to obtain globally consistent reconstruction results.
In the following sections, we first introduce the representations for the camera parameters and then describe the process of the joint optimisation.
Assuming a pinhole camera model, the camera intrinsic parameters can be expressed by a matrix:
where and denote camera focal lengths along the width and the height of the sensor respectively, and and denote principle points in the image plane.
We assume that the camera principle points are located at sensor centre, i.e. and , where and denote the height and the width of the image, and all input images are taken by the same camera. As a result, camera intrinsics estimation reduces to estimating two values, the focal lengths and , which can be directly optimised as trainable parameters during training.
The camera extrinsic parameters determine the position and orientation of the camera, expressed as a transformation matrix in SE(3), where denotes the camera rotation and
denotes the translation. Since translation vectoris defined in Euclidean space, it can be directly optimised as trainable parameters during training.
As for the camera rotation, which is defined on SO(3), we adopt the axis-angle representation: , , where a rotation is represented by a normalised rotation axis and a rotation angle . This can be converted to a rotation matrix using the Rodrigues’ formula:
where the skew operatorconverts a vector to a skew matrix:
With this parameterisation, we can optimise the camera extrinsics for each input image with the trainable parameters and during training.
To summarise, the set of camera parameters that we directly optimise in our model are the camera intrinsics and shared by all input images, and the camera extrinsics parameterised by and specific to each image .
Our goal is to train a NeRF model given only RGB images as input, without known camera parameters. In other words, we need to find out the camera parameters associated with each input image while training the NeRF model.
Recall that NeRF is trained by minimising the photometric reconstruction error on the input views. Specifically, for each training image , we randomly select pixel locations , which we would like to reconstruct from the NeRF model . To render the colour of each pixel , we shoot a ray from the camera position through the pixel into the radiance field, with the current estimates of the camera parameters ):
and is computed from using Eq. 7.
We then sample a number of 3D points along the ray and evaluate the radiance colours and the density values at these locations via the NeRF network . The rendering function Eq. 1 is then applied to obtain the colour of that pixel by aggregating the predicted radiance and densities along the ray.
For each reconstructed pixel, we compute photometric loss using Eq. 3 by comparing its predicted colour against the ground-truth colour sampled from the input image. Since the entire pipeline is fully differentiable, we can jointly optimise both the parameters of the NeRF model and the camera parameters by minimising the reconstruction loss. The pipeline is summarised in Algorithm 1.
For initialisation, the cameras for all input images are located at origin looking towards -axis, i.e. all are initialised with identity matrices and all with zero vectors, and the focal lengths and are initialised to be the width and the height respectively, i.e. FOV.
Although the above joint optimisation of both the camera parameters and the NeRF model from scratch produces reasonable results, the model could fall into local minima where the optimised camera parameters are sub-optimal, resulting in slightly blurry synthesised images. Thus, we introduce an optional refinement step to further improve the quality of the synthesised images. Specifically, after the first training process is completed, we drop the trained NeRF model and re-initialise it with random parameters while keeping the pre-trained camera parameters. We then repeat the joint optimisation using the pre-trained camera parameters as initialisation. We find this additional refinement step generally leads to sharper images and improves the synthesis results, as evidenced by the comparison in Table 1.
Additionally, the camera parameters can also be initialised with estimated values from external toolboxes, where they are available, and jointly refined during the training of the NeRF model. We conduct experiments to refine the camera parameters estimated using COLMAP during NeRF training, and find the novel view results slightly improved through the joint refinement, as shown in Table 1.
We conduct experiments on diverse scenes and compare with the original baseline NeRF, where camera parameters of input images are estimated with COLMAP. In the following sections, we describe the experiment setup, followed by various results and analyses. We also include a discussion on the limitations of the current method at the end of the section.
We first conduct experiments on the same forward-facing dataset as that in NeRF, namely, LLFF-NeRF (Mildenhall et al., 2019), which has 8 forward-facing scenes captured by mobile phones or consumer cameras, each containing 20-62 images. In all experiments, we follow the official pre-processing procedures and train/test splits, i.e. the resolution of the training images is , and every -th image is used as the test image.
To understand the behaviour of NVS under different camera motion scenarios, such as rotation, traversal (horizontal motion) and zoom-in, we additionally collected a number of diverse scenes extracted from the short video segments in RealEstate10K (Zhou et al., 2018) and Tanks&Temples (Knapitsch et al., 2017) dataset, as well as a few more clips captured by ourselves. In particular, we only select the video sequences with the desired motion type from corresponding datasets. The image resolution in these sequences varies between and and the frame rate ranges from 24 fps to 60 fps. We sub-sample the frames and reduce the frame rates to 3-6 fps, and each sequence contains 7-40 images.
We evaluate the proposed framework from two aspects: First
, to measure the quality of novel view rendering, we use the common metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM)(Wang et al., 2004) and Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018); Second, apart from perceptual quality, we also evaluate the accuracy of the optimised camera parameters. However, as ground truth are not accessible for real scenes, we can only evaluate the accuracy by computing the difference between our optimised camera and the estimations obtained from COLMAP. For focal length evaluation, we report the absolute error in the metric of pixels. For the camera poses, we follow the evaluation protocol of the Absolute Trajectory Error (ATE) (Zhang and Scaramuzza, 2018; Sturm et al., 2012), which first aligns two sets of pose trajectories globally using a similarity transformation Sim(3) and reports the rotation angle between two rotations and the absolute distance between two translation vectors.
We implement our framework in PyTorch following the same architecture as originalbaseline NeRF, except that, for computation efficiency, we: (a) do not use the hierarchical sampling strategy; (b) reduce the hidden layer dimension from 256 to 128; and (c) sample only 128 points along each ray. We use Kaiming initialisation (He et al., 2015) for the NeRF model, and initialise all cameras to be at origin looking at direction, with focal lengths and to be the width and the height of the image. We use three separate Adam optimisers for NeRF, camera poses and focal lengths respectively, all with an initial learning rate of , except that we lower the initial NeRF learning rate to for Fortress scene. The learning rate of the NeRF model is decayed every epochs by multiplying with (exponential decay), and learning rates of the pose and focal length parameters are decayed every epochs with a multiplier of . For each training epoch, we randomly sample pixels from every input image and points in NeRF along each ray to synthesise the colour of the pixels. All models are trained for epochs unless otherwise specified. More technical details are included in the supplementary material. We will release the code.
In this section, we present the experimental results and in-depth analyses on the proposed framework, i.e. NeRF. In Section 5.2.1, we demonstrate the results for novel view synthesis in terms of perceptual qualities. In Section 5.2.2, we show the evaluation of the optimised camera parameters. Lastly, in Section 5.2.3, to understand the model behaviour under different camera motion patterns, we demonstrate some qualitative results and discussion for sequences under controlled camera motions, e.g. rotational, traversal, and zoom-in. More results and visualisations are provided in the supplementary material.
In this section, we compare the perceptual qualities from the novel views rendered by baseline NeRF (where camera parameters are estimated from COLMAP), and our proposed model NeRF, which jointly optimises the camera parameters and the 3D scene representation from only RGB images.
Since our optimised camera parameters might lie in different spaces from the ones estimated using COLMAP, for evaluation, we first align the two trajectories globally with a Sim(3) transformation using an ATE toolbox (Zhang and Scaramuzza, 2018), followed by a more fine-grained gradient-driven camera pose alignment by minimising the photometric error on the synthesised image, while keeping the NeRF model fixed. Finally, we compute the metrics between the test image and our synthesised image rendered from the best possible viewpoint. Simply put, all the above mentioned processing aims to eliminate the effect from camera mis-alignment and make a fair comparison on quality of the 3D scene representation.
We report the quantitative evaluations in Table 1 and visual results in Figure 3. Overall, our joint optimisation model, which does not require camera parameters as inputs, achieves similar NVS quality compared to the baseline NeRF model. This confirms that jointly optimising the camera parameters and 3D scene representation is indeed possible. Nevertheless, we observe that for both the Orchids and the Room, our NeRF model produces slightly worse results compared to the baseline NeRF. We also notice from Table 2 that the difference between optimised camera focal lengths and COLMAP estimation are most noticable for these two scenes ( and ). This suggests that the optimisation might have fallen into local minima with sub-optimal intrinsics. More discussion can be found in Section 5.3.
In Table 1, we also show the results with additional refinement step, which has shown to improve the NVS quality for both the baseline NeRF and our proposed NeRF model slightly.
We evaluate the accuracy of the camera parameter estimation on the LLFF-NeRF dataset. As explained in Section 5.1.2, the ground-truth camera parameters for these sequences are not available, we therefore treat the COLMAP estimation as references, and report the difference betwen our optimised camera parameters and theirs on the training images.
In Table 2, we show the L1 difference on the estimated focal lengths, and metrics on camera rotation and translation computed with the ATE toolbox (Zhang and Scaramuzza, 2018), which accounts for global scale ambiguity. In the first set of columns (Focal + Pose + NeRF), the camera poses obtained from our model are close to those estimated from COLMAP, confirming the effectiveness of the joint optimisation pipeline. This can also be visualised by the aligned camera trajectories on -plane in Figure 3. The error on camera intrinsics is however much larger. This is due to the notorious ambiguity between camera intrinsics and the scale of the camera translation (Pollefeys and Van Gool, 1997), especially for these forward facing scenes.
We conduct another two sets of experiments: 1) We fix the camera poses to be same as from COLMAP, and only optimise the camera focal lengths jointly with the NeRF model. We then measure the difference between the optimised focal lengths and the COLMAP estimated ones. As indicated by the second set of columns (Focal + NeRF) in Table 2, by fixing the camera extrinsics, the focal length has been recovered. 2) We fix the focal lengths to be same as from COLMAP estimation, and only the camera extrinsics are jointly optimised with the NeRF model. The results are reported in the same table Pose+NeRF, showing similar performance as the joint optimisation.
|Scene||(E1) Focal + Pose + NeRF||(E2) Focal + NeRF||(E3) Pose + NeRF|
|Focal||Rotation (deg)||Translation||PSNR||Focal||PSNR||Rotation (deg)||Translation||PSNR|
For a better understanding of the optimisation process, we provide a visualisation of the camera poses at various training epochs for the scene Flower from the LLFF-NeRF dataset (Figure 5). The pose estimations are initialised to be identity matrices at the beginning, and converged after about 1000 epochs, subject to a similarity transformation between optimised camera parameters and those estimated from COLMAP.
To inspect how our system performs under different camera motions, we pick a number of sequences from the additional datasets (RealEstate10K, Tanks & Temples), with the camera motions following the desired patterns, such as rotation, traversal (horizontal motion) and zoom-in. To give an overview of the experimental results, both the baseline NeRF and our joint training approach work well for zoom-in camera motions, whereas in rotational and traversal movements, we find that the COLMAP sometimes produces incorrect camera poses or simply fails to converge. We provide more discussions for each motion pattern in the following sections.
Despite being one common camera motion in hand-held video capturings, rotational motion is notoriously difficult to model in SfM or SLAM systems, as no 3D points can be triangulated under such a motion (Svoboda et al., 1998; Szeliski and Shum, 1997; Pirchheim et al., 2013). In the literature, numerous approaches have been proposed to deal with this problem, for example, through rotation averaging (Hartley et al., 2013; Bustos et al., 2019).
In Figure 6, we show the NVS results of a sequence from the RealEstate10K dataset, where the camera motion is dominated by a rotation. In this case, COLMAP produces incorrect camera poses with extreme outliers, leading to a failure for training baseline NeRF. After manually correcting two extreme outlier poses by assigning to their closest ones, we then re-train the NeRF model, shown as the last row in Figure 6. Even with such manually corrected poses, the baseline NeRF still produces blurry synthesis results and fails to model the geometry correctly. In comparison, our joint optimisation model recovers much more accurate geometry and consequently higher quality view synthesis results.
Apart from picking video sequence from the public datasets, we also show results on the sequence recorded by ourselves (Figure 7), where the camera motion is almost purely rotational. This is akin to shooting a panorama image, where the frames can simply be stitched together by homography transformations (Svoboda et al., 1998; Szeliski and Shum, 1997; Pirchheim et al., 2013). In this case, COLMAP fails to estimate the camera parameters, and thus, no results can be produced by the baseline NeRF. Our joint optimisation pipeline, in contrast, still produces reasonable camera estimation and novel view synthesis results. Note that the geometry in this case is poorly reconstructed, as shown in the rendered depth map, since no disparity information is available with such purely rotational camera motions.
Traversal Motion refers to the motion pattern where the camera moves along a horizontal trajectory. We show two examples from Tanks&Temples, and RealEstate10K in Figures 9 and 8, where the camera roughly follows a traversal pattern. Our approach produces reasonable camera parameter estimations and synthesis results on both sequences, whereas COLMAP fails on the second scene due to close to critical plane (Luong and Faugeras, 1994; Torr et al., 1998).
In Figure 10, we show an example captured with a zooming-in camera, both COLMAP and our system recover reasonable camera trajectories and view synthesis results.
Although the proposed framework for jointly optimising camera parameters and 3D scene representation demonstrates promising results, we still observe a few limitations.
Firstly, as with other photometry-based reconstruction methods, it often struggles to reconstruct scenes with large texture-less regions or in the presence significant photometric inconsistency across frames, such as motion blur, changes in brightness or colour. For example, the joint optimisation struggles to converge on the Fortress scene from the LLFF-NeRF dataset (although it works well with a lower learning rate on the NeRF model). This is likely to be caused by large areas of repeated textures, which could potentially be mitigated by incorporating feature-level losses or explicitly attending to distinctive feature points during training.
Secondly, jointly optimizing camera parameters and scene reconstruction is notoriously challenging and could potentially fall into local minima. For instance, as discussed in Section 5.2.1, our joint optimisation pipeline produces inferior synthesis results on Orchids and Room compared to baseline NeRF shown in Table 1, largely due to sub-optimal optimisation results for the camera intrinsics (as indicated by the significant difference between our optimised focal lengths and the ones from COLMAP reported in Table 2). Incorporating additional components for explicit geometric matching might be useful in guiding the optimisation process.
Lastly, the proposed framework is limited to roughly forward-facing scenes and relatively short camera trajectories, since the NeRF model still struggles to model real scenes in or large camera displacements (Zhang et al., 2020). As for future work, exploiting the temporal information in sequences can be an effective regularisation for longer trajectories.
In this work, we present an end-to-end NeRF-based pipeline, called NeRF, for novel view synthesis from sparse input views, which does not require any information about the camera parameters for training. Specifically, our model jointly optimise the camera parameters for each input image while simultaneously training the NeRF model. This eliminates the need of pre-computing the camera parameters using potentially erroneous SfM methods (e.g. COLMAP) and still achieves comparable view synthesis results as the COLMAP-based NeRF baseline. We present extensive experimental results and demonstrate the effectiveness of this joint optimisation framework under different camera trajectory patterns, even when the baseline COLMAP fails to estimate the camera parameters. Despite its current limitations discussed above, our proposed joint optimisation pipeline has demonstrated promising results on this highly challenging task, which presents a step forward towards novel view synthesis on more general scenes with an end-to-end approach.
Shangzhe Wu is supported by Facebook Research. The authors would like to thank Tim Yuqing Tang for insightful discussions and proofreading.
Multiple view geometry in computer vision. Cited by: §2.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV, Cited by: §5.1.3.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §5.1.2.
High-quality video view interpolation using a layered representation. In SIGGRAPH, Cited by: §1.
We employ a smaller NeRF (Mildenhall et al., 2020) network than the original NeRF paper proposed without a hierarchical structure. Specifically, our NeRF implementation shrinks all hidden layer dimensions by half and follows the same positional encoding and skip connections as implemented in the original NeRF. The network architecture is presented in Fig. 11.
As mentioned in Sec. 4.2, we initialise our focal length and to be image size and . We parameterise this implementation by introducing two scale factors and :
and initialise with and . This parameterisation avoids network predicting and in pixel unit directly, whose values are often large and pose numerical difficulties in optimisation.
In practice, we found that optimising the square root of and , denoted by and respectively, leads to slightly better results, i.e.
where and are initialised to 1.0 too. Table 3 shows a quantitative comparison of the novel view rendering quality using these two parameterisations.
We select two sequences from RealEstate10K (Zhou et al., 2018) and one video from Tanks&Temples dataset(Knapitsch et al., 2017). The details of the videos and pre-processing procedures are listed in Table 4.
|Dataset||Video ID/name||Original fps||Original res.||Training fps||Training res.|
|Our data||Globe - rotation||4||6240x4160||4||780x520|
|Our data||Cauliflower - zoom-in||4||6240x4160||4||780x520|