The cell-phone of the 90s was a phone, the modern cell-phone is a handheld computational imaging platform [delbracio2021mobile] that is capable of acquiring high-quality images, pose, and depth. Recent years have witnessed explosive advances in passive depth imaging, from single-image methods that leverage large data priors to predict structure directly from image features [ranftl2021vision, ranftl2019towards] to efficient multi-view approaches grounded in principles of 3D geometry and epipolar projection [tankovich2021hitnet, shamsafar2021mobilestereonet]. Alongside, progress has been made in the miniaturization and cost-reduction [callenberg2021low] of active depth systems such as LiDAR and correlation time-of-flight sensors [lange2001solid]. This has culminated in their leap from industrial and automotive applications [schwarz2010mapping, dong2017lidar] to the space of mobile phones. Nestled in the intersection of high-resolution imaging and miniaturized LiDAR we find modern smartphones, such as the iPhone 12 Pro, which offer access to high frame-rate, low-resolution depth and high-quality pose estimates.
As applications of mixed reality grow, particularly in industry [li2018critical] and healthcare [gerup2020augmented] settings, so does the demand for convenient systems to extract 3D information from the world around us. Smartphones fit this niche well, as they boast a wide array of sensors – e.g. cameras, magnetometer, accelerometer, and the aforementioned LiDAR system – while remaining portable and affordable, and consequently ubiquitous. Image, pose, and depth data from mobile phones can drive novel problems in view synthesis [mildenhall2020nerf, park2021nerfies], portrait relighting [pandey2021total, sun2019single]
, and video interpolation[bao2019depth] that either implicitly or explicitly rely on depth cues, as well as more typical 3D understanding tasks concerning salient object detection [zhang2021bts, fan2020rethinking], segmentation [schwarz2018rgb], localization [zhuang2021semantic], and mapping [schops2019bad, mur2017orb].
Although 3D scene information is essential for a wide array of 3D vision applications, today’s mobile phones do not offer accurate high-resolution depth from a single snapshot. While RGB image data is available at more than 100 megapixels (e.g., Samsung ISOCELL HP1), the most successful depth sensors capture at least three orders of magnitude fewer measurements, with pulsed time-of-flight sensors [morimoto2020megapixel] and modulated correlation time-of-flight imagers [lange20003d, hansard2012time, kolb2010time] offering kilopixel resolutions. Passive approaches can offer higher spatial resolution by exploiting RGB data; however, existing methods relying on stereo [chen20173d, Chang2018, Kendall2017] depth estimation require large baselines, monocular depth methods [chen2016monocular, ranftl2021vision] suffer from scale ambiguity, and structure-from-motion methods [schonberger2016structure] require diverse poses that are not present in a single snapshot. Accurate high-resolution snapshot depth remains an open challenge.
For imaging tasks, align and merge computational photography approaches have long exploited subtle motion cues during a single snapshot capture. These take advantage of the photographer’s natural hand tremor during viewfinding to capture a sequence of slightly misaligned images, which are fused into one super-resolved image [wronski2019handheld, tsai1984multiframe]. These misaligned frames can also be seen as mm-baseline stereo pairs, and works such as [yu20143d, joshi2014micro] find that they contain enough parallax information to produce coarse depth estimates. Unfortunately, this micro-baseline depth is not enough to fuel mixed reality applications alone, as it lacks the ability to segment clear object borders or detect cm-scale depth features. In tandem with high-quality poses from phone-based SLAM [durrant2006simultaneous] and low-resolution LiDAR depth maps, however, we can use the high-resolution micro-baseline depth cues to guide the reconstruction of a refined high-resolution depth map. We develop a pipeline for recording LiDAR depth, image, and pose bundles at 60 Hz, with which we can conveniently record 120 frame bundles of measurements during a single snapshot event.
With this hand shake data in hand, we take a test-time optimization approach to distill a high-fidelity depth estimate from hand tremor measurements. Specifically, we learn an implicit neural representation of the scene from a bundle of measurements. Depth represented by a coordinate multilayer perceptron (MLP) allows us to query for depth at floating point coordinates, which matches our measurement model, as we effectively traverse a continuous path of camera coordinates during the movement of the photographer’s hand. We can, during training, likewise conveniently incorporate parallax and LiDAR information as photometric and geometric loss terms, respectively. In this way we search for an accurate depth solution that is consistent with low-resolution LiDAR data, aggregates depth measurements across frames, and matches visual features between camera poses similar to a multi-view stereo approach. Specifically, we make the following contributions:
A smartphone app with a point-and-shoot user interface for easily capturing synchronized RGB, LiDAR depth, and pose bundles in the field.
An implicit depth estimation approach that aggregates this data bundle into a single high-fidelity depth map.
Quantitative and qualitative evaluations showing that our depth estimation method outperforms existing single and multi-frame techniques.
We will make the smartphone app, training code, experimental data, and trained models available at:
2 Related Work
Active Depth Imaging. Active depth methods emit a known illumination pattern into the scene and measure the returned signal to reconstruct depth. Structured light approaches rely on this illumination to improve local image contrast [zhang2018high, scharstein2003high] and simplify the stereo-matching process. Time-of-Flight (ToF) technology instead uses the travel time of the light itself to measure distances. Indirect ToF achieves this through measuring the phase differences in the returned light [lange20003d], whereas direct ToF methods time the departure and return of pulses of light via avalanche photodiodes [cova1996avalanche] or single-photon detectors (SPADs) [mccarthy2009long]. The low spatial-resolution depth stream we use in this work is sourced from a LiDAR direct ToF sensor. While its mobile implementation comes with the caveats of low-cost SPADs [callenberg2021low] and vertical-cavity surface-emitting lasers [warren2018low], with limited spatial resolution and susceptibility to surface reflectance, this sensor can provide robust metric depth estimates, without scale ambiguity, on visually textureless surfaces such as blank walls. We use this kilo-pixel depth data to ensure our depth solution does not stray too far from the underlying measured depth data.
Multi-View Stereo. Multi-view stereo (MVS) algorithms are passive depth estimation methods that infer the 3D shape of a scene from a bundle of RGB views and, optionally, associated camera poses. COLMAP [schonberger2016structure] estimates both poses and sparse depths by matching visual features across frames. The apparent motion of each feature in image space is uniquely determined by its depth and camera pose. Thus there exists an important relationship that, for a noiseless system, any pixel movement (disparity) not caused by a change in pose must be caused by a change in depth. While classical approaches typically formulate this as an explicit photometric cost optimization [sinha2007multi, furukawa2009accurate, galliani2015massively], more recent learning-based approaches bend the definition of cost. These approaches construct cost volumes with learned visual features [yao2018mvsnet, tankovich2021hitnet, lipson2021raft], which aid in dense matching as they incorporate non-local information into otherwise textureless regions and are more robust to variations in lighting and noise that distort raw RGB values. In our scenario with virtually no variation in lighting or appearance, and free access to reliable LiDAR-based depth estimates in textureless regions, we look towards photometric MVS to extract parallax information from our images with poses.
Monocular Depth Prediction. Single-image monocular approaches [ranftl2019towards, ranftl2021vision, godard2019digging] offer visually reasonable depth maps, where foreground and background objects are clearly separated but may not be at a correct scale, with minimal data requirements – just a single image. Video-based methods such as [fonder2021m4depth, luo2020consistent, watson2021temporal] leverage structure-from-motion cues [ullman1979interpretation] to extract additional information on scene scale and geometry. Works such as [ha2016high, im2018accurate] use video data with small (decimeter-scale) motion, and [joshi2014micro, yu20143d] explore micro-baseline (mm-scale) measurements. As the baseline decreases so does the scale of parallax motion cues, and the depth estimation problem gradually devolves from MVS to single-image prediction. Our work investigates how we can leverage monocular information readily available during a single snapshot – high-resolution images synchronized with 3D pose and coarse depth – to distill high-quality depth.
3 Neural Micro-baseline Depth
When capturing a “snapshot photograph” on a modern smartphone, the simple interface hides a significant amount of complexity. The photographer typically composes a shot with the assistance of an electronic viewfinder, holding steady before pressing the shutter. During composition, a modern smartphone streams the recent past, consisting of synchronized RGB, depth, and six degree of freedom pose (6DoF) frames into a circular buffer at 60 Hz.
In this setting, we make the following observations: (1) A few seconds is sufficient for a typical snapshot of a static object. (2) During composition, the amount of hand shake is small (mm-scale). (3) Under small pose changes view-dependent lighting effects are minor. (4) Our data shows that current commercial devices have excellent pose estimation, likely due to well-calibrated sensors (IMU, LiDAR, RGB camera) collaborating to solve a smooth low-dimensional problem. Concretely, at each shutter press, we capture a “data bundle” of time-synchronized frames, each consisting of an RGB image, 3D poses , camera intrinsics , and a depth map . In our experiments, the high-resolution RGB is 19201440, while the LiDAR depth is 256 192111Though we refer to this as LiDAR depth, the iPhone’s depth stream appears to also rely on monocular depth cues, and likely uses the sparse LiDAR samples to help resolve issues of scale ambiguity. It unfortunately does not offer direct access to raw LiDAR measurements.. To save memory, we restrict bundles to frames (two seconds) in all our experiments.
Micro-Baseline parallax. We specialize classical multi-view stereo [10.5555/861369] for our small motion scenario. Without loss of generality, we denote the first frame in our bundle the reference (r) and represent all other query (q) poses using the small angle approximation relative to the reference:
Let the homogeneous coordinates of a 3D point, the geometric consistency constraint is:
I.e., the known pose should transform any 3D point in the query frame to its corresponding 3D location in the reference. Next, given per-frame camera intrinsics :
perspective projection yields continuous pixel coordinates via:
Using these pixel coordinates to index into images and , we arrive at our second constraint:
Corresponding 3D points should be photometrically consistent: with small motion, they should have the same color in both views. These two constraints are visualized in Fig. 2. Recall that a depth map determines at each pixel its 3D point by “unprojection”:
Implicit Depth Representation. There are numerous ways with which one can represent . For example, we can represent them explicitly with a discrete depth map from the reference view, or as a large 3D point cloud. While explicit representations have many advantages (fast data retrieval, existing processing tools), they are also challenging to optimize. Depth maps are discrete arrays and merging multiple views at continuous coordinates requires resampling. This blurs fine-scale information and is non-trivial at occlusion boundaries. Point cloud representations trade off this adaptive filtering problem with one of scale. Not only is a two second sequence with 120 million points unwieldy for conventional tools [arun1987least], its points are almost entirely redundant.
Thus, we choose an implicit
depth representation in the form of a coordinate multi-layer perceptron (MLP)[hornik1989multilayer], where its learnable parameters automatically adapt to the scene structure. Recent work have used MLPs to great success in neural rendering [mildenhall2020nerf, chen2021mvsnerf, park2021nerfies] and depth-estimation, where continuous sampling is of interest [zhang2021consistent]. In our application, the MLP is a differentiable function
returning a continuous given a continuous encoding of position, camera pose, color, and other features. In our implementation, is a positionally encoded 3D colored point:
We follow the encoding of [mildenhall2020nerf] with:
where is the selected number of encoding functions, and are the its corresponding color values scaled to .
As the size of this MLP is fixed, the large dimensionality of our measurements does not affect the calculation of , and is instead a large training dataset from which to sample. Translating (2) and (5
) into a regularized loss function on
, and backpropagating through, our implicit depth representation
can be optimized using modern stochastic gradient descent.
Backward-Forward Projection Model. Fig. 3 illustrates how we combine our geometric and photometric constraints to optimize our MLP to produce a refined depth. At each training step we sample a query view , from which we generate randomly sampled colored points via Eq. 6:
where and are the image height and width, respectively. Here, is a continuous coordinate and represent sampling with a bilinear kernel. Following (2) we transform these points to the reference frame as
Rather than directly predicting a refined depth, ask our MLP to predict a depth correction :
As we show in Section 5, this parameterization allows us to avoid local minima in poorly textured regions. We transform these points back to the query frame and resample the query image at updated coordinates:
Finally, our photometric loss is:
which attempts to satisfy (5) by encouraging the colors of the refined 3D points to be the same in both the query and reference frames. While (5) works well in well-textured areas, it is fundamentally underconstrained in flat regions. Therefore, we augment our loss with a weighted geometric regularization term based on (2) that pushes the solution towards an interpolation of in these regions:
Our final loss is a weighted combination of these two terms
where by tuning we adjust how strongly our reconstruction adheres to the LiDAR depth initialization.
Patch Sampling. In practice, we cannot rely on single-pixel samples as written in Eq. (6) for photometric optimization. Our two megapixel input images will almost certainly contain color patches that are larger than the depth-induced motion of pixels within them. With single-pixel samples, there are many incorrect depth solutions that yield of zero. To combat this, we replace each sample ( in Eq. (3)) with Gaussian-weighted patches
Fig. 4 illustrates this for : the increased receptive field discourages false color matches. Adjusting , we trade off the ability to reconstruct fine features for robustness to noise and low-contrast textures.
Explicit Confidence. Another augmentation we make is that we introduce a learned explicit confidence map to weigh the MLP outputs. That is, we replace with in (12). This additional degree of freedom allows the network push toward zero in color-ambiguous regions, rather than forcing it to first learn a positional mapping of where these regions are located in the image. As
only adds an additional sampling step during point generation, the overhead is minimal. Once per epoch we apply an optionalmedian filter to to minimize the effects of sampling noise during training.
Final reconstruction. After training, to recover a refined depth map we begin by reprojecting all low-resolution depth maps to the reference frame following (2). We then average and bilinearly resample this data to produce a depth map . We query the MLP at grid-spaced points, using and to generate points as in (3). Finally, we extract and re-grid the depth channel from the MLP outputs to produce .
4 Data Collection
Recording a Bundle. We built a smartphone application for recording bundles of synchronized image, pose, and depth maps. Our app, running on an iPhone 12 Pro using ARKit 5, provides a real-time viewfinder with previews of RGB and depth (Fig. 5). The user can select bundle sizes of [15, 30, 60, 120] frames ([0.25, 0.5, 1, 2] seconds of recording time) and we save all data, including nanosecond-precision timestamps to disk. We will publish the code for both the app and our offline processing pipeline (which does color-space conversion, coordinate space transforms, etc.).
Natural Hand Tremor Analysis. To analyze hand motion during composition, we collected fifty 2-second pose-only bundles from 10 volunteers. Each was instructed to act as if they were capturing photos of objects around them, to hold the phone naturally in their dominant hand, and to keep focus on an object in the viewfinder. We illustrate our aggregate findings in Fig. 6 and individual measurements in the supplemental material.
We focus on in-plane displacement they are the dominant contribution to observed parallax. We find that natural hand tremor appears similar to paths traced by 2D Brownian motion, with some paths traveling far from the initial camera position as in Fig. 6
(b), and others forming circles around the initial position. Consequently, while the median effective baseline from a two second sequence is just under 6mm, some recordings exhibit nearly 1cm of displacement, while others appear are almost static. We suspect that the effects of breathing and involuntary muscle twitches are greatly responsible for this variance. Herein lies the definition ofgood hand shake: it is one that produces a useful micro-baseline.
Fig. 7 (c) illustrates, given our smartphone’s optics, what a 6mm baseline translates to in pixel disparity and therefore the depth feature precision. We intentionally limit ourselves to a depth range of approximately 50cm, beyond which image noise and errors in pose estimation overpower our ability to estimate subpixel displacement.
Implementation Details. We use frame bundles for our main experiments. Images are recorded with portrait orientation, with ,
. Our MLP is a 4 layer fully connected network with ReLU activations and a hidden layer size of 256. For training we use inputs ofcolored points, a kernel size of , and encoding functions. We use the Adam optimizer [kingma2014adam] with an initial learning rate of , exponentially decreased over epochs with a decay rate of . We apply the geometric regularization with weight . We provide ablation experiments on the effects of many of these parameters in the supplement. Training takes about 45 minutes on an NVIDIA Tesla P100 GPU, or 180 minutes on an Intel Xeon Gold 6148 CPU.
Comparisons. We compare our method (Proposed) to the input LiDAR data and several recent baselines. Namely, we reproject all the depth maps in our bundle to the reference frame and resample them to produce a LiDAR baseline . For the depth reconstruction methods we look to Consistent Video Depth Estimation [luo2020consistent] (CVD), which similarly uses photometric loss between frames in a video to refine a consistent depth; Depth Supervised NeRF [deng2021depth] (DSNeRF), which also features an MLP for depth prediction; and Structure from Small Motion [ha2016high] (SfSM), which investigates a closed-form solution to depth estimation from micro-baseline stereo. Both DSNeRF and CVD rely on COLMAP [schonberger2016structure] for poses or depth inputs; however, when given our micro-baseline data COLMAP fails to converge, returning neither. For a fair comparison, we substitute our poses and LiDAR depth.
|Scene||CVD [luo2020consistent]||DSNeRF [deng2021depth]||SfSM [ha2016high]||Proposed|
Experimental Results. We present our results visually in Fig. 8 and quantitatively in Table 1 in the form of photometric error (PE). To compute PE, we take the final depth map output by each method, use the phone’s poses and intrinsics to project each color point in to all other frames, and compare their 8-bit RGB values:
We exclude points that transform outside the image bounds. Like traditional camera calibration or stereo methods, in the absence of ground truth depth, PE serves as a measure of how consistent our estimated depth is with the observed RGB parallax.
Table 1 summarizes the relative performance between these methods on 8 geometrically diverse scenes. Our method achieves the lowest PE for all scenes. Note that neither CVD nor DSNeRF achieve significantly lower PE as compared to the LiDAR depth even though both contain explicit photometric loss terms in their objective. We speculate that our micro-baseline data is out of distribution for these methods, and that the large loss gradients induced by small changes in pose results in unstable reconstructions. DSNeRF also has the added complexity of being a novel view synthesis method and is therefore encouraged to overfit to the scene RGB content in the presence of only small motion. We see this confirmed in Fig. 8, as DSNeRF produces an edge-aligned depth map but also produces hallucinated image texture. SFsM successfully reconstructs textured regions close to the camera (20cm), but fails for smaller disparity regions and the textureless spaces around objects. The reprojected LiDAR depth produces well edge-aligned results but lacks intra-object structure, as it is relying on ambiguous mono-depth cues. Contrastingly, our proposed method reconstructs the castle’s towers, the hand under the thinker’s head, the depth disparity between stones in the rocks object, and the smooth undulations of the gourd.
Frame Ablation. Though the average max baseline in a recorded bundle is 6mm, the baseline between neighboring frames is on the order of 0.1mm. This means that we need not use every frame for effective photometric refinement. This tradeoff between frame count, training time, and reconstruction detail is illustrated in Fig. 9. We retain most of the castle detail by skipping every other frame, but as we discard more data, the reconstruction is progressively blurred instead of completely failing.
Role of LiDAR Supervision. To determine the contribution of low-resolution LiDAR data in our pipeline, we perform an ablation where we set m and disable our geometric regularizer by setting . Fig. 10 shows that, as expected, it fails to reconstruct the textureless background, simply adding noise. However, it does correctly reconstruct the gourd, producing a result close to the pipeline with full supervision. This demonstrates that our technique does indeed extract depth details from micro-baseline RGB images. LiDAR depth, however, is an effective regularizer: in regions without strong visual features, our learned confidence map is nearly zero and our reconstruction gracefully degrades to the LiDAR data. Finally, we note that even though this ablation discards depth supervision, the LiDAR sensor is still used by the phone to determine pose.
Comparison to Dedicated ToF Camera. For additional qualitative validation, we record several scenes with a high-resolution time-of-flight depth camera (LucidVision Helios Flex). Given the differences in optics and effective operating range, the viewpoints and metric depths are not exactly the same. We offset (but not rescale) and crop the depth maps for qualitative comparison. Fig. 11 illustrates that although our technique can reconstruct centimeter-scale features matching that of the ToF sensor, it oversmoothes finer details corresponding to subpixel disparities. Since we rely on passive RGB rather than direct illumination, our technique can reconstruct regions the ToF camera cannot such as specular surfaces on the back of frog and areas of gourd with high subsurface scattering. Note that in these two cases the amplitude modulated ToF measurements measure incorrect depth due to multi-path interference, even merging object depth with the background.
Offsets over Direct Depth. Rather than directly learn , we opt to learn offsets to the collected LiDAR depth data. In this way we start at a coarse, albeit smoothed, depth estimate and for each location in space, effectively search the local neighborhood for a more photometrically consistent depth solution. This allows us to avoid local minima solutions that overpower the regularization – e.g. the accidental matching of similar-looking high-contrast patches. This proves essential for objects with repetitive textures, as demonstrated in Fig. 12.
6 Discussion and Future Work
We show that with a modern smartphone, one can obtain a high-fidelity megapixel depth map on any snapshot of a nearby static object. We quantitatively validated that our technique outperforms several recent baselines and qualitatively compared to a dedicated depth camera.
Although our training time is practical for offline processing and opens the potential for the fast and easy collection of a potentially large-scale training corpus with accurate object depth maps, our method could be further accelerated with an adaptive sampling scheme. This might take into account pose and LiDAR information to select the most useful samples for network training. We also hope in the future to get access to the raw samples recorded by the phone’s LiDAR sensor, rather than filtered depth maps. Raw measurements may open the door for future end-to-end methods which use photon time tags to provide an additional sparse high-trust supervision signal, as well as potential low-light applications where we use the LiDAR measurements to help aggregate photometric information in noisy RGB images.