Very Power Efficient Neural Time-of-Flight

12/19/2018
by   Yan Chen, et al.
16

Time-of-Flight (ToF) cameras require active illumination to obtain depth information thus the power of illumination directly affects the performance of ToF cameras. Traditional ToF imaging algorithms is very sensitive to illumination and the depth accuracy degenerates rapidly with the power of it. Therefore, the design of a power efficient ToF camera always creates a painful dilemma for the illumination and the performance trade-off. In this paper, we show that despite the weak signals in many areas under extreme short exposure setting, these signals as a whole can be well utilized through a learning process which directly translates the weak and noisy ToF camera raw to depth map. This creates an opportunity to tackle the aforementioned dilemma and make a very power efficient ToF camera possible. To enable the learning, we collect a comprehensive dataset under a variety of scenes and photographic conditions by a specialized ToF camera. Experiments show that our method is able to robustly process ToF camera raw with the exposure time of one order of magnitude shorter than that used in conventional ToF cameras. In addition to evaluating our approach both quantitatively and qualitatively, we also discuss its implication to designing the next generation power efficient ToF cameras. We will make our dataset and code publicly available.

READ FULL TEXT VIEW PDF

Authors

page 1

page 5

page 6

page 7

page 8

07/22/2015

Bayesian Time-of-Flight for Realtime Shape, Illumination and Albedo

We propose a computational model for shape, illumination and albedo infe...
12/12/2020

An Overview of Depth Cameras and Range Scanners Based on Time-of-Flight Technologies

Time-of-flight (TOF) cameras are sensors that can measure the depths of ...
05/19/2017

Snapshot Difference Imaging using Time-of-Flight Sensors

Computational photography encompasses a diversity of imaging techniques,...
12/13/2015

Learning the Correction for Multi-Path Deviations in Time-of-Flight Cameras

The Multipath effect in Time-of-Flight(ToF) cameras still remains to be ...
08/24/2017

Gradient-based Camera Exposure Control for Outdoor Mobile Platforms

We introduce a novel method to automatically adjust camera exposure for ...
11/05/2020

Smart Time-Multiplexing of Quads Solves the Multicamera Interference Problem

Time-of-flight (ToF) cameras are becoming increasingly popular for 3D im...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Conventional depth map under (b) Our result from
extreme short exposure ToF raw of (a)
(c) Conventional depth map under (d) Our result from
regular exposure ToF raw of (c)
Figure 1: We propose an end-to-end pipeline to translates the weak and noisy ToF camera raw to high quality depth map. (a) Depth image produced by the ToF camera’s default imaging pipeline with 200us exposure time. The quality is very poor. (b) Depth image produced by our method applied to the ToF camera raw from (a). (c) Depth image produced by the ToF camera’s default imaging pipeline with regular exposure time. Some of the depth information is still lost due to objects with low reflectivity or long distances. (d) Depth image produced by our method applied to the ToF camera raw from (c).

Depth sensing is one of the core components of many computer vision tasks. Amplitude-modulated continuous-wave (AMCW) time-of-flight (ToF) has a brief and definite physical meaning in depth construction of scenes thus it attracts a lot of commercial attention, such as Kinect V2. It is also widely used in academic research of computer vision

[10, 28], including human tracking [27], 3D scene reconstruction [15], robotics [14], object detection, gesture recognition [21, 31]

, and scene understanding

[29, 12]. However, comparing with traditional RGB cameras, ToF cameras compute the depth by emitting a periodic amplitude modulated illumination signal and receive the demodulated signal reflected by the objects. Higher power of active illumination enables the ToF sensor to receive the signal with higher signal noise ratio (SNR) and higher level of confidence. Therefore, the power of illumination directly influences the performance of ToF cameras.

Traditional ToF imaging algorithms are very sensitive to illumination and the depth accuracy degenerates rapidly with the decreasing illumination power. In order to obtain more accurate depth information, one way is to increase the intensity of the received active illumination signal. Other than increasing the illumination power, an alternative treatment to this issue is to increase the physical size of the pixels on the sensor to collect more light. However, this significantly decreases the depth map resolution. According to the inverse square law, one can also cut the depth sensing range of the camera. This obviously decreases the usability of the camera in many applications. Therefore, to make a ToF camera with satisfactory depth quality as well as reasonable resolution and sensing range, the painful dilemma for the illumination and the performance trade-off always troubles the designer of the camera if the conventional imaging pipeline is used.

Such dilemma can be tackled if there is a way to recover high quality depth information from weak signals. A number of recent studies show that it is plausible to recover high SNR natural images from very noisy data using deep learning.

[25, 13, 2]. Chen et al. [2] showed impressive results on recovering high quality color image from camera Bayer pattern which is captured under extremely low light condition with short exposure. Inspired by these research, we show for the first time that for ToF cameras, despite the weak signals in many areas under the extreme short exposure setting, these signals as a whole can be well utilized through a learning process which directly translates the weak and noisy ToF camera raw to high quality depth map. This creates an opportunity to address the aforementioned dilemma and makes it possible to design a very power efficient ToF camera possibly with higher resolution and longer sensing range. To enable the learning, we collect a comprehensive dataset under a variety of scenes and photographic conditions via a specialized ToF camera. The dataset contains ToF raw measurements and depth maps collected under extreme short exposure settings and long exposure settings respectively. We show in the experiments that our proposed method is able to robustly process ToF raw measurements with an exposure time that is one order of magnitude shorter than that used in a conventional ToF camera.

The contributions of our work can be summarized as follows.

  • We show for the first time that our proposed method is able to recover high quality depth information from very weak ToF raw data (one order of magnitude shorter exposure time).

  • We introduce a real-world dataset used for training and validating the this learning tasks. We will make the code and dataset publicly available.

  • We shed light on the design of the next generation ToF camera by providing an effective alternative to optimize the performance and power consumption trade-off.

2 Related Work

Depth reconstruction based on ToF cameras. ToF cameras face a lot of challenging problems when extracting depth from raw phase-shifted measurements with respect to emitted modulated infrared signal. Dorrington et al. [4] established a two-component, dual-frequency approach to resolving phase ambiguity, achieving significant improvements of the accuracy when distortion is caused by multipath interference (MPI). Several methods were proposed to deal with MPI distortions, including adding or modifying hardware [32, 11, 24], employing multiple modulation frequencies [4, 5, 1, 9]

and estimating light transport through an approximation of depth

[6, 7]. Marco et al. [20] correct MPI errors by a two-stage training strategy, training the encoder to represent MPI-corrupted depth images with captured dataset firstly and then use synthetic scenes to train the decoder to correct the depth. However, the above pipelines are based on the assumption that there is no cumulative error and information loss introduced in the previous stage, thus the final result of these methods is likely to contain cumulative errors of multiple stages.

Krishna et al. [33] filled the missing depth pixels by using a color-aware Gaussian-weighted averaging filter to estimate depth value. However, its performance is limited by the similarity between the neighborhood pixels and target pixels and the information of the target region is wasted. An end-to-end ToF image processing framework presented by Su et al. [30] can efficiently reduce noise, correct MPI and resolve phase ambiguity. However, the training data is not realistic. Therefore, depth reconstruction may fail when the scene contains low reflectivity materials and objects. To the best of our knowledge, none of existing depth reconstruction method is able to obtain high quality depth map from the weak and noisy ToF camera raw measurements.

Image enhancement under low light. For conventional RGB cameras, photography in low light is challenging. Several techniques have been proposed to increase the SNR of the recovered image [8, 23, 17, 18, 3]. Chen at el.[2]

established a pipeline by training a fully convolutional neural network which directly translate the very noise and dark Bayer pattern camera raw to high quality color images. Though impressive results from the aforementioned studies, deep learning and data-driven approaches have not yet been adopted to recover high quality depth information from weak and noisy ToF raw. It remains unclear if such methodology is effective for ToF imaging. The aim of this paper is to disclose its feasibility.


Depth datasets. Although recently many datasets of depth maps are proposed, most of them are consisted of synthetic data, such as transient images generated via time-resolved rendering. A dataset of ToF measurements [20] is proposed via simulating 25 different scenes with a physically-based, time-resolved renderer. Besides, Su et al. [30] offer a large-scale synthetic dataset of raw correlation time-of-flight with ground truth labels. However, the ToF raw with artificial distortions and Gaussian noise is not realistic enough to support the real life generalization especially when dealing with areas with large noise caused by low reflectivity. Only the raw RGB data, depth map and accelerometer data are provided in the NYU-Depth V2 dataset [22] but ToF raw measurements are missing. Thus, this dataset can not be used to train ToF raw to depth map conversion. Furthermore, most existing depth datasets concentrate on images captured under appropriate illumination or ideal environments, they are not suitable for evaluating imaging with low active illumination power or weak reflected signal. In this paper, we propose a comprehensive dataset to fill these gaps and enable the training and validation of our proposed model.

3 Method and Analysis

3.1 Imaging with Time-of-Flight Sensors

Distance measurement. The distance measurement mode of Time of Flight uses the on chip driver and the external LED/LD to provide modulated light on the target. Generally, the period of the modulation control signal is programmable. The modulator generates all signals to modulate the external LED/LD and simultaneously all demodulation signals to the pixel-field. We can describe the programmable modulation optical signal with angular frequency as

(1)

where the amplitude is normalized. Once the signal is reflected by the object, the modulated optical signal goes back to the sensor with certain amplitude attenuation and certain phase shift, then the received signal can be expressed as

(2)

where is the offset, is the amplitude after attenuation, and is the phase shift. In order to achieve demodulation, the original emission signal needs to be used as a correlation signal and demodulated with the received signal as

(3)

where the relevant signal is denoted as . ToF cameras need to sample the correlation signal four times in one cycle. That is, sampling is performed when . Considering the received signal is mainly superimposed on the background image, we also need to consider an offset here. Then the phase shift and the amplitude can be obtained from the four sample values

(4)
(5)

and we can find the distance value by the phase shift

(6)

where c is the speed of light.

Figure 2: Sample of received signal per /4.
Network Architecture
Name D1 D2 D3 D4 Res1-Res9 U1 U2 U3 U4
Layer
conv+
LeakyReLU
conv+
LeakyReLU
conv+
LeakyReLU
conv+
LeakyReLU
ResBlock
deconv

+ReLU

deconv
+ReLU
deconv
+ReLU
deconv
+Tanh
Kernel 44 44 44 44 33 44 44 44 44
Stride 2 2 2 2 1 2 2 2 2
I/O 4/64 64/128 128/256 256/256 256/256 256/256 512/128 256/64 128/3
Input ToF raw D1 D2 D3 D4 Res9 D3+U1 D2+U2 D1+U3
Table 1:

Architecture of our network. ”conv” represents a convolutional layer. ”deconv” means a fractionally-strided convolutional layer. We use leaky ReLUs with a negative slope of 0.2. ”ResBlock” means a residual block that contains two 3

3 convolutional layer with the same number of filters and a ReLU layer is in the middle.

Quality of the measurement result. Raw ToF measurements contain the distance information, as well as the quality and the validity (confidence level) of the received optical signal. A higher amplitude of the measured signal represents a more accurate distance measurement. The depth data for each pixel has its own validity and quality in ToF cameras. The amplitude of the modulated light received by the ToF sensor is the primary quality indicator for the measured distance data. It can be calculated as Eq.5. However, excessive active illumination will make the amplitude of the raw measurements very large. This leads to errors in the depth value due to the problem of over-exposure of the ToF sensor.

Problems of Traditional Pipeline. In order to recover high-quality depth maps from imperfect ToF raw measurements, traditional methods of ToF camera imaging often require a series of specialized processing techniques, such as denoising, correction of multipath distortion and nonlinear compensation, etc. However, these components are independent to each other and often relies on the assumption of no cumulative error and information loss in the previous stages. In practice, this assumption is almost always not true. It may cause large errors in the final depth map. To alleviate the overall error, a distance calibration process is conducted to adjust the offset value to the selected calibration plane and sets the Fix Pattern Noise (FPN) on the plane to zero. However, this technique can not be generalized to scenarios of weak signals.

As mentioned above, the amplitude of the modulated signal received by the ToF sensor is the primary quality indicator for the measured distance data. When the amplitude is lower than a certain threshold, the traditional ToF imaging method is unable to calculate a reliable depth value at such a low SNR, so that the depth information is missing in these areas (behave as a black hole on the depth map). The experiment results show that the condition for invalidating the traditional ToF camera imaging pipeline in a pixel as:

(7)

where is the maximum amplitude that can be imaged by the ToF sensor chip.

3.2 Learning from imperfect ToF camera raw

In this section, our approach of depth reconstruction is presented in detail. We first describe the advantage of our method of recovering high-quality depth images from weak and noisy ToF camera raw measurements compared to traditional ToF imaging methods. Then, we give a brief description of our whole pipeline to learn a mapping from ToF measurements acquired under low power illumination to corresponding high-quality depth map. And the network architecture of our method, as shown in Tab.1, will be introduced. Finally, we present how we train the model and implementing details.

Comparison to traditional pipeline. The raw ToF measurements have a very low signal-to-noise ratio (SNR) and amplitude intensity, when the active illumination signal received by the ToF sensor is very low. In this case, conventional edge aware filtering methods such as bilateral filter tend to fail. Traditional method of ToF measurements denoising is based on arbitrary rules and assumptions, but these rules and assumptions often become invalid with changes in scenes and intensity of the received signals. This is particularly true for weak input signals. Therefore, it is very difficult to select the optimal parameters for all the image processing components to achieve good results for all scenarios. In contrast, the proposed method adopts the end-to-end learning and inference approach to translate the weak and noisy ToF camera raw to high quality depth map which avoids the highly complex parameter tuning for such noisy and weak input signals.

(a) a group of measurements (b) a group of measurements ToF camera Depth statistics.
Figure 3: We use EPC660 ToF camera from ESPROS to collect a dataset of multiple pairs of short-exposure and corresponding long-exposure depth measurements. Diverse indoor scenes are collected in the dataset, including office room, restaurant, bedroom and Laboratory. The depth range is reasonable for indoor scenes of our dataset, for depth values range from 0cm to 591cm and has a mean of 236.88cm.

Our pipeline. To build intuition for this end-to-end approach, we have analyzed several previous work of image-to-image mapping. Most of them have adopted an encoder-decoder network with or without skip connections [26], which are consisted of down-sampling, residual blocks and up-sampling. The pixel value of ToF depth map is closely-related to camera settings, scene architecture and layout, compared with RGB images. Besides, the geometry and architecture of scene for both depth map and raw measurements are required to be consistent. And these specific characteristics of ToF raw measurements should be combined with the previous work of image translation, when designing network architecture.

For the above considerations, we select the encoder-decoder with skip connections as our network architecture. The size of input is progressively decreasing in pace with going through the down-sampling layers for four times, until it reaches the residual blocks. And after passing through nine residual blocks and four up-sampling layers, the size of input becomes larger and restored to its original size. The strided convolution layers combined with activation layers serve as decoder and the fractional convolution layers combined with activation layers are regarded as encoder. The residual blocks without normalization are adopted by the bottle neck part. Moreover, we added the skip connections to the network between each pair of layer i and layer n-i following the U-net to enhance the accurate of results.

To obtain high-quality depth reconstruction results, L1 loss is adopted to train our network:

(8)

Training details.

Our networks is implemented in Pytorch. During training, inputs of the network are the ToF raw measurements captured under short exposure and the ground truth is the corresponding depth map captured under regular or long exposure. We randomly crop out 128

128 images on the original 320240 images for data augmentation. This strategy effectively improves the robustness of the model. We train our network using the Adam optimizer [16]

with an initial learning rate of 0.0002 for the first 200 epochs, before linearly decaying it to 0 over another 1800 epochs.

4 Dataset

To enable the learning, we collect a comprehensive dataset under a variety of scenes and photographic conditions by a specialized ToF camera with raw data access. Due to the limitations of hardware devices, it is difficult to change the intensity of received signals by directly changing the physical size of the pixels on the ToF sensor or the power of the infrared LED illumination of the development kit. However, we can modify the intensity of received signal by changing the exposure time of the ToF camera, since the exposure time is directly proportional to the intensity of received signal.

We use EPC660 ToF camera from ESPROS to collect a dataset of multiple pairs of short-exposure and corresponding long-exposure depth measurements for training the proposed architecture. ToF raw measurements, amplitude image and depth map at 320240 resolution are collected for each scene with an exposure time. We captured 200 groups of measurements with 200us and 400us exposure time respectively and 200 groups of corresponding long-exposure images from a variety of scenes with varying materials. During the experiments, we use 150 groups for training and 50 groups for testing.

Diverse indoor scenes are collected in the dataset, including office room, restaurant, bedroom and laboratory. We adopt the ideal sinusoidal modulation functions to avoid the wiggling effect. The images are generally captured at night in rooms without infrared monitoring to avoid the influence of solar radiation and infrared light emitted by some particular machines. Note that a variety of hard cases such as distant objects, fine structures, irregular shapes and various materials including fabric, metals with low reflectivity and dark object with high absorptivity exist in our scenes.

We mount the ToF camera on a sturdy tripod to avoid camera shaking and other vibration when capturing. Due to continuous modulation, 6MHz was selected as modulation frequency for measuring depth in our scenes with range of 0-6 meters to prevent roll-over being observed. Then exposure time is adjusted to obtain high-quality raw data. After long-exposure ToF measurements captured, we decrease the exposure time to 200 us and 400 us respectively via software on computers to collect data without touching the cameras.

200us 400us
MAE SSIM MAE SSIM
Traditional
pipeline
179.79 0.1615 93.39 0.5162
Ours 10.13 0.9156 7.94 0.9342
Table 2: This table reports the mean absolute error (MAE)(cm) and the structural similarity (SSIM)(%) of 200us exposure time and 400us exposure time. Traditional method can recover depth information only in the local position under the low exposure setting, so the overall error is very large.

A mask to evaluate the quality of ToF measurements will be introduced into our dataset. Actually, the quality and validity of the received signal exists in raw data collected by ToF cameras. The signal amplitude as well as the ratio of ambient-light to the value of modulated light (AMR) indicates the quality and validity of received signal. We combine these two features of received signals in a certain proportion to generate a quantitative criteria for evaluating the quality of each pixel in measurements. A threshold for criteria can be defined to produce a mask for each pixel. This mask can be adopted in network training and depth map generation. For instance, unconfident pixels in the labels can be ignored during the computation of error gradients in training.

Fig.3 shows quantitative analysis of depth-range distribution of ToF measurements in our dataset. The depth range is reasonable for indoor scenes of our dataset, for depth values range from 0cm to 591cm and has a mean of 236.88cm. There are some regions with no depth value or much noise when short-exposure, due to few reflected photons detected. The ToF measurements is sufficient to serve as ground truth, though some noise still exists.

5 Experiments and Results

5.1 Qualitative results

We first quantify depth error with the mean absolute error (MAE) and the structural similarity (SSIM) [34] of predicted depth map compared to the ground truth. At the same time, we will analyze the impact of different network structures on our results. Finally, we will quantitatively analyze the variation of the error of our method at different detection distances.

Effect of exposure time. Our dataset contains raw data acquired under 200us and 400us exposure time and their corresponding depth maps collected under regular exposure time. We have trained two models on the ToF raw measurements under 200us and 400us exposure respectively, and tested the accuracy of the two models with the corresponding test set. Then we calculate the mean absolute error (MAE) and the structural similarity (SSIM) [34] and compare the results of the traditional ToF camera pipeline with that of our proposed method on the test set. Note that the result is calculated over the whole test database in which the object distance varies between 0 to 591cm as indicated in the previous section.

As shown in Tab.2, our results meet an overall 7.94cm depth error with raw captured under 400us exposure time and 10.13cm with raw captured under 200us exposure time. Although the accuracy of depth map produced by our method decreases with the reduction of exposure time, the experimental results of the two models both greatly exceed that of traditional pipeline method.

200us 400us
MAE SSIM MAE SSIM
LSGAN 12.00 0.8938 9.03 0.9149
U-net 10.13 0.9156 7.94 0.9342
Table 3: This table reports the mean absolute error (MAE)(cm) and the structural similarity (SSIM)(%) of replacing the U-net [26] (our default architecture) by the the the least square GAN. We can see depth map produced by the U-net have higher SSIM and lower MAE.

Effect of network structures. The network structure of the the least square GAN [19] is used to recover depth map from ToF raw measurements by [30], we also compare the impact of applying this structure to our framework on the results. Tab.3 reports the result of replacing the U-net [26] (our default architecture) by the the least square GAN [19]. The results prove that although the network structure of the the least square GAN [19] has achieved great success in image transfer, in the task of recover depth map from ToF raw measurements, depth map produced by the U-net have higher SSIM and MAE.

Figure 4: The complex scene is designed for evaluating the performance of our method under different distances. The distance between ToF camera and the object in scene is distributed from 100cm to 200cm.
200us 200us 400us 400us 4000us
amplitude traditional pipeline ours traditional pipeline ours ground truth
Figure 5: Experiment results. In order to verify the effectiveness of our proposed method, we validated our method on our test set for exposure times of 200us and 400us respectively. The results show that our method is able to robustly process ToF camera raw with the exposure time of one order of magnitude shorter than that of conventional ToF cameras.

Depth stability over distance. In order to evaluate the depth measurement stability of our method over difference distance, we conducted several case studies. We tested our method against a number of complex scene, one of those is shown in Fig.4

. The same scene is observed by moving the ToF camera across 10 different viewing distance. Note that among these 10 captures the distance between the ToF camera and objects in the scene is distributed within 200cm. The data collected is consisted of ToF raw measurements captured under 200us, 400us and 4000us. Then, the collected ToF raw serve as the input of our trained network to generate depth map. The depth map generated under short exposure is compared with the one generated under 4000us. We observed that the variance of the error among each comparison is small and the average MAE between the depth map under 4000us exposure and the depth map under 200us/400us exposure is 4.8cm and 2.5cm respectively. This indicates both the precision and the accuracy of our method, when applied to very short exposure ToF raw, is comparable to a strong pipeline with 10 times longer exposure time. In many applications of ToF in consumer products, e.g. face recognition and photography in smartphones, the most widely used depth measuring distance is 30cm to 200cm. Thus the results in this experiment show the applicability of a very power efficient ToF design in this area.

5.2 Qualitative results on our dataset

(a) Amplitude (b) ToF depth map with (c) Our results from
suitable exposure time ToF raw of (b)
Figure 6: Traditional pipeline fails to recover the depth value of the regions marked out of the first group, since the black chair in the scene strongly absorbed signal emitted by ToF camera. For the regions marked out of the second group, too large distance results in few photons received by the ToF sensor. In contrast, our method is able to obtain high-quality depth maps for these two regions.

Then, we present the results of our method and the traditional ToF camera imaging pipeline in extreme cases on our test dataset. In this section, we verify that our proposed end-to-end solution can still reconstruct accurate depth value in extreme case. Moreover, compared with the traditional method, if the exposure time is set to regular, our method is more robust to scenes with objects of high absorptivity or regions in distance. In addition, we also explored multiple setting of exposure time or the active illumination power under which our method may fail. We note that other work [30] also implements end-to-end imaging of ToF cameras, but their model training requires a large scale synthetic dataset, making it difficult to compare directly on our dataset.

Qualitative results with different exposure time. We have shown that the amplitude value is an important indicator for evaluating the raw data quality of the ToF camera. Considering that the effect of the power level of the ToF camera active illumination system on the amplitude value in the amplitude map is equivalent to the effect of the length of exposure time, we simulate the power level of active illumination by controlling the length of the exposure time. In order to verify the effectiveness of our proposed method, we used the ToF raw measurements acquired under exposure time of 200us and 400us as the input of the network to predict the corresponding depth map. As shown in Fig.5, our results have better performance, compared with the depth map generated by the traditional ToF pipeline. Experiments show that our method is able to robustly process ToF camera raw with the exposure time of one order of magnitude shorter than that used in conventional ToF cameras.

Robustness under regular exposure. Since there may exist some objects with low reflectivity or too large distance in the scene, choosing the appropriate exposure time or a strong power active illumination does not guarantee that the depth map of the entire scene is of high quality. However, our proposed method has better performance in the depth estimation of these objects, compared with traditional ToF process, due to the ability of translating the weak and noisy ToF camera raw to depth map directly. As shown in Fig.6, we deliberately collected some scenes with dark objects and scenes with large distances(such as black stools and computer screens, glass doors with specular reflections, as well as objects with particularly large depth differences in the scene) to prove the robustness of our method in this case.

Failure cases. Our proposed method aims to solve the depth mapping problem in low-power active illumination. However, this may fail if the illumination power is too low. For instance, when we reduce the exposure time to 100us, shown in Fig.7, the quality of the generated depth map is not satisfactory enough for some applications.

Amplitude graph Results with 100us
Figure 7: Failure case. Although the goal of our proposed method is to solve the depth mapping problem in low-power active illumination, our approach is likely to fail for extremely active illumination power, when we reduce the exposure time to 100us.

6 Discussion and Conclusion

6.1 Implication to ToF camera design

Using neural network to robustly process ToF camera raw with very short exposure time (raw data with low SNR) is a novel alternative to optimize the power efficiency of the whole ToF system. Despite the involvement of neural network computation, the inception of many recent low power neural network hardware makes it a practical solution. In addition to lowering the power consumption of ToF system, the results of this paper also provide a few extra design choices. First, higher depth frame rate may be achievable because the exposure time can be significantly reduced. Second, with the proposed method much smaller pixel size may be considered despite the SNR of the sensor raw could be low. Thus, higher depth resolution can thus be obtained with a reasonable power consumption. Such possibilities pave the way for new innovation in the ToF camera design.

6.2 Concluding remarks

In this paper, we discover that it is possible to devise a deep learning model to recover high quality depth information from very weak and noisy ToF raw measurements using deep learning. To realize the learning process, we collected a comprehensive dataset using a real-world ToF camera. We show in the experiments that our proposed method is able to robustly process ToF camera raw with the exposure time of one order of magnitude shorter than that used in conventional ToF cameras. While this neural network approach forms a key building block of a very power efficient ToF camera, it also shed new light on new innovations of the ToF camera design. We will make our code and dataset publicly available.

For future research, we will continue to improve the quality of our datasets. Specifically, we would adopt HDR imaging to improve the quality and precision of the ground truth depth map. Another opportunity for future work is to explicitly model the correction of the MPI error in an end-to-end trainable model to further enhance the accuracy of the results.

References

  • [1] A. Bhandari, A. Kadambi, R. Whyte, C. Barsi, M. Feigin, A. Dorrington, and R. Raskar. Resolving multipath interference in time-of-flight imaging via modulation frequency diversity and sparse regularization. Optics letters, 39(6):1705–1708, 2014.
  • [2] C. Chen, Q. Chen, J. Xu, and V. Koltun. Learning to see in the dark. arXiv preprint arXiv:1805.01934, 2018.
  • [3] X. Dong, G. Wang, Y. Pang, W. Li, J. Wen, W. Meng, and Y. Lu. Fast efficient algorithm for enhancement of low lighting video. IEEE, 2011.
  • [4] A. A. Dorrington, J. P. Godbaz, M. J. Cree, A. D. Payne, and L. V. Streeter. Separating true range measurements from multi-path and scattering interference in commercial range cameras. In Three-Dimensional Imaging, Interaction, and Measurement, volume 7864, page 786404. International Society for Optics and Photonics, 2011.
  • [5] D. Freedman, Y. Smolin, E. Krupka, I. Leichter, and M. Schmidt. Sra: Fast removal of general multipath for tof sensors. In European Conference on Computer Vision, pages 234–249. Springer, 2014.
  • [6] S. Fuchs. Multipath interference compensation in time-of-flight camera images. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 3583–3586, 2010.
  • [7] S. Fuchs, M. Suppa, and O. Hellwich. Compensation for multipath in tof camera measurements supported by photometric calibration and environment integration. In International Conference on Computer Vision Systems, pages 31–41, 2013.
  • [8] X. Guo, Y. Li, and H. Ling. Lime: Low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing, 26(2):982–993, 2017.
  • [9] M. Gupta, S. K. Nayar, M. B. Hullin, and J. Martin. Phasor imaging: A generalization of correlation-based time-of-flight imaging. ACM Transactions on Graphics (ToG), 34(5):156, 2015.
  • [10] F. Heide, W. Heidrich, M. Hullin, and G. Wetzstein. Doppler time-of-flight imaging. ACM Transactions on Graphics (ToG), 34(4):36, 2015.
  • [11] F. Heide, M. B. Hullin, J. Gregson, and W. Heidrich. Low-budget transient imaging using photonic mixer devices. ACM Transactions on Graphics (ToG), 32(4):45, 2013.
  • [12] S. Hickson, S. Birchfield, I. Essa, and H. Christensen. Efficient hierarchical graph-based segmentation of rgbd videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 344–351, 2014.
  • [13] Z. Hu, S. Cho, J. Wang, and M.-H. Yang. Deblurring low-light images with light streaks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3382–3389, 2014.
  • [14] S. Hussmann and T. Liepert. Robot vision system based on a 3d-tof camera. In Instrumentation and Measurement Technology Conference Proceedings, 2007. IMTC 2007. IEEE, pages 1–5, 2007.
  • [15] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011.
  • [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [17] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. volume 1, pages 541–551. MIT Press, 1989.
  • [18] H. Malm, M. Oskarsson, E. Warrant, P. Clarberg, J. Hasselgren, and C. Lejdfors. Adaptive enhancement and noise reduction in very low light-level video. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
  • [19] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. pages 2813–2821, 2017.
  • [20] J. Marco, Q. Hernandez, A. Munoz, Y. Dong, A. Jarabo, M. H. Kim, X. Tong, and D. Gutierrez. Deeptof: off-the-shelf real-time correction of multipath interference in time-of-flight imaging. ACM Transactions on Graphics (ToG), 36(6):219, 2017.
  • [21] A. Memo and P. Zanuttigh. Head-mounted gesture controlled interface for human-computer interaction. Multimedia Tools and Applications, 77(1):27–53, 2018.
  • [22] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • [23] S. Park, S. Yu, B. Moon, S. Ko, and J. Paik. Low-light image enhancement using variational optimization-based retinex model. IEEE Transactions on Consumer Electronics, 63(2):178–184, 2017.
  • [24] C. Peters, J. Klein, M. B. Hullin, and R. Klein.

    Solving trigonometric moment problems for fast transient imaging.

    ACM Transactions on Graphics (TOG), 34(6):220, 2015.
  • [25] T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep convolutional denoising of low-light images. arXiv preprint arXiv:1701.01687, 2017.
  • [26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241, 2015.
  • [27] L. A. Schwarz, A. Mkhitaryan, D. Mateus, and N. Navab. Human skeleton tracking from depth data using geodesic distances and optical flow. Image and Vision Computing, 30(3):217–226, 2012.
  • [28] S. Shrestha, F. Heide, W. Heidrich, and G. Wetzstein. Computational imaging with multi-camera time-of-flight systems. ACM Transactions on Graphics (ToG), 35(4):33, 2016.
  • [29] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
  • [30] S. Su, F. Heide, G. Wetzstein, and W. Heidrich. Deep end-to-end time-of-flight imaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6383–6392, 2018.
  • [31] M. Van den Bergh and L. Van Gool. Combining rgb and tof cameras for real-time 3d hand gesture interaction. In Applications of Computer Vision (WACV), 2011 IEEE Workshop on, pages 66–72, 2011.
  • [32] A. Velten, D. Wu, A. Jarabo, B. Masia, C. Barsi, C. Joshi, E. Lawson, M. Bawendi, D. Gutierrez, and R. Raskar. Femto-photography: capturing and visualizing the propagation of light. ACM Transactions on Graphics (ToG), 32(4):44, 2013.
  • [33] K. R. Vijayanagar, M. Loghman, and J. Kim. Refinement of depth maps generated by low-cost depth sensors. IEEE International SoC Design Conference (ISOCC), pages 533–536, 2012.
  • [34] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.