Computational Models for Multiview Dense Depth Maps of Dynamic Scene

12/08/2015 ∙ by Qifei Wang, et al. ∙ 0

This paper reviews the recent progresses of the depth map generation for dynamic scene and its corresponding computational models. This paper mainly covers the homogeneous ambiguity models in depth sensing, resolution models in depth processing, and consistency models in depth optimization. We also summarize the future work in the depth map generation.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With recent rapid advances in three-dimensional (3D) vision in both academia and industry, dense depth computation, especially the multiview depth for complex dynamic scene has become a new paradigm to the field of traditional computer vision. Obtaining dense multiview depth maps is the prerequisite for many challenging problems, such as dynamic scene modeling and reconstruction, 3D object recognition and tracking

[1], 3D reconstruction [2], 3D video coding [3][4][5] and streaming [6][7][8], etc.

In order to solve those challenges, some problems should be treated properly.

(1) Low temporal and interview stability: robust performance for vision-based researches and applications highly depends on the stability of multiview depth maps in temporal, spatial, and interview domains. However, this basic requirement often fails to be satisfied due to many reasons, where one of them is homogeneous ambiguity. These ambiguities are caused by homogeneous frequency among multiple time-of-flight (TOF) sensors or homogeneous structural light pattern among multiple RGB-D sensors, and these are treated as homogeneous ambiguity.

(2) Resolution mismatch: depth maps record the 3D coordinates corresponding to those visible pixels in color image. However, since the spatial resolutions of depth sensors are usually lower than that of the CCD sensors, the mismatch between the pixels in depth maps and pixels in texture images may cause some errors in the processing steps using both texture images and depth maps. On the other hand, the temporal resolution mismatch between depth and CCD sensors is also serious problem in many 3D applications.

(3) Low precision: with high-precision depth information, the performance of some traditional vision problems with only texture images can be significantly improved. However, capability limitation [9] of depth sensor such as the noise level, limitations in dynamic cases for phase-based structure light system and others bring difficulties to obtain an accurate dense depth map.

In this paper, we present the recent progresses in the field of dense depth computational models for dynamic scenes, especially the above criterion and difficulties.

Ii Stability models

The recent progresses on depth sensing, such as RGB-D and TOF sensors, have spurred developments of dynamic 3D scene modeling, markerless motion capture and 3D object motion tracking. However, multiple units with overlapping views cause prominent ambiguities, resulting in holes, noises, and interference in the computed depth maps. Examples of homogeneous ambiguities are given in Fig. 1 (a) and (b) which correspond to multiple TOF and RGB-D sensors, respectively. As can be found, the artifacts due to homogeneous ambiguity are not predictable in the captured depth maps when comparing to the results of single depth sensor.

It is a challenging problem to model the homogeneous ambiguity between these multiple depth sensors, because the homogeneous frequencies or the structural light pattern modules are coupled tightly. Decoupling models were proposed in spatial or temporal manners. In [10], a spatial decoupling method was proposed via hierarchical De Bruijn binary modules. The method includes encoding and decoding stages, where hierarchical modules are set for encoding. Fig. 2 shows an example for two coupled modules. Fig. 2(a) provides the binary codes for these two modules, and they are the lowest level in the hierarchical modules. After that, these binary codes are organized by rows and columns satisfying De Bruijn rules, and therefore the hierarchical modules are given in Fig. 2(b).

The proposed model in [10] is able to decouple the homogeneous ambiguity of multiple RGB-D sensors without degradation on depth map resolutions, and it brings great conveniences to 3D applications. Other than this model, other methods were proposed in temporal manners for multiple RGB-D sensors. For example, cyclic motion with different phrase was utilized to different RGB-D cameras, and each camera can capture sharp structural light pattern of its own while coupled patterns from other RGB-D cameras are blurred due to the induced relative motion [11]. Post-processing on blurred pixel separation is needed for this method. Besides that, a time-multiplex system was designed by a steerable hardware-shutter to simulate different cycles for corresponding RGB-D cameras [12]. In this case, depth maps from different RGB-D cameras are in different time-slot, further temporal calibration is needed for temporal resolution up-conversion.

As for TOF sensors, so far, coprime frequencies are utilized in different sensors to avoid homogeneous ambiguities. Actually, the number of available frequency bands is up to 3 in many TOF cameras, and thus brings restrictions to multiview dense depth map capturing.

(a) Temporal interference in multiple TOF sensor systems. Part (A) is captured by dual TOF sensors, and (B) is by single.
(b) Spatial interference in multiple RGB-D sensor system.Part (A) is captured by single RGB-D camera, and (B) is by dural carmera system [11]
Fig. 1: Homogeneous ambiguities in multiple depth sensors.
Fig. 2: Binary codes and the hierarchical modules for spatial decoupling in multiple RGB-D depth sensing.

Iii Resolution models

High spatial resolution of depth map is usually a result of stereo matching methods on high spatial resolution color image pairs rather than depth sensing. On the other hand, the temporal resolution depends on the sensing rate of depth sensors. For example, RGB-D camera can work with video rate (e.g. fps) and resolution, while TOF camera can work with up to 60 fps but only pixels. These capacities are far away from many practical usages.

Iterative filter was proposed in [13]

to up-sample the multiview depth maps simultaneously. This filter is originally proposed for multiview depth video coding to improve the rate-distortion performance. At the encoder side, multiview depth maps are down-sampled by odd-even interlaced extraction pattern. Then, the depth maps can be up-sampled via cross-view reference. The reference relationship is described in Fig.

3, and the iterative filter is modeled as


where is the number of iteration. , and are pixel sets in Fig. 3(b). , and are coefficient matrix obtained through some selected up-sampling filters. The iterative filter is convergent and a unique result will be obtained.

Other than the iterative filters, some depth map super-resolution filters were proposed in two manners, which referring to or without referring to the corresponding view color image. As for color image assisted super-resolution, for example, filter parameters can be learned from corresponding the color image references

[14][15][16]. The sharp edges in depth up-sampling can also be reserved by aligning to the corresponding color image [17]. On the other hand, filter parameters can be fetched directly from low resolution depth maps [18][19], or use joint bilateral union (JBU) filter simply and directly [20].

As for temporal resolution up-sampling, depth propagation is an effective approach. In this method, it is assumed that the variation for a given dynamic scene is identical for both the depth and color information of one viewpoint. Motion vector is widely utilized to describe the motion in dynamic scene, and propagation can be realized by this vector. The main problem for depth propagation is that it is very challenging to obtain accurate vectors for the occlusive or low textural regions. In order to solve the problem of motion vector accuracy, a rectification method was proposed in

[21] by learning the surrounding features in color image. The decision in rectification is settled by a Heaviside step function


where . The rectification on motion vector improves the propagation performance significantly. In [21], the experimental results showed that the temporal resolution can be eight times or even higher than the sensing rate in the proposed method.

Recently, it was found that the up-conversion of temporal resolution for multiview depth computation is an energy minimization problem [22], and the traditional computational models can be utilized for this purpose. In [22], the computation of depth value is a selection from multiple candidate sets, and each of the elements of this set comes from different temporal and inter-view references. The element is selected by the motion vector. These candidates are with the same value range but different temporal or inter-view properties, i.e., different label set. Therefore, the best selection of depth value is an energy minimization of multiple label sets rather than single label set in traditional model, and the model is as below


where is the label sets, and are different labels. The subsequent computation is candidate selection via multi-set graph model, which is given by Fig. 4.

(a) The cross-view depth up-sampling procedure.
(b) The iterative filter for Upscale in (a).
Fig. 3: Diagram of the cross-view depth up-sampling procedure for multiview depth.
Fig. 4: Graph models for energy minimization. Part (a) is the traditional model with one label set, and (b) is the multiple label set graph model.

Iv Precision models

Although depth sensors provide invaluable information for many 3D researches, their imaging capabilities are very limited in terms of noise level. The problem of consistency among depth maps in temporal, spatial and inter-view domains is still a challenge in depth map optimization.

The processing on depth map satisfies the criterion of Markov random field, and the problem of consistency can be modeled in stochastic field. In [23], the inconsistencies was optimized by a risk function


where is possible depth value, and is the risk for selecting . Then, can be obtained through Bayesian modeling as


where is a condition, and

is conditional and prior probability, respectively. Then consequently,

can be learned through the initial depth maps, and the optimization model can be refined iteratively. The model has shown satisfied performance of multiview dense depth map optimization on temporal, spatial and inter-view consistency.

Other than the above models, traditional energy minimization model can also be utilized via proper defined smooth term in the model. For example, the gradient in both spatial and temporal domains are measured for corresponding domains consistency optimization.

The system capability of depth sensing can be improved significantly with the help of all above computation models. A hybrid camera system was built up in [24], and the obtained dense depth maps for dynamic scene are selected by MPEG as standard test sequences.

V Conclusion

In this paper we present recent progresses in depth computational models for dynamic scene, and the models cover the main processing chain in obtaining high quality dense depth maps. We discuss homogeneous ambiguity models in depth sensing, resolution models in depth processing, and consistency models in depth optimization. Although there is still a long way for high quality depth sensing, the mentioned models set up a new starting point to make further progresses.


  • [1] C. Zhang, “Multiview imaging and 3dtv,” IEEE Signal Processing Magazine, vol. 1053, no. 5888/07, 2007.
  • [2] X. Cao, Q. Wang, X. Ji, and Q. Dai, “3d spatial reconstruction and communication from vision field,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 5445–5448.
  • [3] Q. Wang, M.-T. Sun, G. J. Sullivan, and J. Li, “Complexity-reduced geometry partition search and high efficiency prediction for video coding,” in IEEE International Symposium on Circuits and Systems (ISCAS), 2012, pp. 133–136.
  • [4] Q. Wang, J. Li, G. J. Sullivan, and M.-T. Sun, “Reduced-complexity search for video coding geometry partitions using texture and depth data,” in IEEE Visual Communications and Image Processing (VCIP), 2011, pp. 1–4.
  • [5] Q. Wang, X. Ji, M.-T. Sun, G. J. Sullivan, J. Li, and Q. Dai, “Complexity reduction and performance improvement for geometry partitioning in video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 2, pp. 338–352, 2013.
  • [6] X. Ji, Q. Wang, B.-W. Chen, S. Rho, C. J. Kuo, and Q. Dai, “Online distribution and interaction of video data in social multimedia network,” Multimedia Tools and Applications, pp. 1–14, 2014.
  • [7] Q. Wang, X. Ji, Q. Dai, and N. Zhang, “Free viewpoint video coding with rate-distortion analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 6, pp. 875–889, 2012.
  • [8] ——, “Region based rate-distortion analysis for 3d video coding,” in Data Compression Conference (DCC), 2010, pp. 555–555.
  • [9] Q. Wang, G. Kurillo, F. Ofli, and R. Bajcsy, “Evaluation of pose tracking accuracy in the first and second generations of microsoft kinect.”
  • [10] Z. Yan, L. Yu, Y. Yang, and Q. Liu, “Beyond the interference problem: hierarchical patterns for multiple-projector structured light system,” Applied optics, vol. 53, no. 17, pp. 3621–3632, 2014.
  • [11] A. Maimone and H. Fuchs, “Reducing interference between multiple structured light depth sensors using motion,” in IEEE Virtual Reality Short Papers and Posters (VRW), 2012, pp. 51–54.
  • [12] K. Berger, K. Ruhl, Y. Schroeder, C. Bruemmer, A. Scholz, and M. A. Magnor, “Markerless motion capture using multiple color-depth sensors,” in Vision, Modeling, and Visualization (VMV), 2011, pp. 317–324.
  • [13] Q. Liu, Y. Yang, R. Ji, Y. Gao, and L. Yu, “Cross-view down/up-sampling method for multiview depth video coding,” IEEE Signal Processing Letters, vol. 19, no. 5, pp. 295–298, 2012.
  • [14] Q. Yang, R. Yang, J. Davis, and D. Nistér, “Spatial-depth super resolution for range images,” in

    IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2007, pp. 1–8.
  • [15] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof, “Image guided depth upsampling using anisotropic total generalized variation,” in IEEE International Conference on Computer Vision (ICCV), 2013, pp. 993–1000.
  • [16] Y. Yang, J. Cai, Z. Zha, M. Gao, and Q. Tian, “A stereo-vision-assisted model for depth map super-resolution,” in IEEE International Conference on Multimedia and Expo (ICME), 2014, pp. 1–6.
  • [17] K.-H. Lo, K.-L. Hua, and Y.-C. F. Wang, “Depth map super-resolution via markov random fields without texture-copying artifacts,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 1414–1418.
  • [18] S. Ikehata, J.-H. Cho, and K. Aizawa, “Depth map inpainting and super-resolution based on internal statistics of geometry and appearance,” in IEEE International Conference on Image Processing (ICIP), 2013, pp. 938–942.
  • [19] D. Kim and K.-j. Yoon, “High-quality depth map up-sampling robust to edge noise of range sensors,” in IEEE International Conference on Image Processing (ICIP), 2012, pp. 553–556.
  • [20] F. Li, J. Yu, and J. Chai, “A hybrid camera for motion deblurring and depth map super-resolution,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2008, pp. 1–8.
  • [21] Y. Yang, Q. Liu, R. Ji, and Y. Gao, “Dynamic 3d scene depth reconstruction via optical flow field rectification,” 2012.
  • [22] Y. Yang, X. Wang, Q. Liu, M. Xu, and L. Yu, “A bundled-optimization model of multiview dense depth map synthesis for dynamic scene reconstruction,” Information Sciences, 2014.
  • [23]

    Q. Liu, Y. Yang, Y. Gao, R. Ji, and L. Yu, “A bayesian framework for dense depth estimation based on spatial–temporal correlation,”

    Neurocomputing, vol. 104, pp. 1–9, 2013.
  • [24] E.-K. Lee and Y.-S. Ho, “Generation of high-quality depth maps using hybrid camera system for 3-d video,” Journal of Visual Communication and Image Representation, vol. 22, no. 1, pp. 73–84, 2011.