I Introduction
With recent rapid advances in threedimensional (3D) vision in both academia and industry, dense depth computation, especially the multiview depth for complex dynamic scene has become a new paradigm to the field of traditional computer vision. Obtaining dense multiview depth maps is the prerequisite for many challenging problems, such as dynamic scene modeling and reconstruction, 3D object recognition and tracking
[1], 3D reconstruction [2], 3D video coding [3][4][5] and streaming [6][7][8], etc.In order to solve those challenges, some problems should be treated properly.
(1) Low temporal and interview stability: robust performance for visionbased researches and applications highly depends on the stability of multiview depth maps in temporal, spatial, and interview domains. However, this basic requirement often fails to be satisfied due to many reasons, where one of them is homogeneous ambiguity. These ambiguities are caused by homogeneous frequency among multiple timeofflight (TOF) sensors or homogeneous structural light pattern among multiple RGBD sensors, and these are treated as homogeneous ambiguity.
(2) Resolution mismatch: depth maps record the 3D coordinates corresponding to those visible pixels in color image. However, since the spatial resolutions of depth sensors are usually lower than that of the CCD sensors, the mismatch between the pixels in depth maps and pixels in texture images may cause some errors in the processing steps using both texture images and depth maps. On the other hand, the temporal resolution mismatch between depth and CCD sensors is also serious problem in many 3D applications.
(3) Low precision: with highprecision depth information, the performance of some traditional vision problems with only texture images can be significantly improved. However, capability limitation [9] of depth sensor such as the noise level, limitations in dynamic cases for phasebased structure light system and others bring difficulties to obtain an accurate dense depth map.
In this paper, we present the recent progresses in the field of dense depth computational models for dynamic scenes, especially the above criterion and difficulties.
Ii Stability models
The recent progresses on depth sensing, such as RGBD and TOF sensors, have spurred developments of dynamic 3D scene modeling, markerless motion capture and 3D object motion tracking. However, multiple units with overlapping views cause prominent ambiguities, resulting in holes, noises, and interference in the computed depth maps. Examples of homogeneous ambiguities are given in Fig. 1 (a) and (b) which correspond to multiple TOF and RGBD sensors, respectively. As can be found, the artifacts due to homogeneous ambiguity are not predictable in the captured depth maps when comparing to the results of single depth sensor.
It is a challenging problem to model the homogeneous ambiguity between these multiple depth sensors, because the homogeneous frequencies or the structural light pattern modules are coupled tightly. Decoupling models were proposed in spatial or temporal manners. In [10], a spatial decoupling method was proposed via hierarchical De Bruijn binary modules. The method includes encoding and decoding stages, where hierarchical modules are set for encoding. Fig. 2 shows an example for two coupled modules. Fig. 2(a) provides the binary codes for these two modules, and they are the lowest level in the hierarchical modules. After that, these binary codes are organized by rows and columns satisfying De Bruijn rules, and therefore the hierarchical modules are given in Fig. 2(b).
The proposed model in [10] is able to decouple the homogeneous ambiguity of multiple RGBD sensors without degradation on depth map resolutions, and it brings great conveniences to 3D applications. Other than this model, other methods were proposed in temporal manners for multiple RGBD sensors. For example, cyclic motion with different phrase was utilized to different RGBD cameras, and each camera can capture sharp structural light pattern of its own while coupled patterns from other RGBD cameras are blurred due to the induced relative motion [11]. Postprocessing on blurred pixel separation is needed for this method. Besides that, a timemultiplex system was designed by a steerable hardwareshutter to simulate different cycles for corresponding RGBD cameras [12]. In this case, depth maps from different RGBD cameras are in different timeslot, further temporal calibration is needed for temporal resolution upconversion.
As for TOF sensors, so far, coprime frequencies are utilized in different sensors to avoid homogeneous ambiguities. Actually, the number of available frequency bands is up to 3 in many TOF cameras, and thus brings restrictions to multiview dense depth map capturing.
Iii Resolution models
High spatial resolution of depth map is usually a result of stereo matching methods on high spatial resolution color image pairs rather than depth sensing. On the other hand, the temporal resolution depends on the sensing rate of depth sensors. For example, RGBD camera can work with video rate (e.g. fps) and resolution, while TOF camera can work with up to 60 fps but only pixels. These capacities are far away from many practical usages.
Iterative filter was proposed in [13]
to upsample the multiview depth maps simultaneously. This filter is originally proposed for multiview depth video coding to improve the ratedistortion performance. At the encoder side, multiview depth maps are downsampled by oddeven interlaced extraction pattern. Then, the depth maps can be upsampled via crossview reference. The reference relationship is described in Fig.
3, and the iterative filter is modeled as(1) 
where is the number of iteration. , and are pixel sets in Fig. 3(b). , and are coefficient matrix obtained through some selected upsampling filters. The iterative filter is convergent and a unique result will be obtained.
Other than the iterative filters, some depth map superresolution filters were proposed in two manners, which referring to or without referring to the corresponding view color image. As for color image assisted superresolution, for example, filter parameters can be learned from corresponding the color image references
[14][15][16]. The sharp edges in depth upsampling can also be reserved by aligning to the corresponding color image [17]. On the other hand, filter parameters can be fetched directly from low resolution depth maps [18][19], or use joint bilateral union (JBU) filter simply and directly [20].As for temporal resolution upsampling, depth propagation is an effective approach. In this method, it is assumed that the variation for a given dynamic scene is identical for both the depth and color information of one viewpoint. Motion vector is widely utilized to describe the motion in dynamic scene, and propagation can be realized by this vector. The main problem for depth propagation is that it is very challenging to obtain accurate vectors for the occlusive or low textural regions. In order to solve the problem of motion vector accuracy, a rectification method was proposed in
[21] by learning the surrounding features in color image. The decision in rectification is settled by a Heaviside step function(2) 
where . The rectification on motion vector improves the propagation performance significantly. In [21], the experimental results showed that the temporal resolution can be eight times or even higher than the sensing rate in the proposed method.
Recently, it was found that the upconversion of temporal resolution for multiview depth computation is an energy minimization problem [22], and the traditional computational models can be utilized for this purpose. In [22], the computation of depth value is a selection from multiple candidate sets, and each of the elements of this set comes from different temporal and interview references. The element is selected by the motion vector. These candidates are with the same value range but different temporal or interview properties, i.e., different label set. Therefore, the best selection of depth value is an energy minimization of multiple label sets rather than single label set in traditional model, and the model is as below
(3) 
where is the label sets, and are different labels. The subsequent computation is candidate selection via multiset graph model, which is given by Fig. 4.
Iv Precision models
Although depth sensors provide invaluable information for many 3D researches, their imaging capabilities are very limited in terms of noise level. The problem of consistency among depth maps in temporal, spatial and interview domains is still a challenge in depth map optimization.
The processing on depth map satisfies the criterion of Markov random field, and the problem of consistency can be modeled in stochastic field. In [23], the inconsistencies was optimized by a risk function
(4) 
where is possible depth value, and is the risk for selecting . Then, can be obtained through Bayesian modeling as
(5) 
where is a condition, and
is conditional and prior probability, respectively. Then consequently,
can be learned through the initial depth maps, and the optimization model can be refined iteratively. The model has shown satisfied performance of multiview dense depth map optimization on temporal, spatial and interview consistency.Other than the above models, traditional energy minimization model can also be utilized via proper defined smooth term in the model. For example, the gradient in both spatial and temporal domains are measured for corresponding domains consistency optimization.
The system capability of depth sensing can be improved significantly with the help of all above computation models. A hybrid camera system was built up in [24], and the obtained dense depth maps for dynamic scene are selected by MPEG as standard test sequences.
V Conclusion
In this paper we present recent progresses in depth computational models for dynamic scene, and the models cover the main processing chain in obtaining high quality dense depth maps. We discuss homogeneous ambiguity models in depth sensing, resolution models in depth processing, and consistency models in depth optimization. Although there is still a long way for high quality depth sensing, the mentioned models set up a new starting point to make further progresses.
References
 [1] C. Zhang, “Multiview imaging and 3dtv,” IEEE Signal Processing Magazine, vol. 1053, no. 5888/07, 2007.
 [2] X. Cao, Q. Wang, X. Ji, and Q. Dai, “3d spatial reconstruction and communication from vision field,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 5445–5448.
 [3] Q. Wang, M.T. Sun, G. J. Sullivan, and J. Li, “Complexityreduced geometry partition search and high efficiency prediction for video coding,” in IEEE International Symposium on Circuits and Systems (ISCAS), 2012, pp. 133–136.
 [4] Q. Wang, J. Li, G. J. Sullivan, and M.T. Sun, “Reducedcomplexity search for video coding geometry partitions using texture and depth data,” in IEEE Visual Communications and Image Processing (VCIP), 2011, pp. 1–4.
 [5] Q. Wang, X. Ji, M.T. Sun, G. J. Sullivan, J. Li, and Q. Dai, “Complexity reduction and performance improvement for geometry partitioning in video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 2, pp. 338–352, 2013.
 [6] X. Ji, Q. Wang, B.W. Chen, S. Rho, C. J. Kuo, and Q. Dai, “Online distribution and interaction of video data in social multimedia network,” Multimedia Tools and Applications, pp. 1–14, 2014.
 [7] Q. Wang, X. Ji, Q. Dai, and N. Zhang, “Free viewpoint video coding with ratedistortion analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 6, pp. 875–889, 2012.
 [8] ——, “Region based ratedistortion analysis for 3d video coding,” in Data Compression Conference (DCC), 2010, pp. 555–555.
 [9] Q. Wang, G. Kurillo, F. Ofli, and R. Bajcsy, “Evaluation of pose tracking accuracy in the first and second generations of microsoft kinect.”
 [10] Z. Yan, L. Yu, Y. Yang, and Q. Liu, “Beyond the interference problem: hierarchical patterns for multipleprojector structured light system,” Applied optics, vol. 53, no. 17, pp. 3621–3632, 2014.
 [11] A. Maimone and H. Fuchs, “Reducing interference between multiple structured light depth sensors using motion,” in IEEE Virtual Reality Short Papers and Posters (VRW), 2012, pp. 51–54.
 [12] K. Berger, K. Ruhl, Y. Schroeder, C. Bruemmer, A. Scholz, and M. A. Magnor, “Markerless motion capture using multiple colordepth sensors,” in Vision, Modeling, and Visualization (VMV), 2011, pp. 317–324.
 [13] Q. Liu, Y. Yang, R. Ji, Y. Gao, and L. Yu, “Crossview down/upsampling method for multiview depth video coding,” IEEE Signal Processing Letters, vol. 19, no. 5, pp. 295–298, 2012.

[14]
Q. Yang, R. Yang, J. Davis, and D. Nistér, “Spatialdepth super resolution
for range images,” in
IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)
, 2007, pp. 1–8.  [15] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof, “Image guided depth upsampling using anisotropic total generalized variation,” in IEEE International Conference on Computer Vision (ICCV), 2013, pp. 993–1000.
 [16] Y. Yang, J. Cai, Z. Zha, M. Gao, and Q. Tian, “A stereovisionassisted model for depth map superresolution,” in IEEE International Conference on Multimedia and Expo (ICME), 2014, pp. 1–6.
 [17] K.H. Lo, K.L. Hua, and Y.C. F. Wang, “Depth map superresolution via markov random fields without texturecopying artifacts,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 1414–1418.
 [18] S. Ikehata, J.H. Cho, and K. Aizawa, “Depth map inpainting and superresolution based on internal statistics of geometry and appearance,” in IEEE International Conference on Image Processing (ICIP), 2013, pp. 938–942.
 [19] D. Kim and K.j. Yoon, “Highquality depth map upsampling robust to edge noise of range sensors,” in IEEE International Conference on Image Processing (ICIP), 2012, pp. 553–556.
 [20] F. Li, J. Yu, and J. Chai, “A hybrid camera for motion deblurring and depth map superresolution,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2008, pp. 1–8.
 [21] Y. Yang, Q. Liu, R. Ji, and Y. Gao, “Dynamic 3d scene depth reconstruction via optical flow field rectification,” 2012.
 [22] Y. Yang, X. Wang, Q. Liu, M. Xu, and L. Yu, “A bundledoptimization model of multiview dense depth map synthesis for dynamic scene reconstruction,” Information Sciences, 2014.

[23]
Q. Liu, Y. Yang, Y. Gao, R. Ji, and L. Yu, “A bayesian framework for dense depth estimation based on spatial–temporal correlation,”
Neurocomputing, vol. 104, pp. 1–9, 2013.  [24] E.K. Lee and Y.S. Ho, “Generation of highquality depth maps using hybrid camera system for 3d video,” Journal of Visual Communication and Image Representation, vol. 22, no. 1, pp. 73–84, 2011.
Comments
There are no comments yet.