Performance capture methods enable the reconstruction of the motion, the dynamic surface geometry, and the appearance of real world scenes from multiple video recordings, for example, the deforming geometry of body and apparel of an actor, or his facial expressions [5, 7, 2, 18]. Many methods to capture space-time coherent surfaces reconstruct a coarse-to-medium scale 4D model of the scene in a first step, e.g. by deforming a mesh or a rigged template such that it aligns with the images [5, 18]. Finer scale shape detail is then added in a second refinement step. In this second step, some methods align the surface to a combination of silhouette constraints and sparse image features . But such approaches merely recover medium scale detail and may suffer from erroneous feature correspondences between images and shape. Photo-consistency constraints can also be used to compute smaller scale deformations via stereo-based refinement [5, 14]. However, existing approaches that follow that path often resort to discrete sampling of local displacements, since phrasing dense stereo based refinement as a continuous optimization problem has been more challenging . Some recent methods resort to shading-based techniques to capture small-scale displacements, such as shape-from-shading or photometric stereo [22, 21, 18]
. However, the methods either require controlled and calibrated lighting, or complex inverse estimation of lighting and appearance when they are applied under uncontrolled recording conditions.
In this paper, we contribute with a new effective solution to the refinement step using multi-view photo-consistency constraints. As input, our method expects synchronized and calibrated multiple video of a scene and a reconstructed coarse mesh animation, as it can be obtained with previous methods from the literature. Background subtraction or image silhouettes are not required for refinement.
Our first contribution is a new shape representation that models the mesh surface with a dense collection of 3D Gaussian functions centered at each vertex and each having an associated color. A similar decomposition into 2D Gaussian functions is applied to each input video frame.
This scene representation enables our second contribution, namely the formulation of dense photo-consistency-based surface refinement as a global optimization problem in the position of each vertex on the surface. Unlike previous performance capture methods, we are able to phrase the model-to-image photo-consistency energy that guides the deformation as a closed form expression, and we can compute its analytic derivatives. Our problem formulation has the additional advantage that it enables implicit handling of occlusions, as well as spatial and temporal coherence constraints, while preserving a smooth consistency energy function. We can effectively minimize this function in terms of dense local surface displacements with standard gradient-based solvers. In addition to these advantages, unlike many previous methods, our framework does not require a potentially error-prone sparse set of feature correspondences or discrete sampling and testing of surface displacements, and thus provides a new way of continuous optimization of the dense surface deformation.
We used our approach for reconstructing full-body performances of human actors wearing loose clothing, and performing different motions. Initial coarse reconstructions of the scene were obtained with the approaches by Gall et al.  and Starck and Hilton . Our results (Fig. 1 and Sect. 6) show that our approach is able to reconstruct more of the fine-scale detail that is present in the input video sequences, than the baseline methods, for instance the wrinkles in a skirt. We also demonstrate these improvements quantitatively.
2 Related Work
Marker-less performance capture methods are able to reconstruct dense dynamic surface geometry of moving subjects from multi-view video, for instance of people in loose clothing, possibly along with pose parameters of an underlying kinematic skeleton . Most of them use data from dense multi-camera systems and recorded under controlled studio environments. Some methods employ variants of shape-from-silhouette or active or passive stereo [23, 11, 14, 20, 17], which usually results in temporally incoherent reconstructions. Space-time coherency is easier to achieve with model-based approaches that deform a static shape template (obtained by a laser scan or image-based reconstruction) such that it matches the subject, e.g. a person [4, 5, 18, 1, 7] or a person’s apparel . Some of them jointly track a skeleton and the non-rigidly deforming surface [18, 1, 6]; also multi-person reconstruction has been demonstrated . Other approaches use a generally deformable template without embedded skeleton to capture 4D models, e.g. an elastically deformable surface or volume [5, 13], or a patch-based surface representation . Most of the approaches mentioned so far either only reconstruct coarse dynamic surface models that lack fine scale detail, or coarse reconstruction is a first stage. Fine scale detail is then added to the coarse result in a second refinement step.
Some methods use a combination of silhouette constraints and sparse feature correspondences to estimate, at best, a medium scale non-rigid 4D surface detail . Other approaches use stereo-based photo-consistency constraints in addition to silhouettes to achieve denser estimates of finer scale deformations [14, 5]. It is an involved problem to phrase dense stereo-based surface refinement as a continuous optimization problem, as it is done in variational approaches . Thus, stereo-based refinement in performance capture often resorts to discrete surface displacement sampling which are less efficient, and with which globally smooth and coherent solutions are harder to achieve.
In this paper, we propose a new formulation of stereo-based surface refinement as a continuous optimization problem, which is based on a new surface representation with Gaussian functions. In addition, our refinement method also succeeds if silhouettes are not available, making the approach more generally applicable.
An alternative way to recover fine-scale deforming surface detail is to use shading-based methods, e.g. shape-from-shading or photometric stereo . Many of these approaches require controlled and calibrated lighting [8, 19], which reduces their applicability. More recently, shading-based refinement of dynamic scenes captured under more general lighting was shown , but these approaches are computationally challenging as they require to solve an inverse rendering problem to obtain estimates of illumination, appearance and shape at the same time.
The method we propose has some similarity to the work of Sand et al.  who capture skin deformation as a displacement field on a template mesh; however, they require marker-based skeleton capture, and only fit the surface to match the silhouettes in multi-view video. Our problem formulation is inspired by the work of Stoll et al.  who used a collection of Gaussian functions in 3D and 2D for marker-less skeletal pose estimation. Estimation of surface detail was not the goal of that work. Our paper extends their basic concept to the different problem of dense stereo-based surface estimation using continuous optimization of a smooth energy that can be formulated in closed form, and that has analytic derivatives.
An overview of our approach is shown in Fig. 2. The input to our algorithm is a calibrated and synchronized multi-view video sequence showing images of the human subject. In addition, we assume as input a spatio-temporally coherent coarse animated mesh sequence, reconstructed from multi-view video related approaches [7, 14].
Our method refines the initial coarse animation such that the fine dynamic surface details are incorporated to the meshes. First, we create an implicit representation of the input mesh using a dense collection of 3D Gaussian functions on the surface with associated colors. The input images are also represented as a set of 2D Gaussian associated to image patches in each camera view. Thereafter, continuous optimization is performed to maximize the color consistency between the collection of 3D surface Gaussians and the set of 2D image Gaussians. The optimization displaces the 3D Gaussians along the associated vertex normal of the coarse mesh which yields the necessary vertex displacement.
Our optimization scheme has a smooth energy function, that, thanks to our Gaussians-based model, can be expressed in closed form. It further allows us to analytically compute derivatives, enabling the possibility of using efficient gradient-based solvers.
4 Implicit Model
Our framework converts the input coarse animation and input multi-view images into implicit representations using a collection of Gaussians: 3D surface Gaussians on the mesh surface with associated colors and 2D image Gaussians, with associated colors, assigned to image patches in each camera view.
4.1 3D Surface Gaussian
Our implicit model for the input mesh is obtained by placing a 3D Gaussian at each mesh vertex , , being the number of vertices. A 3D un-normalized isotropic Gaussian function on the surface is defined simply with a mean
, that coincides with the vertex location, and a standard deviation(equally set to for all 3D Gaussians on surface) as follows:
with . Note that although has infinite support, for visualization purposes we represent its projection as a square having center (i.e. diagonals intersection) in and side length equal to (see Fig. 3).
We further assign a HSV color value to each surface Gaussian. In order to derive the colors we choose a reference frame where the initial coarse reconstruction is as close as possible to the real shape. This is typically the first frame in each sequence. For each vertex of the input mesh, we first choose the camera view that sees vertex best, i.e. where normal and camera viewing direction align best. Thereafter, the 3D Gaussian associated to is projected to the image from the best camera view and the underlying pixel color average is assigned as a color attribute.
4.2 2D Image Gaussian
Our implicit model for the input images of all cameras , being the number of cameras, is obtained by assigning 2D Gaussian functions , , to each image patch, , of all camera views. Similar to Stoll et al.  we decompose each input frame into squared regions of coherent color by means of quad-tree decomposition (with maximal depth set to 8). A 2D Gaussian is assigned to each patch (Fig. 4), such that its mean corresponds to the patch center, and its standard deviation to half of the square patch side length. The underlying average HSV color is also assigned to the 2D Gaussians as additional attribute.
4.3 Projection of 3D Surface Gaussians
In order to evaluate the similarity between the 3D surface Gaussians and the 2D image Gaussians , we project each to the 2D image space. The 3D surface Gaussian mean is projected using the camera projection matrix , similarly to any 3D point in the space, as follows:
with being the respective coordinates of the projected mean in homogeneous coordinates (i.e. the dimension is set to ). The 3D standard deviation is projected using the following formula:
where is the camera focal length.
5 Surface Refinement
We employ an analysis-by-synthesis approach to refine the input coarse mesh animation, at every frame, by optimizing the following energy with respect to the collection of 3D surface Gaussian means :
The term measures the color similarity of the projected collection of 3D surface Gaussians with the 2D image Gaussians obtained from each camera view. The additional term is used to keep the distribution of the 3D surface Gaussians geometrically smooth, whereas is an user defined smoothness weight, typically set to 1. Since we constrain the 3D Gaussians to move along the corresponding vertex (normalized) normal direction :
aiming at maintaining a regular distribution of 3D Gaussians on the surface, we only need to optimize for single scalar values .
5.1 Similarity Term
We exploit the power of the implicit Gaussian representation of both input images and surface in order to derive a closed-form analytical formulation for our similarity term. In principle, one pair of image Gaussian and projected surface Gaussian should have high similarity measures when they show similar properties in terms of color and their spacial localization is sufficiently close. This measure can be formulated as the integral of the product of the projected surface Gaussian and image Gaussian , weighted by their color similarity , as follows:
In the above equation measures the Euclidean distance between the colors, while
is the Wendland radial basis function modeled by:
where is esperimentally set to for all test sequences. The main advantage of using a Gaussian representation is that the integral in Eq. 6 has a closed-form solution, namely another Gaussian with combined properties:
We first calculate the similarity for all components of the two models for each camera view. Then, we normalize the result considering the maximum obtainable overlap , of an image Gaussian with itself, and the number of cameras as follows:
In this equation, the inner minimization implicitly handles occlusions on the surface as it prevents occluded Gaussians projections into the same image location to contribute multiple times to the energy. This is an elegant way for handling occlusion while preserving at the same time energy smoothness. In fact, exact occlusion detection and handling algorithms are non-smooth or hard to express in closed-form.
In order to improve computational efficiency, we evaluate only for visible surface Gaussians from each camera view. The Gaussian overlap is then computed against visible projected Gaussians and 2D image Gaussians in a local neighborhood.
5.2 Regularization Term
Our regularization term constraints the 3D surface Gaussians in the local neighborhood and each Gaussian such that the final reconstructed surface is sufficiently smooth. This is accomplished by minimizing the following equation:
where is a set of surface Gaussian indices that are neighbors of , is the geodesic surface distance between and measured in number of edges, and is defined in Eq. 7, where .
Our formulation allows us to compute analytic derivatives of the energy (Eq. 4), for which we provide complete derivation in an additional document. The derivative of the similarity term, with respect to each is:
The derivative of the overlap is defined as:
where is the projection matrix of camera , is the vertex normal associated to the model gaussian in homogeneous coordinates (i.e. the dimension is set to ), is the z-component of the projected mean, and
The derivative of the regularization term is given by:
We efficiently optimize our energy function using a conditioned gradient ascent approach. The general gradient ascent method is a first-order optimization procedure that aims at finding local maxima by taking steps proportional to the energy gradient. The conditioner is a scalar factor associated to the analytical derivatives that increases (resp. decreases) step-by-step when the gradient sign is constant (resp. fluctuating). The use of the conditioner brings three main advantages: it allows for faster convergence to the final solution, it prevents typical zig-zag-ing while approaching local maxima, and it constraints at the same time the analytical derivative size.
We tested our approach on three different datasets: , and . Input multi-view video sequences, as well as camera settings and initial coarse mesh reconstruction were provided by Gall et al.  and Starck and Hilton . All the sequences are recorded with 8 synchronized and calibrated cameras and number of frame ranging between 250 and 721 (see Table 1). The input provided coarse mesh are obtained utilizing low-quality refining technique based on sparse feature matching, shape-from-silhouette and multi-view 3D reconstruction, and therefore lack of surface details.
In order to refine the input mesh sequences, we first subdivide the input coarse topology, by inserting additional triangles and vertices, aiming at increasing the scale level of detail. Then we generate a collection of Gaussians on the surface as explained in Sect. 3. Since for the input sequences most of the fine-scale deformations happen on the clothing, we decided to focus on the refinement of those areas, generating surface Gaussians only for the correspondent vertices. Table 1 shows the amount of 3D surface Gaussians created for each sequence.
When rendering the final resulting mesh sequences, we added an extra epsilon to the computed vertex displacements equal to the standard deviation of the surface Gaussians used. This is needed in order to compensate for the small surface bias (shrink along the normal during optimization) that is due to the spatial extent of the Gaussians.
Evaluation. Our results (Fig. 1, Fig. 5 and the accompanying video) show that our approach is able to plausibly reconstruct more fine-scale details, e.g. the wrinkles and folds in the skirt, and produces closer model alignment to the images than the baseline methods ([7, 14]).
In order to verify the quantitative performance of our approach, we textured the model by assigning surface Gaussians colors to the correspondent mesh vertices. Then, we used optical flow to generate displacement flow vectors between the input images and the reprojected textured mesh models (original and refined) for all time steps. Fig.6 plots the average optical flow displacement error difference between the input and the resulting animation sequences over time for a single camera view. As shown in the graphs, our method decreases the average flow displacement error, leading to quantitatively more accurate results.
We created an additional experiment to verify the performance of our refinement framework. For this experiment, we first spatially-smooth the input mesh sequence aiming at eliminating most of the baked-in surface details, if any. The smooth mesh animation is then used as input to our system. As we show in Fig. 7 and in the accompanying video, our approach is able to plausibly refine the input smooth mesh animation, reconstructing fine-scale details in the skirt, t-shirt and shorts. Quantitative evaluation for the smooth input sequence is provided in an additional document.
We evaluated the performance of our system on an Intel Xeon Processor E5-1620, Quad-core with Hyperthreading and 16GB of RAM. Table 1 summarizes the performances we obtained for the three tested sequences. We believe we can further reduce the computation time by parallelizing orthogonal steps and implementing our method on GPU.
Limitations. Our approach is subject to a few limitations. We assume the input mesh sequence to be sufficiently accurate, such that smaller details can be easily and correctly captured by simply displacing vertices along their correspondent vertex normals. In cases where the input reconstructed meshes present misalignments with respect to the images (e.g. ) or if it is necessary to reconstruct stronger deformations, then our method is unable to perform adequately. In this respect, our refinement should be reformulated allowing more complex displacements, e.g. without any normal constraint. However such weaker prior on vertices motion requires more complex regularization formulation in order to maintain smooth surface, also to handle unwanted self-intersections and collapsing vertices. On top of that the increased number of parameters to optimize for (i.e. 3 times more, when optimizing for all 3 vertices dimensions, , and
) would spoil computational efficiency and raise the probability of getting stack in local maxima solutions. The risk of returning local maxima solutions is still high when employing local solvers (e.g. gradient ascent) on non-convex problems as in our case. A possible solution is to use more advanced solvers, e.g. global solvers, when computational efficiency is not a requirement.
Another limitation of our approach is the inability to densely refine plain colored surfaces with few texture (e.g. and ). A solution here is to employ a more complex color model that takes into account e.g. illumination and shading effects, at the cost of increased computational expenses. We would like to investigate these limitations as a future work.
We presented a new effective framework for performance capture of deforming meshes with fine-scale time-varying surface detail from multi-view video recordings. Our approach captures the fine-scale deformation of the mesh vertices by maximizing photo-consistency on all vertex positions. This can be done efficiently by densely optimizing a new model-to-image consistency energy function that uses our proposed implicit representation of the deformable mesh using a collection of 3D Gaussians for the surface and a set of 2D Gaussians for the input images. Our proposed formulation enables a smooth closed-form energy with implicit occlusion handling and analytic derivatives. We qualitatively and quantitatively evaluated our refinement strategy on 3 input sequences, showing that we are able to capture and model finer-scale details.
-  L. Ballan and G. M. Cortelazzo. Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes. In 3DPVT, June 2008.
-  D. Bradley, T. Popa, A. Sheffer, W. Heidrich, and T. Boubekeur. Markerless garment capture. ACM Trans. Graph., 27(3):1–9, 2008.
-  C. Cagniart, E. Boyer, and S. Ilic. Free-form mesh tracking: a patch-based approach. In Proc. IEEE CVPR, 2010.
-  J. Carranza, C. Theobalt, M. Magnor, and H.-P. Seidel. Free-viewpoint video of human actors. In ACM TOG (Proc. SIGGRAPH ’03), page 22, 2003.
-  E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Seidel, and S. Thrun. Performance capture from sparse multi-view video. ACM Trans. Graph., 27(3), 2008.
-  J. Gall, C. Stoll, E. Aguiar, C. Theobalt, B. Rosenhahn, and H.-P. Seidel. Motion capture using joint skeleton tracking and surface estimation. In Proc. IEEE CVPR, pages 1746–1753, 2009.
-  J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H. Seidel. Motion capture using joint skeleton tracking and surface estimation. In , 2009.
-  C. Hernandez, G. Vogiatzis, G. J. Brostow, B. Stenger, and R. Cipolla. Non-rigid photometric stereo with colored lights. In Proc. ICCV, pages 1–8, 2007.
-  K. Kolev, M. Klodt, T. Brox, and D. Cremers. Continuous global optimization in multiview 3d reconstruction. International Journal of Computer Vision, 84(1):80–96, 2009.
-  Y. Liu, J. Gall, C. Stoll, Q. Dai, H.-P. Seidel, and C. Theobalt. Markerless motion capture of multiple characters using multiview image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2720–2735, 2013.
-  W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, and L. McMillan. Image-based visual hulls. In SIGGRAPH, pages 369–374, 2000.
-  P. Sand, L. McMillan, and J. Popović. Continuous capture of skin deformation. ACM TOG, 22(3):578–586, July 2003.
-  Y. Savoye. Iterative cage-based registration from multi-view silhouettes. In Proceedings of the 10th European Conference on Visual Media Production, CVMP ’13, pages 8:1–8:10. ACM, 2013.
-  J. Starck and A. Hilton. Surface capture for performance-based animation. IEEE Computer Graphics and Applications, 27(3):21–31, 2007.
-  C. Stoll, N. Hasler, J. Gall, H.-P. Seidel, and C. Theobalt. Fast articulated motion tracking using a sums of gaussians body model. In D. N. Metaxas, L. Quan, A. Sanfeliu, and L. J. V. Gool, editors, ICCV, pages 951–958. IEEE, 2011.
-  C. Theobalt, E. de Aguiar, C. Stoll, H.-P. Seidel, and S. Thrun. Performance capture from multi-view video. In R. Ronfard and G. Taubin, editors, Image and Geometry Procesing for 3D-Cinematography, page 127ff. Springer, 2010.
-  T. Tung, S. Nobuhara, and T. Matsuyama. Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In Proc. IEEE ICCV, pages 1709 –1716, 2009.
-  D. Vlasic, I. Baran, W. Matusik, and J. Popović. Articulated mesh animation from multi-view silhouettes. ACM TOG (Proc. SIGGRAPH), 2008.
-  D. Vlasic, P. Peers, I. Baran, P. Debevec, J. Popović, S. Rusinkiewicz, and W. Matusik. Dynamic shape capture using multi-view photometric stereo. In ACM TOG (Proc. SIGGRAPH Asia ’09), 2009.
-  M. Waschbüsch, S. Würmlin, D. Cotting, F. Sadlo, and M. Gross. Scalable 3D video of dynamic scenes. In Proc. Pacific Graphics, pages 629–638, 2005.
-  C. Wu, Y. Liu, Q. Dai, and B. Wilburn. Fusing multiview and photometric stereo for 3d reconstruction under uncalibrated illumination. IEEE TVCG, 17(8):1082–1095, 2011.
-  C. Wu, K. Varanasi, Y. Liu, H.-P. Seidel, and C. Theobalt. Shading-based dynamic shape refinement from multi-view video under general illumination. In Proc. iCCV, ICCV ’11, pages 1108–1115. IEEE, 2011.
C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski.
High-quality video view interpolation using a layered representation.ACM Trans. Graph., 23(3):600–608, Aug. 2004.