Polarization images are known to be able to capture polarized reflected lights that preserve rich geometric cues of an object, which has motivated its recent applications in reconstructing detailed surface normal of the objects of interest. Meanwhile, inspired by the recent breakthroughs in human shape estimation from a single color image, we attempt to investigate the new question of whether the geometric cues from polarization camera could be leveraged in estimating detailed human body shapes. This has led to the curation of Polarization Human Shape and Pose Dataset (PHSPD)111Our PHSPD dataset will be released soon for academic purpose only., our home-grown polarization image dataset of various human shapes and poses.
Our PHSPD dataset synchronizes four cameras, one polarization camera and three Kinects v2 in three different views (each Kinect v2 has a depth and a color camera). The depth and color images from three-view Kinects v2 are used to get more accurate annotations of shape and pose in 3D space. We propose an economic yet effective approach to annotating shape and pose in 3D space. Compared with Human3.6M  that uses expensive Motion Capture system to annotate human poses, we do not require subjects to wear special tight clothes and a lot of sensors, which makes the acquired images restrictive and impractical.
We show some of our annotated shapes and poses in Fig. 1, where the shapes are rendered on the image plane and the poses are shown in 3D space. We can see that our annotated shapes and poses align well with the subjects in the image plane from four camera views.
2 Data Acquisition
Our acquisition system synchronizes four cameras, one polarization camera and three Kinects V2 in three different views (each Kinect v2 has a depth and a color camera). The layout is shown in Fig. 2. The main task in data acquisition is multi-camera synchronization. As one desktop can only control one Kinect v2, we develop a soft synchronization method. Specifically, each camera was connected with a desktop (the desktop with the polarization camera is the master and the other three ones with three Kinects are clients). We use socket to send message to each desktop. After receiving certain message, each client will capture the most recent frame from the Kinect into the desktop memory. At the same time, the master desktop sends a software trigger to the polarization camera to capture one frame into the buffer. Practically, our synchronization system can be as fast as 15 frames per second (fps). Fig. 2 shows the synchronization performance of the system that we develop. We let a bag fall down and compare the position of the bag in the same frame from four views. We can find that the positions of the bag captured by four cameras are almost the same in terms of its distance to the ground.
|1||warming-up, walking, running, jumping, drinking, lifting dumbbells|
|2||sitting, eating, driving, reading, phoning, waiting|
|3||presenting, boxing, posing, throwing, greeting, hugging, shaking hands|
Our dataset has 12 subjects, 9 males and 3 females. Each subject is required to do 3 different groups of actions (18 different actions in total) for 4 times plus one free-style group. Details are shown in Tab. 1. So each subject has 13 short videos and the total number of frames for each subject is around 22K. Overall, our dataset has 287K frames with each frame including one polarization image, three color and three depth images. Quantitative details of our dataset are shown in Tab. 2.
3 Annotation Process
3.1 Shape and Pose Representation
We represent the 3D human body shape (mesh) using SMPL model , which is a differentiable function that outputs a triangular mesh with 6890 vertices given 82 parameters . The shape parameter is the linear coefficients of a PCA shape space that mainly determines individual body features such height, weight and body proportions. The shape space is learned from a large dataset of body scans . The pose parameter mainly describes the articulated pose, which consists of one global rotation of the body and the relative rotations of 23 joints in axis-angle representation. The final body mesh is produced by first applying shape-dependent and pose-dependent deformations to the template body, then using forward-kinematics to articulate the body and finally deforming the surface with linear blend skinning. The 3D joint positions
can be obtained by linear regression from the output mesh vertices.
3.2 Annotation of Shape and Pose
The reason that we use multi-camera system to acquire image data is that multi-camera system provides much more information than a single-camera system. So the annotation of human shape and pose in 3D is more reliable.
After camera calibration and plane segmentation of human body in depth images, we have a point cloud of human surface fused from three-view depth image, and also noisy 3D pose by Kinect SDK at hand. The annotation of SMPL human shape and 3D joint position has three main steps as follows.
3.2.1 Initial guess of 3D pose
As the 3D pose given by Kinect SDK is noisy, we use the predicted 2D pose by OpenPose  as the criterion to decide which joint position given by Kinect SDK is correct. We select 14 aligned joints that both Openpose and Kinect have. For joint in view , 2D joint position by OpenPose is denoted by , and 3D joint position by Kinect by and its projected 2D joint position by . Since we cannot figure out which joint is detected correctly by Kinect, we use the joint position by OpenPose as the criterion to decide whether this joint is correctly estimated by Kinect, that is
where means the pixel distance. Then, we get the initial guess of the 3D joint position of joint by averaging the valid positions given by three-view Kinects as
If none of the three-view joint positions by Kinect is correct, we consider it as a missing joint. We discard the frame with more than 2 joints missing (14 in total).
3.2.2 Fitting shape to pose
The next step is similar to SMPLify , but instead of fitting to the 2D joints which have inherent depth ambiguity, we fit SMPL model to the initial guess of 3D pose.
3.2.3 Fine-tuning shape to the point cloud
The final step is fine-tuning the shape to the point cloud of human surface so that the annotated SMPL shape parameters are more accurate. We iteratively optimize SMPL parameters by minimizing the distance between vertices of SMPL shape to their nearest point. Finally, we have the annotated SMPL shape parameters and 3D pose.
Besides, we render the boundary of SMPL shape on the image to get the mask of background, and calculate the target normal using three depth images based on . Although the target normal is noisy, our experiment result shows our model can still learn to predict good and smooth normal maps.
The annotation process is shown in Fig. 3. Starting from the initial guess of 3D pose, we fit SMPL shape to the initial 3D pose and further fine-tune to the point cloud of human surface. Finally, we get the annotated human shape and pose for each frame. We can find from Fig. 3 that the third step is critical to make the annotated shape align better to the subject in the image in that the pint cloud of human surface gives much more information than a skeleton-based pose. So the third step can adjust the shape to improve the alignment of body parts. Besides, we also show our annotated shape on multi-view images (one polarization image and three-view color image) and the human pose in 3D coordinate space in Fig 1.
Keep it smpl: automatic estimation of 3d human pose and shape from a single image.
European Conference on Computer Vision, pp. 561–578. Cited by: §3.2.2.
OpenPose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Cited by: §3.2.1.
-  (2014) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §1.
-  (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 248. Cited by: §3.1.
Geonet: geometric neural network for joint depth and surface normal estimation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 283–291. Cited by: §3.2.3.