Polarization Human Shape and Pose Dataset

04/30/2020 ∙ by Shihao Zou, et al. ∙ University of Alberta 14

Polarization images are known to be able to capture polarized reflected lights that preserve rich geometric cues of an object, which has motivated its recent applications in reconstructing detailed surface normal of the objects of interest. Meanwhile, inspired by the recent breakthroughs in human shape estimation from a single color image, we attempt to investigate the new question of whether the geometric cues from polarization camera could be leveraged in estimating detailed human body shapes. This has led to the curation of Polarization Human Shape and Pose Dataset (PHSPD)5, our home-grown polarization image dataset of various human shapes and poses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Polarization images are known to be able to capture polarized reflected lights that preserve rich geometric cues of an object, which has motivated its recent applications in reconstructing detailed surface normal of the objects of interest. Meanwhile, inspired by the recent breakthroughs in human shape estimation from a single color image, we attempt to investigate the new question of whether the geometric cues from polarization camera could be leveraged in estimating detailed human body shapes. This has led to the curation of Polarization Human Shape and Pose Dataset (PHSPD)111Our PHSPD dataset will be released soon for academic purpose only., our home-grown polarization image dataset of various human shapes and poses.

Our PHSPD dataset synchronizes four cameras, one polarization camera and three Kinects v2 in three different views (each Kinect v2 has a depth and a color camera). The depth and color images from three-view Kinects v2 are used to get more accurate annotations of shape and pose in 3D space. We propose an economic yet effective approach to annotating shape and pose in 3D space. Compared with Human3.6M [3] that uses expensive Motion Capture system to annotate human poses, we do not require subjects to wear special tight clothes and a lot of sensors, which makes the acquired images restrictive and impractical.

We show some of our annotated shapes and poses in Fig. 1, where the shapes are rendered on the image plane and the poses are shown in 3D space. We can see that our annotated shapes and poses align well with the subjects in the image plane from four camera views.

Figure 1: The figure shows our annotated shapes and poses. The first column is the polarization image for reference. The second to the fifth columns show the annotated shape rendered on the polarization image and three-view color images. The sixth column shows the annotated pose in 3D space.

2 Data Acquisition

Our acquisition system synchronizes four cameras, one polarization camera and three Kinects V2 in three different views (each Kinect v2 has a depth and a color camera). The layout is shown in Fig. 2. The main task in data acquisition is multi-camera synchronization. As one desktop can only control one Kinect v2, we develop a soft synchronization method. Specifically, each camera was connected with a desktop (the desktop with the polarization camera is the master and the other three ones with three Kinects are clients). We use socket to send message to each desktop. After receiving certain message, each client will capture the most recent frame from the Kinect into the desktop memory. At the same time, the master desktop sends a software trigger to the polarization camera to capture one frame into the buffer. Practically, our synchronization system can be as fast as 15 frames per second (fps). Fig. 2 shows the synchronization performance of the system that we develop. We let a bag fall down and compare the position of the bag in the same frame from four views. We can find that the positions of the bag captured by four cameras are almost the same in terms of its distance to the ground.

Figure 2: Left figure: the layout of our multi-camera system. Three Kinects are placed around a circle of motion area with one polarization camera. Right figure: the synchronization result of our multi-camera system. The same frame of the three-view color images and one-view polarization image are displayed. Note that the layout of our multi-camera system has been changed to the left figure, but other settings are the same.
group # actions
1 warming-up, walking, running, jumping, drinking, lifting dumbbells
2 sitting, eating, driving, reading, phoning, waiting
3 presenting, boxing, posing, throwing, greeting, hugging, shaking hands
Table 1: The table displays the actions in each group. Subjects are required to do each group of actions for four times, but the order of the actions each time is random.
 subject
 #
 gender
  # of original
 frames
  # of annotated
 frames
  # of discarded
 frames
1 female 22561 22241 320 (1.4%)
2 male 24325 24186 139 (0.5%)
3 male 23918 23470 448 (1.8%)
4 male 24242 23906 336 (1.4%)
5 male 24823 23430 1393 (5.6%)
6 male 24032 23523 509 (2.1%)
7 female 22598 22362 236 (1.0%)
8 male 23965 23459 506 (2.1%)
9 male 24712 24556 156 (0.6%)
10 female 24040 23581 459 (1.9%)
11 male 24303 23795 508 (2.1%)
12 male 24355 23603 752 (3.1%)
total - 287874 282112 5762 (2.0%)
Table 2: The table shows the detail number of frames for each subject and also the number of frames that have SMPL shape and 3D joint annotations.

Our dataset has 12 subjects, 9 males and 3 females. Each subject is required to do 3 different groups of actions (18 different actions in total) for 4 times plus one free-style group. Details are shown in Tab. 1. So each subject has 13 short videos and the total number of frames for each subject is around 22K. Overall, our dataset has 287K frames with each frame including one polarization image, three color and three depth images. Quantitative details of our dataset are shown in Tab. 2.

3 Annotation Process

3.1 Shape and Pose Representation

We represent the 3D human body shape (mesh) using SMPL model [4], which is a differentiable function that outputs a triangular mesh with 6890 vertices given 82 parameters . The shape parameter is the linear coefficients of a PCA shape space that mainly determines individual body features such height, weight and body proportions. The shape space is learned from a large dataset of body scans [4]. The pose parameter mainly describes the articulated pose, which consists of one global rotation of the body and the relative rotations of 23 joints in axis-angle representation. The final body mesh is produced by first applying shape-dependent and pose-dependent deformations to the template body, then using forward-kinematics to articulate the body and finally deforming the surface with linear blend skinning. The 3D joint positions

can be obtained by linear regression from the output mesh vertices.

3.2 Annotation of Shape and Pose

The reason that we use multi-camera system to acquire image data is that multi-camera system provides much more information than a single-camera system. So the annotation of human shape and pose in 3D is more reliable.

After camera calibration and plane segmentation of human body in depth images, we have a point cloud of human surface fused from three-view depth image, and also noisy 3D pose by Kinect SDK at hand. The annotation of SMPL human shape and 3D joint position has three main steps as follows.

3.2.1 Initial guess of 3D pose

As the 3D pose given by Kinect SDK is noisy, we use the predicted 2D pose by OpenPose [2] as the criterion to decide which joint position given by Kinect SDK is correct. We select 14 aligned joints that both Openpose and Kinect have. For joint in view , 2D joint position by OpenPose is denoted by , and 3D joint position by Kinect by and its projected 2D joint position by . Since we cannot figure out which joint is detected correctly by Kinect, we use the joint position by OpenPose as the criterion to decide whether this joint is correctly estimated by Kinect, that is

(1)

where means the pixel distance. Then, we get the initial guess of the 3D joint position of joint by averaging the valid positions given by three-view Kinects as

(2)

If none of the three-view joint positions by Kinect is correct, we consider it as a missing joint. We discard the frame with more than 2 joints missing (14 in total).

3.2.2 Fitting shape to pose

The next step is similar to SMPLify [1], but instead of fitting to the 2D joints which have inherent depth ambiguity, we fit SMPL model to the initial guess of 3D pose.

3.2.3 Fine-tuning shape to the point cloud

The final step is fine-tuning the shape to the point cloud of human surface so that the annotated SMPL shape parameters are more accurate. We iteratively optimize SMPL parameters by minimizing the distance between vertices of SMPL shape to their nearest point. Finally, we have the annotated SMPL shape parameters and 3D pose.

Besides, we render the boundary of SMPL shape on the image to get the mask of background, and calculate the target normal using three depth images based on [5]. Although the target normal is noisy, our experiment result shows our model can still learn to predict good and smooth normal maps.

The annotation process is shown in Fig. 3. Starting from the initial guess of 3D pose, we fit SMPL shape to the initial 3D pose and further fine-tune to the point cloud of human surface. Finally, we get the annotated human shape and pose for each frame. We can find from Fig. 3 that the third step is critical to make the annotated shape align better to the subject in the image in that the pint cloud of human surface gives much more information than a skeleton-based pose. So the third step can adjust the shape to improve the alignment of body parts. Besides, we also show our annotated shape on multi-view images (one polarization image and three-view color image) and the human pose in 3D coordinate space in Fig 1.

Figure 3: The figure shows our three-step annotation process. The first column shows the initial guess of 3D pose, which is projected on the polarization image. After fitting the SMPL shape to the initial pose, we show the initial fitted shape with the point cloud of human surface (black points) in the second column and the rendered shape on the image in the third column. The fourth and fifth columns show the annotated shape after fine-tuning the shape to the point cloud of human surface. The sixth column shows the corresponding annotated 3D pose projected on the polarization image.

References

  • [1] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black (2016) Keep it smpl: automatic estimation of 3d human pose and shape from a single image. In

    European Conference on Computer Vision

    ,
    pp. 561–578. Cited by: §3.2.2.
  • [2] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh (2019)

    OpenPose: realtime multi-person 2d pose estimation using part affinity fields

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN Cited by: §3.2.1.
  • [3] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §1.
  • [4] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 248. Cited by: §3.1.
  • [5] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia (2018)

    Geonet: geometric neural network for joint depth and surface normal estimation

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 283–291. Cited by: §3.2.3.