[CVPR 2022] Versatile Multi-Modal Pre-Training for Human-Centric Perception
Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16 Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning.READ FULL TEXT VIEW PDF
[CVPR 2022] Versatile Multi-Modal Pre-Training for Human-Centric Perception
As a long-standing problem, human-centric perception has been studied for decades, ranging from sparse prediction tasks, such as human action recognition [shahroudy2016ntu, liu2019ntu, yan2018spatial, chen2021channel], 2D keypoints detection [lin2014microsoft, andriluka14cvpr, sun2019deep, xiao2018simple] and 3D pose estimation [h36m_pami, reddy2021tessetrack, martinez2017simple], to dense prediction tasks, such as human parsing [gong2017look, li2017multiple, gong2018instance, chen2014detect] and DensePose prediction [guler2018densepose]. Unfortunately, to train a model with reasonable generalizability and robustness, an enormous amount of labeled real data is necessary, which is extremely expensive to collect and annotate. Therefore, it is desirable to have a versatile pre-train model that can serve as a foundation for all the aforementioned human-centric perception tasks.
With the development of sensors, the human body can be more conveniently perceived and represented in multiple modalities, such as RGB, depth, and infrared. In this work, we argue that the multi-modality nature of human-centric data can induce effective representations that transfer well to various downstream tasks, due to three major advantages: 1) Learning a modal-invariant latent space through pre-training helps efficient task-relevant mutual information extraction. 2) A single versatile pre-train model on multi-modal data facilitates multiple downstream tasks using various modalities. 3) Our multi-modal pre-train setting bridges heterogeneous human-centric datasets through their common modality, which benefits the generalizability of pre-train models.
We mainly explore two groups of modalities as shown in Fig. 1 a): dense representations (e.g. RGB, depth, infrared) and sparse representations (e.g. 2D keypoints, 3D pose). Dense representations can provide rich texture and/or 3D geometry information. But they are mostly low-level and noisy. On the contrary, sparse representations obtained by off-the-shelf tools [8765346, mmpose2020] are semantic and structured. But the sparsity results in insufficient details. We highlight that it is non-trivial to integrate these heterogeneous modalities into a unified pre-training framework for the following two main challenges: 1) learning representations suitable for dense prediction tasks in the multi-modality setting; 2) using weak priors from sparse representations effectively for pre-training.
Challenge 1: Dense Targets. Existing methods [liu2020p4contrast, hou2021pri3d] perform contrastive learning densely on pixel-level features to achieve view-invariance for dense prediction tasks. However, those methods require multiple views of a static 3D scene [dai2017scannet], which is inapplicable for human-centric applications with only single view. Furthermore, it is preferable to learn representations that are continuously and orderly distributed over the human body. In light of this, we generalize the widely used InfoNCE [oord2018representation] and propose a dense intra-sample contrastive learning objective that applies a soft pixel-level contrastive target, which can facilitate learning ordinal and continuous dense feature distributions.
Challenge 2: Sparse Priors. To employ priors in contrastive learning, previous works [khosla2020supervised, wei2020can, assran2020supervision] mainly use the supervision to generate semantically positive pairs. However, these methods only focus on the sample-level contrastive learning, which means each sample is encoded to a global embedding. It is not optimal for human dense prediction tasks. To this end, we propose a sparse structure-aware contrastive learning target, which uses semantic correspondences across samples as positive pairs to complement positive intra-sample pairs. Particularly, leveraging sparse human priors leads to an embedding space where semantically corresponding parts are aligned more closely.
To sum up, we propose HCMoCo, a Human-Centric multi-Modal Contrastive learning framework for versatile multi-modal pre-training. To fully leverage multi-modal observations, HCMoCo effectively utilizes both dense measurements and sparse priors using the following three-levels hierarchical contrastive learning objectives: 1) sample-level modality-invariant representation learning; 2) dense intra-sample contrastive learning; 3) sparse structure-aware contrastive learning. As an effort towards establishing a comprehensive multi-modal human parsing benchmark dataset, we label human segments for RGB-D images from NTU RGB+D dataset [shahroudy2016ntu], and contribute the NTURGBD-Parsing-4K dataset. To evaluate HCMoCo, we transfer our pre-train model to four human-centric downstream tasks using different modalities, including DensePose estimation (RGB) [guler2018densepose], human parsing using RGB [h36m_pami] or depth frames, and 3D pose estimation (depth) [haque2016towards]
. Under full set and data-efficient training settings, HCMoCo constantly achieves better performance than training from scratch or pre-train on ImageNet. To name a few, as shown in Fig.1 b), we achieve 7.16% improvement in terms of GPS AP on training data of DensePose estimation; 12% improvement in terms of mIoU on training data of Human3.6M human parsing. Moreover, we evaluate the modal-invariance of the latent space learned by HCMoCo for dense prediction on NTURGBD-Parsing-4K with two settings: cross-modality supervision and missing-modality inference. Compared against conventional contrastive learning targets, our method improves the segmentation mIoU by 29% and 24% for the two settings, respectively. To the best of our knowledge, we are the first to study multi-modal pre-training for human-centric perception.
The main contributions are summarized below: 1) As the first endeavor, we provide an in-depth analysis for human-centric pre-training, which is formulated as a challenging multi-modal contrastive learning problem. 2) Together with the novel hierarchical contrastive learning objectives, a comprehensive framework HCMoCo is proposed for effective pre-training for human-centric tasks. 3) Through extensive experiments, HCMoCo achieves superior performance than existing methods, and meanwhile shows promising modal-invariance properties. 4) To benefit multi-modal human-centric perception, we contribute an RGB-D human parsing dataset, NTURGBD-Parsing-4K.
Human-Centric Perception. Many efforts have been put into human-centric perception in decades. Lots of work in 2D keypoint detection [lin2014microsoft, andriluka14cvpr, sun2019deep, xiao2018simple] has achieved robust and accurate performance. 3D pose estimation has long been a challenging problem and is approached from two aspects, lifting from 2D keypoints [h36m_pami, reddy2021tessetrack, martinez2017simple] and predicting from depth map [haque2016towards, xiong2019a2j]. Human parsing can be defined in two ways. The first one parses garments together with visible body parts [gong2017look, li2017multiple, gong2018instance]. The second one only focuses on parsing human parts [chen2014detect, h36m_pami, hong2021garmentd]. In this work, we focus on the second setting because the depth and 2D keypoints do not contain the texture information needed for garment parsing. There are a few works [hernandez2012graph, nishi2017generation] about human parsing on depth maps. However, the data and annotations are too coarse or unavailable. To further push the accuracy of human-centric perception, DensePose [guler2018densepose, tan2021humangps] is proposed to densely model each human body surface point. The cost of DensePose annotation is enormous. Therefore, we also explore data-efficient learning of DensePose.
Multi-Modal Contrastive Learning. Multi-modality naturally provides different views of the same sample which fits well into the contrastive learning framework. CMC [tian2020contrastive] proposes the first multi-view contrastive learning paradigm which takes any number of views. CLIP [radford2021learning] learns a joint latent space from large-scale paired image-language dataset. Extensive studies [patrick2021compositions, hazarika2020misa, han2020self, rouditchenko2020avlnet, patrick2020multi, alayrac2020self] focus on video-audio contrastive learning. Recently, 2D-3D contrastive learning [hou2021pri3d, liu2020p4contrast, liu2021contrastive]
has also been studied with the development in 3D computer vision. In this work, aside from commonly used modalities, we also explore the potential of 2D keypoints in human-centric contrastive learning.
In this section, we first introduce the general paradigm of HCMoCo (3.1). Following the design principles (3.2), hierarchical contrastive learning targets are formally introduced (3.3). Next, an instantiation of HCMoCo is introduced (3.4). Finally, we propose two applications of HCMoCo to show the versatility (3.5).
As shown in Fig. 2, HCMoCo takes multiple modalities of perceived human body as input. The target is to learn human-centric representations, which can be transferred to downstream tasks. The input modalities can be categorized into dense and sparse representations. Dense representations are the direct output of imaging sensors, e.g. RGB, depth, infrared. They typically contain rich information but are low-level and noisy. Sparse representations are structured abstractions of the human body, e.g. 2D keypoints, 3D pose, which can be formulated as graph . Different representations of the same view of a human should be spatially aligned, which means intra-sample correspondences can be obtained for dense contrastive learning. HCMoCo aims to pre-train multiple encoders and that produce embeddings of dense representations and sparse representations for downstream tasks transfer.
To support dense downstream tasks, other than the usual sample-level global embeddings used in [he2020momentum, chen2020simple, chen2020improved, grill2020bootstrap, liu2020self, tian2020contrastive], we propose to consider different levels of embeddings i.e. global embeddings , sparse embeddings and dense embeddings 111For easier understanding of the notations, the superscripts of and stand for the kind of embeddings. The subscripts stand for the kind of representations (‘g’ for ‘global’; ‘d’ for ‘dense’; ‘s’ for ‘sparse’). , which are defined as follows: 1) For dense representations , the global embedding is obtained by applying a mapper network to the mean pooling of the corresponding feature map, which is formulated as . Similarly, for sparse representations , the global embedding is defined as . 2) Sparse embeddings have the same size as that of sparse representations. Formally, for sparse representations , where , the corresponding sparse embedding is defined as , where , is a mapper network. For dense representations, the corresponding sparse features are pooled from the dense feature map using the correspondences . Then the sparse features are mapped to sparse embeddings as . 3) Dense embeddings are only defined on dense representations, which is formulated as . With three levels of embeddings defined, we formulate the overall learning objective as
which is analyzed and explained as follows.
In this subsection, we analyze the intuitions when designing learning targets, which makes the following three principles. 1) Mutual Information Maximization: Inspired by [wu2018unsupervised, poole2019variational], we propose to maximize the lower bound on mutual information, which has been proved by many previous works [he2020momentum, chen2020simple, chen2020improved, tian2020contrastive] to be able to produce strong pre-train models. 2) Continuous and Ordinal Feature Distribution: Inspired by the property of human-centric perception, it is desirable for the feature maps of the human body to be continuous and ordinal. The human body is a structural and continuous surface. The dense predictions, e.g. human parsing [gong2017look, li2017multiple, gong2018instance], DensePose [guler2018densepose]
, are also continuous. Therefore, such property should also be reflected in the learned representations. Besides, for an anchor point on human surfaces, closer points have higher probabilities of sharing similar semantics with the anchor point than that of far away points. Therefore, the learned dense representations should also align with such ordinal relationship.3) Structure-Aware Semantic Consistency: Sparse representations are abstractions of the human body, which contains valuable structural semantics about the human body. Instead of identity information, the human pose and structure understanding are the keys to our target downstream tasks. Therefore, it is reasonable to eliminate the identity information and enhance the structure information by enforcing structure-aware semantic consistency where semantically close embeddings (e.g. embeddings of left hands from different samples) are pulled close and vice versa.
Based on the above three principles, we formally define hierarchical contrastive learning targets in this subsection.
Sample-level modality-invariant representation learning aims at learning a joint latent space at the sample level using global embeddings, which fulfills the first principle. Inspired by [tian2020contrastive], the learning target can be formulated as
where is a set of global embeddings of one modality, is the set of of all modalities, is the embedding of the paired view of that of , is the temperature. It should be noticed that can be sampled from the global embeddings of either dense or sparse representations.
Dense intra-sample contrastive learning is operated on the paired dense representations. For any two paired dense embeddings , to simultaneously satisfy the first and the second principle, the dense intra-sample contrastive learning target between them is defined in a ‘soft’ way as
where is the weight, is the temperature, are coordinates on the dense representation, , . The above equation is a generalized version of InfoNCE [oord2018representation]. InfoNCE is a special case when is set to if and else . We use the normalized distances as the weights, which is formulated as
For each pair of dense representations, the above learning target is calculated between each pair of dense embeddings. Therefore, the whole learning target is defined as
where is a set of dense embeddings of one modality, is the set of all , and are two paired embeddings. It should be noticed that the ‘soft’ learning target cannot guarantee an ordinal feature distribution. Instead, it serves as a computationally efficient relaxation of the requirement of ordinal distribution.
Sparse structure-aware contrastive learning takes two sparse representations and as inputs. The paired features and (i.e. features of the -th joint) should be pulled close while unpaired features are pushed away. The two sparse representations can be sampled from the same or different modalities, intra- or inter-sample. The intra-sample alignment satisfies the first principle. The inter-sample alignment follows the third principle. The sparse structure-aware contrastive learning target is formulated as
where is a set of sparse embeddings of one modality, is the set of , is the temperature, are sampled from the union of and . To conclude, the overall learning target is formulated as Eq. 1, where are the weights to balance the targets.
In this section, we introduce an instantiation of HCMoCo. As shown in Fig. 3, for dense representations, we use RGB and depth. Large-scale paired human RGB and depth data is easy to obtain with affordable sensors e.g. Kinect. These two modalities are the most commonly encountered in human-centric tasks [chen2014detect, h36m_pami, gong2017look, li2017multiple, gong2018instance]. Moreover, a proper pre-train model for depth is highly desired. Therefore, RGB and depth are reasonable choices of human dense representations, both of which are easy to acquire and important to downstream tasks. For sparse representations, 2D keypoints are used, which provide positions of human body joints in the image coordinate. Off-the-shelf tools [8765346, mmpose2020] are available to quickly and robustly extract human 2D keypoints given RGB images. Using 2D keypoints as the sparse representation is a good balance between the amount of human prior and acquisition difficulty.
For RGB inputs , an image encoder [sun2019deep] is applied to obtain feature maps . Similarly, for depth inputs , an image encoder [sun2019deep] or 3D encoder [qi2017pointnet, qi2017pointnet++] can be applied to extract feature maps . 2D keypoints are encoded by a GCN-based encoder [zhao2019semantic] to produce sparse features . Mapper networks comprise a single linear layer and a normalization operation.
As for the implementation of contrastive learning targets, we choose to use a memory pool to store all the global embeddings which are updated in a momentum way. Sparse and dense embeddings cannot all fit in memory. Therefore, for the last two types of contrastive learning targets, the negative samples are sampled within a mini-batch.
On top of the pre-train framework HCMoCo, we propose to further extend it on two direct applications: cross-modality supervision and missing-modality inference. The extensions are based on the key design of HCMoCo: dense intra-sample contrastive learning target. With the feature maps of different modalities aligned, it is straightforward to implement the two extensions, which are shown in Fig. 4.
Cross-Modality Supervision is a novel task where we train the network on the source modality, while test on the target modality. This is a practical scenario where people transfer the knowledge of some single modality dataset to other modalities. At training time, an additional downstream task head (e.g. segmentation head) is attached to the backbone of the source modality. The hierarchical contrastive learning targets together with downstream task loss are used for end-to-end training. At inference time, is attached to the backbone of the target modality. The extracted feature maps of the target modality are passed to for prediction.
is another novel task where we train the network using multi-modal data and inference on single modality. Multi-modal data collection in practice would inevitably result in data with incomplete modalities, which brings the requirement of missing-modality inference. At training time, the feature maps of multiple modalities are fused using max-pooling and fed to a downstream task head. Similarly, hierarchical contrastive learning targets and downstream task loss are used for co-training. At inference time, the feature map of a single modality is passed to for missing-modality inference.
|Method||Pre-train Datasets||Full Data||10% Data|
|BBox AP||GPS AP||GPSM AP||IOU AP||BBox AP||GPS AP||GPSM AP||IoU AP|
|Method||Full Data||20% Data||10% Data||1% Data|
Although RGB human parsing has been well studied [chen2014detect, gong2017look, li2017multiple, gong2018instance], human parsing on depth [hernandez2012graph, nishi2017generation] or RGB-D data has not been fully addressed due to the lack of labeled data. Therefore, we contribute the first RGB-D human parsing dataset: NTURGBD-Parsing-4K. The RGB and depth are uniformly sampled from NTU RGB+D (60/120) [shahroudy2016ntu, liu2019ntu]. As shown in Fig. 5, we annotate 24 human parts for paired RGB-D data. The partition protocols follow that of [h36m_pami]. The train and test set both have samples. The whole dataset contains samples. Hopefully, by contributing this dataset, we could promote the development of both human perception and multi-modality learning.
Implementation Details. The default RGB and depth encoders are HRNet-W18 [sun2019deep]. The default datasets for pre-train are NTU RGB+D [liu2019ntu] and MPII [andriluka14cvpr]. The former provides paired indoor human RGB, depth, and 2D keypoints, The latter provides in-the-wild human RGB and 2D keypoints. Mixing human data from different domains helps our pre-train models adapt to a wilder domain.
Downstream Tasks. We test our pre-train models on four different human-centric downstream tasks, two on RGB images and two on depth. 1) DensePose estimation on COCO [guler2018densepose]: DensePose aims at mapping pixels of the observed human body to the surface of a 3D human body, which is a highly challenging task. 2) RGB human parsing on Human3.6M [h36m_pami]. Human3.6M provides pure human part segmentation, which aligns with our objectives. We uniformly sample 2fps of the video for training and evaluation. 3) Depth human parsing on NTURGBD-Parsing-4K. 4) 3D pose estimation from depth maps on ITOP [haque2016towards] (only side view). For all the above downstream tasks, we use the pre-train backbones for end-to-end fine-tune.
Comparison Methods. Since there are few previous human-centric multi-modal pre-train methods, we propose to use general multi-modal contrastive learning methods CMC [tian2020contrastive] and MMV [alayrac2020self] as the baselines. Although there are other multi-modal contrastive learning works, they either require the multi-view calibration [hou2021pri3d] or focus on multi-modal downstream tasks [liu2021contrastive, hazarika2020misa] and therefore are not suitable for comparison. In addition, for RGB tasks, we also experiment under two settings, one initializes encoders with supervised ImageNet [krizhevsky2012imagenet] (IN) pre-train while the other does not.
|Hard Dense Intra-sample||49.40||49.14||52.49||57.30||56.43||54.05||55.36||68.43||31.26||51.54|
|Soft Dense Intra-sample||50.21||50.25||53.42||57.70||62.33||51.50||56.35||69.26||32.20||51.06|
DensePose Estimation. As shown in Tab. 1, we test DensePose estimation [guler2018densepose] under two settings: full and of the training data. The trained models are tested on the full validation set of DensePose. Firstly, if not using IN pre-train, our pre-train model significantly outperforms both ‘From Scratch’ and two baseline methods. Especially under of training data, 12.7% improvement in terms of GPS AP is observed. And our pre-train model even outperforms that using IN pre-train by 4.13% in terms of GPS AP. When we use IN pre-train as initialization, which is a common practice for 2D tasks, our method still outperforms all the baselines. Our method surpasses IN pre-train by 7.2% and 5.4% in terms of GPS/GPSM AP under setting. To further test the performance of in-domain transfer, we also pre-train models using training sets of NTU RGB+D and COCO. The performance gain under setting further improves to 9.7% and 7.5% in terms of GPS/GPSM AP.
RGB Human Parsing. As shown in Tab. 2, we test four settings on Human3.6M [h36m_pami]: full, , and training data. In all settings, our method outperforms all baselines in all metrics. On full training data, we outperform IN pre-train by 5.6% in terms of mIoU. The performance gain increases with the amount of training data decreases. It is worth noticing that with only of training data, our method outperforms IN pre-train with full training data.
|Method||Full Data||20% Data|
Depth Human Parsing. As shown in Tab. 4, we test the pre-train depth backbone on our proposed Dataset NTURGBD-Parsing-4K with all training data and training data. We outperform all baselines on two settings. Especially, only using of training data, we surpass IN pre-train by 6.4% and MMV [alayrac2020self] by 4.6% in terms of mIoU.
3D Pose Estimation. As shown in Tab. 5, we test the pre-train depth backbone on ITOP [haque2016towards] with six different ratios of training data. Our pre-train model outperforms all baselines on most settings. With only training data, the accuracy of our method outperforms that of IN pre-train with all training data. It is also worth noticing that of training data are samples, which makes this a few-shot learning setting. With such limited training data, IN pre-train barely produce meaningful results, while our method improves the accuracy by 48.2%.
In this subsection, we perform a thorough ablation study on HCMoCo to justify the design choices. As shown in Tab. 3, we firstly report the results of only applying sample-level modality-invariant representation learning. Then we add dense intra-sample contrastive learning and sparse structure-aware contrastive learning in order. To further demonstrate the effect of the ‘soft’ design in dense intra-sample contrastive learning, we also report results of the ‘hard’ learning target, which takes the form of a classic InfoNCE [oord2018representation]. We report the results of the ablation study on all four downstream tasks under data-efficient settings.
For DensePose estimation, it is important to learn feature maps that are continuously and ordinally distributed, which is the expected result of soft dense intra-sample contrastive learning. The performance gain of the soft learning target over the hard counterpart justifies the observation and the learning target design. The dense intra-sample contrastive learning also shows superiority on three other downstream tasks, which shows the importance of fine-grained contrastive learning targets for dense prediction tasks.
Explicitly injecting human prior into the network through sparse structure-aware contrastive learning also proves its effectiveness by further improving the performance on DensePose. Thanks to the strong hints provided by 2D keypoints, the performance of 3D pose estimation is improved. Moreover, the sparse structure-aware contrastive learning boosts the performance of human parsing both on RGB and depth maps by 1.9% and 2.8% respectively in terms of mIoU. Although 2D keypoints are sparse priors, they still provide the rough location of each part of the human body, which facilitate the feature alignment of same body parts. To summarize, the sparse and dense learning targets both contribute to the performance of our methods, which is in line with our analysis.
Cross-Modality Supervision. We test the cross-modality supervision pipeline on the task of human parsing on NTURGBD-Parsing-4K because it has two modalities and respective dense annotations. Two baseline methods are adopted: 1) using CMC [tian2020contrastive] contrastive learning target; 2) no contrastive learning target. For a fair comparison, the backbones of all methods are initialized by CMC [tian2020contrastive] pre-train. At training time, the target modality of training data is not available. We experiment on two settings where we supervise on RGB, test on depth (RGB Depth), and vice versa (Depth RGB). As shown in Tab. 6, our method outperforms both baselines under two settings. Specifically, our method improves the mIoU of both settings by 29.2% and 23.0%, respectively. Even compared to methods with direct supervision, we can achieve comparable results.
|Method||RGB Depth||Depth RGB|
|Method||Only RGB||Only Depth|
Missing-Modality Inference. For missing-modality inference, we report the experiments on the same dataset and same baselines as above. As shown in Tab. 7, with no pixel-level alignment, the two baseline methods struggle in two missing-modality settings i.e. ‘Only RGB’ and ‘Only Depth’. While our method improves the segmentation mIoU by 24.3% and 19.6% on two settings.
Faster Convergence. One of the advantages of pre-training is the fast convergence speed when transferred to downstream tasks. Our HCMoCo also shows superiority in this feature. We log the validation mIoU of Human3.6M human parsing at different training epochs. As shown in Fig. 6, compared with IN pre-train and CMC [tian2020contrastive], our pre-train model is able to converge within a few training epochs in both the full training data and data-efficient settings.
Changing Backbone. So far our experiments are all performed on HRNet-W18. To further demonstrate HCMoCo’s performance on other backbones, for the 2D backbone, we also experiment with HRNet-W32 [sun2019deep]. For the depth backbone, we choose to test with PointNet++ [qi2017pointnet++]. For the RGB pre-train model, we experiment on the DensePose estimation. For the depth pre-train model, we experiment on the NTURGBD-Parsing-4K. As shown in Tab. 8, our method outperforms its pre-train counterparts by a reasonable margin, which is in line with our previous experimental results.
In this work, we propose the first versatile multi-modal pre-training framework HCMoCo specifically designed for human-centric perception tasks. Hierarchical contrastive learning targets are designed based on the nature of human datasets and the requirements of human-centric downstream tasks. Extensive experiments on four different human downstream tasks of different modalities demonstrated the effectiveness of our pre-training framework. We contribute a new RGB-D human parsing dataset NTURGBD-Parsing-4K to support research of human perception on RGB-D data. Besides downstream task transfer, we also propose two novel applications of HCMoCo to show its versatility and ability in cross-modal reasoning.
Potential Negative Impacts & Limitations. Usage of large amounts of data and long training time might negatively impact the environment. Moreover, even though we did not collect any new human data in this work, human data collection could happen if our framework is used in other applications, which potentially raises privacy concerns. As for the limitations, due to limited resources, we could only experiment with one possible instantiation of HCMoCo. And for the same reason, even though the theoretical possibility exists, we do not have the chance to further scale up the amount of human dataset and network size.
Acknowledgments This work is supported by NTU NAP, MOE AcRF Tier 2 (T2EP20221-0033), and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).
For the proposed instantiation of HCMoCo, we implement the sample-level modality-invariant representation learning target by maintaining a memory pool, which is adapted from an open-sourced implementation222https://github.com/HobbitLong/PyContrast. The memory pool is updated in a momentum style with the momentum of . For global embeddings, we sample negative samples from the memory pool. For other hyper-parameters, we use a batch size of , a learning rate of , a temperature of for all three contrastive learning targets. For the pre-train, 4 NVIDIA V100 GPUs are used. The training process is divided in two steps. The first step only pre-train the model using sample-level modality-invariant representation learning target for epochs. The second stage adds the other two learning targets and trains for another epochs. The whole training process takes approximately 48 hours.
Mixing Heterogeneous Datasets. Since we mix several heterogeneous human datasets for pre-train, we need to mask out the missing modalities. For example, when we use NTU RGB+D and MPII for pre-train. The former dataset has all the required modalities, while the latter one misses depth maps. Therefore, for the hierarchical contrastive learning targets, we mask out the missing depth embeddings of MPII for all the positive pairs sampling. By using the masking technique, it is possible to combine multiple heterogeneous datasets into this pre-train paradigm as long as there are at least two common modalities.
Datasets for Pre-train. For NTU RGB+D, we only use the version with 60 actions [shahroudy2016ntu]. With the provided RGB-D videos, we uniformly sample one frame from every 30 frames, which makes samples. The RGB and depth frames are calibrated by the correspondences provided by the 2D keypoints positions on RGB and depths. For MPII and COCO, we use the full training sets for pre-train.
For the DensePose [guler2018densepose] estimation, we use the official open-sourced implementation 333https://github.com/facebookresearch/detectron2. For the full training set, we train the network for iterations with a batch size of , a learning rate of on 4 NVIDIA V100 GPUs, which takes around hours to train. For the training set, we train the network for iterations with a learning rate of and other settings the same, which takes 9 hours to train. The training set is uniformly sampled from the default ordered training set.
For the RGB human parsing on Human3.6M [h36m_pami], we use the official HRNet [sun2019deep] semantic segmentation implementation 444https://github.com/HRNet/HRNet-Semantic-Segmentation. Different ratios of training settings are uniformly sampled from the default ordered full training set. For the full training set, we train the network for epochs with a learning rate of , a batch size of on 2 NVIDIA V100 GPUs. For other data-efficient settings, we train the network for epochs with other settings the same. We use the the standard dataset split protocol, where the subjects are for training and the subjects and are for evaluation.
For the depth human parsing on NTURGBD-Parsing-4K, we use the same implementation as that of RGB human parsing. To use the HRNet to encode depth maps, we repeat the depth dimension for three times to fit the RGB input, which is also how HCMoCo deals with depth inputs. For all training settings, we train the network for epochs with a learning rate of , a batch size of on 2 NVIDIA V100 GPUs. Even though the encoder is used to deal with depth inputs, we still initialize it using ImageNet pre-train for that it might help with the performance proved by some previous works [xiong2019a2j].
For the 3D pose estimation from depth maps on ITOP [haque2016towards], we choose to adapt the official implementation 555https://github.com/zhangboshen/A2J of A2J [xiong2019a2j]
. The original implementation uses ResNet as the backbone. And we switch to HRNet. Since the original implementation only provides validation scripts, we re-implement the whole training pipeline. We change the original normalization method where a global mean and variance is counted for a global normalization. Instead, we perform an online instance normalization where we only centralize each depth pixel to zero mean but do not normalize its variance, since its a better way to prevent the over-fitting to the relatively small dataset. We train the network forepochs with a learning rate of and a batch size of on one NVIDIA V100 GPU. As for the dataset, we use the side-view of ITOP since the depth maps in pre-train are side views. Following the official dataset split, there are samples for training and for testing. Following the practice of A2J [xiong2019a2j], we initialize the encoders using ImageNet pre-train.
To experiment with the cross-modality supervision, we choose the downstream task of human parsing on NTURGBD-Parsing-4K. The modalities to experiment with are RGB and depth. To make the experiment fair and the networks to converge faster, the backbones are initialized by CMC [tian2020contrastive] pre-train. The following descriptions are for the setting of ‘RGBDepth’, where the source modality is RGB and the target modality is depth. To implement ‘DepthRGB’, one can simply switch the source and target modalities. At training time, a randomly initialized segmentation header, which is the same one used for human parsing experiments, is attached to the dense mapper network of RGB. Then the network is trained with both the hierarchical contrastive learning targets and a cross-entropy loss for the supervision of the segmentation. For the ‘No Contrastive’ baseline, we only train with . As for the ‘CMC’ baseline, the network is supervised by both the learning target proposed by CMC [tian2020contrastive] and the segmentation loss . Note that, during the whole training time, including the CMC pre-train, the target modality of NTURGBD-Parsing-4K is not exposed to better simulate the application scenario. In order to build the connection between RGB and depth during training time, we mix the NTURGBD-Parsing-4K with NTU RGB+D which is the same one used for our pre-train. At inference time, we attach the trained segmentation head to the mapper network of depth. Since the dense embeddings of RGB and depth are aligned thanks to our hierarchical contrastive learning targets, it is reasonable for the segmentation head to be able to handle the dense embeddings of depth.
We also use human parsing on NTURGBD-Parsing-4K to experiment with our extension of missing-modality inference. The basic setup is the same as that of the cross-modality supervision experiments. At training time, we take the dense embeddings of both RGB and depth together for a max pooling operation for a simple feature-level fusion. Then the fused dense embedding is passed to a segmentation header, which is the same one used by the human parsing experiment, to produce the segmentation prediction. The network is supervised with both the hierarchical contrastive learning targets and a cross-entropy loss for segmentation supervision. Similarly, the ‘No contrastive’ baseline does not use any contrastive learning targets. The ‘CMC’ baseline uses the contrastive learning target proposed in CMC [tian2020contrastive] as . At inference time, if RGB is missing, then the dense embedding of depth is passed to the trained segmentation header for prediction. Since the dense embeddings of RGB and depth are aligned and the segmentation header is trained with the fusion of both embeddings, missing one of them will still produce reasonable predictions.
DensePose Estimation. Due to the page limitation, we could not report all metrics for the DensePose [guler2018densepose] estimation. Therefore, we report them in this supplementary material. As shown in Tab. 9, detailed results of all settings mentioned in the main paper are listed. Specifically, for the initialization of the network, we test with the network randomly initialized (‘From Scratch’) and the network initialized by ImageNet pre-train (‘IN Pre-train’). As for the ratio of training data, we test with the full training set and of the training set. As for the pre-train datasets, we test with two combinations: NTU RGB+D MPII and NTU RGB+D COCO. As for the backbone, we test with HRNet-W18 and HRNet-W32. Compared with the baseline and two other state-of-the-art pre-train counterparts, our method outperforms them in most of the metrics. Especially, our method has advantages in GPS and GPSM, which are two critical metrics for DensePose quality. Additionally, we also report full results of the ablation study. The detailed results further validates the analysis in the main paper.
RGB Human Parsing. We further report detailed RGB human parsing results on Human3.6M [h36m_pami] that could not fit into the main paper. As shown in Tab. 10, we report the per-class IoU for all the settings reported in the main paper. Similarly, for the initialization of the network, we test with the network randomly initialized (‘From Scratch’) and the network initialized by ImageNet pre-train (‘IN Pre-train’). As for the ratio of training data, we test with the full training set, , and of the training set. The pre-train datasets are NTU RGB+D MPII. In most classes, our method outperforms comparison methods. Moreover, we also report per-class IoU for the four settings in ablation study, which are in line with our analysis in the main paper.
Depth Human Parsing. We report detailed depth human parsing results on NTURGBD-Parsing-4K. As shown in Tab. 11, we report the per-class IoU for all the settings reported in the main paper. We initialize the networks using ImageNet pre-train. Two ratios of the training set, i.e. full and , are tested. We also change the backbone to PointNet++ [qi2017pointnet++] (‘PN++’). Since it is a point-based backbone, the ‘background’ class is ignored and not included in the calculation of mIoU. The per-class IoU results also agree with the conclusion in the main paper that our method is superior than other comparison methods.
Cross-Modality Supervision. As shown in Tab. 12, we report detailed per-class IoU for the experiments of cross-modality supervision. In both ‘RGBDepth’ and ‘DepthRGB’ settings, our method outperforms other baseline methods in all classes. Especially, other baseline methods barely make correct predictions while ours makes a huge improvement.
Missing-Modality Inference. As shown in Tab. 12, we list detailed per-class IoU for the experiments of missing-modality inference. In both ‘Only RGB’ and ‘Only Depth’ settings, our method outperforms baseline methods in most classes. Therefore, the detailed results further validates the conclusions made in the main paper.
More qualitative results of RGB human parsing on Human3.6M [h36m_pami] and depth human parsing on NTURGBD-Parsing-4K are shown in Fig. 7, Fig 8 and Fig. 9. We choose to visualize both the full training set and training set for RGB human parsing. The segmentation results produced by our pre-train model are superior than those of other comparison methods, especially in data-efficient settings. For challenging classes like hands and elbows, our method is capable of producing correct predictions constantly while other methods struggle. The depth map is a challenging modality for the dense prediction task like semantic segmentation. Our method manages to produce reasonable predictions that are better than those of other comparison methods.
|Methods||Ratio||bg||right hip||right knee||right foot||left hip||left knee||left foot||left shoulder||left elbow||left hand||right shoulder||right elbow||right hand||crotch||right thigh||right calf||left thigh||left calf||lower spine||upper spine||head||left arm||left forearm||right arm||right forearm||mIoU|
|Methods||Ratio||bg||right hip||right knee||right foot||left hip||left knee||left foot||left shoulder||left elbow||left hand||right shoulder||right elbow||right hand||crotch||right thigh||right calf||left thigh||left calf||lower spine||upper spine||head||left arm||left forearm||right arm||right forearm||mIoU|
|From Scratch w/ PN++||20%||-||21.43||30.84||67.53||21.52||29.85||66.76||32.82||23.17||38.37||36.74||23.42||36.23||25.19||55.25||60.25||55.11||60.37||53.34||65.85||88.22||50.77||47.05||52.70||45.88||45.36|
|CMC* [tian2020contrastive] w/ PN++||20%||-||24.12||32.89||73.11||23.84||32.67||73.20||33.43||27.55||44.62||38.40||27.59||42.11||26.55||57.82||65.60||57.73||65.02||54.53||66.31||89.16||55.04||51.00||56.66||50.82||48.74|
|Ours* w/ PN++||20%||-||23.96||32.90||73.30||24.16||32.44||73.10||34.81||29.54||45.43||37.79||28.16||42.89||27.83||58.25||66.16||58.64||65.51||55.60||66.92||89.51||56.33||53.05||57.87||52.29||49.43|
|Methods||Setting||bg||right hip||right knee||right foot||left hip||left knee||left foot||left shoulder||left elbow||left hand||right shoulder||right elbow||right hand||crotch||right thigh||right calf||left thigh||left calf||lower spine||upper spine||head||left arm||left forearm||right arm||right forearm||mIoU|
|No Contrastive||Only RGB||93.55||8.97||6.66||0.42||4.28||0.75||0.02||0.98||1.19||15.22||0.28||2.69||23.56||0.08||0.30||12.35||0.07||0.88||25.48||5.27||57.54||12.63||26.38||7.73||28.90||13.45|
|CMC [tian2020contrastive]||Only RGB||93.80||0.00||11.94||47.69||0.00||12.00||38.76||21.43||0.01||9.58||24.32||13.56||15.38||1.15||1.84||32.30||1.07||18.12||0.59||3.13||36.86||27.04||15.63||42.37||21.90||19.62|
|No Contrastive||Only Depth||96.46||27.81||7.16||1.46||33.36||10.60||2.30||28.64||4.72||1.27||11.49||0.11||7.86||26.16||33.67||39.55||33.21||27.40||46.05||63.16||47.50||25.38||9.99||19.89||5.04||24.41|
|CMC [tian2020contrastive]||Only Depth||94.81||7.25||1.08||0.06||6.82||0.05||0.10||25.47||3.11||2.95||20.80||0.39||2.73||26.79||17.11||17.84||33.96||4.58||10.64||11.32||37.01||41.41||11.08||33.69||3.52||16.58|