RGB based deep learning approaches such as[carreira2017quo, feichtenhofer2019slowfast, wu2021towards, kalfaoglu2020late, doughty2019pros, parmar2021piano, parmar2021piano, tang2020uncertainty, pan2019action] have shown an impressive performance in human action recognition and performance assessment. However, as stated in [varol2021synthetic], the performance of such approaches drops significantly when they are applied on data that come from unseen viewpoints. To tackle this problem, a simple solution would be to train a network on data from multiple views [varol2021synthetic]. However, in practice, capturing a labelled dataset of different views is cumbersome and rare - but two example cases are the commonly used NTU [shahroudy2016ntu] and the recent health-related QMAR [sardari2020vi] datasets, where the granularity of the labelling is still coarse at action class level and at overall performance score level respectively. Ideally, a wholly view-invariant approach would be trained on data from as few views as possible and be able to perform well on a single (unseen) view at inference time. We shall use the NTU and QMAR datasets in our downstream tasks for our proposed view-invariant pose representation method towards this ideal.
Most current view-invariant action classification approaches are based on supervised learning, with both training and testing commonly carried out using skeleton data, such as[zhang2019view, zhang2020semantics, nie2021view, gedamu2021arbitrary, huang2021view]. Others like [rahmani2018learning, sardari2019view, das2020vpn, das2019focus] train on both RGB and 3D joint annotations to facilitate testing using RGB images alone. However, all these approaches rely on a significant amount of annotations (and sometimes camera parameters) during training, the provision of which is expensive and difficult in in-the-wild scenarios. Only very few works, such as [wang2018dividing, sardari2020vi, varol2021synthetic] deal with training from RGB images only. Moreover, the authors are aware of no other unsupervised view-invariant RGB-only action classification or assessment study, and only of one such work based on RGB and depth [li2018unsupervised].
In this paper, we propose a representation learning approach to disentangle canonical (view-invariant) 3D pose representation and view-dependent features from either an RGB-based 2D Densepose human representation map or a depth mask image without using 3D skeleton annotations or camera parameters. We design an auto-encoder comprising two encoders and a decoder. The first is a view-invariant 3D pose encoder that learns 3D canonical pose representations from an input image, and the second is a viewpoint encoder that extracts rotation and translation parameters, such that when they are applied on the canonical pose features, it would result in view-dependent 3D pose representation which are fed into the decoder to reconstruct the input image. To train the network, we impose geometrical and positional order consistency constraints on pose representation features through novel view-invariant and equivariance losses respectively. The view-invariant loss is computed based on the intrinsic view-invariant properties of pose features between simultaneous frames from different viewpoints, while the equivariance loss is computed using the equivariant properties between augmented frames from the same viewpoint. After training, the 3D canonical pose representations can be used for downstream tasks such as view-invariant action classification and human movement analysis. Fig. 1 shows the proposed view-invariant pose representation learning framework and its application on a view-invariant downstream task.
Our key contributions can be summarized as follows: (i) we propose a novel unsupervised method that learns view-invariant 3D pose representation from a 2D image without using 3D skeleton data and camera parameters. Our view-invariant features can be applied directly
by downstream tasks to be resilient to human pose variations in unseen viewpoints, unlike unsupervised 3D pose estimation methods such as[rhodin2018unsupervised, chen2019unsupervised, chen2019weakly, tripathi2020posenet3d, honari2021unsupervised, dundar2021unsupervised] which obtain view-specific 3D pose features, and require camera parameters and further steps to align their view-specific features in a canonical space, (ii) we introduce novel view-invariance and equivariance losses that impose on the network to preserve geometrical and positional order consistency of pose features - these losses can benefit the training process in other pretext tasks that exploit landmark representation, (iii) we evaluate the performance of learned pose features on two downstream tasks that demand view-invariancy and achieve state-of-the-art unsupervised cross-view action recognition accuracy on the NTU RGB+D standard benchmark dataset for RGB and depth images at and
respectively, and for the first time we obtain unsupervised cross-view and cross-subject rank correlation results for human movement assessment scores on the QMAR dataset, while exceeding its supervised state-of-the-art results, (iv) we perform ablation studies to explore the impact of our loss functions on our proposed model.
2 Related Works
We now consider the more recent related works that deal with unsupervised pose representation and view-invariancy, in particular in relation to our chosen downstream tasks.
Unsupervised Pose Representation –
There are several recent examples of RGB-based unsupervised learning approaches to 3D pose estimation, such as[rhodin2018unsupervised, chen2019unsupervised, chen2019weakly, tripathi2020posenet3d, honari2021unsupervised, dundar2021unsupervised]. Authors in [chen2019unsupervised, chen2019weakly, tripathi2020posenet3d] extract unsupervised pose features from 2D joints generated from RGB data. For example, Chen et al. [chen2019unsupervised] train a network through a 2D-3D consistency loss, computed after lifting 2D pose to 3D joints and reprojecting 3D onto 2D. dundar2021unsupervised disentangle pose and appearance features from an RGB image by designing a self-supervised auto-encoder that reconstructs an input image into foreground and background with the constraint that the appearance features remain consistent temporally while the pose features change. honari2021unsupervised also relies on temporal information and factorizes the pose and appearance features in a contrastive learning manner. In [rhodin2018unsupervised], the authors design a network to encode 3D pose features by predicting a different viewpoint of the input image, but there is no restriction to generate the same pose representation for the simultaneous frames. All these methods are view-specific and do not generate the same (i.e. canonical) 3D pose features for different viewpoints, so they cannot be applied to unseen-view downstream tasks, and camera parameters and extra steps are needed to map their view-specific output into a canonical view. Our proposed method learns view-invariant pose representation from the input image such that it can be applied directly to unseen-view tasks, such as action recognition. rhodin2018learning use both labelled and unlabelled data to estimate canonical 3D pose. They train a network that maps multiple views into a canonical pose through mean square error, but as using only this constraint may generate random features without any positional order consistency, they also use a small subset of 3D pose annotations to enhance the output. However, our proposed approach achieves positional order consistency without utilizing any labels. Note, positional order consistency is important as it enables us to leverage temporal aspects of corresponding body joints to handle video-based downstream tasks.
Supervised View-Invariant Action Recognition and Performance Assessment – To deal with view-invariancy, most action recognition methods are based on 3D skeleton joints, such as [rahmani2018learning, zhang2019view, li2019actional, ji2019attention, sardari2019view, zhang2020semantics, huang2021view, gedamu2021arbitrary]. For example, zhang2019view
present a dual-stream network, one LSTM and one CNN, and fuse the results to predict the action label. Both streams include a view adaptation network estimating the transformation parameters of skeleton data to a canonical view, followed by a classifier. In general, methods that rely on skeleton annotations must rely on fulsome 3D joint representations which are difficult to come by in in-the-wild scenarios. Recently a few works have developed view-invariant action recognition or analysis approaches from RGB-D images, such as[varol2021synthetic, wang2018dividing, sardari2020vi, dhiman2020view]. varol2021synthetic deploy multi-view synthetic videos for training their network to perform action recognition given novel viewpoints, but still use 3D pose annotations to produce the synthetic data, while the newly generated videos would have to be also labelled by experts if they were to be used for specialist applications such as healthcare.
In view-invariant action performance assessment, we are aware of only one study where sardari2020vi investigated a supervised model to assess and score the quality of movement in subjects simulating Parkinson and Stroke symptoms by evaluating canonical spatio-temporal trajectories derived from body joint heatmaps.
Unsupervised View-Invariant Action Recognition – There are also only a relatively few unsupervised deep learning approaches that challenge view-invariant action recognition, e.g. [li2018unsupervised, cheng2021hierarchical]. For instance, Li et al. [li2018unsupervised] introduce an RGB-D based auto-encoder network that extracts unsupervised view-invariant spatio-temporal features from a video sequence. The proposed network is trained to reconstruct two simultaneous source and target view sequences from a given source view video. This method requires both RGB and depth data for training. cheng2021hierarchical propose a 3D skeleton-based unsupervised approach by using motion prediction as a pretraining task to learn temporal dependencies for long video representation with a transformer-based auto-encoder.
3 Proposed Method
Our aim is to learn view-invariant 3D pose representation from 2D RGB or depth images without relying on 3D skeleton annotations and camera parameters. Our method leverages on geometric transformation amongst different viewpoints and the equivariant property of human pose. The proposed auto-encoder includes a view-invariant pose encoder , a viewpoint encoder , and a decoder arranged as shown in Fig. 2. learns 3D canonical pose features from a given image which can be either an RGB-based 2D Densepose human representation map [neverova2019correlated] or a depth mask image. As the extracted pose features are canonical, they are mapped into a specific viewpoint using the parameters obtained through encoder before being passed to to allow the decoder to reconstruct the input image. The network optimises through four losses to generate its view-invariant representation.
Model Architecture and Formulation –
The view-invariant pose encoder learns 3D canonical pose features given image where and refers to the number of 3D pose features. estimates the viewpoint parameters , i.e. rotation and translation . These viewpoint parameters are applied on the canonical pose features to transfer them into a specific viewpoint , such that where .
Then, the decoder reconstructs the input, . The network’s purpose is therefore that it learns to extract the same canonical 3D pose features for simultaneous frames from different viewpoints while maintaining equivariance for the pose features from their augmented frames (shifted in position - see details under Equivariance Loss below) from the same viewpoint. We train the proposed network by combining four losses, view-invariant , equivariance , and two reconstruction losses and .
View-Invariant Loss – We start with two simultaneous frames from different views and of the same scene from their corresponding video sequences at current frame . These are passed to encoders and to extract the canonical 3D pose features and viewpoint parameters , for .
Each frame has a distinct translation parameter, while the rotation is the same for all the frames of a sequence captured from the same viewpoint. Thus, if we estimate the rotation parameters from two random frames and from corresponding sequences and views instead, the network should still retrieve the view-specific pose features. We use this constraint to prevent the model leaking any pose information through
and force it to concentrate on only the viewpoint parameters. Hence, with a probability of 0.5, we randomly select the frame to predict the rotation parameters for the two views,
denotes a uniform distribution returning a number between 0 and 1. As we assumeencodes the same canonical 3D pose features for and , then swapping their pose features while their viewpoint features are retained, the network has to still be able to reconstruct them. Thus, the view-invariant loss is obtained by
where indicates the mean square error.
However, computing only is not enough to learn the view-invariant pose features, and still has to reconstruct the simultaneous frames even without swapping their canonical pose features, otherwise the network learns to only assign random latent codes for canonical pose features, so we introduce as
where with for .
Equivariance Loss – The effect of this loss is to help teach the network to preserve the positional order of the pose components. For example, if the dimension of the latent variable indicates the right shoulder of a subject, it should be consistent for all the images. We assume that the proposed network generates consistent order of pose features, and and axes of view-specific 3D pose space are the same as the and directions of the 2D images, so when and shift by some pixels in the and directions, then all components of the view-specific pose would shift similarly. Hence, we propose an equivariance loss computed from augmentations of and , where the augmented images, and , represent positional changes of the human subject in the scene, for example by and pixels respectively, i.e.
where and for .
is computed based on the view-specific pose features while we can also benefit from the reconstruction of the augmented frames to improve on the pose representation, so we introduce as.
where . The total loss is computed as
We determine the weights empirically to be , , and .
After training the proposed network to learn the 3D canonical pose features, is used for our example view-invariant downstream tasks. Next, in Section 4, we outline our proposed method to model temporal aspects of the canonical pose features for each downstream task.
Datasets – Capturing multi-view datasets requires elaborate set-ups and is inevitably time-consuming and potentially quite expensive. Hence, very few multi-view action classification or movement assessment datasets exist. We present results on two existing datasets, NTU RGB+D and QMAR.
NTU RGB+D [shahroudy2016ntu] is the main benchmark dataset for view-invariant action recognition, including 60 action classes performed by 40 subjects. NTU (for short) contains 17 different environmental settings captured by three cameras from three viewpoints. Two standard protocols are used to evaluate the performance of view-invariant action recognition methods on NTU, cross-view (CV) and cross-subject (CS). In CV, different views are adopted for training and testing, while in CS different subjects are engaged for training and testing. We followed both protocols by using the same training and testing sets as in [shahroudy2016ntu] for both pretext and downstream tasks.
QMAR [sardari2020vi] is the only RGB+D multi-view dataset known to the authors for quality of movement assessment. It comprises 38 subjects captured from 6 different views, three frontal and three sides. The subjects were trained by an expert to simulate four Parkinsons and Stroke movement tests: walking with Parkinson (W-P), walking with Stroke (W-S), sit-to-stand with Parkinson (SS-P), and sit-to-stand with Stroke (SS-S), and the movements were annotated to determine severity of the abnormality scores. For evaluation, we followed [sardari2020vi, parmar2019and] and obtained Spearman’s rank correlation (SRC) results under CV and CS protocols. For CS, we used the same training and testing sets as in [sardari2020vi], and for CV, the data from one frontal and one side view were used for training while the rest of the viewpoints were applied for inference (see Supplementary Materials for details).
Implementation Details and Hyper-Parameter Settings – Our auto-encoder is inspired by the U-Net encoder/decoder [ronneberger2015u, esser2018variational, rhodin2018unsupervised, dorkenwald2020unsupervised]. The U-Net is a convolutional or spatial latent auto-encoder with skip connections between the encoder and the decoder parts, while we desire a dense one [baur2018deep] without the skip connections to encode the 3D pose features, so we adapted it for our problem. Table 1kingma2014adam] with a fix learning rate of 0.0002, and batch size 5. During training, we applied random horizontal flipping for data augmentation. The depth mask images of NTU used in our experiments contain bounding box of subjects as released by [shahroudy2016ntu].
|, , , ,|
|, , ,|
|, , ,|
|, , ,|
|, , ,|
: max pooling,: FC layer with outputs.
To select the 3D canonical pose feature size , we used cross-validation and evaluated the total loss in Eq. 8 for in the range between 40 and 190 with a step-size of 30. The lower bound was inspired by motion capture systems that use 39 markers, and the upper bound was selected based on rhodin2018unsupervised who set their latent code size at . As shown in Table 2, the average cross-validation results on the NTU dataset for both CV and CS protocols is best when , hence our 3D canonical pose feature size is set at .
Action Classification –
Our proposed auto-encoder can learn unsupervised 3D pose representations without using any action labels. To encapsulate the temporal element of the action recognition downstream task, we added a two-layer bidirectional gated recurrent unit (GRU) followed by one FC layer after our view-invariant pose encoder, and trained it on fixed-size 16-frame input sequences with the cross-entropy loss function. Similar to [zolfaghari2017chained], we subsampled the sequences such that every sequence was divided into 16 segments and one random frame was selected amongst all frames of each segment.
All the representation learning methods on NTU that are compared to ours here can operate on either RGB or depth data for training and inference, except li2018unsupervised which requires both RGB and depth for its training stage. Providing like-to-like evaluations against these relevant methods is difficult since for all such techniques their method defines the nature of their backbone architecture, for example they extract spatio-temporal features while we learn pose representation, e.g. [vyas2020multiview] uses 3D CNNs whereas ours is integrally a 2D design. In the case of [li2018unsupervised] which applies a 2D ResNet with added ConvLSTM [shi2015convolutional], we provide results with the closest possible backbone, comprising a 2D ResNet and an LSTM.
|Method||Backbone||Input||Supervised ()||Unsupervised ()|
|Shuffle & Learn [misra2016shuffle]||ECCV 2016||AlexNet||Depth||-||-||-||-||40.9||46.2|
|Luo et al. [luo2017unsupervised]||CVPR 2017||VGG + ConvLSTM||Depth||-||-||-||-||53.2||61.4|
|Vyas et al. [vyas2020multiview] ✓||ECCV 2020||3D CNN + LSTM||Depth||-||-||78.7||71.8||-||-|
|Li et al. [li2018unsupervised] ✓||NeurIPS 2018||2D ResNet + ConvLSTM||Depth||37.7||42.3||63.9||68.1||53.9||60.8|
|Ours ✓||2D ResNet + LSTM||Depth||60.4||63.1||75.5||72.7||58.3||58.0|
|Ours ✓||2D CNN + GRU||Depth||76.7||75.9||82.5||78.8||67.5||64.7|
|Luo et al. [luo2017unsupervised]||CVPR 2017||VGG + ConvLSTM||RGB||-||-||-||-||-||56.0|
|Vyas et al. [vyas2020multiview] ✓||ECCV 2020||3D CNN + LSTM||RGB||-||-||86.3||82.3||-||-|
|Li et al. [li2018unsupervised] ✓||NeurIPS 2018||2D ResNet + ConvLSTM||RGB||29.2||36.6||49.3||55.5||40.7||48.9|
|Ours ✓||2D ResNet + LSTM||RGB||66.5||66.7||78.2||73.8||62.1||63.0|
|Ours ✓||2D CNN + GRU||RGB||77.0||70.3||83.6||78.1||74.8||68.3|
Table 3 shows that for the unsupervised scenario, our 2D CNN and GRU backbone significantly improves the state-of-the-art across CV and CS tests, at , for RGB, and , for depth data, respectively. The 2D ResNet + LSTM incarnation of our method also exceeds across the board on the state-of-the-art in unsupervised results on NTU, for example achieving in almost direct comparison to [li2018unsupervised]’s for cross-view RGB inference.
For the supervised learning case, we improve on all other works with depth data whether training from scratch or fine-tuning our network with best results at and on CV and CS protocols respectively, and attain very competitive results using RGB in comparison to the 3D CNN-based [vyas2020multiview].
In Table 4, we report the results of recent state-of-the-art unsupervised pose representation methods that operate on 3D skeleton data. yao2021recurrent perform better than our method in CV mode and cheng2021hierarchical’s result is marginally better than ours in CS mode. These result vindicate our approach as a viable alternative to skeleton-based methods which are altogether more cantankerous to deal with in real-world applications than RGB or depth derived data.
|lin2020ms2l||ACM Multimedia 2020||GRU||-||52.5|
|yao2021recurrent||ICME 2021||GRU + GCN||79.2||54.4|
|cheng2021hierarchical ✓||ICME 2021||Transformer||72.8||69.3|
|rao2021augmented ✓||Information Sciences 2021||LSTM||64.8||58.5|
Ablation Study – We ablate our losses to examine their impact on the learning of our pose features. Table 5 shows the unsupervised action classification accuracy on NTU as we drop each or both of and .
Table 5 shows that removing from the training process, our results for both CV and CS in both RGB and depth deteriorates. This verifies that positional order consistency is essential in both cases. We also observe that eliminating causes our method’s performance to drop in all cases, except for the cross-subject case with depth as the input modality. The increase in performance in this scenario may be attributed to the removal of the extra geometrical constraints that are imposed on the features by the extra simultaneous frames through the presence of the computation.
Human Movement Analysis – Here, we aim to study the efficiency of the learned representation on NTU for quality of movement scoring on QMAR. As in the action recognition task, we added a two-layer bidirectional GRU followed by one FC layer on top of to deal with temporal analysis. The size of the FC layer is equal to the maximum score for a movement type. However, for movement quality assessment, we require to analyse every single frame of a sequence, so we cannot apply any subsampling strategies for this task. We followed [sardari2020vi] to divide each video sequence into non-overlapping 16-frame video clips. Our network was trained on a random 16-frame clip through the cross entropy loss function, and for inference, all 16-frame clips of a video sequence were processed. The score for a sequence was estimated by averaging the outputs of the last FC layer, as in [sardari2020vi]. Our work offers the first ever unsupervised results on QMAR. sardari2020vi who introduced QMAR present the only other supervised view-invariant results on this dataset. We also show the performance of two other architectures taken from [sardari2020vi].
Table 6 shows our unsupervised human movement analysis results for CV and CS protocols on QMAR, reaching an average SRC of and respectively. These are broadly already competitive to the supervised results in Table 6, particularly when compared against the supervised, Kinetic-400 [kay2017kinetics] pretrained, deep I3D network. Finally, the supervised version of our method, where we fine-tune our network weights after transfering the learnt weights by NTU training, our performance exceeds sardari2020vi on average and achieves and for CV and CS respectively.
|CV||Supervised||C3D (after [parmar2019and])||custom-trained||0.65||0.37||0.21||0.45||0.42|
|VI-Net [sardari2020vi] ✓||scratch||0.92||0.81||0.46||0.61||0.70|
|CS||Supervised||C3D (after [parmar2019and])||custom-trained||0.50||0.37||0.25||0.54||0.41|
|VI-Net [sardari2020vi] ✓||scratch||0.87||0.52||0.58||0.69||0.66|
Most current view-invariant action recognition and performance assessment approaches are based on supervised learning and rely on a large number of 3D skeleton annotations. In this paper, we dealt with these through an unsupervised method to learn view-invariant 3D pose representation from a 2D image. Our experiments show that not only can our learned pose representations be applied on unseen view videos from the same training data, but it can also be used in different domains. Our unsupervised approach is particularly helpful in applications where the use of multi-view data is essential and capturing 3D skeletons is challenging, e.g. in healthcare rehabilitation monitoring at home or in the clinic.
In the pretext stage of our model, we require synchronised multi-view frames to learn view-invariant 3D pose representations. For future work, we will investigate extracting view-invariant pose features from a single view or non-synchronized frames to allow learning to become a simpler process for application to any suitable dataset.