DeepAI
Log In Sign Up

Human-Centered Prior-Guided and Task-Dependent Multi-Task Representation Learning for Action Recognition Pre-Training

Recently, much progress has been made for self-supervised action recognition. Most existing approaches emphasize the contrastive relations among videos, including appearance and motion consistency. However, two main issues remain for existing pre-training methods: 1) the learned representation is neutral and not informative for a specific task; 2) multi-task learning-based pre-training sometimes leads to sub-optimal solutions due to inconsistent domains of different tasks. To address the above issues, we propose a novel action recognition pre-training framework, which exploits human-centered prior knowledge that generates more informative representation, and avoids the conflict between multiple tasks by using task-dependent representations. Specifically, we distill knowledge from a human parsing model to enrich the semantic capability of representation. In addition, we combine knowledge distillation with contrastive learning to constitute a task-dependent multi-task framework. We achieve state-of-the-art performance on two popular benchmarks for action recognition task, i.e., UCF101 and HMDB51, verifying the effectiveness of our method.

READ FULL TEXT VIEW PDF

page 1

page 4

10/12/2020

MS^2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

In this paper, we address self-supervised representation learning from h...
05/01/2022

Preserve Pre-trained Knowledge: Transfer Learning With Self-Distillation For Action Recognition

Video-based action recognition is one of the most popular topics in comp...
08/19/2021

Self-Supervised Video Representation Learning with Meta-Contrastive Network

Self-supervised learning has been successfully applied to pre-train vide...
06/13/2020

DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition

State-of-the-art video action recognition models with complex network ar...
06/15/2019

Delving into 3D Action Anticipation from Streaming Videos

Action anticipation, which aims to recognize the action with a partial o...
09/02/2022

Temporal Contrastive Learning with Curriculum

We present ConCur, a contrastive video representation learning method th...
01/11/2021

Learning from Weakly-labeled Web Videos via Exploring Sub-Concepts

Learning visual knowledge from massive weakly-labeled web videos has att...

1 Introduction

Action recognition is a hot topic in the computer vision community. It has many practical applications such as intelligent surveillance, human-computer interaction and behavior analysis

[12]. The critical challenge of video-based action recognition is to model the complex spatial-temporal information in videos, which is more difficult than understanding static images [4]. Early works usually follow a two-stream paradigm [26, 7, 8, 34]

or exploit 3D convolutional neural network (3D CNN)

[14, 30, 3, 31] to explore the visual appearances and temporal dynamics. However, these supervised methods need great annotation effort [33]

. Recently, many works focus on action recognition pre-training with self-supervised learning frameworks without using categorical labels. These approaches usually follow the design of pretext tasks for representation learning, such as temporal shuffle

[19], future frame prediction [1], video-based space-time cubic puzzle completion [16], and contrastive learning-based tasks [10, 28, 4].

Figure 1: Two rows show some examples of human-centered actions and corresponding segmentation maps from human parsing models, respectively. The human parsing prior provides useful knowledge to action recognition tasks.

Although much great effort has been made for video representation learning, two main issues remain for existing pre-training methods. The commonly used contrastive learning pre-training paradigm emphasizes the instance-level similarity [39], which can hardly capture abundant semantic information in videos, resulting in neutral and less informative representation. Therefore, it further causes severe performance degradation on many specific downstream tasks with very different objectives compared with the pre-training stage. Secondly, many multi-task learning-based pre-training frameworks fail to consider the potential conflicts among different and inconsistent objectives from multiple tasks, leading to sub-optimal solutions [15, 25].

To address the above issues, in this paper, we propose a novel prior-guided and task-dependent multi-task representation learning framework for video-based action recognition pre-training. First, we incorporate the human-centered prior by distilling informative knowledge from a human parsing teacher model with an encoder-decoder network. This is based on the intuition that the informative human parsing knowledge can reflect human actions, which is well-aligned with the downstream action recognition objectives. An example is shown in Fig. 1 for a better demonstration. In addition, we also combine the contrastive learning with both appearance and motion consistency into a multi-task learning framework. To avoid the potential conflict from multiple tasks, task-dependent models are employed. The generated task-dependent representations are further combined for downstream tasks. The framework of our proposed method is demonstrated in Fig. 2. We conduct extensive experiments for action recognition on UCF101 and HMDB51 datasets and achieve state-of-the-art (SOTA) performance, verifying the effectiveness of our proposed method.

The main contributions of this paper are summarized as follows: 1) we present a novel framework that incorporates the human-centered prior for representation learning by knowledge distillation (KD) from the human parsing teacher model; 2) we employ a multi-task learning framework with the task-dependent representation learning strategy; 3) we conduct experiments on action recognition task, demonstrating the effectiveness of the proposed multi-task learning framework for video representation learning.

2 Related Work

Self-Supervised Learning for Visual Representation.

Self-supervised learning aims at learning discriminative representation by leveraging information from unlabeled data. Most works have explored the self-supervised visual representation learning based on the design of pretext tasks, such as image inpainting

[24], permutation [23], predicting jigsaw puzzles [17], and contrastive learning strategies. Recently, the extension from image to video representation learning has become increasingly popular due to the richer temporal information of videos. Like representation learning for images, the self-supervised video representation learning also focuses on the design of pretext tasks yet with the extension of considering temporal consistency in video clips. For example, [9, 19, 36] attempt to shuffle the frame and clip order along the temporal dimension; [10] proposes a pretext task to predict future frames; [16] learns video features through designing space-time cubic puzzles, and [2] proposes the SpeedNet to predict the motion speed in videos. In addition, contrastive learning is also widely used in the design of pretext tasks. Specifically, [10] proposes a dense predictive coding method with contrastive loss for video frames; [28] exploits contrastive multi-view video coding with the inspiration by [29] for image coding; [4] proposes a relative contrastive speed perception task to learn the motion information of videos. However, the representation by self-supervised training may be not well-aligned with the objectives of downstream tasks, leading to severe performance degradation.

Video-Based Action Recognition. Some works of action recognition follow a two-stream architecture to model appearance and temporal information separately. For example, [26] exploits both spatial 2D CNN and temporal CNN to extract the spatial information and motion information on RGB frames and optical flows. [21] utilizes a shift module for better capturing temporal information. [6] proposes a slow-fast network to extract spatial semantics and temporal motion information in the video clips. However, such 2D CNN-based methods adopted in the two-stream architecture have limited ability to capture the dynamics of visual tempos [37]. To address such issues, some recent works of action recognition is based on 3D CNN and its variants, that extract both appearance and temporal information jointly. Specifically, [30] firstly employs 3D convolutions on adjacent frames to model the spatial and temporal information of video. [3] inflates the pre-trained 2D convolutions to 3D convolutions kernels. [31] decomposes 3D convolution kernels into a 2D+1D paradigm to improve the performances of 3D CNNs. [35] proposes a self-attention mechanism to model long-range temporal dynamics of videos. Despite the great progress has been made in recent works, how to better generate more informative representations with self-supervised pre-training framework for action recognition is still under explored and needs further study.

Figure 2: Overview of our proposed framework. Video clips are firstly augmented and then sent to a shared low-level encoder to generate common low-level feature maps across multiple tasks. Then the low-level feature maps are sent to two distinct tasks, namely the human-centered prior knowledge distillation task and video contrastive learning task, following a multi-task learning framework. The knowledge distillation branch is supervised by a pre-trained human parsing teacher model, while the contrastive learning branch combines both motion and appearance contrastive relationships. Task-dependent embeddings are learned from two distinct high-level encoders and concatenated as the final representation.

3 Proposed Method

In this section, we present our prior-guided and task-dependent multi-task representation learning framework for video-based action recognition pre-training. We firstly describe how to incorporate the human-centered prior information by distilling informative knowledge from a human parsing teacher model with an encoder-decoder network, to enrich the semantic capability of the representation. Then, we combine knowledge distillation (KD) with video contrastive learning into a multi-task learning framework to boost the discrimination with fused task-dependent representations.

3.1 Human-Centered Prior Knowledge Distillation

Since most action recognition videos are related to human actions, we incorporate the human-centered prior into representation learning with a pre-trained human parsing guided teacher network. Denote and as the low-level encoder and high-level encoder of the human prior representation module, respectively, where the low-level encoder shares the weights across multiple tasks. Given the input video clip , the human prior representation can be obtained by

(1)

To guide the learning of the human prior representation, we employ a pre-trained human parsing network as the teacher model. Given the middle frame of input video clip , the teacher model can generate the human parsing segmentation feature map as follows,

(2)

In the meanwhile, the prior representation is fed into a decoder to generate a parsing segmentation as follows,

(3)

We assume both and are normalized feature maps after the softmax operation. is supervised by the teacher model with the KL-divergence loss as follow,

(4)

where represent the spatial index, represents the segmentation class index, is the number of segmentation classes, and is the number of spatial dimensions. With human parsing guided learning, the prior representation should contain rich semantic information related to human-centered actions.

3.2 Task-Dependent Multi-Task Learning

To learn discriminative representations among videos, inspired by [4], we also combine the video contrastive learning module into a multi-task learning framework. From the output of the low-level encoder , we stack another high-level encoder , which is different to used in the human prior module, to generate contrastive representations as follows,

(5)

Two more projection heads, i.e., and , are further employed to learn motion contrastive and appearance contrastive with the following margin ranking loss and InfoNCE loss [11],

(6)

where represents the distance between the anchor motion embedding to the embedding with the same/different playback speed [4], and represents the similarity between anchor appearance embedding and the embedding from the same/different video clips.

With the human prior knowledge distillation module and video contrastive module, we form a multi-task learning framework with the following total loss,

(7)

where , and are the weights of individual loss. Instead of widely adopted multi-task learning approaches that different tasks share the common representation, we use task-dependent and uncorrelated representations and to learn different semantic information from videos, which can better avoid the conflict across different tasks. We use the concatenated representation as the final representation of each video clip.

4 Experiment

4.1 Implementation Details

We adopt Kinetics-400

[3] dataset for self-supervised pre-training. We employ the SOTA human parsing model SCHP [20] as the pre-trained teacher model for human-centered prior knowledge distillation. After the pre-training process, we finetune the model on UCF101 [27] and HMDB51 [18] datasets for downstream action recognition task. We evaluate the performance based on top-1 and top-5 accuracy (Acc@1, Acc@5, respectively). More implementation details are demonstrated in the Supplementary Material.

UCF101 HMDB51
Method Acc@1 Acc@5 Acc@1 Acc@5
C3D Architecture
VCP [22] 68.5 - 32.5 -
MAS [32] 61.2 - 33.4 -
RTT [13] 69.9 - 39.6 -
RSPNet [4] 76.7 - 44.6 -
RSPNet [4] 77.6 93.7 45.4 75.7
Ours 80.4 95.7 46.1 78.4
R(2+1)D Architecture
VCP [22] 66.3 - 32.2 -
PSP [5] 74.8 - 36.8 -
ClipOrder [36] 72.4 - 30.9 -
PRP [38] 72.4 - 35.0 -
Pace [33] 77.1 - 36.6 -
RSPNet [4] 81.1 - 41.8 -
RSPNet [4] 79.4 94.3 43.0 74.5
Ours 81.6 95.3 46.1 74.8
Table 1: Top-1 and Top-5 accuracy on UCF101 and HMDB51 datasets compared with SOTA methods. Best performance is marked in bold for each type of architecture.
UCF101 HMDB51
Method Acc@1 Acc@5 Acc@1 Acc@5
C3D Architecture
w/o KD 77.6 93.7 45.4 75.7
TI 79.3 92.1 44.8 67.5
Full model 80.4 95.7 46.1 78.4
R(2+1)D Architecture
w/o KD 79.4 94.3 43.0 74.5
TI 80.1 94.3 44.6 74.4
Full model 81.6 95.3 46.1 74.8
Table 2: Ablation Studies on Action Recognition Task on UCF101 and HMDB51 datasets. Best performance is marked in bold for each type of architecture.
Figure 3: Some examples of visualization results. The first row and second row represent RSPNet [4] and our proposed method, respectively. As expected, our model focuses more on the human actions rather than background regions.

4.2 Evaluation on Video Action Recognition Task

Comparison with SOTA Methods. We compare our method with other SOTA methods on UCF101 and HMDB51 datasets. We report top-1 and top-5 accuracy with C3D [30] and R(2+1)D [31] architectures in Table 1. RSPNet

denotes that a 100-epoch finetuning is used instead of the result in the original paper. It is shown that we outperform all the other SOTA methods with a large margin, verifying the effectiveness of our proposed method.

Ablation Study. We conduct ablation studies on the effectiveness of each component of our proposed method on UCF101 and HMDB51 datasets with the results shown in Table 2. Denote the removal of the knowledge distillation module as “w/o KD”, and the use of task-independent representation with the shared representation across multiple tasks as “TI”. It can be found that there is a significant degradation for the corresponding variants compared with the full model, further demonstrating the effectiveness of our proposed method on individual components.

Visualization. To better reveal the effectiveness of our framework, we show some visualization examples of the region of interest by using class-activation map (CAM) technique [40] in Fig. 3. The produced heatmap is added on the original video frames for each example in the figure. The first row shows the results from RSPNet [4], while the second row represents the results of our proposed method. It is shown that our method concentrates more on the human action, while RSPNet is distracted by uncorrelated background regions. The visualization results demonstrate the validity of our motivation that incorporates human-centered prior knowledge for multi-task pre-training to learn discriminative representations. More visualization examples are included in Supplementary Material.

5 Conclusion

In this paper, we present a novel prior-guided and task-dependent multi-task representation learning framework for video-based action recognition pre-training. First, we incorporate the human-centered prior information by knowledge distillation from the human parsing teacher model to enrich the learned representation. Moreover, we follow the multi-task learning framework with the task-dependent representation learning strategy to solve the conflict of multi-task training paradigm. Experimental results on the action recognition task demonstrate the effectiveness of the proposed multi-task learning framework for video representation learning.

Acknowledgement. This work was supported by the National Natural Science Foundation of China under Grant 62106219.

References

  • [1] N. Behrmann, J. Gall, and M. Noroozi (2021) Unsupervised video representation learning by bidirectional feature prediction. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1670–1679. Cited by: §1.
  • [2] S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman, M. Rubinstein, M. Irani, and T. Dekel (2020) Speednet: learning the speediness in videos. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 9922–9931. Cited by: §2.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §1, §2, §4.1.
  • [4] P. Chen, D. Huang, D. He, X. Long, R. Zeng, S. Wen, M. Tan, and C. Gan (2021) RSPNet: relative speed perception for unsupervised video representation learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 35, pp. 1045–1053. Cited by: §1, §2, §3.2, §3.2, Figure 3, §4.2, Table 1.
  • [5] H. Cho, T. Kim, H. J. Chang, and W. Hwang (2020) Self-supervised spatio-temporal representation learning using variable playback speed prediction. arXiv preprint arXiv:2003.02692 3 (6), pp. 7. Cited by: Table 1.
  • [6] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211. Cited by: §2.
  • [7] C. Feichtenhofer, A. Pinz, and R. P. Wildes (2017) Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4768–4777. Cited by: §1.
  • [8] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941. Cited by: §1.
  • [9] B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3636–3645. Cited by: §2.
  • [10] T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1, §2.
  • [11] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §3.2.
  • [12] D. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri, L. Fei-Fei, and J. C. Niebles (2018) What makes a video a video: analyzing temporal information in video understanding models and datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7366–7375. Cited by: §1.
  • [13] S. Jenni, G. Meishvili, and P. Favaro (2020) Video representation learning by recognizing temporal transformations. In European Conference on Computer Vision, pp. 425–442. Cited by: Table 1.
  • [14] S. Ji, W. Xu, M. Yang, and K. Yu (2012)

    3D convolutional neural networks for human action recognition

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. Cited by: §1.
  • [15] A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482–7491. Cited by: §1.
  • [16] D. Kim, D. Cho, and I. S. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8545–8552. Cited by: §1, §2.
  • [17] D. Kim, D. Cho, D. Yoo, and I. S. Kweon (2018) Learning image representations by completing damaged jigsaw puzzles. In 2018 IEEE Winter Conference on Applications of Computer Vision, pp. 793–802. Cited by: §2.
  • [18] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2556–2563. Cited by: §4.1.
  • [19] H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §1, §2.
  • [20] P. Li, Y. Xu, Y. Wei, and Y. Yang (2020) Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (01), pp. 1–1. Cited by: §4.1.
  • [21] J. Lin, C. Gan, and S. Han (2019) Temporal shift module for efficient video understanding. 2019 ieee. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7082–7092. Cited by: §2.
  • [22] D. Luo, C. Liu, Y. Zhou, D. Yang, C. Ma, Q. Ye, and W. Wang (2020) Video cloze procedure for self-supervised spatio-temporal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11701–11708. Cited by: Table 1.
  • [23] I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §2.
  • [24] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §2.
  • [25] O. Sener and V. Koltun (2018) Multi-task learning as multi-objective optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 525–536. Cited by: §1.
  • [26] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pp. 568–576. Cited by: §1, §2.
  • [27] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.
  • [28] L. Tao, X. Wang, and T. Yamasaki (2020) Self-supervised video representation learning using inter-intra contrastive framework. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 2193–2201. Cited by: §1, §2.
  • [29] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In European Conference on Computer Vision, pp. 776–794. Cited by: §2.
  • [30] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. Cited by: §1, §2, §4.2.
  • [31] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §1, §2, §4.2.
  • [32] J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4006–4015. Cited by: Table 1.
  • [33] J. Wang, J. Jiao, and Y. Liu (2020) Self-supervised video representation learning by pace prediction. In European Conference on Computer Vision, pp. 504–521. Cited by: §1, Table 1.
  • [34] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision, pp. 20–36. Cited by: §1.
  • [35] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §2.
  • [36] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang (2019) Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343. Cited by: §2, Table 1.
  • [37] C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou (2020) Temporal pyramid network for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 591–600. Cited by: §2.
  • [38] Y. Yao, C. Liu, D. Luo, Y. Zhou, and Q. Ye (2020) Video playback rate perception for self-supervised spatio-temporal representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6548–6557. Cited by: Table 1.
  • [39] L. Zhang, Q. She, Z. Shen, and C. Wang (2021) How incomplete is contrastive learning? an inter-intra variant dual representation method for self-supervised video recognition. arXiv e-prints, pp. arXiv–2107. Cited by: §1.
  • [40] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. Cited by: §4.2.