Log In Sign Up

Cross-modal knowledge distillation for action recognition

In this work, we address the problem how a network for action recognition that has been trained on a modality like RGB videos can be adapted to recognize actions for another modality like sequences of 3D human poses. To this end, we extract the knowledge of the trained teacher network for the source modality and transfer it to a small ensemble of student networks for the target modality. For the cross-modal knowledge distillation, we do not require any annotated data. Instead we use pairs of sequences of both modalities as supervision, which are straightforward to acquire. In contrast to previous works for knowledge distillation that use a KL-loss, we show that the cross-entropy loss together with mutual learning of a small ensemble of student networks performs better. In fact, the proposed approach for cross-modal knowledge distillation nearly achieves the accuracy of a student network trained with full supervision.


page 1

page 2

page 3

page 4


CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation

In 3D action recognition, there exists rich complementary information be...

Periocular in the Wild Embedding Learning with Cross-Modal Consistent Knowledge Distillation

Periocular biometric, or peripheral area of ocular, is a collaborative a...

Cross-modal Contrastive Distillation for Instructional Activity Anticipation

In this study, we aim to predict the plausible future action steps given...

Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets without Superior Knowledge

Cross-modal knowledge distillation deals with transferring knowledge fro...

Progressive Cross-modal Knowledge Distillation for Human Action Recognition

Wearable sensor-based Human Action Recognition (HAR) has achieved remark...

Distilling Knowledge from Language Models for Video-based Action Anticipation

Anticipating future actions in a video is useful for many autonomous and...

1 Introduction

Action recognition is addressed in many works and in particular deep learning methods have been proposed for various modalities like RGB videos 

[20, 23, 3, 5] or skeleton data [6, 24, 16, 15]. Deep learning methods for action recognition, however, require large annotated datasets. This poses a problem if the modality required for an application differs from the modality of an already annotated dataset. While acquiring data is usually not a bottleneck, annotating a dataset is very time consuming. It is therefore desirable to transfer the knowledge a network has learned from the already annotated dataset to a network for the new modality.

For the cross-modal knowledge transfer, we assume that we have already trained a deep learning model for action recognition. This model is also called teacher network and we aim to distill [2, 11, 4] and transfer the knowledge of the teacher network to the student network for the target modality. For the transfer, we assume that we have paired videos of both modalities that are not annotated. This assumption is not a constraint for most applications since acquiring videos with two different sensors at the same time is straightforward.

In this work, we focus on the knowledge transfer from RGB videos to sequences of 3D skeleton poses [19] since skeleton and RGB data are very different modalities in terms of data structure. To transfer the knowledge from the teacher network to the student network, we propose a different loss than the Kullback–Leibler (KL) divergence loss, which was used in [11]. Instead of the KL-loss, we propose the cross-entropy for the transfer from the teacher to the student and train not one student network, but multiple student networks. Using an additional mutual loss for the student networks regularizes the transfer and increases the action recognition accuracy on the target modality.

We evaluate the approach on the NTU RGB+D dataset [19] using ST-GCN [24] and HCN [16] as network architectures for the student network. The experimental evaluation shows that the proposed approach outperforms an approach that uses the KL-loss as distillation and that it nearly achieves the accuracy of a student network trained with full supervision.

2 Related Works

There is a large body of works on action recognition from 3D human pose data [25]

. More recently, most approaches use either recurrent neural networks to learn spatio-temporal features from sequences of skeleton data 

[21, 7, 19]

or convolutional neural networks for classifying the skeleton sequences 

[14, 15, 13]. In [24], a spatio-temporal graph convolutional network has been proposed to learn both spatial and temporal features directly from the skeleton data. A convolutional neural network is also used in [16] to learn co-occurrence features. It combines different levels of contextual information for learning co-occurrence features in a hierarchical manner. Both raw skeleton coordinates and their temporal differences are used within a two-stream framework.

Knowledge distillation has been originally proposed to compress ensemble classifiers into a smaller network without any significant loss of performance [2, 11]. In [4], the approach has been extended to compress large networks and they showed that softening the softmax predictions of a network by a high temperature conveys important information, also called dark knowledge. Recently, knowledge distillation has been proposed for multi-modal action recognition. For instance, [18] use a graph-based distillation method for action recognition that is able to distill information from multiple modalities during training. Similarly, [9] proposed a multi-modal action recognition framework that uses multiple data modalities at training time. While these works analyze if the networks can be better trained using full supervision if additional modalities including the modality of the test data are available during training, we address the problem if the modality of the annotated training set differs from the modality of the test set. In [5], a 3D convolutional neural network is initialized by transferring the knowledge of a pre-trained 2D CNN. Cross-modal distillation has been also used for other tasks such as object detection [10], emotion recognition [1]

, or human pose estimation 


3 Cross-Modal Action Recognition

For cross-modal action recognition, we assume that a teacher network has been already trained on RGB videos. We now aim to train the student network for another modality, namely sequences of 3D human poses. For training the student network, we use pairs of RGB videos and human pose sequences. The pairs are not annotated and were therefore not part of the training data for the teacher network.

3.1 Teacher-Student Network

Figure 1: (a) The teacher network, which has been previously trained for RGB videos, provides the supervision for the student network for skeleton data. For training the student network, unlabeled pairs for both modality and the cross-entropy loss are used. (b)

Instead of one student network, two or more student networks can be trained together using mutual learning such that each student learns from the supervision of the teacher as well as the other student. The red dashed lines denote back-propagation for the corresponding loss functions.

The training of the student network is illustrated in Fig. 1

(a). The trained teacher network predicts for a training pair from the source modality the target class probabilities, where the vector of all class probabilities is denoted by

. The parameters of the student network are then optimized such that the class probabilities estimated by the student for the target modality matches . In [11], the Kullback–Leibler (KL) divergence has been proposed as loss for knowledge transfer between two networks of the same modality:


where and are softmax predictions of the student and teacher networks both softened with temperature :


A temperature value of

produces a softer probability distribution over the classes and has been proposed to avoid overfitting 


3.1.1 Loss Function

In our experimental evaluation, we show that the loss function (2) is not optimal for cross-modal knowledge transfer. In particular, finding an optimal is difficult and it strongly depends on the student network. Instead, we propose to use the cross-entropy loss


where . This means that the teacher makes a hard decision and we use the class label estimated by the teacher as supervision for the student network.

3.2 Mutual Learning

In the context of fully supervised image classification, [27] proposed a deep mutual learning strategy. Instead of learning a single network with full supervision, an ensemble of networks is learned collaboratively and the networks teach each other throughout the training process.

We show that mutual learning is also useful for cross-modal knowledge transfer. In this case, we train an ensemble of student networks together such that each network learns to mimic the probability distribution of the teacher network, as well as to match the probability estimates of its peers. Our approach for cross-modal knowledge transfer with mutual learning is shown in Fig. 1(b) for .

Since the students are applied to the same modality, we can apply the KL-loss with softened temperature (1). The loss functions and for the student networks with parameters and , respectively, are then given by




The proposed approach can be extended to more student networks. For students, the loss function for optimizing the -th student network is given by


4 Experiments

We evaluate our approach on the large scale multi-modal action recognition dataset NTU RGB+D [19]. The videos are collected from 40 distinct subjects and contain 60 different action classes. We use the RGB videos as source modality for the teacher network and the skeleton data as target modality. We adapt the cross subject evaluation protocol with 40,320 samples from 20 subjects for training and 16,560 samples from the remaining 20 subjects for testing. To evaluate the knowledge transfer, we divide the 20 training subjects into two groups of 10 subjects each, resulting in the Teacher-Train set for training the teacher network and the Student-Train set for training the student networks. While the RGB videos with class labels are used for the Teacher-Train set, the Student-Train comprises pairs of RGB videos and sequences of 3D human poses, but no class labels. We evaluate the accuracy of the student networks on the pose data of the Test set. We use Temporal Segment Networks [23] (TSN) as our teacher network and use optical flow as the teacher modality. We use the same hyper-parameters as in [23]. For the student networks, we use the Spatio Temporal Graph Convolution Networks (ST-GCN) [24] and the Hierarchical Co-occurrence Network (HCN) [16]

which both use the skeleton modality as their input data. We train the ST-GCN model using two GPUs with a batch size of 16 for a total of 200 epochs. All other hyper-parameters are the same as in

[24] and [16].

Noise% 0 5 10 14 20 25
Acc 78.50 73.20 72.58 71.51 69.70 68.01
Table 1: Impact of noisy labels during training on the classification accuracy of the ST-GCN model. The Student-Train set is used for training and the Test set for evaluation.

In order to analyze how much knowledge we can extract from the teacher network, we evaluate the action recognition accuracy of the teacher network, which has been trained on the Teacher-Train set. On the Student-Train set, we obtain an accuracy of 86%, i.e., the teacher network will produce around 14% wrong labels during the knowledge transfer to the student networks.

Next, we study the effect of noisy labels on the performance of the ST-GCN network. To conduct this experiment, we assign randomly wrong labels to a percentage of the training videos in the Student-Train set. We then train the ST-GCN model on Student-Train in a fully supervised manner using the noisy labels as ground-truth. Table 1 reports the action recognition accuracy on the Test set for different percentages of noisy labels during training. 78.5% is the upper bound that can be achieved by cross-modal knowledge transfer if the teacher network is perfect since it corresponds to training the student network with full supervision. The accuracy drops from 78.5% to 73.2% if of the videos are wrongly labelled. Given that the teacher network misclassifies 14% of the videos on Student-Train, we can expect to achieve 71.51% accuracy using cross-modal knowledge transfer.

1 2 5 10 20 50
Acc 51.05 52.00 70.80 71.17 68.90 64.00
Table 2: Accuracy of the ST-GCN student network on the Test set using the KL-loss with different values for the softmax temperature .

Given some bounds for the accuracy that we can expect, we now analyse the impact of the loss functions for the task of cross-modal knowledge transfer. For the rest of the experiments, we train the student networks on Student-Train using the teacher network as supervision and evaluate the action recognition accuracy of the student networks on the Test set. We first analyze the impact of the temperature for the KL-loss (1). Table 2 shows that for , the accuracy is very low since ST-GCN overfits on the Student-Train set.

#Students Method Accuracy Accuracy
(K) (supervision) (Max) (Average)
1 Full Supervision - 78.50
1 Teacher-Student - 71.17
2 Ensemble without mutual 71.93 72.32
2 Mutual Learning 73.20 73.60
3 Mutual Learning 73.60 74.22
4 Mutual Learning 73.30 73.50
Table 3: Impact of mutual learning and the number of student networks . In case of multiple student networks, we combine the predictions of the student networks during inference either by averaging the class probabilities or taking the maximum probability of each class. For the experiments, the KL-loss with is used.

We keep the KL-loss with , but evaluate the benefit of using more than one student network for mutual learning. Table 3 reports the accuracy for mutual learning with multiple student networks (last three rows). In this case, we obtain an ensemble of student networks where the predictions are combined by averaging the class probabilities (average). We also report the results if for each class the highest probability among all student networks is taken (max). The results show that averaging performs better than taking the maximum. gives the best accuracy and mutual learning increases the accuracy compared to a teacher-student setup as proposed in [11] by 3%. To analyze if the improvement stems from the ensemble model or the mutual learning, we also trained two student networks without the mutual loss (ensemble without mutual). The result shows that 50% of the improvement is due to the ensemble and the rest due to the mutual learning. It is interesting to note that mutual learning already achieves a higher accuracy than training the network with of randomly assigned wrong labels (Table 1).

Loss # of students Accuracy
Full supervision - 78.50
KL 1 71.17
Cross-entropy 1 74.91
KL + Mutual 2 73.60
Cross-entropy + Mutual 2 77.83
Table 4: Results for the cross-entropy loss. For mutual learning, we average over the student networks. The cross-entropy loss outperforms the KL loss reported in Table 3.

So far we have only used the KL-loss, but we have not evaluated the proposed approach using the cross-entropy loss (6). We report the results with the cross-entropy loss in Table 4. Compared to the KL-loss, the accuracy increases from 71.17% to 74.91% for one student network and from 73.6% to 77.83% for two student networks. While the second term in (6) uses the KL-loss with for mutual learning, we observed that the accuracy decreases if cross-entropy is used for both terms. Compared to [11], the proposed approach improves the accuracy by 6.66%. Note that the proposed approach nearly achieves the accuracy of ST-GCN trained with full supervision.

Loss # of students Accuracy
Full supervision - 80.60
KL 1 74.40
KL 1 74.90
Cross-entropy 1 77.40
Cross-entropy + Mutual 2 79.00
Cross-entropy + Mutual 3 79.50
Table 5: Accuracy of the HCN student network on the Test set using different loss functions and varying number of student networks.

In order to demonstrate that the proposed approach is insensitive to the student network architecture, we also evaluated the accuracy of cross-modal knowledge transfer if we use HCN [16] as student network. Table 5 reports the results for the HCN model. For the KL-loss (1), we had to adjust the temperature . While ST-GCN performs better for a large value of as reported in Table 2, it is the other way around for HCN since HCN is a smaller network which suffers less from overfitting. For HCN, performs best and larger values of actually decrease the accuracy. This shows that it is very difficult to choose the hyper-parameter  [11] in the context of cross-modal action recognition. If we use the proposed cross-entropy loss, this problem does not occur and it outperforms the KL-loss. If we use mutual learning with two or three students, the action recognition accuracy is improved by 4.1% or 4.6%, respectively, compared to KL with  [11]. Note that we still use in (6) and we found that (6) is not sensitive to the parameter since the mutual loss is computed only for the student networks, which have the same network architecture applied to the same modality.

Finally, we compare our approach with the current state-of-the-art methods for the skeleton modality on the NTU RGB+D dataset in Table 6. Although our student networks are trained on less data and with less supervision, they achieve a higher accuracy than many other approaches that are trained with full supervision on the entire training set.

Method Full Train Student-Train
Skeletal Quads [8] 38.62
Lie Group [22] 50.08
HBRNN-L [7] 59.07
Dynamic Skeletons [12] 60.23
PA-LSTM [19] 62.90
STA-LSTM [21] 73.40
ST-LSTM+TS [17] 69.20
Temporal Conv [14] 74.30
VA-LSTM [26] 79.20
ST-GCN [24] 81.60 78.50
Two-stream CNN [15] 83.20
HCN [16] 86.50 80.60
Cross-modal ST-GCN 77.83
Cross-modal HCN 79.50
Table 6: Comparison with the state-of-the-art for the cross-subject protocol. Note that the numbers are not directly comparable since the other approaches are trained with full supervision on the entire training set. While our approach is trained only on Student-Train and with less supervision.

5 Conclusion

We have presented an approach that uses knowledge distillation for cross-modal action recognition. The approach is able to transfer knowledge from one modality to another modality without the need of any additional annotations. Instead, pairs of sequences of both modalities are sufficient for the knowledge transfer. We evaluated our approach on a large-scale multi-modal dataset using two different student networks. In both cases, we showed that cross-modal knowledge transfer achieves an action recognition accuracy that is very close to fully supervised learning.


  • [1] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman (2018) Emotion recognition in speech using cross-modal transfer in the wild. In ACM Multimedia, Cited by: §2.
  • [2] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In KDD, Cited by: §1, §2.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, Cited by: §1.
  • [4] Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2017) A survey of model compression and acceleration for deep neural networks.. CoRR abs/1710.09282. Cited by: §1, §2.
  • [5] A. Diba, M. Fayyaz, V. Sharma, M. M. Arzani, R. Yousefzadeh, J. Gall, and L. Van Gool (2018) Spatio-temporal channel correlation networks for action classification. In ECCV, Cited by: §1, §2.
  • [6] Y. Du, W. Wang, and L. Wang (2015) Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, Cited by: §1.
  • [7] Y. Du, W. Wang, and L. Wang (2015) Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, Cited by: §2, Table 6.
  • [8] G. Evangelidis, G. Singh, and R. Horaud (2014) Skeletal quads: human action recognition using joint quadruples. In ICPR, Cited by: Table 6.
  • [9] N. C. Garcia, P. Morerio, and V. Murino (2018) Modality distillation with multiple stream networks for action recognition. In ECCV, Cited by: §2.
  • [10] S. Gupta, J. Hoffman, and J. Malik (2016) Cross modal distillation for supervision transfer. In CVPR, Cited by: §2.
  • [11] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In NIPS Workshops, Cited by: §1, §1, §2, §3.1, §4, §4, §4.
  • [12] J. Hu, W. Zheng, J. Lai, and J. Zhang (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In CVPR, Cited by: Table 6.
  • [13] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid (2017) A new representation of skeleton sequences for 3d action recognition. In CVPR, Cited by: §2.
  • [14] T. S. Kim and A. Reiter (2017) Interpretable 3d human action analysis with temporal convolutional networks. in CVPR Workshops. Cited by: §2, Table 6.
  • [15] C. Li, Q. Zhong, D. Xie, and S. Pu (2017) Skeleton-based action recognition with convolutional neural networks. In ICME Workshops, Cited by: §1, §2, Table 6.
  • [16] C. Li, Q. Zhong, D. Xie, and S. Pu (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In IJCAI, Cited by: §1, §1, §2, Table 6, §4, §4.
  • [17] J. Liu, A. Shahroudy, D. Xu, and G. Wang (2016) Spatio-temporal LSTM with trust gates for 3d human action recognition. In ECCV, Cited by: Table 6.
  • [18] Z. Luo, J. Hsieh, L. Jiang, J. C. Niebles, and L. Fei-Fei (2018) Graph distillation for action detection with privileged modalities. In ECCV, Cited by: §2.
  • [19] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR, Cited by: §1, §1, §2, Table 6, §4.
  • [20] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NIPS, Cited by: §1.
  • [21] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu (2017)

    An end-to-end spatio-temporal attention model for human action recognition from skeleton data

    In AAAI, Cited by: §2, Table 6.
  • [22] R. Vemulapalli, F. Arrate, and R. Chellappa (2014) Human action recognition by representing 3d skeletons as points in a lie group. In CVPR, Cited by: Table 6.
  • [23] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, Cited by: §1, §4.
  • [24] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, Cited by: §1, §1, §2, Table 6, §4.
  • [25] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall (2013) A survey on human motion analysis from depth data.. In Time-of-Flight and Depth Imaging, pp. 149–187. Cited by: §2.
  • [26] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV, Cited by: Table 6.
  • [27] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In CVPR, Cited by: §3.2.
  • [28] M. Zhao, T. Li, M. Abu Alsheikh, Y. Tian, H. Zhao, A. Torralba, and D. Katabi (2018)

    Through-wall human pose estimation using radio signals

    In CVPR, Cited by: §2.