Deep learning technologies have recently advanced various tasks on video understanding, particularly for human action recognition where the performance has been improved considerably [27, 1, 28]. Intuitively, human action is a highly-semantic concept that contains various semantic cues. For example, as shown in Figure 1, in the first column, one can identify a skiing action by learning from a snow field scene with the related human dresses. In the second column, people can still know the action as basketball playing by observing basketball court and players, even it may be difficult to identify the ball due to low resolution or motion blur. In the last column, we can easily recognize a pushup action from human pose presented. Therefore, context knowledge is critical to understand human actions in videos, and learning meaningful such context knowledge is of great importance to improving the performance.
Past work commonly considered action recognition as a classification problem, and attempted to learn action-related semantic cues directly from training videos [5, 26, 1]. They assumed that action-related features can be implicitly learned with powerful CNN models by simply using video-level action labels. However, it has been proven that learning action and actor segmentation jointly can boost the performance of both tasks. Experiments were conducted on A2D dataset , where the ground truth of actor masks and per-pixel action labels were provided. Yet in practice, it is highly expensive to provide pixel-wise action labels for a large-scale video dataset, and such per-pixel annotations are not available in most action recognition benchmarks, such as Kinetics  and UCF-101 .
Deep learning methods have achieved expressive performance on various vision tasks, such as human parsing 
, pose estimation, semantic segmentation , and scene recognition [37, 25]. It is interesting to utilize these existing technologies to enhance the model capability by learning context knowledge from action videos. This inspired us to design a knowledge distillation  mechanism to learn the context knowledge of human and scene explicitly, by training action recognition jointly with human parsing and scene recognition. This allows the three tasks to work collaboratively, providing a more principled approach that learns rich context information for action recognition without additional manual annotations.
Contributions. In this work, we propose Knowledge Integration Networks, referred as KINet, for video action recognition. KINet is capable of aggregating meaningful context features by design new three-branch networks with knowledge distillation. The main contribution of the paper is summarized as follows.
We propose KINet - a three-branch architecture for action recognition. KINet has a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which encourage the model to learn the knowledge of human and scene via knowledge distillation. This results in an end-to-end trainable framework where the three tasks can be trained collaboratively and efficiently, allowing the model to learn the context knowledge explicitly.
We design a two-level knowledge encoding mechanism. The auxiliary human and scene knowledge can be encoded into convolutional features directly by introducing a new Cross Branch Integration (CBI) module. An Action Knowledge Graph (AKG) is further designed for effectively modeling the high-level correlations between action and the auxiliary knowledge of human and scene.
2 Related Work
We briefly review recent work on video action recognition and action recognition with external knowledge.
Various CNN architectures have been developed for action recognition, which can be roughly categorized into three groups. First, a two-stream architecture was introduced in , where one stream is used for learning from RGB images, and the other one is applied for modeling optical flow. The results produced by the two CNN streams are then fused at later stages, yielding the final prediction. Two-stream CNNs and its extensions have achieved impressive results on various video recognition tasks. Second, 3D CNNs have recently been proposed in [24, 1], by considering a video as a stack of frames. The spatio-temporal features of action can be learned by 3D convolution kernels. However, 3D CNNs often explore a larger number of model parameters, which require more training data to achieve high performance. Recent results on Kinetics dataset, as reported in [33, 28, 5], show that 3D CNNs can obtain competitive performance on action recognition. Third, recurrent networks, such as LSTM [20, 4], have been explored for temporal modeling, where a video is considered as a temporal sequence of 2d frames. .
Recently, Xu et al. created an A2D dataset , where pixel-wise annotations on actors and actions were provided. In , the authors attempted to handle a similar problem by using probabilistic graphical models. A number of deep learning approaches have been developed for actor-action learning on the A2D databaset [16, 2, 15], demonstrating that jointly learning actor and action representations can improve action recognition and understanding. However, these approaches are built on dense pixel-wise labelling on actors and actions provided in the A2D dataset, which are highly expensive and are difficult to obtain from a large-scale action recognition dataset, such as Kinetics  and UCF-101 , where only video-level action labels are provided.
An action can be defined by multiple elements, features or context information, and it is still an open problem on how to effectively incorporate various external knowledge into action recognition. In , the authors tried to combine object, scene and action recognition by using multiple instance learning framework. Recently, with deep learning approaches, Jain et al. introduced object features to action recognition by discovering the relations of action and objects 
, and Wu et al. further explored the relation of object, scene and action by designing a more advanced and robust discriminative classifier.
In , an external object detector was used to provide locations of objects and persons, and the authors in  incorporated an object detection network to provide object proposals, which are used with spatio-temporal graph convolutional networks for action recognition.
However, all these methods commonly rely on external networks to extract semantic cues, while such external networks were trained independently, and were fixed when applied to action recognition. This inevitably limits their capability for learning meaningful action representation. Our method is able to learn the additional knowledge of human and scene via knowledge distillation, allowing us to learn action recognition jointly with human praising and scene recognition with a signle model, providing a more efficient manner to encode the context knowledge for action recognition.
3 Knowledge Integration Networks
In this section, we describe details of the proposed Knowledge Integration Networks (KINet) which is able to distill the knowledge of human and scene from two teacher networks. KINet contains three branches, and has a new knowledge integration mechanism by introducing a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into the intermediate convolutional features, and an Action Knowledge Graph (AKG) for effectively integrating high-level context information.
Overview. The proposed KINet aims to explicitly incorporate scene context and human knowledge into human action recognition, while the annotations of scene categories or human masks are highly expensive, and thus are not available in many existing action recognition datasets. In this work, we attempt to tackle this problem by introducing two external teacher networks able to distill the extra knowledge of human and scene, providing additional supervision for KINet. This allows KINet to learn action, scene and human concepts simultaneously, and enables the explicit learning of multiple semantic concepts without additional manual annotations.
Specifically, as shown in Figure 2, KINet employs two external teacher networks to guide the main network. The two teacher networks aim at providing pseudo ground truth for scene recognition and human parsing. The main network is composed of three branches, where the fundamental branch is used for action recognition. The other two branches are designed for two auxiliary tasks - scene recognition and human representation, respectively. The intermediate representations of three branches are integrated by the proposed CBI module. Finally, we aggregate the action, human and scene features by designing an Action Knowledge Graph (AKG). The three branches are jointly optimized during training, allowing us to directly encode the knowledge of human and scene into KINet for action recognition.
We also tried to incorporate an object branch by distillation, but this leads to a performance drop with unstable results. We conjecture that there may be two main reasons: (1) Due to low resolution and motion blur, it is very hard to identify the exact object in the video. (2) The categories of object detection/segmentation are usually rather limited, while for action recognition, the objects involved are much more diverse. Although object proposals are used by 
to overcome these problems, it is very hard to distill these noisy proposals. So we just use the ImageNet pretrained model to initialize the framework instead of forcing it to ”remember” everything from it.
3.1 The Teacher Networks
We explore two teacher networks pre-trained for human parsing and scene recognition.
Human parsing network. We use LIP  dataset to train the human parsing network. LIP is a human parsing dataset, which was created specifically for semantic segmentation of multiple parts of human body. We choose this dataset because it provides training data where only some certain parts of human body, such as hand, are available, and these body parts are commonly presented in video actions.
Furthermore, the original LIP dataset contains 19 semantic parts. Due to the relatively low resolution of the frames, the generated pseudo labels may contain a certain amount of noisy pixel labels for fine-grained human parsing. Thus we merge all 19 human parts into a single human segmentation, which leads to much stronger robustness on segmentation results. Finally, we employ PSPNet  with DenseNet-121  as its backbone for human parsing teacher network.
3.2 The Main Networks
The proposed KINet has three branches - one main branch for action recognition and the other two branches for auxiliary tasks of scene recognition and human parsing guided by the teacher networks. We use Temporal Segment Network (TSN) structure  as the action recognition branch, where a 2D network is used as backbone. TSN is able to model long-range temporal information by sparsely sampling a number of segments along the whole video, and then average the representation of all segments. In addition, TSN can also be applied in two-stream architecture, with the second stream for modelling motion information by utilizing optical flow. We set the number of segments for training as in our experiments for an efficient training, and also for a fair comparison against previous state-of-the-art methods.
The three branches of KINet share low-level layers in the backbone. There are two reasons for such design. First, low-level features are generalized over three tasks, and sharing features allow the three tasks to be trained more collaboratively with fewer parameters used.
The higher level layers are three individual branches, not sharing the parameters, but still exchange information through various integration mechanism, which will be introduced in following sections.
3.3 Knowledge Integration Mechanism
Our goal is to develop an efficient feature integration method able to encode context knowledge of different levels. . We propose a two-level Knowledge Integration mechanism including a Cross Branch Integration (CBI) module and an Action Knowledge Graph (AKG) method for encoding the knowledge of human and scene into different levels of the features learned for action recognition.
Cross Branch Integration (CBI)
The proposed CBI module aims to aggregate the intermediate features learned from two auxiliary branches into action recognition branch, which enables the model to encode the knowledge of human and scene. Specifically, we use the feature maps of auxiliary branches as gated modulation of the main action features, by implementing element-wise multiplication on them, as shown in Figure 3
. We apply a residual-like connection with batch normalization
and ReLU activation so that the feature maps of auxiliary branches can directly interact with the action features.
Finally the features maps from three branches are concatenated along the channel dimension, followed by a convolution layer for reducing the number of channels. In this way, the input channel and output channel are guaranteed to be identical, so that the CBI module can be applied at any stage in the network.
Action Knowledge Graph (AKG)
In the final stages, we apply global average pooling individually on each branch, obtaining three groups of representation vectors with the same size. Each group containsfeature vectors, corresponding to the input frames, where means the number of segments in TSN . We then construct an Action Knowledge Graph to explicitly model the pair-wise correlation among these representations. To this end, we build an Action Knowledge Graph on the high-level features of three tasks, and apply the recent Graph Convolutional Networks  on the AKG for further integrating high-level semantic knowledge.
Graph Definition. The goal of our knowledge graph is to model the relationship among action, scene and human segments. Specifically, there are total nodes in the graph, denoted as , where the nodes , with indicating the channel dimension of the last convolutional layer in the backbone. The graph represents the pair-wise relationship among the nodes, with edge indicating the relationship between node and node .
For the proposed KINet framework, our goal is to build the correlations between the main action recognition and auxiliary scene recognition and human parsing tasks. Therefore, it is not necessary to construct a fully-connected knowledge graph. We only activate the edges which are directly connected to an action node , and set the others to 0. We implement AKG by computing element-wise product of and an edge mask matrix . The mask is a 0 or 1 matrix with the same size as , where the edges between human nodes and scene nodes are set to 0, and otherwise 1.
Relation function. We describe different forms of the relation function for computing the relationship between knowledge nodes.
(1) Dot product and embedded dot product. Dot product is a frequently-used function for modelling the similarity between two vectors. It is simple but effective and parameter-free. Another extension of dot product is the embedded dot product, which projects the input vectors onto a subspace, and then applies dot product, by utilizing two learnable weight matrices,
(2) Concatenation. We simply adopt the relation module by concatenation which was proposed in :
where denotes the concatenation operation, and represents the learnable weight matrix that projects the concatenated vector into a scalar.
Normalization. The sum of all edges pointing to the same node must be normalized to 1, and then the graph convolution can be applied to the normalized knowledge graph. In this work, we apply the softmax function for implementing normalization,
This normalization function essentially casts dot product into Gaussian function, thus we do not use Gaussian or embedded Gaussian function directly for learning the relations.
3.4 Relation Reasoning on the Graph
We apply recent Graph Convolutional Network (GCN)  on the constructed Action Knowledge Graph for aggregating high-level sematic knowledge of human and scene into action recognition branch. Formally, the behavior of a graph convolution layer can be formulated as,
where is the edge mask matrix mentioned, is the matrix of the constructed knowledge graph. is the input to the GCN, is the learnable weight matrix for GCN, and
is the activation function. In practice, we found that applying a deeper GCN was not helpful for improving the performance, and thus we only apply one graph convolution layer, which has proved to be enough for modelling rich high-level context information. The output of GCN,, has the same size as the input . We only use the vectors from action branch for final classification.
3.5 Joint Learning
The proposed three-branch architecture enables an end-to-end joint learning of action recognition, human parsing and scene recognition. The multi-task loss function is computed as,
where and are cross-entropy losses for classification, is a cross-entropy loss for semantic segmentation. For scene recognition and human parsing, the loss of each segment is calculated individually and then averaged.
The ground truth for action recognition is provided by the training dataset, while the ground truth of scene recognition and human parsing is provided by the two teacher networks as pseudo labels for knowledge distillation. We empirically set for main tasks, and for the two auxiliary tasks.
Learnable parameters for auxiliary tasks. Notice that previous works, such as [32, 10, 30], encode the extra knowledge by directly using the teacher networks whose parameters are fixed, while our proposed three-branch framework with knowledge distillation enables a joint learning of three tasks. This allows for training three tasks more collaboratively, providing a more principled approach for knowledge integration. Our experiments presented in next section verify this claim.
To verify the effectiveness of our KINet, we conduct experiments on a large-scale action recognition dataset Kinetics-400 , which contains 400 action categories, with about 240k videos for training and 20k videos for validation. We then examine the generalization ability of our KINet by transferring the learned representation to a small dataset UCF-101 , containing 101 action categories with 13,320 videos in total. Following previous standard criterion, we divide the total videos into 3 training/testing splits and the results of the three splits are averaged as the final result.
4.2 Implementation Details
Training. We use ImageNet pretrained weights to initialize the framework. Following the sampling strategy in  we uniformly divide the video into segments, and randomly select a frame out of each segment. We first resize every frame to size
and then we apply multiscale cropping for data augmentation. For Kinetics, we utilize SGD optimizer with initial learning rate set to 0.01, which drops by 10 at epoch 20, 40 and 60. The model is totally trained for 70 epochs. We set the weight decay to beand the momentum to be . For UCF-101, we follow  to fine tune the pretrained weights on Kinetics, where we have all but the first batch normalization layer frozen and the model is trained for 80 epochs.
Inference. For fair comparison, we also follow  by uniformly sampling 25 segments from each video and select one frame out of each segment. We crop the 4 corners and the center of each frame and then flip them so that 10 images are obtained. Totally, there are images for each video. We use a sliding window of on the 25 test segments. The results are averaged finally to produce the video-level prediction. Note that during inference, the decoder of the human parsing branch and the classifier( fully connected layer) of scene recognition branch can be removed, since our main task is action recognition. This makes it extremely efficient to transfer the learned representation to other datasets.
|1 CBIres4+1 CBIres5||71.8||+2.3|
|2 CBIres4+1 CBIres5||71.5||+2.0|
|AKG+Multitask||AKG + dot product||71.7||+2.2|
|AKG + E-dot product||71.6||+2.1|
|AKG + concatenation||71.2||+1.7|
4.3 Ablation Study on Kinetics
In this subsection, we conduct extensive experiments on large scale dataset Kinetics to study our framework. In this study, we use TSN-ResNet50  as baseline.
Multitask Learning with Knowledge Distillation. First, in order to show that distilling external knowledge does help with action recognition, we incorporate human parsing and scene recognition into action recognition network, by jointly learning these three tasks via knowledge distillation, yet without applying CBI module or Action Knowledge Graph here. As shown in Table 1, the multitask learning with knowledge distillation outperforms the baseline. When action recognition and human parsing are jointly trained, the top-1 accuracy increases 0.8%. When action recognition and scene recognition are jointly trained, the top-1 accuracy increases 0.5%. When three tasks are jointly trained, the top-1 accuracy increases 1.1%.
Cross Branch Integration Module. Instead of simple multitask learning, we apply CBI Module to enable intermediate feature exchange. As shown in Table 1, aggregating human and scene knowledge into action branch strengthens the learning ability of action branch. We further employ multiple CBI Modules at different stages, showing that higher accuracy can be obtained. According to experiment results, we finally apply 1 CBI at res4 and 1 CBI at res5 for a balance between accuracy and efficiency.
Action Knowledge Graph. The AKG is applied at the late stage of the framework, with 3 possible relation function as mentioned in section 3.3. We compare their performance in Table1. The AKG boosts performance by aggregating multiple branches and models the relation among action, human and scene representations. We find that dot product and embedded dot product are comparable, which are slightly better than ReLU concatenation. We simply choose to use dot product as the relation function in the remaining experiments.
|Backbones||TSN top-1||KINet top-1||Gain|
|Fixed auxiliary branches KINet-ResNet50||70.5|
|Learnable auxiliary branches KInet-ResNet50||72.4|
|2D Backbones||TSN ||ResNet50||69.5||88.9|
|TSN ||Inception V3||72.5||90.2|
|Two-stream TSN ||Inception V3||76.6||92.4|
|3D Backbones||ARTNet ||ARTNet ResNet18 + TSN||70.7||89.3|
|ECO ||ECO 2D+3D||70.0||89.4|
|S3D-G ||S3D Inception||74.7||93.4|
|Nonlocal Network ||I3D ResNet101||77.7||93.3|
|SlowFast ||SlowFast 3D ResNet101+NL||79.8||93.9|
|I3D ||I3D Inception||71.1||89.3|
|Two-stream I3D ||I3D Inception||74.2||91.3|
|Two-stream KINet(Ours)||Inception V3||77.8||93.1|
|TS TSN ||BNInception||97.0|
|TS TSN ||Inception V3||97.3|
|TS I3D (Carreira at al. 2017)||I3D Inception||98.0|
|TS KINet||Inception V3||97.8|
Entire KINet Framework. We combine all previous mentioned components into the baseline, TSN ResNet50, for RGB-based action recognition with entire Knowledge Integration Networks. As shown in Table 1, we find that the top-1 accuracy has been boosted to 72.4%, while the baseline is 69.5%. This significant improvement of 2.9% on video action recognition benchmark proves the effectiveness of our proposed framework.
Effective Parameters. As shown in Table 2, although our method introduces more parameters due to the multi-branch setting, the overall amount of parameters is still less than that of TSN-ResNet200 , yet with higher accuracy. This comparison proves that the functionality of our framework contributes vitally to action recognition, not just because of the extra parameters introduced.
Different Backbones. We implement KINet with different backbones, to verify the generalization ability. The results in Table 3 show that our KINet can consistently improve the performance with different backbones.
Learnable parameters. To verify the impact of joint learning, we directly use the two teacher networks to provide auxiliary information, with their weights fixed. The results are shown in Table 4. The KINet outperforms the fixed way significantly. We explain this phenomenon by stressing the importance of pseudo label guided learning. With KINet, the auxiliary branches are jointly trained with the action recognition branch using the pseudo label, so that the intermediate features of scene and human can be finetuned to suit action recognition better. Yet for the fixed way in previous works, the auxiliary representation cannot be finetuned. Although the fixed auxiliary networks provide more accurate scene recognition and human parsing results compared to the KINet, their improvement on main task, action recognition, is less than that of KINet (1.9%).
4.4 Comparison with State-of-the-Art Methods
Here we compare our 2D KINet with state-of-the-art methods for action recognition, including 2D and 3D methods, on action recognition benchmark Kinetics-400. We also include two-stream CNNs for KINet, where the RGB stream CNN is our KINet and the optical flow stream CNN is normal TSN structure. As shown in Table 5, our method achieves state-of- the-art results on Kinetics. Although our network is based on 2D backbones, the performance is even on par with state-of- the-art 3D CNN methods.
4.5 Transfer Learning on UCF-101
We further transfer the learned representation on Kinetics to smaller dataset UCF-101 to check the generalization ability of our framework. Following the standard TSN protocol, we report the average of three train/test splits in Table 6
. The results show that our framework pretrained on Kinetics has strong transfer learning ability. Our model also obtain state-of- the-art result on UCF-101.
As shown in Figure 5, our KINet has more clear understanding of human and scene concepts, compared with baseline TSN. This integration of multiple domain specific knowledge enables our KINet to recognize complex action involving with various high-level semantic cues.
We have presented new Knowledge Integration Networks (KINet) able to incorporate external semantic cues into action recognition via knowledge distillation. Furthermore, a two-level knowledge encoding mechanism is proposed by introducing a Cross Branch Integration (CBI) module for intergrading the extra knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for learning meaningful high-level context information.
-  (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, Cited by: 3rd item, §1, §1, §2, §2, §4.1, Table 5.
-  (2018) Actor-action semantic segmentation with region masks. In BMVC, Cited by: §2.
-  (2009) ImageNet: A large-scale hierarchical image database. In CVPR, Cited by: §3.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. IEEE-PAMI. Cited by: §2.
-  (2018) SlowFast networks for video recognition. CoRR abs/1812.03982. Cited by: §1, §2, Table 5.
-  (2017) ActivityNet challenge 2017 summary. CoRR abs/1710.08011. Cited by: §4.3, Table 5.
-  (2017) Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In CVPR, Cited by: §1, §3.1.
-  (2019) StNet: local and global spatial-temporal modeling for action recognition. In AAAI, Cited by: Table 5, Table 6.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
-  (2017) SCC: semantic context cascade for efficient action detection. In CVPR, Cited by: §3.5.
Distilling the knowledge in a neural network. CoRR abs/1503.02531. Cited by: §1.
-  (2017) Densely connected convolutional networks. In CVPR, Cited by: §3.1.
-  (2010) Object, scene and actions: combining multiple features for human action recognition. In ECCV, Cited by: §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §3.3.
-  (2018) End-to-end joint semantic segmentation of actors and actions in video. In ECCV, Cited by: §2.
-  (2017) Joint learning of object and action detectors. In ICCV, Cited by: §2.
-  (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §3.3, §3.4.
-  (2019) Temporal bilinear networks for video action recognition. In AAAI, Cited by: Table 5.
-  (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §3.3.
-  (2015) Beyond short snippets: deep networks for video classification. CVPR. Cited by: §2.
-  (2017) A simple neural network module for relational reasoning. In NIPS, Cited by: §3.3.
-  (2014) Two-stream convolutional networks for action recognition in videos. In NIPS, Cited by: §2.
-  (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. Cited by: 3rd item, §1, §2, §4.1.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. ICCV. Cited by: §2.
-  (2017) Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns. IEEE Transactions on Image Processing 26 (4), pp. 2055–2068. Cited by: §1.
-  (2018) Appearance-and-relation networks for video classification. In CVPR, Cited by: §1, Table 5.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, Cited by: §1, §3.2, §3.3, §4.2, §4.2, §4.3, Table 5, Table 6.
-  (2018) Non-local neural networks. In CVPR, Cited by: §1, §2, Table 5.
-  (2018) Videos as space-time region graphs. In ECCV, Cited by: §2, §3.
-  (2016) Two-stream sr-cnns for action recognition in videos. In BMVC, Cited by: §2, §3.5.
-  (2019) Geometric pose affordance: 3d human pose with scene constraints. arXiv preprint arXiv:1905.07718. Cited by: §1.
-  (2016) Harnessing object and scene semantics for large-scale video understanding. In CVPR, Cited by: §2, §3.5.
-  (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Cited by: §2, Table 5.
-  (2016) Actor-action semantic segmentation with grouping process models. In CVPR, Cited by: §2.
-  (2015) Can humans fly? action understanding with multiple classes of actors. In CVPR, Cited by: §1, §2.
-  (2017) Pyramid scene parsing network. In CVPR, Cited by: §1, §3.1.
-  (2017) Places: a 10 million image database for scene recognition. IEEE-PAMI. Cited by: §1, §3.1.
-  (2018) ECO: efficient convolutional network for online video understanding. In ECCV, Cited by: Table 5.