1 Introduction
Analyzing human activities in natural scenes is a fundamental task to many potential applications like video surveillance [2018Real], keyevent retrieval [2017ER3], social behavior interpretation [2017Social] and sports analysis [2018stagNet]. Abundant techniques have been developed for human activity recognition (HAR, where the goal is to assign an activity label to each image or video) [choi2011learning, patron:2012structured, ji20133d, Annane2014Two, KongInteractive, wu2019learning, wu2019learning, 2020empowerRN], which have gained impressive progress on recognition accuracy. However, the task of human interaction understanding (HIU) is much less successful mainly because current methods learn human interactive relations via shallow graphical representations [WangICCV2019, Wang2016A, wang2018understanding, patron:2012structured, choi2011learning, Yu2012Learning], which is inadequate to model complicated human activities.
Serving as an effective way of integrating CNN for local feature extraction and graphical representation for relationalinductivebias learning
[battaglia2018relational], graph networks have recently achieved impressive success on multiple vision tasks like collective activity recognition (CAR) [2020empowerRN, wu2019learning], skeletonbased human action recognition [sklActionRecogCVPR2019], gazecommunication understanding [liFengICCV19], feature matching [deepGraphNetFeatMatchingICCV2019] and oneshot semantic segmentation [pyGraphNetSemSegICCV2019]. Unfortunately, we find that a straightforward implementation of such graph networks for HIU can yield inconsistent predictions. Taking the prediction “” of the scene depicted by Figure 1 as an example, clearly “” contradicts “”. Another issue is that the predicted action labels for two interacting people can be incompatible, , handshake versus kick. Our key observation is that these graph networks are designed to enhance local representations by a softaggregation of nonlocal features (, the graph attention mechanism), while they are unaware of the underlying logic which is essential to achieve consistent HIU predictions.As commonly done in literature [WangICCV2019, wang2018understanding, Wang2016A], we decompose HIU into two subtasks: 1) The individual action prediction task which assigns each participant an action label; 2) The pairwise interactive prediction task which determines if any pair of participants are interacting or not. Figure 1 gives an example of HIU, where targets and naturally form two groups of concurrent human activities. Solving the two subtasks provides a way to disentangle concurrent human activities with multiple participants, as well as a comprehensive understanding of surveillance scenes. Then we present a logicaware graph network (LAGNet) for HIU. As shown by Figure 1, LAGNet consists of a backbone CNN to extract image features, a graph network to learn relations among participants, and a logic aware module to make consistent action and interaction predictions. All components of LAGNet could be trained jointly and efficiently with GPU acceleration. We empirically validate that these three components complements each other, and the combination of them always delivers best results.
Our contributions include four aspects. First, we propose a logicaware graph network for HIU to overcome the inconsistent predictions made by recent graph networks. Second, we present an efficient meanfield inference algorithm to solve the logicaware reasoning problem. Third, we create a new challenging benchmark (will be publicly available to the community) for HIU. Finally, our proposed LAGNet outperforms the stateoftheart results by salient margins on four evaluated benchmarks.
2 Related Work
Action/Activity Recognition Since the invention of the twostream network [Annane2014Two], numerous works on HAR (predicting each image or video an action class) have been proposed [ji20133d, wang2015action, adascan, wang2016temporal, 2017I3D, li2019actional, yan2019pa3d]. These methods take the video as a whole and focus on extracting powerful features to represent human motions. Though they could be taken to recognize the collective activity (CAR) of multiple participants, an increasing number of works justify the importance of modeling the spatiotemporal correlations among action variables of different people [choi2011learning, lan2012discriminative, 2014WongUnderstanding, 2016StructureInferenceM, 2017Social, 2017GernShu, 2018Mostafa, 2018stagNet, wu2019learning, 2020empowerRN]. Early works in this vein explore conditional random fields (CRFs) [choi2011learning, lan2012discriminative, 2014WongUnderstanding], while recent efforts contribute most on the joint learning of image features and human relations with RNN [2016StructureInferenceM, 2017Social, 2017GernShu, 2018stagNet, shu2019hierarchical] or deep graphical models [2018Mostafa, wu2019learning, 2020empowerRN]. These approaches are designed to predict each input an activity category, leaving the HIU task rather unsolved.
Human Interaction Understanding In order to understand human interactions, abundant conditional random field (CRF)based models have been proposed [Yu2012Learning, kong2015close, KongInteractive, patron:2010high, patron:2012structured, Wang2016A, wang2018understanding, WangICCV2019] to model the interactive relations in both spatial and temporal domains. The main drawback is that these CRFs are of shallow graphicalrepresentations, which is neither effective in terms of learning complicated human interactions nor efficient to solve the associated maximum a posteriori inference [WangICCV2019]
. Moreover, they perform deep feature learning and relational reasoning separately, which typically results in suboptimal HIU results. Our LAGNet addresses these issues by incorporating a graph network, which can benefit from both the representative power of deep architectures and the attentive capability of graph convolution.
Graph Networks have become a popular choice to many vision tasks which involve modeling and reasoning relations among components within a system [battaglia2018relational, kipf2018neural, zhenZhangICCV19, guoShengICCV19, liFengICCV19, ZhangNIPS2020]. Graph networks share the computational efficiency of deep architectures while are more powerful and flexible in terms of modeling relations in nongrid structures, for instance, the correspondences between two sets of points in a matching problem [zhenZhangICCV19], the correlations between query and support pixels in oneshot semantic segmentation [guoShengICCV19], human gaze communication [liFengICCV19], and the interperson relations for CAR [wu2019learning]. Essentially, these works implement the graph attention which enhances the feature representation of a node by aggregating features from other relevant nodes. We improve such graphical representations by incorporating a logicaware reasoning module, which helps to reduce the chance of obtaining inconsistent HIU predictions.
Logical Reasoning
As a way to highlevel intelligence, logical reasoning has seen a renaissance in very recent years. Since traditional logical reasoning has relied on methods and tools which are very different from deep learning models, such as Prolog language, SMT solvers and discrete algorithms, a key problem is how to bridge logic and deep models effectively and efficiently. Recent works viewed graph networks as a general tool to make such a connection. For example,
[barcelo2019logical, battaglia2018relational] take graph networks to incorporate explicitly logic reasoning bias, [mao2019neuro]builds a neurosymbolic reasoning module to connect scene representation and symbolic programs, and
[amizadeh2020neuro] introduces a differentiable firstorder logic formalism for visual question answering. In contrast, we take the wellestablished oracles to formalize the logic system, and fuse logic and graphical representations with an optimization system, solving which delivers logiccompatible predictions for HIU.3 Our Approach
Task Description and Notations Given an input image and the bounding boxes (RoIs) of detected human bodies, the HIU task decomposes into two subtasks: 1) predicting the action category for every individual where , and 2) predicting all pairwise interactive relations
for each pair of people. Here the binary variable
represents if the th participant and the th participant are interacting () or not. All vectors in this paper will be column vectors unless otherwise stated.
3.1 Model Overview
Figure 2 shows an overview of the proposed LAGNet, which consists of three key components including a basemodel, a graph network and a logicaware reasoning module. Given an input image and the detected human bodies, the basemodel takes a backbone CNN to extract features from the input, which are then processed by a RoIAlign module [maskRCNNICCV2017] to generate local features for each individual. The local features are then passed to one FC layer to generate the initial node features for our graph network. We also compute a matrix beforehand based on the position of each participant, where represents the Euclidean distance between the centers of the associated bounding boxes. Afterwards, taking as inputs the local features and , we build a graph network in order to capture the attentive relations of people based on their interactive relations. Each node in the graph network represents the action label of the associated person, and each edge takes a weight (learned from data) encoding how likely the associated people are interacting. Next, we implement graph attention for each node by performing a weighted aggregation of features from neighboring nodes to get the node feature updated (Section 3.2). Though the graph attention is able to enhance the feature representation of each node, the labeling consistency among nodes is rather neglected. To alleviate this, next we introduce a logicaware module, which essentially performs a deductive reasoning leveraging the logical system designed for HIU (Section 3.3). In practice the reasoning is achieved via solving a surrogate meanfield inference with highorder energy functions such that all modules of the proposed LAGNet could be trained in an endtoend manner with GPU acceleration.
3.2 The Graph Network
The graph network can be seen as an attention mechanism with global receptive field, which augments the feature representation of each variable through a weighted aggregation of features from relevant variables. We closely follow the formulation of the graph network (GN) proposed by [wu2019learning], and make straightforward modifications to facilitate the subsequent logicaware reasoning.
Graph Definition Our graph network uses a complete graph , which takes as its node set and as the edge set. As stated, each node is associated with an action variable to denote the action category performed by the th individual, and each edge takes a binary variable to represent whether the related people are interacting () or not ().
Relation Reasoning In order to evaluate the existence of interactive relations, we learn a confidence score for each edge through
(1) 
where , and are initial node features extracted by the basemodel. Here and
denote two linear transformations that map local feature vectors to the
space, and . is an indicator function which outputs if the Euclidean distance between the th target and the th target is smaller than a threshold , otherwise it gives . With this function, the graph network is able to filter distant people (the trifling relations) out for later attentive aggregation.Interactive Score For each , we compute a score vector using
(2) 
where the first entry represents the confidence of assigning to , while the second entry measures the confidence of assigning to . This score vector is to be taken as inputs for the subsequent logicaware reasoning module.
Graph Attention Given the initial node features and the learned relations , we now update the node features via graph attention. Specifically, for any node , its new feature is computed through
(3) 
Above, we first aggregate messages (weighted by ) sent to from its neighboring nodes. Afterwards a linear transformation is applied, which maps the aggregated feature vector to the same feature space as that of
. The resulting feature then undergoes a ReLU operation, which gives the updated node feature
.Action Score Finally, we calculate the classification scores (denoted by ) for individual actions by applying to the updated node features two linear transformations in succession. Specifically,
(4) 
where and are linear transformations which project their inputs to and spaces respectively. These scores are taken as inputs (together with the interactive scores) to the subsequent logicaware reasoning module.
Parallel Computation For implementation, all linear transformations, , and could be performed on all nodes concurrently using matrix multiplication, so does the feature aggregation in Equation (3). For the relation reasoning in Equation (1), all confidence scores could be computed simultaneously again with matrix multiplication, elementwise multiplication and Softmax operation for efficient implementation.
Comparison with Existing GN [wu2019learning] Our modified GN differs from the original GN [wu2019learning] on three aspects. Frist, the original GN is designed for CAR which aims at assigning a single label to each image to describe the occurring collective activity there, whereas the modified version is designed for HIU. Second, the original GN treats
as latent variables which are implicitly supervised by groundtruth labels of individual actions during training, whereas the modified version explicitly supervises learning the parameters of
relation reasoning with annotated interactive relations (see Section 3.4). We find that such a straightforward modification offers a boost of performance on HIU compared against the original GN. Third, we add the computation of interactive scores with Equation (2), which prepares inputs for subsequent logicaware module.3.3 LogicAware Reasoning
We first present two deductive oracles for HIU (with logic symbols defined in Figure 1):

The compatibility oracle: For any pair of people who are interacting (), their action categories must be compatible (). In logical words, this rule is represented by .

The transitivity oracle: Considering the interactive relations among a triplet of people , we have .
Typical compatible examples include (handshake, handshake), (pass, receive) and (punch, fall), and typical incompatible examples are (handshake, hug), (punch, pass), (highfive, handshake). Examples obey or violate the transitivity are provided in Figure 3. Though this oracle only considers triplets of people, it is straightforward to prove that the higherorder transitivity for any clique within is simply a conclusion of the 3order transitivity. Intuitively, by enforcing the transitivity across all triplets, participants in the scene are split into different groups, such that individuals in the identical group are interacting with each other, while people in different groups have no interaction.
With such oracles, predictions of the graph network in Section 3.2 could be refined by applying the traditional logical reasoning algorithms. Unfortunately, it is well known that such reasoning algorithms are typically brittle and inefficient. Instead we resort to a workaround which first embeds the knowledge into an energy function defined by
(5) 
where and are data terms (computed by the graph network) that penalize particular label and label assignments respectively based on the learned deep representations from . The functions and are Potts models [kohli2007p3] defined by
(6) 
(7) 
Here the notation represents the action label is incompatible with the action label , is a set that includes all cases violating the transitivity oracle, and are penalties incurred by predictions which violate the compatibility and transitivity oracles. It is easy to check that when and are sufficiently large, minimizing the energy (5) delivers desirable and predictions which satisfy the compatibility and transitivity oracles. In this paper, instead of predesignating suitable and values, we learn them from training data in conjunction with other parameters of LAGNet.
MeanField Inference Minimizing (5
) is NPcomplete. We derive an efficient meanfield inference algorithm by first approximating the joint distribution
with a product of independent marginal distributions: . Then we derive the meanfield updates of all marginal distributions using the techniques described in [vineet2014filter], which gives(8)  
(9) 
where is an indicator function and is a normalization constant. The marginal distributions on variables are
(10)  
(11) 
Here , and is a normalization constant. We initialize the marginal distributions , by applying the softmax function to the scores output by the graph network. The inference is summarized by Algorithm 1. Note that we can perform the updates of all expectations (Equation (8) and (10
)) and marginal probabilities (Equation (
9) and (11)) in parallel, which yields very efficient inference.UT  BIT  TVHI  
Method  F1  Accuracy  mean IoU  F1  Accuracy  mean IoU  F1  Accuracy  mean IoU 
VGG19 [Simonyan15]  85.69  91.68  69.03  85.22  89.60  67.03  70.68  76.90  52.30 
ResNet50 [He2016Deep]  90.62  94.64  76.70  87.12  91.20  71.41  81.18  82.61  66.33 
Inception V3 [szegedy2016rethinking]  92.20  95.86  80.30  87.84  91.61  72.00  83.00  86.91  71.53 
GN [wu2019learning]  89.25  91.78  78.24  70.52  78.52  65.89  80.57  82.76  66.82 
Modified GN (Section 3.2)  93.38  96.39  84.13  89.95  93.38  76.42  84.18  87.86  71.31 
Base model + LAR  92.81  96.26  81.51  88.72  92.23  73.99  83.07  87.23  72.29 
LAGNetC  94.28  97.06  85.11  90.58  93.72  77.54  86.82  90.10  74.41 
LAGNetT  94.08  96.95  84.66  90.01  93.44  76.69  85.68  89.26  74.06 
LAGNetFull  94.38  97.06  85.27  90.73  93.83  77.86  87.59  90.30  75.07 
Name  # Video  Resolution ()  ANP  ANI 

UT  120  2.0  1.0  
TVHI  300  2.1  1.0  
BIT  400  2.9  1.0  
CI  340  7.8  2.5 
As mentioned, Algorithm 1 is a surrogate of the logical reasoning task taking the two oracles as its knowledgebase. This algorithm actually forms the last layer of our network, which outputs updated action scores and updated interactive scores . Our experimental results in Section 4 demonstrate that such updated scores indeed deliver much better HIU results.
3.4 EndtoEnd Learning
The meanfield inference algorithm allows the backpropagation of the error signals
to all parameters of LAGNet (including that of the basemodel, the graph network and the logicaware reasoning module), which enables the joint training of all parameters from scratch. In practice, we resort to a twostage training due to the limitation of computational resources. The first stage learns a basemodel with the backbone CNN initialized by a model pretrained on ImageNet. The second stage trains the graph network,
and jointly with fixed backboneparameters. Both stages compute losses by summing crossentropy losses computed on and predictions.Implementation Details
Our implementation is based on PyTorch deep learning toolbox and a workstation with three pieces of NVIDIA GeForce GTX 1080 Ti GPU. We test several backbone CNNs including VGG19
[Simonyan15], ResNet 50 [He2016Deep] and Inception V3 [szegedy2016rethinking]. We use the official implementation of RoIAlign by PyTorch, which outputs feature maps with a size of (using Inception V3). We add dropout (the ratio is 0.3) followed by a layernormalization to every FC layer of LAGNet except for the ones computing final classification scores. We set andfor initialization. We adopt minibatch SGD with Adam to learn the network parameters, and train all models in 200 epochs. We augment training data with random combinations of scaling, cropping, horizontal flipping and color jittering. For the scaling and flipping operations, the bounding boxes are scaled and flipped as well.
4 Dataset, Experiment and Result
Existing Datasets Our experiment uses datasets including UT [UTInteractionData], BITInteraction [Yu2012Learning] and TVHI [patron:2010high]. UT contains 120 short videos of 6 action classes: handshake, hug, kick, punch, push and noaction. As done by [wang2018understanding], we extend original action classes by introducing a passive class for each of the three asymmetrical action classes including kick, punch and push (bekicked, bepunched and bepushed). As a result, we have 9 action classes in total. Following [wang2018understanding], we split samples of UT into 2 subsets for training and testing. BITInteraction covers 9 interaction classes including box, handshake, highfive, hug, kick, pat, bend, push and others, where each class contains 50 short videos. Of each class 34 videos are chosen for training and the rest for testing as that recommended by [Yu2012Learning]. TVHI contains 300 short videos of television shows, which covers 5 action classes including handshake, highve, hug, kiss and noaction. As suggested by [patron:2010high], we split samples of TVHI into two parts for training and testing.
4.1 The New Dataset
To our best knowledge, existing datasets for HIU (such as UT, BIT and TVHI) are not challenging enough, mainly because that each frame only contains scanty people and one category of human activity. We address this by proposing a new dataset, namely CampusInteraction (CI) captured in a campus circumstance, see supplementary for a few examples of this dataset. CI contains 340 short videos and 10 typical action classes including kick, steal, highfive, pass, handshake, hug, support, talk, punch and others. Here others indicates any other action categories beyond the first 9 classes. All interactive classes contain exactly 2 participants except for talk, which might include more people.
Each CI video is captured in either an open square or a sportsground in campus. The videos may include heavy occlusions, and the appearances, poses and heights of different individuals can vary a lot, which make the dataset more realistic. Moreover, CI contains both symmetric (such as handshake, which involves two participants performing the same action) and asymmetric interactions (such as kick, which involves a person performing kick and the other performing bekicked).
Table 2 compares CI with existing benchmarks. We can see that each frame of CI videos contains an average of 7.8 people, which significantly surpasses that of UT, BIT and TVHI. More impressively, frames taking any human interaction in CI include 2.5 clusters of concurrent human interactions on average, while the existing benchmarks have at most one group of human interaction per frame.
CI offers framebyframe annotations, including the bounding box and the action label of each individual, as well as the pairwise interactive relations among people. Utilizing Vatic [ijcvVatic], we annotate 438,452 bounding boxes in total. The annotation takes around 340 hours, considering each video takes about an hour .
We split CI into training and testing sets with 255 and 85 videos respectively. To advoid overfitting, there is no sourceoverlap among videos in different sets. In this paper, we sample each video every 10 frames for experiments.
4.2 Ablation Study
Evaluation Metric Since the numbers of instances across different classes are significantly imbalanced, we use F1score, overall accuracy and mean IoU
as the evaluation metrics. Specifically, we calculate the
macroaveragedF1 scores on and predictions respectively (using the f1_score function in sklearn package), and present the mean of the two F1 scores. Likewise, overall accuracy calculates the mean of the actionclassification accuracy and the interactiverelationclassification accuracy. To obtain mean IoU, we first compute IoU value on each class, then average all IoU values. Next we validate the capabilities of different components in LAGNet, using results provided by Table 1.Choice of BackboneCNN. Here we evaluate base models (see Figure 2) taking different backbone CNNs to extract image features. We test three popular backbones: VGG19 [Simonyan15], Inception V3 [szegedy2016rethinking] and ResNet50 [He2016Deep], and the results correspond to the first three rows (from top to bottom) in Table 1. Inception V3 performs notably better than other backbones on all benchmarks. The reason might be that Inception V3 is able to learn multiscale feature representations, which stacks into a feature pyramid to better capture the appearance of human motions. Hence we use Inception V3 as the backbone for all subsequent experiments.
Effect of Graph Attention. Graph network (GN) is typically viewed as an attention mechanism which selectively collects relevant local information from nonlocal regions to perform inductive inference [battaglia2018relational]. For the HIU task, GN is expected to enhance the action representation of each individual by aggregating features from other relevant participants. Somewhat surprisingly, GN [wu2019learning] (designed for CAR) performs even worse than base model on all evaluated benchmarks. We can see a reverse with the Modified GN, which surpasses base model on all benchmarks (, the improvements on BIT are around , , respectively in terms of F1, accuracy and meanIoU). The reason probably is that without supervision from annotated human interactive relations, the graph network tend to learn fake attentions, which in turn pollute the original feature representations.
Power of LogicAware Reasoning. Here we compare four models: 1) Base model + LAR that consists of a basemodel followed by the proposed logicaware reasoning module; 2) LAGNetC is the proposed LAGNet without taking the transitivity oracle into consideration; 3) LAGNetT is the proposed LAGNet without taking the compatibility oracle into consideration; 4) The full LAGNet. We can draw two conclusions based on the results. First, the combination of the modified graph network and both oracles (LAGNetFull) yields best results, which validates our observation that graph attention and logicaware reasoning complement each other in terms of modeling human interactions. Second, the proposed LAGNet considerably outperforms the second best (, the Modified GN) by clear margins on all evaluated benchmarks. Specifically, on TVHI it overshoots by , on UT it outperforms by , and on BIT it surpasses by respectively in terms of F1, accuracy and meanIoU, which demonstrates the power of the proposed logicaware reasoning module.
Method  F1 (%)  Accuracy (%)  mean IoU (%) 

GN [wu2019learning]  80.57  82.76  66.82 
Joint + AS [wang2018understanding]  83.50  87.33  71.64 
QP + CCCP [WangICCV2019]  83.42  87.25  71.61 
LAGNet (ours)  87.59  90.30  75.07 
Method  F1 (%)  Accuracy (%)  mean IoU (%) 

GN [wu2019learning]  89.25  91.78  78.24 
Joint + AS [wang2018understanding]  92.20  95.86  80.30 
QP + CCCP [WangICCV2019]  89.71  93.23  80.35 
LAGNet (ours)  94.38  97.06  85.27 
Method  F1 (%)  Accuracy (%)  mean IoU (%) 

GN [wu2019learning]  70.52  78.52  65.89 
Joint + AS [wang2018understanding]  88.61  91.77  72.12 
QP + CCCP [WangICCV2019]  88.80  91.92  72.46 
LAGNet (ours)  90.73  93.83  77.86 
Method  F1 (%)  Accuracy (%)  mean IoU (%) 

Base model  65.44  84.14  39.58 
GN [wu2019learning]  49.50  79.00  35.36 
Modified GN  66.63  85.72  42.76 
Joint + AS [wang2018understanding]  65.88  84.23  40.57 
QP + CCCP [WangICCV2019]  65.92  84.47  40.86 
LAGNet (ours)  67.80  86.79  44.61 
4.3 Comparison with Recent Model
We consider three recent approaches. Joint + AS [wang2018understanding] first extracts motion features of individual actions with backbone CNN. Afterwards the deep and contextual features of human interactions are fused by structured SVM. This method is able to predict and in a joint manner. QP + CCCP [WangICCV2019] takes a structured model to represent the correlations between and variables as well. It also developed a new inference algorithm (namely QP + CCCP) to solve the related inference problem. GN [wu2019learning] is a recent stateoftheart for recognizing collective human activities. This model is empowered by both the representative ability of deep CNNs and the attention mechanism of graph networks. Note that GN does not yield predictions. We fix this by setting () if the learned confidence score (). For fair comparison, all methods take Inception V3 as the backbone to extract image features. Results on four datasets are provided in Table 3 to Table 6. We can see that LAGNet outperforms GN and shallow structured models (Joint + AS and QP + CCCP) significantly on all evaluated benchmarks. Compared with LAGNet, GN lacks supervision on interactive relation prediction and a logicaware reasoning module, consequently it performs much worse on HIU than LAGNet. Albeit sharing the same feature extractor (Inception V3) with LAGNet, Joint + AS and QP + CCCP learn human interactive relations via shallow structured models without incorporating attentions and logicaware reasoning, hence their performances are much worse than our LAGNet. Note that the performance on CI is much worse than on existing datasets, which makes CI a new challenge for future HIU study.
To provide a qualitative analysis of different models, we visualize a few predictions in Figure 4. Albeit the predicted action classes are incompatible or the predicted interactive relations violate the transitivity oracle using either the basemodel or the modified GN, thanks to the logicaware reasoning module, our LAGNet is able to make almost perfect predictions. Also note that the last example includes three concurrent human interactions (talk, handshake and steal), which are correctly identified by our LAGNet.
5 Conclusion
We have presented a LogicAware Graph Network for human interaction understanding, which requires to predict individual actions as well as interactive relations among people. Compared with previous methods solely based on reasoning attentive relations, our logicaware graphical model leverages both attentions and wellestablished human knowledge which helps to reduce the chance of predicting incompatible action labels as well as inconsistent interactive relations. We further proposed a meanfieldstyle inference algorithm such that all modules within our network could be trained in an endtoend manner. We have also presented a challenging dataset for HIU. Experiments on existing benchmarks and the new dataset show that our proposed method significantly outperforms the baseline models and achieves a new stateoftheart performance.