Manipulation-skill Assessment from Videos with Spatial Attention Network

01/09/2019 ∙ by Zhenqiang Li, et al. ∙ The University of Tokyo 6

Recent advances in computer vision have made it possible to automatically assess from videos the manipulation skills of humans in performing a task, which has many important applications in domains such as health rehabilitation and manufacturing. However, previous methods used all video appearance as input and did not consider the attention mechanism humans use in assessing videos, which may limit their performance since only a part of video regions is critical for skill assessment. Our motivation here is to model human attention in videos that helps to focus on most relevant video regions for better skill assessment. In particular, we propose a novel deep model that learns spatial attention automatically from videos in an end-to-end manner. We evaluate our approach on a newly collected dataset of infant grasping task and four existing datasets of hand manipulation tasks. Experiment results demonstrate that state-of-the-art performance can be achieved by considering attention in automatic skill assessment.



There are no comments yet.


page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Skill assessment is a type of evaluation often used to determine the skills and abilities a person has. Out of a variety of different skills, the assessment of manipulation skill is of particular interests and are widely used in health rehabilitation and various professional environments such as surgery and manufacturing. For example, a pediatrician would assess the motor skills of infants to diagnose their developmental progress, and a factory owner would assess the skills of his employees in order to improve their work performance. While a professional supervisor may find it easy to assess the skills of his apprentice, skill assessment becomes difficult where no professional supervisor exists, e.g., in rural areas.

With recent advances in computer vision, automatic skill assessment from videos is believed to have many potential applications and begins to attract research interests in recent years [23, 30, 35, 2, 12]

. In particular, the effectiveness of deep learning techniques in automatic skill assessment has been demonstrated in


. However, in this method, the skill level is determined based on low-level features extracted globally from the whole image, which may be limited in capturing fine details of motion. Skill assessment requires careful observation of the details of the action, of which the location is usually a relatively small region. For example, when experts assess the shooting skill of a basketball player, the hand region rather than the player’s face or dressing, should be paid more attention to.

Moreover, the region of attention is not determined solely from the current frame of the video, as knowledge accumulated from previous observations also plays a significant role. Let us consider the procedure of a human expert assessing the skills from a video, which could provide useful inspiration for network design. It has been observed that spatial attention is dependent on the past observation of the video. An expert cannot find an appropriate region to focus only based on the current video frame. Instead, he would first consider the history of this video, and then locates the critical region on the current frame. For instance, when assessing the shooting skill of a basketball player, the attention is first paid to the pose of two hands holding the ball, and when the player shoots, the attention will no longer be paid on the hands but moves to the trajectory of the ball towards the basket. Therefore, a network should find its attention not only based on the low-level information of the current frame, but also the past high-level knowledge about the whole process.

In this paper, we propose a deep model for skill assessment, in which a novel recurrent neural network (RNN) based module is developed for estimating spatial attention. At each time step, our model first encodes the appearance and motion information of the input frame with fully convolutional layers. The proposed spatial attention module estimates a spatial attention map by jointly considering the encoded features of the current frame and the skill-related knowledge constructed from previous observation. The encoded features are then filtered via weighted average pooling using the estimated attention map. The attention-filtered features are used as input to a temporal aggregation module (implemented with RNN) which updates the skill-related knowledge temporally. After accepting all the frames, the accumulated information will be used to assess the skill in a video.

To evaluate our approach, we use existing public datasets of hand manipulation tasks as well as a newly collected dataset which records visuomotor skills of infants at different ages (called as “Infant Grasp Dataset”). To alleviate the difficulty of annotation, we annotate the videos pair-wisely and use a pairwise deep ranking technique for training our model. Using pairwise ranking is also a way for data augmentation. The Infant Grasp Dataset contains 4371 video pairs from 94 videos of object grasping task. Experimental results show that our proposed approach not only achieves state-of-the-art performance but also can learn meaningful attention for video-based skill assessment.

Main contributions of this paper are summarized as follows:

  • We propose a novel attention-based deep learning approach for skill assessment from videos. To the best of our knowledge, this is the first work to consider attention mechanism in skill assessment.

  • We collect and annotate a new dataset for skill assessment, which records object-grasping task of infants at different ages.

  • Our method has achieved state-of-the-art performance in skill assessment.

2 Related work

In this section, we review past skill assessment works in video, both for certain specific tasks like sports or surgical tasks and general skill assessment works. We also introduce the attention mechanism in video representation. Finally, we relate this work to deep ranking approaches, on which our method is based.

2.1 Skill assessment

There are only a few works aiming at skill assessment and most of which focus on surgical tasks, probably due to the intensive needs in this area

[30, 37, 47, 48, 49]. For example, Zia et al. utilize the repetitive nature of surgical tasks and rely on the entropy of repeated motions to identify different skill levels. In [30], a combination of video and kinematic data is used to rank skill in two surgical tasks. Sharma et al. [37] use motion textures to predict a measure of skill specific to the surgical domain called “the OSATS criteria”. There are also some works on sports skill assessment [2, 4, 23, 25, 35, 33]. For instance, [2] proposed a pairwise deep ranking model to assess the skill of a basketball player from first-person videos. However, the method proposed in this paper is specifically designed for the basketball game, which is limited and hard to generalize to other tasks. [4, 33] focus on the quality of motion, and [35] uses a regression framework and appearance feature for learning to assess the quality of human actions from videos. These works also have the limit in generalizability since either appearance (or pose) information or motion information is lost. Neither quality of motion nor appearance on its own is an essential condition to determine skill level [12].

A number of previous methods only regards the skill level in a coarse manner: In [49] the level of skill is only split into novice and expert. These works [15, 49]

determine skill labels by participants’ previous experience, but not their performance in individual videos. In this work we aim to rank the skill in each video instead of classifying the videos as expert or novice.

Perhaps the most similar work with ours is [12]

, in which a two-stream pairwise deep ranking model with a newly designed loss function is used for skill assessment. However, their method is purely bottom up, without using the high-level information related to task or skill to guide the bottom-up feed-forward process. This may result in worse performance since too much redundant information is observed. In this work, we use the attention mechanism to guide the bottom-up feed-forward process. Our attention is learned from both low-level information globally extracted from each frame and skill-related information accumulated in previous observation, which is able to dynamically generate spatial attention maps to guide the bottom-up features, thus achieving a better performance.

Figure 1:

The illustration of our skill assessment framework. At every time step, the network takes an RGB image and the corresponding stacked optical flow images as input and firstly represents them as deep features. The spatial attention module is then used to generate an attention map, by integrating the global information from the deep features and the skill-related information from the top part of the framework. Meanwhile, the temporal evolution of spatial attention is also incorporated implicitly in the module. An attended feature vector is then generated by weighted pooling the deep feature according to the estimated attention map. The feature vector is forwarded to an RNN (

) for temporal aggregation. The output of at the final time step is used to yield a ranking score.

2.2 Attention mechanism in video representation

Evidence from human perception process shows the importance of attention mechanism [31], which uses top information to guide bottom-up feed forward process [41]. Recently, tentative efforts have been made towards applying attention into deep neural network for various tasks like image recognition [19, 41, 43], visual question answering [9, 29], image and video captioning [1, 6, 7, 14] and visual attention prediction [20, 36, 21, 22].

In video representation, the majority of works utilize the attention mechanism in action recognition [28, 27, 26, 44] and action localization [5]. In [39]

, an end-to-end spatiotemporal attention model is used to recognize action using skeleton data. They use LSTM-like structure and construct joint selection gates and frame selection gates for modeling spatial and temporal attention. Girdhar

et al.[16] propose an attentional pooling method based on second-order pooling for action recognition. Both saliency-based and class-specific attention are considered in their model, however, the attention is learned statically from each frame where no temporal information between frames is considered. [13] incorporates a pose-based attention mechanism into recurrent networks to learn complex temporal motion structures for action recognition. Although the temporal relationship of attention in context is modeled by a recurrent structure, the model cannot generalize into the situation where the pose is unavailable from appearance.

In this work, we propose a new attention module for skill assessment, which makes use of skill-related knowledge to guarantee the model to focus on regions that are highly correlated with the task. Moreover, as the attention in a frame is not independent of the attention in context, we incorporate the recurrent networks into our model to leverage the relationship of attention in continuous frames.

2.3 Deep ranking

The most widely used ranking formulation is pairwise ranking. It was originally designed to learn search engine retrieval functions from click-through data, however, it has been adopted in other ranking applications such as relative attributes in images [32]

. Ranking aims to minimize the number of incorrectly ordered pairs of elements, by training a binary classifier to decide which element in a pair should be ranked higher. This formulation was first used by Joachims

et al. in RankSVM [24], where a linear SVM is used to learn a ranking. Firstly in [3], the ranking has also been integrated into deep learning frameworks. [45] uses a pairwise deep ranking model for highlight detection in egocentric videos. In this work, we use pairwise deep ranking as the training scheme, not only to ease the difficulty in ground truth labeling but also to provide augmented data (pairs) for training our model.

3 Approach

In this section, we first describe the overview of the framework for skill assessment. Then we introduce the details of each part, especially the proposed spatial attention module which integrates temporal evolution patterns into the estimation of the spatial attention. We also describe the pairwise ranking scheme for training our model.

3.1 Model architecture

Our goal is to learn models for skill assessment in different tasks. Given a video recording a whole progress of finishing a certain task, our model estimates a score to assess the skill performed in the video. Figure 1 depicts the architecture of our model.

As is done in [12], we split the video into segments, and select one frame randomly in each segment to form a sparse representation of the whole video. At every time step , the feature encoding module extracts deep features from a single RGB image and a corresponding stacked optical flow image like [38]. The spatial attention module estimates an attention map based on the low-level deep features and a high-level skill-related vector generated from the top temporal aggregation module. An attended vector at each time step is then obtained by pooling the deep features with weights derived from the attention map and fed into the top temporal aggregation module, which aggregates vectors temporally and outputs a final score. We illustrate the details of each module in the following subsections.

Figure 2: The details of the spatial attention module. To estimate the skill-related significance for the regional vectors at different locations, the module incorporates not only low-level information globally extracted from the deep feature maps, but also information for skill assessment which is accumulated by a high-level RNN ( in Figure 1). We also use an RNN () to learn the evolution pattern of attention and its output is utilized to estimate attention weights for all locations in deep feature map .

3.2 Feature encoding

From each of the -th segment, the module takes an RGB image and the corresponding stacked optical flow images as input, since appearance and motion are both important for skill assessment. The feature encoding module first extracts two deep features and from and

by feeding them into two ResNet101 networks respectively. The ResNet101 networks are pre-trained on ImageNet

[11] and then fine-tuned on UCF101 [40] in a two-stream framework for action recognition[38]. As [21, 17], we extract the deep features from the last convolution layer of the 5th convolutional block.

Then we build a two-layer convolution network for spatiotemporal feature fusion, which takes and as input and outputs the fused spatiotemporal deep representation .


3.3 Attention pooling

We aim to apply attention to guide the bottom-up feed-forward process. In our work, this is done by our proposed attention pooling module that dynamically adjusts spatial attention based on both the low-level information and the high-level knowledge of the skill. Specifically, at each time step, the attention pooling module accepts two inputs: the frame’s deep feature maps as the low-level information, and the skill-related vector as the high-level information. The two inputs will be explained in details as follows.

To extract a compact low-level representation vector from deep feature maps, following [43], we firstly squeeze spatial information of the deep feature map

by performing average-pooling and max-pooling. As a result, two features are generated and summed together to form a highly abstract low-level representation vector

for as following:


The high-level representation vector and the low-level representation vector both serve as a basis for estimating attention map. We name this part as spatial attention module, and show its details (its details are shown )in Figure 2. Briefly speaking, we concatenate the two vectors together, getting a vector for estimating the attention map.


A Recurrent Neural Network (RNN) [8] (called ) is adopted to learn the transition pattern of attention at different time steps. The output of at time step is a vector integrating the current low-level information and the skill-related high-level information:


Given the output vector and the deep feature maps , the RNN model generates an attention weight for each spatial location of the deep features at all locations of feature maps and normalize them by activation function. With this procedure, the attention on each spatial location will be guided by the accumulated information , which leads to a better decision on the importance of each specific location.


We call this part as Attend part in the attention pooling module (Figure 2).

The attended image feature which will be used as input to the final RNN for skill determination is calculated as a convex combination of feature vectors at all locations:


3.4 Temporal aggregation

In [12, 38, 42]

, at every time step, after obtaining the high-level representation for actions of one snippet, a network based on MLP is used to estimate a score for skill determination or action recognition. However, in skill assessment, the skill level is determined by the temporal evolution of actions so it is hard to judge only from short video clips captured at one moment. For this reason, we choose to use a recurrent neural network to aggregate the changed action information temporally and estimate the score for skill level at the final step.

We aggregate the feature vectors temporally using an RNN. The output at the final temporal step of the RNN () is followed by a fully connected layer (FC) to get the final output score .


Here the skill-related vector is the high-level information about the video’s skill level. Besides being used to predict the final output score of the video, the skill-related vector is also fed into the attention pooling module mentioned previously.

Figure 3: Image examples of two videos showing different skill levels in our Infant Grasp Dataset. The skill level in the bottom row is better than in the top row because the action of putting is not continuous in the top row. Infants’ faces are blurred for privacy.

3.5 Training and implementation details

We use a pairwise ranking framework [45, 12] for training, which requires only pairwise annotations to assess the skill of videos performing the same actions. To be more precise, given pairs of videos formed from a set of videos , the annotation only needs to rank the relative skill level of each video other than giving an exact score of a video:


Since , we only need to annotate one score for one pair of videos. In our pairwise ranking framework, two videos in one pair are fed into a Siamese architecture consisting of two same models with shared weights.

Assume the output of our model is , denote as all pairs of videos in (a) training set that contains skill preference, and let the video pairs in to be , the model learns to minimize the following loss function:


, depict the predicted skill measure for videos 1 and 2 respectively. denotes margin, which is incorporated to adjust the distance between the predicted scores for (of) the two videos. In this work, we use in all experiments.

We use PyTorch

[34] to implement our framework. The optical flow images for motion input are extracted by TV- algorithm [46]. For the dataset of Infant-Grasp, the optical flow images are extracted with the original frame rate and for the other datasets, we use the frame rate of 10-fps since motion is slow in these videos. The deep features in Figure 1 is extracted from the output of the 5-th convolution block () of ResNet101 [18]. The input images are resized to , so the size of deep features extracted from ResNet is

. The conv-fusion module consists of 2 convolution layers, in which the first layer followed by ReLU activation. The first layers have 512 kernels with a size of

, and the second convolution layer has a kernel size of 1 with 256 output channels. The dimensions of parameters in attend part of spatial attention module are set as

.The RNN for both attention pattern learning and temporal aggregation are implemented with a 1-layer Gated Recurrent Unit (GRU)


whose hidden state size is set as 128. We uniformly split each video into 25 segments, and sample one frame randomly from each split during training. The last frame of each segment is utilized for (to test) testing our model. We use stochastic gradient descent with a momentum of 0.9 to optimize our model. We set learning rate as 5E-4 for the Infants-grasp dataset, and 1e-3 when for other datasets. All weight decays are set as 1e-3. As

[12], our model is trained and tested separately on different datasets.

4 Experiments

We evaluate our method on our newly collected dataset as well as four public datasets. Similarly to [12]

, we report the results yielded by four-fold cross-validation, and for each fold, we use ranking accuracy as the evaluation metric. Ranking accuracy is defined as the percentage of correctly ranked pairs among all pairs in the validation set.

4.1 Datasets

Figure 4: Visualization of attention maps learned from our model on the InfantGrasp dataset.

4.1.1 Infant Grasp Dataset

Since the related public datasets are either small in size (e.g., up to 40 videos [12, 15]) or unsuitable for manipulation skill assessment (e.g., comparing skill between different diving actions [35]), we construct a larger dataset for infant grasp skill assessment. The dataset consists of 94 videos, and each video contains a whole process of an infant grasping a transparent block and putting it into a specified hole. The videos were originally captured for analyzing visuomotor skill development of infants at different ages. Figure 3 shows representative frames selected from a pair of videos. The length of each video ranges from 80 to 530 frames with a frame rate of 60fps. This dataset is expected to be of great importance not only to the computer vision community but also to the developmental psychology community.

To annotate the dataset, we asked 5 annotators from the field of developmental psychology to label each video pair by deciding which video in a pair shows a better skill than the other or there is no obvious difference in skill. We form 4371 pairs out of 94 videos, among which 3318 pairs have skill preference (76%).

4.1.2 Public datasets

We also evaluate our method using public datasets of another four manipulation tasks: Chopstick-using, Dough-Rolling, Drawing [12], and Surgery [15]. The Chopstick-using dataset contains 40 videos with 780 total pairs. The number of pairs of video with skill preference is 538 (69% of total pairs). The Dough-Rolling dataset selects 33 segments about the task of pizza dough rolling from the kitchen-based CMU-MMAC dataset[10] and 538 pairs of videos are annotated with skill preference (69%). The Drawing dataset consists of two sub-dataset and 40 videos in total, among which 380 pairs are formed and 247 pairs show skill preference (65%). The Surgery dataset contains three sub-datasets of three different kinds of surgery tasks: 36 videos of Knot-Tying task, 28 videos of Needle-Passing and 39 videos of Suturing. Each sub-dataset contains a maximum of 630, 378, 701 pairs respectively, and since the annotation is given by a surgery expert using a standard and structured method, more than 90% of pairs contains the difference in skill level. Following [12], we train and test the 3 sub-datasets of the Surgery dataset together using one model. Same is done for the Drawing dataset.

Figure 5: Qualitative comparison of the output attention maps of our method and the CBAM [43] on the Surgery, Drawing and Chopsticks dataset. It can be seen that our method successfully finds the skill-related regions while the attention maps obtained by CBAM tend to only highlight the salient regions.

4.2 Baseline methods

We compare with several baseline methods to validate the effectiveness of our proposed approach. Because previous works on skill assessment are scarce, two ranking-based approaches were introduced as baseline methods for skill assessment in [12]. The first baseline is RankSVM [24], which has been commonly used in ranking problems [32]. The second baseline is the pairwise deep ranking method (Yao et al. [45]) used for video ranking, although it was originally developed for a different purpose (highlight detection). Moreover, we compare our method with Doughty et al. [12] which is the most relevant work with ours. They use the TSN [42] with a modified ranking loss function for skill assessment. Considering there is no previous work adopting attention mechanism into skill assessment, to also validate the effectiveness of the spatial attention module in our model, we construct a competitive baseline method which replaces our spatial attention module with a state-of-the-art attention model [43] (CBAM Atten.). We implement the spatial and channel attention modules of [43], and the attention map is obtained by feeding the encoded feature into channel attention module and spatial attention module successively.

4.3 Performance comparison

Surgery Drawing
RankSVM [24] 76.6 65.2 71.5 72.0 N/A
Yao et al.[45] 70.3 66.1 71.5 78.1 N/A
Doughty et al. [12] 71.5 70.2 83.2 79.4 80.3
CBAM Atten. [43] 82.0 68.6 84.1 80.8 83.8
Ours 85.5 73.1 85.3 82.7 86.1
Table 1: Performance comparison with baseline methods. Ranking accuracy is used as the evaluation metric.

We compare our method with other baseline methods and the quantitative results are shown in Table 1. Our method achieves best performance on all datasets and outperforms the state-of-the-art method [12] by a large margin, demonstrating the importance of adopting attention mechanism in video-based skill assessment. Our method also outperforms the CBAM Atten. that also uses attention mechanism. This validates the effectiveness of our proposed spatial attention module which considers not only the appearance of current frame but also the past high-level knowledge about the whole task process.

We visualize the attention maps generated by our model. Figure 4 shows attention maps on our Infant-Grasp dataset. We can clearly see that the spatial attention focus on the critical regions of all images. In the first row, the spatial attention falls on the infant’s hand at first, and when the infant accidentally drops the block ( image), our model successfully locates attention on the dropped block, instead of on the hand continuously. In the second row, the infant first grabs the block with her right hand, then passes the block to her left hand to put it to the destination. Our attention module successfully locates the correct task-related hand to emphasize. We think the reason why our attention model is adaptive to the shifting attention might be that the previously accumulated knowledge about the task is leveraged to focus on the critical events.

We also compare the attention maps of three different datasets obtained by our proposed model and the baseline attention model of CBAM [43] in Figure 5. With our model, the attention is always paid onto the discriminative skill-related region, while CBAM often locates salient but task-unrelated regions. In the first example, the skill level of the chopstick-using task is determined by whether the bean could be picked up with the chopstick, and its relevant regions are successfully highlighted by our model. In the second example on Drawing dataset, our model focuses on both the drawing hand and the picture drawn by the subject (last image). In the last example on Surgery dataset, our method can locate both robotic arms in performing the task as well as the suture being manipulated.

4.4 Ablation study

To validate the effectiveness of each component of our model, we conduct ablation study on each dataset with the following baselines:

  • No : The component of designed for learning attention evolution patterns is replaced by one fully-connected layer, which also accepts and as input.

  • based Atten.: The accepts only as input without the concatenation with . We build this baseline to see the influence of the high-level skill-related knowledge in skill assessment.

  • based Atten.: The accepts only as input. We build this baseline to see the influence of low-level features.

Surgery Drawing
No 84.1 70.8 82.4 82.0 85.1
-based Atten. 84.1 70.1 83.4 81.8 85.3
-based Atten. 82.8 69.1 84.8 81.6 84.7
Full model 85.5 73.1 85.3 82.7 86.1
Table 2: Ablation study for different components of our model. Ranking accuracy is used as the evaluation metric.

Table 2 shows quantitative results of different subsets of our full model. The results validate our thought that the evolutional patterns of attention maps (), the low-level appearance-based information () and the high-level task-related knowledge () are all important for manipulation-skill assessment.

We also train our model with randomized labels to examine whether our model can learn meaningful skill-related attention. As shown in Figure 6, when the model is trained with randomized labels, the spatial attention maps become meaningless compared with the one trained with real labels. This indicates that our model is able to discover meaningful skill-related regions rather than just highlighting the salient regions regardless of the underlying skills.

Figure 6: Attention map visualization between our model trained with real ground truth labels and randomized labels. The attention maps tend to be chaotic when the model is trained with randomized labels, which shows our model’s ability to discover the correct skill-related regions as attention maps.

5 Conclusion

In this study, we present a novel method for skill assessment which uses attention learned from both low-level information and high-level skill-related information in a pairwise deep ranking framework. We show that powerful visual features for skill assessment could be learned by our model through pairs of weakly labeled videos even without annotations of exact scores. A dataset was also collected and annotated, which contains the largest number of videos performing the same action compared with existing datasets for skill assessment. Experiments demonstrate that our proposed approach achieves state-of-the-art performance on both our new dataset and existing datasets.

As for our future work, we plan to collect a larger dataset for skill assessment since all existing datasets are still small in size. We will also delve deeper into the attention mechanism of skill assessment, e.g., temporal attention in videos.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • [2] G. Bertasius, H. Soo Park, S. X. Yu, and J. Shi. Am i a baller? basketball performance assessment from first-person videos. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [3] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In

    Proceedings of the 22nd international conference on Machine learning

    , pages 89–96. ACM, 2005.
  • [4] O. Çeliktutan, C. B. Akgul, C. Wolf, and B. Sankur. Graph-based analysis of physical exercise actions. In Proceedings of the 1st ACM international workshop on Multimedia indexing and information retrieval for healthcare, pages 23–32. ACM, 2013.
  • [5] L. Chen, M. Zhai, and G. Mori. Attending to distinctive moments: Weakly-supervised attention models for action localization in video. In Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on, pages 328–336. IEEE, 2017.
  • [6] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6298–6306. IEEE, 2017.
  • [7] S. Chen and Q. Zhao. Boosted attention: Leveraging human attention for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 68–84, 2018.
  • [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • [9] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100, 2017.
  • [10] F. De la Torre, J. Hodgins, A. Bargteil, X. Martin, J. Macey, A. Collado, and P. Beltran. Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. Robotics Institute, page 135, 2008.
  • [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
  • [12] H. Doughty, D. Damen, and W. Mayol-Cuevas. Who’s better, who’s best: Skill determination in video using deep ranking. arXiv preprint arXiv:1703.09913, 2017.
  • [13] W. Du, Y. Wang, and Y. Qiao. Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In IEEE International Conference on Computer Vision, volume 2, 2017.
  • [14] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia, 19(9):2045–2055, 2017.
  • [15] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh, et al. Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In MICCAI Workshop: M2CAI, volume 3, page 3, 2014.
  • [16] R. Girdhar and D. Ramanan. Attentional pooling for action recognition. In Advances in Neural Information Processing Systems, pages 34–45, 2017.
  • [17] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, volume 2, page 3, 2017.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [19] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
  • [20]

    Y. Huang, M. Cai, H. Kera, R. Yonetani, K. Higuchi, and Y. Sato.

    Temporal localization and spatial segmentation of joint attention in multiple first-person videos. In Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on, pages 2313–2321. IEEE, 2017.
  • [21] Y. Huang, M. Cai, Z. Li, and Y. Sato. Predicting gaze in egocentric video by learning task-dependent attention transition. arXiv preprint arXiv:1803.09125, 2018.
  • [22] Y. Huang, M. Cai, Z. Li, and Y. Sato. Mutual context network for jointly estimating egocentric gaze and actions. arXiv preprint arXiv:1901.01874, 2019.
  • [23] W. Ilg, J. Mezger, and M. Giese. Estimation of skill levels in sports based on hierarchical spatio-temporal correspondences. In Joint Pattern Recognition Symposium, pages 523–531. Springer, 2003.
  • [24] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–142. ACM, 2002.
  • [25] M. Jug, J. Perš, B. Dežman, and S. Kovačič. Trajectory based assessment of coordinated human activity. In International Conference on Computer Vision Systems, pages 534–543. Springer, 2003.
  • [26] D. Li, T. Yao, L. Duan, T. Mei, and Y. Rui. Unified spatio-temporal attention networks for action recognition in videos. IEEE Transactions on Multimedia, 2018.
  • [27] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166:41–50, 2018.
  • [28] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot. Global context-aware attention lstm networks for 3d action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 7, page 43, 2017.
  • [29] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
  • [30] A. Malpani, S. S. Vedula, C. C. G. Chen, and G. D. Hager. Pairwise comparison-based objective score for automated skill assessment of segments in a surgical task. In International Conference on Information Processing in Computer-Assisted Interventions, pages 138–147. Springer, 2014.
  • [31] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
  • [32] D. Parikh and K. Grauman. Relative attributes. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 503–510. IEEE, 2011.
  • [33] P. Parmar and B. T. Morris. Learning to score olympic events. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 76–84. IEEE, 2017.
  • [34] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [35] H. Pirsiavash, C. Vondrick, and A. Torralba. Assessing the quality of actions. In European Conference on Computer Vision, pages 556–571. Springer, 2014.
  • [36] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
  • [37] Y. Sharma, V. Bettadapura, T. Plötz, N. Hammerla, S. Mellor, R. McNaney, P. Olivier, S. Deshmukh, A. McCaskie, and I. Essa. Video based assessment of osats using sequential motion textures. Georgia Institute of Technology, 2014.
  • [38] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  • [39] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI, volume 1, pages 4263–4270, 2017.
  • [40] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [41] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6450–6458. IEEE, 2017.
  • [42] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
  • [43] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
  • [44] S. Yan, J. S. Smith, W. Lu, and B. Zhang. Hierarchical multi-scale attention networks for action recognition. Signal Processing: Image Communication, 61:73–84, 2018.
  • [45] T. Yao, T. Mei, and Y. Rui. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 982–990, 2016.
  • [46] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l 1 optical flow. In Joint Pattern Recognition Symposium, pages 214–223. Springer, 2007.
  • [47] Q. Zhang and B. Li.

    Video-based motion expertise analysis in simulation-based surgical training using hierarchical dirichlet process hidden markov model.

    In Proceedings of the 2011 international ACM workshop on Medical multimedia analysis and retrieval, pages 19–24. ACM, 2011.
  • [48] Q. Zhang and B. Li. Relative hidden markov models for video-based evaluation of motion skills in surgical training. IEEE transactions on pattern analysis and machine intelligence, 37(6):1206–1218, 2015.
  • [49] A. Zia, Y. Sharma, V. Bettadapura, E. L. Sarin, M. A. Clements, and I. Essa. Automated assessment of surgical skills using frequency analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 430–438. Springer, 2015.