The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos

12/13/2018 ∙ by Hazel Doughty, et al. ∙ 12

We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Previous work formulates skill determination for common tasks as a ranking problem, yet measures skill from randomly sampled video segments. We believe this approach to be limiting since many parts of the video are irrelevant to assessing skill, and there may be variability in the skill exhibited throughout a video. Assessing skill from a single section may not reflect the overall skill in the video. We propose to train rank-specific temporal attention modules, learned with only video-level supervision, using a novel rank-aware loss function. In addition to attending to task-relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skills. We evaluate the approach on the public EPIC-Skills dataset and additionally collect and annotate a larger dataset for skill determination with five previously unexplored tasks. Our method outperforms previous approaches and classic softmax attention on both datasets by over 4 demonstrate our model's ability to attend to rank-aware parts of the video.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video-based skill determination is the problem of assessing how well a subject performs the given task. Automatic skill assessment will enable us to explore the wealth of online videos capturing daily tasks, such as crafts and cooking, as well as videos capturing surgery or sports, for training humans and intelligent agents - which video should a robot imitate to prepare you scrambled eggs for breakfast?

For long videos, previous approaches make a naive assumption; the same level of skill is exhibited throughout the video, and thus skill can be determined in any (or all) of its parts [4, 19, 25, 34, 36]. Take for example the task of ‘tying a tie’; draping the tie around the neck or straightening the tie may be uninformative when determining a subject’s skill, however the way the subject crosses one side over and pushes the tie into the loop are key. Additionally, there may be variation in skill across the video: when comparing two videos, one subject may perform better at neatly crossing the tie but worse at pulling through the loop.

Figure 1: Rank-aware attention for skill ranking. We determine a video’s rank by using high (green) and low (red) skill attention modules, which determine the segments influence to the rank. Line opacity indicates the attention value for a segment and the line thickness indicates the score. Sometimes a frame is relevant for all levels of skill so it is scored by both branches.

Instead, we consider skill determination to be a fine-grained video understanding problem, where it is important to first localize relevant temporal regions to distinguish between instances [21]. We target skill determination for common tasks, where ranking of videos is more suitable than computing per-video scores [1, 4, 17]. Therefore, we devise a Siamese CNN over temporal segments, including attention modules adapted from [18], which we train to be rank-aware using a novel loss function. This is because relevance may differ depending on the skill displayed in the video - e.g. mistakes may not appear in higher-ranked videos. When trained with our proposed loss, these modules specialize so they separately attend to parts of the video informative for high skill or sub-standard performance (see Fig. 1).

While temporal attention has previously been used to indicate relevance in long videos [18, 21], no prior work has proposed to learn rank-aware temporal attention. Our main contribution is that we address the challenges of fine-grained video ranking by demonstrating the need for rank-aware temporal attention and propose a model to learn this effectively. We additionally contribute a new skill determination dataset, by collecting and annotating 5 tasks from YouTube, each containing 100 videos. In total, our dataset is 26 hours of video, twice the size of existing skill determination datasets, with videos up to 10 minutes in length. We outperform previous works on published and newly collected datasets and show comprehensive evaluation of the contribution of rank-aware attention.

The rest of the paper is organized as follows. Section 2 reviews the related work. We introduce our proposed method in Section 3 and our new dataset in Section 4. Section 5 presents quantitative and qualitative results of our method, followed by our conclusion in Section 6.

2 Related Work

In this section, we first review skill determination works in video, both task-specific and widely applicable methods. We then review works proposing attention modules, specifically temporal attention, for a variety of problems.

Skill Determination. Several seminal works attempted skill determination in video [10, 11, 33]. Gordon [10] was the first to explore the viability of automated skill assessment from videos and addressed issues such as appropriate tasks for analysis with a case study automatically assessking skill in gymnastic vault from skeleton trajectories. Despite the importance of automatic skill assessment from video for training and guidance, following works remain limited [1, 4, 19, 23, 25, 31, 34, 36, 37]. These works demonstrate good performance by focusing on features specific to the task, such as skeleton trajectory in diving [23] or entropy between repeated sutures in surgery [36]. A few recent works perform skill determination using kinematic data from non-visual sensors such as inertial measurement units (IMUs) [5, 6, 17, 29, 35].

These works introduce and evaluate on several skill assessment datasets [4, 8, 19, 23, 31]. MIT Dive [23] and the UNLV datasets [19] only include short video clips (s), whilst the remaining three [8, 4, 23] are small scale datasets. Xu et al. present the Fis-V dataset [31], containing 500 figure skating videos, however this is not yet publicly available. We test on EPIC-Skills [4] as this includes the JIGSAWS [8] dataset, re-annotated for ranking alongside 3 other tasks. We also present a new dataset consisting of 500 videos across 5 daily-living tasks.

To assess skill in long videos, different approaches have been proposed. One is to first localise pre-selected events specific to the task [1], such as shooting or passing the ball in a basketball game. Alternatively, global features from the entire video have been used [23, 25, 34, 36], such as skeleton trajectories [23], features averaged across the video [19], or from randomly sampled segments in our previous work [4]. The only work to use attention in long videos is [31] for figure-skating. They use a self-attentive LSTM and a multi-scale skip LSTM to learn local (technical movements) and global (performance of players) scores respectively. They use a regression framework specifically for predicting the components of figure skating scores, not appropriate for common tasks.

We differ from all previous works in that we train a model to attend to skill-relevant parts of the video; learnable, and thus applicable, to any task. We use a convolutional network with temporal segments and propose a novel ‘rank-aware’ loss function. We do not use LSTMs due to the reported issues with maintaining information over longer videos [26, 28], and inferior performance compared to non-recurrent networks in many sequence-based tasks [2, 9, 28].

Figure 2: Rank-Aware Attention Network. Given a ranked pair of videos where exhibits higher skill: each video is uniformly split into segments. Extracted features (I3D) are passed into a pair of attention modules to produce video-level features, followed by a ranking function (FC layer) to produce ranking scores (green) and (red) per video. We include a third branch to uniformly weight the segments, which produces score (blue). Three types of losses are defined: ranking loss computes pairwise loss (green-to-green, red-to-red, blue-to-blue), disparity loss ensures attention outperforms uniform (green-to-blue, red-to-blue), the final loss optimizes the attention modules to pick rank-aware segments (green-to-red).

Attention Modules. Attention is increasingly used in fine-grained recognition, as intelligently weighting input is key to distinguishing between similar categories. This is a common problem in image recognition [7, 27] where attention can localize discriminative attributes in the object of interest. For instance, Fu et al. [7] present RA-CNN to recursively zoom into the most discrimative image region with an inter-scale ranking loss. Singh et al. [27]

adapt the spatial transformer network 

[12] into a Siamese network to perform relevant attribute ranking. Similarly, in person re-identification from video, attention [13, 16, 30] is utilized to select the frames with the best view of identifying attributes.

Attention has also been adopted in the video domain for action recognition [21, 22] and action localization [14, 24, 18, 20], including for weakly supervised localization from video-level labels [18, 20]. Pei et al. [21]

combine an attention module with a gated recurrent network to classify untrimmed video. Piergeovanni et al. 

[22] present temporal attention filters to discover latent sub-events in activities. Nguyen et al. [18] use attention filters in a CNN to identify a sparse set of video segments which minimize a video’s classification loss. They use this class-agnostic attention with the class activations to localize target actions. We build on the attention filters used in this work for our rank-aware attention (Sec 3.3). Paul et al. [20] also use class-specific attention, with multiple instance learning, to localize actions.

Using class-specific attention is a common technique in existing temporal attention works [18, 20]. In this work, we propose the first model to train rank-specific (which we call rank-aware) attention, and demonstrate that it outperforms rank-agnostic attention and existing methods.

3 Rank-Aware Attention Network

In this section, we re-formulate the skill determination problem in long videos. We then detail the combination of training losses that form our rank-aware loss.

3.1 Problem Formulation

We follow the pairwise supervised learning approach for widely applicable skill determination applicable to common tasks, introduced in our previous work 

[4], where the training set comprises of all pairs of videos, . Each pair , has been annotated such that displays more skill than . Such pairwise annotations can be acquired for any task, unlike skill scores which are only applicable to certain domains [19, 23, 36]. The aim is then to learn a ranking function for an individual task such that

(1)

For long videos, previously we assumed these pairwise skill annotations can be propagated to any part of the video [4]. Given is the video segment, , skill annotations were propagated so that,

(2)

Another approach to deal with long videos [19, 32]

, is to use a uniform weighting of feature vectors to learn a video level ranking. This assumes all parts of the video are equally important for skill assessment.

(3)
(4)

In this work, we believe these assumptions do not hold. First, some parts of the video may not exhibit any difference in skill, or may even show reversed ranking - where the overall better video has segments exhibiting less skill. Second, non-uniform pooling should better represent the video’s overall skill by increasing the weight for segments more pertinent to a subject’s skill. Third, comparing corresponding video chunks assumes tasks are performed in a set order, at the same speed. We deviate from these assumptions, and instead aim to jointly learn temporal attention , alongside ranking function such that

(5)
(6)

While is a standard attention module for relevance, we observe that the segments most crucial to determining skill may differ depending on the subject’s skill, i.e. a low-skill subject may perform certain actions not performed by a high-skill subject and vice-versa. Therefore, we propose to train two general attention modules to produce scores , for all pairs , such that:

(7)

Following training, the two attention modules diverge to become rank-aware, such that one attends to segments which display a high skill () and the other to low skill (), along with differing ranking functions , :

(8)
(9)

3.2 Rank-Aware Attention and Overall Network

We show our overall architecture in Fig. 2. The Siamese network takes a video pair and splits each into segments of uniform length. We then obtain a video level feature from all segments weighted by our learned attention functions and (Sec. 3.3), or through uniform weighting . The network then learns three ranking functions using fully connected (FC) layers in the branches containing , and uniform respectively, and outputs three corresponding scores per video . These FC layers are separate for each weighting function, but are shared by both sides of the Siamese network. For each, a margin ranking loss function ensures is ranked higher that ,

(10)

where is the final score of video from the high-skill attention module and is a constant margin. The ranking loss is defined similarly for the low-skill and uniform weighting branches:

(11)
(12)

While the need for uniform weighting might not be obvious, when using an attention module alone to learn a ranking we find it is likely to fall into a local-minimum during training. The learned attention weights for such a local-minimum perform worse than uniform weighting. We avoid this by introducing an attention disparity loss, which explicity encourages the attention branch to outperform uniform:

(13)

Here, is a separate margin from specific to this loss. For a video pair , this loss encourages the difference between scores to be greater than the difference between scores , thereby encouraging the attention module to produce video-level features better at distinguishing between the skill displayed in the two videos than the uniform branch. This loss alone could instead cause the performance of to degrade, however by jointly optimizing with Eq. 12 this is avoided. A similar loss is defined for the low-skill branch with uniform.

So far, the two learnt attention modules are indistinguishable. They attend to task-relevant segments to form video-level features and and perform the ranking. We finally optimize these filters to achieve the desired response with our proposed rank-aware loss:

(14)

With Eq. 3.2, we ensure attends to higher skill parts of the better video while attends to video parts with lower skill from . To optimize for rank-aware attention, we use a larger margin compared to single branches . The overall training is then conducted by combining the losses:

(15)

As training iterates through pairs in , the same video, will be considered higher skill in one pair and lower in another (e.g. ). The network accordingly optimizes the shared weights so as to learn rank-aware attention modules.

When testing the network, a single video is evaluated and its rank is assigned through its ranking score:

(16)

Note that in training we learn and such that and which implies . Although attends to low-skill segments, the overall score reflects the correct ranking of the videos. We do not include as the attention alone should be sufficient (shown in Fig. 5).

Figure 3: The attention module consists of attention filters, each outputting a scalar weight per segment, used to produce the weighted video-level feature.

3.3 Multi-filter Attention Module

Our attention modules and each take a set of video segments and learn a weighting of these segments informative for skill ranking. As the attention modules have the same structure, we will refer to the generic attention module for simplicity. We show the architecture of the attention module in Fig. 3. The attention module consists of

filters, each comprised of two FC layers, the first followed by a ReLU layer, the second followed by a softmax. This is based on the attention filter used in STPN 

[18] with a softmax activation instead of sigmoid. Filters are combined to achieve segment level attention:

(17)

where refers to the th attention filter for the attention module , and importantly for each of the filters. We include multiple attention filters to encourage a module to attend to multiple skill-relevant sub-tasks in the long videos. We assess the importance of multiple filters and the number of filters in Section 5.

To regularize the filters, we use a diversity loss. We define the x attention matrix relating to video as:

(18)

and use the following diversity loss:

(19)

where

is the identity matrix and

denotes the Frobenius norm. Similar diversity losses have been used successfully in other applications, such as text embedding [15] - here we use it to regularize temporal attention in video. In our network, this loss encourages each filter to learn a different aspect of the video. Without such a loss all filters will attend to the same most discriminative part in the video, rendering any more than one filter redundant. This loss also encourages filters to be sparse and pick the few most informative segments.

Note that the diversity loss is within an attention module; diversity is not enforced between modules. Attentions are allowed to overlap and do so when the segment is relevant at all skill levels. This loss is added to training as follows:

(20)

4 Tasks and Datasets

We evaluate our model on the publicly available EPIC-Skills 2018 dataset [4]. It consists of four distinct tasks: surgery (knot-tying, needle passing, and suturing), drawing (two drawings), dough-rolling, and chopstick-using. Every (sub-)task consists of up to 40 videos, with pairwise annotations indicating the ranking of videos in a pair. A limitation of this dataset is that each task is collected in a single environment with the same perspective and only minor variations in the background. We therefore collect and annotate a new skill determination dataset over twice as large, from online videos and thus with a variety of individuals, environments, and viewpoints.

Task #Videos #Pairs %Pairs Av. Length (s)

EPIC-Skills

Chopstick Using 40 536 69%  46 17
Dough Rolling 33 181 34% 102 29
Drawing 40 247 65% 101 47
Surgery 103 1659 95%  92 41

YouTubeSkill

Scramble Eggs 100 2112 43% 170 113
Tie Tie 100 3843 77%  81 47
Apply Eyeliner 100 3743 76% 122 105
Braid Hair 100 3847 78% 179 91
Origami 100 3237 65% 386 193
Table 1:

For our new tasks: #videos, #of pairs and average and standard deviation of video length in comparison to EPIC-Skills.

4.1 YouTubeSkill Dataset

We collect and annotate a new dataset consisting of five skill tasks with 100 videos per task, publicly available at https://github.com/hazeld/rank-aware-attention-network. This dataset gives us an opportunity to test on a larger variety of skill tasks with more and longer videos per task from varied environments.

Video Collection. We selected five tasks which can be completed using various methods and may be challenging for novices: scrambling eggs, braiding hair, tying a tie, making an origami crane, and applying eyeliner. The tasks selected are deliberately varied in their content and also differ from the tasks in the EPIC-Skills dataset as this allows a more thorough testing of the proposed model.

To obtain 100 videos per task, we first retrieve the top-400 videos from YouTube using the task name as a query. We then ask AMT workers to answer questions about each video to determine its suitability for our dataset. These ensure the selected videos contain the relevant task, are good quality videos, contain a clear view of the task and contain the complete performance of the task with minimal edits. We also ask AMT workers for their initial opinion of the skill of the person performing the task from ‘Beginner’, ‘Intermediate’, ‘Expert’. This initial labelling ensures we get a distribution of skill levels before pairwise annotations.

As only a portion of the YouTube video may contain the desired task, we annotate the start and end of the relevant activity via AMT, using the same approach for annotations from [3]. We use the agreement of 4 workers.

Pairwise Annotation. We perform skill annotation similarly to our previous work [4]. We ask AMT workers to watch videos in a pair simultaneously and select the video which displays more skill. The pair is taken as ground-truth only if all four workers agree on a pair’s ordering. We have many more videos than EPIC-Skills [4] and therefore considerably more possible video pairings. It is unnecessary to annotate all possible pairs. Instead, we annotate 40% of the possible pairings, where each video appears in an equal number of pairs. This removes the need for exhaustive annotation as we utilize the transitive nature of skill ranking to obtain pairs outside of the original 40%. We perform a second round of annotations for pairs of a similar rank, to ensure our dataset contains challenging pairs with marginal difference in skill.

The number and percentage of pairs per task is shown in Table 1, along with the average video length per task. Our dataset is considerably larger than EPIC-Skills in terms of both videos and annotated pairs.

5 Experiments

We first describe the implementation details of our network. We then present results on the two datasets alongside baselines and analyze the contribution of the various components in our method with an ablation study.

5.1 Implementation Details

We uniformly sample 400 stacks of 16 frames, at 10fps, for each video. Images are rescaled to have a shortest side of 256 then centre cropped to 224224. We extract features using I3D, pre-trained on Kinetics [2]. To prevent overfitting we augment the features by adding noise per dimension as in [18]. All models are trained using the Adam optimizer with a batch size of 128 and learning rate of

for 2000 epochs. For stable training, we iteratively optimise the network’s parameters. We first fix the attention module parameters and optimise the ranking FC layer weights using

losses (Eq 101112). We then fix the ranking FC layer weights and optimise the attention module weights, using the remaining losses ( and ). In all experiments, we set the weight of (Eq. 20) to 0.1 and set (Eq. 10) and (Eq. 3.2) and (Eq. 3.2) for the ranking margins.

Figure 4: Ablation study of loss functions on all tasks. In general each additional loss term gives an improvement, the most significant improvement being the rank-aware loss which gives an average 5% improvement for YouTubeSkill.
Figure 5: Contribution of different branches in the network. The addition of and cause both the high and low skill branches to perform better than uniform in most tasks. These branches offer complementary information causing an improvement in our final result.

5.2 Quantitative Results

Evaluation Metric

We evaluate tasks individually for both datasets. We report pairwise accuracy (% of correctly ordered pairs) and mean task accuracy for an overall dataset metric. For EPIC-Skills we use the four-fold cross validation training and test splits provided with the dataset 

[4]. For YouTubeSkill we use a single 75%:25% split for each task, as the number of pairs per task is larger. Our test set consists exclusively of pairs where neither video is present in the training set.

Method EPIC Skills YouTubeSkill
Who’s Better [4] 76.0 75.8
Last Segment 76.8 61.0
Uniform Weighting 78.8 73.6
Softmax Attention 74.5 72.3
STPN [18] 74.3 70.0

Ours (Rank Aware Attention)
80.3 81.2
Table 2: Results of our method in comparison to baseline. Our final method outperforms every baseline on both datasets.

Baselines and Attention. In Table 2 we show the results of our method in comparison with different baselines. We use our previous work ‘Who’s Better, Who’s Best’ [4] as the current state of the art performance for skill determination in generic tasks. On average we outperform this baseline by over 4% on each dataset. We also use four baselines for various temporal attention approaches. The first temporal attention baseline is to focus on the last segment of the video. It could be argued that this segment, displaying the final result of the task, is sufficiently informative to attend to across tasks, however this performs particularly poorly on YouTubeSkill.

We also show our method in comparison to uniform weighting and softmax attention. For softmax attention we use the same basic structure of our method with a single attention branch only optimized by . Importantly, our proposed method shows an improvement over both uniform weighting and the standard softmax attention, particularly for YouTubeSkill with longer videos. Interestingly, we see the inclusion of softmax attention decreases the overall accuracy for both datasets from a naive uniform weighting of segments (-4.3% and -0.7%). Although softmax attention achieves higher accuracy than uniform for several tasks, we found softmax attention to be highly inconsistent. To compare to existing temporal attention methods, we adapt the class agnostic attention from Sparse Temporal Pooling Network (STPN) [18] into a pairwise ranking framework. While this approach works well for action localization, when adapted to a ranking framework it performs worse than both our method and uniform sampling.

Ablation Study. In Fig. 4 we perform a per-task ablation study, testing the individual contributions of the diversity loss (Eq. 19), the disparity loss (Eq. 3.2) and our rank-aware loss (Eq. 3.2). The inclusion of the diversity loss increases the result by 2% for both datasets. It is particularly useful for Drawing (+7.3%) and Tie Tie (+6%), as videos in these tasks consistently have many skill-relevant segments.

Figure 6: We test the number of filters () for all tasks. The number of filters causes a clear increase in many tasks, with the majority of tasks peaking at

From Fig. 4 we see training the attention module alongside the uniform weighting with the disparity loss improves the results further.

encourages the network to learn attentions better at discriminating between videos than the uniform distribution and decreases the sensitivity to initialization with softmax attention. In tasks like Chopstick Using and Scramble Eggs, where the softmax attention performs similarly to uniform, this can help significantly.

Our final rank-aware loss further improves the results, particularly for YouTubeSkill (average improvement of 5%). This is particularly true for tasks such as Scramble Eggs and Apply Eyeliner (which have an increase of 10.4% and 8.8% respectively). These tasks contain more instances of subtasks only performed by subjects with higher or lower skill as can be seen in Section 5.3.

We note three exceptions to this trend: Drawing, Surgery and Origami. Surgery maintains a similar score throughout the ablation test and has the lowest final score of all tasks. We believe this is due to the I3D features not being able to capture the difference between the fine-grained detail of surgical motions of different abilities. Drawing and Origami both drop with the addition of . In Drawing the attention branch struggles to be better at separating videos than the uniform branch, indicating most segments are relevant for determining skill. In Origami, the uniform weighting has poor performance due to the visual subtlety of placing neat folds in the paper. Therefore, optimizing the attention branch to be better than uniform does not improve training.

Figure 7: We test correlation of high and low skill filters for all tasks, to check they attend to different video segments.
Figure 8: Attention values of the high-skill (green) and low-skill (red) modules with the corresponding video segments for examples from ‘Scramble Eggs’ and ‘Tie Tie’. The intensity of the color indicates the attention value. We show the predicted ranking from both branches.

Branch Contribution. In the final model we also test the contribution of the different branches producing scores , and , (Fig. 5). From Fig. 5 we see that we are able to learn high and low skill branches which are both more informative than uniform in the majority of tasks. This is particularly true for tasks such as Chopstick Using and Scramble Eggs which see little improvement with attention until the disparity loss is introduced (Fig. 4). Within tasks, the performances of the high and low skill branches can vary greatly. We can see this for Tie Tie, with the low-skill branch performing best (+4.3%). Here, (as discussed in Sec. 5.3) the presence of hesitation proves effective at ranking the videos.

The fusion of high and low skill branches further improves the result (EPIC-Skills +2.9% and YouTubeSkill +3.2%). In many tasks the branches offer complementary information, as each branch can attend to separate video segments, specific to either high or low skill (see Sec 5.3).

Number of Filters. In Fig. 6 we test the effect of the number of filters in our attention modules (Sec. 3.3). The previous sections report results using 3 filters. This shows an improvement over a single filter in all but the Chopstick Using task. However, more filters do not further increase the accuracy, as additional less-informative segments are included.

We also ran a separate test to assess how our two rank-aware attention modules, with three filters each, compares to a single module with six filters. These might be thought of as comparable. However, our test shows a clear drop in overall accuracy for YouTubeSkill from 81.2% to 75.0%. Per-task results are included in the supplementary material.

Filter Correlation. To ensure our high and low skill filters are attending to different video segments we plot the correlation of pairs of filters between high and low attention modules, averaged over all videos for YouTubeSkill. From Fig. 7 we can see most filter pairs have low correlation and are therefore attending to different segments. There are some cases where filters have a higher correlation (Braid Hair at ) as it can be helpful for at least one of the high and low skill filters to attend to the same segments when relevant at all levels of skill.

5.3 Qualitative Results

In Fig. 8 we show attention weights with corresponding frames for the Scramble Eggs and Tie Tie tasks. Firstly, the figure shows we are able to filter out irrelevant segments using attention, for instance turning on the stove-top and opening the cupboard in ‘Scramble Eggs’. Secondly, we can see our rank-aware attention allows the modules to focus on different aspects of the video. In the Scramble Eggs task the high-skill module consistently focuses on whisking the eggs and stirring in the pan, while the low-skill module attends to adding milk/cream to the eggs and pouring. For ‘Tie Tie’ the high skill module gives a strong weighting to segments displaying a tight inner knot and straightening the tie before folding across, while the low-skill module focuses mainly on hesitation and repetition. However, there are cases where the attention will attend to segments not necessarily correlated to the rank; in Scramble Eggs the low-skill module attends to segments containing bread. Video results are included in the supplementary material.

6 Conclusion

In this paper we have presented a new model for rank-aware attention, trained using a novel loss function. We use the disparity loss to directly optimize the attention to pick more informative segments than the uniform distribution, solving the poor performance of the standard softmax attention in ranking. Our rank-aware loss enables us to learn the most informative segments to attend to in relation to the skill shown in the video. We have tested this method on two datasets, one of which we introduce in this paper, and show our method achieves state-of-the-art results for skill determination, with an average performance of over 80% in both datasets. Future work involves exploring applications of the attention segments to improve people’s skill in a task, as well as attempting to transfer learned ranks to unseen tasks.

References

  • [1] G. Bertasius, H. Soo Park, S. X. Yu, and J. Shi. Am I a Baller? Basketball Performance Assessment From First-Person Videos. In

    The IEEE International Conference on Computer Vision (ICCV)

    , October 2017.
  • [2] J. Carreira and A. Zisserman. Quo Vadis, Action Recognition? a New Model and the Kinetics Dataset. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    . IEEE, July 2017.
  • [3] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. In European Conference on Computer Vision (ECCV), September 2018.
  • [4] H. Doughty, D. Damen, and W. Mayol-Cuevas. Who’s Better? Who’s Best? Pairwise Deep Ranking for Skill Determination. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [5] M. J. Fard, S. Ameri, R. Darin Ellis, R. B. Chinnam, A. K. Pandya, and M. D. Klein. Automated robot-assisted surgical skill evaluation: Predictive analytics approach. The International Journal of Medical Robotics and Computer Assisted Surgery, 14(1), 2018.
  • [6] G. Forestier, F. Petitjean, P. Senin, F. Despinoy, and P. Jannin. Discovering Discriminative and Interpretable Patterns for Surgical Motion Analysis. In

    Conference on Artificial Intelligence in Medicine in Europe

    , pages 136–145. Springer, 2017.
  • [7] J. Fu, H. Zheng, and T. Mei.

    Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [8] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh, et al. JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): A Surgical Activity Dataset for Human Motion Modeling. In MICCAI Workshop: M2CAI, volume 3, page 3, 2014.
  • [9] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional Sequence to Sequence Learning. arXiv preprint arXiv:1705.03122, 2017.
  • [10] A. S. Gordon. Automated Video Assessment of Human Performance. In Proceedings of AI-ED, pages 16–19, 1995.
  • [11] W. Ilg, J. Mezger, and M. Giese. Estimation of Skill Levels in Sports Based on Hierarchical Spatio-Temporal Correspondences. In Joint Pattern Recognition Symposium, pages 523–531. Springer, 2003.
  • [12] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. In Advances in Neural Information Processing Systems (NIPS), pages 2017–2025, 2015.
  • [13] S. Li, S. Bak, P. Carr, and X. Wang. Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
  • [14] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek. VideoLSTM Convolves, Attends and Flows for Action Recognition. Computer Vision and Image Understanding, 166:41–50, 2018.
  • [15] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A Structured Self-Attentive Sentence Embedding. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • [16] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang.

    HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis.

    In The IEEE International Conference on Computer Vision (ICCV), October 2017.
  • [17] A. Malpani, S. S. Vedula, C. C. G. Chen, and G. D. Hager. Pairwise Comparison-Based Objective Score for Automated Skill Assessment of Segments in a Surgical Task. In International Conference on Information Processing in Computer-Assisted Interventions, pages 138–147. Springer, 2014.
  • [18] P. Nguyen, T. Liu, G. Prasad, and B. Han. Weakly Supervised Action Localization by Sparse Temporal Pooling Network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [19] P. Parmar and B. T. Morris. Learning to Score Olympic Events. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 76–84. IEEE, 2017.
  • [20] S. Paul, S. Roy, and A. K. Roy-Chowdhury. W-TALC: Weakly-supervised Temporal Activity Localization and Classification. In European Conference on Computer Vision (ECCV), September 2018.
  • [21] W. Pei, T. Baltrusaitis, D. M. Tax, and L.-P. Morency. Temporal attention-gated model for robust sequence classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017.
  • [22] A. Piergiovanni, C. Fan, and M. S. Ryoo. Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
  • [23] H. Pirsiavash, C. Vondrick, and A. Torralba. Assessing the Quality of Actions. In European Conference on Computer Vision, pages 556–571. Springer, 2014.
  • [24] S. Sharma, R. Kiros, and R. Salakhutdinov. Action Recognition using Visual Attention. arXiv preprint arXiv:1511.04119, 2015.
  • [25] Y. Sharma, V. Bettadapura, T. Plötz, N. Hammerla, S. Mellor, R. McNaney, P. Olivier, S. Deshmukh, A. McCaskie, and I. Essa. Video Based Assessment of OSATS Using Sequential Motion Textures. In International Workshop on Modelling and Monitoring of Computer Assisted Interventions (M2CAI) workshop, 2014.
  • [26] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations (ICLR), 2017.
  • [27] K. K. Singh and Y. J. Lee. End-to-End Localization and Ranking for Relative Attributes. In European Conference on Computer Vision (ECCV), pages 753–769. Springer, 2016.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems (NIPS), pages 5998–6008, 2017.
  • [29] Z. Wang and A. M. Fey. Deep Learning with Convolutional Neural Network for Objective Skill Evaluation in Robot-assisted Surgery. arXiv preprint arXiv:1806.05796, 2018.
  • [30] L. Wu, Y. Wang, J. Gao, and X. Li. Where-and-When to Look: Deep Siamese Attention Networks for Video-based Person Re-Identification. arXiv preprint arXiv:1808.01911, 2018.
  • [31] C. Xu, Y. Fu, Z. Cheng, B. Zhang, Y.-G. Jiang, and X. Xue. Learning to Score Figure Skating Sport Videos. arXiv preprint arXiv:1802.02774, 2018.
  • [32] T. Yao, T. Mei, and Y. Rui. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [33] Q. Zhang and B. Li.

    Video-based Motion Expertise Analysis in Simulation-based Surgical Training Using Hierarchical Dirichlet Process Hidden Markov Model.

    In Proceedings of the 2011 international ACM workshop on Medical multimedia analysis and retrieval, pages 19–24. ACM, 2011.
  • [34] Q. Zhang and B. Li. Relative Hidden Markov Models for Video-based Evaluation of Motion Skills in Surgical Training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(6):1206–1218, 2015.
  • [35] A. Zia, Y. Sharma, V. Bettadapura, E. L. Sarin, M. A. Clements, and I. Essa. Automated Assessment of Surgical Skills Using Frequency Analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 430–438. Springer, 2015.
  • [36] A. Zia, Y. Sharma, V. Bettadapura, E. L. Sarin, and I. Essa. Video and Accelerometer-Based Motion Analysis for Automated Surgical Skills Assessment. International journal of computer assisted radiology and surgery, 13(3):443–455, 2018.
  • [37] A. Zia, Y. Sharma, V. Bettadapura, E. L. Sarin, T. Ploetz, M. A. Clements, and I. Essa. Automated video-based assessment of surgical skills for training and evaluation in medical schools. International Journal of Computer Assisted Radiology and Surgery, 11(9):1623–1636, 2016.