Detecting events and key actors in multi-person videos

11/09/2015 ∙ by Vignesh Ramanathan, et al. ∙ Google Stanford University 0

Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Event recognition and detection in videos has hugely benefited from the introduction of recent large-scale datasets [21, 53, 22, 40, 13] and models. However, this is mainly confined to the domain of single-person actions where the videos contain one actor performing a primary activity. Another equally important problem is event recognition in videos with multiple people. In our work, we present a new model and dataset for this specific setting.

Figure 1: Looking at the wrong people in a multi-person event can be very uninformative as seen in the basketball video in the first row. However, by observing the correct people in the same video, we can easily identify the event as a “2-pointer success” based on the shooter and the player throwing the ball into play. We use the same intuition to recognize the key players for event recognition.

Videos captured in sports arenas, market places or other outdoor areas typically contain multiple people interacting with each other. Most people are doing “something”, but not all of them are involved in the main event. The main event is dominated by a smaller subset of people. For instance, a “shot” in a game is determined by one or two people (see Figure 1). In addition to recognizing the event, it is also important to isolate these key actors. This is a significant challenge which differentiates multi-person videos from single-person videos.

Identifying the people responsible for an event is thus an interesting task in its own right. However acquiring such annotations is expensive and it is therefore desirable to use models that do not require annotations for identifying these key actors during training. This can also be viewed as a problem of weakly supervised key person identification. In this paper, we propose a method to classify events by using a model that is able to “attend” to this subset of key actors. We do this without ever explicitly telling the model who or where the key actors are.

Recently, several papers have proposed to use “attention” models for aligning elements from a fixed input to a fixed output. For example,

[3] translate sentences in one language to another language, attending to different words in the input; [68] generate an image-caption, attending to different regions in the image; and [70] generate a video-caption, attending to different frames within the video.

In our work, we use attention to decide which of several people is most relevant to the action being performed; this attention mask can change over time. Thus we are combining spatial and temporal attention. Note that while the person detections vary from one frame to another, they can be associated across frames through tracking. We show how to use a recurrent neural network (RNN) to represent information from each track; the attention model is tasked with selecting the most relevant track in each frame. In addition to being able to isolate the key actors, we show that our attention model results in better event recognition.

In order to evaluate our method, we need a large number of videos illustrating events involving multiple people. Most prior activity and event recognition datasets focus on actions involving just one or two people. Multi-person datasets like [45, 38, 6] are usually restricted to fewer videos. Therefore we collected our own dataset. In particular we propose a new dataset of basketball events with time-stamp annotations for all occurrences of different events across videos each hours long in length. This dataset is comparable to the THUMOS [21] detection dataset in terms of number of annotations, but contains longer videos in a multi-person setting.

In summary, the contributions of our paper are as follows. First, we introduce a new large-scale basketball event dataset with 14K dense temporal annotations for long video sequences. Second, we show that our method outperforms state-of-the-art methods for the standard tasks of classifying isolated clips and of temporally localizing events within longer, untrimmed videos. Third, we show that our method learns to attend to the relevant players, despite never being told which players are relevant in the training set.

Figure 2: We densely annotate every instance of 11 different basketball events in long basketball videos. As shown here, we collected both event time-stamps and an event labels through an AMT task.

2 Related Work

Action recognition in videos Traditionally, well engineered features have proved quite effective for video classification and retrieval tasks [7, 17, 20, 29, 36, 37, 30, 39, 41, 47, 48, 63, 64]. The improved dense trajectory (IDT) features [64] achieve competitive results on standard video datasets. In the last few years, end-to-end trained deep network models [19, 22, 52, 51, 60] were shown to be comparable and at times better than these features for various video tasks. Other works like [66, 69, 72] explore methods for pooling such features for better performance. Recent works using RNN have achieved state-of-the-art results for both event recognition and caption-generation tasks [8, 35, 54, 70]. We follow this line of work with the addition of an attention mechanism to attend to the event participants.

Another related line of work jointly identifies the region of interest in a video while recognizing the action. Gkioxari et al. [10] and Raptis et al. [43] automatically localize a spatio-temporal tube in a video. Jain et al. [18] merge super-voxels for action localization. While these methods perform weakly-supervised action localization, they target single actor videos in short clips where the action is centered around the actor. Other methods like [27, 42, 58, 65] require annotations during training to localize the action.

Muti-person video analysis Activity recognition models for events with well defined group structures such as parades have been presented in [61, 14, 33, 23]. They utilize the structured layout of participants to identify group events. More recently, [28, 6, 24] use context as a cue for recognizing interaction-based group activities. While they work with multi-person events, these methods are restricted to smaller datasets such as UT-Interaction[46], Collective activity [6] and Nursing home[28].

Attention models Itti et al. [16] explored the idea of saliency-based attention in images, with other works like [49] using eye-gaze data as a means for learning attention. Mnih et al. [32] attend to regions of varying resolutions in an image through a RNN framework. Along similar lines, attention has been used for image classification [5, 12, 67] and detection [2, 4, 71] as well.

Bahdanau et al. [3] showed that attention-based RNN models can effectively align input words to output words for machine translation. Following this, Xu et al. [68] and Yao et al. [70] used attention for image-captioning and video-captioning respectively. In all these methods, attention aligns a sequence of input features with words of an output sentence. However, in our work we use attention to identify the most relevant person to the overall event during different phases of the event.

Action recognition datasets Action recognition in videos has evolved with the introduction of more sophisticated datasets starting from smaller KTH [48], HMDB [26] to larger , UCF101 [53], TRECVID-MED [40] and Sports-1M [22] datasets. More recently, THUMOS [21] and ActivityNet [13] also provide a detection setting with temporal annotations for actions in untrimmed videos. There are also fine-grained datasets in specific domains such as MPII cooking [44] and breakfast [25]. However, most of these datasets focus on single-person activities with hardly any need for recognizing the people responsible for the event. On the other hand, publicly available multi-person activity datasets like [46, 6, 38] are restricted to a very small number of videos. One of the contributions of our work is a multi-player basketball dataset with dense temporal event annotations in long videos.

Person detection and tracking. There is a very large literature on person detection and tracking. There are also specific methods for tracking players in sports videos [50]. Here we just mention a few key methods. For person detection, we use the CNN-based multibox detector from [57]. For person tracking, we use the KLT tracker from [62]. There is also work on player identification (e.g., [31]), but in this work, we do not attempt to distinguish players.

Event # Videos Train (Test) Avg. # people
3-point succ. 895 (188) 8.35
3-point fail. 1934 (401) 8.42
free-throw succ. 552 (94) 7.21
free-throw fail. 344 (41) 7.85
layup succ. 1212 (233) 6.89
layup fail. 1286 (254) 6.97
2-point succ. 1039 (148) 7.74
2-point fail. 2014 (421) 7.97
slam dunk succ. 286 (54) 6.59
slam dunk fail. 47 (5) 6.35
steal 1827 (417) 7.05
Table 1: The number of videos per event in our dataset along with the average number of people per video corresponding to each of the events. The number of people is higher than existing datasets for multi-person event recognition.

3 NCAA Basketball Dataset

A natural choice for collecting multi-person action videos is team sports. In this paper, we focus on basketball games, although our techniques are general purpose. In particular, we use a subset of the NCAA games available from YouTube.111https://www.youtube.com/user/ncaaondemand These games are played in different venues over different periods of time. We only consider the most recent games, since older games used slightly different rules than modern basketball. The videos are typically hours long. We manually identified key event types listed in Tab. 1. In particular, we considered 5 types of shots, each of which could be successful or failed, plus a steal event.

Next we launched an Amazon Mechanical Turk task, where the annotators were asked to annotate the “end-point” of these events if and when they occur in the videos; end-points are usually well-defined (e.g., the ball leaves the shooter’s hands and lands somewhere else, such as in the basket). To determine the starting time, we assumed that each event was 4 seconds long, since it is hard to get raters to agree on when an event started. This gives us enough temporal context to classify each event, while still being fairly well localized in time.

The videos were randomly split into training, validation and test videos. We split each of these videos into 4 second clips (using the annotation boundaries), and subsampled these to 6fps. We filter out clips which are not profile shots (such as those shown in Figure 3) using a separately trained classifier; this excludes close-up shots of players, as well as shots of the viewers and instant replays. This resulted in a total of training, validation and test clips, each of which has one of 11 labels. Note that this is comparable in size to the THUMOS’15 detection challenge (150 trimmed training instances for each of the classes and untrimmed validation instances). The distribution of annotations across all the different events is shown in Tab. 1. To the best of our knowledge, this is the first dataset with dense temporal annotations for such long video sequences.

In addition to annotating the event label and start/end time, we collected AMT annotations on video clips in the test set, where the annotators were asked to mark the position of the ball on the frame where the shooter attempts a shot.

We also used AMT to annotate the bounding boxes of all the players in a subset of 9000 frames from the training videos. We then trained a Multibox detector [56] with these annotations, and ran the trained detector on all the videos in our dataset. We retained all detections above a confidence of 0.5 per frame; this resulted in 6–8 person detections per clip, as listed in Tab. 1. The multibox model achieves an average overlap of at a recall of with ground-truth bounding boxes in the validation videos.

We plan to release our annotated data, including time stamps, ball location, and player bounding boxes.

4 Our Method

All events in a team sport are performed in the same scene by the same set of players. The only basis for differentiating these events is the action performed by a small subset of people at a given time. For instance, a “steal” event in basketball is completely defined by the action of the player attempting to pass the ball and the player stealing from him. To understand such an event, it is sufficient to observe only the players participating in the event.

This motivates us to build a model (overview in Fig. 3) which can reason about an event by focusing on specific people during the different phases of the event. In this section, we describe our unified model for classifying events and simultaneously identifying the key players.

Figure 3: Our model, where each player track is first processed by the corresponding BLSTM network (shown in different colors). -BLSTM corresponds to the

’th player. The BLSTM hidden-states are then used by an attention model to identify the “key” player at each instant. The thickness of the BLSTM boxes shows the attention weights, and the attended person can change with time. The variables in the model are explained in the methods section. BLSTM stands for “bidirectional long short term memory”.

4.1 Feature extraction

Each video-frame is represented by a

dimensional feature vector

, which is the activation of the last fully connected layer of the Inception7 network [15, 55]. In addition, we compute spatially localized features for each person in the frame. In particular, we compute a dimensional feature vector which contains both appearance ( dimensional) and spatial information ( dimensional) for the ’th player bounding box in frame . Similar to the RCNN object detector[9], the appearance features were extracted by feeding the cropped and resized player region from the frame through the Inception7 network and spatially pooling the response from a lower layer. The spatial feature corresponds to a spatial histogram, combined with a spatial pyramid, to indicate the bounding box location at multiple scales. While we have only used static CNN representations in our work, these features can also be easily extended with flow information as suggested in [51].

4.2 Event classification

Given and for each frame , our goal is to train the model to classify the clip into one of 11 categories. As a side effect of the way we construct our model, we will also be able to identify the key player in each frame.

First we compute a global context feature for each frame, , derived from a bidirectional LSTM applied to the frame-level feature as shown by the blue boxes in Fig. 3. This is a concatenation of the hidden states from the forward and reverse LSTM components of a BLSTM and can be compactly represented as:

(1)

Please refer to Graves et al. [11].

Next we use a unidirectional LSTM to represent the state of the event at time :

(2)

where is a feature vector derived from the players, as we describe below. From this, we can predict the class label for the clip using , where the weight vector corresponding to class is denoted by . We measure the squared-hinge loss as follows:

(3)

where is if the video belongs to class , and is otherwise.

4.3 Attention models

Unlike past attention models [3, 68, 70] we need to attend to a different set of features at each time-step. There are two key issues to address in this setting.

First, although we have different detections in each frame, they can be connected across the frames through an object tracking method. This could lead to better feature representation of the players.

Second, player attention depends on the state of the event and needs to evolve with the event. For instance, during the start of a “free-throw” it is important to attend to the player making the shot. However, towards the end of the event the success or failure of the shot can be judged by observing the person in possession of the ball.

With these issues in mind, we first present our model which uses player tracks and learns a BLSTM based representation for each player track. We then also present a simple tracking-free baseline model.

Attention model with tracking. We first associate the detections belonging to the same player into tracks using a standard method. We use a KLT tracker combined with bipartite graph matching [34] to perform the data association.

The player tracks can now be used to incorporate context from adjacent frames while computing their representation. We do this through a separate BLSTM which learns a latent representation for each player at a given time-step. The latent representation of player in frame is given by the hidden state of the BLSTM across the player-track:

(4)

At every time-step we want the most relevant player at that instant to be chosen. We achieve this by computing as a convex combination of the player representations at that time-step:

(5)

where is the number of detections in frame , and

is a multi layer perceptron, similar to

[3]. is the softmax temperature parameter. This attended player representation is input to the unidirectional event recognition LSTM in Eq. 2. This model is illustrated in Figure 3.

Attention model without tracking. Often, tracking people in a crowded scene can be very difficult due to occlusions and fast movements. In such settings, it is beneficial to have a tracking-free model. This could also allow the model to be more flexible in switching attention between players as the event progresses. Motivated by this, we present a model where the detections in each frame are considered to be independent from other frames.

We compute the (no track) attention based player feature as shown below:

(6)

Note that this is similar to the tracking based attention equations except for the direct use of the player detection feature in place of the BLSTM representation .

Event IDT[64] IDT[64] player C3D [60] MIL[1] LRCN [8] Only player Avg. player Our no track Our track

3-point succ.
0.370 0.428 0.117 0.237 0.462 0.469 0.545 0.583 0.600
3-point fail. 0.501 0.481 0.282 0.335 0.564 0.614 0.702 0.668 0.738
fr-throw succ. 0.778 0.703 0.642 0.597 0.876 0.885 0.809 0.892 0.882
fr-throw fail. 0.365 0.623 0.319 0.318 0.584 0.700 0.641 0.671 0.516
layup succ. 0.283 0.300 0.195 0.257 0.463 0.416 0.472 0.489 0.500
layup fail. 0.278 0.311 0.185 0.247 0.386 0.305 0.388 0.426 0.445
2-point succ. 0.136 0.233 0.078 0.224 0.257 0.228 0.255 0.281 0.341
2-point fail. 0.303 0.285 0.254 0.299 0.378 0.391 0.473 0.442 0.471
sl. dunk succ. 0.197 0.171 0.047 0.112 0.285 0.107 0.186 0.210 0.291
sl. dunk fail. 0.004 0.010 0.004 0.005 0.027 0.006 0.010 0.006 0.004
steal 0.555 0.473 0.303 0.843 0.876 0.843 0.894 0.886 0.893
Mean 0.343 0.365 0.221 0.316 0.469 0.452 0.489 0.505 0.516
Table 2: Mean average precision for event classification given isolated clips.
Event IDT[64] IDT player[64] C3D [60] LRCN [8] Only player Avg. player Attn no track Attn track
3-point succ. 0.194 0.203 0.123 0.230 0.251 0.268 0.263 0.239
3-point fail. 0.393 0.376 0.311 0.505 0.526 0.521 0.556 0.600
free-throw succ. 0.585 0.621 0.542 0.741 0.777 0.811 0.788 0.810
free-throw fail. 0.231 0.277 0.458 0.434 0.470 0.444 0.468 0.405
layup succ. 0.258 0.290 0.175 0.492 0.402 0.489 0.494 0.512
layup fail. 0.141 0.200 0.151 0.187 0.142 0.139 0.207 0.208
2-point succ. 0.161 0.170 0.126 0.352 0.371 0.417 0.366 0.400
2-point fail. 0.358 0.339 0.226 0.544 0.578 0.684 0.619 0.674
slam dunk succ. 0.137 0.275 0.114 0.428 0.566 0.457 0.576 0.555
slam dunk fail. 0.007 0.006 0.003 0.122 0.059 0.009 0.005 0.045
steal 0.242 0.255 0.187 0.359 0.348 0.313 0.340 0.339
Mean 0.246 0.273 0.219 0.400 0.408 0.414 0.426 0.435
Table 3: Mean average precision for event detection given untrimmed videos.

5 Experimental evaluation

In this section, we present three sets of experiments on the NCAA basketball dataset: 1. event classification, 2. event detection and 3. evaluation of attention.

5.1 Implementation details

We used a hidden state dimension of

for all the LSTM and BLSTM RNNs, an embedding layer with ReLU non-linearity and

dimensions for embedding the player features and frame features before feeding to the RNNs. We used bins with spatial pyramid pooling for the player location feature. All the event videos clips were four seconds long and subsampled to 6fps. The value was set to for the attention softmax weighting. We used a batch size of , and a learning rate of which was reduced by a factor of every

iterations with RMSProp

[59]. The models were trained on a cluster of GPUs for

iterations over one day. The hyperparameters were chosen by cross-validating on the validation set.

5.2 Event classification

In this section, we compare the ability of methods to classify isolated video-clips into 11 classes. We do not use any additional negatives from other parts of the basketball videos. We compare our results against different control settings and baseline models explained below:

  • IDT[64] We use the publicly available implementation of dense trajectories with Fisher encoding.

  • IDT[64] player

    We use IDT along with averaged features extracted from the player bounding boxes.

  • C3D [60] We use the publicly available pre-trained model for feature extraction with an SVM classifier.

  • LRCN [8] We use an LRCN model with frame-level features. However, we use a BLSTM in place of an LSTM. We found this to improve performance. Also, we do not back-propagate into the CNN extracting the frame-level features to be consistent with our model.

  • MIL [1] We use a multi-instance learning method to learn bag (frame) labels from the set of player features.

  • Only player We only use our player features from Sec. 4.1 in our model without frame-level features.

  • Avg. player We combine the player features by simple averaging, without using attention.

  • Attention no track Our model without tracks (Eq. 6).

  • Attention with track Our model with tracking (Eq. 5).

The mean average precision (mAP) for each setting is shown in Tab. 2. We see that the method that uses both global information and local player information outperforms the model only using local player information (“Only player”) and only using global information (“LRCN”). We also show that combining the player information using a weighted sum (i.e., an attention model) is better than uniform averaging (“Avg. player”), with the tracking based version of attention slightly better than the track-free version. Also, a standard weakly-supervised approach such as MIL seems to be less effective than any of our modeling variants.

The performance varies by class. In particular, performance is much poorer (for all methods) for classes such as “slam dunk fail” for which we have very little data. However, performance is better for shot-based events like “free-throw”, “layups” and “3-pointers”where attending to the shot making person or defenders can be useful.

5.3 Event detection

In this section, we evaluate the ability of methods to temporally localize events in untrimmed videos. We use a sliding window approach, where we slide a

second window through all the basketball videos and try to classify the window into a negative class or one of the 11 event classes. We use a stride length of

seconds. We treat all windows which do not overlap more than second with any of the annotated events as negatives. We use the same setting for training, test and validation. This leads to negative examples across all the videos. We compare with the same baselines as before. However, we were unable to train the MIL model due to computational limitations.

The detection results are presented in Tab. 3. We see that, as before, the attention models beat previous state of the art methods. Not surprisingly, all methods are slightly worse at temporal localization than for classifying isolated clips. We also note a significant difference in classification and detection performance for “steal” in all methods. This can be explained by the large number of negative instances introduced in the detection setting. These negatives often correspond to players passing the ball to each other. The “steal” event is quite similar to a “pass” except that the ball is passed to a player of the opposing team. This makes the “steal” detection task considerably more challenging.

5.4 Analyzing attention

We have seen above that attention can improve the performance of the model at tasks such as classification and detection. Now, we evaluate how accurate the attention models are at identifying the key players. (Note that the models were never explicitly trained to identify key players).

Figure 4: We highlight (in cyan) the “attended” player at the beginning of different events. The position of the ball in each frame is shown in yellow. Each column shows a different event. In these videos, the model attends to the person making the shot at the beginning of the event.
Figure 5: We visualize the distribution of attention over different positions of a basketball court as the event progresses. This is shown for 3 different events. These heatmaps were obtained by first transforming all videos to a canonical view of the court (shown in the background of each heatmap). The top row shows the sample frames which contributed to the “free-throw” success heatmaps. It is interesting to note that the model focuses on the location of the shooter at the beginning of an event and later the attention disperses to other locations.

To evaluate the attention models, we labeled the player who was closest (in image space) to the ball as the “shooter”. (The ball location is annotated in 850 test clips.) We used these annotations to evaluate if our “attention” scores were capable of classifying the “shooter” correctly in these frames.

Event Chance Attn. with track Attn. no track
3-point succ. 0.333 0.445 0.519
3-point fail. 0.334 0.391 0.545
free-throw succ. 0.376 0.416 0.772
free-throw fail. 0.346 0.387 0.685
layup succ. 0.386 0.605 0.627
layup fail. 0.382 0.508 0.605
2-point succ. 0.355 0.459 0.554
2-point fail. 0.346 0.475 0.542
slam dunk succ. 0.413 0.347 0.686
slam dunk fail. 0.499 0.349 0.645
Mean 0.377 0.438 0.618
Table 4: Mean average precision for attention evaluation.

The mean AP for this “shooter” classification is listed in Tab. 4. The results show that the track-free attention model is quite consistent in picking the shooter for several classes like “free-throw succ./fail”, “layup succ./fail.” and “slam dunk succ.”. This is a very promising result which shows that attention on player detections alone is capable of localizing the player making the shot. This could be a useful cue for providing more detailed event descriptions including the identity and position of the shooter as well.

Figure 6: The distribution of attention for our model with tracking, at the beginning of “free-throw success”. Unlike Fig. 5, the attention is concentrated at a specific defender’s position. Free-throws have a distinctive defense formation, and observing the defenders can be helpful as shown in the sample images in the top row.

In addition to the above quantitative evaluation, we wanted to visualize the attention masks visually. Figure 4 shows sample videos. In order to make results comparable across frames, we annotated 5 points on the court and aligned all the attended boxes for an event to one canonical image. Fig. 5 shows a heatmap visualizing the spatial distributions of the attended players with respect to the court. It is interesting to note that our model consistently focuses under the basket for a layup, at the free-throw line for free-throws and outside the 3-point ring for 3-pointers.

Another interesting observation is that the attention scores for the tracking based model are less selective in focusing on the shooter. We observed that the tracking model is often reluctant to switch attention between frames and focuses on a single player throughout the event. This biases the model towards players who are present throughout the video. For instance, in free-throws (Fig. 6) the model always attends to the defender at a specific position, who is visible throughout the entire event unlike the shooter.

6 Conclusion

We have introduced a new attention based model for event classification and detection in multi-person videos. Apart from recognizing the event, our model can identify the key people responsible for the event without being explicitly trained with such annotations. Our method can generalize to any multi-person setting. However, for the purpose of this paper we introduced a new dataset of basketball videos with dense event annotations and compared our performance with state-of-the-art methods on this dataset. We also evaluated the ability of our model to recognize the “shooter” in the events with visualizations of the spatial locations attended by our model.

References