Paying More Attention to Motion: Attention Distillation for Learning Video Representations

04/05/2019 ∙ by Miao Liu, et al. ∙ 0

We address the challenging problem of learning motion representations using deep models for video recognition. To this end, we make use of attention modules that learn to highlight regions in the video and aggregate features for recognition. Specifically, we propose to leverage output attention maps as a vehicle to transfer the learned representation from a motion (flow) network to an RGB network. We systematically study the design of attention modules, and develop a novel method for attention distillation. Our method is evaluated on major action benchmarks, and consistently improves the performance of the baseline RGB network by a significant margin. Moreover, we demonstrate that our attention maps can leverage motion cues in learning to identify the location of actions in video frames. We believe our method provides a step towards learning motion-aware representations in deep models.



There are no comments yet.


page 1

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action recognition in videos has emerged as a key challenge for modern deep models. This task requires a sophisticated understanding of the contributions of spatial and temporal cues and the best methods for extracting and fusing them. The two-stream architecture [40], exemplified by two-stream I3D [3], has proven to be a highly effective framework for investigating these issues. Combining two modalities of appearance and motion is conceptually appealing, delivers good performance, and has been shown to be capable of learning complex spatio-temporal features [6]. Fig. 1 illustrates an attention-based approach to understanding the visual cues utilized in two-stream action recognition. On the left, the Grad-Cam method [37] is used to produce attention maps from the appearance (RGB) and motion (flow) streams of I3D model. The attention maps from the two streams are qualitatively different. The appearance modality focuses on the subject’s body and part of the flute, while the motion modality highlights the moving fingers. Intuitively, both object properties and motion patterns are needed to recognize actions.

We ask basic questions in this context: “Does a deep model need an explicit flow channel to capture motion patterns? Can the model infer the same information from the RGB channel alone?” They connect to the more general question of whether single stream architectures can be competitive with two-stream architectures [13]. Several previous works have addressed this challenge of learning video representation that encodes motion information using a single RGB stream [44, 48, 42, 5]. Our work shares the same motivation, yet pursues a vastly different approach.

Figure 1: Left: Attention maps from RGB and flow streams of I3D [3] by Grad-Cam [37]

. RGB and flow networks attend to different aspects of an action, yet both are essential for recognition. Right: Our proposed attention distillation model learns to predict both motion attention and appearance attention from RGB frames. Note that the attention maps on the right are explicitly trained with probabilistic attention model, which will not agree completely with Grad-Cam result.

To this end, we present a novel representation learning method, called attention distillation. Our method makes use of an explicit probabilistic attention model, and leverages motion information available at training time to predict the motion-sensitive attention features from a single RGB stream. In addition to their utility in visualizing and understanding learned feature representations, we argue that attention models provide an attractive vehicle for mapping between sensing modalities in a task-sensitive way. Once learned, our model jointly predicts appearance and motion attention maps from a single RGB stream at testing time, as illustrated in Fig. 1. The attention maps from our method (right) encode different aspects of the action. The appearance channel captures the gist of the action while the motion channel “zooms in” on the moving region. This ability to identify moving regions from RGB frames demonstrates an exciting possibility for modeling motion in videos.

We summarize our contributions into three parts.

  • We provide the first systematic study of attention mechanisms for action recognition. We demonstrate that modeling attention as probabilistic variables can better facilitate the learning of deep model.

  • We propose a novel method for learning motion-aware video presentations from RGB frames. Our method learns an RGB network that mimics the attention map of a flow network, thereby distilling important motion knowledge into the representation learning.

  • Our method achieves consistent improvements of more than 1% across major datasets (UCF101 [41], HMDB51 [22] and 20BN-Something-Something [11, 28]) with almost no extra computational cost.

2 Related Works

2.1 Action Recognition

Action recognition is well studied in computer vision. A recent survey can be found in 

[35]. We focus on methods using deep models. Simonyan and Zisserman [40] proposed two-stream convolutional networks. Their key idea is to factorize the learning of spatial and temporal features into two networks–an RGB network using video frames and a flow network using optical flow maps. Spatiotemporal features can also be learned from video frames using recurrent networks. Donahua et al. [4] proposed to model a sequence of frames using LSTMs [16]. A similar idea was discussed in [50]. More recently, Tran et al. [43] proposed to use 3D convolutional networks for action recognition. This idea was further studied by [13, 3]. For example, two stream 3D networks was used for learning video representations [3]. Both recurrent networks and 3D convolutional networks should be able to capture motion beyond a single frame. However, their performance using video frames alone still falls far behind their two stream versions [50, 3]. Our work seeks to address this challenge of learning motion-aware video representations from RGB frames.

There are several recent attempts in this direction. For instance, Bilen et al. [1] proposed dynamic image network–a compact representation of video frames. This representation makes use of the parameters of a ranking machine that captures the temporal evolution of video frames. Another example is Ng et al. [31], where they proposed to jointly predict action labels and flow maps from video frames using multi-task learning. This idea is extended by Fan et al. [5]

, where they fold the TV-L1 flow estimation 

[34] into their TVNet. Without using flow, Tran et al. [44] demonstrated that factorized 3D convolution (2D spatial convolution and a 1D temporal convolution) can facilitate the learning of spatiotemporal features and achieve higher recognition accuracy. A similar finding was also presented by Xie et al. [48]. Our method shares the same motivation as these approaches yet takes a different route. We explore attention mechanisms for video recognition, and propose to distill the predicted attention from a flow network to an RGB network.

Figure 2: Overview of our method. Our model (c) takes multiple RGB frames as inputs and feeds them into a backbone 3D convolutional network. Our model outputs two attention maps using the attention module (b), based on which the action labels are predicted. The motion map is learned by mimicking the attention from a reference flow network (a). And the appearance map is learned to highlight discriminative regions for recognition. These two maps are used to create spatiotemporal feature representations from video frames for action recognition.

2.2 Knowledge Distillation

Our attention distillation is inspired by knowledge distillation, first proposed by Caruana et al. [2] for model compression and further popularized by Hinton et al. [15]. The most relevant work comes from Zagoruyko et al. [51], where they used attention to transfer knowledge from a teacher network to a student network. However, they did not consider cross-modal learning. We compare our method to [51] in the experiment. There are several recent attempts of knowledge distillation across modalities. Gupta et al. [12] proposed to transfer representations learned from labeled RGB images to unlabeled depth images (or flow maps). Garcia et al. [9] proposed to distill depth information to appearance stream for action recognition by minimizing the distance between the depth and appearance features. More recently, Luo et al. [26] considered knowledge distillation from a source domain with multiple modalities to a target domain with a subset of modalities for action detection. Our method shares the same intuition of cross-modal knowledge distillation with those previous works.

However, our method differs from [12, 9, 26, 51] in two key aspects. Our work focuses on the challenge of motion-aware video representation learning, while none of previous works did. [51, 12, 26] did not consider video representation learning, and [9] did not consider the modality of motion. More importantly, we propose to distill attention maps–indicators of important regions for recognition, instead of directly matching the features. This design comes from the key challenge of video representation learning–motion is substantially different from appearance and both modalities are important for recognition. In this case, we can no longer assume that different modalities sharing similar structural cues [9], or a teacher model using one modality that can better represent the data [51, 26].

2.3 Attention for Recognition

Attention has been widely used for visual recognition. Top-down task-driven attention can guide the search of objects [32], select local descriptors for object or action recognition [8, 29], or localize actions [38]. More recently, attention has been explored in deep models for object recognition [30] and image captioning [49]. Attention enables these models to “fixate” on image regions, where the decision is made based on a sequence of fixations. This definition is different from self-similarity as in [45]. Several attention mechanisms are proposed for deep models. For example, Sharma et al. [39] integrated soft attention in LSTMs for action recognition. Li et al. [24] further extends [49] into videos. Specifically, they combined LSTMs with motion-based attention to infer the location of the actions. Girdhar and Ramanan [10] modeled top-down and bottom-up attention using bilinear pooling. Wang et al. [46] proposed a residual architecture for soft attentions. Li et al. [23] considered attention as a probabilistic distribution. In this paper, we demonstrate that a prior distribution from human gaze is not necessary for modeling attention as probabilistic variable. We also provide a systematical study of these methods for action recognition.

3 Distilling Motion Attention for Actions

In this section, we present our method of attention distillation for action recognition. We start with an overview of the key ideas, followed by detailed description of the components in our method. Finally, we describe our network architecture and discuss the implementation details.

3.1 Overview

For simplicity, we consider an input video with a fixed length of frames. Our method can easily generalize to multiple videos, e.g., for mini-batch training. We denote the input video as , where is a frame of resolution with as the frame number. Given , our goal is to predict a video-level action label . And we leverage the intermediate output of a 3D convolutional network to represent

. This is given by a 4D tensor

of the size . is the feature dimension of 3D grids from the video . Our method consists of three key components.

• Attention Generation. The model first predicts an attention map based on using attention mapping function . is a 3D tensor of size . Moreover, is normalized within each temporal slice, i.e., . is thus a sequence of 2D attention maps defined over steps.

• Attention Guided Recognition. Based on the attention map and the feature map , the model further applies a recognition module to predict the action label . Specifically, this module uses to selectively pool features from

, followed by a classifier that maps the result feature vectors to the action label


• Attention Distillation. To regularize the learning, we assume that will receive supervision from a teacher model that outputs a reference attention map . Our teacher model comes from a different modality and is equipped with the same attention mechanism for recognition.

Fig. 2 presents an overview of our method. Our model takes the input of a video clip with multiple frames, and learns to predict two attention maps based on : for motion attention and for appearance attention. Based on these two maps, the model further aggregates visual features that will be passed into final recognition sub-network. During training, we match to the attention map from the reference flow network. For testing, only the input video is required for recognition. Our model also outputs two attention maps that can be used to diagnose recognition results. We now detail the design of our key components.

3.2 Attention Generation

We explore two different approaches for generating an attention map from the features , including soft attention [46] and its probabilistic version [23].

Soft Attention. Attention maps can be created by a linear function of over the feature map ,


where is the 1x1 convolution on 3D feature grids. Softmax is applied on every time slice to normalize each 2D map.

Probabilistic Soft Attention. An alternative approach is to further model the distribution of linear mapping outputs as discussed in [23], namely


where we model the distribution of . During training, an attention map can be sampled from using Gumbel Softmax trick [20, 27]. We follow [23] to regularize the learning by adding additional loss term of



is the Kullback-Leibler divergence and

is the 2D uniform distribution (

). This term matches each time slice of the attention map to the prior distribution. It is derived from variational learning and accounts for (1) the prior of attention maps and (2) additional regularization by spatial dropout [23]. During testing, we directly plug in (the expected value of ).

Note that for both approaches, we restrict to a linear mapping without a bias term. In practice, this linear mapping avoids a trivial solution of generating a uniform attention map by setting to all zeros. This all-zero solution almost never happens during our training when using a proper initialization of .

3.3 Attention Guided Recognition

Our recognition module makes use of an attention map to select features from . Again, we consider two different models for the attention guided recognition.

Attention Pooling. We follow the attention mechanism in [46, 25] and design the function as


where denotes the tilted multiplication . This operation is equivalent to weighted average pooling over , followed by linear classifiers with softmax normalization. Specifically, the weights used for pooling () are shared across all channels.

Residual Connection. Using the attention map to re-weight features helps to filter out background noises, yet may also raise potential risk of missing important foregrounds. This drawback was discussed in [46]

. We follow their solution of a residual connection to the attention map, given by


where is a 3D tensor of all ones. Intuitively, this operation further adds an average pooled features to the representation before the linear classifier. By adding the residual term, features learned by the network are preserved.

3.4 Attention Distillation

The key of our method lies in the attention distillation during training. Specifically, we assume a reference flow network is given as the teacher network. The teacher model also uses attention mechanism for recognition. And its motion attention map is used as additional supervisory signal for training our RGB network. This RGB network is thus the student model that mimics the motion attention map. With probabilistic attention modeling, the imitation of the attention maps is enforced by using the loss


This loss minimizes the distance between the attention maps at every time step . In our implementation, our teacher flow network is trained with the same attention mechanism. Once trained, the weights of the teacher model remain fixed during the learning of the student model. And only the student model (RGB network) is used for inference.

3.5 Our Full Model

Putting everything together, we summarize our full model with probabilistic soft attention and attention distillation. Specifically, our model estimate two probabilistic attention maps of (motion) and (appearance). These maps are further used to predict the action labels. This is given by


where each follows Eq 4. We assume equal weights of and . Further tuning the weights barely affects the performance in practice.

Loss Function. Our training loss is defined as


where is the cross entropy loss between the predicated labels and the ground-truth . Thus, the loss consists of three terms. The first cross entropy term is to minimize the error for classification. The second KL term (from Eq. 6) enforces that the motion attention mimic the attention map from the reference flow network. And the third KL term (from Eq. 3) regularizes the learning of the appearance attention. The coefficients and are used to balance the three terms. We choose as .

3.6 Implementation Details

Network Architecture. Our model uses I3D network [3]

as the backbone. I3D has five 3D convolution blocks, and three of them are composed of multiple Inception Modules. For all attention module, intermediate feature

is obtained from the outputs of the 4th convolution block. The attention map is used to select the final network feature from the last Inception module of the 5th block.

Data Preparation. We down-sample all frames to with a frame rate of 24. For training, we compute optical flow using TV-L1 [34]. We apply several data augmentation techniques, including random flipping, cropping and color perturbation to prevent over-fitting. Our model takes 24 consecutive frames as inputs, and all input frames are cropped to for training. For testing, we evaluate our model on full resolution clips () and aggregate scores from all clips to produce the video-level results.

Training and Inference Details. All our models are trained using SGD with momentum of 0.9. Their weights are initialized from Kinetics pre-trained models provided by the authors of [3]. Our models are trained with a batch size of 64 on 4 GPUs. The initial learning rate is 0.01 with a decay rate of 10 when loss starts to saturate. We set weight decay to 4e-5 and enable batch norm [19]

. We also apply dropout on the output of attention modules before the recognition network (dropout rate=0.7). Our model is implemented in TensorFlow and the code will be made publicly available.

During testing our model does not need optical flow, and runs at the same speed as the RGB network.

4 Experiments

We now present our experiments and results. Our results are summarized into three parts. First, we provide a systematical evaluation of attention guided action recognition. Second, we benchmark our attention distillation and compare our results to the state-of-the-art methods on several public datasets. Finally, we further investigate the predicted attention maps and the learned features of our model.

4.1 Attention Guided Action Recognition

We start from an ablation study of attention guided action recognition. Specifically, we evaluate different combinations of attention generation and attention based recognition. And we compare their results to those from models without attention. Our experiments show that a proper design of the attention mechanism can consistently improve the performance of action recognition across datasets. We now present our benchmark, baselines and results.

Benchmark. We use two public action recognition datasets for this experiment: UCF101 and HMDB51. UCF101 [41] contains 13,320 videos from 101 action categories. HMDB51 [22] includes 6,766 videos from 51 action categories. We evaluate mean class accuracy and report the results using the first split of these two datasets.

Method Mean Class Accuracy
RGB Stream I3D (backbone) 94.8 70.9
Soft-Atten 94.7 70.8
Soft-Res 94.9 70.1
Flow Stream Prob-Atten 95.1 71.3
I3D (backbone) 94.0 73.9
Soft-Atten 94.7 74.1
Soft-Res 95.2 74.4
Prob-Atten 94.9 74.2
Table 1: Evaluations of attention modules. We compared 3 different design choices with RGB/flow stream on UCF101/HMDB51. Adding attention to the backbone I3D slightly improves the performance of RGB and flow streams. And Prob-Atten provides consistent performance boost on both streams and across datasets.

Baselines. We consider the different combinations of how the model generates attention maps (Soft vs. Probabilistic Attention) and how the attention maps are used for recognition (Attention Pooling vs. Residual Connection). In addition, we also show how combining motion attention and appearance attention affects the recognition performance. The valid combinations include the follows.

  • Soft-Atten combines soft attention and attention pooling for recognition. This is used in [25].

  • Soft-Res is the residual attention in [46] that further adds residual connection to Soft-Atten.

  • Prob-Atten is the attention module in [23] that combines probabilistic attention with attention pooling.

We note that the combination of Prob+Res is invalid as it violates the probabilistic modeling of attention. In practice, we also found its training unstable. Therefore, we report the results of three valid designs for both RGB and flow stream on UCF101 and HMDB51 datasets. We also include results of the vanilla I3D models (our backbone) using the same input sequence length (24 frames) and the models that use both motion attention and appearance attention for feature pooling. These results are summarized in Table 1.

Results. Adding attention to the backbone recognition network almost always improves the performance by a small margin, with the exception of the Soft-Atten. The performance boost from the attention module is larger for the flow stream in comparison to the RGB stream. For both UCF101 and HMDB51, the best performing method is Prob-Atten for RGB stream (+) and Soft-Res for flow stream (+). Prob-Atten also outperforms the I3D baseline for flow stream, yet Soft-Res decreases the performance of RGB stream on HMDB51. Across the modalities and datasets, Prob-Atten design can consistently improve the recognition accuracy even without human gaze as supervisory signal as in [23].

4.2 Attention Distillation for Action Recognition

We now evaluate our method of attention distillation. In this setting, we assume a reference flow network with attention module is given at training time. We attach two attention modules, both follow the same attention module design as the reference network, to our RGB backbone. And the flow attention is asked to mimic the attention map from reference flow network. We present our benchmarks and results on action recognition, and contrast our method with feature matching method [51].

Benchmark. While Kinetics [21] is without question the state-of-the-art dataset, its size is a significant practical barrier to experimentation. For our scientific questions relating to the ability to learn motion-sensitive representations, it is not necessary to tackle the full size of Kinetics. Instead, we report results of action recognition on UCF101 [41], HMDB51 [22] and a large scale dataset–20BN-V2 [28]. For UCF101 and HMDB51, we report mean class accuracy on the first splits, and compare our results with latest methods. Moreover, we conduct experiments on the challenging 20BN Something-Something-v2 (20BN-V2) [28] dataset. 20BN-V2 has over 220K videos from 174 fine-grained action categories, with the number of samples following a long-tailed distribution. We use their training and validation split, report top-1 and top-5 accuracy following [28, 52], and compare our results to strong baselines.

Method Mean Class Accuracy
RGB + Flow Two Stream [40] 88.0 59.4
Two Stream LSTM [50] 88.6
Joint Two stream [7] 92.5 65.4
TSN [47] 94.0 68.5
Dynamic Image [1] 95.0 71.5
RGB Only Two Stream I3D* [3] 96.8 76.1
VGG16 [40] 73.0 40.5
RGB LSTM [50] 82.6
RGB TSN [47] 84.5
Dynamic Image [1] 90.6 61.3
P3D ResNet [36] 88.6
ActionFlowNet [31] 83.9 56.4
TVNet [5] 94.5 71.0
I3D RGB* (backbone) [3] 94.8 70.9
Ours (Soft-Distill) 95.2 71.4
Ours (Prob-Distill) 95.7 72.0
More Frames I3D RGB (64f)  [3] 95.6 74.8
R(2+1)D RGB (32f) [44] 96.8 74.5
S3D (64f) [48] 96.8 75.9
Two Stream I3D (64f) [3] 98.0 80.7
Table 2: Results of action recognition on UCF101 and HMDB51. We compare the results of our model with several previous methods. Our model outperforms state-of-the-art results that use single RGB stream and the same input sequence length by . *For fair comparison, we report results of I3D models that use 24 frames (24f) as inputs–the same as our model.
Method Top-1/5 Acc Temporal Footprint
TRN RGB [52] 48.8 / 77.6 5 sec
TRN RGB+Flow [52] 55.5 / 83.1 5 sec
I3D RGB+Atten (backbone) 48.1 / 77.8 1 sec
I3D Flow+Atten (ref) 48.3 / 77.9 1 sec
Ours (Prob-Distill) 49.9 / 79.1 1 sec
Table 3: Action recognition results on on 20BN-V2 dataset [28]. We report top-1/top-5 accuracy and the temporal footprints of the inputs. Our model achieves the best performance among networks that uses RGB frames, yet falls behind the two stream networks that use both frames and flow maps.

Comparison to RGB Networks. Table 2 compares our results with previous methods on UCF101/HMDB51. We denote our models using Prob-Atten and Soft-Atten for distillation as Prob-Distill and Soft-Distill, respectively. Prob-Distill slightly outperforms Soft-Distill with a mean class accuracy of on UCF101/HMDB51. Prob-Distill also outperforms state-of-the-art methods of motion representation learning. Specifically, our results are at least better than our direct competitors of learning motion-aware video representations from RGB frames, including Dynamic Image [1], ActionFlowNet [31] and TVNet [5]. Part of this boost is due to our strong I3D RGB backbone. However, Prob-Distill further improves our backbone by . More importantly, both Prob-Distill and Soft-Distill consistently improve the performance of Prob-Atten and Soft-Atten. We conjecture that this gap is a result of our attention distillation method. It is worth noting that this performance boost is significant for action recognition. In contrast, with 50 more layers, ResNet101 is only better than ResNet50 on HMDB51 [14]. The performance of our method is on par with state-of-the-art action recognition results [3, 48, 44], even though these methods requires many more input frames. As a future work, we plan to experiment with using more frames for our model.

Comparison to Two Stream Networks. We have to admit that our results still lag behind the two stream networks when using the same input sequence length (Two Stream I3D*). Our model is - worse on UCF101 and - on HMDB51. This gap suggests that our model does not fully capture the concepts of motion that are encoded in two stream networks. Nonetheless, we believe that our model provide a key step forward for learning motion-aware representations from RGB frames.

Results on 20BN-V2. We report the results of our method on 20BN-V2 in Table 3. With 1/5 of the temporal receptive field as to the latest TRN [52], our model with RGB frames outperforms TRN RGB by in top-1/top-5 accuracy. In fact, our backbone network (I3D RGB) is slightly worse than TRN. And our method improves the backbone by in top-1/top-5. Our model with RGB frames also outperform the reference flow network used for attention distillation by in top-1/top-5. The ranking of results remains consistent as UCF101/HMDB51. Again, our results lag behind the two stream networks.

Method Mean Class Accuracy
I3D RGB (backbone) 94.8 70.9
FeatMatch [51] 94.3 70.7
Ours 95.7 72.0
Table 4: Comparison between our attention distillation and attention transfer in [51]. Matching the feature activation across modalities decreases the performance of the base network. In contrast, our method of matching attention maps improves the baseline.

Comparison to Feature Matching [51]. We also contrast our model with feature matching method as in [51]. Their method seeks to match the the maximum activation across feature channels for knowledge distillation. The key differences between our model and this feature matching method has been discussed in Sec. 2. Our experiment is to highlight the performance gap produced by these differences. Concretely, we implement [51] on our I3D RGB backbone. Besides the classification task, the FeatMatch network is trained to mimic the “features” from the flow network.

Unlike our model, FeatMatch decreases the performance of the base network (see Table 4). The performance gap between our method and FeatMatch [51] is even larger (+ on UCF101 and + on HMDB51). This supports our argument that matching feature won’t work for motion knowledge distillation. Since the features learned from flow can be drastically different from those learned from RGB frames. In contrast, our method matches attention maps for knowledge transfer, and is thus more robust.

Figure 3:

Visualization of attention maps from our method and Soft-Atten, all using the same backbone network (I3D RGB). For each video clip, we re-interpolate the attention maps and plot them on the first and last frame. Red regions indicate higher value of attention. Our model produces qualitatively different appearance and motion attention maps. These attention maps index key regions of actions.

4.3 Analysis of Attention Distillation

We provide extensive analysis to understand what has been learned by our model. Specifically, we visualize the attention maps of our model. And we show that these attention maps help to locate the spatial extents of actions. Finally, we study different approaches to evaluate whether the learned representation is sensitive to motion.

Visualization of Attention Maps. To better understand our model, we visualize both motion and appearance attention maps from our model. We also compare these maps with attention maps created by our Soft-Atten models from RGB and flow streams in Fig 3. We notice that these two attention maps are qualitatively different across all methods. The appearance attention is likely to cover foreground objects or actors, while the motion attention focus on the moving parts. Moreover, the appearance attention from our model can better localize the foreground regions of actions than those of Soft-Atten from the RGB stream. The motion attention from our model does remain quite similar to the Soft-Atten from the flow stream. We also find that the attention map from our model tends to be more “diffused”. This is because the regularization by a uniform distribution in Prob-Atten leads to “smoother” attention maps.

Method Prec Rec F1
Gaussian (center prior) 52.6 20.6 29.6
Saliency Map (DSS [17]) 51.2 47.7 49.4
Soft-Atten (RGB) 33.8 40.5 36.9
Soft-Atten (Flow) 39.2 50.0 44.0
Our Appearance 31.5 52.1 39.2
Our Motion 36.3 62.6 46.0
Table 5: Results of action localization using attention maps on THUMOS’13 localization test set [18]

. We report the best F1 score and its precision and recall. Our motion attention outperforms all baselines that are trained with only action labels.

Does the attention help to localize actions? We evaluate our output attention for spatiotemporal action localization using THUMOS’13 localization dataset [18]

–a subset of UCF101 with bounding box annotations for actions. We present our evaluation metric and discuss our results.

  • Evaluation Metric. We consider action localization as binary labeling of pixels and report the F1 score from Precision-Recall (PR) curve. Specifically, we first rescale both attention maps and video frames into a fixed resolution (

    ). We then enumerate all thresholds and binarize the attention map. Each threshold defines a point on the PR curve. Given a binary attention map, a positive pixel is considered as a true positive if it is inside the bounding box, or it is within 10-pixel “tolerance zone” of the box. This tolerance is added to compensate for the reduced resolution of the attention map, as in 

    [33]. We report the best F1 score on the curve and its corresponding precision and recall.

  • Results

    . We compare attention maps from our model to a set of baseline methods, including a fixed Gaussian distribution (center prior), a latest deep saliency model (DSS 

    [17]), and our Soft-Atten (RGB/Flow). The results are shown in Table 5. Our appearance attention beats the baselines of center prior and Soft-Atten (RGB), but is worse than Soft-Atten (flow). Our motion attention achieves the highest score among all methods that only receive action labels as supervision, and only under-performs DSS. We have to emphasis that directly comparing our results to DSS is unfair. DSS is trained with pixel-level annotations using external data and runs at the original video resolution, while our attention maps are trained using clip-level action labels and down-sampled both spatially (32x) and temporally (8x). These results suggest that our attention maps help to locate the spatial extent of actions.

Dataset Method Mean Class Accuracy
Original Reverted
UCF101 I3D RGB 94.8 94.7 0.1
I3D flow 94.0 89.9 4.1
Ours 95.7 95.1 0.6
HMDB51 I3D RGB 70.9 70.2 0.7
I3D flow 73.9 66.0 7.9
Ours 72.0 70.6 1.4
Table 6: Inverting the arrow of time for action recognition. We train the models on normal samples, yet test them on videos with reversed temporal order. A large performance drop indicates that the model has to rely on motion information for the recognition.

Does our method learn better motion representation? We further study how the temporal order of the input video frames will impact the recognition performance. We conduct an experiment of classifying reverted videos as in [48, 52]. Specifically, we invert the frame order for all testing videos of UCF101 and HMDB51. We compare their recognition results with those from normal temporal order. If a model truly rely on motion representation for the recognition, this inversion will significantly decrease the recognition performance. We test the vanilla I3D RGB and flow models, as well as our model. And the results are presented in Table 6. Not surprisingly, I3D flow model has the largest performance drop. In contrast, I3D RGB is barely affected by the reverted arrow of time. Our model has a performance drop that is larger than I3D RGB yet much smaller than I3D flow. This is consistent with our results on action recognition. Our model does not capture the same level of motion information as the flow network.

How is the motion encoded? It is also possible that our model simply copies the motion attention map without encoding motion in the network. To eliminate this hypothesis, we experimented with training an RGB network that directly combines a reference motion attention map and its own appearance attention map for action recognition. The reference motion attention is produced by a flow network during both training and testing. And the rest of this network follows exactly the same architecture as our model. This model has an accuracy of on UCF101/HMDB51, under-performing our model by -/- on UCF101/HMDB51. These results indicate that the distillation process not only generates motion attention maps, but also learns motion-aware representation.

What has been learned? Our visualization and action localization experiment suggest that our model learns to locate moving regions from video frames. However, when we invert the temporal order of frames, our learned features are not as sensitive as those from flow network. These results illustrate a key challenge for learning motion-aware representations. How the model learns to identify moving regions is not necessarily the right representation to encode motion. This is the same pitfall faced by our work and many previous works [31, 25]. And this challenge remains open.

5 Conclusions

In this paper, we presented a novel method of attention distillation for action recognition in videos. We provided extensive experiments to evaluate our method. Our results demonstrate that a proper design of attention module helps to improve recognition performance. More importantly, attention maps from RGB and flow networks are qualitatively different, suggesting that these networks capture different aspects of the video. We also showed that our attention distillation learns to locate moving regions, and achieves competitive results of action recognition across datasets. We believe our work provides valuable insights into attention based recognition, as well as a solid step towards learning spatiotemporal features in deep models.


  • [1] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi. Action recognition with dynamic image networks. TPAMI, 2018.
  • [2] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil. Model compression. In SIGKDD, 2006.
  • [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • [4] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  • [5] L. Fan, W. Huang, S. E. Chuang Gan, B. Gong, and J. Huang. End-to-end learning of motion representation for video understanding. In CVPR, 2018.
  • [6] C. Feichtenhofer, A. Pinz, R. P. Wildes, and A. Zisserman. What have we learned from deep representations for action recognition? In CVPR, 2018.
  • [7] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  • [8] D. Gao, S. Han, and N. Vasconcelos. Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. TPAMI, 2009.
  • [9] N. Garcia, P. Morerio, and V. Murino. Modality distillation with multiple stream networks for action recognition. In ECCV, 2018.
  • [10] R. Girdhar and D. Ramanan. Attentional pooling for action recognition. In NIPS, 2017.
  • [11] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, and M. Mueller-Freitag. The” something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
  • [12] S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.
  • [13] K. Hara, H. Kataoka, and Y. Satoh.

    Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?

    In CVPR, 2018.
  • [14] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In CVPR, 2018.
  • [15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
  • [17] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Deeply supervised salient object detection with short connections. In CVPR, 2017.
  • [18] H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The THUMOS challenge on action recognition for videos “in the wild”. CVIU, 2017.
  • [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [20] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
  • [21] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
  • [23] Y. Li, M. Liu, and J. M. Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. In ECCV, 2018.
  • [24] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek. Videolstm convolves, attends and flows for action recognition. CVIU, 2018.
  • [25] S. Liu, E. Johns, and A. J. Davison. End-to-end multi-task learning with attention. arXiv preprint arXiv:1803.10704, 2018.
  • [26] Z. Luo, J.-T. Hsieh, L. Jiang, J. C. Niebles, and L. Fei-Fei. Graph distillation for action detection with privileged modalities. In ECCV, 2018.
  • [27] C. J. Maddison, A. Mnih, and Y. W. Teh.

    The concrete distribution: A continuous relaxation of discrete random variables.

    In ICLR, 2017.
  • [28] F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic. Fine-grained video classification and captioning. arXiv preprint arXiv:1804.09235, 2018.
  • [29] S. Mathe and C. Sminchisescu. Dynamic eye movement datasets and learnt saliency models for visual action recognition. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, ECCV, 2012.
  • [30] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.
  • [31] J. Y.-H. Ng, J. Choi, J. Neumann, and L. S. Davis. Actionflownet: Learning motion representation for action recognition. In WACV, 2018.
  • [32] A. Oliva, A. Torralba, M. S. Castelhano, and J. M. Henderson. Top-down control of visual attention in object detection. In ICIP, 2003.
  • [33] M. Oquab, L. Bottou, I. Laptev, and J. Sivic.

    Is object localization for free?-weakly-supervised learning with convolutional neural networks.

    In CVPR, 2015.
  • [34] J. S. Pérez, E. Meinhardt-Llopis, and G. Facciolo. TV-L1 optical flow estimation. IPOL, 2013.
  • [35] R. Poppe. A survey on vision-based human action recognition. Image and vision computing, 2010.
  • [36] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
  • [37] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [38] N. Shapovalova, M. Raptis, L. Sigal, and G. Mori. Action is in the eye of the beholder: Eye-gaze driven model for spatio-temporal action localization. In NIPS, 2013.
  • [39] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. In ICLR Workshop, 2016.
  • [40] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [41] K. Soomro, A. Roshan Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01, 2012.
  • [42] S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang. Optical flow guided feature: A fast and robust motion representation for video action recognition. In CVPR, 2018.
  • [43] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [44] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
  • [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • [46] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In CVPR, 2017.
  • [47] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [48] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, 2018.
  • [49] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
  • [50] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [51] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
  • [52] B. Zhou, A. Andonian, and A. Torralba. Temporal relational reasoning in videos. In ECCV, 2018.