Deformable Tube Network for Action Detection in Videos

07/03/2019 ∙ by Wei Li, et al. ∙ 2

We address the problem of spatio-temporal action detection in videos. Existing methods commonly either ignore temporal context in action recognition and localization, or lack the modelling of flexible shapes of action tubes. In this paper, we propose a two-stage action detector called Deformable Tube Network (DTN), which is composed of a Deformation Tube Proposal Network (DTPN) and a Deformable Tube Recognition Network (DTRN) similar to the Faster R-CNN architecture. In DTPN, a fast proposal linking algorithm (FTL) is introduced to connect region proposals across frames to generate multiple deformable action tube proposals. To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression. Modelling action proposals as deformable tubes explicitly considers the shape of action tubes compared to 3D cuboids. Moreover, 3D convolution based recognition network can learn temporal dynamics sufficiently for action detection. Our experimental results show that we significantly outperform the methods with 3D cuboids and obtain the state-of-the-art results on both UCF-Sports and AVA datasets.



There are no comments yet.


page 4

page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action detection aims to localize actions in a long video in both space and time. In another word, we are given a video and should not only recognize actions but also find out their corresponding starting time, ending time and spatial bounding boxes of subjects. As a key ingredient for video understanding, we can easily apply it in several practical scenarios such as the surveillance and content-based retrieval. Therefore, this problem has been widely explored in recent years   [23, 1, 36, 18, 17, 25]

. In particular, as deep learning techniques demonstrated a remarkable success in object detection  

[34, 33, 30], researchers followed similar pipelines (e.g., Faster R-CNN  [34]) and obtained much progress in terms of both localization and recognition. However, accurate action detection remains an open problem because it poses new challenges compared with object detection and action recognition. Crucially, bounding boxes in action detection are required to be associated with their corresponding actions, which should be inferred along time. In addition, the locations of the same actions should be consistent across adjacent frames to guarantee that humans undergo a complete action.

These challenges require that actions should be represented in fine-grained granularity instead of the entire video so that they can be accurately localized. Several previous methods  [13, 31] processed each frame independently and recognized actions over regions generated by selective search [44], region proposal network  [34] and so on. Then per-frame detections are linked to keep temporal consistency. Obviously, these methods failed to model temporal dynamics of actions which will help localization and recognition certainly. Recently, in order to include temporal information, researchers extended 2D region proposal network to generate 3D action cuboid proposals, which are then exploited to recognize actions and further refine their locations in each frame. However, the regular 3D cuboids are not able to model the flexibility of actions where the spatial positions and scales of humans might vary significantly along time though anchor cuboids are introduced to represent volumes of different scales and aspect ratios as opposed to anchor boxes in object detection. In a word, action proposals are deformable along time.

In order to overcome these shortcomings, we propose a Deformable Tube Network (DTN) for accurate action detection in videos. The network is composed of a Deformable Tube proposal Network (DTPN) and a Deformable Tube Recognition Network (DTRN), which are arranged in the same way as the standard object detection framework Faster R-CNN, as opposed to RPN and recognition network, respectively. In DTPN, action proposals are generated online by linking per-frame region proposals to form deformable region tubes. Based on these deformable tube proposals, a novel recognition network DTRN are followed to perform action recognition and location regression. DTRN consists of several 3D convolution layers to extract spatio-temporal representations. Simultaneously, skip connections are interleaved to maintain the original spatial information to enhance location boundaries. Our proposed network shows two main advantages: 1) deformable action proposals offer enough flexibility to model their variations along time, and 2) the recognition network based on 3D convolution enable us to include temporal context to improve action detection.

In summary, our main contributions are threefold:

  • We introduce a novel deformable tube network to perform action detection based on deformable action proposals.

  • We propose a fast linking algorithm so that action proposals can be generated fastly.

  • We carefully design a deformable proposal recognition network to recognize actions and regress their positions.

Our experiments show that the proposed DTN model significantly outperforms other methods based on 3D cuboid proposals. Simultaneously, considering one single model and feature, DTN achieves the state-of-the-art results on UCF Sports  [35, 41] and AVA dataset [14] .

The reset of the paper is organized as follows. In section 2, we review some related works on object detection, action recognition and action detection. In section 3, we introduce of our network structure, including DTPN and DTRN in detail. In section 4, we show our experiment results on UCF-Sports and AVA datasets to validate the effectiveness of our proposed model. In section 5, we give a further analysis of our fast tube linking algorithm and we conclude our paper in sectionconclusion

2 Related Work

We will introduce several previous works on action detection in this section. In addition, action detection is also much related to object detection and action recognition. A lot of researches are inspired by the methodologies of object detection [7, 34, 29, 26, 10] and the advance in action recognition [42, 39, 50, 49, 4, 43]. Therefore, we will walk through these three directions.

2.1 Object Detection

Recently, object detection also benefits a lot from deep learning techniques. Girshick et al. firstly proposed Region CNN (R-CNN) [12] in object detection, which achieves significant improvement compared with the traditional methods  [7, 46, 8]

by introducing a convolutional neural network. Then Ren

et al. [34]

overcame the bottleneck of speed to generate object proposals by introducing a Region Proposal Network (RPN). RPN utilizes densely sampled anchor boxes to generate top-scored object proposals. Object features extracted by ROI Pooling are then input to a recognition network for future classification and location regression. In order to accelerate this process, one-stage detectors such as SSD

[30] and YOLO [33]

, directly classify and regress anchor boxes in one pass without RPN to propose potential bounding boxes. In order to model the scale variations of objects, SSD assigns anchor boxes to the feature maps of different levels. Similarly, FPN

[28] builds feature pyramids by a top-down architecture with lateral connections to integrate multi-scale representations. Li et al. [27] further integrated the feature pyramid structure with a specially designed backbone for object detection. To address the imbalance between foreground and background instances, Lin et al. [29] designed a novel focal loss to automatically reduce the contribution of easy examples and focus on hard examples. Later, Mask R-CNN [16] extends Faster R-CNN with a mask prediction branch to train object detection and instance segmentation in parallel.

In this paper, we follow the two-stage detection framework. However, we adapt it to action detection by extending recognition and region proposal network.

2.2 Action Recognition

Early approaches for action recognition mainly rely on hand-crafted features such as HOF [22] and IDT [45]. However, they are still unable to represent abundant information in videos in spite of complex designs. To increase the capacity of features, several convolutional neural networks are introduced to learn video representations in an end-to-end way. Typically, Simonyan et al. introduced a two-stream architecture [39] with RGB frames and optical flows as inputs to each stream respectively. The architecture process motion information and spatial appearance in parallel and fuses the classification scores of both streams to obtain a final prediction. However, extracting optical flows is time-consuming and the network still has limited capacity to capture temporal information. Trans et al. proposed a 3D convolutional neural network, where several 3D convolution operators are interleaved to learn spatio-temporal features directly. Although 3D convolution avoids the explicit extraction of optical flows, it would introduces more parameters and computation. Therefore, a lot of variants are proposed to tackle these challenges such as Pseudo 3D [32] and R2+1D [43]. On the other hand, Donahue et al. leveraged LSTM to integrate CNN features along time [4]. In our work, we adopt a carefully designed convolutional neural network with skip connections for action detection.

2.3 Action Detection

Compared to action recognition, action detection requires accurate boundaries regression. A natural way is to follow the standard sliding window strategy in object detection  [23, 1, 36]

. The main difference mainly lies in the feature selection and the method to generate candidate action proposals. For example, Rohrbach

et al. [36] generated multiple candidate segments by sliding windows and perform recognition over dense trajectories and human pose features. Lan et al[20] made use of figure-centric visual word features to represent actions.

As the performance of object detection went up, most recent approaches turned to link frame-level object detections to form action tubes. Based on visual and motion cues from two-stream network, Gkioxari and Malik [13] classified region proposals generated by selective search, which are then linked to action tubes along time for temporal consistency. Weinzaepfei et al. [47] proposed to track high-scoring proposals using a tracking-by-detection approach. Saha et al. [38]

fused appearance and motion detection boxes based on estimated action scores and their spatial overlaps between each other, and constructed spatio-temporal action tubes with a two-pass dynamic programming method. Peng and Schmid

[31] replaced selective search with region proposal network and embedded a multi-region scheme into their two-stream classification network. Singh et al. [40] introduced a real-time action localization method with a SSD object detector and an online linking algorithm. All these methods rely much on frame-level human detections. However, distinguishing actions based on single frames are difficult without considering temporal dynamics.

To further include temporal dynamics for action recognition and localization, Kalogeiton et al. [19] came up with anchor cuboids to generate action proposals directly, which encode enough spatio-temporal information for action recognition. Hou et al. [17] further generalized Region-of-Interest (RoI) pooling layer to 3D Tube-of-Interest (ToI) pooling layer. Although these approaches built on anchor cuboids offer an opportunity to integrate temporal dynamics for action recognition compared with those connecting frame-level human detections, action tubes are too deformable to be modelled by regular 3D volumes in practice. In contrast, we propose a novel detection network, which generates deformable action tube candidates by a fast linking algorithm and perform recognition and regression over these proposals. RTPR [25]

is probably most similar to us in spirit, which also link region proposals to action tubes and perform action recognition by LSTM. However, in their work, action localization are regressed recurrently and did not consider global temporal information to enhance detections in each frame. Instead, we design a fully convolutional neural network to perform recognition and regression as a whole over generated deformable action tubes.

3 Overview of the approach

Our proposed deformable tube network is composed of a deformable tube proposal network (DTPN) and a deformable tube recognition network (DTRN). An overview of our framework is shown in Figure 1. To generate deformable actor-centric tube proposals, we decompose it into two separate processes. Firstly, we obtain region proposals of high quality for each frame using a standard RPN. Then, we leverage a fast tube linking algorithm (FTL) to link per-frame proposals into deformable candidate tubes. Compared with anchor cuboids, deformable actor-centric tubes can capture actors in a more flexible way but also maintain the spatio-temporal information. Then DTRN is carefully designed to classify and refine candidate tube proposals.

Figure 1: The overall architecture of our proposal two-stage action localization model with DTPN and DTRN. We link per-frame proposals into deformable candidate tubes in our DTPN, which can capture actors in a more flexible way. Our DTRN can capture spatio-temporal correlations with 3D Convolution to perform recognition and bounding box regression.

3.1 Deformable Tube Proposal Network

The tube proposal network aims to generate deformable spatio-temporal action proposals. The whole network is composed of a standard RPN and one linking layer. RPN generates a set of candidate actor region proposals for each frame. In the following linking layer, we employ a fast proposal linking algorithm to generate deformable action tubes.

3.1.1 Region Proposals Generation

In our experiment, RPN takes as input video frames and generates 1k actor proposals per frame. It is worth noting that RPN does not consider the temporal context across frames and performs region proposal generation independently for each frame. We follow the standard setting during training. Specifically, several anchor bounding boxes are first generated according to predefined aspect ratios and scales. Then the overlap score with any groundtruth is estimated for each and every anchor. Anchor are considered as positives if their maximum overlap scores are higher than 0.7 regardless of action labels and the rest are assigned negative labels. Besides classification, positive anchors are also used to regress their corresponding groundtruth locations.

3.1.2 Deformable Tube Proposals

For each and every frame in a video clip, RPN outputs a set of region proposals each with a confidence score to represent how likely it belongs to an actor. Inspired by  [25], we link these generated regions with large overlaps across frames to form deformable action tubes by iteratively running a dynamic programming algorithm, which are then fed to the following recognition network to perform classification and further regression. Instead of considering overlap ratios during the tube construction as in  [25], we adopt a hard thresholding strategy to accelerate the process. Nevertheless, generating proposals serially is still expensive in the entire detection pipeline. Hence, we improve the linking algorithm a lot to further reduce the generation computation cost.

Given a video clip with continuous frames and RPN generates proposals in the -th frame, our goal is to generate deformable tube proposals with largest action scores.

For any tube , its action score represents how likely this tube is regarded as an action and is formulated as


where is the objectness score of output by RPN and is a hard thresholding score of defined as


We can see that the tubes with high action scores prefer the region proposals with large overlaps across frames and high confidence scores.

As known, finding the tube proposal with the maximum score can be converted into a typical dynamic programming problem and addressed by Viterbi algorithm [9]. In order to search for largest action proposals, we adopt a greedy strategy. Firstly, the tube with the largest score is generated. Then we remove all region proposals in it and find next path with the maximum action score in the remaining proposals. The whole process terminates when all tube proposals are found or no legal region candidates can be linked. Assuming that each frame has candidate regions averagely, the time complexity of the whole process is . In our practical experiments, we found that this greedy generation took too long as the Viterbi algorithm in each step cost a lot. Therefore, we improve the dynamic programming algorithm to boost the deformable tube generation.

Our main modifications lie in two algorithm details.

Hard Thresholding One hand, since there a large amount of region proposals in each frame and actors usually move smoothly, we only link proposals to those with large overlaps in follow frames, which reduces much searching space and thus increase the efficiency. Specially, given an uncompleted path and proposals in next frame, we only choose the region proposals with IoU scores with higher than to extend the tube. As a result, the time complexity is reduced from to where is the average number of legal proposals in next frame. Most importantly, the tube proposal generated by this way is still the global optimal solution due to our hard thresholding strategy.

Top-K Selection Additionally, maintaining and updating the maximum action scores of incomplete proposals ending at all region proposals in intermediate frames are unnecessary and expensive. Based on the observation that RPN performs relatively reliable in each frame and the tube with the largest confidence always scores high in every step, we only select top- uncompleted path as candidates tubes to extend at any step (See Algorithm  1). We will discuss the influence of in the experiment section. Consequently, we reduce the time complexity of the whole generation process from to which significantly accelerates action detection.

Input : Region Proposals in each frame , adjustable parameter .
Output : Tube proposals .
for  to  do
       for  to  do
             if  then
                   top- by
                   for  in frame  do
                   end for
                   by in Eq 1
       end for
end for
Algorithm 1 Fast tube linking algorithm. represent incomplete tubes ending at frame. aims to select tubes with the largest action score.

3.2 Deformable Tube Proposal Recognition

When deformable tube proposals are available, the tube proposal recognition network aims to classify them correctly and refine their locations in each frame. Similar to Fast RCNN, the network takes video frames and deformable tube proposals as input. The network firstly processes video frames by a backbone convolutional neural network and then gets multi-scale feature maps for each frame. Then a feature pooling layer outputs a 3D feature volume for each deformable tube proposal, which are further fed into our designed tube recognition network for action classification and location regression.

Tube Feature Pooling To form the representation for a tube, we firstly exploit ROI Pooling to extract region features independently based on the previous feature maps and then these representations are concatenated along the tube. Specifically, given any tube with frames where 4 denotes the region positions of in each frame as in Faster R-CNN [34], ROI Pooling outputs a feature volume with a fixed spatial extent (e.g. ). By stacking features along T frames, we obtain a representation.

Tube Recognition Network Our recognition network takes tube representations as input and performs recognition and bounding boxes regression. The entire architecture follows the design of U-net [37] with extra skip connections illustrated in Table 1, where inputs undergo a series of spatial and temporal convolutions to enhance the interaction along tubes and model motion dynamics. In particular for temporal convolution layers, we use the kernel

without padding to reduce the temporal dimension. The same amount of deconvolution layers are followed to recover the original spatial and temporal resolutions, which are used for bounding box regression and classification later. It is worth noting that we also adopt skip connections to connect the previous convolution layers with their symmetric deconvolution layers, which aims to help regression by adding low-level features.

To get final predictions, two sibling layers are followed to estimate action classes and positions respectively. The first outputs a probability distribution for each tube,

, over categories. For the -th region of the tube, the second sibling layer outputs bounding box regression offsets for each of the action classes. The corresponding regression target is designed as


where is the assigned groundtruth of with class .

Given these definitions, our multi-task loss for each tube is defined as:


where is an indicator function, represents cross entropy loss with true class and is a regression loss formulated as

name kernel stride output
Input - -
Skip3 - -
Skip2 - -
Skip1 - -
Flatten-Reg - -
Flatten-Cls - -
Table 1: Detailed network architecture of DTRN.

4 Experiments

4.1 Datasets

Unlike image dataset for object detection, large scale video datasets for spatial-temporal localization is much harder to collect. We evaluate our proposed model in two action localization datasets: UCF-Sports [35, 41] and AVA [14]. UCF-Sports dataset consists of 150 short videos with 10 different actions. The bounding box annotations are available for all frames. We follow the dataset split of [21] with 103 videos for training and 47 videos for testing. AVA is a newly published challenging dataset. It densely annotates 80 atomic visual actions in 430 video clips with 15 minutes and nearly 900 key frames are annotated based on 3-seconds video segments centered on these key frames. Each bounding box of key frames has multiple action labels. We follow the experiment setup of [14]. It only uses classes that have at least 25 instances, which results in a total of 210,634 training and 57,371 validation examples on 60 classes.

4.2 Experiment Setup

4.2.1 Implementation Details

We implement our method based on the MXNet toolbox [2]. When training RPN, we assign positive labels to anchor boxes with an IoU higher than 0.7 or has the highest IoU with any groundtruth. All remaining anchors are considered as negatives. We keep top RPN proposals after NMS operation as candidate region proposals for tube linking. Our FTL links these region proposals to generate tube proposals with a IoU threshold

. One tube proposal is assigned to a positive label if the average IoU of boxes over all frames with ground truths is higher than 0.5 and the rest are associated with negative labels. In terms of DTRN’s training, we sample 40 tube proposals per video as inputs for backpropagation, where positive tubes and negative ones keep a ratio of

. In testing stage, non-maximum suppression (NMS) with a threshold of 0.7 is applied in each frame to get the final action detection results. We add random flip to the whole sequence of frames to prevent overfitting in training. For the training strategy, we employ SGD optimizer of [24] a momentum of 0.9 and a clipnorm of 5. The learning rate is initialized with 0.0004. We experiment with 2D backbone for UCF-Sports dataset and both 2D and 3D backbone for AVA dataset. For UCF-Sports dataset, we sample 5 frames with a sampling stride of S frames. As for AVA dataset, we densely sample 5 frames in 1-second video segment centered at this key frame for 2D ResNet-101 backbone. As for 3D ResNet-50 backbone [6], We choose to generate our deformable tube with region proposals of consecutive key frames. For each key frame, compared with single frame input for 2D backbone, the input for our 3D ResNet-50 backbone is 32 frames sampled from a 64-frame raw clip with a temporal stride of 2. We sample 5 video clips each centered with key frames with a sampling stride of seconds. For our DTN with 3D backbone, we follow a two-steps training with a baseline model which only utilizes the center video clip to predict action categories and regress box offsets of corresponding key frame as [14]. Then we extract the feature volume of and finetune with loaded weights from baseline model due to GPU memory limit. Finally, we use mean pooling to reduce the temporal dimension of the output feature volume of each clip to 1 and stack each feature map after ROI- Pooling together as the input for our DTRN.

Figure 2: Per-category AP on AVA dataset: baseline model, baseline-multi model and Our DTN. Categories are sorted by the number of training examples. Our DTN achieves 0.75 points performance improvement compared with baseline-multi model. The highlighted categories are the 5 largest absolute gains (black).

4.2.2 Evaluation Metrics

To evaluate our model performance, we adopt frame-mAP and video-mAP

as our evaluation metrics. An action tube or a frame is considered as positive if the IoU with the groundtruth annotation is greater than a threshold and the action prediction is correct. Specially, we utilize both

frame-mAP and video-mAP as evaluation metrics for UCF-Sports. Since AVA dataset is annotated with multiple labels for each bounding box, we only use frame-mAP for AVA dataset.

N=300 N=500 N=700 N=1000
Viterbi 8.27 34.04 70.4 122.7
Viterbi + HT 2.63 5.3 7.72 12.0
Viterbi + HT + TS 0.461 0.560 0.641 0.669
Table 2: Runtime (s) comparison of different linking algorithm.
frame-mAP video-mAP
Viterbi 91.9 99.0
Viterbi + HT 92.3 98.3
Viterbi + HT + TS 93.08 98.8
Table 3: Comparisons of variants with different tube linking algorithm on UCF-Sports dataset.

4.3 Ablation Study

4.3.1 Runtime Analysis

As mentioned in Section 3.1, our modifications of linking algorithm lie in two details, hard thresholding (HT) and top-K selection (TS). We reduce the time complexity of the linking algorithm from to . Table 2 shows the actual runtime of different linking algorithm with N region proposals. The frame count of each linked tube if fixed to 5 for all linking algorithm.

We can achieve over 180x acceleration with our two modifications (HT and TS) when and nearly 18x acceleration when .

To further evaluate the quality of our linked tubes, we compare the performance with different linking algorithm on UCF-Sports dataset in Table 3. We can achieve slightly worse or even better result with our two modifications, which verifies the effectiveness of our fast linking algorithm. We will further quantify the effect of TS in Section 5.2.

4.3.2 Linked Tubes & Anchor Cuboids

As discussed in Section 3.1, our TDN can generate deformable actor-centric tube proposals. Compared with anchor cuboids, our deformable candidate tubes can capture actors in a more flexible way. We will give a qualitative visualization to illustrate the advantage of our linked tubes in Section 5.1. In this section, we quantitively evaluate the performance with linked tubes or anchor cuboids on both UCF-Sports and AVA datasets.

For UCF-Sports dataset, we choose top-200 region proposals of the center frame and replicate it 5 times to form anchor cuboids. As shown in Table 4, our DTN with linked tubes outperforms its counterpart with anchor cuboids in most categories, especially in Swing2 (Swing-SideAngle), Walk, Run and Riding. These are the categories with relatively fast motion of actors.

Diving Golf Kicking Lifting Riding Run SkateB. Swing1 Swing2 Walk frame-mAP video-mAP
Anchor Cuboids 100.00 99.62 88.56 100.00 92.14 81.31 98.13 89.40 81.55 72.52 90.32 97.8
Linked Tubes 100.00 95.20 88.62 100.00 98.53 90.03 96.98 77.44 99.36 84.61 93.08 98.8
Table 4: Comparison results of variant model with anchor cuboids or linked tubes on UCF-Sports dataset.

Compared with UCF-Sports dataset, the movement of actors in AVA dataset is relatively slow. We do all the comparisons with 3D backbone with relatively larger sampling intervals. As mentioned in Section 4.2, we first train a baseline model which only utilizes the center video clip and then finetune stage to integrate information over the span of multiple seconds with deformable linked tubes. In order to evaluate the advantage of our deformable tubes compared with anchor cuboids, we replace the linked tubes with top-200 region proposals of the center key frame as baseline-multi model. The comparison results are shown in Table 5. We can see that our DTN obtains consistent improvements compared with baseline and baseline-multi models. Our DTN outperforms baseline model by 1.84 points when and baseline-multi model by 0.75 points when , which verifies the effectiveness of our linked tubes compared with anchor cuboids on AVA dataset.

baseline 24.0 24.0
baseline-multi 24.7 25.2
our DTN 25.45 25.8
Table 5: Comparisons with baseline models on AVA dataset with frame-mAP. The Iou threshold is fixed to 0.5.

As for the per-category AP shown in Figure~2, our DTN outperforms baseline model in 44 of 60 categories and baseline-multi model in 50 of 60 categories. We can see the largest absolute gains for categories like “martial art(+5.03)”, “cut(+4.59)”, “play musical instrument” and “dance(+4.08)” when compared with baseline-multi model. Our deformable tube play a key role for such categories with large spatial movements.

4.4 Variants of Proposed Model

4.4.1 Sampling Interval

For UCF-Sports dataset, we sample 5 consecutive frames with a sampling interval of frames as the input of our DTRN. The sampling interval plays a key role in our DTN, which controls the diversity of each input frame. In order to investigate the effect of sampling interval on the performance of our DTN, we increase the sampling interval from 1 to 4 to obtain 4 variants of our model. The comparison results with different sampling intervals are shown in Table 6. With larger sampling interval, our DTN can integrate more intact video contents which helps to distinguish similar action categories. We achieve consistent improvement by changing the sampling interval from 1 to 3. However, there is a performance drop by changing from 3 to 4 which can be explained by that large sampling interval can leads to inaccurate linking of region proposals when with fast motions. Based on these analyses, we fix the sampling interval to 3 for all other experiments on UCF-Sports dataset.

frame-mAP video-mAP
91.2 97.3
92.64 98.0
93.08 98.8
92.2 98.3
Table 6: Comparisons of variants with different sampling interval on UCF-Sports dataset.

4.4.2 Combination with LFB

LFB [48]

can be used to augment 3D Convolution Networks with supportive information extracted over the whole span of a video. It extends vanilla 3D Convolution Networks with a external long-term feature bank and a feature bank operator (e.g. Non-local operator) that computes interactions between short-term and long-term features. We argue that our deformable tube which integrates middle-term video contents is complementary to LFB. For each key frame, we choose top-2 linked tubes with highest action scores (maximum class score after sigmoid activation of all 60 classes). The feature volume before classification stage is used as the representation for this linked tube (after vectorization by flatten or global average pooling). We replace the precalculated long-term feature bank in LFB with our extracted representations. The comparison results of our DTN with or without LFB are shown in Table 

7. We can see a consistent improvement when combined with LFB.

our DTN 25.8
our DTN + LFB (average) 27.2
our DTN + LFB (flatten) 27.7
Table 7: Comparison results of our DTN with or without LFB on AVA dataset with frame-mAP. The Iou threshold is fixed to 0.5.
frame-mAP video-mAP
Gkioxari et al. [13] 68.1
Weinzaepfel et al. [47] 71.9 -
Peng et al. [31] 84.51 94.8
Kalogeiton et al. [18] 87.7 92.7
Hou et al. [17] 86.7 95.2
He et al. [15] - 96.0
Li et al. [25] - 98.6
Duarte et al. [5] 83.9
Our DTN 93.08 98.8
Table 8: Comparison results with the state of the art methods on UCF-Sports dataset with frame-mAP and video-mAP. The Iou threshold is set to 0.5 and 0.2 respectively.
Figure 3: Visualization of Five detection examples from UCF-Sports dataset. Blue boxes indicate model detections and red boxes denote ground truths. The predicted label is on the top-left corner of each detection box.
Diving Golf Kicking Lifting Riding Run SkateB. Swing1 Swing2 Walk frame-mAP
Gkioxari et al. [13] 75.8 69.3 54.6 99.1 89.6 54.9 29.8 88.7 74.5 44.7 68.1
Weinzaepfel et al. [47] 60.71 77.55 65.26 100.00 99.53 52.60 47.14 88.88 62.86 64.44 71.9
Peng et al. [31] 96.12 80.47 73.78 99.17 97.56 82.37 57.43 83.64 98.54 75.99 84.51
Hou et al. [17] 84.38 90.79 86.48 99.77 100.00 83.65 68.72 65.75 99.62 87.79 86.7
Our DTN 100.00 95.20 88.62 100.00 98.53 90.03 96.98 77.44 99.36 84.61 93.08
Table 9: Frame-AP of each class on UCF-Sports dataset with a IoU threshold .

4.5 Comparison with Other Methods

4.5.1 Performance Comparison on UCF-Sports Dataset

Table 9 shows our results of each class on UCF-Sports dataset. Our approach significantly outperforms other methods in the overall performance and most categories, e.g. Diving, Kicking, Lifting and SkateBoarding. Especially, our actor-centric tubelet feature helps action recognition for hard action classes, e.g, SkateBoarding. We also make a comparison with recent methods and results are reported in Table 8. Overall, we outperform the state-of-the-art method [18, 25] with a absolute margin of for frame-mAP and for video-mAP respectively. The results verify the effectiveness of our deformable actor-centric tube proposals, which benefit localizing actions with significant spatial motion.

In order to further validate the advantage of our method, we visualize five detection examples from UCF-Sports dataset in Figure 3

. Our method works well in complex cases such as those with multiple persons (first row) and large pose variations (second row). The false detections on the fourth row can be explained by high visual similarity with category

Riding. Moreover, our tube detections have high IoU overlaps with groundtruths, which is mainly due to actor-centric tube proposals with precise spatial-temporal context.

Figure 4: Visualization examples on AVA dataset. Blue boxes indicate model predictions and red boxes denote ground truths. The ground-truth labels are in red font.

Pretrained frame-mAP
Gu et al. [14] (2D) ImageNet 14.2
Li et al. [25] ImageNet 18.2
Our DTN (2D) ImageNet 19.7
Gu et al. [14](3D) Kinetics-400 15.8
Sun et al. [3] Kinetics-400 17.4
Girdhar et al. [11] Kinetics-400 25.0
Wu et al. [48] Kinetics-400 26.8
Feichtenhofer et al. [6] Kinetics-600 27.3
Our DTN (3D) Kinetics-400 25.8
Our DTN (3D) + LFB Kinetics-400 27.7
Table 10: Comparisons with the state of the arts on AVA dataset with frame-mAP. The Iou threshold is fixed to 0.5.

4.5.2 Performance Comparison on AVA Dataset

AVA is a newly proposed dataset and there are only a handful of studies on it. We compare our results with the state-of-the-art methods [14, 3, 25, 6, 48, 11] in Table 10. The rows of Table 10 are split into two parts: methods with 2D backbone and 3D backbone. Among methods with 2D backbone, Gu et al. [14] duplicated RPN detections of the key frame for all adjacent frames to form action tubes with limited spatial extent, while we can generate actor-centric tube proposals with our novel DTPN. The results indicate that we outperforms [14, 3] by a large margin. To have a fair comparison, we compare with the version in  [25] which utilizes the same features with us, we can also observe that our method (2D) outperforms  [25] by in terms of frame-mAP.

Figure 5: Visualization of linked tube examples with our DTPN. The green boxes represent region proposals of linked tubes and the red boxes means ground truths. The blue boxes indicates the fake anchor boxes.

For methods with 3D backbone, Feichtenhofer et al[6] propose a two-pathway network which treats spatial structures and temporal events separately. However, they only can integrate contents of short clips of 2-5 seconds. Wu et al[48] resolve this with a long-term feature bank with supportive information extracted over the entire span of a video. They introduce a feature bank operator to compute interactions between long-term and short-term features. However, such long-term information may not relevant to the annotation of current key frame. Instead, Our DTN (3D) can integrate middle-term representation with our deformable linked tube which was verified to be complementary to LFB. Since the test augmentation strategies for existing methods are not always reported, we only compare with the results without test augmentation. It is worth mentioning that [6] pretrained their model on larger Kinetics-600 dataset which further boosts their performance. We achieve the state-of-the-art result on AVA dataset when combined with LFB.

Figure 4 shows three detection examples from AVA dataset. For example in the first row, our approach precisely localizes actor with its correct label walk. Also, our approach can handle multi-label cases well such as sit and talk in the second row. To conclude, our system obtains the state-of-the-art result on AVA and we believe that fusing more scene features can further improve out detection performance.

5 Discussion

In this section, we make a further deeper dive into our system to validate the rationality of our fast tube linking algorithm.

5.1 Visualization of Linked Tubes & Anchor Cuboids

Kalogeiton et al. [18] proposed to integrate temporal context by anchor cuboids with limited spatial extent. Our DTPN can generate actor-centric action tubes which contribute to precise action recognition and location regression. Table 4 shows that our DTPN with linked tubes can outperforms its counterpart in most categories, especially with actors of fast motions. In order to further validate the effectiveness of our DTPN for actions even with large spatial motions, we select hard cases from UCF-Sports dataset for visualization.

Figure 5 shows five examples of tube proposals and corresponding ground truth tubes in frame level. A fake anchor box, which is defined as average position of ground truth tube, is visualized in each frame to show the drawback of anchor cuboids with fixed spatial-extent. The top row shows a running man with large spatial movement. Since our deformable tube proposal network adaptively links region proposals, the resulting tubes have high IoUs with groundtruths in each frame. However, the fake anchor box has a rather low IoU overlap with ground truths especially at the start and in the end. The same problem also exists in the second row with a fast-moving actor. Through the visualization, we verify the effectiveness of DTPN for cases with large spatial motions compared with anchor cuboids.

5.2 Experiments on Top-K Selection

Based on the observations that RPN generates rather reliable proposals and each region proposal of the best linked tube normally has a high objectness score, we modify Viterbi algorithm with top-K selection to accelerate the tube linking process. We argue that the linked tubes with or without top-K selection are intuitively similar. To quantify the effect of top-K selection, we propose a coselection rate of tube proposal sets generated by methods with or without top-K selection


We select top n tube proposals generated with top-K selection for comparison. A tube proposal is assigned a positive label if it has a IoU higher than threshold with any tubes generated by the method without top-K selection. TP is the number of positive tubes among n tube proposals. is averaged over the whole dataset. The criterion measures the distribution similarity between two tube proposal sets. Table 11 shows the comparison results of with different and threshold .

0.998 0.995 0.986 0.948
0.996 0.989 0.973 0.904
0.992 0.977 0.942 0.831
0.987 0.960 0.901 0.755
Table 11: Comparisons of coselection rate on UCF-Sports dataset.

We can see the tubes set with top-K selection has a high overlap with the tubes set without top-K selection. Specially, when , even with a strong limitation , we can still get a rather high coselection rate . According to our statistics, our improved linking algorithm can generate 141 tubes averagely. The remaining tubes are linked only based on their objectness scores. The significant drop of for compared with is due to lack of IoU restraints for those complementary tubes. The experiments validate the effectiveness of our fast linking algorithm with similar tube proposals and much lower time complexity.

6 Conclusion

We propose Deformable Tube Network (DTN), a two-stage action localization architecture with a Deformable Tube Proposal Network (DTPN) and a Deformable Tube Recognition Network (DTRN). Compared with the methods based on anchor cuboids, DTPN generates deformable tube proposals by linking pre-frame region proposals with a fast tube linking algorithm. With actor-centric candidate action tubes, we use DTRN to perform action recognition and location regression with a 3D convolutional network with skip connections to integrate spatio-temporal context. Our experiments validate the effectiveness of our method. Moreover, we achieve the state-of-the-art results on both UCF-Sports and AVA datasets.


  • [1] L. Cao, Z. Liu, and T. S. Huang. Cross-dataset action detection. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • [2] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang.

    Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems.

  • [3] Carl Vondrick Kevin Murphy Rahul Sukthankar Chen Sun, Abhinav Shrivastava and Cordelia Schmid. Actor-centric relation network. In European Conference on Computer Vision (ECCV), 2018.
  • [4] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [5] Kevin Duarte, Yogesh Singh Rawat, and Mubarak Shah. Videocapsulenet: A simplified network for action detection. arXiv preprint arXiv:1805.08162, 2018.
  • [6] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018.
  • [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
  • [8] Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, and Raquel Urtasun. Bottom-up segmentation for top-down detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
  • [9] G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
  • [10] K. Fu, Q. Zhao, and I. Y. Gu. Refinet: A deep segmentation assisted refinement network for salient object detection. IEEE Transactions on Multimedia, 21(2):457–469, Feb 2019.
  • [11] Rohit Girdhar, João Carreira, Carl Doersch, and Andrew Zisserman.

    Video action transformer network.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [12] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [13] Georgia Gkioxari and Jitendra Malik. Finding action tubes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [14] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [15] Jiawei He, Mostafa S. Ibrahim, Zhiwei Deng, and Greg Mori. Generic tubelet proposals for action localization. The IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
  • [16] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [17] Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional neural network (t-cnn) for action detection in videos. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [18] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [19] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action Tubelet Detector for Spatio-Temporal Action Localization. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [20] Tian Lan, Yang Wang, and G. Mori. Discriminative figure-centric models for joint action localization and recognition. In The International Conference on Computer Vision (ICCV), 2011.
  • [21] Tian Lan, Yang Wang, and G. Mori. Discriminative figure-centric models for joint action localization and recognition. In The International Conference on Computer Vision (ICCV), 2011.
  • [22] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
  • [23] I. Laptev and P. Perez. Retrieving actions in movies. In The IEEE International Conference on Computer Vision (ICCV), 2007.
  • [24] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
  • [25] Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. Recurrent tubelet proposal and recognition networks for action detection. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, European Conference on Computer Vision (ECCV), 2018.
  • [26] J. Li, X. Liang, J. Li, Y. Wei, T. Xu, J. Feng, and S. Yan. Multistage object detection with group recursive learning. IEEE Transactions on Multimedia, 20(7):1645–1655, July 2018.
  • [27] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In The European Conference on Computer Vision (ECCV), 2018.
  • [28] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [29] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [30] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision (ECCV), 2016.
  • [31] Xiaojiang Peng and Cordelia Schmid. Multi-region two-stream r-cnn for action detection. In European Conference on Computer Vision (ECCV), 2016.
  • [32] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [33] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • [35] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008.
  • [36] Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision (IJCV), 119(3):346–373, Sep 2016.
  • [37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
  • [38] Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, and Fabio Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. 2016.
  • [39] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • [40] Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, and Fabio Cuzzolin. Online real time multiple spatiotemporal action localisation and prediction. 2017.
  • [41] Khurram Soomro and Amir R. Zamir. Action Recognition in Realistic Sports Videos, pages 181–208. Springer International Publishing, Cham, 2014.
  • [42] M. A. Tahir, F. Yan, P. Koniusz, M. Awais, M. Barnard, K. Mikolajczyk, A. Bouridane, and J. Kittler. A robust and scalable visual category and action recognition system using kernel discriminant analysis with spectral regression. IEEE Transactions on Multimedia, 15(7):1653–1664, Nov 2013.
  • [43] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [44] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision (IJCV), 104(2):154–171, Sep 2013.
  • [45] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV), 2013.
  • [46] Xiaoyu Wang, Ming Yang, Shenghuo Zhu, and Yuanqing Lin. Regionlets for generic object detection. In The IEEE International Conference on Computer Vision (ICCV), December 2013.
  • [47] Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action localization. In The IEEE International Conference on Computer Vision (ICCV), 2015.
  • [48] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, and Ross Girshick. Long-Term Feature Banks for Detailed Video Understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [49] S. Zhang, C. Gao, J. Zhang, F. Chen, and N. Sang. Discriminative part selection for human action recognition. IEEE Transactions on Multimedia, 20(4):769–780, April 2018.
  • [50] X. Zhen, F. Zheng, L. Shao, X. Cao, and D. Xu. Supervised local descriptor learning for human action recognition. IEEE Transactions on Multimedia, 19(9):2056–2065, Sep. 2017.