Spatio-temporal action detection is an important research area in computer vision, which aims to classify actions present in a video and localize them in both space and time. It has wide applications in many scenarios, such as video surveillance[17, 10] and video captioning [27, 32]. Most current approaches [6, 28, 18, 21, 29] apply an action detector at each frame independently and then link these resulting frame-wise detection results across time using dynamic programming [6, 22] or tracking . These methods fail to capture temporal information when conducting frame level detection, and thus are less effective for detecting action tubes in reality. To address this issue, some approaches [12, 9, 31] try to perform action detection at the clip level by exploiting short-term temporal information. In this sense, these methods input a sequence of frames and directly output detected tubelets (i.e., a short sequence of bounding boxes). This clip-level detection scheme yields a more principled and effective solution for video-based action detection, and has shown promising results on standard benchmarks.
The existing action detection methods [6, 18, 12] are closely related with the current mainstream object detectors such as Faster R-CNN  and SSD , which operate on a huge number of pre-defined anchor boxes. Although anchor-based object detectors have achieved success in image domains, they still suffer from critical issues such as being sensitive to hyper-parameters (e.g., box size, aspect ratio, and box number) and less efficient due to densely placed bounding boxes. These practical issues are more serious when adapting the anchor-based detection framework from images to videos. The number of possible tubelet anchors would grow dramatically with increasing clip duration, which imposes great challenge for training and inference. In addition, simple tubelet anchor design is also incapable of leveraging temporal coherence and correlation of adjacent frame-level bounding boxes. It is expected that detecting action tubelets requires to devise more sophisticated anchor box placement to consider temporal evolution.
Inspired by the recent advances in anchor-free object detection [19, 13, 3, 36, 26], we take a different perspective on action tubelet detection from clips with those anchor-based action detection framework. Intuitively, movement is a natural phenomenon in videos and describes the essential property of human behavior. Human action detection could be highly simplified and achieved with movement detection. Based on this analysis, as shown in Figure 1, we present a new view on human actions by treating each action instance as a trajectory of moving points. In this view, an action tubelet is represented by a center point in the middle frame and offsets of other frames with respect to its center point. To determine the spatial extent of action instance, we propose to directly regress the bounding box size at the corresponding detected moving point on each frame. This new detection scheme decouples the task of tubelet detection into three separate components of center detection, offset estimation, and box regression. This decomposition leverages the temporal coherence of action tubes to divide the complex tubelet detection task into simpler sub-tasks, which not only makes the whole detection framework more compact and easily optimized, but also increases its detection efficiency.
Specifically, we present an anchor-free action detector for localizing action tubelets in short clips, termed as MovingCenter detector
(MOC-detector). First, frames are fed into a 2D efficient backbone network for feature extraction. Then, we devise three separate branches: (1) Center Branch: detecting the action instance center and category; (2) Movement Branch: estimating the offsets of current frame with respect to its center; (3) Box Branch: predicting bounding box size at the detected center point of each frame. This unique design enables three branches cooperate with each other to generate the tubelet detection results. We also empirically investigate a series of good practices to design these branches. Finally, we link these detected action tubelets across frames to yield long-range detection results following the common practice. We perform experiments on two challenging action tube detection benchmarks of UCF101-24  and JHMDB . Our MOC-detecor outperforms the existing state-of-the-art approaches for both frame-mAP and video-mAP on these two datasets yet with a higher detection efficiency (around 25 fps).
2 Related Work
Recently the state-of-the-art spatio-temporal action detection methods [31, 7, 12, 29] are closely related to current anchor-based object detectors, such as Faster-RCNN , SSD . In this paper, we propose an efficient anchor-free action tubelet detector. We will review recent object detection approaches and spatio-temporal action detection methods as follows.
2.1 Object Detection
Anchor-based Object Detectors. Traditional one-stage [16, 19, 15] and two-stage object detectors [5, 8, 4, 20] heavily relied on predefined anchor boxes. Two-stage object detectors like Faster-RCNN  and Cascade-RCNN  devised RPN to generate RoIs from a set of anchors in the first stage and handled classification and regression of each RoI in the second stage. By contrast, typical one-stage detectors utilized class-aware anchors and jointly predicted the categories and relative spatial offsets of objects, such as SSD , YOLO  and RetinaNet .
Anchor-free Object Detectors. However, some recent works [26, 36, 13, 3, 37] have shown that the performance of anchor-free methods could be competitive with anchor-based detectors and such detectors also get rid of computation-intensive anchors and region-based CNN. CornerNet  detected object bounding box as a pair of corners, and grouped them to form the final detection. CenterNet  modeled an object as the center point of its bounding box and regressed its width and height to build the final result.
2.2 Spatio-temporal Action Detection
Frame-level Detector. Many efforts have been made to extend an image object detector to the task of action detection as frame-level action detectors [6, 28, 18, 21, 22, 29]. After getting frame detections, linking algorithm is applied to generate final tubes. Although flows are used to capture motion information, frame-level detection fails to fully utilize video’s temporal information.
Clip-level Detector. In order to model temporal information for detection, some clip-level approaches or action tubelet detectors [12, 9, 31, 14, 7] have been proposed. ACT  took a short sequence of frames and output tubelets which were regressed from anchor cuboids. Gu et al.  used longer clips and took advantages of I3D pre-trained on Kinetics  and boosted the result. STEP  proposed a progressive method to refine the proposals over a few steps to solve the large displacement problem and utilized longer temporal information.
These approaches are all based on anchor-based object detectors, whose designs are sensitive to anchor design and computationally cost due to large numbers of anchor boxes. Instead, we try to design an anchor-free action tubelet detector by treating each action instance as a trajectory of moving points. Experimental results demonstrate that our proposed action tubelet detector is effective for spatio-temporal action detection, in particular for the high video IoU.
Action tubelet detection involves localizing a short sequence of bounding boxes from an input clip. We present a new tubelet detector, coined as MovingCenter detector (MOC-detector), by treating each action instance as a trajectory of moving points. As shown in Figure 2, in our MOC-detector, we take a set of consecutive frames as input and feed them into an efficient 2D backbone to extract frame-level features. Then, we design three crucial head branches to perform tubelet detection in an anchor-free manner. The first branch is Center Branch, which is defined on the center (key) frame. This Center Branch aims to localize the tubelet center and recognize the action category. The second branch is Movement Branch, which is defined over all frames. This Movement Branch tries to relate adjacent frames to predict the center motion. The estimated movement would propagate the center point from key frame to other frames. The third branch is Box Branch, which is defined on the detected or estimated center points of all frames. This branch focuses on determining the spatial extent of detected action instance by directly regressing the height and width of bounding box. These three branches collaborate together to yield tubelet detection from a short clip, which will be further linked to form action tubes in a long untrimmed video by following a common matching strategy . We will first give a short description on the backbone design and then provide technical details on three branches and linking algorithm in the following subsections.
Backbone. In our MOC-detector, we input frames and each frame is with resolution of . First frames are fed into a backbone network to generate a feature volume . is spatial downsampling ratio. To keep full temporal structure for subsequent detection, we do not perform any downsampling over temporal dimension. Specifically, we choose DLA-34  architecture as our MOC-detector feature backbone due to its good balance between detection accuracy and efficiency. We have also experimented with ResNet-18 and ResNet-101  in Section 4.2.4. This architecture employs encoder-decoder architecture to extract features for each frame independently and the spatial downsampling ratio is 4. The extracted features are shared by three head branches, each of which contains a
convolution, a ReLU and anotherconvolution for feature transformation to specific target. Next we will show the technical details of these branches to generate tubelet detection.
3.1 Center Branch: Detect Center at Key Frame
The Center Branch aims at detecting action instance center in the key frame (i.e., center frame) and recognizing its category based on the extracted video features. Well-established temporal information is important for action recognition, and thereby we design a temporal module to estimate action center. Specifically, based on the video feature representation , we estimate a center heatmap for key frame. The is the number of action classes. The value of represents the likelihood of detecting an action instance of class at location , and higher value indicates a stronger possibility. Specifically, to capture temporal structure, we employ a 3D convolution operation and sigmoid non-linearity to estimate the center heatmap in a fully convolutional manner.
Training. We train the Center Branch following the common dense prediction setting [13, 36]. For action instance, we represent its center as key frame’s bounding box center and utilize center’s position for each action category as the groundtruth label . We generate the groundtruth heatmap using a Gaussian kernel which produces the soft heatmap groudtruth . The is adaptive to instance size and we choose the maximum when two Gaussian of the same category overlap. We choose the training objective, which is a variant of focal loss , as follows:
where is the number of groundtruth instances and and are hyper-parameters of the focal loss. We set and in all our experiments. The experimental results indicate that this loss is able to deal with the imbalanced training issue effectively.
Inference. After the Center Branch training, it could be deployed in tubelet detection for localizing action instance center and recognizing its category. Specifically, we first apply a max pooling operation to extract peaks(local maximum) in estimated heatmap for each class independently, and then keep the top peaks from all categories. These kept peaks are detection candidate centers and each provide a tubelet score. In experiment, we set as 100.
3.2 Movement Branch: Move Center Temporally
The Movement Branch tries to relate adjacent frames to predict the movement of action instance center. Similar to Center Branch, Movement Branch also employs temporal information to regress the center offsets along temporal dimension. Specifically, Movement Branch takes stacked feature as input and outputs a movement prediction map . 2K channels represent center movements from key frame to other frames in X and Y direction. Given key frame center , encodes center movement from key frame to frame.
In practice, in order to find an efficient way to capture center movements, we implement Movement Branch in several different ways. The first one is Accumulated Movement strategy which predicts center movement between consecutive frames instead of with respect to key frames. During inference, we find this will accumulate error and harm accuracy. The second strategy, Cost Volume Movement, is to directly compute the movement offset by constructing cost volume between key frame and current frame following , but this explicit computing fails to yield better results and is slower due to the constructing of cost volume. The third strategy, Center Movement, is to employ 3D convolutional operation to directly regress the offsets of current frame with respect to key frame. Experiment results in Section 4.2.2 indicate this simple implementation, Center Movement, is effective and efficient for movement prediction, so we apply it in our MOC-detector.
Training. The ground truth tubelet of action instance is , where subscript and represent top-left and bottom-right points of bounding boxes, resepectively. Let be the key frame index and the action instance center at key frame is defined as follows:
We could compute the instance bounding box center at frame as follows:
Then, the movement groundtruth is calculated as follows:
Then we optimize the movement map only at the key frame center location and use the L1 Loss as follows:
Inference. After the Movement Branch training and given centers generated by Center Branch, we gather a set of
movement vectorfor each detected center. This branch generates a trajectory set , and for the center , its moving trajectory is defined as follows:
3.3 Box Branch: Determine Spatial Extent
The Box Branch is the last step of tubelet detection and focuses on determining spatial extent of action instance. Unlike previous Center Branch and Movement Branch, we assume box detection only depends on the current frame since we find that temporal information will not benefit the class-agnostic bounding box regression but only bring computational burden. In this sense, this branch could be performed in a frame-wise manner. Specifically, Box Branch directly regresses the bounding box size (i.e., width and height) and it generates a size prediction map . At each location , its value represents the bounding box size at center in frame.
Training. The groundtruth bbox of action instance at frame can be represented as follows:
With this groundtruth bounding box size, we optimize the Box Branch at the center points of all frames for each tubelet with Loss as follows:
Note that the is the instance groungtruth center at frame. So the overall training objective of our MOC-detector is
where we set a=1 and b=0.1 in all our experiments.
Inference. Finally we are ready to generate the tubelet detection results based on center trajectories from Movement Branch and size prediction heatmap for every location produced by this branch. For point in trajectory , we use to denote its coordinates, and (w,h) to denote Box Branch size output at specific location. Then, bounding box for this point is calculated as:
3.4 Tubelet Linking
After introducing technical details of MOC-detector, we are ready to describe how to combine the MOC-detector results to obtain a tube detection over long-range video. In principle, our proposed MOC-detector could be applied to perform the whole video tube detection directly as we just need change the parameter to the video frame number. However, in practice, due to the memory limit of GPU, we cannnot set too large number to parameter . So, a common approach is to detect action tubelet in a limited temporal duration (i.e. is less than 10) and a temporal linking algorithm is designed to merge nearby tubelets into an action tube.
As our main goal is to propose a new tubelet detector and to fairly compare with previous method, we use the same linking algorithm in . First, in the initialization step, it keeps top highest scored tublets for each class after non-maximum supression (NMS) for each sequence of frames. Then, in the linking step, it performs tubelet extension in a greedy way by assigning the highest scored tubelet to one of the tubelet candidates starting at this frame. To extend a link with the tubelet candidate , it should meet three conditions: (1) the tubelet is not selected by other link, (2) the candidate has highest score, (3) the overlap between link and tubelet is greater than a threshold . Finally, in the termination step, the link procedure stops when more than consecutive frames don’t meet these criteria. The detection score of this link is calculated as the average of all linked tubelets and we use average of box coordinates to overlapped frames. More details are described in .
In this section we present the experimental results of our MOC-detector on action detection. First, we give a description on the evaluation datasets and provide the implementation details. Second, we perform ablation study to verify the effectiveness of our branch design. Then, we provide an error analysis of MOC-detector for different tubelet lengthsand input modalities. After that, we compare our MOC-detector with the existing state-of-the-art methods on two challenging datasets. Finally, we show some visualization examples of action detection results.
4.1 Datasets and Implementation Details
To verify the effectiveness of MOC-detector for video-based action detection, we perform experiments on two challenging benchmarks: UCF101-24  and JHMDB . These two datasets are the existing action detection datasets suitable for verifying tubelet-based action detector. We notice that AVA dataset is a larger dataset for action detection but only contains a single-frame action instance annotation for each 3s clip. Thus, AVA is not suitable to verify the effectiveness on tubelet-based action detector.
UCF101-24. The UCF101 dataset is a common benchmark for action recognition and it contains spatio-temporal action instance annotations for 3207 videos from 24 sports classes. Thus, we called detection version of UCF dataset as UCF101-24. This video dataset is untrimmed and thus more challenging for action detection. Following the common setting [18, 12], we report the action detection performance for the first split only.
JHMDB. The HMDB51 dataset is another smaller action recognition benchmarks with 51 action classes. JHMDB is a subset of HMDB51, containing 928 videos from 21 action classes from our daily life. It is worth noting that these video clips are trimmed to the whole action instance. Thus the action detection on JHMDB mainly focuses on spatial detection and we report results averaged over three splits following the common setting [18, 12].
Evaluation Metrics. To evaluate the detection performance of MOC-detector, we use the metrics of frame AP and video AP. Frame-level AP focuses on bounding box detection at each frame and thus is more suitable to evaluate the performance of tubelet detection. This measure is independent on the tube linking algorithm and pay more attention on the original tubelet detection results. Video-level AP focuses on tube detection for the whole video and it not only depends on tubelet detection results, but also is related with the linking algorithm. To better demonstrate the effectiveness of our MOC-detector on tubelet detection, we use the same linking algorithm of previous method .
Both frame-level and video-level AP calculation is based on the Intersection-over-Union (IoU). The frame AP calculate the IoU based on the frame-level bounding box while the video AP calculate the IoU based on the clip-level tube. For frame AP, we set the IoU threshold as 0.5 and for video AP, we set the IoU from 0.2 to 0.95. For a detected instance, it is considered correct only if its IoU with a groundtruth is larger than the threshold and the predicted class label is correct. For each action class, we compute the average precision (AP) and report the mean AP (mAP) by averaging over classes.
Implementation Details. Following the work [18, 12], we choose two-stream inputs to investigate our MOC-detector on these two challenging datasets. We choose the DLA34  architecture as our backbone and the frame is resized to . The spatial downsampling ratio is set to 4 and the resulted feature map size is . During training, we use the same data augmentation as 
to the whole video: photometric transformation, scale jittering and location jittering. We use Adam with learning rate 5e-4 to optimize the overall objective. The learning rate adjusts to convergence on validation set and it decreases by a factor of 10 when performance saturates. The iteration maximum is set to 8 epochs on UCF101-24 and 20 epochs on JHMDB 
4.2 Ablation Study
In this subsection, we perform ablation study on our proposed MOC-detector on the dataset of UCF101-24. Specifically, we focus on studying from four aspects: (1) analysis of movement, (2) Movement Branch design, (3) tubelet length K, (4) backbone exploration. For efficient exploration, we perform experiments only using RGB input modality. Without special specified, we use exactly the same training strategy in this subsection.
4.2.1 Study on Tubelet Detection Design
To demonstrate the effectiveness of our proposed MOC-detector, we compare MOC with other two basic tubelet detection designs, called as No Movement and Box Movement, as shown in Figure 3. We set the tubelet length in these exploration experiments with the same training strategy for all tubelet detection designs.
No Movement: directly remove the Movement Branch and just generate bounding box for each frame at the same location with key frame center. In this way, all bounding boxes share the same center location and are only different in bounding box size. This design is based on the assumption the actor movement in a local temporal window is relatively small compared with the actor pose variation.
Box Movement: first generate bounding box for each frame at the same location with key frame center, and then move the generated box in each frame according to Movement Branch prediction. This design assumes that the regressing bounding box size is less sensitive to box center location and could be performed at the same location for all frames.
Center Movement (MOC): first move the key frame center to current frame center according to Movement Branch prediction, and then Box Branch generates bounding box for each frame at its own center. The difference between Center Movement and Box Movement is that they generate the bounding box at different locations: one at real center, and the other at the fixed key frame center.
Quantitative Result. The results of three movement strategies are summarized in Table 1. Box Movement performs slightly worse than Center Movement(MOC) for frame mAP@0.5(70.05%vs.70.39%). We can conclude that our Box Branch is robust, as slight center localization error caused by no movement in the 5 frames clip will not harm frame detection seriously. However, there is a 1.74% gap for video mAP@0.5(46.82%vs.48.56%), since accumulating the subtle error of box size estimation in each frame will gradually deteriorate video level detection. No Movement has a 2.16% gap for frame mAP@0.5(68.23%vs.70.39%) and a 10.00% gap for video mAP@0.5(38.56%vs.48.56%) with Center Movement, which proves that estimating movement is significant in our MOC-detector even though we just input 5 consecutive frames. Moreover, this comparison proves that our Movement Branch successfully estimates the movement between key frame and other frames.
|Cost Volume Movement||70.11||72.81||44.01||22.59||22.51|
Qualitative Result. We provide some visualization examples to intuitively compare performance among three movement strategies. As shown in Figure 4, No Movement and Center Movement both detect actor accurately in key frame, which presents that our MOC-detector can localize key frame center and regress the height and width of actor’s bounding box accurately. Center Movement(MOC) moves action center from key frame to non-key frames according to Movement Branch prediction and regresses the height and width of bounding box at respective frame center. Adjusting both box location and size can better detect action in non-key frames. However, No Movement just adjusts box size at the same location(key frame center). In order to enclose the actor, boxes in non-key frames predicted by No Movement tend to be larger than ground truth. Moreover, this visualization example shows negligible difference between Box Movement and Center Movement(MOC) which
coincides with quantitive result in Table 1. These examples vividly represent the effect of movement and efficiency of movement strategy applied in our MOC-detector.
4.2.2 Study on Movement Branch Design
As illustrated in Section 3.2, we propose three implementations of Movement Branch: (1) Accumulated Movement strategy, (2) Cost Volume Movement strategy, (3) Center Movement strategy. We first conduct exploration study on these three kinds of Movement Branch design. In these experiments we set the tubelet length and use the same training strategy. The results are reported in Table 2.
First, we observe that the Accumulated Movement strategy performs worse than Center Movement. We analyze that this result might be ascribed to the fact that Accumulated Movement strategy would cause the issue of error accumulation and is also more sensitive to the training and inference consistency. In this sense, the groudtruth movement is calculated at the real bounding box center during training, while for inference, the current frame center is estimated from Movement Branch and might not be so precise, so that Accumulated Movement would bring large displacement to the groundtruth.Second, we notice that cost volume based
Movement Branch design is slightly worse than our directly employing a 3D convolution to regress movement (70.11% vs. 70.39% for frame mAP@0.5). This phenomenon could be explained by two reasons: (1) In our cost volume based Movement Branch design, we explicitly calculate the correlation of current frame with respect to key frame. When regressing the movement of current frame, it only depends on current correlation map. However, when directly regressing movement with 3D convolutions, the movement information of each frame will depend on all frames, which might contribute to a more accurate estimation. (2) As cost volume calculation and offset aggregation involve a correlation without extra parameters, it is observed that the convergence is much harder than Center Movement. Therefore, we choose Center Movement as Movement Branch design in our MOC-detector.
|F-mAP@0.5(%)||Video-mAP (%)||F-mAP@0.5 (%)||Video-mAP (%)|
|Saha et al. ||-||72.6||71.5||43.3||40.0||-||66.7||35.9||7.9||14.4|
|Peng et al. ||58.5||74.3||73.1||-||-||65.7||73.5||32.1||2.7||7.3|
|Singh et al. ||-||73.8||72.0||44.5||41.6||-||73.5||46.3||15.0||20.4|
|Hou et al. (C3D)||61.3||78.4||76.9||-||-||41.4||47.1||-||-||-|
|Kalogeiton et al. ||65.7||74.2||73.7||52.1||44.8||69.5||76.5||49.2||19.7||23.4|
|Yang et al. ||-||-||-||-||-||75.0||76.6||-||-||-|
|Song et al. ||65.5||74.1||73.4||52.5||44.8||72.1||77.5||52.9||21.8||24.1|
|Zhao et al. ||-||-||74.7||53.3||45.0||-||78.5||50.3||22.2||24.5|
|Gu et al.  (I3D)||73.3||-||78.6||-||-||76.3||-||59.9||-||-|
|Sun et al.  (S3D-G)||77.9||-||80.1||-||-||-||-||-||-||-|
4.2.3 Study on Input Sequence Duration
The temporal length of input clip is another important parameter in our MOC-detector. In this study, we report the performance of MOC-detector by varying from 1 to 9 and the experiment results are summarized in Table 3. We reduce the training batch size for K=7 and K=9 due to GPU memory limitation. First, we notice
that when , our MOC-detector reduces to the frame-level detector which obtains the worst performance. This confirms the common assumption that frame level action detector lacks consideration of temporal information for action recognition and thus it is worse than those tubelet detectors, which agrees with our basic motivation of designing action tubelet detector. Second, we see that the detection performance will increase and speed will decrease as we vary from 3 to 7 and the performance gap becomes smaller when comparing and . And from to , detection performance gap becomes smaller and even decreases in some evaluation indicators. To keep a balance between detection efficiency and accuracy, we set in our MOC-detector for all datasets. Without flow computation, two-stream MOC-detector runs at around 25 fps, which uses DLA-34 as backbone and forwards temporal and spatial CNNs sequentially.
Note that we evaluate speed on a single NVIDIA TITAN Xp under the settings of batch size 64 and report the average frames per second (tubelets per second). The speed evaluation consists of two parts, extracting frame features and generating tubelets. Since all frames share the same backbone weights, we extract feature just once for each frame, which avoids redundant feature extraction for consecutive tubelets. And then input extracted features into three MOC’s branches to generate tubelets. IO time is excluded, for it is affected seriously by hardware.
4.2.4 Study on Backbone
is a standard residual network augmented with three up-convolutional networks to generate a higher-resolution output with stride 4 for dense prediction. Following, we use deformable convolution layers to modify both ResNets and DLA-34. Different backbones don’t have exactly the same training epochs due to the difference in the number of parameters. We use the speed evaluation metioned in Section 4.2.3. As shown in Table 4, performance gaps between different backbones are marginal, showing that our framework is insensitive to backbone variations. To balance accuracy and efficiency, we choose DLA-34 as our MOC-detector’s backbone for all datasets.
4.3 Comparison with the State of the Art
Finally, we compare the detection results of our MOC-detector with the existing state-of-the-art methods on both trimmed JHMDB  dataset and untrimmed UCF101-24  dataset in Table 5. We report detection performance with both metrics of frame-mAP and video-mAP and give detailed two stream results in Table 6 and Table 7. Note that all methods in Table 5 utilize both RGB and optical flow information except for 
, which only takes RGB as input and uses C3D to model temporal information. For frame mAP, we see that our method significantly outperforms previous methods pre-trained on the image-level dataset (e.g. ImageNet) by around 9% on JHMDB dataset and 5% on UCF101-24 dataset. We also compare with other 3D CNNs based action detector[7, 25] and our performance is comparable to these strong 3D CNN based detectors. It is worth noting that these methods employ more computationally expensive architecture and pre-trained on the large-scale video-level dataset (e.g. Kinetics ). Thus, it can not directly compare our method with them due to different backbone architecture and pre-trained datasets. For video mAP, our MOC-detector is also significantly superior to the previous state-of-the-art approaches pre-trained on image-level dataset and comparable to those Kinetics pretrained 3D action detector . For video mAP@[0.5:0.95], our method advances state-of-the-art performance by around 15% on JHMDB dataset and 3% on UCF101-24 dataset. All these superior performance demonstrates the effectiveness of our proposed MOC-detector for action tubelet detection.
In Figure 5, we give some qualitative examples to compare the performance between tubelet duration K=1 and K=7. Comparison between the second row and the third row shows that our tubelet detector leads to less missed detections and localizes action more accurately owing to offset constraint in the same tubelet. What’s more, comparison between the fifth and the sixth row presents that our tubelet detector can reduce classification error because some actions can not be discriminated by just looking one frame. All examples in Figure 5 are from UCF101-24 dataset and we set visualization threshold as 0.4.
5 Conclusion and Future Work
In this paper we have presented a clip level action detector, termed as MOC-detector, by treating each action instance as a trajectory of moving points and directly regressing bounding box size at estimated center points of all frames. As demonstrated on two challenging datasets, the MOC-detector has brought a new state-of-the-art with both metrics of frame mAP and video mAP, while maintaining a reasonable computational cost. The superior performance is largely ascribed to the unique design of three branch and their cooperative modeling ability to perform tubelet detection. In the future, based on the proposed MOC-detector, we try to extend its framework to longer-term modeling and model action boundary in temporal dimension, thus contribute to spatio-temporal action detection in longer continuous video streams.
Appendix A: Error Breakdown Analysis
In this section, following , we conduct an error analysis in frame mAP to better explore our proposed MOC-detector. In particular, we investigate five kinds of tubelet detection error: (1) classification error : the detection IoU is greater than 0.5 with the ground-truth box of another action class. (2) localization error : the detection class is correct in a frame but the bounding box IoU with groundtruth is less than . (3) time error : the detection in untrimmed video covers the frame that doesn’t belong to the temporal extent of current action instance. (4) missed detection error : cannot detect out a groundtruth box. (5) other error : the detection appears in a frame without the class and has IoU less than 0.5 with groundtruth bounding box of other classes.
We first present error analysis on UCF101-24  with respect to varying input sequence duration using RGB as input and the results are summarized in Figure 6. From these results, we see that the major difference between frame level detector () and tubelet detector lies in three kinds of errors: (1) classification error , (2) localization error , and (3) missed detection error . For localization error, we find the frame level detector is able to yield better performance since it performs frame wise detection more precisely. For classification error, the tubelet detector can obtain better performance as it employs temporal structure for classification. For missed detection error, we notice that tubelet detector is capable of detecting more action instance by exploiting rich temporal context. We also notice that the time error is the highest error type because our MOC-detector lacks modeling the starting and ending points in the current framework.
Then, we also visualize error analysis with two-stream fusion on UCF101-24  and the results are reported in Figure 7. Note that we set tubelet length as 7. First, we find that temporal stream MOC-detector obtains lower error rate at localization error, time error than spatial stream MOC-detector. This result indicates that optical flow is able to detect action instance with high precision but low recall. Thus, its missed detector error is very high. Second, when combining temporal MOC-detector into spatial MOC-detector, it mainly increases detection performance from aspects of location error and classification error.
Finally, we present error analysis on untrimmed dataset UCF101-24  and trimmed dataset JHMDB (only split 1) with tubelet length and two-stream fusion. As shown in Figure 8, we find the major error is , time error(10.1%), for untrimmed dataset UCF101-24  and , classification error(23.66%), for trimmed dataset JHMDB . Although our MOC-detector has achieved state-of-art on both datasets, we will try to extend this framework to model longer temporal information to improve classification accuracy and model action boundary in temporal dimension to eliminate time error.
Zhaowei Cai and Nuno Vasconcelos.
Cascade r-cnn: Delving into high quality object detection.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
-  Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
-  Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6569–6578, 2019.
-  Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  Georgia Gkioxari and Jitendra Malik. Finding action tubes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 759–768, 2015.
-  Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
Rui Hou, Chen Chen, and Mubarak Shah.
Tube convolutional neural network (t-cnn) for action detection in videos.In Proceedings of the IEEE International Conference on Computer Vision, pages 5822–5831, 2017.
-  Weiming Hu, Tieniu Tan, Liang Wang, and Steve Maybank. A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 34(3):334–352, 2004.
-  Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In Proceedings of the IEEE international conference on computer vision, pages 3192–3199, 2013.
-  Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 4405–4413, 2017.
-  Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.
-  Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. Recurrent tubelet proposal and recognition networks for action detection. In Proceedings of the European conference on computer vision (ECCV), pages 303–318, 2018.
-  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis, et al. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR 2011, pages 3153–3160. IEEE, 2011.
-  Xiaojiang Peng and Cordelia Schmid. Multi-region two-stream r-cnn for action detection. In European conference on computer vision, pages 744–759. Springer, 2016.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529, 2016.
-  Gurkirt Singh, Suman Saha, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 3637–3646, 2017.
-  Lin Song, Shiwei Zhang, Gang Yu, and Hongbin Sun. Tacnet: Transition-aware context network for spatio-temporal action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11987–11995, 2019.
-  Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. Actor-centric relation network. In ECCV, pages 335–351, 2018.
-  Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
-  Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015.
-  Limin Wang, Yu Qiao, Xiaoou Tang, and Luc Van Gool. Actionness estimation using hybrid fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2016.
-  Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action localization. In Proceedings of the IEEE international conference on computer vision, pages 3164–3172, 2015.
Bin Xiao, Haiping Wu, and Yichen Wei.
Simple baselines for human pose estimation and tracking.In Proceedings of the European Conference on Computer Vision (ECCV), pages 466–481, 2018.
-  Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S Davis, and Jan Kautz. Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 264–272, 2019.
-  Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pages 4507–4515, 2015.
-  Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2018.
-  Jiaojiao Zhao and Cees GM Snoek. Dance with flow: Two-in-one stream action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9935–9944, 2019.
-  Yue Zhao, Yuanjun Xiong, and Dahua Lin. Recognize actions by disentangling components of dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6566–6575, 2018.
-  Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
-  Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 850–859, 2019.