E^2TAD: An Energy-Efficient Tracking-based Action Detector

by   Xin Hu, et al.
The University of Texas at Austin

Video action detection (spatio-temporal action localization) is usually the starting point for human-centric intelligent analysis of videos nowadays. It has high practical impacts for many applications across robotics, security, healthcare, etc. The two-stage paradigm of Faster R-CNN inspires a standard paradigm of video action detection in object detection, i.e., firstly generating person proposals and then classifying their actions. However, none of the existing solutions could provide fine-grained action detection to the "who-when-where-what" level. This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions spatially (by predicting the associated target IDs and locations) and temporally (by predicting the time in exact frame indices). This solution won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC).


page 5

page 6

page 7


A Proposal-Based Solution to Spatio-Temporal Action Detection in Untrimmed Videos

Existing approaches for spatio-temporal action detection in videos are l...

Modeling Spatio-Temporal Human Track Structure for Action Localization

This paper addresses spatio-temporal localization of human actions in vi...

Learning to track for spatio-temporal action localization

We propose an effective approach for spatio-temporal action localization...

CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization

Most current pipelines for spatio-temporal action localization connect f...

MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions

Spatio-temporal action detection is an important and challenging problem...

Extraction and Classification of Diving Clips from Continuous Video Footage

Due to recent advances in technology, the recording and analysis of vide...

Am I Done? Predicting Action Progress in Videos

In this paper we introduce the problem of predicting action progress in ...

1 Introduction

The 2021 Low-Power Computer Vision Challenge (LPCVC) UAV-Video Track [24, 25]

requires contestants to spatio-temporally localize the key action of ball-catching from videos captured by drones. The evaluation metric takes both accuracy and efficiency into account. The solution is expected to precisely detect the key action temporally (by predicting frame indices) and spatially (by predicting associated persons’ and balls’ IDs), using as low energy consumption as possible. This challenge has three unique challenges: lack of training data, robustness, and efficiency.

Lack of Training Data:

Unlike the ubiquitous “person” object found in benchmarks for detection, segmentation, pose estimation, or tracking tasks, the “ball” object could only be found in COCO’s sports ball category, with large semantic domain gap.


The irregular moving pattern of the actors and the drone has made the tracking extremely difficult. The resulting occlusion and varying view angle has brought enormous detection and association errors.


Detecting target persons/balls, extracting their ReID features, and localizing key action spatio-temporally are computationally intensive. Considering the limited computation power and memory capacity on Pi 3B+, running these modules on Pi 3B+ in real-time would be difficult.

This paper presents our submitted solution, dubbed Energy-Efficient Tracking-based Action Detector (ETAD). We present the following:

  • [leftmargin=*]

  • a tracking-based vision system to spatio-temporally localize key action from videos. It has three core components: ball-person detection, deep association, and action detection;

  • a harmonization-aware image composition module to generate synthetic but realistic ball with homogeneous color on pedestrian datasets;

  • two adaptive inference strategies and a cache-friendly pipeline to save energy cost;

  • a shape-texture debiased training and a domain-invariant adversarial training to improve the robustness of detection and ReID feature;

  • a morphology-based and learning-free action detector that only depends on bounding box trajectories to localize key actions spatio-temporally.

This solution won the 1st place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC).

2 Related Works

2.1 Video Action Detection

Video action detection (a.k.a, spatio-temporal video action localization) requires localizing persons and recognizing their actions in space and time from video sequences.

STEP [72] is a progressive learning framework that consists of spatial refinement and temporal extension, which iteratively refines the coarse cuboid proposals toward action locations and incorporates longer-range temporal information to improve action classification. STAGE [56] learns relations between actors and objects by self-attention operations over a spatio-temporal graph representation of the video. LFB [64] introduces a long-term feature bank that stores long-term supportive memory extracted over the entire video. SlowFast [13] has a slow pathway operating at low frame rates to capture spatial semantic information, and a lightweight fast pathway operating at high frame rates to capture rapid motion better. Recently, Context-Aware RCNN [65] enlarges the resolution of small actor boxes by cropping and resizing instead of RoI-Pooling, also extracting scene context information aided by LFB. AIA [54] leverages an interaction aggregation structure and an asynchronous memory update algorithm to efficiently model very long-term interaction dynamics. ACAR-Net [43] learns to reason high-order relations (aka actor-context-actor relations) with a actor-context feature bank that preserves spatial contexts. Collaborative memory [71] shares the long-range context among sampled clips in a computation lightweight and memory-efficient way.

2.2 Multi-Object Tracking


In deep learning era, object detector can be grouped into two genres: “two-stage detector” and “one-stage detector”. One-stage detector classify and localize objects in a single shot using dense sampling. Two-stage detector has a separate module to generate region proposals. Compared with one-stage detectors, two-stage detectors usually achieve better accuracy but lower speed. RCNN series (RCNN 

[17], Fast-RCNN [16], Faster-RCNN [49], R-FCN [10], and Mask-RCNN [18]) are the most representative two-stage detectors. YOLO [45, 46, 47, 2, 14], SSD [40], and RetinaNet [35] are the most representative one-stage detectors. Recently, anchor-free one-stage detectors have gained popularity, including CenterNet [87], FCOS [55], etc. An object detector’s neck, which is composed of serveral bottom-up and top-down paths, is adopted to collect feature maps from different stages (scales). The most representative necks include FPN [34], PAN [39], BiFPN [53], and NAS-FPN [15].


A standard closed-world Re-ID system has three main components: feature representation learning, deep metric learning, and rank optimization. Feature representation learning includes person-level global feature [81], part-level local features [79, 73, 52], and auxiliary information (e.g., semantic attributes, viewpoint, and domain) enhanced features [51, 4, 74]

. Deep metric learning aims to design loss function to guide the feature representation learning. Verification loss 

[57, 12, 31], identity loss [81], and triplet loss [22, 5, 75] are the three widely-used loss function. Ranking optimization improves the retrieval performance in the testing stage.


As the standard approach in multi-object tracking (MOT), Tracking-by-detection [1, 63, 62, 86, 78, 77]

has four stages: detection, feature extraction/motion prediction, affinity, and association. Given the raw frames, a typical workflow starts with an object detector that returns bounding boxes, a feature extractor and a motion predictor that give appearance and motion cues, an affinity calculator that computes the confidence of two objects belonging to the same target, and an object associator that assigns IDs to detected objects (if not excluded). MOT has two modes 

[8]: batch and online. Since exploiting global information often results in better tracking, batch tracking uses future information when assigning IDs for detected objects in a certain frame. In comparison, online tracking, often running in real-time speed, cannot fix past errors using future information.

2.3 Efficient Neural Network


Pruning explores the redundancy in the model weights by removing the uncritical weights. It could be done in a one-shot way or an iterative way. One-shot methods prune weights at once based on some importance metric and require finetuning as accuracy loss compensation. Iterative methods prune weights during optimization following a progressively increasing sparsity rate. From a granularity perspective, ordered in decreased sparsity rate, pruning could be done element-wise, channel-wise, column-wise, filter-wise, or layer-wise. Channel-wise pruners include Network Slimming [41], AMC [19], and ADMM [76]. Filter-wise pruners include FPGM [20], magnitude-based pruner [29], AGP [88], Network Trimming [23], and Taylor FO [42]. Layer-wise pruners include NetAdapt [70]. AutoCompress [38] supports combined filter and column pruning.


Quantization refers to compressing models by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time. From granularity perspective, quantization could be done in layer-wise, group-wise, channel-wise (the standard method used for quantizing convolutional kernels), or sub-channel-wise. Deep neural networks’ major numerical representation bitwidth is

-bit float (FP). Many research works have demonstrated that weights and activations can be represented using -bit integers [27, 85] without significant loss in accuracy. Lower bit-widths quantization, such as // bits [9, 84], is an active field of research.

Dynamic Neural Networks:

As opposed to static network’s static computation graph and parameters, dynamic network, whose primary advantage is efficiency, can adapt its structures or parameters to the input during inference. Dynamic network allocates computation resources on-demand during inference, by selectively activating components (e.g., channels, layers, or stages) conditioned on the input [26, 33, 50, 69, 6]. For example, a dynamic network spends less computation on easy samples or less informative spatial areas/temporal locations of an input. For image [60, 32] or video-related [30, 59] tasks, sample-wise, spatial-wise, or temporal-wise adaptive inference could be conducted by formulating the recognition or detection task as a sequential decision problem and allowing early exiting during inference.

Algorithm and Hardware Codesign:

Algorithm and hardware codesign enables deep neural network cross the performance wall when parallelism and data reuse are exhausted. Efficient data operation and model compression are the two widely approaches. Stochastic data operation(e.g., data mapping, matrix decomposition) increases hardware utilization and accelerates computation on hardware [7, 11]. Compressed architecture (e.g., quantization and pruning) on accelerators ultra-highly speeds up inference time with negligible accuracy loss [28, 21]. Ren et al[48] proposed the first algorithm-hardware co-designing framework with a combination of weight pruning and quantization, to reduce performance overhead due to irregular sparsity. Wang et al[61] extended Roofline model into deep learning area and incorporated computational complexity and run-time into models, which made it possible to analyze code performance for deep learning applications systematically.

3 Approach

3.1 System Design

The proposed Energy-Efficient Tracking-based Action Detector (ETAD) has three core components: ball-person detection, deep association, and action detection.

Ball-Person Detection:

Figure 1: Proposed YOLO variants. The top figure shows YOLO-MobileV1 and the bottom figure shows YOLO-MobilbeV2.

Existing datasets are available for detecting pedestrians. However, there is no ball-related dataset in the literature. We adopts harmonization-aware image compositing [37] to generate synthetic but realistic balls on pedestrian datasets, e.g., VisDrone [3], COCO ball-person subset, PANDA [58]. To address the efficiency, we design a lightweight model based on YOLOv3 and YOLOv5. Among the three branches that detect objects in multi-scales (e.g., small, medium, and large), the middle one is retained to detect medium-scale objects. Our proposed two YOLO variants are dubbed YOLO-MobileV1 and YOLO-MobileV2. Inherited from YOLOv5, YOLO-MobileV1 modifies PANet in YOLOv5 and cuts off one upsample module to directly output the medium-size feature map. As a descendant of YOLOv3, YOLO-MobileV2 replaces the convolution with depth-wise convolution and removes upsampling modules. The architecture details of YOLO-MobileV1 and YOLO-MobileV2 are shown in Fig. 1.

Deep Association:

Shown in Fig. 2, this module tags the detected persons and balls with their corresponding identity labels.

Figure 2: Proposed deep association that tags the detected persons and balls with their IDs.

First, an identity-aware feature extractor (e.g., ResNet-18 trained on person ReID task) is adopted to obtain the feature embeddings of the detected persons/balls, given the bounding boxes together with the raw image. Second, Deep Association maintains the person tracklets and ball tracklets in a gallery. Each tracklet is a queue that stores ReID features of the latest tracks for the corresponding person/ball. Last, using the ReID feature of newly detected persons/balls as queries and defining the cost of matching a query with a specific tracklet in the gallery as the distance between their ReID features, Deep Association formulates the identity tagging as a minimum cost assignment problem that can be solved in polynomial time by Hungarian algorithm. To reduce association errors, any distance larger than a predefined threshold is marked as in the cost matrix to rule out the possibility of assignment.

Video Action Detection:

Free of learnable parameters, this module in Fig. 3 only depends on bounding box (bbox) trajectories to spatio-temporally localize key actions (e.g., catch and throw), thus extremely energy-efficient. A ball has two states: collision or non-collision. Collision happens when the center of one ball’s bbox falls into one person’s bbox. For each ball, the method plots its collision history with other persons as a gated signal spanning from to , where is the total number of frames. The start and the end of any connected part in the gated signal signifies catch action and throw action, respectively.

Figure 3:

Proposed heuristic approach for video action detection. The dilation/erosion parameter

is the prior of minimum flying time for any ball.

3.2 Efficiency


Pruning could be done in an unstructured or structured way. Unstructured pruning removes individual weights at the kernel level, while structured pruning removes groups of weight connections such as channels or layers. Unstructured pruning leads to higher sparsity but requires specific hardware support for acceleration, thus not applicable to Raspberry Pi.


There are two types of quantization: post-training quantization and quantization-aware training (QAT) 111In QAT, a pre-trained model is quantized and then finetuned using training data to recover accuracy loss. In PTQ, a pre-trained model is quantized based on the clipping ranges and the scaling factors computed from the calibration data.. Unlike post-training quantization, QAT supports concurrent training and quantization. Consequently, such a simulation of quantization errors during training gives minimal accuracy loss if the representation bitwidth is converted from FP32 to INT8.

Adaptive Inference:

Considering video’s temporal coherency and key action’s infrequent occurrences, we are motivated to propose two adaptive inference strategies in Fig. 4: Activity Region Cropping (ARC) and Collision Inspection (CI). Given the observation that persons and balls’ locations not changing abruptly since videos are temporally coherent, ARC predicts the next frame’s activity region () based on the current frame’s bboxes’ coordinates () and crops out the non-activity region after adding some residual space ( and ), whose details are shown below:


. ARC also improves the signal-to-noise ratio (SNR) of the input image for detection. Since ball-person collision is a necessary condition for throwing or catching a ball, CI skips the succeeding Deep Association and Action Detection modules if no ball-person collision is inspected.

Figure 4: Proposed two data-dependent adaptive inference strategies: Activity Region Cropper and Collision Inspector.

Cache-friendly Pipeline:

Figure 5: Proposed cache-friendly pipeline that trade memory for less data movement. Given a video of frames, which includes unique persons and unique balls, we store the intermediate decoded frames, person/ball crops, and person/ball boxes into the “Image Queue”, “Crop Records”, and “Box Records” respectively. The image queue size is and . The person crop and the ball crop size are and respectively. We set and . The in crop size represents the box coordinates. Note that each box entry in “Box Records” is ordered by its corresponding person/ball ID.

There are three computation bottlenecks in the ETAD (ordered descendingly by profiling): video decoding, ReID feature extraction, and ball-person detection. The computation pipeline of ETAD could be accelerated by addressing “spatial locality” and “temporal locality” for higher cache hit rate. As is shown in the following code snippet and Fig. 5, we store the intermediate results into an image queue (, where as queue size and as the total number of frames), crop records ( as the total number of frames and ), and bbox records ( as the number of persons and balls), to reduce the Memory Access Cost

(MAC) for both model weights and data tensors.




To improve the temporal locality of ReID feature extraction, crop records and box records are used since the layer-wise computation could be fit in the L1 cache (32KB), and the weights after pruning and quantization could be fit in the L2 cache (512KB). To improve the spatial and temporal locality of video decoding, an image queue is used so that temporally consecutive frames could be decoded without interruption 222Video’s temporal redundancy is utilized for high compression rate so that consecutive frames’ encoded signals are closely packed together in storage.. Finally, ETAD’s pipeline is turned from memory-access intensive to computation-intensive Since the computation cost is fixed for a given video, the latency could be significantly reduced due to less memory access.

3.3 Robustness


Although shape and texture are two critical and complementary cues for object recognition, CNN-learned features are biased towards either shape or texture, depending on the training dataset. For example, the proposed YOLO-MobileV1 and V2 frequently mistake homogeneous color patches for balls. To alleviate the bias towards texture, we propose a texture-shape debiased training strategy by compositing homogeneous color patches as hard negative samples on the training images. Moreover, we propose a geometry-aware circular anchor that takes the circle shape of balls into account to detect them better.

Figure 6: Proposed adversarial training in minimax game to learn robust ReID feature. After training, the identity classifier and the view classifier are discarded.

ReID Feature:

The two major challenges to robust ReID feature learning are various camera views and occlusion. Therefore, we are motivated to propose the following two solutions. (1) Occlusion-Aware Data Augmentation: Ball-person occlusion is mimicked by a circular binary mask overlayed on the person images. We also use pseudo masks predicted by a pretrained person segmentation model to extract the person foreground and overlay it on the target person image to simulate person-person occlusion. Both strategies are used as data augmentation to train the ReID model. (2) Domain-Invariant Adversarial Feature Learning: Given training images , the goal is to learn a ReID feature embedding that is discriminative with different identities while invariant under different camera views. The goal can be mathematically expressed as a minimax game below [68, 66, 67, 82]:

Here, the is the extracted ReID feature embedding from . and predict identity labels and view labels from , respectively. and stands for triplet loss and cross-entropy loss. and symbolizes identity labels and view labels. is a coefficient used to balance the two losses. More details are shown in Fig. 6.


As the downstream task, action detection is vulnerable to errors propagated from the upstream tasks (e.g., detection and association). In detection errors, false-negative cases may owe to occlusion, and false-positive cases could result from patches with homogeneous colors. Association errors arise from non-discriminative ReID features. Inspired by morphology-based denoising [44], we propose a heuristic approach in Fig. 3 to eliminate the preceding errors from the upstream tasks. As prior, is the minimum flying time for any ball. Given the collision history with different persons as a gated signal for each ball, we first dilate it by . Then, we vote in the connected parts and select the majority as the associated person holding the ball, so that the preceding detection and association errors are eliminated. Last, the dilated gated signal is eroded by .

4 Experiments

4.1 Detection

Datasets and Evaluation Protocols:

We evaluated the proposed ball-person detector on the sample videos provided by LPCVC21 UAV-Video Track, dubbed as LPCVC21. Among the five sample videos, the two most challenging videos 7p3b and 5p5b are reserved for testing and dubbed as LPCVC21-test; the remaining three videos 5p4b, 5p2b, and 4p1b are used for training and dubbed as LPCVC21-train. The bounding boxes for persons are refined on the pseudo labels produced by a pretrained person detector and the bounding boxes for ball are manually labeled at every frames. To increase the training set size, we mix PANDA [58], VisDrone pedestrian subset [3], and COCO ball-person subset [36] with LPCVC21-train to obtain a large-scale mixed dataset for ball-person detection, dubbed as MixDet-train. MixDet-test is the testing set of LPCV21, namely LPCVC21-test.

We adopt the COCO evaluation protocol’s mAP and mAP to evaluate detection performance.

Ablation Studies:

1. Harmonization-Aware Image Composition. Due to the lack of training data for the ball class, we propose a harmonization-aware image composition method to embed synthetic balls into images randomly selected from VisDrone and PANDA. Fig. 7 shows the two steps of harmonization-aware image composition. First, synthetic ball images are “copy-pasted” on background image guided by generated randomly-placed mask. Second, composited images are harmonized through RainNet [37] by taking the lighting and background into account.

Figure 7: Overview of Harmonization-Aware Image Composition. Balls are composited into background images guided by randomly-placed masks. The composited images are harmonized by RainNet with region-aware instance normalization.

2. Shape-Texture Debiased Training. To alleviate the ball-detection’s bias towards texture, we add random-shaped patches with homogeneous texture, which serve as negative samples to detect balls. Debiased training images are showed in Fig. 8.

Figure 8: Shape-texture debiased training samples. Random patches with homogeneous color are marked by red rectangles.

3. Geometry-Aware Anchor. Traditional rectangular bounding box (R-box) is not suitable for ball detection especially for heavily-occluded ball. The background information included in R-box introduces unexpected noise for ReID feature extraction. Considering the geometry difference between persons and balls, we use traditional rectangular bounding box (R-box) to detection persons and propose circular bounding box (C-box) to detect balls. Fig. 9 shows samples of different anchors.

We conduct ablation studies of the proposed three methods in Tab. 1

. Image-composition significantly improves detection performance. In addition, debiased training and circular anchor significantly improve both precision and recall, which shows that our proposed methods can effectively reduce false negatives (FN) and false positives (FP) for ball detection.

Figure 9: Samples of geometry-aware anchor for “ball” class. The left shows traditional anchors and the right shows geometry-aware anchor.
ImgComp DebiasTr CircAnchor Precision Recall
- - - 0.998 0.865 0.932 0.647
- - 0.996 0.872 0.936 0.656
- 0.995 0.877 0.938 0.658
0.997 0.890 0.947 0.661
Table 1: Ablation studies of proposed strategies to improve ball detection using YOLOv5-S.

Results and Analyses:

We train YOLOv5s as baseline and the proposed YOLO variants on MixDet-train, and test on MixDet-test. Detection results are showed in Tab. 2. YOLO-Mobile v1(416) reduces 16× FLOPs at the cost of 24% mAP loss compared to YOLOv5-S(640). Furthermore, YOLO-Mobile v1(416) reduces 10× energy compared to YOLOv5-S(640). We also propose three methods to advance detection performance.

Model Input FLOPs(G) Params(M) FPS Energy(kJ)
YOLOv5-S 640 16.4 7.07 0.3 0.634 1.273
YOLOv5-S 416 6.9 7.07 0.7 0.575 0.579
YOLO-MobileV1 640 2.7 1.08 5 0.535 0.242
YOLO-MobileV1 416 1.2 1.08 10 0.485 0.142
YOLO-MobileV2 416 0.8 0.241 14 0.356 0.146
Table 2: Comparison of Different Detection Models.

4.2 Re-Identification

Datasets and Evaluation Protocols:

We train and evaluate the ReID feature extractor on a mixed dataset from three person ReID datasets: CUHK03 [31], Market1501 [80], and DukeMTMC [83], dubbed as MixReID. MixReID-train includes the training set of CUHK03/Market1501/DukeMTMC and the testing set of CUHK03/DukeMTMC. MixReID-test is the testing set of Market1501.

For each query, we calculate the area under the Precision-Recall curve, known as average precision (AP). Then, the mean value of APs of all queries, i.e., mAP, is calculated.

SparsityPruner Baseline FPGM AGP
0 0.6550 - - - -
0.1 - 0.6468 0.6389 0.6457 0.6542
0.2 - 0.6451 0.6321 0.6405 0.6541
0.3 - 0.6412 0.6257 0.6378 0.6539
0.5 - 0.6385 0.6138 0.6316 0.6538
0.6 - 0.5517 0.5523 0.5730 0.5823
0.7 - 0.4781 0.4699 0.4926 0.5145
0.8 - 0.3948 0.3896 0.4126 0.4367
0.9 - 0.3395 0.3287 0.3578 0.3601
Table 3: Accuracy of different one-shot and iterative pruners under different sparsity rates.
MetricsSparsity 0 0.25 0.5 0.75 0.875
FLOPS 11.84 9.80 3.15 0.90 0.27
Params 11.16 7.94 2.79 2.00 0.17
mAP 0.6550 0.6540 0.6538 0.5830 0.4390
Table 4: Tradeoff between accuracy and efficiency for AGP iterative pruner under different sparsity rates.

Results and Analyses:

In Tab. 3 and Tab. 4, ReID model is jointly scaled in its input resolution, depth, and width towards a desired efficiency-accuracy tradeoff. When scaling, structured pruning is adopted in an iterative (progressive) and adaptive (layerwise) way. FPGM [20] and AGP [88] pruner outperformed and filter pruner by achieving better efficiency-accuracy tradeoff. Moreover, the input image resolution is reduced to (from to ) 333The input resolution for person crop is downsampled from to and ball crop is downsampled from to . and the ReID model is finetuned on the resized images. Lastly, reducing bitwidth from FP32 to INT8 by QAT gives faster computation and lower memory usage for ReID: times inference time speed-up and times model size reduction with minor accuracy loss.

4.3 Video Action Detection

Datasets and Evaluation Protocols:

We evaluate the video action detection performance on the two reserved videos for object detection testing, namely 7p3b and 5p5b in LPCVC21-test.

The accuracy metric measure how well the action detector spatio-temporally predicts the ball-catching action. A True Positive (TP) predicts person and ball IDs correctly and the frame index within 10 frames as a tolerance threshold. False positives and false negatives may arise from object detection and association errors.

ReID Det Action AdpInf Pipeline Accuracy Energy Score
- - - - - 0.657 0.261 2.517
- - - - 0.661 0.215 3.074
- - - 0.734 0.135 5.437
- - 0.771 0.135 5.711
- 0.794 0.112 7.089
0.794 0.091 8.725
Table 5: Ablation studies of proposed strategies to improve robustness or efficiency. Score is computed as the ratio of accuracy and energy, i.e., Accuracy/Energy.

Results and Analyses:

ReID means robust ReID feature extractor improved by the proposed occlusion-aware data augmentation and domain-invariant adversarial feature learning. Det means robust ball detector improved by the proposed texture-shape debiased training. Action means robust action detector improved by the proposed morphology-based denoising to eliminate preceding association and detection errors. AdpInf means the two adaptive inference strategies implemented with an Activity Region Cropper and a Collision Inspector. Pipeline means the cache-friendly pipeline that trades memory for less data movement cost. Tab. 5 presents the ablation study results, from which we can draw a conclusion that incrementally adding robust ReID features, robust ball detections, robust action detections, adaptive inference, and the cache-friendly pipeline could consistently improve the accuracy-efficiency tradeoff.

5 Conclusion

This paper is a tech report for our submitted solution to the UAV-Video Track of LPCVC-2021. Our proposed Energy-Efficient Tracking-based Action Detector (ETAD) is a tracking-based vision system that can spatio-temporally localize key action from videos. ETAD has three core components: ball-person detection, deep association, and action detection. First, we propose a harmonization-aware image composition module to generate synthetic but realistic balls. Second, we present two adaptive inference strategies, a cache-friendly pipeline, and some pruning and quantization strategies to address the energy-efficiency concern. Third, we put forward a shape-texture debiased training and a domain-invariant adversarial training to improve the robustness of ball detection and ReID feature extraction. Last, inspired by morphology-based denoising, we develop a learning-free action detector that only depends on bounding box trajectories to spatio-temporally localize key actions.


  • [1] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
  • [2] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
  • [3] Yaru Cao, Zhijian He, Lujia Wang, Wenguan Wang, Yixuan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. Visdrone-det2021: The vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2847–2854, 2021.
  • [4] Tianlong Chen, Shaojin Ding, Jingyi Xie, Ye Yuan, Wuyang Chen, Yang Yang, Zhou Ren, and Zhangyang Wang. Abd-net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8351–8361, 2019.
  • [5] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 403–412, 2017.
  • [6] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11030–11039, 2020.
  • [7] Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9(2):292–308, 2019.
  • [8] Gioele Ciaparrone, Francisco Luque Sánchez, Siham Tabik, Luigi Troiano, Roberto Tagliaferri, and Francisco Herrera. Deep learning in video multi-object tracking: A survey. Neurocomputing, 381:61–88, 2020.
  • [9] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  • [10] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems, 29, 2016.
  • [11] Chunhua Deng, Fangxuan Sun, Xuehai Qian, Jun Lin, Zhongfeng Wang, and Bo Yuan. Tie: Energy-efficient tensor train-based inference engine for deep neural network. In Proceedings of the 46th International Symposium on Computer Architecture, pages 264–278, 2019.
  • [12] Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 994–1003, 2018.
  • [13] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, 2019.
  • [14] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  • [15] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019.
  • [16] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [17] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [19] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pages 784–800, 2018.
  • [20] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang.

    Filter pruning via geometric median for deep convolutional neural networks acceleration.

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4340–4349, 2019.
  • [21] Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 674–687. IEEE, 2018.
  • [22] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [23] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
  • [24] Xiao Hu, Ming-Ching Chang, Yuwei Chen, Rahul Sridhar, Zhenyu Hu, Yunhe Xue, Zhenyu Wu, Pengcheng Pi, Jiayi Shen, Jianchao Tan, et al. The 2020 low-power computer vision challenge. In

    2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS)

    , pages 1–4. IEEE, 2021.
  • [25] Zhenyu Hu, Pengcheng Pi, Zhenyu Wu, Yunhe Xue, Jiayi Shen, Jianchao Tan, Xiangru Lian, Zhangyang Wang, and Ji Liu. E2vts: Energy-efficient video text spotting from unmanned aerial vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 905–913, 2021.
  • [26] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens Van Der Maaten, and Kilian Q Weinberger. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, 2017.
  • [27] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
  • [28] Jinmook Lee, Dongjoo Shin, and Hoi-Jun Yoo.

    A 21mw low-power recurrent neural network accelerator with quantization tables for embedded deep learning applications.

    In 2017 IEEE Asian Solid-State Circuits Conference (A-SSCC), pages 237–240. IEEE, 2017.
  • [29] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
  • [30] Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, and Larry S Davis. 2d or not 2d? adaptive 3d convolution selection for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6155–6164, 2021.
  • [31] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 152–159, 2014.
  • [32] Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Wei Xu. Dynamic computational time for visual attention. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 1199–1209, 2017.
  • [33] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. Advances in neural information processing systems, 30, 2017.
  • [34] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [35] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • [36] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [37] Jun Ling, Han Xue, Li Song, Rong Xie, and Xiao Gu. Region-aware adaptive instance normalization for image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9361–9370, 2021.
  • [38] Ning Liu, Xiaolong Ma, Zhiyuan Xu, Yanzhi Wang, Jian Tang, and Jieping Ye. Autocompress: An automatic dnn structured pruning framework for ultra-high compression rates. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4876–4883, 2020.
  • [39] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8759–8768, 2018.
  • [40] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [41] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017.
  • [42] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.
  • [43] Junting Pan, Siyu Chen, Zheng Shou, Jing Shao, and Hongsheng Li. Actor-context-actor relation network for spatio-temporal action localization. arXiv, 2020.
  • [44] Richard Alan Peters. A new algorithm for image noise reduction using mathematical morphology. IEEE transactions on Image Processing, 4(5):554–568, 1995.
  • [45] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [46] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
  • [47] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  • [48] Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 925–938, 2019.
  • [49] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. TPAMI, 2016.
  • [50] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • [51] Chi Su, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Deep attributes driven multi-camera person re-identification. In European conference on computer vision, pages 475–491. Springer, 2016.
  • [52] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on computer vision (ECCV), pages 480–496, 2018.
  • [53] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020.
  • [54] Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. Asynchronous interaction aggregation for action detection. arXiv, 2020.
  • [55] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
  • [56] Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, and Rita Cucchiara. Stage: Spatio-temporal attention on graph entities for video action detection. arXiv, 2019.
  • [57] Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, and Gang Wang.

    A siamese long short-term memory architecture for human re-identification.

    In European conference on computer vision, pages 135–153. Springer, 2016.
  • [58] Xueyang Wang, Xiya Zhang, Yinheng Zhu, Yuchen Guo, Xiaoyun Yuan, Liuyu Xiang, Zerun Wang, Guiguang Ding, David Brady, Qionghai Dai, et al. Panda: A gigapixel-level human-centric video dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3268–3278, 2020.
  • [59] Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, and Gao Huang. Adaptive focus for efficient video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16249–16258, 2021.
  • [60] Yulin Wang, Kangchen Lv, Rui Huang, Shiji Song, Le Yang, and Gao Huang. Glance and focus: a dynamic approach to reducing spatial redundancy in image classification. Advances in Neural Information Processing Systems, 33:2432–2444, 2020.
  • [61] Yunsong Wang, Charlene Yang, Steven Farrell, Yan Zhang, Thorsten Kurth, and Samuel Williams. Time-based roofline for deep learning performance analysis. In 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS), pages 10–19. IEEE, 2020.
  • [62] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, and Shengjin Wang. Towards real-time multi-object tracking. In European Conference on Computer Vision, pages 107–122. Springer, 2020.
  • [63] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.
  • [64] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019.
  • [65] Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, and Gangshan Wu. Context-aware rcnn: A baseline for action detection in videos. arXiv, 2020.
  • [66] Zhenyu Wu, Karthik Suresh, Priya Narayanan, Hongyu Xu, Heesung Kwon, and Zhangyang Wang. Delving into robust object detection from unmanned aerial vehicles: A deep nuisance disentanglement approach. In ICCV, pages 1201–1210, 2019.
  • [67] Zhenyu Wu, Haotao Wang, Zhaowen Wang, Hailin Jin, and Zhangyang Wang. Privacy-preserving deep action recognition: An adversarial learning framework and a new dataset. TPAMI, 2020.
  • [68] Zhenyu Wu, Zhangyang Wang, Zhaowen Wang, and Hailin Jin. Towards privacy-preserving visual recognition via adversarial training: A pilot study. In ECCV, pages 606–624, 2018.
  • [69] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. Advances in Neural Information Processing Systems, 32, 2019.
  • [70] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018.
  • [71] Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry S Davis, and Heng Wang. Beyond short clips: End-to-end video-level learning with collaborative memories. In CVPR, 2021.
  • [72] Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S Davis, and Jan Kautz. Step: Spatio-temporal progressive learning for video action detection. In CVPR, 2019.
  • [73] Hantao Yao, Shiliang Zhang, Richang Hong, Yongdong Zhang, Changsheng Xu, and Qi Tian. Deep representation learning with part loss for person re-identification. IEEE Transactions on Image Processing, 28(6):2860–2871, 2019.
  • [74] Ye Yuan, Wuyang Chen, Tianlong Chen, Yang Yang, Zhou Ren, Zhangyang Wang, and Gang Hua. Calibrated domain-invariant learning for highly generalizable large scale re-identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3589–3598, 2020.
  • [75] Ye Yuan, Wuyang Chen, Yang Yang, and Zhangyang Wang. In defense of the triplet loss again: Learning robust person re-identification with fast approximated triplet loss and label distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 354–355, 2020.
  • [76] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 184–199, 2018.
  • [77] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864, 2021.
  • [78] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11):3069–3087, 2021.
  • [79] Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang. Deeply-learned part-aligned representations for person re-identification. In Proceedings of the IEEE international conference on computer vision, pages 3219–3228, 2017.
  • [80] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015.
  • [81] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. Person re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1367–1376, 2017.
  • [82] Yuli Zheng, Zhenyu Wu, Ye Yuan, Tianlong Chen, and Zhangyang Wang. Pcal: A privacy-preserving intelligent credit risk modeling framework based on adversarial learning. arXiv preprint arXiv:2010.02529, 2020.
  • [83] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE international conference on computer vision, pages 3754–3762, 2017.
  • [84] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044, 2017.
  • [85] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
  • [86] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Tracking objects as points. In European Conference on Computer Vision, pages 474–490. Springer, 2020.
  • [87] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  • [88] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.