The emergence of various large-scale video datasets, along with the continuous development of deep neural networks have vastly promoted the development of video-based machine vision tasks, with action recognition (AR) being one of the spotlights. Recently, there have been increasing applications of automatic AR in diverse fields, e.g., security surveillance (chen2010real; zou2019wifi; ullah2021efficient), autonomous driving (royer2007monocular; cao2019bypass; chen2020survey), and smart home (fahad2015integration; feng2017smart; yang2018device). As a result, effective AR models that are robust to the different environments are required to cope with the different real-world scenarios. There has indeed been a significant improvement in the performance of AR models, reaching superior accuracies across various datasets (wang2019hallucinating; ghadiyaram2019large; gowda2021smart).
Despite the rapid progress made by current AR research, most research aims to improve the model performance on existing AR datasets that are constrained by several factors, one of which concerns the fact that videos in existing datasets are shot under a non-challenging environment, with adequate illumination and contrast. The existence of such constraints could lead to the observable fragility of proposed methods, which are not capable to generalize well to adverse environments, including dark environments with low illumination. Take security surveillance as an example: automated AR models could play a vital role in anomaly detection. However, anomaly actions are more common at night time and in dark environments, yet current AR models are obscured by darkness, and are unable to recognize any actions effectively. Autonomous systems are another example, where darkness has hampered the effectiveness of onboard cameras so severely that most vision-based autonomous driving systems are strictly prohibited at night(brown2019autonomous), while those who do allow night operation could cause severe accidents (boudette2021it).
The degrading performance of AR models in dark environments is eminently related to the construction and collection of current large-scale AR datasets (e.g. HMDB51 (kuehne2011hmdb)
, UCF101(soomro2012ucf101), and Kinetics (kay2017kinetics)), which heavily rely on online video platforms (e.g., YouTube and Flickr). While online video platforms contain a large variety of videos and support the collection of large-scale video datasets, videos on these platforms are generally shot under normal illumination. Whereas in reality, poor illumination (darkness) is quite common in many scenarios, such as surveillance at night and dark scenarios. Yet it is not covered properly in any existing datasets for training AR models. Meanwhile, current AR models are predominantly data-driven and are sensitive to unseen conditions. With the lack of low illumination videos in current AR datasets, the degradation in current AR models is not unexpected.
To mitigate performance degradation of AR models in dark environments, one intuitive method is to perform pre-processing of dark videos which could improve the visibility of the dark videos. Such a method is indeed effective from the human vision perspective. Over the past decade, various visual enhancement techniques (guo2016lime; ying2017new; zhang2019kindling; guo2020zero; li2021learning)
have been proposed to improve the visibility of degraded images and videos, ranging from dehazing, de-raining to illumination enhancements. Given the effectiveness of deep neural networks in related tasks such as image reconstruction, deep-learning based illumination enhancement methods have also been developed with the introduction of various illumination enhancement datasets (e.g., SID(chen2018learning), ReNOIR (anaya2018renoir) and LOL dataset (liu2021benchmarking)). The results are reportedly promising from a human vision viewpoint, given their capability in improving the visual quality of low-illumination images and videos.
In spite of their capability in generating visually enhanced images and videos, prior research has shown that a majority of illumination enhancement methods are incapable of improving AR performance in dark videos consistently. This is caused by two aspects: first, most illumination enhancement methods are developed upon low-illumination images, which are static and do not contain motion information. For the few illumination enhancement video datasets (e.g, DRV (chen2019seeing)), videos collected are also mostly static, with the “ground truth” of the dark videos shot by long exposures. In contrast, actions in videos are closely correlated with motion information, which is not generally included in current datasets for illumination enhancement. Second, current illumination enhancement datasets target predominantly on human vision, with the evaluation of method based not only on quantitative evaluation but also on rather subjective qualitative evaluation (e.g., US (guo2020zero) and PI scores (mittal2012making; ma2017learning; blau2018perception)). Quantitative evaluation of illumination enhancement methods is also based mostly on the quality of the image/video (e.g., PSNR) instead of the understanding of image/video (e.g., classification and segmentation). The misalignment between the target of applying illumination enhancements to dark videos for AR and that of the illumination enhancement datasets would therefore be unable to guide illumination enhancement methods to improve on AR accuracies in dark videos.
To apply AR models in real-world practical applications, the model is expected to be robust to videos shot in all environments, including the challenging dark environments. In view of the inability of current solutions in addressing AR in dark environments, it is therefore highly desirable to conduct comprehensive research on effective methods to cope with such challenging environments. Such research could enable models to handle real-world dark scenarios, and benefit in all fields such as security and autonomous driving.
To bridge the gap between the lack of research in AR models robust to dark environments and the wide application in real-world scenarios of such research, we propose the UG2+ Challenge Track 2 in IEEE CVPR 2021. The UG2+ Challenge Track 2 (UG2-2) aims to evaluate and advance the robustness of AR models in poor visibility environments, focusing on dark environments. Specifically, UG2-2 is structured into two sub-challenges, featuring different actions and diverse training protocols. UG2-2 is built on top of a novel AR dataset: ARID, which is a collection of realistic dark videos dedicated to AR. UG2-2 further expands the original ARID, strengthening its capability of guiding models in recognizing actions in dark environments. More specific dataset details and evaluation protocols are illustrated in Section 3.1. Compare with previous works and challenges, UG2-2 and its relevant datasets include the following novelties:
Addressing Videos from Dark Environments: The dataset utilized in UG2-2 is the first video dataset dedicated to action recognition in the dark. The original dataset with its expansion is collected from real-world scenarios. It provides much-needed resources to research actions captured in the challenging dark environments, and to design effective recognition methods robust towards dark environments.
Covering Fully and Semi-Supervised Learning: The two sub-challenges in UG2-2 are structured to cover both fully supervised learning (UG2-2.1) and semi-supervised learning (UG2-2.2). To the best of our knowledge, this is the first challenge that involves semi-supervised learning of dark videos. While our dataset provides resources for AR in dark environments, more feasible and efficient strategies to learn robust AR models is to adapt or generalize models learnt in non-challenging environments (which usually are of larger scale) to the dark environments. In this sense, our challenge promotes research into leveraging current datasets to boost performance on dark videos.
Greatly Challenging: Compare with conventional AR datasets (e.g., UCF101), the dataset utilized in the fully supervised sub-challenge is of small scale. Yet the winning solution of this sub-challenge achieves a performance inferior to that in UCF101. Meanwhile, even though the cross-domain video dataset used in the semi-supervised sub-challenge is comparable to conventional cross-domain video dataset (i.e., UCF-HMDB (sultani2014human)), the winning solution performance is also inferior to that achieved in UCF-HMDB. Performances of second runner-up solutions of the semi-supervised sub-challenge are of a large gap away from the winning solution. The results prove that our datasets are greatly challenging with a large room for further improvements.
The rest of this article is organized as follows: Section 2 reviews previous action recognition and dark visual datasets, as well as various action recognition methods. Section 3 introduces the details of the UG2-2 challenge, with its dataset, evaluation protocol and baseline results. Further, Section 4 illustrates the results of the competition and related analysis, while briefly discussing the reflected insights as well as possible future developments. The article is concluded in Section 5.
2 Related Works
2.1 Large-Scale Datasets
Various datasets have been proposed to advance the development of video action recognition (AR). Earlier datasets (e.g. KTH (schuldt2004recognizing), Weizmann (gorelick2007actions), and IXMAS (weinland2007action)) comprise a relatively small number of action classes. The videos in these datasets were recorded offline performed by several actors under limited scenarios. For example, KTH (schuldt2004recognizing) includes six different action classes performed by 25 actors under 4 different scenarios. With the advancing performance of deep-learning-based methods, there has been an urging demand for larger and more complicated datasets. To address this issue, subsequent datasets, such as HMDB51 (kuehne2011hmdb) and UCF101 (soomro2012ucf101), have been proposed by collecting videos from more action classes and more diverse scenarios. Specifically, HMDB51 (kuehne2011hmdb) is constructed with videos of 51 action classes collected from a variety of sources from movies to online video platforms, while UCF101 (soomro2012ucf101) is a larger dataset, consisting of 101 different actions collected from user-uploaded videos.
Both HMDB51 (kuehne2011hmdb) and UCF101 (soomro2012ucf101)
have served as the standard benchmark of AR, while they possess insufficient data variation to train deep models, mainly because they contain multiple clips sampled from the same video. To address this issue, larger datasets with more variation have been proposed. One of the most representative examples is the famous Kinetics-400(kay2017kinetics). The Kinetics-400 incorporates 306,245 clips from 306,245 videos (i.e. each clip is from a different video) in 400 action classes. There are at least 400 clips within each class, which guarantees more inner-class variety compared to other datasets. The following versions of Kinetics dataset, including Kinetics-600 (carreira2018short) and Kinetics-700 (carreira2019short), have also been collected abiding a similar protocol. In addition to Kinetics datasets, many large-scale datasets are presented to increase the variety of samples from different perspective, such as Something-Something (goyal2017something) for human-object interactions, AVA (gu2018ava)
for localized actions, Moments-in-Time(monfort2019moments) for both visual and auditory information. While the emerging large-scale datasets push the performance limit of deep models, most of them are mainly collected from internet or shot under normal illuminations.
2.2 Dark Visual Datasets
There have been emerging research interests towards high-level tasks in low-illumination environments in the field of computer vision. This increasing attention leads to a number of image-based datasets in dark environments. The earlier datasets were mainly designed for image enhancement or restoration, which include LOL(wei2018deep), SID (chen2018learning), ExDARK (loh2019getting) and DVS-Dark (zhang2020DVS). Specifically, LOL (wei2018deep) and SID (chen2018learning) consist of pairs of images shot under different exposure time or ISO, while ExDARK (loh2019getting) contains images collected from various online platforms. DVS-Dark consists of event images instead of RGB images, which can respond to changes in brightness, and the recent work (lv2021attention) proposed to further extend the scale of the dataset by introducing synthetic low-light images. These research interests have also expanded to the video domain. Several video datasets, such as DRV (chen2019seeing) and SMOID (jiang2019learning), have been proposed specifically for low-light video enhancement, which include raw videos captured in dark environments and corresponding noise-free videos obtained by using long-exposure. However, these datasets mainly encompass static scenes with trivial dynamic motion and therefore are not suitable for AR which significantly relies on motion information. Furthermore, both datasets are of small scales (e.g. 179 samples for DRV (chen2019seeing) and 202 samples for SMOID (jiang2019learning)). In this paper, we introduce the ARID dataset containing more samples of various actions as our evaluation benchmark.
2.3 Action Recognition Methods
In the era of deep learning, early state-of-the-art AR methods are fully supervised methods mainly based on either 3D CNN (ji20123d) or 2D CNN (karpathy2014large). 3D CNN (ji20123d) attempts to jointly extract the spatio-temporal features by expanding the 2D convolution kernel to the temporal dimension, while this expansion suffers from high computational cost. To alleviate this side effect, subsequent works, such as P3D (qiu2017learning) and R(2+1)D (tran2018closer), improve the efficiency by replacing 3D convolution kernels with pseudo 3D kernels. As for 2D CNN, due to the lack of temporal features, early 2D-based methods (simonyan2014twostream) usually require additional hand-crafted features as input (e.g. optical flow) to represent the temporal information. More recent methods attempt to model the temporal information in a learnable manner. For example, TSN (wang2016temporal) proposed to extract more abundant temporal information by utilizing a sparse temporal sampling strategy. SlowFast networks (feichtenhofer2019slowfast) proposed to utilize dual pathways with slow or high temporal resolutions to extract spatial or temporal features, respectively.
The outstanding performance of fully supervised methods mainly relies on large-scale labeled datasets, whose annotations are resource-expensive. Moreover, networks trained in the fully-supervised manner suffer from poor transferability and generalization. To increase the efficiency and generalization of extracted features, some works proposed to utilize semi-supervised approaches, such as self-supervised learning (fernando2017self; xu2019self; yao2020video; wang2020self) and Unsupervised Domain Adaptation (UDA) (pan2020adversarial; munro2020multi; choi2020shuffle; xu2021partial). Self-supervised learning is designed to extract effective video representation from unlabeled data. The core of self-supervised learning is to design a pretext task to generate supervision signals through the characteristic of videos, such as frame orders (fernando2017self; xu2019self) and play rates (yao2020video; wang2020self). On the other hand, UDA aims to extract the transferable representation across the labeled data in the source domain and the unlabeled data in the target domain. Compared to image-based UDA methods (ganin2015unsupervised; ganin2016domain; busto2018open), there exists fewer works in the field of video-based UDA (VUDA). (pan2020adversarial) is one of the primary works focusing on VUDA, which attempts to address the temporal misalignment by introducing a co-attention module across the temporal dimension. (munro2020multi) further leverages the multi-modal input of video to tackle VUDA problem. SAVA (choi2020shuffle) proposed an attention mechanism to attend to the discriminate clips of videos and PATAN (xu2021partial) further expanded the UDA problem to a more general partial domain adaption problem. In this work, we structured sub-challenges by covering both fully-supervised and semi-supervised to inspire novel AR methods in poor visibility environments.
3 Introduction of UG2+hallenge Track 2
The UG2+ Challenge Track 2 (UG2-2) aims to evaluate and advance the robustness of AR methods in dark environments. In this section, we detail the datasets and evaluation protocols used in UG2-2, as well as the baseline results for either sub-challenges. The datasets of UG2-2 for either sub-challenges are built based on the Action Recognition In the Dark (ARID) dataset. We begin this section by a brief review of the ARID dataset.
3.1 The ARID Dataset
The ARID dataset (xu2021arid) is the first video dataset dedicated to action recognition in dark environments. The dataset is a collection of videos shot by commercial cameras in dark environments, with actions performed by 11 volunteers. In total, it comprises 11 action classes, including both Singular Person Actions (i.e., jumping, running, turning, walking, and waving) as well as Actions with Objects (i.e., drinking, picking, pouring, pushing, sitting, and standing). The dark videos are shot in both indoor and outdoor scenes with varied lighting conditions. The dataset consists of a total of 3,784 video clips, with the minimum action class containing 205 video clips. The clips of every action class are divided into clip groups according to the different actors and scenes. Similar to previous action recognition datasets (e.g., HMDB51 (kuehne2011hmdb) and UCF101 (soomro2012ucf101)), three train/test splits are selected, with each split partitioned according to the clip groups, with a ratio of . The splits are selected to maximize the possibility that each clip group is presented in either the training or testing partition. All video clips in ARID are fixed to a 30 FPS frame rate, and a unified resolution of . The overall duration of all video clips combined is 8,721 seconds.
Though the ARID dataset pioneers the investigation of action recognition methods in dark environment, it has its own limitations. Compared with current SOTA benchmarks such as Kinetics (kay2017kinetics) and Moments-in-Time (monfort2019moments), the ARID is of limited scale, especially in terms of the number of videos per class. The limited scale of ARID prohibits complex deep learning methods to be trained, owing to a higher risk of overfitting. Increasing the dataset scale is an effective solution for such constraint, given that conventional action recognition dataset follows the same development path. However, the collection and annotation of dark videos is of high cost, given that there is limited public dark video on any public video platforms. Therefore, the strategy of increasing the dataset scale could only bring limited improvement to the dataset. Given the vast availability of videos shot in non-challenging environments, such videos should be fully utilized to train transferable models that could generalize to dark videos. To this end, we introduce a comprehensive extension of the ARID dataset: ARID-plus, to address the issues of the original ARID dataset, and serve as the datasets for the two sub-challenges of UG2-2.
3.2 Fully Supervised Action Recognition in the Dark
To equip AR models the ability to cope with dark environments for applications such as night surveillance, the most intuitive method would be no other than training action models in a fully supervised manner with videos shot in the dark, which motivates the construction of Sub-Challenge 1. The first component of ARID-plus serves as the dataset of Sub-Challenge 1 of UG2-2 (UG2-2.1), where participants are given the annotated dark videos for fully supervised action recognition. A total of 1,937 real-world dark video clips capturing actions by volunteers are adopted as the training and/or validation sets, with the recommended train/validation split provided to participants. The video clips contain six categories of actions, i.e., run, sit, stand, turn, walk, and wave. For testing, a hold-out set with 1,289 real-world dark video clips are provided, collected with similar methods as the training/validation video clips, with the same classes. In total, there are a minimum of 456 clips for each action. A detailed distribution of train(validation)/test video clips is shown in Fig. 1
(b). During training, participants can optionally use pre-trained models (e.g., models pretrained on ImageNet(deng2009imagenet) or Kinetics), and/or external data, including self-synthesized or self-collected data. If any pre-trained model or external data is used, participants must state explicitly in their submissions. The participants are ranked by the top-1 accuracy of the hold-out test set, while all the solutions of candidate winners are tested for their reproducibility.
The video clips adopted for training and testing in UG2-2.1 include that in the original ARID dataset, as well as new video clips. Several changes are adopted during the collection of the new video clips. Firstly, the new video clips are shot in a number of new scenes, whose visibility is even lower. This is justified statistically by lower RGB mean values and standard deviation (std) values as depicted in Fig. 1(a). Secondly, videos collected in the original ARID dataset follow a aspect ratio, which matches standard 320p or 480p videos. Meanwhile, the currently more common High-Definition (HD) videos would have a aspect ratio, with larger view angles. Following the aspect ratio of HD videos, the new video clips are fixed to a resolution of . We have also extended the length of each video clip, from an average 2.3 seconds per clip for clips in the original ARID to an average of 4 seconds per clip for the new video clips. Sampled frames from the train/validation set and the hold-out test set are displayed in Fig. 2(a) and Fig. 2(b).
3.3 Semi-Supervised Action Recognition in the Dark
While fully supervised training in dark videos allow models to cope with dark environments directly, publicly available dark videos are scarce compared with the vast amount of normal illuminated videos, which could be obtained with ease. Due to the high cost of both video collection and annotation, simply increasing the scale of dark video datasets for improving the effectiveness of fully supervised learning would not be a feasible strategy. Alternatively, the large amount of normal illuminated videos presented in previous public datasets should be utilized to train transferable models that could be generalized to dark videos. Such transfer may be further boosted with certain frame enhancements. The above strategy could be regarded as a semi-supervised learning strategy for action recognition in dark videos, which motivates the design of Sub-Challenge 2. The Sub-Challenge 2 of UG2-2 (UG2-2.2) is designed to guide participants to tackle action recognition in dark environments in a semi-supervised manner, achieved by generalizing models learnt in non-challenging environments to the challenging dark environments. To this end, the participants are provided with a subset of the labeled HMDB51 (kuehne2011hmdb) that includes 643 videos from five action classes (i.e., drink, jump, pick, pour, and push), for the training of models in non-challenging environments. Meanwhile, to facilitate the transfer of models, the second component of ARID-plus, with a total of 1,613 dark video clips, is provided to the participants in an unlabeled manner, which can be optionally used at the participants’ discretion for training and validation. The 1,613 clips contain the same five categories of actions. Similar to UG2-2.1, a hold-out set containing 722 real-world dark video clips with the same classes is provided for testing. Overall, there are at least 297 clips for each action class. The detailed distribution of train(validation)/test dark video clips is shown in Fig. 1(c).
During training, participants can also optionally use pre-trained models, and/or external data, including self-synthesized or self-collected data. However, the 1,613 dark video clips provided during the training/validation phase are not allowed to be manually labeled for training (i.e., they must remain to be unlabeled). Participants are to state explicitly if any pre-trained model or external data is used, and are ranked by the top-1 accuracy of the hold-out test set with reproducibility subject to testing if the relevant solution’s testing accuracy stands out. Changes in extra data that have been applied in UG2-2.1 have also been employed in the extra dark video clips for UG2-2.2. Such changes result in a similar degradation of clip visibility (as depicted in Fig. 1(a)), and an increase in view angles and average clip length. We show the sampled frames from the labeled HMDB51 train set, the unlabeled dark train/validation set, as well as the hold-out test set in Fig. 2(c) and Fig. 2(d).
3.4 Baseline Results and Analysis
For both sub-challenges, we report baseline results utilizing off-the-shelf enhancement methods with fine-tuning of several popular pre-trained action recognition models and domain adaptation methods. It should be noted that these enhancement methods, pre-trained models and domain adaptation methods are not designed specifically for dark videos, hence they are by no means very competitive, and performance boosts are expected from participants.
3.4.1 Fully Supervised UG2-2.1 Baseline Results
For UG2-2.1, we report baseline results from a total of six AR models including: I3D (carreira2017quo), 3D-ResNet-50 (hara2018can), 3D-ResNeXt-101 (hara2018can), TSM (lin2019tsm), SlowOnly (feichtenhofer2019slowfast), and X3D-M (feichtenhofer2020x3d). Among which, RGB frames are utilized as the input for all methods, while we also report the results utilizing optical flow obtained through TV-L1 (zach2007duality) for I3D and SlowOnly methods, along with the results by class score fusion (simonyan2014twostream) with both RGB frames and optical flow. Meanwhile, applying enhancement methods which improve the visibility of dark videos is an intuitive method to improve AR accuracies. Therefore, we also evaluate the above methods using RGB input with four enhancement methods: Gamma Intensity Correction (GIC), LIME (guo2016lime), Zero-DCE (guo2020zero) and StableLLVE (zhang2021learning)
. All AR models and enhancement methods adopt the officially released versions when applicable, where all learning-based methods are written with the PyTorch(paszke2019pytorch) framework. All AR models are fine-tuned from their models pre-trained on Kinetics-400 (kay2017kinetics)
, and trained for a total of 30 epochs. Due to the constraints in computation power, the batch size is unified for all models and set to 8 per GPU. All experiments are conducted with two NVIDIA RTX 2080Ti GPUs. The reported results are an average of five experiments. The detailed results are found in Table1 and Table 2.
Overall, with the training settings as introduced above, current AR models performs poorly without any enhancements in UG2-2.1. The best performance is achieved by using optical flow input with the SlowOnly model, i.e., an accuracy of . In comparison, the evaluated models could achieve at least accuracy on the large-scale Kinetics dataset, and over accuracy on the HMDB51 dataset. It is worth noting that newer models (e.g., X3D-M) which produce SOTA results on large-scale datasets may perform inferior to previous models (e.g., 3D-ResNeXt-101). Therefore novel AR models may not be more generalizable than prior AR models.
Meanwhile, the results after applying enhancements show that the evaluated enhancements may not bring consistent improvements in action recognition accuracy. The evaluated enhancements all produce visually clearer videos, where actions are more recognizable by humans, as shown in Fig. 3. The actor who is running can be seen visually in all sampled frames with enhancements, while the actor is almost unrecognizable in the original dark video. However, at least three AR models produce inferior performance when applying any enhancement. The best result is obtained with 3D-ResNeXt-101 while applying Zero-DCE enhancement. In general, Zero-DCE results in the best average improvement of . Meanwhile, the susceptibility of each model varies greatly. 3D-ResNet-50 gains the most positive effect of average accuracy gain with enhancements applied, while TSM is most susceptible to negative effects with an average loss of accuracy.
We argue that the negative effect of applying enhancements results from the noise brought by enhancements. Though enhanced videos are clearer from human perspectives, some enhancements break the original data distribution, and can therefore be regarded as artifacts or adversarial attacks for videos. The change in data distribution and the addition of noise could result in a notable decrease in performance for AR models. The deficiencies of the examined enhancements suggest that simple integration of frame enhancements may not be sufficient. Instead, other techniques such as domain adaptation or self-supervision could be further employed to improve the effectiveness of frame enhancements.
3.4.2 Semi-Supervised UG2-2.2 Baseline Results
For UG2-2.2, we report baseline results with three AR models: I3D, 3D-ResNet-50, and 3D-ResNeXt-101. To transfer networks from the labeled normal videos to unlabeled dark videos, we employ and evaluate three different domain adaptation methods: the adversarial-based DANN (ganin2015unsupervised), and the discrepancy-based MK-MMD (long2015learning) and MCD (saito2018maximum). We also examine both the source-only scenario (i.e., fully supervised learning) and target-only scenario (i.e., without any domain adaptation method). Similar to the baseline experiments in UG2-2.1, all models are pre-trained on Kinetics-400, with the whole training process set to 30 epochs. For all AR models, we freeze the first three convolutional layers, and the batch size is set to 8 per GPU. The experiments are conducted with the same hardware and framework as that of UG2-2.1 baselines. No enhancement method is employed when conducting the baseline experiments for UG2-2.2. The reported results are an average of five experiments. Detailed results are shown in Table 3.
The results in Table 3 imply that though all three adaptation methods can improve the generability of the respective AR models, scoring higher than the source-only scenarios, all have a large gap towards the target-only accuracies, which are the upper bounds of the networks’ performances. The large performance gap towards the upper bound also justifies the fact that there exists a large domain gap between videos shot in non-challenging environments and videos shot in dark environments. Among the three adaptation methods, DANN produces the best performance in general, resulting in an average performance gain of towards the models’ source only performances. The best baseline result is obtained with 3D-ResNeXt while applying DANN as the domain adaptation method. It should be noted that no enhancements or other training tricks are applied when obtaining the baseline results for UG2-2.2. Therefore, it is expected that participants could score higher than the target-only accuracies in Table 3.
4 Results and Analysis
is the prediction of the final classifier.
A total of 34 teams registered in the UG2+ Challenge Track 2 (UG2-2) at CVPR, among which 25 and 12 teams submitted their results to the fully supervised sub-challenge (UG2-2.1) and the semi-supervised sub-challenge (UG2-2.2), respectively. For each sub-challenge, the team with the highest performance is selected as the winner. In this section, we summarize the technical details of some outstanding performers and compare them with our baseline results. The full leaderboards can be found in the website***https://cvpr2021.ug2challenge.org/leaderboard21_t2.html.
4.1 UG2-2.1: Fully Supervised Learning
Among the 25 teams that successfully participated in this sub-challenge, 11 teams proposed novel models that outperform our baseline results. Among them, 7 teams are included in our leaderboard, where the winner team AstarTrek achieved the best performance of . While all teams constructed their models based on complex backbones, some interesting observations are as follows: (i) besides RGB, 3 out of 6 teams in the leaderboard utilized additional optical flow as input, while this extra modality did not bring solid improvement compared to those using pure RGB input; (ii) teams achieving top performance utilized low-light enhancement methods; (iii) except for the winner team, all teams trained their model from scratch with large epoch numbers (more than 200) rather than utilizing other pre-trained models, surpassing our baseline results by at least .
The winner team AstarTrek adopted a two-stream structure as shown in Fig. (b)b. The team first utilized the Gamma Intensity Correction (GIC) with to enhance the illumination level of videos. Subsequently, both RGB and optical flow were generated as the input of the two-stream structure. Specifically, the SlowFast Network (feichtenhofer2019slowfast) (based on ResNet 50 (he2016deep)) pretrained on Kinetics-400 (K400) was adopted as the RGB stream to extract spatial features from raw RGB input. For the flow stream, the team utilized ResNet-50-based I3D (carreira2017quo)
pretrained on K400 to extract temporal information from optical flow. During the training process, the team adopted a two-stage procedure, where each stream was trained independently to ensure that each of them can provide reliable predictions by itself. Each stream was trained with stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.0001. The batch size was set to 32 and the initial learning rate was 0.001, decayed by a factor of 0.1 at epochs 60 and 100 (with total epochs of 800). Each input (RGB or optical flow) was first resized to a square of the height randomly sampled from [224, 288], then randomly cropped into a square of size 224
224, followed by a horizontal flip with a probability of 0.5. During inference, each input (RGB or optical flow) was resized to the size of 240300. The final prediction is the average of results from both streams.
On the other hand, the runner-up team Artificially Inspired adopted different backbones and strategies, achieving a competitive performance of . As shown in Fig. (a)a, taking pure RGB as input, the team utilized Zero-DCE (guo2020zero) as their enhancement method and R(2+1)D-Bert (kalfaoglu2020late) as their single stream backbone. In fact, they are the only participant in the leaderboards that utilized deep-model-based method to improve the quality of dark videos. Moreover, noticing samples in ARID containing a relatively small number of frames, the team utilized Delta Sampling strategy that constructed the input sample by various sample rates while avoiding loop sampling. The team utilized 4,500 different images to train the Zero-DCE model, where 2,500 images were randomly sample from ARID dataset and the others are of different illumination levels collected from other datasets. During the training process, videos were first enhanced by the frozen Zero-DCE model to enhance their light levels and then resized to 112112. The team also included a random horizontal flip and rotation to increase the variation of input samples. According to their ablation studies, the utilization of Zero-DCE can bring an improvement of and the proposed sampling strategy surpassed other alternatives. More technical details can be refer to their report (hira2021delta).
Besides AstarTrek, there are two teams in the leaderboard, Cogvengers and MKZ5, which attempted to leverage the optical flow as the additional input to improve the performance. However, their performance is surpassed by most of teams taking pure RGB input by more than , mainly because they utilized inferior strategies for extracting and processing optical flow. Specifically, MKZ5 utilized a different two-stream structure as shown in Fig. (c)c, which directly extracted optical flow from dark videos, while AstarTrek
extracted optical flow from enhanced videos. The direct extraction may end up a worse quality of optical flow since most of the optical flow estimation methods show poor performance with low-light data(zheng2020optical). As for Cogvengers, while they adopted the structure in Fig. (b)b similar to the winner team, they follow a one-stage training strategy to jointly optimized the two-stream model, which might be the reason for their performance gap compared to others. In addition to the two-stream models based on RGB and optical flow, the team White give proposed another interesting two-stream structure based on pure RGB input (chen2021darklight). Specifically, adopting a similar structure in Fig. (c)c, the team replaced the flow stream with a shared-weight RGB stream taking original dark clips as input. The features from enhanced clips and dark clips were then ensembled by a self-attention module (wang2018non) to extract the effective spatio-temporal features from dual streams.
4.2 UG2-2.2: Semi-Supervised Learning
A total of 12 teams submitted their results in the semi-supervised challenge. Among the participants, the winner team Artificially Inspired achieved the best performance of . Similar to UG2-2.1, there is a noticeable gap between leaderboard performance and our baseline results. This is mainly because our baseline evaluation simply adopts existing domain adaptation methods without any other pre-processing or enhancement techniques to further boost the performance. Also, in order to achieve state-of-the-art performance, all teams utilized much larger epoch number (e.g. total 425 epochs for the winner team) and more complicated networks.
For the winner team Artificially Inspired, they adopted the same backbone and enhancement method from the UG2-2.1. To fully leverage the unlabeled data from ARID, the team adopted pseudo-label strategy to create pseudo-labels for the unlabeled data as shown in Fig. (a)a. Specifically, in the first run, the team first trained the model with labeled data from HMDB51 and generated the pseudo-labels of unlabeled data by inference. Samples with confident pseudo-labels were subsequently filtered based on their confidence scores and subsequently joined the supervised training process together with the data from HMDB51. The team initially chose a relatively high threshold of 0.99 and further increased it up to 0.999999 from the fourth run to the tenth run. At the end of each run, the checkpoint of the trained model was saved. During the testing process, the final prediction was generated as the average of predictions from the model saved at the end of each run.
As for the runner-up team, DeepBlueAI achieved a competitive result of with only a minor gap of compared with the best result. The team utilized CSN (tran2019video) based on ResNet-152, which is a more complex backbone compared to the winner team. While they also adopted the pseudo-label strategy similar to Artificially Inspired as in Fig. (a)a, they designed a different set of filtering rules. Specifically, they designed a four-run training process, where all samples with pseudo-labels generated in the first run were included in the supervised training process of the second run. In the rest process, pseudo-labels of two classes, including “Drink” and “Pick”, were changed to the class “Pour” if satisfying one of the two following rules: (i) if the confidence score of “Pour” is larger than 2.0, or (ii) if the confidence score of “Pour” is larger than of the highest score. While the team does not reveal their rational of this design, it might be the similarity between these three actions that motivates this specific design. Also different from the winner team, DeepBlueAI generated their prediction only based on the model of the final run.
Other teams also provided interesting solutions to generate supervision signal from the unlabeled ARID data. For example, team Cogvengers (rank No. 3, Top1 ), which utilized R(2+1)D-Bert as their based model, adopted Temporal Contrastive Learning (TCL) (singh2021semi) for semi-supervised learning as shown in Fig. (b)b. Specifically, after performing GIC enhancement, the team adopted two different instance-level contrastive loss to maximize the mutual information between clips from the same video under different frame rates. For unlabeled samples with the same pseudo-label, a group-level contrastive loss was utilized to minimize the feature distance within the group with the same pseudo-label. As for AstarTrek as shown in Fig. (c)c, they adopted an adversarial-based unsupervised domain adaptation method DANN (ganin2016domain) to adapt the features learned from the labeled HMDB51 data to the unlabeled data. However, they adopted a shallow backbone ResNet-18, which might be the reason for their inferior performance (Top1 ) compared with others.
4.3 Analysis and Discussion
As presented above, participants have provided various solutions to tackle action recognition in dark video for the UG2-2 challenge. While all winning solutions improved substantially from the baseline results, the best results are still lower compared to results in datasets of comparable scale (e.g., IXMAS (weinland2007action)). Moreover, there is a significant gap among the winning solutions. Both observations justify the difficulty of the challenge with much room for improvement.
In summary, advancements have been made by the various challenge solutions, all winning solutions utilize deep learning based methods with complex backbones, trained from scratch with a long training process. Such strategy possesses a high risk of overfitting given the scale of the ARID-plus, while also suffers from the need for large computational resources. Therefore, though achieving notable performances, such strategy may not be ultimate for AR in dark videos. Meanwhile, though domain adaptation approaches have been popular in coping with semi-supervised action recognition, where dark videos are unlabeled, domain adaptation solutions are not the preeminent ones in the challenge, due to unique characteristics of dark videos. Such observation suggests that there are limitations in applying domain adaptation to dark videos directly. To further improve AR accuracy, an intuitive strategy is to apply low-light enhancement methods. However, empirical results go against such intuition.
Are image enhancement methods effective? While some low-light enhancement methods do bring improvements in accuracy, results show that the improvements are erratic. Negative effects due to enhancements could be explained by its disruption over the original data distribution as well as the introduction of noise. Interestingly, the few adopted enhancements in the winning solutions may not produce the best visual enhancement results. Instead, it could be observed that these methods would either preserve the character of the original data distribution or introduce less noise. Therefore, it could be argued that for any enhancement to bring substantial improvement in AR accuracy, either condition should be met. Since less noise could contribute towards AR accuracy, employing further denoising methods (tassano2020fastdvdnet; sheth2021unsupervised) could be examined along with the various low-light enhancement methods to suppress noise, mitigating the possible negative effects. Meanwhile, current solutions only exploit one single enhancement. To this end, enhancement-invariant methods may be developed to capture underlying distributions that are not influenced by enhancement methods, which could be the key to understanding dark videos. This strategy could be implemented with various enhancement methods applied simultaneously to the dark videos, with the invariant features trained by contrastive learning (qian2021spatiotemporal; pan2021videomoco) of the enhanced results. The final classification would be performed on the enhancement-invariant features extracted.
How to reduce model complexity? To overcome the risk of overfitting and the requirement for large computational resource due to the use of large-scale deep learning methods, multiple alternative strategies could be considered. One of which is few-shot learning (kumar2019protogan; bo2020few), which has enabled models to be trained with limited data while generalizing to unseen test data, and has been gaining research interest for action recognition. This conforms to the task of AR in dark environments, and should therefore be considered as a feasible alternative to the fully supervised strategy. Further, due to the insufficient number of classes in ARID-plus, winning solutions may not be capable of generalizing to videos in the wild, where most actions are considered to be unseen by ARID-plus. To overcome such shortcoming, zero-shot learning (xu2017transductive; mishra2018generative; liu2019generalized) endows AR methods the capacity of predicting unseen actions, which could better cope with real-world scenarios. Meanwhile, techniques such as self-supervised learning would also boost model capacity by exploiting extra information within videos, such as video speed (wang2020self) and video coherence (cao2021self). Meanwhile, to apply models in areas such as surveillance, models should be deployed on edge devices (e.g., embedding systems such as Jetson). These devices possess limited computation resources but are able to be mass deployed. These attributes prohibit large-scale models to be applied directly. One possible solution would be model compression (he2018amc; pan2019compressing), which aims to deploy models in low-power and resource-limited devices without a significant drop in accuracy. The ability of the compressed model to be applied on edge devices could help to expand the application of AR solutions in scenarios such as nighttime autonomous driving systems, where conventional hardware (i.e., GPUs and TPUs) could not be installed.
Does domain adaptation help a lot? Applying domain adaptation approaches directly to semi-supervised AR of dark videos is ineffective largely due to the large domain gap between normal videos and videos in dark environments. Domain adaptation approaches would therefore be unable to minimize the discrepancies between different domains, or to extract domain-invariant features for transferring. Currently, most domain adaptation approaches align high-level features (ganin2016domain; ganin2015unsupervised; long2015learning; saito2018maximum), which is in accord with the fact that high-level features are utilized for the final classification task. However, large discrepancies would exist between the low-level features of normal and dark videos, given the large differences in mean and standard deviation values of video frames. The discrepancies between low-level features would escalate the discrepancies between high-level features, therefore undermining the effort of current domain adaptation approaches in obtaining transferable features from normal videos. In view of such observation, low-level features should be aligned with high-level features jointly when designing domain adaption approaches for semi-supervised AR in dark videos.
How to leverage multi-modality information? Besides the techniques mentioned above, it is observed that optical flow could bring performance improvement. Optical flow can be viewed as an additional modality embedded in videos, and could provide more effective information thanks to the fact that it is essentially computed as the correlation of spatiotemporal pixels between successive frames, which is highly related to motion. However, in solutions utilizing optical flow, it is extracted with hand-crafted methods, such as TVL1, which require a large computation cost. Hand-crafted optical flow also prohibits end-to-end training due to the need for storing optical flow before subsequent training. Advances have been made in optical flow estimation with deep learning method (ranjan2017optical; hui2018liteflownet; sun2019models) that allows optical flow estimation to be performed along with the training of feature extractors and classifiers in an end-to-end manner. However, these advances are made with normal illuminated videos, and it is worth exploring whether these models could also be applied with videos shot in dark environments. Meanwhile, with the optical flow as an additional modality of information, current solutions tend to utilize optical flow independent from RGB features, with the results obtained in a late fusion fashion. Since both modalities are obtained from the same set of data, it would be worth exploring how to train with both modalities jointly through approaches such as cross modality self-supervision (khowaja2020hybrid; sayed2018cross), which can be applied in both supervised training and cross-domain semi-supervised training (munro2020multi). Such approach enables network to learn features with high semantic meaning, which could lead to further improvements in AR effectiveness.
In this work, we dive deeper into the challenging yet under-explored task of action recognition (AR) in dark videos, with the introduction of a novel UG2+ Challenge Track 2 (UG2-2). UG2-2 aims to promote the research of AR in challenging dark environments from both fully supervised and semi-supervised manners, improving the generability of AR models in dark environments. Our baseline analysis justifies the difficulties of the challenges, with poor results obtained from current AR models, enhancement methods and domain adaptation methods. While solutions in UG2-2 has introduced promising progress, there remain large room for improvements. We hope this challenge and the current progress could draw more interest from the community to tackle AR in dark environments.
Funding: No funding was received to assist with the preparation of this manuscript.
Competing interests: The authors have no competing interests to declare that are relevant to the content of this article.