Thanks to the increasing application of automatic action recognition in various fields, such as surveillance  and smart homes , action recognition tasks have received considerable attention in recent years. Although much progress has been made, current research mostly focused on videos shot under normal illumination. This is partly due to the fact that current datasets for action recognition are normally collected from web videos shot mostly under normal illumination. Yet videos shot in the dark are useful in many cases, such as night surveillance, and self-driving at night. Additional sensors, such as infrared or thermal imaging sensors, could be utilized for recognizing actions in the dark. However, such sensors are of high cost and cannot be deployed on a large scale. Hence we focus on action recognition in the dark without additional sensors. To this end, we collected a new dataset: Action Recognition In the Dark (ARID) dataset, dedicated to the task of action recognition in dark videos. To the best of our knowledge, it is the first dataset focusing on human actions in the dark.
Currently, there already exist a large number of videos in various datasets, shot under normal illumination. It is intuitive to make use of these videos through creating synthetic dark videos based on them. In this paper, we prove the necessity of a dataset with real dark videos through a detailed analysis and comparison with synthetic dark videos. We observe distinct characteristics of real dark videos that cannot be replicated by synthetic dark videos.
Recently, neural networks, especially convolutional neural network (CNN) based solutions have proven to be effective for various computer vision tasks. For action recognition, state-of-the-art results on previous action recognition datasets are mostly achieved through 3D-CNN based networks. To gain further understanding of the challenges faced with action recognition in dark videos, we analyze how dark videos affect current action recognition models. Additionally, we explore potential solutions for substantial improvements in action recognition accuracy utilizing current models.
In summary, we explored the task of action recognition in dark videos. The contribution of this work is threefold: 1) we propose a new ARID dataset, dedicated to the task of recognizing actions in dark videos; 2) we verify the importance of our ARID dataset through statistical and visual analysis and comparison with synthetic dark videos; 3) we benchmark the performance of current 3D-CNN based action recognition models on our dataset while exploring potential methods to improve accuracy with current models, and reveals challenges in the task of action recognition in dark videos.
2 Related Works
Action Recognition Datasets. There are a number of benchmark datasets in the action recognition domain. Earlier datasets, such as KTH  and Weizmann , contain relatively small number of action classes. With the rapidly increased performance of proposed methods on these smaller datasets, larger and more challenging datasets are introduced. This includes HMDB51 , UCF101  and Kinetics . Particularly, the Kinetics dataset, with 400 action classes and more than 160,000 clips in total, becomes the primary choice. Though these datasets involve an abundant scale of actions, these actions are mostly collected from web videos, mostly recorded under normal illumination. Hence, to study the action recognition performance in dark videos, we collected a new video dataset dedicated to videos shot in the dark.
Dark Visual Datasets.
Recently, there has been a rise of research interest with regards to computer vision tasks in the dark environment, such as face recognition in the dark. The research for dark environment visual tasks is partly supported by the various dark visual datasets introduced. Among which, most datasets focused on image enhancement and denoising tasks, where the goal is to visually enhance dark images for a clearer view. These include LOL Dataset and SID . More recently, such an enhancement task has been expanded to the video domain. New datasets include DRV  and SMOID 
. Although both datasets contain dark videos, their focus is more towards enhancing the visibility of video frames. The scenes are randomly shot and may not include specific human actions. In contrast, our ARID dataset focuses on classifying different human actions in dark videos.
3 Action Recognition In the Dark Dataset
Although a small amount of videos taken in the dark do exist in current action recognition datasets, such as Kinetics and HMDB51, the task of human action recognition in dark environment has rarely been studied. This is partly due to the very low proportion of dark videos in current benchmark datasets, and a lack of datasets dedicated to action analysis in the dark. To bridge the gap in the lack of dark video data, we introduce a new Action Recognition In the Dark (ARID) dataset. In this session, we take an overview of the dataset in three perspectives: the action classes, the process of data collection as well as some basic statistics of our ARID dataset.
Action Classes. The ARID dataset includes a total of 11 common human action classes. The list of action classes can be categorized into two types: Singular Person Actions, which includes jumping, running, turning, walking and waving; and Person Actions with Objects, which includes drinking, picking, pouring, pushing, sitting and standing. Figure 1 shows the sample frames for each of the 11 action classes in the ARID dataset.
Data Collection. The video clips in the ARID dataset are collected using 3 different commercial cameras. The clips are shot strictly during night hours. All clips are collected from a total of 11 volunteers, among which 8 males and 3 females. We collected the clips in 9 outdoor scenes and 9 indoor scenes, such as carparks, corridors and playing fields for outdoor scenes, and classrooms and laboratories for indoor scenes. The lighting condition of each scene is different, with no direct light shot on the actor in almost all videos. In many cases, it is challenging even for the naked eye to recognize the human action without tuning the raw video clips.
Basic Statistics. The ARID dataset contains a total of 3,784 video clips, with each class containing at least 110 clips. The clips of a single action class are divided into 12-18 groups with each group containing no less than 7 clips. The clips in the same group share some similar features, such as being shot under similar lighting conditions or shot with the same actor. Figure 2 shows the number of distribution of clips among all the classes. The training and testing sets are partitioned by splitting the clip groups, with a ratio of 7:3. We selected three train/test splits, such that each group would have an equal chance to be present in either the train partition or the test partition.
The video clips are fixed to a frame rate of 30 FPS with a resolution of . The minimum clip length is 1.2 seconds with 36 frames, and the duration of the whole dataset is 8,721 seconds. The videos are saved in .avi format and are compressed using the DivX codec.
4 Experiments and Discussions
In this section, we gain further understanding of our proposed dataset through a detailed analysis of the ARID dataset. The main objectives are twofold: 1) validate the necessity of a video dataset collected in the real dark environment and 2) provide a benchmark for current action recognition datasets while revealing the challenges with regards to the task of action recognition in dark videos. In the following, we first introduce the experiment settings along with the construction of a synthetic dark video dataset. We then introduce methods used to enhance dark video frames in ARID in an effort to improve action recognition accuracy. We then analyze our ARID dataset in detail through three perspectives: statistical and visual analysis of ARID, analysis of ARID classification result and visualization of extracted feature from ARID.
4.1 Experimental Settings
To obtain the action recognition results on our ARID dataset, we utilize 3D-CNN based models on PyTorch. For all experiments, the inputs to our 3D-CNN based models are sequences of 16 sampled frames with each frame resized to . To accelerate training, we utilize the pretrained 3D-CNN based models pretrained on the Kinetics dataset when available. Due to the constraint in computation power, a batch size of 16 is applied to all experiments. The action recognition results are reported as the average top-1 and average top-5 accuracies of the three splits.
Compared to collecting a new dataset for the dark environment, it is more intuitive to synthesize dark videos through current video datasets which mainly consist of videos shot under normal illumination. To showcase the necessity of a real dark video dataset, we compare the synthetic dark video dataset with our ARID. The synthetic dark video dataset is constructed based on the HMDB51, denoted as HMDB51-dark. We synthesize dark videos by gamma intensity correction formulated as:
where is the value of the pixel in the synthetic dark video, located at spatial location at the frame, and is the pixel value of the corresponding pixel in the original video. Both and are in the range of . is the parameter that controls the degree of darkness in the synthetic dark video, typically in the range of , where a smaller number would result in lower pixel values, producing darker synthetic videos.
We note that the dark videos collected in our ARID are shot under different illumination conditions. To mimic the differences in illumination, we apply different values when synthesizing dark videos. More specifically, the
value is obtained randomly from a normal distributionwith the constraint of . Here the mean
is set to 0.2 and the standard deviationis set to 0.07. Figure 3 shows the comparison of sample frames of videos from the original HMDB51 dataset with the sample frames from the corresponding synthetic dark videos.
4.2 Frame Enhancement Methods
For humans to better recognize actions in dark videos, an intuitive method is to enhance each dark video frame. In this paper, we investigate the effect of applying different frame enhancement methods on ARID towards current action recognition models. We applied five frame enhancement methods: Histogram Equalization (HE) , Gamma Intensity Correction (GIC), LIME , BIMEF  and KinD . Among them, HE and GIC are traditional image enhancement methods. HE produces higher contrast images, whereas GIC is used to adjust the luminance of images. Both LIME and BIMEF are based on the Retinex theory , which assumes that images are composed of reflection and illumination. LIMEestimates the illumination map of dark images while imposing a structure prior to the initial illumination map, while BIMEF proposes a multi-exposure fusion algorithm. KinD is a deep neural network-based method utilizing a two-stream structure for reflectance restoration and illumination adjustment. The KinD is implemented with weights pretrained on the LOL Dataset. The result of applying the above methods to the ARID dataset are denoted as ARID-HE, ARID-GIC, ARID-LIME, ARID-BIMEF, and ARID-KinD respectively. The GIC is also applied to the synthetic dark dataset HMDB51-dark, whose result is denoted as HMDB51-dark-GIC.
|Dataset||RGB Mean Values||RGB Standard Deviations|
|ARID||[0.0796, 0.0739, 0.0725]||[0.1005, 0.0971, 0.0899]|
|ARID-GIC||[0.5473, 0.5418, 0.5391]||[0.1101, 0.1102, 0.1022]|
|HMDB51||[0.4248, 0.4082, 0.3676]||[0.2695, 0.2724, 0.2779]|
|HMDB51-dark||[0.0979, 0.0884, 0.0818]||[0.1836, 0.1840, 0.1789]|
|HMDB51-dark-GIC||[0.4904, 0.4816, 0.4588]||[0.3593, 0.3600, 0.3486]|
4.3 Statistical and Visual Analysis of ARID
To better understand real dark videos and understand the necessity of real dark videos, we compute and compare the statistics of the ARID dataset with the HMDB51 dataset as well as the synthetic HMDB51-dark dataset. Table 1 presents the detailed mean value and standard deviation value of datasets ARID, ARID-GIC, HMDB51, HMDB51-dark and HMDB51-dark-GIC respectively. The gamma values for both ARID-GIC and HMDB51-dark-GIC are both set to 5.
The mean and standard deviation values of ARID as shown in Table 1 depict the characteristics of videos in our ARID dataset. Compared to the original HMDB51, the RGB mean and standard deviation values of the ARID dataset are both lower than that of the HMDB51 dataset. This indicates that video frames in ARID are lower in brightness and contrast compare to video frames in HMDB51. This is further justified by the sampled frames and their RGB and Y histograms comparison between ARID and HMDB51 datasets, as shown in Figure 4(a) and (c). The lower brightness and lower contrast for video frames in ARID make it challenging even for the human naked eye to identify the actions.
We observe that our real dark dataset ARID and the synthetic dark dataset HMDB51-dark are very similar in terms of the RGB mean values. This in part, shows that our synthesized operation mimics the real dark environment well. However, further comparison in terms of RGB standard deviation values indicates that the real dark dataset ARID is still lower in contrast. This matches the observation of comparison between the sampled frames of ARID and HMDB51-dark, as shown in Figure 4(a) and (d). Here we observe that videos from HMDB51-dark would visually be more distinguishable. We argue that this is due to the fact that bright pixels in the original HMDB51 dataset, whose corresponding output pixels in the synthetic dark videos have higher pixel values. This raises both the standard deviation of HMDB51-dark, which in terms is reflected as frames with higher contrast.
As mentioned in Section 4.2, the GIC method could enhance frames by adjusting the luminance of the frames. By setting , the resulting pixel value after applying the GIC method should be larger than the input pixel value. This is justified by the larger RGB mean values of ARID-GIC and HMDB51-dark-GIC compared to ARID and HMDB51-dark datasets. Sampled frames as shown in Figure 4(a) and (b) also justifies that GIC enhancement greatly increases the visibility of each video frame. The person seen running can not be clearly observed by the naked eye in Figure 4(a), whereas the person becomes more visible in Figure 4(b).
Though the comparison of sampled frames across Figure 4(a)(b) and (d)(e) shows the effectiveness of GIC enhancement in increasing luminance of dark videos, there is still a significant difference between ARID-GIC and HMDB51-dark-GIC. The most significant difference is that standard deviation of ARID-GIC is much smaller than that of HMDB51-dark-GIC. This indicates that videos in ARID-GIC are still low in contrast after the GIC enhancement. This is justified by comparing the sampled frames as shown in Figure 4(b) and (e), where we observe that the sampled frame from ARID-GIC looks pale as compared to that from HMDB51-dark-GIC.
From the above observation, we can summarize the main characteristic of the real dark videos collected in our ARID dataset: low brightness and low contrast. Though the character of low brightness could be mimicked by the synthetic dark videos, the characteristic of low contrast cannot be easily mimicked by synthetic dark videos. This is partly due to the bright backgrounds and pixels commonly exist in videos shot under normal illumination. The above analysis confirms that real dark videos are irreplaceable for the task of action recognition in a dark environment.
|Method||Top-1 Accuracy||Top-5 Accuracy|
4.4 Classification Results on ARID
In this section, we illustrate how current action recognition models perform in the task of action recognition in the dark on our ARID dataset. We further explore potential ways to improve the performance of action recognition in real dark videos, and reveal some challenges faced with action recognition in dark videos. The performance of current competitive 3D-CNN-based action recognition models are presented in Table 2, which includes: C3D , 3D-ShuffleNet , 3D-SqueezeNet , 3D-ResNet-18 , Pseudo-3D-199 , Res50-I3D  and 3D-ResNext-101 .
The performance results as shown in Table 2 show that among the current action recognition models, 3D-ResNext-101 performs the best with a top-1 accuracy of . We notice that the top-5 accuracy is relatively high for all methods, which is partly because of the small number of classes in our dataset. We also notice that though our dataset is of relatively small size and has fewer classes than current normal illumination video datasets, there is plenty of room for improvement in accuracy. To explore potential ways for further improving accuracy for dark videos, we choose 3D-ResNext-101 as the baseline for experiments. An intuitive method for improving accuracy is the use of frame enhancement methods as introduced in Section 4.2. To test on whether frame enhancement methods could improve accuracy, we employ GIC method on the synthetic HMDB51-dark dataset due to its larger data size and ease of obtaining dark data from the current datasets. The performance of 3D-ResNext-101 on the synthetic dataset HMDB51-dark and its corresponding GIC enhanced HMDB51-dark-GIC is illustrated in Table 3.
The results as presented in Table 3 show a sharp decrease in classification accuracy when the same network is utilized for the dark data. The decrease is expected, given that dark videos contain fewer details as shown in Figure 3. Besides this, we also notice a significant increase of in accuracy when the GIC method is applied to enhance the dark video frames. As the synthetic data is darkened with random gamma values while the GIC enhancement utilizes a fixed gamma value, it is nearly impossible to recover the original videos. Despite this, the GIC operation still brings a significant amount of accuracy improvement.
The success in applying frame enhancement method for increasing classification accuracy in synthetic dark videos give us a hint on potential ways to improve accuracy for action recognition in real dark videos. To justify if the same GIC method could also improve action recognition accuracy on our ARID dataset, we perform experiments on the GIC enhanced ARID dataset: ARID-GIC, utilizing 3D-ResNext-101. The result is as presented in Table 4.
The results in Table 4 illustrate that the action recognition accuracy of our ARID would improve through GIC enhancement, thanks to the increase in the illumination of each video frame as presented in Figure 4. The increase in accuracy is consistent with the findings with regards to the synthetic dark dataset HMDB51-dark. However, we also notice that the improvement of performance by using GIC is only , which is rather limited compared to the improvement in the synthetic dark dataset. As GIC method is a method based on simple exponential calculation, we further examine if more sophisticated frame enhancement methods could further improve action recognition accuracy. We thus examine the accuracy on datasets ARID-HE, ARID-LIME, ARID-BIMEF and ARID-KinD, which are results of the output by frame enhancement methods HE, LIME, BIMEF and KinD respectively. The results are also presented in Table 4.
Interestingly, Table 4 illustrates that not all frame enhancement methods result in improvements in action recognition accuracy in dark videos. Of all the frame enhancement methods, the largest improvement is achieved by the GIC
method. Whereas the accuracy drops the most utilizing the recent deep learning-based methodKinD. To gain a better understanding of the differences between the outcome of utilizing the different enhancement methods, we visualize the frame output of each enhancement method. Figure 5 presents the sampled frames of the output of the above enhancement methods with the same input ARID video frame.
Figure 5 clearly shows that visually, the outputs of all frame enhancement methods improve the visibility of the video. The actor who is running can be seen clearly in all sampled frames except the sample frame from the original video in ARID. However, the sampled frame of ARID-GIC does not appear to be the best enhancement visually, as it is still low in contrast. In comparison, all other methods produce higher contrast images, as justified by the RGB histograms in Figure 5. This indicates that current frame enhancement which clearly improves dark video frames visually may not bring improvement in action recognition accuracy for dark videos. We argue that some enhancement can be regarded as artifact or adversarial attack for videos. Though enhanced frames are clearer visually, some enhancements breaks the original distribution of videos and introduce noise. The change in distribution and introduction of noise could lead to a decrease in performance for action recognition models.
4.5 Feature Visualization with ARID
To further understand the performance of current action recognition models on ARID and analyze the effect of dark videos on current models, we extract and visualize features at the last convolution layer using 3D-ResNext-101. The visualization of features are presented as Class Activation Maps (CAM) , which depicts the focus of the model with respect to the given prediction. Figure 7 and Figure 6 compare the sampled frames from the ARID and HMDB51 datasets, with the corresponding CAMs. We observe that for the frames in HMDB51 with normal illumination as shown in Figure 7, the 3D-ResNext-101 model is able to focus on the actors, whereas for the dark video, the model focuses more on the background. For example, for the action shown in Figure 6(a)(left), the network classifies the action as action ”Jumping” by focusing on the background whose details are uncovered due to the person jumping backward. Therefore the CAM shows that the network focuses on a narrow beam in the background. The focus on the background instead of the actor could be partly due to the fact that clear outlines of actors rarely exist in dark videos.
In Table 4, certain frame enhancement methods could positively affect the final classification accuracy. To gain further understanding of how the different frame enhancement methods actually affect the action recognition models, we compare the CAMs with respect to the same sampled frame from the five frame enhanced ARID datasets as shown in Figure 6. Compared with the original video frame, the outline of the actor is much clearer in all enhanced frames. We observe that the focus area of the network is more concentrated compared with CAM of the original frame. Additionally, we observe some offset between the focus of the network of the frame enhanced sample frames and the actual actor. In comparison, the CAMs of HMDB51 video frames show the network focuses center around the actors. This may partly explain the inability of frame enhancement methods to improve action recognition accuracy while being able to focus on a more concentrated area of each video frame.
From the results and analysis presented above, we can draw three major conclusions about the task of action recognition in the dark. First, videos taken in a dark environment are characterized by its low brightness and low contrast. As the characteristic of low contrast cannot be fully synthesized, synthetic dark videos cannot be directly applied to action recognition in the dark. Second, though current frame enhancement methods could produce visually clearer video frames, the accuracy improvements made for current action recognition models after frame enhancing dark videos is rather limited. Some frame enhancement methods even deteriorate classification accuracy, since some enhancement can be regarded as artifact or adversarial attack for videos. Breaking the original distribution of videos might decrease the performance of a statistical model. Better frame enhancement methods developed may be helpful in improving action recognition accuracy in dark videos. Third, in many dark videos, current action recognition models fail to focus on the actor for classification. This might be caused by unclear outlines of actors and shows that action recognition models could tend to focus on the actors for frame enhanced dark videos. However, the focus in frame enhanced dark videos contain offsets. We believe that better action recognition models with a better ability to focus on actors, especially with unclear outlines, could be a critical part of improving action recognition accuracy in dark videos. These conclusions contribute to exploring more effective solutions for ARID.
In this work, we introduced the Action Recognition In the Dark (ARID) dataset, which is, as far as we are aware, the first dataset dedicated to the task of action recognition in the dark. The ARID includes 4k video clips with 11 action categories. To understand the challenges behind real dark videos, we analyze our ARID dataset with three perspectives: statistical, classification result, and feature visualization. We discover distinct characteristics of real dark videos that are different from synthetic dark videos. Our analysis shows that current action recognition models and frame enhancement methods are not be effective enough in recognizing action in dark videos. We hope this study could draw more interest to work on the task of action recognition in the dark.
Quo vadis, action recognition? a new model and the kinetics dataset.
proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2, §4.4.
-  (2019) Seeing motion in the dark. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3185–3194. Cited by: §2.
-  (2018) Learning to see in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–3300. Cited by: §2.
-  (2007) Actions as space-time shapes. IEEE transactions on pattern analysis and machine intelligence 29 (12), pp. 2247–2253. Cited by: §2.
-  (2016) LIME: low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing 26 (2), pp. 982–993. Cited by: §4.2.
Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555. Cited by: §4.4.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §4.4.
-  (2019) Learning to see moving objects in the dark. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7324–7333. Cited by: §2.
-  (2019) Resource efficient 3d convolutional neural networks. arXiv preprint arXiv:1904.02422. Cited by: §4.4.
-  (2011) HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563. Cited by: §2.
-  (1977) The retinex theory of color vision. Scientific american 237 (6), pp. 108–129. Cited by: §4.2.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
-  (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §4.4.
-  (2004) Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Vol. 3, pp. 32–36. Cited by: §2.
-  (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §2.
-  (1992) Color image enhancement through 3-d histogram equalization. In Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol. III. Conference C: Image, Speech and Signal Analysis,, pp. 545–548. Cited by: §4.2.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §4.4.
-  (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §4.4.
-  (2018) Deep retinex decomposition for low-light enhancement. In British Machine Vision Conference, Cited by: §2.
-  (2018) Device-free occupant activity sensing using wifi-enabled iot devices for smart homes. IEEE Internet of Things Journal 5 (5), pp. 3991–4002. Cited by: §1.
-  (2017) A new image contrast enhancement algorithm using exposure fusion framework. In International Conference on Computer Analysis of Images and Patterns, pp. 36–46. Cited by: §4.2.
-  (2019) Kindling the darkness: a practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, New York, NY, USA, pp. 1632–1640. External Links: Cited by: §4.2.
Learning deep features for discriminative localization. In Computer Vision and Pattern Recognition, Cited by: §4.5.
-  (2019) Wifi and vision multimodal learning for accurate and robust device-free human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1.