Log In Sign Up

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

by   Xiao Wang, et al.

The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which are suffered from illumination, fast motion, privacy-preserving, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released on <>.


GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer

Group activity recognition is a crucial yet challenging problem, whose c...

Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric

Combining the Color and Event cameras (also called Dynamic Vision Sensor...

Dynamic Vision Sensors for Human Activity Recognition

Unlike conventional cameras which capture video at a fixed frame rate, D...

Extreme Low Resolution Activity Recognition with Spatial-Temporal Attention Transfer

Extreme low-resolution(LR) activity recognition plays a vital role in pr...

Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework

Vision-based human activity recognition has emerged as one of the essent...

Batch-Based Activity Recognition from Egocentric Photo-Streams

Activity recognition from long unstructured egocentric photo-streams has...

STS Classification with Dual-stream CNN

The structured time series (STS) classification problem requires the mod...

1 Introduction

With the rapid development of the smart city, recognizing human behavior (i.e., Human Activity Recognition, HAR) accurately and efficiently is becoming an extremely urgent task. Most researchers develop the HAR algorithms [kong2018humanARSurvey, ahmad2021graph] based on RGB cameras which are widely deployed and easy to collect the data. With the help of large-scale benchmark datasets [gu2018ava, kay2017kinetics, caba2015activitynet, kuehne2011hmdb, monfort2019moments, soomro2012ucf101, sigurdsson2016hollywood]

and deep learning, HAR in regular scenarios has been studied to some extent. However, the storage, transmission, and analysis of surveillance videos set limits the demands for the practical systems due to the usage of RGB sensors. More in detail, the standard RGB cameras have a limited frame rate (e.g., 30 FPS) which makes it hard to capture the fast-moving objects and is easily influenced by motion blur. The low dynamic range (60 dB) makes the RGB sensors work poorly in low illumination. It also suffers from the high redundancy between nearby frames which needs more storage and energy consumption. Privacy protection also greatly limits its development, therefore, a natural question is

do we have to recognize human activities using the RGB sensors?

Figure 1: Comparison between existing datasets and our proposed HARDVS dataset for event based video classification.

Recently, the biologically inspired sensors (called event cameras), such as DAVIS [brandli2014240], CeleX [chen2019celexV], ATIS [posch2010qvga], and PROPHESEE 111, drawing more and more attention of researchers. Different from RGB cameras which record light in a synchronous way (i.e., the video frame), the event cameras output events (or spikes) asynchronously which corresponds to the illumination variation. In another word, each pixel of event cameras independently records a binary value only when the light changes exceed a threshold. Events for the increase and decrease of illumination are called ON and OFF events respectively. Due to the unique sampling mechanism, the asynchronous events are spatially sparse but temporally dense. It is less affected by motion blur, therefore, is suitable for capturing fast-moving human actions, such as the magician’s fast-moving palm, and movement recognition of sports players. It has a higher dynamic range (120 dB) and lower latency, which enables it to work well even in low illumination compared with standard RGB cameras. In addition, the storage and energy consumption are also significantly reduced [gallegoevent, wang2021visevent, Li2022vidardvsDet, zhu2022eventsnn, zhu2021neuspike]. Event streams highlight the contour information and protect personal privacy to a large extent. According to the aforementioned observation and thinking, we are inspired to address human activity recognition in the wild using event cameras. A comparison of the imaging principles of the color frame and event camera is illustrated in Fig. 2.

Figure 2: Comparison between the imaging principles of the color frame and event stream.

Although there are already several benchmark datasets proposed for classification [bi2020graph, amir2017low, li2017cifar10, serrano2015poker, kuehne2011hmdb, soomro2012ucf101, kliper2011action]. However, most of them are simulated/synthetic datasets that are transformed from RGB videos with the simulator. Some researchers attain the event data by recording the screen while displaying RGB videos. Obviously, these datasets are hard to reflect the features of event cameras in real-world scenarios, especially fast-motion and low-light scenarios. ASL-DVS [bi2020graph] is proposed by Bi et al. which is consisted of samples but can only be used for hand gesture recognition with 24 classes. DvsGesture [amir2017low] is also limited by its scale and categories in the deep learning era. In addition, some datasets have become saturated in performance, for example, Wang et al. [wang2019steventclouds] already achieved on the DvsGesture [amir2017low] dataset. Therefore, the research community still has insistent demands for a large-scale HAR benchmark dataset recorded in the wild.

In this paper, we propose a large-scale benchmark dataset, termed HARDVS, to address the problem of the lack of real event data. Specifically, the HARDVS dataset contains more than 100K video clips recorded with a DAVIS346 camera, each of them lasting for about 5-10 seconds. It contains 300 categories of human activities in daily life, such as drinking, riding a bike, sitting down, washing hands. The following factors are taken into account to make our data more diverse, including multi-views, illuminations, motion speed, dynamic background, occlusion, flashing light, photographic distance. To the best of our knowledge, our proposed HARDVS is the first large-scale and challenging benchmark dataset for human activity recognition in the wild. A comparison between existing recognition datasets and our HARDVS is illustrated in Fig. 1.

Based on our newly proposed HARDVS dataset, we construct a novel event-based human action recognition framework, termed ESTF (Event-based Spatial-Temporal Transformer). As shown in Fig. 4, the ESTF transforms the event streams into spatial and temporal tokens and learns the dual features by employing SpatialFormer (SF) and TemporalFormer (TF) respectively. Further, we propose a FusionFormer to realize the message passing between the spatial and temporal features. The aggregated representation is added with features of dual branches as the input for subsequent learning blocks, respectively. The outputs will be concatenated and fed into two MLP layers for the final action prediction.

To sum up, the contributions of this paper can be concluded as the following three aspects:

We propose a large-scale neuromorphic dataset for human activity recognition, termed HARDVS. It contains more than 100K samples with 300 classes, and fully reflects the challenging factors in the real world. To the best of our knowledge, it is the first large-scale realistic neuromorphic dataset for HAR.

We propose a novel Event-based Spatial-Temporal Transformer (ESTF) approach for human action recognition by exploiting spatial and temporal feature learning and fusing them with Transformer networks. It is the first Transformer based spatial-temporal representation learning framework for event stream-based HAR.

We re-train and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare on the HARDVS dataset. Extensive experiments on multiple event-based classification datasets fully demonstrate the effectiveness of our proposed ESTF approach.

2 Related Work

HAR with Event Sensors. Compared with RGB cameras, few researchers focus on event camera-based HAR [amir2017low, clady2017motion, chen2021novel, baby2017dynamic]. Arnon et al. [amir2017low] propose the first gesture recognition system based on TrueNorth neurosynaptic processor. Xavier et al. [clady2017motion] propose an event-based luminance-free feature for local corner detection and global gesture recognition. Chen et al. [chen2021novel] propose a hand gesture recognition system based on DVS and also design a wearable glove with a high-frequency active LED marker that fully exploits its properties. A retinomorphic event-driven representation (EDR) is proposed by Chen et al. [chen2019fast], which can realize three important functions of the biological retina, i.e., the logarithmic transformation, ON/OFF pathways, and integration of multiple timescales. The authors of [lagorce2016hots] represent the recent temporal activity within a local spatial neighborhood, and utilize the rich temporal information provided by events to create contexts in the form of time-surfaces, termed HOTS, for the recognition task. Wu et al. first transform the event flow into images, then, predict and combine the human pose with event images for HAR [wu2020multipath]

. Graph neural networks (GNN) and SNNs are also exploited for event-based recognition 

[george2020reservoir, mehr2019action, samadzadeh2020convsnn, li2018deepCNN, ceolini2020hand, panda2018learning, liu2020unsupervised, xing2020new, chen2020dyGCN, wang2021eventGNN]. Specifically, Chen et al. [chen2020dyGCN] treat the event flow as a 3D point cloud and use dynamic GNNs to learn the spatial-temporal features for gesture recognition. Wang et al. [wang2021eventGNN]

adopt GNNs and CNNs for gait recognition. Xing et al. design a spiking convolutional recurrent neural network (SCRNN) architecture for event-based sequence 

[xing2020new]. According to our observations, these works are evaluated only on simple HAR datasets or simulated datasets. It is necessary and urgent to introduce a large-scale HAR dataset for current evaluation.

Event Benchmark Datasets for HAR. As shown in Table 1, most of the existing event camera-based datasets for recognition are artificial datasets. Usually, the researchers display the RGB HAR datasets on a large screen and record the activity with neuromorphic sensors. For example, the N-Caltech101 [orchard2015converting]

and N-MNIST 

[orchard2015converting] are recorded with an ATIS camera which contains 101 and 10 classes, respectively. Bi et al. [bi2020graph] also transform popular HAR datasets into simulated event flow, including HMDB-DVS [bi2020graph, kuehne2011hmdb], UCF-DVS [bi2020graph, soomro2012ucf101], and ASLAN-DVS [kliper2011action], which further expands the number of datasets available for HAR. However, these simulated event datasets hardly reflect the advantages of event cameras, such as low light, fast motion, etc. There are three realistic event datasets for classification, i.e., the DvsGesture [amir2017low], N-CARS [sironi2018hats] and ASL-DVS [bi2020graph], but these benchmarks are limited by their scale, categories, and scenes. Specifically, these datasets contain 11, 2, and 24 classes only, and also rarely take challenging factors like multi-view, motion, and glitter into consideration. Compared with existing datasets, our proposed HARDVS dataset is large-scale (100K samples) and category-wide (300 classes) for deep neural networks. Our sequences are recorded in the wild and fully reflect the features of the aforementioned attributes. We believe our proposed benchmark dataset greatly promotes the development of event-based HAR.

Dataset Year Sensors Scale Class Resolution Real M-VW M-ILL M-MO DYB OCC DR Link
ASLAN-DVS [bi2020graph, kliper2011action] 2011 DAVIS240c 432 - - - - - - URL
MNIST-DVS [serrano2015poker] 2013 DAVIS128 10 - - - - - - URL
N-Caltech101 [orchard2015converting] 2015 ATIS 101 - - - - - - URL
N-MNIST [orchard2015converting] 2015 ATIS 10 - - - - - - URL
CIFAR10-DVS [li2017cifar10] 2017 DAVIS128 10 - - - - - - URL
HMDB-DVS [bi2020graph, kuehne2011hmdb] 2019 DAVIS240c 51 - - - - - - URL
UCF-DVS [bi2020graph, soomro2012ucf101] 2019 DAVIS240c 101 - - - - - - URL


2021 Samsung-Gen3 1000 - - - - - - URL


2021 - 1000 - - - - - - URL
DvsGesture [amir2017low] 2017 DAVIS128 11 - URL
N-CARS [sironi2018hats] 2018 ATIS 2 - URL
ASL-DVS [bi2020graph] 2019 DAVIS240 24 0.1s URL
PAF [miao2019neuromorphic] 2019 DAVIS346 10 5s URL
DailyAction [LiuXTM021ijcai] 2021 DAVIS346 12 5s URL
HARDVS (Ours) 2022 DAVIS346 300 5s URL
Table 1: Comparison of event datasets for human activity recognition. M-VW, M-ILL, M-MO, DYB, OCC, and DR denotes multi-view, multi-illumination, multi-motion, dynamic background, occlusion, and duration of the action, respectively. Note that we only report these attributes of realistic DVS datasets for HAR.
Figure 3: Illustration of some representative samples of our proposed HARDVS dataset.

3 HARDVS Benchmark Dataset

3.1 Protocols

We aim to provide a good platform for the training and evaluation of DVS-based human activity recognition. When constructing the HARDVS benchmark dataset, we obey the following protocols:

1). Large-scale: As we all know, large-scale datasets play a very important role in the deep learning era. In this work, we collect more than 100k DVS event sequences to meet the needs for large-scale training and evaluation of HAR. 2). Wide varieties: Thousands of human activities can exist in the real world, but existing DVS-based HAR datasets only contain limited categories. Therefore, it is hard to fully reflect the classification and recognition ability of HAR algorithms. Our newly proposed HARDVS contains 300 classes which are several times larger than other DVS datasets. 3). Various challenges: Our dataset considers multiple challenging factors which may influence the results of HAR with the DVS sensor. The detailed introductions can be found below: (a). Multi-view: We collect different views of the same behavior to mimic practical applications, including front-, side-, horizontal-, top-down-, and bottom-up-views. (b). Multi-illumination: High dynamic range is one of the most important features of DVS sensors, therefore, we collect the videos under scenarios with strong-, middle-, and low-light ( of each category). Our dataset also contains many videos with glitter, because we find that the DVS sensor is sensitive to flashing lights, especially at the night. (c). Multi-motion: We also highlight the features of DVS sensors by recording many actions with various motion speeds, such as slow-, moderate-, and high-speed. (d). Dynamic background: As it is relatively easy to recognize actions without background objects, i.e., stationary DVS camera, we also collect many actions with a dynamic moving camera to make our dataset challenging enough. (e). Occlusion: In the real world, human action can be occluded commonly. Thus, we also add occlusion issues into the HARDVS dataset with hand or other things. 4). Different capture distance: The HARDVS dataset is collected under various distances, i.e., 1-2, 3-4, and more than 5 meters. 5). Long-term: Most of the existing DVS-based HAR datasets are microsecond-level, in contrast, each video in our HARDVS dataset lasts for about 5 seconds. 6). Dual-modality: The DAVIS346 camera can output both RGB frames and event flow, therefore, our dataset can also be used for HAR by fusing video frames and events. In this work, we focus on HAR with DVS only, but the RGB frames will also be released to support the research on dual-modality fusing based HAR.

3.2 Data Collection and Statistic Analysis

The HARDVS dataset is collected with a DAVIS346 camera whose resolution is . We take the aforementioned protocols in mind when recording videos. Therefore, our dataset fully reflects the unique features of DVS sensors in challenging scenarios, such as low-illumination, high-speed, clutter background, etc. The main characters are also diverse, generally speaking, there is a total of five persons involved in the data collection stage.

From a statistical perspective, our dataset contains a total of video sequences and 300 classes of common human activities. We split , and of each category for training, validating, and testing, respectively. Totally, the number of videos in the training, validating, and testing subset is , respectively. A direct comparison with existing classification benchmark datasets can be found in Table 1 and Fig. 1. With the aforementioned characteristics, we believe our HARDVS dataset will be a better evaluation platform for the neuromorphic classification problem, especially for the human activity recognition task.

4 Methodology

4.1 Overview

In this section, we devise a new Event-based Spatial-Temporal Transformer (ESTF) approach for event-stream data learning. As shown in Fig. 4, the proposed ESTF architecture contains three main learning modules, i.e., i) Initial Spatial and Temporal Embedding, ii) Spatial and Temporal Enhancement Learning, and iii) Spatial-Temporal Fusion Transformer. Specifically, given the input event-stream data, we first extract the initial spatial and temporal embeddings respectively. Then, a Spatial and Temporal Feature Enhancement Learning module is devised to further enrich the event-stream data representations by deeply capturing both spatial correlation and temporal dependence of event stream. Finally, an effective Fusion Transformer (FusionFormer) block is designed to integrate the spatial and temporal cues together for the final feature representation. The details of these modules are introduced below.

Figure 4: An overview of our proposed ESTF framework for event-based human action recognition. It transforms the event streams into spatial and temporal tokens and learns the dual features using multi-head self-attention layers. Further, a FusionFormer is proposed to realize message passing between the spatial and temporal features. The aggregated features are added with dual features as the input for subsequent TF and SF blocks, respectively. The outputs will be concatenated and fed into MLP layers for action prediction.

4.2 Initial Spatial and Temporal Embedding

Different from visible sensors which capture a global image at each time, the event cameras asynchronously capture the intensity variations in the log-scale. That is, each pixel outputs a discrete event (or spike) independently when the visual changing exceeds a pre-defined threshold. Usually, we use a 4-tuple to represent the discrete event of a pixel captured with DVS, where are spatial coordinates, is timestamp, and is the polarity of brightness variation. Following previous works [wang2019evGait, zhu2018evflownet, fang2021snnresnet, yao2021TASNN], we first transform the asynchronous event flow into the synchronous event images by stacking the events in a time interval based on the exposure time. Let be the collection of the sampled input event frames. In our experiments, we set , as used in works [tran2015c3d]. For each event frame , we adopt StemNet [he2016resnet] to extract an initial CNN feature descriptor for it and denote as the collection of event frames. Based on it, we respectively extract spatial and temporal embeddings. To be specific, for temporal branch, we adopt a convolution layer to reduce the feature size to obtain and reshape it to the matrix form as where . For spatial branch, we first adopt a convolution layer to resize the features to . Then, we conduct the merging/summation operation on the time dimension and reshape it to the matrix form where . Hence, both spatial and temporal embeddings have the same -dim feature descriptors.

4.3 Spatial and Temporal Enhancement Learning

Based on the above initial spatial embeddings and temporal embeddings

, we then devise our Spatial and Temporal Enhancement Learning (STEL) module to further enrich their representations. The proposed STEL module involves two blocks, i.e., Spatial Transformer (SF) block, and Temporal Transformer (TF) block, which respectively capture the spatial correlations and temporal dependences of event data to learn context enriched representations. The SF block includes multi-head self-attention (MSA) and MLP module with a Layernorm (LN) used between two modules. A residual connection is also employed, as shown in Fig. 

4. To be specific, given spatial embeddings , we first incorporate the position encoding [dosovitskiy2020ViT] to obtain which represents number of the input tokens with -dim feature descriptor. Then, the outputs of SF block are summarized as follows,


In contrast to input , the output provides the spatial-aware enhanced representations by employing the MSA mechanism to model the spatial relationships of different event patches. Similarly, given representing temporal tokens with position encoding, the outputs of TF block are summarized as follows,


Compared with the input , the outputs provide a temporal-context enhanced representations for number of frame tokens thanks to the MSA mechanism to model the dependencies of different event frames.

4.4 Fusion Transformer

In order to conduct the interaction between the above ST and TF blocks and learn a unified spatio-temporal contextual data representations, we also design a Fusion Transformer (FusionF) module. To be specific, let and denote the outputs of previous SF and TF blocks respectively. We first collect the spatial and temporal tokens together and feed them to a unified Transformer block which includes multi-head self-attention (MSA) and MLP submodule, i.e.,


Afterword, we split into where and and further employ the above SF (Eqs.(1,2)) and TF (Eqs.(3,4)) block to respectively enhance their representations as follows,


Finally, we concatenate both and

together and reshape the concatenated features to the vector form. After that, we utilize a two-layer MLP to output the final class label prediction, as shown in Fig. 


4.5 Loss Function

Our proposed ESTF framework can be optimized in an end-to-end way. The standard cross-entropy loss function is adopted to measure the distance between our model prediction and ground truth:


where denotes the batch size, denotes the number of event classes. and represent the ground truth and predicted class labels of the event sample, respectively.

5 Experiments

5.1 Dataset and Evaluation Metrics

In this work, three datasets are adopted for the evaluation of our proposed model, including N-Caltech101 [orchard2015converting], ASL-DVS [bi2020graph], and our newly proposed HARDVS. More details about these datasets can be found in Table 1. The widely used top-1 and top-5 accuracy

are adopted as evaluation metrics.

5.2 Implementation Details

Given the event streams, we stack them into image-like representations to make full use of CNN. More in detail, the time window is set based on the exposure time of color frames, when generating the event images. The batch size is 60, and the initial learning rate is 0.01, which is reduced to 10% of the original every 15 epochs. The stochastic gradient descent (SGD) 


is selected as the optimizer to train our network. Our code is implemented based on Python 3.8, PyTorch 1.10.2+cu113 

[paszke2019pytorch], on a server with RTX3090. The source code and pre-trained models will be released to help other researchers reproduce our experimental results.

5.3 Comparison with SOTA Algorithms

Results on N-Caltech101 [orchard2015converting]. As shown in Table 2, our proposed method achieves 0.832 on the top-1 accuracy metric which is significantly better than the compared models by a large margin. For example, the VMV-GCN achieves 0.778 on this benchmark dataset which ranks second place, meanwhile, our model outperforms it by up to

. The M-LSTM is an adaptive event representation learning model which obtained 0.738 only on this dataset. EV-VGCNN is a graph neural network based model which obtains 0.748 and is also worse than ours. These experimental results fully demonstrate the effectiveness of our proposed spatial-temporal feature learning for event-based pattern recognition.

0.425 0.196 0.657 0.778 0.748 0.753
0.637 0.687 0.738 0.694 0.642 0.832
Table 2: Results on N-Caltech101 [orchard2015converting] Dataset.

Results on ASL-DVS [bi2020graph]. As shown in Table 3, the performance on this dataset is already close to saturation and most of the compared models achieve more than 0.95+ on the top-1 accuracy, including EST [gehrig2019EST] (0.979), AMAE [deng2020amae] (0.984), M-LSTM [cannici2020mlstm] (0.980), MVF-Net [deng2021mvfnet] (0.971). Note that, the VMV-GCN [xie2022vmvgcn] achieves 0.989 on this benchmark dataset which ranks the second place. It is very hard to beat these models. Thanks to our proposed spatial-temporal feature learning and fusion modules, we set new state-of-the-art performance on this dataset, i.e., 0.999 on the top-1 accuracy. Therefore, we can draw the conclusion that our method almost completely solves the simple gesture recognition problem defined in the ASL-DVS [bi2020graph].

0.979 0.984 0.980 0.971 0.886
0.833 0.901 0.983 0.989 0.999
Table 3: Results on the ASL-DVS [bi2020graph] dataset.

Results on HARDVS. From the experimental results reported in the ASL-DVS [bi2020graph] and N-Caltech101 [orchard2015converting], we can find that existing event based recognition datasets are almost saturated. The newly proposed HARDVS dataset can bridge this gap and further boost the development of event based human action recognition. As shown in Table 4, we re-training and testing multiple state-of-the-art models for future works to compare on the HARDVS benchmark dataset, including C3D [tran2015c3d], R2Plus1D [tran2018R2Plus1D], TSM [song2019TSM], ACTION-Net [wang2021actionnet], TAM [liu2021tam], Video-SwinTrans [liu2021videoSwin], TimeSformer [bertasius2021TimeSformer], SlowFast [feichtenhofer2019slowfast]. It is easy to find that these popular and strong recognition models still perform poorly on our newly proposed HARDVS dataset. To be specific, the R2Plus1D [tran2018R2Plus1D], ACTION-Net [wang2021actionnet], and SlowFast [feichtenhofer2019slowfast] only achieves , , and on the top-1 and top-5 accuracy respectively. The recently proposed TAM [liu2021tam] (ICCV-2021), Video-SwinTrans [liu2021videoSwin] (CVPR-2022), TimeSformer [bertasius2021TimeSformer] (ICML 2021) also obtains , , and on the two metrics respectively. Compared with these models, our proposed spatial-temporal feature learning and fusion modules perform comparable or even better than these SOTA models, i.e., . All in all, our proposed model is effective for event based human action recognition task and may be a good baseline for future works to compare.

No. Algorithm Publish Backbone Top1 Top5
01 ResNet18 [he2016resnet] CVPR-2016 ResNet18 49.20 56.09
02 C3D [tran2015c3d] ICCV-2015 CNN 50.52 56.14
03 R2Plus1D [tran2018R2Plus1D] CVPR-2018 ResNet-34 49.06 56.43
04 TSM  [lin2019tsm] ICCV-2019 ResNet-50 52.63 60.56
05 ACTION-Net [wang2021actionnet] CVPR-2021 ResNet-50 46.85 56.19
06 TAM [liu2021tam] ICCV-2021 ResNet-50 50.41 57.99
07 Video-SwinTrans [liu2021videoSwin] CVPR-2022 Swin Transformer 51.91 59.11
08 TimeSformer [bertasius2021TimeSformer] ICML-2021 VIT 50.77 58.70
09 SlowFast [feichtenhofer2019slowfast] ICCV-2019 ResNet-50 46.54 54.76
10 X3D [feichtenhofer2020x3d] CVPR-2020 ResNet 45.82 52.33
11 ESTF (Ours) - ResNet18 51.22 57.53
Table 4: Results on the newly proposed HARDVS dataset.

5.4 Ablation Study

To help researchers better understand our proposed module, in this subsection, we conduct extensive experiments to analyze the contributions of each key component and the influence of different settings for our model.

Component Analysis. As shown in Table 5, three main modules are analyzed on the N-Caltech101 dataset, including SpatialFormer (SF), TemporalFormer (TF), and FusionFormer. We can find that our baseline method ResNet18 [he2016resnet] achieves 72.14 on the top-1 accuracy metric. When introducing the TemporalFormer (TF) into the recognition framework, the overall performance can be significantly improved by , and achieves 81.54. When the SpatialFormer (SF) is adopted for long-range global feature relation mining, the recognition results can be enhanced to 80.47, and the improvement is up to . When both modules are all utilized for joint spatial-temporal feature learning, a better result can be obtained, i.e., 82.89. If the FusionFormer is adopted to achieve interactive feature learning and information propagation between the spatial and temporal Transformer branches, the best results can be achieved, i.e., 83.17 on the top-1 accuracy. Based on the experimental analysis for Table 5 and Table 2, we can draw the conclusion that our proposed modules all contribute to final recognition results.

No. ResNet TF SF FusionFormer Accuracy
1 72.14
2 81.54
3 80.47
4 82.89
5 83.17
Table 5: Component Analysis on the N-Caltech101 Dataset.
Figure 5: Experimental results of different input frames.

Analysis on Number of Input Frames. In this paper, we transform the event streams into an image-like representation for classification. In our experiments, 8 frames are adopted for the evaluation of our model. Actually, various event frames can be obtained with different intervals of the time windows. In this part, we test our model with 4, 6, 8, 10, 12, and 16 frames on the N-Caltech101 dataset and report the results in Fig. 5. It is easy to find that the mean accuracy is 73.67, 75.94, 77.37, 73.49, 75.11, and 73.03, correspondingly, and the highest mean accuracy can be obtained when 8 frames are adopted. For the decrease in accuracy when the frames are larger than 8, we think this may be caused by the fact that the event streams are partitioned into more frames and each frame will be more sparse. Therefore, this will lead to sparse edge information which is very important for recognition.

Figure 6: Experimental results of different partition patches (left) and Transformer layers (right).
Figure 7:

Visualization of feature distribution of our baseline and newly proposed ESTF on HARDVS dataset (a, b) and confusion matrix of baseline ResNet and our model on N-Caltech101 dataset (c, d). Best viewed by zooming in.

Figure 8: Visualization of the top-5 predicted actions using our model.

Analysis on Split Patches of Spatial Data. In this paper, the spatial features are partitioned into non-overlapped patches. We test multiple scales in this subsection, including , , and . As illustrated in Fig. 6 (left), the best performance can be obtained when is adopted, i.e., 83.17, 94.20, and 77.37 on the top-1, top-5, and mean accuracy respectively.

Analysis on Layers of Transformer Layers. As we all know, the self-attention or Transformer layers can be stacked multiple times for more accurate recognition, as validated in many works. In this experiment, we also test different Transformer layers to check their influence on our model. As shown in Fig. 6 (right), four different settings are tested, i.e., 1, 2, 3, and 4 layers, and the corresponding mean accuracy is 77.37, 75.20, 76.05, and 74.64. We can find that higher recognition results can be obtained when the Transformer is set as 1 to 3 layers. Maybe a larger dataset is needed to train deeper Transformer layers.

Model Parameters and Running Efficiency. The storage space occupied by our checkpoint is 377.34 MB and the number of parameters is 46.71 M. The MAC score is 17.62 G tested using toolkit ptflops 222 Our model spends 25 ms for each video (8 frames used) in our proposed HARDVS dataset.

Figure 9: Visualization of confusion matrix on the HARDVS dataset.

5.5 Visualization

In the previous subsections, we conduct extensive experiments to validate the effectiveness of our model from a quantitative point of view. In this part, we resort to the visualization to help the readers better understand our proposed model.

Feature Visualization & Confusion Matrix.  As shown in Fig. 7 (a, b), we select 10 classes of actions defined in the HARDVS dataset and visualize the features by projecting them into 2D plane using tSNE toolkit 333

. It is easy to find that partial data samples are not discriminated well using the baseline ResNet18, such as the regions highlighted in blue bounding box. In contrast, our proposed ESTF model achieves a better feature representation learning and more of the categories are classified well. For the confusion matrix on N-Caltech101 dataset, as shown in Fig. 

7 (c, d), we can find that our proposed ESTF achieves significant improvement compared with our baseline ResNet18. All in all, we can draw the conclusion that our proposed spatial-temporal feature learning module works well for event based action recognition.

Confusion Matrix.  As shown in Fig. 9, we visualize the confusion matrix of our model based on the results predicted in the training, validation, and testing phase, respectively. One can note that our model achieves better results in the training phase, but the overall performance in the testing phase is still weak. This demonstrate that our proposed HARDVS dataset is challenging and there is still plenty of room for further improvement.

Recognition Results.  As shown in Fig. 8, we provide the top-5 predicted actions and corresponding confidence scores. The ground truth and top-1 results are highlighted in black and green. It is easy to find that our model can predict the human activities accurately.

6 Conclusion

In this paper, we propose a large-scale benchmark dataset for event-based human action recognition, termed HARDVS. It contains 300 categories of human activities and more than 100K event sequences captured from DAVIS346 camera. These videos reflect various views, illuminations, motions, dynamic backgrounds, occlusion, etc. More than 10 popular and recent classification models are evaluated for future works to compare. In addition, we also propose a novel Event-based Spatial-Temporal Transformer (short for ESTF) that conducts spatial-temporal enhanced learning and fusion for accurate action recognition. Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed framework. It sets the new SOTA performances on N-Caltech101 and ALS-DVS datasets. We hope the proposed dataset and baseline approach will boost the further development of event camera based human action recognition. In our future works, we will consider combining the color frames and event streams together for high-performance action recognition.