ERA: A Dataset and Deep Learning Benchmark for Event Recognition in Aerial Videos

01/30/2020 ∙ by Lichao Mou, et al. ∙ DLR Technische Universität München 16

Along with the increasing use of unmanned aerial vehicles (UAVs), large volumes of aerial videos have been produced. It is unrealistic for humans to screen such big data and understand their contents. Hence methodological research on the automatic understanding of UAV videos is of paramount importance. In this paper, we introduce a novel problem of event recognition in unconstrained aerial videos in the remote sensing community and present a large-scale, human-annotated dataset, named ERA (Event Recognition in Aerial videos), consisting of 2,866 videos each with a label from 25 different classes corresponding to an event unfolding 5 seconds. The ERA dataset is designed to have a significant intra-class variation and inter-class similarity and captures dynamic events in various circumstances and at dramatically various scales. Moreover, to offer a benchmark for this task, we extensively validate existing deep networks. We expect that the ERA dataset will facilitate further progress in automatic aerial video comprehension. The website is https://lcmou.github.io/ERA_Dataset/

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Unmanned aerial vehicles (UAVs), a.k.a. drones, get a bad reputation in the media. Most people associate them with negative news, such as flight delays causing by unauthorized drone activities and dangerous attack weapons. However, recent advances in the field of remote sensing and computer vision showcase that the future of UAVs will actually be shaped by a wide range of practical applications 

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. To name a few, in the aftermath of earthquakes [15] and floods [16]

, UAVs can be exploited to estimate damage 

[17, 18], deliver assistance, and locate victims. In addition to disaster relief, urban planners are capable of better understanding the environment of a city and implementing data-driven improvements by the use of UAVs [19, 20, 21, 22, 23, 24, 25, 26]. In precision agriculture, agricultural workers can make use of UAVs to collect data [27, 28, 29], automate redundant procedures [30, 31, 32, 33], and generally maximize efficiency [34, 35]. In combination with geospatial information, UAVs are now used to monitor and track animals for the purpose of wildlife conservation [36, 37].

Fig. 1: Temporal cue matters in event understanding from an aerial perspective. What takes place in (a) and (b)? (c) or (d): Whose content depicts a traffic congestion scene? It is difficult to answer these questions from still images only, while temporal context provides an important visual cue. (See videos and answers at https://lcmou.github.io/ERA_Dataset/gif_page/.)
Fig. 2: Overview of the ERA dataset. Overall, we have collected 2,866 labeled video snippets for 24 event classes and 1 normal class: post-earthquake, flood, fire, landslide, mudslide, traffic collision, traffic congestion, harvesting, ploughing, constructing, police chase, conflict, baseball, basketball, boating, cycling, running, soccer, swimming, car racing, party, concert, parade/protest, religious activity, and non-event. For each class, we show the first (left) and last (right) frames of a video. Best viewed zoomed in color.

Unlike satellites, UAVs are able to provide real-time, high-resolution videos at a very low cost. They usually have real-time streaming capabilities that enable quick decision-making. Furthermore, UAVs can significantly reduce a dependence on weather conditions, e.g., clouds, and are available on a demand offering a higher flexibility to cope with various problems.

Yet the more UAVs there are in the skies, the more video data they create. The Federal Aviation Administration (FAA) estimates that in the US alone, there are more than 2 million UAVs registered in 2019111https://www.faa.gov/data_research/aviation/aerospace_forecasts/media/Unmanned_Aircraft_Systems.pdf. And every day around 150 terabytes of data can be easily produced by a small drone fleet222https://www.bloomberg.com/news/articles/2017-05-10/airbus-joins-the-commercial-drone-data-wars. The era of big UAV data is here. It is unrealistic for humans to screen these massive volumes of aerial videos and understand their contents. Hence methodological research on the automatic interpretation of such data is of paramount importance.

However, there is a paucity of literature on UAV video analysis, which for the most part is concentrated on detecting and tracking objects of interest, e.g., vehicle and people, in relatively sanitized settings. Towards advancing aerial video parsing, this paper introduces a novel task, event recognition in unconstrained aerial videos, in the remote sensing community. We present an Event Recognition in Aerial video (ERA) dataset, a collection of 2,866 videos each with a label from 25 different classes corresponding to an event unfolding 5 seconds (see Fig. 2). This temporal length (5 seconds) corresponds to the minimal duration of human short-term memory (5 to 20 seconds)333http://web.mnstate.edu/smithb/EdPsychWebs/classnotes/coursenotes/chapter_7.htm. This dataset enables training models for richly understanding events in the wild from a broader, aerial view, which is a crucial step towards building an automatic aerial video comprehension system. In addition, to offer a benchmark for this task, we extensively validate existing deep networks and report their results in two ways: single-frame classification and video classification (see Section IV).

Ii Related Work

This section is dedicated to briefly reviewing related work on UAV image/video analysis.

Ii-a Object Detection

Detecting objects of interest in UAV images/videos has been studied extensively in the remote sensing field. Early efforts [38, 39]

have been made to seek out useful human-designed visual features such as texture feature, color feature, and histogram of oriented gradients (HOG), or their combinations to represent objects. Recently, deep convolutional neural networks (CNNs) have been harnessed for detecting objects in UAV data and shown promising results 

[40, 41]. For examples, [42] proposes an R-Net that uses rotatable bounding boxes to simultaneously localize vehicles and identify their orientations in UAV images and videos. In [37], the authors study how to train CNNs effectively on a substantially imbalanced dataset for the purpose of detecting mammals in UAV images. To detect objects of different scales, [43] devises a hierarchical selective filtering layer and introduces it into Faster R-CNN [44] architecture for ship detection in UAV and satellite images. Several works [45, 46, 47] focus on detecting and counting people in complex crowd and outdoor scenes from overhead images. In addition, there have recently been numerous datasets aiming at object detection from UAV data, e.g., UAVDT [48] and VisDrone [49].

Fig. 3: Categorization of event classes in the ERA dataset. All event categories are arranged in a two-level tree: with 25 leaf nodes connected to 7 nodes at the first level, i.e., disaster, traffic, productive activity, security, sport, social activity, and non-event.

Ii-B Tracking

Tracking objects in aerial videos is another promising topic. In [50], the authors make use of a correlation filter-based online learning framework to track single objects, e.g., pedestrian, vehicle, and building, in UAV videos. From single object tracking to multi-object tracking, [51] views the latter task as a voting problem of trajectories from an optical flow field and achieves satisfactory multi-object tracking results in aerial videos. For real-time UAV applications, the authors of [52] present an onboard long-term object tracking algorithm that uses a reliable global-local object model and achieve real-time tracking performance. As to publicly available datasets in this direction, in [53], the authors render real-world scenarios and a wide range of moving objects commonly seen in aerial videos by the use of a photo-realistic simulator (Unreal Engine 4). Finally, they produce 123 videos with more than 110K video frames, which can be used to train models for long-term aerial tracking. [54] builds a benchmark of high diversity, consisting of 70 videos captured by UAVs, for object tracking tasks. Furthermore, the aforementioned two object detection datasets, i.e., UAVDT [48] and VisDrone [49], also provide annotations for tracking.

Ii-C Semantic Segmentation

Semantic segmentation of UAV data refers to a process that each pixel in an image or video is assigned a category label. This task has received much attention recently, as it enables drones to perceive complex scenes. Several datasets, e.g., ICG-DroneDataset444http://dronedataset.icg.tugraz.at/, AeroScapes [55], UAVid [56], and Skyscapes [57], have been proposed to facilitate progress in this direction. Methodological research in this field still remains underexplored.

Ii-D Beyond Perception Towards Content Understanding

The abovementioned perception tasks, namely object detection, tracking, and semantic segmentation, can answer the following question: what objects appear in a scene and where are they located? However, the visual understanding of UAV data goes well beyond this. With just one glance at a video clip, we are capable of imagining the world behind pixels: for example, we can infer what happens and what may happen next, i.e., recognize events, activities, and actions. While this task is effortless for humans, it is enormously hard for vision algorithms. In this direction, a few pioneer works can be found in [58, 59], but these studies are conducted in well-controlled environments.

Fig. 4: User interface. An example for localizing a 5-second video snippet of flood and cutting it from a long video in our annotation process.
Fig. 5: Sample videos in several categories. Events in aerial videos can happen in various circumstances and at dramatically various scales. The ERA dataset has a significant intra-class variation and inter-class similarity. Here we display one frame (the middle frame) for each video. Best viewed zoomed in color.
Fig. 6: Sample distributions of all classes in the ERA. The red and blue bars represent the numbers of training and test samples in each category, respectively, and green bars denote the total number of instances in each category.

Iii The ERA Dataset

This work aims to devise an aerial video dataset which covers an extensive range of events. The ERA dataset is designed to have a significant intra-class variation and inter-class similarity and captures dynamic events in various circumstances and at dramatically various scales.

Iii-a Collection and Annotation

We start by creating our taxonomy (cf. Fig. 3) by building a set of the 24 most commonly seen events from aerial scenes. The event categories refer to the Wikipedia. Moreover, to investigate if models can distinguish events from normal videos, we set up a category called non-event, which comprises videos not including specific events. The 25 classes of our dataset are as follows: post-earthquake, flood, fire, landslide, mudslide, traffic collision, traffic congestion, harvesting, ploughing, constructing, police chase, conflict, baseball, basketball, boating, cycling, running, soccer, swimming, car racing, party, concert, parade/protest, religious activity, and non-event. Fig. 2 shows the overview of our ERA dataset.

In order to collect candidate videos for further labeling, we search YouTube by parsing the metadata of videos and crawling the search engine to create a collection of candidate videos for each category. Then we download all videos and send them to data annotators. Each annotator is asked to localize 5-second video snippets that depict specific events and cut them from long candidate videos. To improve the efficiency of this procedure, we make use of a toolbox. Fig. 4 shows the user interface of this toolbox.

There are three different annotators with payments who are responsible for validating cut videos during the annotation procedure. The first annotator generates primary annotations to begin with. Then annotations from the first round are sent to the second annotator for tune-ups. Finally the third annotator screen all generated 5-second videos and remove videos with quite similar contents. Overall, the total annotation time is around 290 hours.

Iii-B Dataset Statistics

The goal of this work is to collect a large, diverse dataset that can be used to train models for event recognition in UAV videos. As we gather aerial videos from YouTube, the largest video sharing platform in the world, we are capable of including a large breadth of diversity (see Fig. 5), which is more challenging than making use of self-collected data [58, 59]. In total, we have gathered and annotated 2,866 videos for 25 classes. Each video sequence is at 24 fps (frames per second), in 5 seconds, and with a spatial size of 640640 pixels. The train/test split can be found in Fig. 6 and Section IV-A. Fig. 6 exhibits the distribution of all classes. The red and blue bars represent the numbers of training and test samples in each category, respectively, and green bars denote the total number of instances in each category.

To build a diverse dataset, we collect not only high quality UAV videos but also ones acquired in extreme conditions. By doing so, many challenging issues of event recognition in overhead videos in real-world scenarios, e.g., low spatial resolution, extreme illumination conditions, and bad weathers, can be investigated. True aerial video parsing methods should be capable of recognizing events under such extreme conditions.

 

Dataset Type of Task Data Source Video # Classes # Samples Year
UCLA Aerial Event dataset [58] human-centric event recognition self-collected (actor staged) 12 104 2015
Okutama-Action dataset [59] human action detection self-collected (actor staged) 12 - 2017
AIDER dataset [60] disaster event recognition Internet 5 2,545 2019
ERA dataset (Ours) general event recognition YouTube 25 2,866 2020

 

TABLE I: Comparison to existing UAV data understanding datasets. We offer various comparisons for each dataset.

Iii-C Comparison with Other Aerial Data Understanding Datasets

The first significant effort to build a standard dataset for aerial video content understanding can be found in [58], in which the authors make use of a GoPro-equipped drone to collect video data at an altitude of 25 meters in a controlled environment to build a dataset called UCLA Aerial Event dataset. There are about 15 actors involved in each video. This dataset includes two different sites at a park in Los Angeles, USA and 104 event instances that present 12 classes related to human-human and human-object interactions. The events are: exchange box, play frisbee, info consult, pick up, queue for vending machine, group tour, throw trash, sit on table, picnic, serve table, sell BBQ, and inspection hide.

The authors of [59] propose Okutama-Action dataset for understanding human actions from a bird’s eye view. Two UAVs (DJI Phantom 4) with a flying altitude of 10-45 meters above the ground are used to capture data, and all videos included in this dataset are gathered at a baseball field in Okutama, Japan. There are 12 actions, including handshaking, hugging, reading, drinking, pushing/pulling, carrying, calling, running, walking, lying, sitting, and standing.

In [60], the authors build an aerial image dataset, termed as AIDER, for emergency response applications. This dataset only involves four disaster events, namely fire/smoke, flood, collapsed building/rubble, and traffic accident, and a normal case. There are totally 2,545 images collected from multiple sources, e.g., Google/Bing Images, websites of news agencies, and YouTube.

Both the UCLA Aerial Event dataset [58] and the Okutama-Action dataset [59] are small in today’s terms for aerial video understanding, and their data are gathered in well-controlled environments and only focus on several human-centric events. And the AIDER dataset [60] is an image dataset with only 5 classes for disaster event classification. In contrast, our ERA is a relatively large-scale UAV video content understanding dataset, aiming to recognize generic dynamic events from an aerial view. A comprehensive overview of these most important comparable datasets and their features is given in Table I.

Iii-D Challenges

The proposed ERA dataset poses the following challenges:

  • Although the ERA dataset is the largest dataset for event recognization in aerial videos yet, its size is still relatively limited as compared to video classification datasets in computer vision. Hence there exists the small data challenge in the model training.

  • The imbalanced distribution across different classes (cf. Fig. 6) brings a challenge of learning unbiased models on an imbalanced dataset.

  • Unlike [58, 59], the existence of the non-event class in our dataset requires that models are able to not only recognize different events but also distinguish events from normal videos.

  • In this dataset, events happen in various environments and are observed at different scales, which leads to a significant intra-class variation and inter-class similarity (cf. Fig. 5).

-0.2cm0cm   Model post-earthquake flood fire landslide mudslide traffic collision traffic congestion harvesting ploughing constructing police chase conflict baseball basketball boating cycling running soccer swimming car racing party concert parade/protest religious activity non-event OA VGG-16 46.3 59.1 53.6 38.8 56.1 30.8 76.2 62.7 65.4 69.0 70.0 44.4 61.0 56.2 69.4 39.6 33.3 87.5 62.0 32.0 73.5 56.7 47.8 64.6 30.4 51.9 VGG-19 45.5 56.4 70.2 48.1 47.1 33.3 50.0 57.1 58.1 65.2 80.9 7.4 66.7 55.9 66.7 35.8 57.1 67.3 55.6 26.7 53.3 54.4 43.6 50.0 31.1 49.7 Inception-v3 62.9 76.1 88.0 44.7 54.7 48.0 55.4 64.6 77.3 73.7 76.5 50.0 72.0 61.2 73.7 70.2 90.0 80.0 61.7 60.0 66.7 47.7 52.2 62.2 45.5 62.1 ResNet-50 65.5 69.8 77.4 40.0 51.9 40.6 50.0 77.4 72.9 63.8 68.6 62.5 83.3 52.2 71.4 77.4 28.6 73.5 54.3 50.0 61.5 49.4 46.0 48.9 38.9 57.3 ResNet-101 59.6 82.9 79.2 34.5 43.8 18.8 48.7 65.8 78.0 69.5 64.6 55.0 76.1 57.7 82.2 90.5 61.5 73.3 58.2 31.6 51.2 49.5 47.1 64.7 36.2 55.3 ResNet-152 67.3 68.2 78.8 45.2 46.4 38.9 58.5 61.9 75.6 58.0 59.3 57.1 79.5 56.9 77.8 63.4 75.0 74.4 56.1 30.8 61.9 44.7 48.6 52.8 37.0 56.1 MobileNet 72.0 70.8 78.0 57.5 61.0 43.6 52.6 66.2 66.7 67.2 70.6 50.0 74.5 59.7 76.4 54.7 72.0 64.8 52.9 56.2 65.0 44.4 54.5 61.5 52.5 61.3 DenseNet-121 58.6 71.4 82.8 54.5 51.6 38.1 58.2 71.1 78.0 70.2 73.5 48.0 85.0 68.4 86.7 65.3 57.1 75.4 61.7 52.9 68.3 52.3 66.7 47.8 43.3 61.7 DenseNet-169 70.0 82.9 71.9 45.2 40.2 36.7 59.5 71.6 87.2 80.4 76.6 53.8 91.4 65.0 67.7 76.9 63.6 75.0 63.2 57.1 59.1 60.0 55.4 60.9 39.7 60.6 DenseNet-201 69.9 80.4 84.5 52.2 48.1 43.2 62.3 71.6 85.4 71.2 77.1 47.1 87.8 63.6 79.6 69.8 47.8 65.0 58.0 43.8 61.0 60.9 55.0 60.8 42.1 62.3 NASNet-L 60.0 50.0 77.2 41.0 50.9 46.9 50.0 68.0 77.8 82.7 78.0 61.5 82.6 74.5 78.0 75.0 62.2 69.0 54.5 70.0 69.2 44.6 58.7 55.9 41.7 60.2  

  • All networks are initialized with weights pre-trained on the ImageNet dataset and trained on the ERA dataset.

TABLE II: Performance of single-frame classification models: We show the per-class precision and overall accuracy of baseline models on the test set. The best precision/accuracy is shown in bold.

-0.2cm0cm   Model post-earthquake flood fire landslide mudslide traffic collision traffic congestion harvesting ploughing constructing police chase conflict baseball basketball boating cycling running soccer swimming car racing party concert parade/protest religious activity non-event OA C3D 23.1 24.3 30.9 19.5 32.9 7.00 15.5 27.5 36.1 45.5 50.0 18.2 40.9 37.0 47.5 20.6 12.0 58.3 36.2 16.7 25.8 38.2 37.8 27.5 29.6 30.4 C3D 27.9 56.5 32.7 10.2 23.9 8.30 38.5 42.3 31.1 40.0 51.9 11.1 45.7 48.9 41.9 13.6 9.30 41.9 38.2 18.2 17.4 32.0 28.1 35.8 28.5 31.1 P3D-ResNet-199 43.6 65.9 66.7 35.5 48.7 20.0 37.8 77.4 70.8 62.0 81.6 22.2 66.7 63.1 55.4 35.6 35.3 76.2 57.4 40.0 54.5 37.5 38.7 47.8 37.4 50.7 P3D-ResNet-199 72.4 76.3 84.8 24.5 38.2 35.6 40.8 56.9 67.4 71.4 57.9 50.0 70.4 78.8 71.7 47.1 60.0 79.5 68.1 40.9 59.1 37.0 49.1 55.9 37.9 53.3 I3D-Inception-v1 40.4 63.5 68.9 22.6 46.3 17.6 55.0 61.5 50.0 53.3 73.2 50.0 75.0 69.4 60.7 61.9 53.3 70.8 52.5 50.0 57.1 50.7 40.3 49.0 35.8 51.3 I3D-Inception-v1 60.0 68.1 65.7 29.0 60.4 51.5 52.2 67.1 66.7 54.2 64.8 57.9 85.0 61.9 86.4 75.0 44.4 77.6 64.1 65.2 53.7 50.0 47.8 65.1 43.0 58.5 TRN-BNInception 84.8 71.4 82.5 51.2 50.0 46.8 66.7 68.1 77.4 52.4 70.5 75.0 64.5 67.7 84.0 56.1 55.2 83.3 72.9 61.1 62.0 48.9 44.6 62.8 51.1 62.0 TRN-Inception-v3 69.2 87.8 88.9 65.8 60.0 44.1 58.3 78.1 90.7 70.8 73.3 28.6 83.3 72.7 73.7 60.0 66.7 73.6 70.6 63.6 65.1 47.7 42.7 65.1 47.9 64.3  

  • C3D uses pre-trained weights on the Sport1M dataset as initialization; C3D uses pre-trained weights on the UCF101 dataset as initialization.

  • P3D-ResNet-199 uses pre-trained weights on the Kinetics dataset as initialization; P3D-ResNet-199 uses pre-trained weights on the Kinetics-600 dataset as initialization.

  • I3D-Inception-v1 uses pre-trained weights on the Kinetics dataset as initialization; I3D-Inception-v1 uses pre-trained weights on Kinetics+ImageNet as initialization.

  • TRN-BNInception uses pre-trained weights on the Something-Something V2 dataset as initialization; TRN-Inception-v3

    uses pre-trained weights on the Moments in Time dataset as initialization.

TABLE III: Performance of video classification models: We show the per-class precision and overall accuracy of baseline models on the test set. The best precision/accuracy is shown in bold.

Iv Experiments

Iv-a Experimental Setup

Data. As to the split of training and test sets, we obey the following two rules: 1) videos cut from the same long video are assigned to the same set, and 2) the numbers of training and test videos per class are supposed to be nearly equivalent. Because video snippets stemming from the same long video usually share similar properties (e.g., background, illumination, and resolution), this split strategy is able to evaluate the generalization ability of a model. The statistics of training and test samples are exhibited in Fig. 6. During the training phase, 10% of training instances are randomly selected as the validation set.

Evaluation metric.

To compare models comprehensively, we make use of per-class precision and overall accuracy as evaluation metrics. More specifically, the per-class precision is calculated by the following equation:

(1)

where true positives and false positives are measured with respect to each class. The overall accuracy is calculated by counting the number of correctly predicted samples and normalizing that number by the total number of test samples. Here, we mainly consider the latter metric as it comprehensively indicates the classification performance of a model on all classes.

Fig. 7: Examples of event recognition results on the ERA dataset. We show the best two single-frame classification network architectures (i.e., Inception-v3 and DenseNet-201) and the best two video classification network architectures (i.e., I3D-Inception-v1 and TRN-Inception-v3). The ground truth label and top 3 predictions of each model are reported. For each example, we show the first (left) and last (right) frames. Best viewed zoomed in color.

Iv-B Baselines for Event Classification

Single-frame classification models. We first describe single-frame classification models where only a single video frame is selected (the middle frame in this paper) from a video as the input to networks. We summarize the used single-frame models as follows.

Fig. 8: Examples of misclassifications. We show several failure examples where the prediction is not in the top 3.
  • VGGNet [61]

    . VGGNet is mainly composed of five convolutional blocks and three fully-connected layers, and each block contains several convolutional layers and one max-pooling layer. The size of all convolutional filters is

    . We train a 16-layer VGGNet (VGG-16) and a 19-layer VGGNet (VGG-19) in our experiments.

  • Inception networks [62, 63, 64, 65]. Inception models aim to learn miscellaneous features by leveraging convolutional filters of various sizes simultaneously. The naive version of an Inception model consists of , , and convolutions as well as a max pooling. However, the computational cost of the model is boosted with the increasing number of filters/units, especially those in deep layers. To tackle this problem, convolutions are introduced to improve computational efficiency. Here we employ Inception-v3 [64] to perform single-frame classification.

  • ResNet [66]. ResNet is characterized by explicitly learning residual mapping functions via shortcut connections. In contrast to plain networks, such residual learning architecture can well address the degradation problem and enable networks to go deeper. In our experiments, we make use of three variations: a 50-layer ResNet (ResNet-50), a 101-layer ResNet (ResNet-101), and a 152-layer ResNet (ResNet-152).

  • MobileNet [67]. MobileNet is a light-weight CNN, and it facilitates the utilization of deep neural networks in source restricted applications, e.g., mobile and embedded vision applications. To reduce computational cost and model size, MobileNet employs depth separable convolutions, which are implemented by factorizing standard convolutions into depthwise and pointwise convolutions. Besides, two hyper-parameters, width multiplier and resolution multiplier , are defined to further shrink the network. Specifically, the former is responsible for squashing input channels of each layer, while the latter reduces the resolution of input images. Here, we set them by default: and .

  • DenseNet [68]. DenseNet maximizes information flow between various layers by directly connecting all layers of equivalently-sized feature maps with each other. In addition, concatenation instead of element-wise addition is utilized to combine features from early and later layers so that information can be preserved. With this design, all feature maps are taken into consideration for making final predictions. To thoroughly explore the performance of DenseNet, we experiment with a 121-layer DenseNet (DenseNet-121), a 169-layer DenseNet (DenseNet-169), and a 201-layer DenseNet (DenseNet-201).

  • NASNet [69]. NASNet refers to network architectures that are learned on datasets of interest using a neural architecture search (NAS) framework [70]. This framework is designed to search for the best-performing network in a predefined search space. However, directly searching on a large dataset is expensive, and thus, authors in [69] propose to search for a transferable architecture on a small dataset, which can then be transferred to a relatively large dataset. In our experiments, we select a NASNet model that is searched on the CIFAR-10 dataset and achieves the best performance on the ImageNet dataset, namely NASNet-L, to perform single-frame classification.

Video classification models. These models take several video frames as input, so that they can learn temporal information from videos.

Fig. 9: Confusion matrices of two models for the ERA dataset. (a) DenseNet-201; (b) TRN-Inception-v3.
  • C3D [71]. C3D (3D convolutional network) aims to extract spatiotemporal features with 3D convolutional filters and pooling layers. Compared to conventional 2D CNNs, 3D convolutions and pooling operations in C3D can preserve the temporal information of input signals and model motion as well as appearance simultaneously. Moreover, authors in [71] demonstrate that the optimal size of 3D convolutional filters is 333. In our experiments, we train two C3D555https://github.com/tqvinhcs/C3D-tensorflow networks with pre-trained weights on the Sport1M dataset [72] and the UCF101 dataset [73] (see C3D and C3D in Table III), respectively.

  • P3D ResNet [74]. P3D ResNet (pseudo-3D residual network) is composed of pseudo-3D convolutions, where conventional 3D convolutions are decoupled into 2D and 1D convolutions in order to learn spatial and temporal information separately. With such convolutions, the model size of a network can be significantly reduced, and the utilization of pre-trained 2D CNNs is feasible. Besides, inspired by the success of ResNet [66], P3D ResNet employs ResNet-like architectures to learn residuals in both spatial and temporal domains. In our experiments, we train two 199-layer P3D ResNet666https://github.com/zzy123abc/p3d (P3D-ResNet-199) models with pre-trained weights on the Kinetics dataset [75] and the Kinetics-600 dataset [76] (see P3D-ResNet-199 and P3D-ResNet-199 in Table III), respectively.

  • I3D [77]. I3D (inflated 3D ConvNet) expands 2D convolution and pooling filters to 3D, which are then initialized with inflated pre-trained models. Particularly, weights of 2D networks pre-trained on the ImageNet dataset are replicated along the temporal dimension. With this design, not only 2D network architectures but also pre-trained 2D models can be efficiently employed to increase the learning efficiency and performance of 3D networks. To assess the performance of I3D on our dataset, we train two I3D777https://github.com/LossNAN/I3D-Tensorflow models whose backbones are both Inception-v1 [62] (I3D-Inception-v1) with pre-trained weights on the Kinetics dataset [75] and Kinetics+ImageNet, respectively (see I3D-Inception-v1 and I3D-Inception-v1 in Table III).

  • TRN [78]. Temporal relation network (TRN) is proposed to recognize human actions by reasoning about multi-scale temporal relations among video frames. By leveraging the proposed plug-and-play relational reasoning module, TRN can even accurately predict human gestures and human-object interactions through sparsely sampled frames. For our experiments, we train TRNs888https://github.com/metalbubble/TRN-pytorch with 16 multi-scale relations and select the Inception architecture as the backbone. Notably, we experiment two variants of the Inception architecture: BNInception [63] and Inception-v3 [64]. We initialize the former with weights pre-trained on the Something-Something V2 dataset [79] (TRN-BNInception in Table III) and the latter with weights pre-trained on the Moments in Time dataset [80] (TRN-Inception-v3 in Table III).

Iv-C Baseline Results

Quantitative results of single-frame classification models and video classification models are reported in Table II and Table III, respectively. As we can see, DenseNet-201 achieves the best performance, an OA of 62.3%, in the single-frame classification task and marginally surpasses the second best model, Inception-v3, by 0.2%. For the video classification task, TRN-Inception-v3 performs superiorly and gains an OA of 64.3%. By comparing Table II and Table III, it is interesting to observe that the best-performed video classification model obtains the highest OA, which demonstrates the significance of exploiting temporal cues in event recognition from aerial videos.

We further show some predictions of the best two single-frame classification network architectures (i.e., Inception-v3 and DenseNet-201) and the best two video classification network architectures (i.e., I3D-Inception-v1 and TRN-Inception-v3) in Fig. 7

. As shown in the first row, frames/videos with discriminative event-relevant characteristics, such as collapsed buildings, congested traffic states on a highway, and smoke rising from a residential area, can be accurately recognized by all baselines with high confidence scores. Besides, high-scoring predictions of TRN in identifying parade/protest and concert illustrate that efficiently exploiting temporal information helps in distinguishing events of minor inter-class variances. Moreover, we observe that extreme conditions might disturb predictions, for instance, frames/videos of night and snow scenes (see Fig. 

7) tend to be misclassified.

Despite successes achieved by these baselines, there are still some challenging cases as shown in Fig. 8. A common characteristic shared by these examples is that event-relevant attributes such as human actions are not easy to recognize, and this results in failures to identify these events. To summarize, event recognition in aerial videos is still a big challenge, and may benefit from better recognizing discriminative attributes as well as exploiting temporal cues.

In Fig. 9, we provide confusion matrices of the best single-frame classification model DenseNet-201 and the best video classification model TRN-Inception-v3.

V Conclusion

We present ERA, a dataset for comprehensively recognizing events in the wild form UAV videos. Organized in a rich semantic taxonomy, the ERA dataset covers a wide range of events involving diverse environments and scales. We report results of plenty of deep networks in two ways: single-frame classification and video classification. The experimental results show that this is a hard task for the remote sensing field, and the proposed dataset serves as a new challenge to develop models that can understand what happens on the planet from an aerial view.

References

  • [1] T.-Z. Xiang, G.-S. Xia, and L. Zhang, “Mini-unmanned aerial vehicle-based remote sensing,” IEEE Geoscience and Remote Sensing Magazine, vol. 7, no. 3, pp. 29–63, 2019.
  • [2] A. S. Laliberte and A. Rango, “Texture and scale in object-based analysis of subdecimeter resolution unmanned aerial vehicle (UAV) imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 3, pp. 761–770, 2009.
  • [3] T. Adão, J. Hruška, L. Pádua, J. Bessa, E. Peres, R. Morais, and J. Sousa, “Hyperspectral imaging: A review on UAV-based sensors, data processing and applications for agriculture and forestry,” Remote Sensing, vol. 9, no. 11, p. 1110, 2017.
  • [4] G. Pajares, “Overview and current status of remote sensing applications based on unmanned aerial vehicles (UAVs),” Photogrammetric Engineering & Remote Sensing, vol. 81, no. 4, pp. 281–330, 2015.
  • [5] A. Bhardwaj, L. Sam, F. Martín-Torres, and R. Kumar, “UAVs as remote sensing platform in glaciology: Present applications and future prospects,” Remote Sensing of Environment, vol. 175, pp. 196–204, 2016.
  • [6] C. Stöcker, R. Bennett, F. Nex, M. Gerke, and J. Zevenbergen, “Review of the current state of UAV regulations,” Remote sensing, vol. 9, no. 5, p. 459, 2017.
  • [7] Y. Lin, J. Hyyppä, T. Rosnell, A. Jaakkola, and E. Honkavaara, “Development of a UAV-MMS-collaborative aerial-to-ground remote sensing system–a preparatory field validation,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 6, no. 4, pp. 1893–1898, 2013.
  • [8] S. Harwin and A. Lucieer, “Assessing the accuracy of georeferenced point clouds produced via multi-view stereopsis from unmanned aerial vehicle (UAV) imagery,” Remote Sensing, vol. 4, no. 6, pp. 1573–1599, 2012.
  • [9] S. Yahyanejad and B. Rinner, “A fast and mobile system for registration of low-altitude visual and thermal aerial images using multiple small-scale UAVs,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 104, pp. 189–202, 2015.
  • [10] Y. Chen, T. Hakala, M. Karjalainen, Z. Feng, J. Tang, P. Litkey, A. Kukko, A. Jaakkola, and J. Hyyppä, “UAV-borne profiling radar for forest research,” Remote Sensing, vol. 9, no. 1, p. 58, 2017.
  • [11] L. Wallace, A. Lucieer, and C. Watson, “Evaluating tree detection and segmentation routines on very high resolution UAV LiDAR data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 12, pp. 7619–7628, 2014.
  • [12] N. Tijtgat, W. Van Ranst, T. Goedeme, B. Volckaert, and F. De Turck, “Embedded real-time object detection for a UAV warning system,” in IEEE International Conference on Computer Vision Workshop (ICCVW), 2017.
  • [13] L. Wallace, R. Musk, and A. Lucieer, “An assessment of the repeatability of automatic forest inventory metrics derived from UAV-borne laser scanning data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 11, pp. 7160–7169, 2014.
  • [14] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017.
  • [15]

    Z. Xu, L. Wu, and Z. Zhang, “Use of active learning for earthquake damage mapping from UAV photogrammetric point clouds,”

    International Journal of Remote Sensing, vol. 39, no. 15-16, pp. 5568–5595, 2018.
  • [16] G. Milani, M. Volpi, D. Tonolla, M. Doering, C. Robinson, M. Kneubühler, and M. Schaepman, “Robust quantification of riverine land cover dynamics by high-resolution remote sensing,” Remote Sensing of Environment, vol. 217, pp. 491–505, 2018.
  • [17] A. Vetrivel, M. Gerke, N. Kerle, F. Nex, and G. Vosselman, “Disaster damage detection through synergistic use of deep learning and 3D point cloud features derived from very high resolution oblique aerial images, and multiple-kernel-learning,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 140, pp. 45–59, 2018.
  • [18] W. Lee, S. Kim, Y. Lee, H. Lee, and M. Choi, “Deep neural networks for wild fire detection with unmanned aerial vehicle,” in IEEE International Conference on Consumer Electronics (ICCE), 2017.
  • [19] Q. Guo, Y. Su, T. Hu, X. Zhao, F. Wu, Y. Li, J. Liu, L. Chen, G. Xu, G. Lin, Y. Zheng, Y. Lin, X. Mi, L. Fei, and X. Wang, “An integrated UAV-borne LiDAR system for 3D habitat mapping in three forest ecosystems across china,” International Journal of Remote Sensing, vol. 38, no. 8-10, pp. 2954–2972, 2017.
  • [20] S. Manfreda, M. McCabe, P. Miller, R. Lucas, V. Pajuelo Madrigal, G. Mallinis, E. Ben Dor, D. Helman, L. Estes, G. Ciraolo, J. Müllerová, F. Tauro, M. I. D. Lima, J. D. Lima, A. Maltese, F. Frances, K. Caylor, M. Kohv, M. Perks, G. Ruiz-Pérez, Z. Su, G. Vico, and B. Toth, “On the use of unmanned aerial systems for environmental monitoring,” Remote sensing, vol. 10, no. 4, p. 641, 2018.
  • [21] K. Chiang, G. Tsai, Y. Li, and N. El-Sheimy, “Development of LiDAR-based UAV system for environment reconstruction,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 10, pp. 1790–1794, 2017.
  • [22] B. Kalantar, S. B. Mansor, M. Sameen, B. Pradhan, and H. Shafri, “Drone-based land-cover mapping using a fuzzy unordered rule induction algorithm integrated into object-based image analysis,” International Journal of Remote Sensing, vol. 38, no. 8-10, pp. 2535–2556, 2017.
  • [23] X. Zhu, Y. Hou, Q. Weng, and L. Chen, “Integrating UAV optical imagery and LiDAR data for assessing the spatial relationship between mangrove and inundation across a subtropical estuarine wetland,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 149, pp. 146–156, 2019.
  • [24] E. Semsch, M. Jakob, D. Pavlicek, and M. Pechoucek, “Autonomous UAV surveillance in complex urban environments,” in IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 2009.
  • [25] A. Matese, P. Toscano, S. Di Gennaro, L. Genesio, F. Vaccari, J. Primicerio, C. Belli, A. Zaldei, R. Bianconi, and B. Gioli, “Intercomparison of UAV, aircraft and satellite remote sensing platforms for precision viticulture,” Remote Sensing, vol. 7, no. 3, pp. 2971–2990, 2015.
  • [26] T. Moranduzzo and F. Melgani, “Detecting cars in UAV images with a catalog-based approach,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 10, pp. 6356–6367, 2014.
  • [27] P. Tokekar, J. Vander Hook, D. Mulla, and V. Isler, “Sensor planning for a symbiotic UAV and UGV system for precision agriculture,” IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1498–1511, 2016.
  • [28] C. Torresan, A. Berton, F. Carotenuto, S. Di Gennaro, B. Gioli, A. Matese, F. Miglietta, C. Vagnoli, A. Zaldei, and L. Wallace, “Forestry applications of UAVs in Europe: A review,” International Journal of Remote Sensing, vol. 38, no. 8-10, pp. 2427–2447, 2017.
  • [29]

    Q. Feng, J. Liu, and J. Gong, “UAV remote sensing for urban vegetation mapping using random forest and texture analysis,”

    Remote sensing, vol. 7, no. 1, pp. 1074–1094, 2015.
  • [30] Y. Guo, X. Jia, D. Paull, J. Zhang, A. Farooq, X. Chen, and M. Islam, “A drone-based sensing system to support satellite image analysis for rice farm mapping,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2019.
  • [31]

    Y. Dai, L. Shen, Y. Cao, T. Lei, and W. Qiao, “Detection of vegetation areas attacked by pests and diseases based on adaptively weighted enhanced global and local deep features,” in

    IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2019.
  • [32] L. Wei, M. Yu, Y. Zhong, J. Zhao, Y. Liang, and X. Hu, “Spatial–spectral fusion based on conditional random fields for the fine classification of crops in UAV-borne hyperspectral remote sensing imagery,” Remote Sensing, vol. 11, no. 7, p. 780, 2019.
  • [33] E. Honkavaara, M. Eskelinen, I. Pölönen, H. Saari, H. Ojanen, R. Mannila, C. Holmlund, T. Hakala, P. Litkey, T. Rosnell, N. Viljanen, and M. Pulkkanen, “Remote sensing of 3-D geometry and surface moisture of a peat production area using hyperspectral frame cameras in visible to short-wave infrared spectral ranges onboard a small unmanned airborne vehicle (UAV),” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 9, pp. 5440–5454, 2016.
  • [34] P. Lottes, R. Khanna, J. Pfeifer, R. Siegwart, and C. Stachniss, “UAV-based crop and weed classification for smart farming,” in IEEE International Conference on Robotics and Automation (ICRA), 2017.
  • [35] L. Deng, Z. Mao, X. Li, Z. Hu, F. Duan, and Y. Yan, “UAV-based multispectral remote sensing for precision agriculture: A comparison between different cameras,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 146, pp. 124–136, 2018.
  • [36] B. Kellenberger, D. Marcos, S. Lobry, and D. Tuia, “Half a percent of labels is enough: Efficient animal detection in UAV imagery using deep CNNs and active learning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 12, pp. 9524–9533, 2019.
  • [37] B. Kellenberger, D. Marcos, and D. Tuia, “Detecting mammals in UAV images: Best practices to address a substantially imbalanced dataset with deep learning,” Remote Sensing of Environment, vol. 216, pp. 139–153, 2018.
  • [38] T. Moranduzzo and F. Melgani, “Detecting cars in UAV images with a catalog-based approach,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 10, pp. 6356–6367, 2014.
  • [39] G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 117, pp. 11–28, 2016.
  • [40] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, and R. Qu, “A survey of deep learning-based object detection,” arXiv:1907.09408, 2019.
  • [41] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 159, pp. 296–307, 2020.
  • [42] Q. Li, L. Mou, Q. Xu, Y. Zhang, and X. X. Zhu, “R-Net: A deep network for multioriented vehicle detection in aerial images and videos,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 7, pp. 5028–5042, 2019.
  • [43] Q. Li, L. Mou, Q. Liu, Y. Wang, and X. X. Zhu, “HSF-Net: Multiscale deep feature embedding for ship detection in optical remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 12, pp. 7147–7161, 2018.
  • [44] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015.
  • [45] M. Küchhold, M. Simon, V. Eiselein, and T. Sikora, “Scale-adaptive real-time crowd detection and counting for drone images,” in IEEE International Conference on Image Processing (ICIP), 2018.
  • [46] R. Bahmanyar, E. Vig, and P. Reinartz, “MRCNet: Crowd counting and density map estimation in aerial and ground imagery,” in British Machine Vision Conference Workshop (BMVCW), 2019.
  • [47] L. Wen, D. Du, P. Zhu, Q. Hu, Q. Wang, L. Bo, and S. Lyu, “Drone-based joint density map estimation, localization and tracking with space-time multi-scale attention network,” arXiv:1912.01811, 2019.
  • [48] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” in European Conference on Computer Vision (ECCV), 2018.
  • [49] P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, “Vision meets drones: A challenge,” arXiv:1804.07437, 2018.
  • [50] C. Fu, F. Lin, Y. Li, and G. Chen, “Correlation filter-based visual tracking for UAV with online multi-feature learning,” Remote Sensing, vol. 11, no. 5, p. 549, 2019.
  • [51] T. Yang, D. Li, Y. Bai, F. Zhang, S. Li, M. Wang, Z. Zhang, and J. Li, “Multiple-object-tracking algorithm based on dense trajectory voting in aerial videos,” Remote Sensing, vol. 11, no. 19, p. 2278, 2019.
  • [52] C. Fu, R. Duan, D. Kircali, and E. Kayacan, “Onboard robust visual tracking for UAVs using a reliable global-local object model,” Sensors, vol. 16, no. 9, p. 1406, 2016.
  • [53] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for UAV tracking,” in European Conference on Computer Vision (ECCV), 2016.
  • [54] S. Li and D. Yeung, “Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models,” in AAAI Conference on Artificial Intelligence (AAAI), 2017.
  • [55] I. Nigam, C. Huang, and D. Ramanan, “Ensemble knowledge transfer for semantic segmentation,” in Winter Conference on Applications of Computer Vision (WACV), 2018.
  • [56] Y. Lyu, G. Vosselman, G. Xia, A. Yilmaz, and M. Y. Yang, “The UAVid dataset for video semantic segmentation,” arXiv:1810.10438, 2018.
  • [57] S. Azimi, C. Henry, L. Sommer, A. Schaumann, and E. Vig, “Skyscapes – fine-grained semantic understanding of aerial scenes,” in International Conference on Computer Vision (ICCV), 2019.
  • [58] T. Shu, D. Xie, B. Rothrock, S. Todorovic, and S. Zhu, “Joint inference of groups, events and human roles in aerial videos,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2015.
  • [59] M. Barekatain, M. Martí, H. Shih, S. Murray, K. Nakayama, Y. Matsuo, and H. Prendinger, “Okutama-Action: An aerial view video dataset for concurrent human action detection,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.
  • [60] C. Kyrkou and T. Theocharides, “Deep-learning-based aerial image classification for emergency response applications using unmanned aerial vehicles,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019.
  • [61] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
  • [62] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [63]

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    International Conference on Machine Learning (ICML)

    , 2015.
  • [64] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception architecture for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [65]

    C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, Inception-ResNet and the impact of residual connections on learning,” in

    AAAI Conference on Artificial Intelligence (AAAI), 2017.
  • [66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [67] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
  • [68] G. Huang, Z. Liu, L. V. D. Maaten, and K. Weinberger, “Densely connected convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [69] B. Zoph, V. Vasudevan, J. Shlens, and Q. Le, “Learning transferable architectures for scalable image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [70]

    B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” in

    International Conference on Learning Representations (ICLR), 2017.
  • [71] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [72] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li, “Large-scale video classification with convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [73] K. Soomro, A. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv:1212.0402, 2012.
  • [74] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3D residual networks,” in IEEE International Conference on Computer Vision (ICCV), 2017.
  • [75] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics human action video dataset,” arXiv:1705.06950, 2017.
  • [76] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman, “A short note about Kinetics-600,” arXiv:1808.01340, 2018.
  • [77] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [78] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in European Conference on Computer Vision (ECCV), 2018.
  • [79] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fründ, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic, “The ”Something Something” video database for learning and evaluating visual common sense,” in IEEE International Conference on Computer Vision (ICCV), 2017.
  • [80] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. Bargal, T. Yan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, and A. Oliva, “Moments in time dataset: one million videos for event understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.