Action Detection from a Robot-Car Perspective

07/30/2018 ∙ by Valentina Fontana, et al. ∙ Oxford Brookes University University of Naples Federico II 0

We present the new Road Event and Activity Detection (READ) dataset, designed and created from an autonomous vehicle perspective to take action detection challenges to autonomous driving. READ will give scholars in computer vision, smart cars and machine learning at large the opportunity to conduct research into exciting new problems such as understanding complex (road) activities, discerning the behaviour of sentient agents, and predicting both the label and the location of future actions and events, with the final goal of supporting autonomous decision making.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With a rapid increase in the number of cars and other vehicles in the urban transportation system, autonomous driving (or robot-assisted driving

) has emerged as one of the predominant research areas in artificial intelligence. Imagine a self-driving car allowing you to catch a bit of sleep while you are on your way to office/school, watch a movie with your family on a long road trip or drive back home after a night out at the bar. Work towards the development of such advanced autonomous cars has dramatically increased since the achievements of Stanley

[25] in the 2005 Darpa grand challenge [2]. In recent years many large companies such as Toyota, Ford, Google have introduced their own versions of the robot car concept [10, 7, 13]. As a result, “self-driving cars” are increasingly considered to be the next big step in the development of personal use vehicles.
Society’s smooth acceptance of this new technology, however, depends on many factors such as safety, ethics, cost and reliability, to name a few. For example, from a safety perspective in a mixed scenario in which both robots and humans share the road, smart cars need to be able to spot children approaching a zebra crossing and pre-emptively adjust speed and course to cope with the children’s possible decision to cross the road. At the same time, their cost should be affordable for the average consumer. At present, though, the vast majority of these cars do not meet all these predefined standards to make them available to the public.

The latest generation of robot cars use a range of different sensors (i.e. laser rangefinders, radar, cameras, GPS) to provide data on what happens on the road, and fuse the information extracted from all these modalities in a meaningful way to suggest how the car should maneuver [7].
A number of autonomous car datasets exist for 3D environment mapping [13], stereo reconstruction [15], or both [1]

, including optical flow estimation and self localisation 

[3]. Recently, Maddern et al. introduced a large scale dataset [12] for self localisation via LIDAR and vision sensors. All these benchmarks are designed to address interesting problems – nevertheless, none of them tackles the paramount issue of allowing the car to be aware of the actions performed by surrounding vehicles and humans, and in general of detecting, recognising and anticipating complex road events to support autonomous decision making.

Thus, in this paper, consider the issue of vision-based autonomous driving, i.e., the problem of endowing cars to self-drive based on streaming videos captured by cameras mounted on them. In such a setting, which closely mimicks how human drivers ‘work’, the car needs to reconstruct and understand the surrounding environment from the incoming video sequence(s). A crucial task of video understanding is to recognise and localise (in space and time) different actions or events appearing in the video: for instance, the vehicle needs to perceive the behaviour of pedestrians by identifying which kind of activities (e.g., ‘moving’ versus ‘stopping’) they are performing, when and where [20] this is happening. In the computer vision literature [4, 26, 21, 9, 6, 22, 14, 23] this problem is termed spatio-temporal action localisation or, in short, action detection. Although a considerable amount of research has been undertaken in this area, most approaches perform offline video processing and are thus not suitable for self-driving cars which require the online processing of streaming video frames at real-time speed. In opposition, most recently Singh et al[23] have proposed an online, real-time action detection approach. However, as it is common practice in the action detection community, they evaluated their model on action detection datasets composed by YouTube video clips not designed from a robot car perspective.

Unlike current human action detection datasets [19] such as J-HMDB [8], UCF-101 [24], LIRIS-HARL [29], DALY [27] or AVA [5], the Road Event and Activity Detection (READ) dataset we introduce here is specially designed from the perspective of self-driving cars, and includes spatiotemporal actions performed not just by humans but by all road users, including cyclists, motor-bikers, drivers of vehicles large and small, and obviously pedestrians.
We strongly believe, a belief back up by clear evidence, that an awareness of all the actions and events taking place, and their location within the road scene, is essential for inherently safe self-driving cars. To this purpose we introduce three different types of label for each such road event, namely: (i) the position of the road user relative the autonomous vehicle perceiving the scene (e.g. in vehicle lane, on right pavement, in incoming lane, in outgoing lane); (ii) the type of the road user (e.g. pedestrian, small/large vehicle, cyclist); and (iii) the type of action being performed by the road user (e.g. moving away, moving towards, crossing the road, crossing the road illegally, and so on).
READ has been generated by providing additional annotation for a number of videos captured by the cameras mounted on the Oxford RobotCar platform, an autonomous Nissan LEAF, while driving in the streets of the city of Oxford, in the United Kingdom (§ 3). All such videos are part of the publicly available Oxford RobotCar Dataset [12] released in 2017 by the Oxford Robotics Institute111 More specifically, ground truth labels and bounding box annotations (which indicate where the action/event of interest is taking place in each video frame) are provided for several actions/events taking place in the surroundings of the robot-car (§ 3).

To the best of our knowledge, READ is the first action detection dataset which can be fully exploited to train machine learning algorithms tailored for self-driving robot cars. Additionally, it significantly expands the range and scope of current action detection benchmarks, in terms of size, context, and specific challenges associated with the road scenario. Here we report quantitative action detection results produced on READ by what is currently the state-of-the-art online action detection approach [23].

2 Related work

Most of current generation visual datasets for autonomous driving address issues like 3D environment mapping [13] or stereo reconstruction [pfeiffer2013exploitingblanco2014malaga]. A large-scale dataset called KITTI was released in 2013 for optical flow estimation and self localisation [3]. Similarly, Maddern et al. introduced in 2017 a large scale dataset [12] for self localisation via LIDAR and vision sensors. The relevant effort closest to the scope of READ is due to Ramanishka et al[16], who deal with action and events in the car context. The authors, however, limit themselves to the behaviour of the driver rather than looking at events involving other cars. As stated above, we think, instead, that a full awareness of events or activities performed by other road users is necessary for an autonomous car to successfully navigate complex road situations. As a consequence, READ considers the problem of detecting road events and activities performed by other road users as well as the robot-car itself.

Inspired by the record-breaking performance of CNN-based object detectors [17, 18, 11] several scholars  [23, 22, 4, 14, 26, 28, 31] have recently extended object detectors to videos for spatio-temporal action localisation. This includes, in particular, a recent work by Yang et al. [30]

which uses features extracted from the current, frame

proposals to ‘anticipate’ region proposal locations at time and use them to generate future detections. None of these approaches, however, tackle spatial and temporal reasoning jointly at the network level, as spatial detection and temporal association are treated as two disjoint problems. More recent efforts try to address this problem by predicting ‘micro-tubes’ [21] or, alternatively, ‘tubelets’ [9, 6], for sets of frames taken together. However, to be applicable to the road event detection scenario, these methods need to run in real-time and in an online fashion – for this reason, here we select [23] by Singh et al. as a baseline to conduct experiments on READ.

3 Road Event and Activity Detection dataset

There are six cameras on Oxford RobotCar [12], where three front-facing cameras are for the stereo generation. In our annotation process, we use videos captured from the central camera from the stereo setup. The desired annotations are produced in four steps, as explained in the following subsections.

3.1 Multi-label concept

We consider all the possible agents, the actions they perform, their locations with respect to the ‘autonomous vehicle’ (AV) and the actions of the vehicle itself. Multiple agents may be present at any given time, and perform multiple actions simultaneously. We propose to label each agent using at least one label, and locate its position using a bounding box around it.

3.1.1 Agent labels

We consider three types of actors as main road users, as well as traffic lights as a class of object that can perform actions able to influence the decision of AV. In READ we call them agents. AV is considered as just another agent. These three agent classes are: pedestrian, vehicle and cyclist. Further, the vehicle category is subdivided into six sub-classes: two-wheeler, car, bus, small-size vehicle, medium-size vehicle, large-size vehicle. Similarly, the ‘traffic light’ agent class is subdivided into two sub-classes (Table 1), one referring to traffic lights in the AV lane and the other to vehicles in a different lane. Each traffic light class can be associated to three action classes: red, amber and green (see Table 2). Only one out of the 11 agent labels in Table 1 can be assigned to each agent present in the scene.

Autonomous Vehicle (AV)
Small vehicle
Medium vehicle
Large vehicle
Vehicle traffic light
Other traffic light
Table 1: Agent labels.
Figure 1: Illustration of location labelling. Sub-figure (a) shows a green car in front of the Autonomous Vehicle changing lanes, as depicted by the arrow symbol. The associated event will then carry the following labels: ‘In vehicle lane’, ‘Moving left’, ‘Merging’. Once the merging action is completed, the location label changes to ‘In outgoing lane’. In sub-figure (b), if the Autonomous vehicle is to turn left from lane 6 to lane 4, then lane 4 is the ‘outgoing Lane’ as the traffic is moving in the same direction as the AV, will be once it completes its turn. However, if the Autonomous vehicle is to turn right from lane 6 to lane 4 (a wrong turn), then lane 4 will be the ‘incoming lane’ as the vehicle will be moving into the incoming traffic.
Moving away
Moving towards
Indicating left
Indicating right
Hazard lights on
Looking behind
Turning left
Turning right
Moving right
Moving left
Overtaking road user
Waiting to cross
Crossing road from left
Crossing road from right
Pushing object
Traffic light red
Traffic light amber
Traffic light green
Table 2: Action Labels.

3.1.2 Action labels

Each agent can carry one or more labels at any given time instant. For example, a traffic light can only carry a single action label - either red, amber or green, whereas a car can be associated with two action labels simultaneously, e.g., ‘turning right’ and ‘indicating right’.

Although some road agents are inherently multi-tasking, some multi-tasking combinations can be suitably described by a single label e.g. ‘pushing a trolley while walking on the footpath’ can be simply labelled as ‘pushing a trolley’. We list all the action labels considered in READ in Table 2.
AV actions. Each video frame is also labelled with the action label associated with the AV. In order to accomplish this, a bounding box is drawn on the bonnet of the AV and labelled. We assign to the AV one of the six action labels: ‘moving’, ‘stopped’, ‘turning left’, ‘turning right’, ‘merging’, ‘overtaking road user’) . These labels are similar to those used for the AV in [16], while being of a more abstract nature. In addition, as explained, READ covers many more events and actions, performed by other vehicles.

3.1.3 Agent location labels

Agent location is crucial in deciding what action the AV should take next. As the final objective is to assist autonomous decision making, we propose to label the location of each agent from the perspective of the AV. To understand this, Figure 1 illustrates two scenarios in which the location of the other vehicles sharing the road is depicted from the point of view of the AV. Table 3 shows all the possible locations an agent can assume, e.g., a pedestrian can be on the right or the left pavement, or in vehicle lane, or at the crossing or at a bus stop. The same applies to other vehicles as well. There is no location label for the traffic lights as they are not movable objects, but agents of a static nature.

In outgoing bus lane
In incoming bus lane
In outgoing cycle lane
In incoming cycle lane
In vehicle lane
In outgoing lane
In incoming lane
On left pavement
On right pavement
At junction
At traffic lights
At crossing
At bus stop
Table 3: Location Labels.

3.2 Annotation process

3.2.1 Video collection

In the setup by Maddern et al[12] there are three front-facing cameras. As part of READ we only annotate the video sequences recorded by the central camera, downloaded from the Oxford RobotCar dataset website ( Image sequences are first demosaiced to convert them into RGB image sequences and then encoded into video sequences using ffmpeg222 at the rate of 12 frames per second (fps). Although the original frame rate in the considered image sequences varies from 11 fps to 16 fps, we uniformised it to keep the annotation process consistent. Only a subset of all the available videos was selected based on content, by manually inspecting the videos. Annotators were asked to select videos in order to cover all types of labels and avoid heavy tail as much as possible.

3.2.2 Annotation tool

Annotating tens of thousand of frames rich in content is a very intensive process, and calls for a tool which is both fast and user-friendly
After trying multiple tools, such as the Matlab Autonomous Driving System toolbox333 and Vatic, we decided to use an open source tool available on the GitHub by Microsoft, called Visual Object Tagging Tool (VOTT, The most useful feature of VoTT is that it can copy annotations (bounding boxes and their labels) from the previous frame to the current frame, so that boxes across frames are automatically linked together. We used the most basic version of the VOTT without any detector or tracking. However, the annotation copy (from the previous frame) property of VOTT allows us to link the bounding boxes across time implicitly. VOTT also allows for multiple labels, as in our multi-label annotation concept which requires us to label location, agent and action simultaneously.

3.3 Final event label creation

Given annotations for actions and agents in the multi-label scenario as discussed above, we can generate event-level labels pertaining to the agents, e.g. ‘pedestrian moving towards the AV on the right pavement’, ‘cyclist overtaking in the vehicle lane’ etc. These labels can be any combinations of location, action and actor labels. If we ignore the location labels the resulting event labels become location invariant. When creating event-level labels a trade-off needs to be struck, depending on the final application and the number of instances available in the dataset for a particular event.

Figure 2: An ideal autonomous driving system in action.

3.4 Complex activity label creation

The purpose of READ is to go beyong the detection of simple actions (as is typical of current action detection datasets), to provide a benchmark test-bed for complex road activities, defined as ensembles of events and actions performed by more than one agent in a correlated/coordinated way. A complex activity is thus made up of simple actions. E.g., ‘Illegal crossing’ is composed of ‘Pedestrian crossing road’ + ‘Vehicle traffic light green’ + ‘Vehicle braking’ + ‘Vehicle stopped at traffic light’.
A list of READ complex activities is shown in Table 4.

Pedestrian crossing road legally
Pedestrian crossing road illegally
Vehicle stopping at traffic light
Vehicle stopping for crossing
Vehicle doing three point turn
Vehicle doing U-turn
Bus stopping at the bus stop
Vehicle parking
Cyclist riding legally
Cyclist riding illegally
Vehicle avoiding collision by slowing down
Vehicle avoiding collision by moving to another lane
Vehicle stopping temporary
Vehicle indicating hazard manoeuvre
Vehicle moving for emergency vehicle
Vehicle stopping for emergency vehicle
Vehicle avoiding a stationary object
Table 4: Complex activity labels.

4 Current Action Detection Methods

Most action detection methods [4, 26, 21, 22, 14] follow an offline action tube generation approach. These methods build action tubes (i.e., sequences of detection bounding boxes around the action of interest, linked in time) by assuming that the entire video is available beforehand. Such methods are not suitable for self-driving cars because their action tube generation component is offline and slow, whereas for self-driving cars we want to process streaming videos online at real-time speed.
Figure 2 illustrates an ideal self-driving system in action with an example. A car is approaching a zebra crossing. The camera mounted on the car sends one or more streaming video sequences as input to an action detection module which processes, in real time, chunks of video, and outputs the class confidence and space-time locations (action tubes) of the various action and events it perceives. Based on the action detection outputs, signals from other sensors mounted on the car and outputs from other modules (such as path planning, sign and object detection, self-localisation) an autonomous driving system sends control signals to the car. Note that, as far as the action detection system is concerned, (1) action tube generation has to be online: in the first instance action tubes need to be built based on the initial video chunk, whereas tubes are later incrementally updated in time as more and more chunks arrive. Furthermore (2), the processing needs to take place in real-time to allow the driving system to swiftly react to new developments in its environment.

Figure 3: Online real-time action detection pipeline proposed by Singh et al[23].

Most recently, Singh et al[23] have proposed an action tube generation algorithm which incrementally builds action tubes in an online fashion and at real-time speed, and is thus suitable for our task. Following Singh et al[23]’s work, other authors [9, 5] have used Singh et al.’s online action tube building algorithm, without though exhibiting real-time processing speed. In the following section we briefly recall the online real-time action detection approach of [23], which is used in this work to report action detection results on our new Road Event and Activity Detection dataset.

4.1 Online real-time action detection

Figure 3 shows the block diagram of the online real-time action detection pipeline proposed by Singh et al[23]. The pipeline takes RGB and optical flow frames as inputs and processes them through their respective appearance and motion streams. The appearance and motion streams output frame level detection bounding boxes and their associated softmax scores are then fused using a late fusion scheme, which allows the system to exploit the complementary aspects of appearance and motion information concerning the actions present in the video. The resulting frame level detections are incrementally linked in time to build action tubes in an online fashion. Unlike previous approaches [4, 26, 21, 22, 14], Singh et al[23]

use a faster and more efficient SSD fully convolutional neural network architecture 

[11] to implement the appearance and motion streams. Furthermore, [23] proposes an elegant online action tube generation algorithm as opposed to the offline algorithms used by previous authors [4, 26, 21, 22, 14]. Rather than generating action tubes using a Viterbi forward and backward pass (thus assuming that frame level detections are available for the entire video), [23] incrementally builds action tubes using only a Viterbi forward pass, starting by processing a smaller video snippet (a few initial frames) and incrementally updating the tubes as more and more frames are available to the system.

5 Experiments

In this section, we present action detection results on the initial version of the READ dataset. We term this version READv1, which was created by annotating six days of videos from [12]. Each day video is divided into multiple videos, usually 20-40 minutes long. We followed the annotation process described in Section 3. However, the resulting dataset had a long tail, and some actions did not have many instances. We combined agents and actions to form event categories. If we consider location labels, then number final event classes increases hence the number of instances gets divided among these classes. As explained in Section 3.3, we picked 32 event classes as shown in Table 5 along with their number of instances. In this case, an event instance is an annotation of that particular event in a frame with a bounding box, and one frame can contain multiple instances of an event.

The READv1 dataset contains 11K annotated frames in total: 4343 frames of them are used as test set, with the remaining ones used for training. These 11K frames are sampled from a broader set of frames coming from videos captured over 6 days, at the rate of 4 frames per second.

5.1 Detector details

In the initial tests shown here, we train only the appearance stream of Singh et al[23]’s action detector (§ 4) using READv1’s annotations. We train the action detection network for 30K iterations with an initial learning rate of 0.001, up to 40K iterations. We plan to add the training of the flow stream and the fusion strategy when running tests on the next version of the dataset, which we plan to release by October 2018.

5.2 Evaluation metric

In these tests evaluation is done on a frame-by-frame basis, rather than on the basis of action tube detections. Namely, we use frame-AP [4]

as the evaluation metric, rather than video-mAP 

[4] as the temporal association of ground truth bounding boxes are not yet fully available. Nevertheless, we plan to make it available soon, in order to be able to evaluate this or any other model using video-mAP, which is the accepted, standard evaluation metric for action detection. As is standard practice, we computed frame-AP results at a detection threshold (, measuring the Intersection-over-Union (IOU) degree of overlap between ground truth and predicted action bounding box) equal to or greater than 0.5.

5.3 Discussion

Table 5 shows the action detection results on the test set of READv1. We can see a clear correlation between the number of instances (in the second column) and the performance of each class (in the third column). It indicates that an increase in the number of instances per class should improve detection performance. The final performance, in terms of frame-mAP, is a modest , as the number of classes is limited.
We are working to improve the dataset in a number of ways: i) by providing annotation in a multilabel format, as described in Section 3, to describe the different aspects of road event and activities; ii) by annotating additional instances for the classes which have fewer number of instances, to avoid imbalance; iii) by providing the temporal linking of detections across frames.

Event label instances AP@0.5
Car indicating right 51 20.1
Car turn right 235 12.7
Car indicating left 1 00.0
Car turn left 51 18.7
Car stopped at the traffic light 1638 06.1
Car moving in lane 3098 45.8
Car braking in lane 102 04.7
Car stopped in lane 1404 61.9
Car waiting at junction 6 00.0
edestrian waiting to cross 452 03.8
Pedestrian crossing road legally 468 18.0
Pedestrian walking on left pavement 1930 31.1
Pedestrian walking on right pavement 1930 30.9
Pedestrian walking on road left side 48 01.8
Pedestrian walking on road right side 48 02.4
Pedestrian crossing road illegally 16 00.0
Cyclist indicating left 16 01.8
Cyclist indicating right 16 00.0
Cyclist moving to left lane 17 06.9
Cyclist moving to right lane 17 00.9
Cyclist stopped at the traffic light 790 39.8
Cyclist turn right 56 00.1
Cyclist turn left 56 00.0
Cyclist crossing 58 23.9
Cyclist moving in lane 2306 44.5
Cyclist stopped in lane 16 00.0
Cyclist moving on pavement 40 00.0
Motorbike moving in lane 1 00.0
Bus moving in lane 456 38.3
Trafficlight red 1526 52.7
Trafficlight amber 324 32.7
Trafficlight green 852 58.6
total 18027 mAp=17.5
Table 5: Frame-mAP @ IoU 0.5 for event detection on the 4343 frames of test set.

6 Conclusions

In this report we presented a new Road Event and Activity Detection (READ) dataset, as the first benchmark for road event detection in autonomous driving. READ has been constructed by providing extra annotation to a fraction of the recently released Oxford RobotCar dataset. The annotation provided follows a multi-label approach in which road agents (including the AV), their locations and the action they perform (possibly more than one) are labelled separately. Event-level labels can be generated by simply composing lower-level descriptions.

Here we showed preliminary tests conducted using the current state of the art in online action detection, using only frame-mAP as a metric, and on a small subset of the final dataset, using an initial set of event labels. In the upcoming months, prior to release of the full dataset, we will work towards (i) completing the multi-label annotation of around 40,000 frames coming from videos spanning a wide range of road conditions; (ii) providing the temporal association ground truth information necessary to compute video-mAP results; (iii) devising a novel, deep learning approach to detecting complex activities, such as those associated with common driving situations. All data, documentation and baseline code will be publicly released on GitHub.


  • [1] J.-L. Blanco-Claraco, F.-Á. Moreno-Dueñas, and J. González-Jiménez. The málaga urban dataset: High-rate stereo and lidar in a realistic urban scenario. The International Journal of Robotics Research, 33(2):207–214, 2014.
  • [2] M. Buehler, K. Iagnemma, and S. Singh. The 2005 DARPA grand challenge: the great robot race, volume 36. Springer, 2007.
  • [3] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  • [4] G. Gkioxari and J. Malik. Finding action tubes. In

    IEEE Int. Conf. on Computer Vision and Pattern Recognition

    , 2015.
  • [5] C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421, 2017.
  • [6] R. Hou, C. Chen, and M. Shah. Tube convolutional neural network (t-cnn) for action detection in videos. In IEEE Int. Conf. on Computer Vision, 2017.
  • [7] G. Inc. Self-driving car project - google. Available at:
  • [8] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. Black. Towards understanding action recognition. 2013.
  • [9] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid. Action tubelet detector for spatio-temporal action localization. In IEEE Int. Conf. on Computer Vision, 2017.
  • [10] K. Korosec. Toyota is betting on this startup to drive its self-driving car plans forward. Available at:
  • [11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015.
  • [12] W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
  • [13] G. Pandey, J. R. McBride, and R. M. Eustice. Ford campus vision and lidar data set. The International Journal of Robotics Research, 30(13):1543–1552, 2011.
  • [14] X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. In ECCV 2016 - European Conference on Computer Vision, Amsterdam, Netherlands, Oct. 2016.
  • [15] D. Pfeiffer, S. Gehrig, and N. Schneider. Exploiting the power of stereo confidences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 297–304, 2013.
  • [16] V. Ramanishka, Y.-T. Chen, T. Misu, and K. Saenko.

    Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7699–7707, 2018.
  • [17] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242, 2016.
  • [18] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
  • [19] M. S. Ryoo and J. K. Aggarwal. UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA)., 2010.
  • [20] S. Saha. Phd thesis - spatio-temporal human action detection and instance segmentation in videos. Available at:, 2018.
  • [21] S. Saha, G. Singh, and F. Cuzzolin. Amtnet: Action-micro-tube regression by end-to-end trainable deep architecture. In IEEE Int. Conf. on Computer Vision, 2017.
  • [22] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. In British Machine Vision Conference, 2016.
  • [23] G. Singh, S. Saha, M. Sapienza, P. Torr, and F. Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In IEEE Int. Conf. on Computer Vision, 2017.
  • [24] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. Technical report, CRCV-TR-12-01, 2012.
  • [25] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann, et al. Stanley: The robot that won the darpa grand challenge. Journal of field Robotics, 23(9):661–692, 2006.
  • [26] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2015.
  • [27] P. Weinzaepfel, X. Martin, and C. Schmid. Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197, 2016.
  • [28] P. Weinzaepfel, X. Martin, and C. Schmid. Towards weakly-supervised action localization. arXiv preprint arXiv:1605.05197, 2016.
  • [29] C. Wolf, J. Mille, E. Lombardi, O. Celiktutan, M. Jiu, E. Dogan, G. Eren, M. Baccouche, E. Dellandrea, C.-E. Bichot, C. Garcia, and B. Sankur. Evaluation of video activity localizations integrating quality and quantity measurements. In Computer Vision and Image Understanding, 127:14–30, 2014.
  • [30] Z. Yang, J. Gao, and R. Nevatia. Spatio-temporal action detection with cascade proposal and location anticipation. In BMVC, 2017.
  • [31] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In IEEE Int. Conf. on Computer Vision, pages 2923–2932. IEEE, 2017.