Imagine a pedestrian on the sidewalk, and an autonomous car cruising on the nearby road. If the pedestrian stays on the sidewalk and continues walking, they are of no concern for the self-driving car. What, instead, if they start approaching the road, in a possible attempt to cross it? Any future prediction about the pedestrian’s action and their possible position on/off the road would crucially help the autonomous car avoid any potential incident. It would suffice to foresee the pedestrian’s action label and position half a second early to avoid a major accident. As a result, awareness about surrounding human actions is essential for the robot-car.
We can formalise the problem as follows. We seek to predict both the class label and the future spatial location(s) of an action instance as early as possible, as shown in Figure 1. Basically, it translates into early spatiotemporal action detection , achieved by completing action instance(s) for the unobserved part of the video. As commonly accepted, action instances are here described by ‘tubes’ formed by linking bounding box detections in time.
In an existing relevant work by Singh , early label prediction and online action detection are performed jointly. The action class label for an input video is predicted early on just by observing a smaller portion (a few frames) of it, whilst the system incrementally builds action tubes in an online fashion. In contrast, the proposed approach can predict both the class label of an action and its future location(s) (i.e., the future shape of an action tube). In this work, by ‘prediction’ we refer to the estimation of both an action’s label and location in future, unobserved video segments. We term ‘detection’ the estimation of action labels/locations in the observed segment of video up to any given time, i.e., for present and past video frames.
The computer vision community is witnessing a rising interest in problems such as early action label prediction [31, 24, 9, 2, 32, 39, 40, 16, 42, 28, 20], online temporal action detection [23, 39, 6, 33], online spatio-temporal action detection [31, 32, 38], future representation prediction [34, 16] or trajectory prediction [1, 15, 21]. Although, all these problems are interesting, and definitely encompass a broad scope of applications, they do not entirely capture the complexity involved by many critical scenarios including, e.g., surgical robotics or autonomous driving. In opposition to [31, 32], which can only perform early label prediction and online action detection, in this work we propose to predict both future action location and action label. A number of challenges make this problem particularly hard, e.g., the temporal structure of an action is obviously not completely observed; locating human actions is itself a difficult task; the observed part can only provide clues about the future locations. In addition, camera movement can make it even harder to extrapolate an entire tube. We propose to solve these problems by regressing the future locations from the present tube.
The ability to predict action micro-tubes (sets of temporally connected bounding boxes spanning video frames) from pairs of frames  or sets of frames [13, 10] provides a powerful tool to extend the single frame-based online approach by Singh  in order to cope with action location prediction, while retaining its incremental nature. Combining the basic philosophies of  and  has thus the potential to provide an interesting and scalable approach to action prediction.
Briefly, the action micro-tubes network (AMTnet, 
), divides the action tube detection problem into a set of smaller sub-problems. Action ‘micro-tubes’ are produced by a convolutional neural network (a 3D region proposal network) processing two input frames that areapart. Each micro-tube consists of two bounding boxes belonging to the two frames. When the network is applied to consecutive pairs of frames, it produces a set of consecutive micro-tubes which can be finally linked  to form complete action tubes. The detections forming a micro-tube can be considered as implicitly linked, hence reducing the number of linking subproblems. Whereas AMTnet was originally designed to generate micro-tubes using only appearance (RGB) inputs, here we augment it by introducing the feature-level fusion of flow and appearance cues, drastically improving its performance and, as a result, that of TPnet.
Concept: We propose to extend the action micro-tube detection
architecture by Saha  to produce, at any time , past (), present, and future () detection bounding boxes,
so that each (extended) micro-tube contains bounding boxes for both observed and not yet observed frames. All bounding boxes, spanning presently observed frames as well as past and future ones (in which case we call them predicted bounding boxes), are considered to be linked, as shown in blue in Figure 2.
We call this new deep network ‘TPnet’.
Once bounding boxes are regressed, the online tube construction method of Singh  can be incrementally applied to the observed part of the video to generate one or more ‘detected’ action tubes at any time instant .
Further, in virtue of TPnet and online tube construction, the temporally linked micro-tubes forming each currently detected action tube (spanning the observed segment of the video) also contain past and future estimated bounding boxes. As these predicted boxes are implicitly linked to the micro-tubes which compose the presently detected tube, the problem of linking to the latter the future bounding boxes, leading to a whole action tube, is automatically addressed.
The proposed approach provides two main benefits: i) future bounding box predictions are implicitly linked to the present action tubes; ii) as the method relies only on two consecutive frames separated by a constant distance , it is efficient enough to be applicable to real-time scenarios.
Contributions: In Summary we present a Tube Predictor network (TPnet) which:
given a partially observed video, can (early) predict video long action tubes in terms of both their classes and the constituting bounding boxes;
demonstrates that training a network to make predictions also helps in improving action detection performance;
demonstrates that feature-based fusion works better than late fusion in the context of spatiotemporal action detection.
2 Related work
and Fisher vectors. Recently, Yeung [39, 40]
have proposed a variant of long short-term memory (LSTM) deep networks for modelling these temporal relations via multiple input and output connections. Kong
, instead, make use of variational auto-encoders to predict a representation for the whole video and use it to determine the action category for the whole video as early as possible. Probabilistic approaches based on Bayesian networks, Conditional Random Fields  or Gaussian processes  may help in activity anticipation. However, inference in such generative approaches is often expensive. None of these methods address the full online label and spatiotemporal location prediction setting considered here.
Online Action Detection. Soomro  have recently proposed an online method which can predict an action’s label and detect its location by observing a relatively smaller portion of the entire video sequence. They use segmentation to perform online detection via SVM models trained on fixed length segments of the training videos. Similarly, Singh  have extended online action detection to untrimmed videos with help of an online tube construction algorithm built on the top of frame-level action bounding box detections. Similiarly, Behl  solve online detection with help of tracking formulation. However, these approaches [32, 31, 3] only perform action localisation for the observed part of the video and adopt the label predicted for the currently detected tube as the label for the whole video.
To the best of our knowledge, no existing method generates predictions concerning both labels and action tube geometry. Interestingly, Yang  use features from current, frame proposals to ‘anticipate’ region proposal locations in and to generate detections at time , thus failing to take full advantage of the anticipation trick to predict the future spatiotemporal extent of the action tubes.
Advances in action recognition are always going to be helpful in action prediction from a general representation learning point of view. For instance, Gu  have recently improved on [25, 13] by plugging in the inflated 3D network proposed by  as a base network on multiple frames. Although they use a very strong base network pre-trained on the large “Kinetics”  dataset, they do not handle the linking process within the network as the AVA  dataset’s annotations are not temporally linked. Analogously, learning to predict future representation  can be useful in general action prediction (cfr. e.g. ).
Recently, inspired by the record-breaking performance of CNN-based object detectors [26, 27, 22], a number of scholars [31, 30, 7, 25, 35, 37, 41, 3] have tried to extend frame-level object detectors to videos for spatio-temporal action localisation. These approaches, however, fail to tackle spatial and temporal reasoning jointly at the network level, as spatial detection and temporal association are treated as two disjoint problems. More recent works have attempted to address this problem by reducing the amount of linking required with the help of ‘micro-tubes’  or ‘tubelets’ [13, 10] for small sets of frames taken together, where micro-tube boxes from different frames are considered to be linked together. AMTnet  by Saha is particularly interesting, because of its compact (GPU memory-wise) and flexible nature, as it can exploit pairs of successive frames sampling intervals apart, that it can also leverage sparse annotations  as well. For these reasons in this work we build on AMTnet as base network, improving its feature representation by feature-level fusion of motion and appearance cues.
In this section, we describe our tube prediction framework for the problem formulation described in § 3.1. Our approach has four main components. Firstly, we tie the future action tube prediction problem (§ 3.1) with action micro-tube  detection. Secondly, we devise our tube prediction network (TPnet) to predict future bounding boxes along with current micro-tubes, and describe its training procedure in § 3.3. Thirdly, we use TPnet in a sliding window fashion (§ 3.4) in the temporal direction while generating micro-tubes and corresponding future predictions. These, eventually, are fed to a tube prediction framework (§ 3.4) to generate the future of any current action tube being built using micro-tubes.
3.1 Problem Statement
We define an action tube as a connected sequence of detection boxes in time without interruptions and associated with a same action class , starting at first frame and ending last frame , in trimmed video: . Tubes are constrained to span the entire video duration, like in . At any time point , a tube is divided into two parts, one needs to be detected up to and another part needs to be predicted/estimated from frame to along with its class . The observed part of the video is responsible for generating (red in Fig 1), while we need to estimate the future section of the tube (blue in Fig 1) for the unobserved segment of the video. The first sub-problem, the online detection of , is explained in § 3.2. The second sub-problem (the estimation of the future tube segment ) is tackled by a tube prediction network (TPnet, § 3.3) in a novel tube prediction framework (§ 3.4).
3.2 From micro-tubes to full action tubes
Saha  introduced micro-tubes in their action micro-tube network (AMTnet) proposal, shown in Figure 3. AMTnet decomposes the problem of detecting into a set of smaller problems, detecting micro-tubes at time along with their classification scores for classes, using two successive frames and as an input (Fig. 3(a)). Subsequently, the detection micro-tubes are linked up in time to form action tube . Similar to , one background class is added to the class list which takes the number classes to .
AMTnet employs two parallel CNN streams (Fig. 3(b)), one for each frame, which produce two feature maps (Fig. 3(c)). These feature maps are stacked together into one (Fig 3(d)). Finally, convolutional heads are applied in a sliding window (spatial) fashion over predefined anchor regions , which correspond to prior  or anchor  boxes. Convolutional heads produce a output per micro-tube (Fig. 3(f)) and corresponding classification scores (Fig 3(g)). Each micro tube has coordinate, for the bounding box in frame and for bounding box in frame . As shown in Figure 3(f), the pair of boxes can be considered as implicitly linked together, hence the name micro-tube.
Originally, Saha  employed FasterRCNN 
as base detection architecture. Here, however, we switch to Single Shot Detector (SSD)  as
a base detector for efficiency reasons.
Singh  used SSD to propose an online and real-time action tube generation algorithm, while
Kalogeiton  adapted SSD
to detect micro-tubes (or, in their terminology, ‘tubelets’) frames long.
More importantly, we make two essential changes to AMTnet. Firstly, we enhance its feature representation power by fusing appearance features (based on RGB frames) and flow features (based on optical flow) at the feature level (see the fusion step shown in Fig 4), unlike the late fusion approach of  and . Note that the original AMTnet framework does not make use of optical flow at all. We will show that feature-level fusion dramatically improves its performance. Secondly, the AMTnet-based tube detection framework proposed in  is offline, as micro-tube linking is done recursively in an offline fashion . Similar to Kalogeiton , we adapt the online linking method of  to link micro-tubes in to a tube .
Micro-tube linking details: Let be the set of detection bounding boxes from frame , and the corresponding set from , generated by a frame-level detector. Singh  associate boxes in to boxes in , whereas, in our case, we need to link micro-tubes from a pair of frames to microtubes from the next set of frames . This happens by associating elements of , coming from , with elements of , coming from . Interestingly, the latter is a relatively easier sub-problem, as all such detections are generated based on the same frame, unlike the across frame association problem considered in . The association is achieved based on Intersection over Union (IoU) and class score, as the tubes are built separately for each class in a multi-label scenario. For more details, we refer the reader to .
Since we adopt the online linking framework of Singh , we follow most of the linking setting used by them, e.g.: linking is done for every class separately; the non-maximal threshold is set to . As shown in Figure 5(a) to 5(b), the last box of the first micro-tube (red) is linked to the first box of next micro-tube (red). So, the first set of micro-tubes is produced at , the following one at the one after that at , and so on. As a result, the last micro-tube is generated at to cover the observable video duration up to time . Finally, we solve for the association problem as described above.
3.3 Training the tube prediction network (TPnet)
AMTnet allow us to detect current tubes by generating a set of successive micro-tubes , where . However, our aim is to predict the future section of the tube using the latter linked micro-tubes, up to time .
To address this problem, we propose a tube prediction framework aimed at simultaneously estimating a micro-tube , a set of past and future detections, and the classification scores for the classes. measures how far in the past we are looking into, whereas is a future step size, and is the number of future steps. This is performed by a new Tube Prediction network (TPnet).
The underlying architecture of TPnet is shown in Figure 4. TPnet takes two successive frames from time and as input. The two input frames are fed to two parallel CNN streams, one for appearance and one for optical flow. The resulting feature maps are fused together, either by concatenating or by element-wise summing the given feature maps. Finally, three types of convolutional output heads are used for prior boxes as shown in Figure 4. The first one produces the classification outputs; the second one regresses the coordinates of the micro-tubes, as in AMTnet; the last one regresses coordinates, where coordinates correspond to the frame at , and the remaining are associated with the future steps. The training procedure of the new architecture is illustrated below.
Multi-task learning TPnet is designed to strive for three objectives, for each prior box
. The first task (i) is to classify theprior boxes; the second task (ii) is to regress the coordinates of the micro-tubes; the last (iii) is to regress the coordinates of the past and future detections associated with each micro-tube.
Given a set of anchor boxes and the respective outputs we compute a loss following the training objective of SSD . Let be the indicator for matching the -th prior box to the -th ground truth box of category . We use the bipartite matching procedure described in  for matching the ground truth micro-tubes to the prior boxes, where is a ground truth box at time . The overlap is computed between a prior box and micro-tube as the mean IoU between and the ground truth boxes in . A match is defined as positive () if the overlap is more than or equal to .
The overall loss functionis the following weighted sum of classification loss (), micro-tube regression loss () and prediction loss ():
where is the number of matches, is the ground truth class, is the predicted micro-tube, is the ground truth micro-tube, assembles the predictions for the future and the past, and is the ground truth of future and past bounding boxes associated with the ground truth micro-tube . The values of and are both set to in all of our experiments: different values might result in better performance.
The classification loss is a softmax cross-entropy loss; a hard negative mining strategy is also employed, as proposed in . The micro-tube loss is a Smooth L1 loss  between the predicted () and the ground truth () micro-tube. Similarly, the prediction loss is also a Smooth L1 loss between the predicted boxes () and the ground truth boxes (). As in [22, 27], we regress the offsets with respect to the coordinates of matched prior box matched to for both and . We use the same offset encoding scheme as used in .
3.4 Tube prediction framework
TPnet is shown in Figure 2 at test time. As in the training setting, it observes only two frames that are apart at any time point . The outputs of TPnet at any time are linked to a micro-tube, each micro-tube containing a set of bounding boxes , which are considered as linked together.
As explained in § 3.2, given a set of micro-tubes we can construct by online linking  of the micro-tubes. As a result, we can use predictions for up to to generate the future of , thus extending it further into the future as shown in Figure 5. More specifically, as it is indicated in Figure 5(a), a micro tube at is composed by bounding boxes () linked together. The last micro-tube is generated from . In the same fashion, putting together the predictions associated with all the past micro-tubes () yields a set of linked future bounding boxes () for the current action tube , thus outputting a part of the desired future .
Now, we can generate future tube from the set of linked future bounding boxes () from to and simple linear extrapolation of bounding boxes from to . Linear extrapolation is performed based on the average velocity of the each coordinates from last frames, predictions outside the image coordinate are trimmed to the image edges.
We test our action tube prediction framework (§ 3) on four challenging problems: i) action localisation (§ 4.1), ii) early action prediction (§ 4.2), iii) online action localisation (§ 4.2), iv) future action tube prediction (§ 4.3) Finally, evidence of real time capability is quantitatively demonstrated in (§ 4.4).
J-HMDB-21. We evaluate our model on the J-HMDB-21  benchmark. J-HMDB-21  is a subset of the HMDB-51 dataset  with 21 action categories and 928 videos, each containing a single action instance and trimmed to the action’s duration. It contains atomic action which are 20-40 frames long. Although, videos are of short duration (max frames), we consider this dataset because tubes belong to the same class and we think it is a good dataset to start with for action prediction task.
Now, we define the evaluation metrics used in this paper. i) We use a standard mean-average precision metric to evaluate the detection performance when the whole video is observed. ii) Early label prediction task is evaluated by video classification accuracy[32, 31] as early as when only 10% of the video frames are observed.
iii) Online action localisation (§ 4.2) is set up based on the experimental setup of , and use mAP (mean average precision) as metric for online action detection i.e. it evaluates present tube () built in online fashion.
iv) The future tube prediction is a new task; we propose to evaluate its performance in two ways. Firstly, we evaluate the quality of the whole tube prediction from the start of the videos to the end as early as when only 10% of the video is observed. The entire tube predicted (by observing only a small portion (%) of the video) is compared against the ground truth tube for the whole video. Based on the detection threshold we can compute mean-average-precision for the complete tubes, we call this metric completion-mAP (c-mAP). Secondly, we measure how well the future predicted part of the tube localises. In this measure, we compare the predicted () tube with the corresponding ground truth future tube segment. Given the ground truth and the predicted future tubes, we can compute the mean-average precision for the predicted tubes, we call this metric prediction-mAP (p-mAP).
We report the performance of previous three tasks (i.e. task ii to iv) as a function of Video Observation Percentage, i.e., the portion (%) of the entire video observed.
Baseline. We modified AMTnet to fuse flow and appearance features 3.2. We treat it as a baseline for all of our tasks. Firstly, we show how feature fusion helps AMTnet in Section 4.1, and compare it with other action detection methods along with our TPnet. Secondly in Section 4.3, we linearly extrapolate the detection from AMTnet to construct the future tubes, and use it as a baseline for tube prediction task. Implementation details. We train all of our networks with the same set of hyper-parameters to ensure the fair comparison and consistency, including TPnet and AMTnet. We use an initial learning rate of , and the learning rate drops by a factor of after and iterations. All the networks are trained up to
iterations. We implemented AMTnet using pytorch (https://pytorch.org/). We initialise AMTnet and TPnet models using the pretrained SSD network on J-HMDB-21 dataset on its respective train splits. The SSD network training is initialised using image-net trained VGG network. For, optical flow images, we used optical flow algorithm of Brox . Optical flow output is put into a three channel image, two channels are made of flow vector and the third channel is the magnitude of the flow vector.
|Methods||= 0.2||= 0.5||= 0.75||= .5:.95||Acc %|
|MR-TS Peng ||74.1||73.1||–||–||–|
|FasterRCNN Saha ||72.2||71.5||43.5||40.0||–|
|OJLA Behl ||–||67.3||–||36.1||–|
|SSD Singh ||73.8||72.0||44.5||41.6||–|
|AMTnet Saha  rgb-only||57.7||55.3||–||–||–|
|ACT kalogeiton ||74.2||73.7||52.1||44.8||61.7|
|T-CNN (offline) Hou ||78.4||76.9||–||–||67.2|
|MR-TS  + I3D  Gu ||–||78.6||–||–||–|
|TPnet represents our TPnet where , and .; means online methods|
TPnet. The training parameters of our TPnet are used to define the name of the setting in which we use our tube prediction network. The network name TPnet represents our TPnet where , and , if is set to it means network doesn’t learn to predict the past bounding boxes. In all of our settings, we use .
4.1 Action localisation performance
Table 1 shows the traditional action localisation results for
the whole action tube detection in the videos of J-HMBD-21 dataset.
Feature fusion compared to the late fusion scheme in AMTnet shows (Table 1) remarkable improvement, at detection threshold the gain with feature level fusion is , as a result, it is able to surpass the performance of ACT , which relies on set of frames as compared to AMTnet which uses only successive frames as input. Looking at the average-mAP (), we can see that the fused model improves by almost as compared to single frame SSD model of Singh . We can see that concatenation and sum fusion perform almost similar for AMTnet. Sum fusion is little less memory intensive on the GPUs as compared to the concatenation fusion; as a result, we use sum fusion in our TPnet.
TPnet for detection is shown in the last part of the Table 1, where we only use the detected micro-tubes by TPnet to construct the action tubes(§ 3.2). We train TPnet to predict future and past (i.e. when ) as well as present micro-tubes. We think that predicting bounding boxes for both the past and future video segments acts as a regulariser and helps improving the representation of the whole network. Thus, improving the detection performance (Table 1 TPNet and TPNet). However, that does not mean adding extra prediction task always help when a network is asked to learn prediction in far future, as is the case in TPNet and TPNet, we have a drop in the detection performance. We think there might be two possible reasons for this, i) network might starts to focus more on prediction task, and ii) videos in J-HMDB-21 are short and number of training samples decreases drastically ( for TPNet and for TPNet), because we can not use edge frames of the video in training samples as we need a ground truth bounding box which is frames in the future, as and for TPNet. However, in Section 4.3, we show that the TPNet model is the best to predict the future very early.
4.2 Early label prediction and online localisation
’s method also perform early label prediction on J-HMDB-21; however, their performance is deficient, as a result the plot would become skewed (Figure6(a)), so we omit theirs from the figure. For instance, by observing only the initial of the videos in J-HMDB-21, TPnet able to achieve a prediction accuracy of as compared to by Singh  and by Soomro , which is in fact higher than the accuracy achieved by  after observing the entire video. As more and more video observed, all the methods improve, but TPnet show the most gain, however, TPnet observed the least gain from all the TPnet settings shown. Which is in-line with action localisation performance discussed in the previous section 4.1. We can observe the similar trends in online action localisation performance shown in Figure 6(b). To reiterate, TPnet doesn’t get to see the training samples from the end portion of the videos, as it needs a ground truth bounding box from frames ahead. So, the last frame it sees of any training video is , which is almost half the length of the most extended video( frames) in J-HMDB-21. This effect magnifies when online localisation performance measured at , we provide the evidence of it in the supplementary material.
4.3 Future action tube prediction
Our main task of the paper is to predict the future of action tubes. We evaluate it using two newly proposed metrics (p-mAP and c-mAP) as explained earlier at the start of the experiment section 4. Result are shown in Figure 7 for future tube prediction (Figure 7 (a)) with p-mAP metric and tube completion with c-mAP as metric.
Although, the TPnet is the worst setting of TPnet model for early label prediction (Fig. 6(a)), online detection(Fig. 6(b)) and action tube detection (Table 1), but as it predicts furthest in the future (i.e. frame away from the present time), it is the best model for early future tube prediction (Fig. 7(a)). However, it does not observe as much appreciation in performance as other settings as more and more frames are seen, owing to the reduction in the number of training samples. On the other hand, TPnet observed large improvement as compared to TPnet as more and more portion of the video is observed for tube completion task (Fig.7(b)), which strengthen our arguement that predicting not only the future but also the past is useful to achieve more regularised predictions.
Comparision with the baseline. As explained above, we use AMTnet as a baseline, and its results can be seen in all the plots and the Table. We can observe that our TPnet performs better than AMTnet in almost all the cases, especially in our desired task of early future prediction (Fig 7(a)) TPnet shows almost improvement in p-mAP (at video observation) over AMTnet.
Discussion. Predicting further into the future is essential to produce any meaningful predictions (seen in TPnet), but at the same time, predicting past is helpful to improve overall tube completion performance. One of the reasons for such behaviour could be that J-HMDB-21 tubes are short (max frames long). We think training samples for a combination of TPnet and TPnet, i.e. TPnet are chosen uniformly over the whole video while taking care of absence of ground truth in the loss function could give us better of both settings. We show the result of TPnet in current training setting in supplementary material. The idea of regularising based on past prediction is similar to the one used by Ma .
4.4 Test Time Detection Speed
Singh  showcase their method’s online and real-time capabilities. Here we use their online tube generation method for our tube prediction framework to inherit those properties. The only question mark is TPnet’s forward pass speed. We thus measured the average time taken for a forward pass for a batch size of 1 as compared to 8 by . A single forward pass takes 46.8 milliseconds to process one text example, showing that it can be run in almost real-time at 21fps with two streams on a single 1080Ti GPU. One can improve speed even further by testing TPnet with equal to or and obtain a speed improvement of or . However, use of dense optical flow , which is slow, but as in , we can always switch to real-time optical  with small drop in performance.
We presented TPnet, a deep learning framework for future action tube prediction in videos which, unlike previous online tube detection methods[31, 32], generates future of action tubes as early as when of the video is observed. It can cope with the future uncertainty better than the baseline methods while remaining state-of-the-art in action detection task. Hence, we provide a scalable platform to push the boundaries of action tube prediction research; it is implicitly scalable to multiple action tube instances in the video as future prediction is made for each action tube separately. We plan to scale TPnet for action prediction in temporally untrimmed videos in the future.
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: Human trajectory prediction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 961–971 (2016)
-  Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging lstms to anticipate actions very early. In: IEEE International Conference on Computer Vision (ICCV). vol. 1 (2017)
-  Behl, H.S., Sapienza, M., Singh, G., Saha, S., Cuzzolin, F., Torr, P.H.: Incremental tube construction for human action detection. arXiv preprint arXiv:1704.01358 (2017)
-  Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping (2004)
-  Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4724–4733. IEEE (2017)
-  De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. arXiv preprint arXiv:1604.06506 (2016)
-  Gkioxari, G., Malik, J.: Finding action tubes. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2015)
-  Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., et al.: Ava: A video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421 (2017)
-  Hoai, M., De la Torre, F.: Max-margin early event detectors. International Journal of Computer Vision 107(2), 191–202 (2014)
-  Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (t-cnn) for action detection in videos. In: IEEE Int. Conf. on Computer Vision (2017)
-  Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.: Towards understanding action recognition (2013)
-  Jiang, Y., Saxena, A.: Modeling high-dimensional humans for activity anticipation using gaussian process latent crfs. In: Robotics: Science and Systems, RSS (2014)
-  Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: IEEE Int. Conf. on Computer Vision (2017)
-  Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
-  Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: European Conference on Computer Vision. pp. 201–214. Springer (2012)
-  Kong, Y., Tao, Z., Fu, Y.: Deep sequential context networks for action prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1473–1481 (2017)
-  Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32(8), 951–970 (2013)
-  Kroeger, T., Timofte, R., Dai, D., Van Gool, L.: Fast optical flow using dense inverse search. arXiv preprint arXiv:1603.03590 (2016)
-  Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: Computer Vision (ICCV), 2011 IEEE International Conference on. pp. 2556–2563. IEEE (2011)
-  Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: Computer Vision–ECCV 2014. pp. 689–704. Springer (2014)
-  Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., Chandraker, M.: Desire: Distant future prediction in dynamic scenes with interacting agents. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 336–345 (2017)
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325 (2015)
-  Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in lstms for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1942–1950 (2016)
-  Nazerfard, E., Cook, D.J.: Using bayesian networks for daily activity prediction. In: AAAI Workshop: Plan, Activity, and Intent Recognition (2013)
-  Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: ECCV 2016 - European Conference on Computer Vision. Amsterdam, Netherlands (Oct 2016), https://hal.inria.fr/hal-01349107
-  Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. pp. 91–99 (2015)
-  Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from streaming videos. In: IEEE Int. Conf. on Computer Vision. pp. 1036–1043. IEEE (2011)
-  Saha, S., Singh, G., Cuzzolin, F.: Amtnet: Action-micro-tube regression by end-to-end trainable deep architecture. In: IEEE Int. Conf. on Computer Vision (2017)
-  Saha, S., Singh, G., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. In: British Machine Vision Conference (2016)
-  Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: IEEE Int. Conf. on Computer Vision (2017)
-  Soomro, K., Idrees, H., Shah, M.: Predicting the where and what of actors and actions through online action localization (2016)
-  Tahmida Mahmud, M.H., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: IEEE Int. Conf. on Computer Vision. vol. 1 (2017)
-  Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 (2015)
-  Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (June 2015)
-  Weinzaepfel, P., Martin, X., Schmid, C.: Human action localization with sparse spatial supervision. arXiv preprint arXiv:1605.05197 (2016)
-  Weinzaepfel, P., Martin, X., Schmid, C.: Towards weakly-supervised action localization. arXiv preprint arXiv:1605.05197 (2016)
-  Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. In: BMVC (2017)
-  Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738 (2015)
-  Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. CVPR (2016)
-  Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: IEEE Int. Conf. on Computer Vision. pp. 2923–2932. IEEE (2017)
-  Zunino, A., Cavazza, J., Koul, A., Cavallo, A., Becchio, C., Murino, V.: Predicting human intentions from motion cues only: A 2d+ 3d fusion approach. In: Proceedings of the 2017 ACM on Multimedia Conference. pp. 591–599. ACM (2017)