The ability to foresee what possibly happens in the future is one of the factors that makes humans intelligent. Predicting the future state of the environment conditioned on the past and current states requires a good perception and understanding of the environment as well its dynamics. This ability allows humans to plan ahead and choose actions that shape the environment in our interest.
In this paper, we focus on improving the model’s understanding of the environment’s dynamics by simply observing it. Due to the availability of unlabeled video data, self-supervised learning from observations is very attractive compared to approaches that require explicit labeling of large amounts of data.
In comparison to literature on understanding the current state of the environment – works typically known under the terms of semantic segmentation or action classification – there is limited work addressing the problem of predicting future states. In this paper, we are interested in predicting future activities. Our work is different from most prior works on video prediction, as they focus on predicting whole frames of the future [27, 42, 5]. In the context of decision making systems, pixel-wise future prediction is too detailed and cannot be expected to enable longer prediction horizons than just a few frames. The strategy of Luc et al.  to predict the segmentation of future frames by forecasting the future semantics instead of raw RGB values, appears much more promising. We follow a similar strategy to predict abstract features and even increase the level of abstraction by dealing with activities rather than pixel-wise labeling; see Figure 1.
This connects large part of the problem with activity classification: before making predictions about future states, we must interpret the given video input and extract features that describe the current state of the environment. In order to cover not only the present state but also the context of the past, we build on the work by Zolfaghari et al. . This work on activity classification samples frames from a large time span of the past and converts the context from these frames into a feature representation optimized for classifying the observed activity. We argue that this feature representation is a good basis for learning a representation of what is likely to happen in the future. While we keep the first untouched, we learn the latter from the time-course of the videos. This even allows us to learn the dynamics in a self-supervised way. In this paper, we report results for both supervised (with action class labels) and self-supervised (unlabeled videos) training.
The predicted future state is provided as an activity class label or as a caption generated by a captioning module based on the predicted representation. Since the future is non-deterministic, forcing the network to predict a single possible outcome leads to contradictory learning signals and will hamper learning good representations. Therefore, we use a multi-hypotheses scheme that can represent multiple futures for each video observation.
Moreover, we decouple the prediction of the action and the object involved in an activity. This allows the model to generalize the same action across multiple objects and learn from only few shots or even without observing all combinations during training.
Overall, we propose the first approach for Predicting Future Activity (PreFAct) over large time horizons. Our method involves four important components: (1) a future prediction module that transforms abstract features of an observation to a representation of the future; (2) decoupling of the future representation into object and action; (3) representation of multiple hypotheses of the future; (4) natural language caption of the future representation.
2 Related work
Future Image Prediction. Many existing approaches for future prediction focus on generating future frames [29, 39, 35]. Since predicting RGB pixel intensities is difficult and the future is ambiguous, these methods usually end up in predicting a blurry average image. To cope with the non-determinism, Mathieu et al.  suggest to use a multi-scale architecture with adversarial training. Stochastic approaches [3, 24] use adversarial losses and latent variables to explicitly model the underlying ambiguities.
However, pixel-level prediction is still limited to a few frames into the future especially when the scene is highly dynamic and visual cues change rapidly. Moreover, pixel-level fine-detailed future prediction is not necessary for many decision making systems.
introduced a stacked LSTM based method to learn the task grammar and predict future using both RGB and flow cues. The key component of their method is the estimation of task progress which considers separate networks for each level of granularity. This makes the approach not only inefficient but also very specific to each task since granularity level for different activities and environments is not the same. To predict the starting time and label of the future action, Mahmud et al. propose to use an LSTM to model long-term sequential relationships. More recently, Farha et al.  proposed a deep network to predict future activity. These methods rely on partial observations of the future and their predictions are limited to a fixed time horizons into the future. Another very interesting future prediction task required by autonomous driving, interactive agents or surveillance systems is forecasting the locations of objects or humans in the future [4, 9]. Fan et al.  introduced a two-stream network to infer future representations to predict future locations. Their method is limited to 1 to 5 seconds into the future. Bhattacharyya et al.  further addressed the multi-modality and the uncertainty of the future prediction by modeling both data and model uncertainties.
Future Feature Prediction. Prediction of future in semantic level is more easier and appealing for many applications such as autonomous driving . Vondrick et al.  predicted visual representation of future frame. This approach is based on single frame an therefore is limited in terms of dynamics of the actions and also considers a short time horizon of 5 seconds into the future.
In contrast to these works, we limit ourselves to only look at the current observations to infer the future activity without limiting the time horizon of the future prediction. Inspired by by Luc et al.  we explicitly learn translation from current features to future features. Moreover, we address the ambiguous nature of the future by predicting multiple possible future representations with their uncertainties.
Uncertainty Estimation in CNNs. Modern CNNs are shown to be overconfident about their estimations  which makes them less trusted than non-blackbox traditional counterparts, despite their high performance. Recently, well-calibrated uncertainty estimation for CNNs has gained significant importance in order to tackle this shortcoming. One of the most popular uncertainty estimation methods for modern CNNs is MCDropout by Gal and Ghahramani [11, 18]. They show that using dropout over the weights for sampling, it is possible to get easy and efficient sampling for model uncertainty. Lakshminarayanan et al  propose using network ensembles over dropout ensembles for better uncertainty predictions. Another less resource expensive alternative is snapshot ensembling 
over the networks trained with Stochastic Gradient Descent with Warm Restarts. All these methods still cannot avoid the sampling cost. Ilg et al  propose multi-hypotheses networks (MHN) [25, 6, 36] for uncertainty estimation in order to overcome sampling. They show that the MHN is not only able to produce multiple samples in one forward pass but also provides with the state-of-the-art uncertainty estimates for optical flow. To this end, we modify the MHN for classification for our uncertainty estimation for future action classification.
3 Prediction of future activities
Given an observation at a current time segment , future activity prediction aims for estimating the activity class of the video segment at , where is the prediction horizon. Rather than learning directly this mapping, which we show to be clearly inferior, we use the features from an activity classification network for the current time segment and learn a mapping from these current features to features in the future.
A coarse view of the network training is shown in Figure 2. Figure 3 shows a more detailed view of the overall model. As the base action classification network, we use ECO . Between the convolutional encoder and the fully connected layers of ECO, we add the future prediction module . We explore different designs for this module, shown in Figure 4. This also includes three different inception blocks. Moreover, we evaluate six different ways on where to include these modules into the ECO architecture, as illustrated in Figure 5. Experimental results with these different options are presented in Section C.
The weights of the ECO base network stay fixed (both the convolutional encoder and the fully connected activity classifier), while the future prediction module is trained. The training scheme is illustrated in Figure 2. Ground-truth features in the future are simply extracted by running ECO on the future time segment (lower part of Figure 2), i.e., training can work in a self-supervised manner as regression without annotation of class labels. The corresponding training objective is simply the mean squared error between predicted features and extracted features for video segment :
Concurrently, the future prediction module can also take the class labels of a labeled video into account. In this case, the weights are optimized for the cross-entropy loss on the activity class labels in the future time segment:
where is the softmax output for the future activity prediction based on .
3.1 Class representation of objects and actions
One way to represent the result of future prediction is by activity classification. Human actions are characterized by the objects they interact with and the actions
they perform. In activity learning, often, one optimizes for the activity class directly, which leads to a combinatorial explosion of possible activities. Training directly on these can result in bad representations. For example, if in the training set the action ’put’ is always combined with ’plate’, the network will not learn the action ’put’ but rather will recognize plates. Such representation will not generalize to somebody putting a cup.
Understanding the relationship between actions and objects leads to a more comprehensive interpretation. For instance, if the model already learned what a ’put’ action means, it can more easily generalize to various scenarios such as ’put butter’ or ’put spoon’. This enables us to extend the model to unseen objects-activity combinations by providing only a very small set of samples. Therefore, we propose to decouple the object and action classes. Treating them as separate sets but learning them jointly still exploits the relationship between them. We will show the advantage of this decoupling in Section 5.6.
3.2 Video captioning
A richer way of representing the results of the prediction module is via language. A video caption usually has more details than an activity class label. For instance, the caption ‘put celery back into fridge’ conveys more details than the label ‘put celery’. We use an LSTM based architecture - semantic compositional networks  - for generating a caption describing the future feature representation. The semantic concepts are trained separately from scratch for each video dataset. These concepts are used to extend the weight matrices of caption generating decoder, as described in 
4 Multiple hypotheses and uncertainty
If the next action is deterministic and depends only on the previous action, learning the mapping from the present to the future is almost trivial. It is a simple look-up table to be learned. However, the future action typically depends on subtle cues in the input and contains non-deterministic elements. Thus, multiple reasonable possibilities exist for the future activity. Therefore, we propose learning multiple hypotheses with their uncertainties, similar to multi-hypotheses networks (MHN) [16, 25, 6, 36]
. In our setting, a multi-hypotheses network is used for predicting multiple feature vectors corresponding to the various possible outcomes together with their uncertainties. Each hypotheses yields the object and action class together with their class uncertainties; see Figure3. We have separate uncertainties for objects and actions because each task has different uncertainty levels. For instance, if a person is washing and there is a spoon, a plate, and a knife in the sink, the uncertainty for the chosen object will be much higher than for the action. The feature uncertainties allows the captioning LSTM to reason about which features are most likely to rely on.
To model the data uncertainty (aleatoric), the network yields the parameters of a parametric distribution, e.g., a Gaussian or Laplacian. This enables learning not only the mean prediction but also its variance, which can be interpreted as uncertainty. To cover the model uncertainty (epistemic), however, sampling from the network parameters is needed to compute the variation inherent within the model. Multiple-hypotheses networks create multiple samples in one forward pass, which approximates sampling from the network in a very efficient way.
4.1 Feature uncertainties
Following Ilg et al. , we model our posterior by a Laplace distribution parameterized by median and scale for ground-truth feature as:
During training, we minimize its negative log-likelihood (NLL):
As commonly done in the literature, we predict instead of for more stable training.
To also include the model uncertainty, we minimize the multi-hypotheses loss:
For each training sample , only the best feature among all hypotheses is penalized while the others stay untouched. The best feature is defined as the hypothesis closest to its ground-truth in terms of distance as follows:
4.2 Classification uncertainties
For the classification loss, we model the data uncertainty as the learned noise scale . In order to learn both the score maps and their noise scale, we minimize the negative expected log likelihood:
where is the observed class and
are predicted logits corrupted by Gaussian noise with the learned noise scale. Note that both and are learned by the network. This formula can be interpreted as first corrupting the logits with noise times, where
is the number of hypotheses, and normalizing them by softmax to get pseudo-probabilities, then averaging over these pseudo-probabilities to get the final pseudo-probabilities . Finally, the cross-entropy loss is applied to the pseudo-probabilities.
From variances to uncertainties. For both feature regression and object/action classification, we compute the final uncertainties as the entropy of the distributions:
Since future activity prediction received little attention so far, there is no dedicated dataset for this task. We conducted our experiments on the Epic-Kitchens dataset  and the Breakfast dataset . Both show sequential activities on preparing meals with sufficient diversity. These two are the most suitable datasets for our task since they include temporally meaningful actions which follow each other in a procedural way i.e. ”Peel Potato” is followed by ”Cut Potato”.
Epic-Kitchens dataset includes videos of people cooking in a kitchen from a first person view. Each video is divided in multiple video segments. In total there are video sequences with activity segments for training/validation and video sequences with action segments for testing. These segments are annotated using in total verb and noun classes. From the video sequences in training/validation dataset we randomly choose 85% of the videos for training and 15% of them for validation.
Breakfast dataset includes meal preparation videos of common breakfast items in third person view. On average each item has preparation videos where some videos are from the same scene with multiple camera angles. All videos are divided in multiple video segments which are annotated with one of predefined activity labels. We convert activity classes into object and action classes, e.g. activity: ”take cup” becomes action: ”take” and object: ”cup”. All videos in this dataset are provided by 52 participants. We use data from 39 of these participants for training and data from remaining 13 participants for testing.
5.2 Evaluation metrics
For classification, we use accuracy as quantitative measure, i.e. the rate of correctly predicted classes over the whole predictions.
. BLEU (B-1) calculates the geometric mean of n-gram precision scores weighted by a brevity penalty. ROUGE_L measures the longest common sub-sequence between generated caption and the ground-truth. METEOR is defined as the harmonic mean of precision and recall of matched uni-grams between generated caption and its ground-truth. CIDEr measures the consensus between generated caption and the ground-truth.
For evaluating the quality of the uncertainty predictions we use reliability diagrams . A reliability diagram plots the expected quality as a function of uncertainty. If the model is well-calibrated this plot should draw a diagonal decrease.
5.3 Implementation and training details
We base our feature extraction module on ECO. Following the original paper, we take the ECO which was pretrained on Kinetics 
and then further trained on Breakfast or Epic-Kitchens depending on the dataset used in the experiments. When we retrain the ECO for the baseline comparisons we follow the design choices of the original paper as is if not mentioned otherwise. We provide all details in supplementary material. Data augmentation is also applied as in the original work. Keeping the ECO feature extractor fixed, we train our future representation module which is initialized randomly. We use mini-batch SGD optimizer with Nesterov momentum of, weight decay of , and mini-batches of . We utilize dropout after each fully connected layer. For the multi-hypotheses experiments we fix the number of hypotheses (T) to .
We extract frames from the video segments following the sampling strategy explained in the original paper. In this sampling, each segment is splitted into 16 subsections of equal size and from each subsection a single frame is randomly sampled. This sampling provides robustness to variations and enables the network to fully exploit all frames and enable us to predict arbitrary horizons into the future.
5.4 Comparison of feature translation modules
Table 1 compares different design choices for the feature translation module, as depicted in Figures 4,5. The architectures M3 and M6 provide the best performance. M3 corresponds to locating a grid inception block with 2D convolutions before the last two fully connected layers. M6 is the same with 3D convolutions. For the rest of the experiments, we used M3 as a feature transformation module. In these experiments, we report the accuracy on the (composed) activity recognition for the Breakfast dataset and on the (single) action recognition for the Epic-Kitchens dataset. This differs from the results we provide in the following sections where we evaluate our decomposed action/object classes separately.
|Methods||Brk (A)||Brk (O)||Epic (A)||Epic (O)|
|Copy current label||8.8||9.6||21.32||13.92|
|Assoc. rule mining||16.1||8.7||27.9||14.2|
5.5 Results on future activity prediction
Due to the lack of previous work on this problem, we compare to some simple and some more in-depth baselines. A comparison of these baselines is shown in Table 2. The table provides as upper bound the classification accuracy with ECO where the time frame of interest is observed, i.e., the problem is a standard classification problem without a future prediction component.
The ”largest class” baseline just assigns the label of the largest class in the training data to the label of the future activity. This is the accuracy achieved by simply exploiting the data imbalance of the datasets.
The ”copy current label” baseline performs activity classification on the current observation and considers the predicted current label as the future activity label. This approach only works in cases, where the action or the object do not change over time.
The ”association rule mining” picks the most likely future activity as the label of the future activity. See supplementary material for details about each baseline.
As can be seen from the Table, the results we get with the future activity prediction network (PreFAct) are much better than these simple baselines. PreFAct-C denotes the network that was only trained using the class labels supervisedly, whereas PreFAct-R was trained trained only in a self-supervised manner without using class labels. PreFAct-R+C used both losses jointly for training. As expected, supervised training works better than only self-supervised training, and using both losses works marginally better than only supervised training. Note that the self-supervised learning can leverage on additional unlabeled video data. We explore this more in detail in Section 5.7.
The two most interesting baselines are the two state-of-the-art methods on video understanding - ECO  and Epic-Kitchens  - which we trained to predict future activities rather than the present activity. For a fair comparison, we modified the methods such that they provide both object and action classes rather than a single activity class. PreFAct clearly improves over this ECO baseline, which shows that the future prediction module is advantageous over directly learning the mapping from the observation to the future activity.
PreFAct-MH shows results with our multi-hypotheses network. Generating multiple hypotheses has potential to lead to significant performance improvement, as demonstrated by the Oracle selection, where the best hypothesis is selected based on the true label. Whereas, automated selection of the best hypothesis via their uncertainty estimates does not lead to a significant difference over the version with single hypothesis. This is consistent with the findings in other works, which showed good uncertainty estimates, but could not benefit from these hypotheses to select the best solution within a fusion approach.
5.6 Learning unseen combinations
The decomposition of activities into the action and the involved object allows us to generalize to new combinations not seen during training. Table 3 shows results on an object-action pair when all pairs of the specified object with the 5 actions in the top row were completely removed from the training data. In brackets are the numbers when these activities were part of the training set. In most cases, the approach is able to compensate for the missing object-action pairs by using the information from a related object or another action not among the 5 actions.
|Cupboard||-||-||91 (94)||83 (91)||-|
|Drawer||3 (6)||3 (8)||94 (93)||70 (74)||-|
|Fridge||-||-||95 (97)||94 (97)||-|
|Plate||68 (70)||69 (69)||-||-||90 (90)|
|Knife||69 (67)||69 (69)||-||-||97 (92)|
|Spoon||65 (67)||70 (59)||-||-||100 (93)|
|Lid||69 (80)||55 (60)||0 (50)||0 (50)||75 (80)|
|Pot||57 (55)||71 (70)||-||-||80 (92)|
5.7 Self-supervised learning
While the self-supervised regression loss yields inferior results compared to supervised training on class labels, self-supervised learning has the advantage that it can be run effortlessly on unlabeled video data.
Table 4 shows how the self-supervised learning improves when adding extra unlabeled data S1 and S2 provided by the Epic-Kitchens dataset. S1 contains 8048 samples of seen kitchens, and S2 contains 2930 samples of unseen kitchens. The improvement is small but increases as more data is added.
5.8 Future captioning
We use semantic compositional networks  for captioning current and future video features. For each dataset, we obtain a separate semantic concept detector by training a multi-label classifier for the set of selected concepts from each dataset. For most experiments, we use the full vocabulary as the set of concepts.
Our feature representations provide multiple features and classes with their uncertainties. We explored various options on how to fuse and feed this information into the captioning module (Fig. 3(B)). They are tagged as in Table 5. We use the class certainties to select features: feature yielding the highest class certainty (BEST), the highest three certainty (TOP3), and all features (ALL). For fusion of the selected features, we considered: concatenation of them with their certainties (concat), and multiplying them with their certainties and concatenating (mult). We obtain the certainties by first normalizing the uncertainties to (0,1) range and then subtracting them from 1.
Table 5 compares these different options. Using all feature hypotheses magnified with their certainties () yield the best results with large margin in comparison to other alternatives. This suggests that capturing the future with its multi-modality and variation is the key to represent future semantics.
Table 6 and 7 shows that the multi-hypotheses design is clearly superior to its single prediction counterparts on both the Breakfast and the Epic-Kitchens dataset. While multiple hypotheses could not be exploited at the classification level in Section 5.5, they help a lot on the captioning task.
Figure 6 shows some qualitative results of future captioning on the Breakfast dataset. For each sample, future action/object classes of top-3 hypotheses are presented. In the top-left case, hypotheses are certain about the future object ”egg”, but for the action there is high uncertainty. In contrast, in the bottom-left case, uncertainty on the future object is higher than for the future action ”put”.
5.9 Uncertainty evaluation
In Figure 18, we provide the reliability diagram for the uncertainties of feature hypotheses for Epic-Kitchens dataset. The diagonal decrease suggests that our uncertainties are well calibrated with the errors of the features. In the supplementary material we provide more details about our method and more in-depth evaluations as well as more qualitative results including failure cases.
We presented the problem of predicting future activities based on past observations. To this end, we leveraged a feature embedding used for action classification and extended it by learning a dynamics module that transfers these features into the future. The network predicts multiple hypotheses to model the uncertainty of the future state. While this had little effect on future activity classes, it helped substantially for future video captioning. Due to the decomposed representation into object and action, the approach generalizes well to unseen activities. The approach also allows for fully self-supervised training. Although the performance is still inferior to the supervised setting, it has a lot of potential when applied to large-scale unlabeled videos. We believe there is promise in investigating further in this direction.
Appendix A Baseline: Association rule mining
”Association rule mining”  discovers relations between activities in the dataset. For instance, in the Epic-Kitchens dataset actions ”take” and ”put” are occurred together frequently. Therefore, by identifying these relations we can predict the future class labels based on the current observation class label.
In the previous example, the rule would be:
If action ”take” is observed, then ”put” will be the next action.
Using this method, we find the most probable patterns between activities. We first obtain the activity label () of the current observation and then, using association rule mining, the label of the future activity () will be the most co-occurred consequent activity ().
Table 8 shows frequently occurring action sequences in the Epic-Kitchens dataset. In this table, we have provided three different components ”Support”, ”Confidence” and ”Lift”. Support refers to the popularity of action set, Confidence refers to the likelihood that an action ”B” happens if action ”A” is happened already, and Lift measures dependency of actions.
|FC [512-d]||FC [512-d]|
As shown in Table 8, the frequent action set is which happens in the dataset and the frequent object set is with occurrence. We utilized Confidence to find the most probable action ”B” after observing action ”A”. For instance, if the current observed action is ”Open” then action ”put” will happen with probability of .
Appendix B Implementation and training details
During training, we use the SGD optimizer with Nesterov momentum of and weight decay of . Training is performed up to epochs for Epic-Kitchens dataset with randomized minibatches consisiting of samples, where each sample contains frames of a current video segment.
For the Epic-Kitchen dataset, initial learning rate is and decreases by a factor of when validation error saturates for epochs.
Training is performed up to epochs for Breakfast dataset. We use dropout of for the last fully connected layer. For the Breakfast dataset, initial learning rate is and decreases by a factor of when validation error saturates for epochs. In addition, we apply the data augmentation techniques similar to : we resize the input frames to and employ fixed-corner cropping and scale jittering with horizontal flipping. Afterwards, we run per-pixel mean subtraction and resize the cropped regions to .
During the inference time, we sample samples from the video, apply only center cropping and then feed them directly to the network to get final future predictions. For the captioning, we utilize same approach but extracting the features from the regression layer. We use extracted features to train the LSTM to provide caption for each video segment.
Appendix C Architecture details of feature translation modules
For the future translation modules, we design several different architectures consist of fully connected layers and convolutional layers, see Table 9. For the , and , we make use of inception modules introduced in . For simplicity, we present each layer of modules , and in the following format (Table 9):
IN: Input from a specific layer of the ECO network.
Out: Output to the rest of the ECO network.
Appendix D Uncertainty evaluation
For evaluating the quality of the uncertainty predictions we use reliability diagrams  and sparsification plots [2, 43, 19, 21, 16]. A reliability diagram plots the expected quality as a function of uncertainty. If the model is well-calibrated this plot should draw a diagonal decrease. A sparsification plot shows the quality gain over the course of removing the samples with the highest uncertainties gradually. In the best case, samples would be removed using the ground truth error and this can serve as the ground-truth (oracle) for the sparsification plot.
In Figure 12 we show the sparsification plots of the best, the worst, and the average hypothesis for classification of actions and objects for Breakfast (first row) and EpicKitchens (second row) datasets. Plots tends to consistently increase as the uncertainties removed. In order to assess the quality of these plots we also provide the Oracle Plot. The oracle is simply repeating the sparsification with the true error instead of the uncertainties to get the upper bound for the uncertainties. Ideally the closer the sparsification plot is to its oracle is the better. One possible reason for the relatively bigger distance in our plots can be that activity prediction has still not reached its saturation ( error) while image classification has (typically error).
In Figure 17 we report the reliability diagram per hypothesis on both Breakfast and Epic-Kitchens datasets for future action/object classification. The diagonal increase suggests that our uncertainties are well calibrated. Accuracy tends to consistently increase as the confidence threshold of removed samples increase.
In Figure 18 we report the reliability diagram per hypothesis on Breakfast dataset for feature reconstruction. For Epic-Kitchens diagram, see the Figure 7 of the main paper. The diagonal decrease suggests that our uncertainties are useful as also supported by our captioning results.
In Figure 21 we show the sparsification plots per hypothesis for feature reconstruction for both datasets. Error tends to decrease as the high uncertain features removed. However, as in the classification case there is a big difference to its oracle due to the difficulty in future prediction. When the predictions do not generalize, uncertainties also do not.
Appendix E Qualitative results for the future prediction
Figure 22 shows representative results of our method. We input the current observation to the model and get the future captions.
Epic-kitchens dataset is ego-centeric and camera mounted on the person’s head. Therefore, future prediction on this dataset is more challenging due to the quick changes in view point and motion blurr. In the Breakfast dataset, the camera location is fixed throughout the video recording.
In the last row of Figure 22 for the Epic-Kitchens, it is not known that tap is on or off and for the Breakfast, pan being buttered is not known. This implies that providing longer history of previously performed actions would decrease the ambiguity of future prediction.
For instance, in the last row for the Epic-Kitchens the observation is ”put down washing liquid” and prediction is ”turn on tap” while ground-truth is ”take spoon”.
-  R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, pages 207–216, New York, NY, USA, 1993. ACM.
-  O. M. Aodha, A. Humayun, M. Pollefeys, and G. J. Brostow. Learning a confidence measure for optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5):1107–1120, May 2013.
-  M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. CoRR, abs/1710.11252, 2017.
-  A. Bhattacharyya, M. Fritz, and B. Schiele. Long-term on-board prediction of people in traffic scenes under uncertainty. CoRR, abs/1711.09026, 2017.
W. Byeon, Q. Wang, R. Kumar Srivastava, and P. Koumoutsakos.
Contextvp: Fully context-aware video prediction.
The European Conference on Computer Vision (ECCV), September 2018.
-  Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In IEEE Int. Conference on Computer Vision (ICCV), 2017.
-  D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The EPIC-KITCHENS dataset. CoRR, abs/1804.02748, 2018.
-  M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014.
-  C. Fan, J. Lee, and M. S. Ryoo. Forecasting hand and object locations in future frames. CoRR, abs/1705.07328, 2017.
-  Y. A. Farha, A. Richard, and J. Gall. When will you do what? - anticipating temporal occurrences of activities. CoRR, abs/1804.00892, 2018.
Y. Gal and Z. Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.In
Int. Conference on Machine Learning (ICML), 2016.
-  Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.
C. Guo, G. Pleiss, Y. Sun, and K. Weinberger.
On calibration of modern neural networks.In Int. Conference on Machine Learning (ICML), 2017.
-  T. Han, J. Wang, A. Cherian, and S. Gould. Human action forecasting by learning task grammars. CoRR, abs/1709.06391, 2017.
-  G. Huang, Y. Li, and G. Pleiss. Snapshot ensembles: Train 1, get M for free. In Int. Conference on Learning Representations (ICLR), 2017.
-  E. Ilg, Ö. Çiçek, S. Galesso, A. Klein, O. Makansi, F. Hutter, and T. Brox. Uncertainty estimates and multi-hypotheses networks for optical flow. In European Conference on Computer Vision (ECCV), 2018. https://arxiv.org/abs/1802.07095.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
-  A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Int. Conference on Neural Information Processing Systems (NIPS), 2017.
-  C. Kondermann, R. Mester, and C. Garbe. A Statistical Confidence Measure for Optical Flows, pages 290–301. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
H. Kuehne, A. Arslan, and T. Serre.
The language of actions: Recovering the syntax and semantics of
goal-directed human activities.
2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, June 2014.
-  J. Kybic and C. Nieuwenhuis. Bootstrap optical flow confidence and uncertainty measure. Computer Vision and Image Understanding, 115(10):1449 – 1462, 2011.
-  B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS workshop, 2016.
-  T. Lan, T.-C. Chen, and S. Savarese. A hierarchical representation for future action prediction. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 689–704, Cham, 2014. Springer International Publishing.
-  A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
-  S. Lee, S. Purushwalkam, M. Cogswell, V. Ranjan, D. Crandall, and D. Batra. Stochastic multiple choice learning for training diverse deep ensembles. In Int. Conference on Neural Information Processing Systems (NIPS), 2016.
-  C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
W. Liu, D. L. W. Luo, and S. Gao.
Future frame prediction for anomaly detection – a new baseline.In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. In Int. Conference on Learning Representations (ICLR), 2017.
-  W. Lotter, G. Kreiman, and D. D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. CoRR, abs/1605.08104, 2016.
-  P. Luc, C. Couprie, Y. LeCun, and J. Verbeek. Predicting future instance segmentations by forecasting convolutional features. CoRR, abs/1803.11496, 2018.
-  P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun. Predicting deeper into the future of semantic segmentation. ICCV, 2017.
-  T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury. Joint prediction of activity labels and starting times in untrimmed videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5784–5793, Oct 2017.
-  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. CoRR, abs/1511.05440, 2015.
-  K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., 2002.
-  M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. CoRR, abs/1412.6604, 2014.
-  C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, and G. D. Hager. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In International Conference on Computer Vision (ICCV), 2017.
-  G. Singh, S. Saha, and F. Cuzzolin. Predicting action tubes. CoRR, abs/1808.07712, 2018.
-  K. Soomro, H. Idrees, and M. Shah. Predicting the where and what of actors and actions through online action localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2015.
-  C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating the future by watching unlabeled video. CoRR, abs/1504.08023, 2015.
-  A. S. Wannenwetsch, M. Keuper, and S. Roth. Probflow: Joint optical flow and uncertainty estimation. In IEEE Int. Conference on Computer Vision (ICCV), Oct 2017.
-  M. Zolfaghari, K. Singh, and T. Brox. ECO: efficient convolutional network for online video understanding. Computer Vision – ECCV 2018, pages 713–730, 2018.