Human action analysis is a central task in computer vision that has a enormous impact on many applications, such as, video content analysis[3, 23], video surveillance [22, 35], and automated driving vehicles [5, 29]. Systems interacting with humans also need the capability to promptly react to the context changes, and plan their actions accordingly. Most previous works focus on the tasks of action recognition [26, 33, 12] or early-action recognition [32, 6, 8], i.e., recognition of an action after its observation (happened in the past) or recognition of an ongoing action from its partial observation (only part of the current action is available). A more challenging task is to predict near future, i.e., to forecast actions that will be performed ahead in time. Predicting future actions before observing the corresponding frames [7, 14] is demanded by many applications which need to anticipate human behaviour. For example, intelligent surveillance systems may support human operators to avoid hazards or assistive robotics may help non-self-sufficient people. Such task requires to analyze significant spatio-temporal variations among actions performed by different people. For this reason, multiple modalities (e.g., appearance and motion) are typically considered to improve the identification of similar actions. Egocentric scenarios provide useful settings to study early-action recognition or action anticipation tasks. Indeed, wearable cameras offer an explicit point-of-view to capture human motion and object interaction.
In this work, we address the problem of anticipating egocentric human actions in an indoor scenario at several time steps. More specifically, we anticipate an action by leveraging video segments that precede the action. We disentangle the processing of the video into encoding and decoding stages. In the first stage, the model summarizes the video content while in the second stage the model predicts at multiple anticipation times the next action (see Fig. 1
). We exploit a recurrent neural network (RNN) to capture temporal correlations between subsequent frames and consider three different modalities for representing the input: appearance (RGB), motion (optical flow) and object-based features.
An important aspect to consider when dealing with human action anticipation is that the future is uncertain, which means that different prediction of future actions are equally likely to occur. For example, the actions “sprinkle over pumpkin seeds” and “sprinkle over sunflower seeds
” may be equally performed when preparing a recipe. For this reason, to deal with the uncertainty of future predictions, we propose to group similar actions comparing several label smoothing techniques in order to broaden the set of possible futures and reduce the uncertainty caused by one-hot encoded labels. Label smoothing is introduced in as a form of regularization for classification models since it introduces a positive constant value into wrong classes components of the one-hot target. A peculiar feature of such method is to make models robust to overfitting especially when labels in the dataset are noisy, e.g., the targets are ambiguous. In our work, we extend label smoothing by using them as a bridge for distilling knowledge into the model during training. Our experiments on the large-scale EPIC-Kitchens dataset show that label smoothing increases the performance of state-of-the-art models and yields better generalization on test data.
The main contributions of our work are as follows: 1) we generalize the label smoothing idea extrapolating semantic priors from the action labels to capture the multi-modal future component of the action anticipation problem. We show that label smoothing, in this context, can be seen as a knowledge distillation process where the teacher gives semantic prior information for the action anticipation model; 2) we perform extensive experiments on egocentric videos proving that our label smoothing techniques systematically improve results of action anticipation of state-of-the-art models.
Ii Related Work
Action anticipation requires the interpretation of the current activity using a number of observations in order to foresee the most likely actions. For this reason, we briefly review three related research areas: action recognition, early-action recognition and action anticipation.
Action recognition is the task of recognizing the action contained in an observed trimmed video. Classic approaches to action recognition have leveraged hand-designed features, coupled with machine learning algorithms to recognize actions from video[26, 40, 41]
. More recent works have investigated the use of deep learning to obtain representations suitable for action recognition directly from video in an end-to-end fashion. Among these approaches, a line of works has investigated ways to exploit standard 2D CNNs for action recognition, often relying on optical flow as a mean to represent motion[33, 11, 12, 42, 44, 27]. Other works have focused on the extension of 2D CNNs to 3D CNNs able to process spatio-temporal volumes [23, 37, 38]. Some approaches have used recurrent networks to model the temporal relationships between per-frame observations [10, 34, 14]. All of these works investigate deeply how to represent and leverage the input video but less or even no importance is given to the representation of the action labels.
Early-action recognition. Early action recognition consists in recognizing on-going actions from partial observations of streaming video . Classic works have addressed the task using integral histograms of spatio-temporal features , sparse-coding , Structured Output SVMs , and Sequential Max-Margin event detectors . Another line of research has leveraged the use LSTMs [2, 4, 9, 14] to account for the sequential nature of the task.
Action anticipation. Action anticipation deals with forecasting actions that will happen in the future. Previous studies have investigated different approaches such as hierarchical representations , auto-regressive HMMs , regressing future representations , using encoder-decoder LSTMs 
, and inverse reinforcement learning. Some other approaches proposed to perform long-term predictions focusing only on appearance features [28, 1]. However, differently from this work, very little or even no attention has been payed either to the knowledge distillation of the action semantics or label smoothing for action anticipation.
Knowledge distillation and label smoothing Knowledge distillation  is the procedure of transferring the information extracted by a teacher network (with high learning capacity) to a student network (with low learning capacity) in order to allow the latter to reach similar performance. This is usually obtained by training the student via a distillation loss which takes into account both the ground truth and the prediction of the pre-trained teacher. Since this procedure can distill useful information form the teacher to the student, we perform semantic distillation via label smoothing.
Label smoothing, introduced in 
, is the procedure of softening the distribution of the target labels, reducing the most confident value of the one-hot vector and considering a uniform value for all the zero vector components. Although such procedure improves results for classification problems reducing overfitting, no previous works investigate other design approaches except for the uniform smoothing. In our work, we both generalize this idea and show a systematically improvement of state-of-the-art models.
Iii Proposed Approach
Anticipating human actions is essential for developing intelligent systems able to avoid accidents or guide people to correctly perform their actions. We study the suitability of label smoothing techniques to address the issue.
Iii-a Label Smoothing
As investigated in 
, there is an inherent uncertainty on predicting future actions. In fact, starting from the current state observation of an action there can be multiple, but still plausible, future scenarios that can occur. Hence, the problem can be reformulated as a multi-label task with missing labels where, from all the valid future realizations, only one is sampled from the dataset. All previous models designed for action anticipation are trained with cross-entropy using one-hot labels, leveraging only one of the possible future scenario as the ground truth. A major drawback of using hard labels is to favour logits of correct classes weakening the importance of other plausible ones. In fact, given the one-hot encoded vectorfor a class , the prediction of the model and the logits of the model such that , the cross entropy is minimized only if . This fact encourages the model to be over-confident about its predictions since during training it tries to focus all the energy on one single logit leading to overfitting and scarce adaptivity . To this purpose, we smooth the target distribution enabling the chance of negative classes to be plausible. However, the usual label smoothing procedure introduces a uniform positive component among all the classes, without capturing the difference between actions. To this end, we propose several ways of designing such smoothing procedure by encoding semantic priors into the labels and weighting the actions according their feature representation. We can connect our soft labels approach to the knowledge distillation framework  where the teacher provides useful information to the student model during training. What differs is that the teacher does not depend on the input data but solely on the target, i.e., it distills information using ground-truth data. Since teacher’s prediction is constant w.r.t. the input, such information can be encoded before training into the target via label smoothing.
As a form of regularization,  introduces the idea of smoothing hard labels by averaging one-hot encoded targets with constant vectors as follows:
where is the one-hot encoding, is the smoothing factor () and represents the number of classes. Since cross entropy is linear w.r.t. its first argument, it can be written as follows:
The optimization based on the above loss can be seen as a distillation knowledge procedure  where the teacher randomly predicts the output, i.e., . Hence, the connection with the distillation loss proves that the second term in Eq. (1) can be seen as a prior knowledge, given by an agnostic teacher, for the target . Although using an agnostic teacher seems an unusual choice, uniform label smoothing can be seen as a form of regularization  and thus it can improve the model’s generalization capability. Taking this into account, we extend the idea of smoothing labels by modeling the second term of Eq. (1), i.e., the prior knowledge of the targets, as follows:
where is the prior vector such that and .
Therefore, the resulting cross entropy with soft labels is written as follows:
This loss not only penalizes errors related to the correct class but also errors related to the positive entries of the prior. Starting from this formulation, we introduce Verb-Noun, GloVe and Temporal priors for smoothing labels in the knowledge distillation procedure. In the following, we detail our label smoothing techniques.
Verb-Noun label smoothing. EPIC-KITCHENS  contains action labels structured as verbs-noun pairs, like “cut onion” or “dry spoon”. More formally, if we define the set of actions, the set of verbs, and the set of nouns, then an action is represented by a tuple where and . Let the set of actions sharing the same verb and the set of actions sharing the same noun , defined as follows:
where and .
We define the prior of the ground-truth action class as
where is the action, and are the verb and the noun of the action, is the indicator function, and is a normalization term. Using such encoding rule, the cross entropy not only penalizes the error related to the correct class but also the errors with respect to all the other “similar” actions with either the same verb or noun111It can be proved that in terms of scalar product two different classes having the same noun or verb and encoded with Verb-Noun label smoothing are closer respect to classes encoded with hard labels.
GloVe based label smoothing. An important aspect to consider when dealing with actions represented by verbs and/or nouns is their semantic meaning. In the Verb-Noun label smoothing, we define the prior considering a rough yet still meaningful semantic where actions that share either the same verb or noun are considered similar. To extend this idea, we extrapolate the prior from the word embedding of the action. One of the most important properties of word embeddings is to put closer words with similar semantic meanings and to move dissimilar ones far away, as opposed to hard labels that cannot capture at all similarity between classes since .
Using such action representation, we enable the distillation of useful information into the model during training since the cross entropy not only penalizes the error related to the correct class but also the error related to all other similar actions. In order to compute the word embeddings of the actions we use the GloVe model  pretrained on the Wikipedia 2014 and Gigaword 5 datasets. We use the GloVe model since it does not rely just on local statistics of words, but incorporates global statistics to obtain word vectors. Since the model takes as input only single words, we encode the action as follows:
where is the obtained action representation of and
is the output of the GloVe model. We finally compute the prior probability for smoothing the labels as the similarity between two action representations, which is computed as follows:
Hence, in Eq. (9) represents the similarity between the and the action.
Temporal label smoothing.
Some actions are more likely to co-occur than others. Furthermore, only specific action sequences may be considered plausible. For this reason, it could be reasonable to focus on most frequent action sequences since they may reveal possible valid paths in the actions space. In this case, we build the prior probability of their observation by considering subsequent actions of length two, i.e., we estimate from the training set the transition probability from theto the action as follows:
where is the number of times that the action is followed by the action. Using such representation, we reward both the correct class and most frequent actions that precede the correct class.
Iii-B Action Anticipation Architecture
For our experiments, we consider a learning architecture based on recurrent neural networks. Following , our approach uses the protocol depicted in Fig. 2 for anticipating future actions. We process the frames preceding the action that we want to anticipate grouping them into video snippets of length . Each video snippet is collected every seconds and processed considering three different modalities: RGB
features computed using a Batch Normalized Inception CNN trained for action recognition, objects features computed using Fast-R CNN  and optical flow computed as in , processed through a Batch Normalized Inception CNN trained for action recognition. Our multi-modal architecture processes the above inputs and encompasses two building blocks: an encoder which recognizes and summarises past observations and a decoder which predicts future actions at different anticipation time steps. As shown in Fig. 3, during the encoding stage each modality is separately processed by a LSTM layer. During the decoding stage such streams are then merged with late fusion and fed into a fully connected layer using softmax activation.
Iv-a Dataset and Evaluation Measures
Dataset. Our experiments are performed on the EPIC-KITCHENS  dataset. This is a large-scale collection of egocentric videos that contains action annotations divided into unique actions, verbs, and nouns. We use the same split as  producing segments for training and segments for validation.
Evaluation Measures. To asses the quality of predictions and compare all methods, we use the Top-k accuracy, i.e. we assume the prediction correct if the label falls into the best top-k predictions. As reported in [13, 24], such measure is one of the most appropriate given the uncertainty of future predictions. More specifically, we use the Top-5 accuracy for methods comparison. For the test set, we also use the Top-1 accuracy, the Macro Average Class Precision and the Macro Average Class Recall. The last two metrics are computed only on many-shot nouns, verbs and actions as explained in .
Iv-B Models and Baselines
In the comparative analysis, we exploit the architecture proposed in Sec. III-B employing the different label smoothing techniques defined in Sec. III-A. In our experiments we consider models trained using one-hot vectors (One Hot), uniform smoothing (Smooth Uniform), temporal soft labels (Smooth TE), Verb-Noun soft labels (Smooth VN), GloVe based soft labels (Smooth GL) and GloVe + Verb-Noun soft labels (Smooth GL+VN). We design the last method (Smooth GL+VN) by smoothing hard labels with the average of the two related priors. We also evaluate the effect of the proposed label smoothing techniques on the state-of-the-art approach RU-LSTM . In this case, we choose GL+VN as label smoothing technique.
Results. For each label smoothing method we select the best smooth factor with a grid search procedure between and and step size . The smooth factors used for the results discussed in this section are shown in Table I.
We notice that our label smoothing procedures such as Verb-Noun, GloVe and Temporal perform well with higher smooth factors () compared to Uniform label smoothing (). This suggests that, when soft labels encodes semantic information, prior information becomes more relevant assuming the same importance of the ground-truth due to the similar smooth factors (). During training, we fix for each method and, in order to have a robust performance estimation among trials, we iterate ten times all the experiments.
Table II reports our results on the EPIC-KITCHENS validation set. We notice that all our proposed label smoothing methods improve model performance as compared with training using one-hot encoded labels. Soft labels based on GloVe + Verb-Noun attain best performance improving the Top-5 accuracy from to compared to hard labels. To validate our results, we trained , the state-of-the-art model for action anticipation, considering label smoothing using EPIC-KITCHENS validation and test sets. Table III reports results of RU-LSTM architecture using both one-hot encoding (i.e., the baseline) and smoothed labels on the validation set. We select GloVe + Verb-Noun soft labels since they show a higher performance increase. As shown by the obtained results, label smoothing improves the Top-5 accuracy for all the anticipation times of a margin from to . Such behavior highlights the systematic effect of smoothed labels on the model performance.
In Table IV, we report the results obtained considering the test set of EPIC-KITCHENS. Different approaches are considered for the comparison. It is wort noting that the method which consider label smoothing together with RU-LSTM improves the performances obtaining best Top-1 and Top-5 accuracy for anticipating verb, noun and action. Label smoothing helps also to improve Precision for verb and obtains comparable results in anticipating action and noun. Results in terms of Recall point out that label smoothing helps for noun anticipation maintaining comparable results for the anticipation of verb and action. Finally, Fig. 4 shows some qualitative results obtained with our framework, whereas Fig. 5 depicts prior component representations of the proposed label smoothing procedures.
This study proposed a knowledge distillation procedure via label smoothing for leveraging the multi-modal future component of the action anticipation problem. We generalized the idea of label smoothing by designing semantic priors of actions that are used during training as ground truth labels. We implemented a LSTM baseline model that can anticipate actions at multiple time steps starting from multi-modal representation of the input video. Experimental results corroborate out findings compared to state-of-the-art models highlighting that label smoothing systematically improves performance when dealing with future uncertainty.
Research at the University of Padova is partially supported by MIUR PRIN-2017 PREVUE grant. Authors of Univ. of Padova gratefully acknowledge the support of NVIDIA for their donation of GPUs, and the UNIPD CAPRI Consortium, for its support and access to computing resources. Research at the University of Catania is supported by Piano della Ricerca 2016-2018 linea di Intervento 2 of DMI and MIUR AIM - Attrazione e Mobilità Internazionale Linea 1 - AIM1893589 - CUP E64118002540007.
When will you do what? - anticipating temporal occurrences of activities.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
-  (2017) Encouraging LSTMs to anticipate actions very early. In IEEE Int. Conference on Computer Vision (ICCV), Cited by: §II.
-  (2011) Event detection and recognition for semantic annotation of video. Multimedia tools and applications 51 (1), pp. 279–302. Cited by: §I.
-  (2017) Am I done? predicting action progress in videos. arXiv preprint, 1705.01781. Cited by: §II.
-  (2018) Long-term on-board prediction of people in traffic scenes under uncertainty. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2013) Recognize human activities from partially observed videos. In IEEE Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II.
-  (2018-09) Scaling egocentric vision: the epic-kitchens dataset. In European Conference on Computer Vision (ECCV), Cited by: §I, §III-A, §IV-A, §IV-A, TABLE IV.
-  (2016) Online action detection. In European Conference on Computer Vision (ECCV), Cited by: §I, §II.
-  (2018) Modeling temporal structure with lstm for online action detection. In IEEE Winter Conference on Applications in Computer Vision (WACV), Cited by: §II.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
-  (2017) Spatiotemporal multiplier networks for video action recognition. In IEEE Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
-  (2016) Convolutional two-stream network fusion for video action recognition. In IEEE Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II.
Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In European Conference on Computer Vision Workshops (ECCVW), Cited by: §III-A, §IV-A, TABLE IV.
-  (2019) What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §II, §II, §III-B, §IV-A, §IV-B, §IV-B, TABLE III, TABLE IV.
-  (2017) RED: reinforced encoder-decoder networks for action anticipation. British Machine Vision Conference. Cited by: §II, TABLE IV.
-  (2015) Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), Cited by: §III-B.
-  (2015) Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, Cited by: §II, §III-A, §III-A.
-  (2014) Max-margin early event detectors. International Journal of Computer Vision 107 (2), pp. 191–202. Cited by: §II.
-  (2014) Sequential max-margin event detectors. In European Conference on Computer Vision (ECCV), Cited by: §II.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), Cited by: §III-B.
-  (2015) Car that knows before you do: anticipating maneuvers via learning temporal driving models. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. External Links: Cited by: §I.
-  (2014-06) Large-scale video classification with convolutional neural networks. In IEEE Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II.
-  (2016) Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. on Pattern Analysis and Machine Intelligence 38 (1), pp. 14–29. Cited by: §IV-A.
-  (2014) A hierarchical representation for future action prediction. In European Conference on Computer Vision (ECCV), pp. 689–704. Cited by: §II.
-  (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II.
-  (2019) TSM: temporal shift module for efficient video understanding. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
-  (2017-10) Joint prediction of activity labels and starting times in untrimmed videos. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
-  (2018-06) Event-based vision meets deep learning on steering prediction for self-driving cars. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2019) Leveraging the present to anticipate the future in videos. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: TABLE IV.
GloVe: global vectors for word representation.
Empirical Methods in Natural Language Processing (EMNLP), Cited by: §III-A.
-  (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §II.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §I, §II.
-  (2018) LSTA: long short-term attention for egocentric action recognition. In IEEE Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
Real-world anomaly detection in surveillance videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2016-06) Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II, §III-A, §III-A.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
-  (2018) A closer look at spatiotemporal convolutions for action recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
-  (2016) Anticipating visual representations from unlabeled video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II, TABLE IV.
-  (2013) Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103 (1), pp. 60–79. Cited by: §II.
-  (2013) Action recognition with improved trajectories. In IEEE Int. Conference on Computer Vision (ICCV), Cited by: §II.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV), Cited by: §II, §III-B.
-  (2017-10) Visual forecasting by imitating dynamics in natural sequences. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
-  (2018) Temporal relational reasoning in videos. In European Conference on Computer Vision (ECCV), Cited by: §II.