Thanks to the advance in Artificial Intelligence, many intelligent systems (e.g., Amazon Echo, Google Home.) have become available on the markets. Despite their great ability to interact with humans through a speech interface, they are currently not good at proactively interacting with humans. Thus, we argue that the key for proactive interaction is to anticipate user’s intention by observing their actions. Given the anticipated intention, the intelligent system may provide service to facilitate the intention. More specifically, the ability to anticipate a large number of daily intentions will be the key to enable a proactive intelligent system.
Many researchers have tackled tasks related to intention anticipation. [11, 28, 18] focus on early activity prediction – predicting actions before they have completed. However, the time-to-action-completion of this task is typically very short. Hence, there are only a few scenarios that intelligent systems may take advantage of the predicted activity. Kitani et al.  propose to forecast human’s trajectory. Forecasting trajectory is very useful, but it does not directly tell you the “intention” behind a trajectory. [3, 12, 13] anticipate the future events on the road such as making a left turn or involving in an accident. Although these events can be considered as intentions, only few intentions (at most five) are studied. Moreover, none of the work above leverages heterogeneous sensing modalities to reduce computing requirement.
In this work, we anticipate a variety of daily intentions (e.g., “go outside”, “charge cellphone”, in Fig. 1) by sensing motion and visual observation of actions. Our method is unique in several ways. Firstly, we focus on On-Wrist sensing: (1) an on-wrist camera (inspired by [24, 2]
) is used to observe object interactions reliably, and (2) an on-wrist accelerometer is used to sense 3D hand motion efficiently. Since both on-wrist sensors are unconventional, we collect auxiliary object appearance and motion data to pre-train two encoders: (1) a Convolutional Neural Network (CNN) to classify daily objects, and (2) a 1D-CNN to classify common motions. Secondly, we leverage heterogeneous sensing modalities to reduce computing requirement. Note that visual data is very informative but costly to compute. In contrast, motion data is less informative but cheap to compute. We propose a Policy Network to determine when to peek at some images. The network will trigger the camera only at some important moments while continuously analyzing the motions. We call this asMotion Triggered sensing. Finally, we propose to use a Recurrent Neural Network (RNN) to model important long- and short-term dependency of actions. Modeling this dependency properly is the key of accurate anticipation, since daily action sequences are subtle and diverse. For instance, while multiple action sequences leading to the same intention, the same subset of actions can lead to different intention as well (see “go exercise” and “go outside” in Fig. 1).
In order to evaluate our method, we collect the first daily intention dataset from on-wrist sensors. It consists of 2379 videos with 34 intentions and 164 unique action sequences. For pre-training encoders, we collect an object dataset by manipulating 50111including a hand free class, which means that hand is not interacting with any objects. daily objects without any specific intention, and a 3D hand motion dataset with six motions performed by eight users. On the intention dataset, our method achieves accuracy while processing only of the visual observation on average.
Our main contributions can be summarized as follows. (1) We adapt on-wrist sensors to reliably capture daily human actions. (2) We show that our policy network can effectively select the important images while only slightly sacrificing the anticipation accuracy. (3) We collected and will release one of the first daily intention dataset with a diverse set of action sequence and heterogeneous on-wrist sensory observations.
2 Related Work
We first describe works related to anticipation. Then, we mention other behavior analysis tasks. Finally, we describe a few works using wearable sensors for recognition.
The gist of anticipation is to predict in the future. We describe related works into groups as follows.
introduces a probability model for early activity prediction. Hoai et al. proposed a max-margin model to handle partial observation. Lan et al.  propose the hierarchical movemes representation for predicting future activities.
Event anticipation. [17, 13, 12, 3, 33] anticipate events even before they appear. Jain et al. [13, 12] propose to fuse multiple visual sensors to anticipate the actions of a driver such as turning left or right. Fu et al.  further propose a dynamic soft-attention-based RNN model to anticipate accidents on the road captured in dashcam videos. Recently, Vondrick et al.  propose to learn temporal knowledge from unlabeled videos for anticipating actions and objects. However, the early action recognition and anticipation approaches focus on activity categories and do not study risk assessment of objects and regions in videos. Bokhari and Kitani  propose to forecast long-term activities from a first-person perspective.
Intention anticipation. Intention has been explored more in the robotic community [35, 17, 16, 22]. Wang et al.  propose a latent variable model for inferring human intentions. Koppula and Saxena  address the problem by observing RGB-D data. A real robotic system has executed the proposed method to assist humans in daily tasks. [16, 22] also propose to anticipate human activities for improving human-robot collaboration. Hashimoto et al.  recently propose to sense intention in cooking tasks via the knowledge of access to objects. Recently, Rhinehart and Kitani  propose an on-line approach for first-person videos to anticipate intentions including where to go and what to acquire.
Others. Kitani et al.  propose to forecast human trajectory by surrounding physical environment (e.g., road, pavement). The paper shows that the forecasted trajectory can be used to improve object tracking accuracy. Yuen and Torralba  propose to predict motion from still images. Julian et al.  propose a novel visual appearance prediction method based on mid-level visual elements with temporal modeling methods.
Despite many related works, to the best of our knowledge, this is the first work in computer vision focusing on leveraging a heterogeneous sensing system to anticipate daily intentions with low computation requirement.
2.2 High-level Behavior Analysis
Other than activity recognition, there are a few high-level behavior analysis tasks. Joo et al.  propose to predict the persuasive motivation of the photographer who captured an image. Vondrick et al.  propose to infer the motivation of actions in an image by leveraging text. Recently, many methods (e.g., [38, 25, 26, 40, 32, 37]) have been proposed to generate sentence or paragraph to describe the behavior of humans in a video.
2.3 Recognition from Wearable Sensors
Most wearable sensors used in computer vision are first-person (i.e., ego-centric) cameras. [23, 31, 6, 19] propose to recognize activities. [21, 7] propose to summarize daily activities. Recently, two works [24, 2] focus on recognition using on-wrist camera and show that it outperforms ego-centric cameras. Inspired by them, we adapt a similar on-wrist sensor approach.
3 Our Approach
We first define the problem of intention anticipation. Next, we introduce our RNN model encoding sequential observations and fusing multiple sensors’ information from both hands. Then, we talk about our novel motion-triggered process based on a policy network. Finally, we describe how we pre-train the representation from auxiliary data.
3.1 Problem Formulation
Observations. At frame , the camera observes an image , and the motion sensor observes the 3D acceleration of hands .
Representations. The image and 3D acceleration are raw sensory values which are challenging to be used directly for intention anticipation, especially when lacking training data. Hence, we propose to learn visual object (referred to as object) and hand motion (referred to as motion) representations from other tasks with a larger number of training data. Note that, for all the variables, we use superscript to specify left or right hand (when needed). For instance, indicates left-hand object representation.
Goal. At frame , our model predicts the future intention based on the observations, where is the set of intention indices. Assuming the intention occurs at frame , we not only want the prediction to be correct but also to predict as early as possible (i.e., to be large).
3.2 Our Recurrent and Fusion Model
Intention anticipation is a very challenging task. Intuitively, the order of observed objects and hand motions should be a very strong cue. However, most orders are not strict. Hence, learning composite orders from limited training data is critical.
Recurrent Neural Network (RNN) for encoding.
We propose to use an RNN with two-layers of Long-Short-Term-Memory (LSTM) cell to handle the variation (Fig.2-Top) as follows,
where is the softmax probability of every intention in , is the model parameter to be learned,
is the learned hidden representation, andis a fixed dimension output of . is the parameter of embedding function , is the concatenation operation, and is a linear mapping function (i.e., . RNN has the advantage of learning both long- and short-term dependency of observation which is ideal for anticipating intentions.
Fusing left and right hands. Since tasks in real life typically are not completed by only one hand, we allow our system to observe actions on both hands simultaneously. We concatenate the right (i.e., the dominant hand) and left-hand observations in a fixed order to preserve the information of which hand is used for certain actions more frequently. The fused observation is , where .
Training for anticipation. Since our goal is to predict at any time before the intention happened, anticipation error at different time should be panelized differently. We use exponential loss to train our RNN-based model similar to . The anticipation loss is defined as,
where is the ground truth intention and is the time when intention reached. Based on this definition, the loss at the first frame (t=0) is only of last frame (t=T). This implies that anticipation error is panelized less when it is early, but more when it is late. This encourages our model to anticipate the correct intention as early as possible.
The current RNN considers both motion and object representations as shown in Eq. 1. It is also straightforward to modify Eq. 1 such that RNN considers only motion or only object representation. However, the RNN needs to consider the same type of representation at all times. In the following section, we introduce the Motion-Triggered sensing process, where the RNN considers different representations at different frames depending on a learned policy.
3.3 RL-based Policy Network
We propose a policy network to determine when to process a raw image observation into an object representation . The network continuously observes motion and hidden state of RNN to parsimoniously trigger the process of computing as follows,
where is the decision of our policy network to trigger () or not trigger (), is the parameters of the policy network, the policy
outputs a probability distribution over trigger () or non-trigger (), and is the modified object representation. As shown in Eq. 7, when , the visual observation at frame will be updated () with high cost on CNN inference. When , the previous representation will simply be kept (). The modified object representation will influence the embedded representation as shown in Eq. 8.
Reward. We set our reward to encourage less triggered operation () while maintaining correct intention anticipation () as shown below.
where is the ground truth intention, is the predicted intention, is the number of triggered operations in frames of the video, is the probability of anticipated intention, is a positive reward for correct intention anticipation, and is a negative reward for incorrect intention anticipation. Note that, when the trigger ratio is higher, the positive reward is reduced and the negative reward gets more negative.
Policy loss. We follow the derivation of policy gradient in 
and define a policy loss function,
where is the sequence of trigged patterns sampled from , is the number of sequences, and is the time when intention reached. is the reward of the sampled sequence at time computed from Eq. 9. Please see Sec.2 of the supplementary material for the derivation.
Joint training. The whole network (Fig. 2) consists of a RNN and a policy network. We randomly initialize the parameters of policy network. The parameters of RNN is initialized by the RNN encoder trained on both representation and . This initialization enables the training loss to converge faster. We define the joint loss for each training example, where
is the weight to balance between two loss. Similar to the standard training procedure in deep learning, we apply stochastic gradient decent using mini-batch to minimize the total joint loss.
3.4 Learning Representations from Auxiliary Data
Due to the limited daily intention data, we propose to use two auxiliary datasets (object interaction and hand motion) to pre-train two encoders: an object Convolutional Neural Network (CNN) and a hand motion 1D-CNN. In this way, we can learn a suitable representation of object and motion.
It is well-known that ImageNet pre-trained CNN performs well on classifying a variety of objects. However, Chan et al.  show that images captured by on-wrist camera are significantly different from images in ImageNet. Hence, we propose to collect an auxiliary image dataset with 50 object categories captured by our on-wrist camera, and fine-tuned on Imagenet  pre-trained ResNet-based CNN . After the model is pre-trained, we use the model to extract object representation from the last layer before softmax.
Hand motion 1D-CNN. Our accelerometer captures acceleration in three axes () with a sampling rate of 75Hz. We calibrate our sensor so that the acceleration in 3 axes are zero when we placed it on a flat and horizontal surface. We design a 1D-CNN to classify every 150 samples (2 seconds) into six motions: lift, pick up, put down, pull, stationary, and walking. The architecture of our model is shown in Fig. 3. Originally, we plan to mimic the model proposed by , which is a 3-layer 2D-CNN model with 1 input channel. Considering that there are no stationary properties among three acceleration values for each sample, we adjust the input channel number to 3 and define the 1D-CNN. For training the model, we have collected an auxiliary hand motion data with ground truth motions (Sec. 4). After the model is trained, we use the model to extract motion representation at the FC4 layer (see Fig. 3).
3.5 Implementation Details
Intention anticipation model. We design our intention anticipation model to make a prediction in every half second. All of our models are trained using a constant learning rate 0.001 and 256 hidden states.
Policy Network. Our policy network is a neural network with two hidden layers. For joint training, we set learning rate 0.001, 0.1 for joint loss. The reward of and are 100 and -100, respectively.
Object CNN. Following the setting of , our object CNN aims at processing 6 fps on NVIDIA TX1. This frame rate is enough for daily actions. Since most of the actions will last a few seconds, it’s unnecessary to process at 15 or 30 fps. We take the average over 6 object representations as the input of our model. Different from , our on-wrist camera has a fish-eye lens to ensure a wide field-of-view capturing most objects. For fine-tuning the CNN model on our dataset, we set maximum iterations 20000, step-size 10000, momentum 0.9, every 10000 iteration weight decay 0.1, and learning rate 0.001. We also augment our dataset by horizontal flipping frames.
Hand motion 1D-CNN. Motion representation is extracted for a 2-second time segment. Hence, at every second, we process a 2-second time segment overlapped with previous processed time segment for 1 second. For training from scratch, we set the base learning rate to 0.01 with step-size 4000, momentum 0.9 and weight decay 0.0005. We adjust the base learning rate to 0.001 when fine-tuning.
4 Setting and Datasets
We introduce our setting of on-wrist sensors and describe details of our datasets.
4.1 Setting of On-wrist Sensors
Following similar settings in [24, 2], our on-wrist camera222fisheye lens mounted on noIR camera module with CSI interface. and accelerometer333MPU-6050. are mounted as shown in Fig. 4. Both camera and accelerometer are secured using velcro. We use the fisheye lens to ensure a wide field-of-view. We list some simple rules to be followed by users. First, the camera is under the arm, toward the palm. Second, the camera should roughly align the center of the wrist. This ensures that camera can easily record the state of the hand.
We collect three datasets444Our dataset and code can be downloaded from http://aliensunmin.github.io/project/intent-anticipate for the following purposes. (1) Daily Intention Dataset: for training our RNN model to anticipate intention before the intention occurs. (2) Object Interaction Dataset: for pre-training a better object interaction encoder to recognize common daily object categories. (3) Hand Motion Dataset: for pre-training a better motion encoder to recognize common motions.
|# of action sequences||1540||358||481|
|avg. per sequence||9.4||2.2||2.9|
4.2.1 Daily Intention Dataset
Inspired by Sigurdsson et al. , we select daily intentions such as charge cellphone, go exercise, etc. Note that each intention is associated with at least one action sequence, and each action consists of a motion and an object (e.g., pick up+wallet). We propose two steps to collect various action sequences fulfilling daily intentions.
Exploring stage. At this stage, we want to observe various ways to fulfill an intention (Fig. 1). Hence, we ask a user (referred to as user A) to perform each intention as different as possible. At this step, we observed unique action sequences.
Generalization stage. At this stage, we ask user A and other users (referred to as user B and user C) to follow action sequences and record multiple samples55510, 2, 3 times for user A, B, C, respectively for each action sequence. This setting simulates when an intelligent system needs to serve other users. We show by experiment that our method performs similarly well on three users.
In Table 1, We summarize our intention dataset. Note that the number of action sequences recorded by user A is much more than others. Since we will train and validate on user A, selecting the proper hyper-parameters (e.g., design reward function). Next, we’ll apply the same setting to the training process of all users, and evaluate the result. This can exam the generalization of our methods. Design of reward function is described in the Sec.3 of the supplementary material.
4.2.2 Object Interaction Dataset.
We select 666including a hand-free category. object categories and collect a set of videos corresponding to unique object instances777not counting “free” as an instance.. Each video records how an object instance is interacted by a user’s hand. We sample frames on average in each video. At the end, we collected an auxiliary dataset consisting of frames in total to pre-train our object encoder. Example frames of the dataset are shown in Fig. 6.
4.2.3 Hand Motion Dataset
Inspired by , we select six motions. We ask eight users to collect motion sequences from the right hand and one user to collect motion sequences from the left hand. For the right-hand data collected by eight users, we aim at testing cross users generalizability. For the left-hand data, we aim at testing cross hand generalizability.
We first conduct pilot experiments to pre-train object and hand motion encoders. This helps us to select the appropriate encoders. Next, we conduct experiments for intention anticipation with policy network and evaluate our method in various settings. Finally, we show typical examples to highlight the properties of our method.
|Model||Training Acc.||Testing Acc.||Speed|
5.1 Preliminary Experiments
Object pre-training. We evaluate multiple Convolution Neural Network (CNN) architectures on classifying object categories in our object intention auxiliary dataset. These architectures include VGG-16  and Residual Net (ResNet)  with 50, 101, 152-layers. We separate the whole dataset into two parts: 80% of object instances for training and 20% for testing. The testing accuracy is reported on Table. 2. Our results show that deeper networks have slightly higher accuracy. Another critical consideration is the speed on the embedded device. Hence, we report the processed frames per second (fps) on NVIDIA TX1 in the last column of Table. 2. Considering both accuracy and speed, we decide to use ResNet-50 since we designed our system to process at fps similar to .
|Model||Training Acc.||Testing Acc.|
|User A||User B||User C|
For hand motion, We describe two experiments to (1) select the best model generalizing across users, and (2) select the pre-processing step generalizing to the left hand.
Generalizing across users. Given our dataset collected by eight different users, we conduct a 4-fold cross validation experiment and report the average accuracy. We compare a recent deep-learning-based method  (1ch888ch stands for number of input channels-3layer model) with our 3ch models trained from scratch in Table. 3. The results show that our 3ch-3layer model generalizes the best across different users. At the end, we pre-train our 3-layer model on data collected by 999Their data is collected by cellphone’s accelerometer while the cellphone is in user’s pocket. to leverage more data. Then, we fine-tune the model on our auxiliary data.
Generalizing across hands. We propose the following pre-process to generalize our best model (3ch-3layer trained on right hand data) to handle left hand. We flip the left hand samples by negating all values in one channel (referred to as flip). This effectively flips left-hand samples to look similar to right-hand samples. In the last two rows of Table. 3, we show the accuracy of left-hand data. Our method with flip pre-processing achieves better performance. In the intention anticipation experiment, we use “3ch-3layer” and apply flip pre-process on left hand.
5.2 Motion Triggered Intention Anticipation
For intention anticipation, we evaluate different settings on all three users. In the following, we first introduce our setting variants and the evaluation metric. Then, we compare their performance in different levels of anticipation (e.g., observing only the beginning X percent of the action sequence).
(1) Object-only (OO): RNN considering only object representation .
(2) Motion-only (MO): RNN considering only motion representation .
(3) Concatenation (Con.): RNN considering both object and motion representations.
(4) Motion-Triggered (MTr.): RNN with policy network, where the input of RNN is determined by the policy network. In this setting, we also report the ratio of triggered moments (referred as to Ratio). The lower the ratio the lower the computation requirement.
Metric. We report the intention prediction accuracy when observing only the beginning , , , or of the action sequence in a video.
Comparisons of different variants on all users (referred to as user A, B, and C) are shown in Table. 4. We summarize our findings below. Object-only (OO) outperforms Motion-only (MO). This proves that object representation is much more influential than motion representation for intention anticipation. We also found that concatenating motion and object (Con.) does not consistently outperform Object-only (OO). Despite the inferior performance of MO, the tendency of MO under different percentage of observation is pretty steady. This implies that there are still some useful information in the motion representation. Indeed, MTr. can take advantage of motion observation to reduce the cost of processing visual observation to nearly 29% while maintaining a high anticipation accuracy ().
In Fig. 8, we control the ratio of triggered moments and change the anticipation accuracy by adjusting the threshold of motion triggers. The results show that increasing the ratio of triggered moments leads to higher accuracy on intention anticipation. Most interesting, the accuracy only decrease slightly when the ratio is larger than 20%. Note that the default threshold is 0.5, which means the policy will decide to trigger when the probability of trigger is larger than non-trigger. Some quantitative results are described in Sec.4 of the supplementary material.
5.3 Typical Examples
We show typical examples in Fig. 7. In the first example, our Policy Network (PN) efficiently peeks at various objects (e.g., keys, cellphone, backpack, etc.). In other examples, PN no longer triggers after some early peeks. Specifically, in the second example, once the cellphone is observed and the wire is plugged in, PN is confident enough to anticipate cellphone charging without any further triggered operation.
We propose an on-wrist motion triggered sensing system for anticipating daily intentions. The core of the system is a novel RNN and policy networks jointly trained using policy gradient and cross-entropy loss to anticipate intention as early as possible. On our newly collected daily intention dataset with three users, our method achieves impressive anticipation accuracy while processing only 29% of the visual observation. In the future, we would like to develop an on-line learning based method for intention anticipation in the wild.
We thank MOST 104-2221-E-007-089-MY2 and MediaTek for their support.
-  S. Z. Bokhari and K. M. Kitani. Long-term activity forecasting using first-person vision. In ACCV, 2016.
-  C.-S. Chan, S.-Z. Chen, P.-X. Xie, C.-C. Chang, and M. Sun. Recognition from hand cameras: A revisit with deep learning. In ECCV, 2016.
-  F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun. Anticipating accidents in dashcam videos. ACCV, 2016.
-  Y. Chen and Y. Xue. A deep learning approach to human activity recognition based on single accelerometer. In SMC, 2015.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
-  A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In ICCV, 2011.
-  J. Ghosh, Y. J. Lee, and K. Grauman. Discovering important people and objects for egocentric video summarization. In CVPR, 2012.
-  A. Hashimoto, J. Inoue, T. Funatomi, and M. Minoh. Intention-sensing recipe guidance via user accessing objects. International Journal of Human-Computer Interaction, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  M. Hoai and F. De la Torre. Max-margin early event detectors. In CVPR, 2012.
-  A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena. Car that knows before you do: Anticipating maneuvers via learning temporal driving models. In ICCV, 2015.
-  A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In ICRA, 2016.
-  V. Joo, W. Li, F. F. Steen, and S.-C. Zhu. Visual persuasion: Inferring communicative intents of images. In CVPR, 2014.
-  K. M. Kitani, B. D. Ziebart, J. A. D. Bagnell, and M. Hebert . Activity forecasting. In ECCV, 2012.
-  H. S. Koppula, A. Jain, and A. Saxena. Anticipatory planning for human-robot teams. In ISER, 2014.
-  H. S. Koppula and A. Saxena. Anticipating human activities using object affordances for reactive robotic response. PAMI, 38(1):14–29, 2016.
-  T. Lan, T.-C. Chen, and S. Savarese. A hierarchical representation for future action prediction. In ECCV, 2014.
-  Y. Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In CVPR, 2015.
-  J. W. Lockhart, G. M. Weiss, J. C. Xue, S. T. Gallagher, A. B. Grosner, and T. T. Pulickal. Design considerations for the wisdm smart phone-based sensor mining architecture. In Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data, 2011.
-  Z. Lu and K. Grauman. Story-driven summarization for egocentric video. In CVPR, 2013.
-  J. Mainprice and D. Berenson. Human-robot collaborative manipulation planning using early prediction of human motion. In IROS, 2013.
-  K. K. Minghuang Ma. Going deeper into first-person activity recognition. In CVPR, 2016.
-  K. Ohnishi, A. Kanehira, A. Kanezaki, and T. Harada. Recognizing activities of daily living with a wrist-mounted camera. In CVPR, 2016.
-  P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In CVPR, 2016.
-  Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016.
N. Rhinehart and K. M. Kitani.
First-person activity forecasting with online inverse reinforcement learning.ICCV, 2017.
-  M. S. Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011.
-  G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  S. Singh, C. Arora, and C. V. Jawahar. First person action recognition using deep learned descriptors. In CVPR, 2016.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko. Sequence to sequence - video to text. In ICCV, 2015.
-  C. Vondrick, D. Oktay, H. Pirsiavash, and A. Torralba. Predicting motivations of actions by leveraging text. In CVPR, 2016.
-  J. Walker, A. Gupta, and M. Hebert. Patch to the future: Unsupervised visual prediction. In CVPR, 2014.
-  Z. Wang, M. Deisenroth, H. Ben Amor, D. Vogt, B. Schölkopf, and J. Peters. Probabilistic modeling of human movements for intention inference. In RSS, 2012.
-  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992.
-  L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015.
-  H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR, 2016.
-  J. Yuen and A. Torralba. A data-driven approach for event prediction. In ECCV, 2010.
-  K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun. Title generation for user generated videos. In ECCV, 2016.