In recent years, a significant amount of research in the computer vision community has focused on human activity recognition. The objective of this research is to be able to automatically recognize and understand what humans depicted in a video are doing. In this work, human activity recognition is formulated as a classification problem (i.e., given a short video clip, which activity in a given set is depicted). This problem is important for several applications including human-robot teaming (helping robots to understand and interact with their environment and thus better react to it), surveillance (sift through a large number of video streams to detect abnormal behavior), and video-tagging (automatically tag videos to make them easier to find).
Most state-of-the-art approaches to this problem train deep convolutional neural networks (CNN) to classify videos based on their raw pixels and/or extracted features. One prominent method, the two-stream approach, trains one network to classify single RGB frames and a second network to classify short snippets of optical flow features. Temporal Segment Networks  attempt to exploit longer-term temporal information by grouping frames from different portions of the video during training. Karpathy et al. 
apply deep learning to a very large dataset. In addition to the standard convolutional neural networks, three-dimensional convolutional neural networks[4, 5, 6] have been used for activity recognition to great effect.
One of the main limitations of deep learning approaches to this problem is their dependence on the size and scope of the training data set. It is important to have a large labeled training set to take full advantage of the power of deep neural networks. However, it may not be feasible to attain a large enough dataset as it requires excessive human effort to collect and annotate the training videos. Although state-of-the-art methods achieve good results on benchmark datasets, due to limited data, they cannot use the recently proposed deeper network architectures such as ResNet without overfitting. Furthermore, regardless of the size of the dataset, it is virtually impossible to guarantee that all variations of the target activities are captured by given video dataset.
In this work, we attempt to address these problems by introducing a novel CNN network which incorporates text-guided object information within a multitask learning scheme. This approach helps the overall network training in two ways. First, it allows us to exploit a large object recognition dataset to boost the amount of training data we have. Second, it allows us to incorporate general knowledge from text about the target activities that may not be fully apparent from the training videos, and our approach results in improvement upon the baseline activity recognition performance.
More specifically, we train our network to simultaneously perform object recognition with the activity recognition. Learning a single model which considers multiple related tasks is known as multitask learning [8, 9, 10, 11]
. In our approach, we are considering object recognition as a highly related task with respect to the activity recognition. Using this approach allows us to leverage the ImageNet dataset, which provides us with significantly more training data. This enables us to use much deeper networks than current methods in the literature without overfitting and thus achieve higher recognition rates. We further improve upon the multitask learning approach by analyzing the relationship between the activities and the objects within a text-guided semantic space.
2 Our Method
2.1 Incorporating Object Recognition with Activity Recognition in Multitask Learning
Previous approaches have demonstrated that being able to detect or recognize objects within an image can improve recognition of relevant events and activities in that image [13, 14]. We take a similar approach in exploiting the object information but with two major novel aspects. First, we introduce a practical way of training and enhancing the activity recognition network by carrying out the multitask learning with the object recognition network. Moreover, unlike the previous approaches, we do not attempt to localize or identify the objects within the target domain (in our case, activity recognition) but train the network to perform the task of object recognition using a totally different dataset (ImageNet). This bolsters the amount of training data for the overall network, and at the same time, removes the need for manually annotating/detecting the relevant objects in the target videos. As shown in Figure 1, we share the weights in all the layers of the network between the two tasks except the task-specific softmax classifiers.
) are only annotated for each single task (i.e., videos frames for activity recognition and ImageNet images for object recognition). Thus, we design the network so that each data sample is directly associated with the loss function for the corresponding task. However, as we ground our method in the relevance of the two tasks, all the layers except the softmax layer are being shared between the two tasks.
We can view our multitask learning approach as an extension of the standard finetuning strategy (Figure 2). In training our network we learn the parameter weights for both the activity recognition (ActivityNet) and the object recognition (ObjectNet) by finetuning from the network pretrained for the task of the object recognition (ObjectNet) as shown in Figure 2. The continuation of the incorporation of gradients from the object recognition loss acts as a regularization for the overall network parameters, preventing them from overfitting to the activity recognition task. As our pretrained ObjectNet, we have used the network which was trained to classify 1000 object classes assigned by the ImageNet Challenge .
2.2 Leveraging the text-guided semantic space
The object-incorporated activity recognition network introduced in Section 2.1 uses all the objects from the ImageNet dataset to learn the ObjectNet, and thus solely relies on the capability of the multitask network learning process to harvest the necessary information about the objects with respect to the activities. We seek to further improve upon our object-incorporated activity recognition network by exploring the following questions: Which objects are more important and indicative for certain activities? Would selecting this subset of objects help improve activity recognition?
Our strategy is to refine the original object dataset before proceeding into the network training by selecting the most relevant set of objects with respect to the activities in the target domain. To select the most relevant objects, we carry out what we call ‘Text-guided Relevance Analysis (TRA)’ where we compute the similarity between the textual labels of the activities and those of the ImageNet objects within a semantic vector space. We exploit the the textual labels which are originally provided from both datasets (UCF101 and ImageNet).
In TRA, we use Word2Vec  embedding to project the textual labels to the semantic vector space. Word2Vec embeds words and phrases into a vector space based on their usage in a large text corpora. Words that are used in similar contexts will be embedded closer together in the vector space. An illustration of the text-guided semantic space is shown in Figure 3, where the activity label “tennis swing” is closely embedded with the object labels “ball” and “racket”.
Assuming as the embedding learned by Word2Vec, we approximate the relevance between a target activity and an ImageNet class
with the cosine similarity of their vector space representations as follows:
We then compute the overall relevance of an ImageNet class to the set of target activities as the sum of the relevances of to each activity ,
Once we acquire for all ImageNet classes, we select the most relevant classes (those whose relevance score is numerically highest) to be used for training the “text-guided, object-incorporated activity recognition network”. This overall process of TRA (See Figure 2), can be considered a dataset refinement procedure for the original object recognition dataset as :
where ) indicates the rank in descending order among all such that and , while is the number of selected objects within . Based on an empirical analysis, we selected, for our image input dataset (identified as in Figure 2), the images that have text-labels for 1000 objects ( = 1000) for training the final version of the network. In Table 1, we introduce some samples of highly ranked object (ImageNet) classes with respect to the activity (UCF101) classes acquired by the TRA.
3.1 Experimental details
Preprocessing the data.
First, we subtract a mean pixel from each pixel in the image. Then we select a random window from the target frame. The window’s width and height are randomly and independently selected (from a uniform distribution) to be between 168 and 256 pixels. Once the width and height are selected, the location of the window within the image is selected at random (again, from a uniform distribution). Finally, the window is resized to 224224 pixels and fed into the network. The random window selection process helps to generate more variation in the training data to reduce the risk of overfitting. For the ImageNet images, we still subtract the pixel mean, but select a sub-image by simply choosing a random 224224 window from the image. We can use a simpler window selection with ImageNet because it contains many more images which are uncorrelated unlike the video frames which are highly correlated.
Network architecture setting. We use the ResNet  architectures (ResNet 50, 100, and 152) which has recently demonstrated the state-of-the-art performance in various applications. This is in contrast to previous approaches which use shallower networks. Our multitask approach acts as a regularization, enabling us to use the deeper, better-performing ResNet networks. All networks are initialized by pretraining on the 1000 ImageNet challenge classes.
We incorporate the Temporal Segment Network (TSN)  approach in training our networks which is known to capture long-term temporal information. We have empirically determined the optimal number of segments to be three, and thus the size of the activity recognition portion of the batch was set to be a multiple of three. For example, when training our ResNet 50 network, total batch size is 64. Ideally, we would split it evenly between the two network streams (32 each). However, as 32 is not a multiple of three, we use 33 activity recognition samples and 31 ImageNet samples.
We train our networks with stochastic gradient descent on single GPU (NVIDIA TITAN X) system. Due to the depths of the networks used and the memory limitations of the GPU (12 GB), we were forced to use small batch sizes of 64, 48, and 32 frames/images for ResNet 50, 101, and 152, respectively. When training in the multitask setting, we split the batch size between activity recognition frames and ImageNet images. We found that splitting the batch approximately evenly between the two (i.e., giving equal weight to the two objectives) provided the best performance.
When training ResNet 50, we initialize the learning rate to .001. We divide it by 10 after 10k and 13k iterations and train for 15k iterations in total. Due to the smaller batch sizes, we initialize the learning rates for the ResNet 101 and 152 to .0005. For ResNet 101, we divide it by 10 after 13k and 18k iterations and train for 20k iterations in total. For ResNet 152, we divide it by 10 after 28k and 36k iterations and train for 40k iterations in total. Weight decay was set as .0001. During training of all three architectures, we place dropout layers just before the final softmax classifiers. Dropout rate is set as .25.
At test time, we use the standard approach of generating predictions for 25 evenly spaced frames. For each frame, we generate predictions from 10 different 224
224 pixel windows: one from each corner of the frame, one from the center of the frame, and then a horizontally flipped version of each of those. For each video, 250 probability predictions are made for each of the classes. We average them and predict the activity with the highest value. In this work, we use a pre-trained Word2Vec model which was trained on an internal Google dataset of news articles containing a billion words.
3.2 Performance Evaluation
We evaluate the performance of our approach on the UCF 101 benchmark dataset . We have used the ResNet to construct the baseline architecture for both the activityNet and the objectNet (See Figure 1). The experiments were carried out on three different ResNet networks (ResNet 50, 101, and 152) under three different settings (baseline, object-incorporated, text-guided + object-incorporated). The baseline approach is the standard method without multitask learning. For the object-incorporated multitask approach, we randomly selected 1000 ImageNet classes to learn the objectNet. The text-guided + object-incorporated approach uses Word2Vec to select the 1000 most relevant ImageNet classes as described in Section 2.2.
|Baseline||object incorp.||object incorp.|
From the results shown in Table 2, it is clear that using the ResNet networks with the baseline approach provides worse performance than the state-of-the-art method (TSN ). This is because the architecture used in  uses shallower networks which are not as prone to overfitting. When we incorporate the object information in a multitask learning scheme (object-incorporated), the performance increases close to the current state-of-the-art. And finally, when we exploit the text-guided supervision on top of the object incorporation, we are able to outperform the state-of-the-art.
We have introduced a novel way of constructing an object-incorporated and text-guided CNN to better handle the task of video-based human activity recognition. We do this by leveraging the text-guided semantic space to select the most commonly associated objects with respect to the target activities. We then train the network to recognize the target activities as well as the selected set of objects by exploiting a shared network and a multitask learning approach. We have experimentally verified that the strategies of incorporating objects for activity recognition and text-guided object selection are both effective in improving the performance for the human activity recognition. In the future, we are seeking to incorporate the background scenes into our framework as it also carries significant semantic information for the activities.
-  Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems (NIPS), 2014, pp. 568–576.
-  Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 20–36.
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul
Sukthankar, and Li Fei-Fei,
“Large-scale video classification with convolutional neural
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2014, pp. 1725–1732.
-  Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
-  Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence (PAMI), vol. 35, no. 1, pp. 221–231, 2013.
-  Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi, “Human action recognition using factorized spatio-temporal convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4597–4605.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
-  Rich Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, July 1997.
-  Rajeev Ranjan, Swami Sankaranarayanan, Carlos D Castillo, and Rama Chellappa, “An all-in-one convolutional neural network for face analysis,” in Automatic Face & Gesture Recognition (FG), 2017 12th IEEE International Conference on. IEEE, 2017, pp. 17–24.
-  Tianzhu Zhang, Bernard Ghanem, Si Liu, and Narendra Ahuja, “Robust visual tracking via multi-task sparse learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2012, pp. 2042–2049.
-  Xiao-Tong Yuan, Xiaobai Liu, and Shuicheng Yan, “Visual classification with multitask joint sparse representation,” IEEE Transactions on Image Processing, vol. 21, no. 10, pp. 4349–4360, 2012.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2009, pp. 248–255.
-  Victor Escorcia and Juan Carlos Niebles, “Spatio-temporal human-object interactions for action recognition in videos,” in International Conference on Computer Vision Workshop, 2013.
-  Sungmin Eum, Hyungtae Lee, Heesung Kwon, and David Doermann, “IOD-CNN: Integrating object detection networks for event recognition,” in International Conference on Image Processing (ICIP), 2017.
-  Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,
“Distributed representations of words and phrases and their compositionality,”in Advances in neural information processing systems (NIPS), 2013, pp. 3111–3119.