Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition

11/25/2019 ∙ by Yang Liu, et al. ∙ Xidian University SUN YAT-SEN UNIVERSITY 17

Existing deep learning methods for action recognition in videos require a large number of labeled videos for training, which is labor-intensive and time-consuming. For the same action, the knowledge learned from different media types, e.g., videos and images, may be related and complementary. However, due to the domain shifts and heterogeneous feature representations between videos and images, the performance of classifiers trained on images may be dramatically degraded when directly deployed to videos. In this paper, we propose a novel method, named Deep Image-to-Video Adaptation and Fusion Networks (DIVAFN), to enhance action recognition in videos by transferring knowledge from images using video keyframes as a bridge. The DIVAFN is a unified deep learning model, which integrates domain-invariant representations learning and cross-modal feature fusion into a unified optimization framework. Specifically, we design an efficient cross-modal similarities metric to reduce the modality shift among images, keyframes and videos. Then, we adopt an autoencoder architecture, whose hidden layer is constrained to be the semantic representations of the action class names. In this way, when the autoencoder is adopted to project the learned features from different domains to the same space, more compact, informative and discriminative representations can be obtained. Finally, the concatenation of the learned semantic feature representations from these three autoencoders are used to train the classifier for action recognition in videos. Comprehensive experiments on four real-world datasets show that our method outperforms some state-of-the-art domain adaptation and action recognition methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent works [1, 2, 3, 4]

show that deep convolutional neural networks (CNNs) are promising for action recognition in videos. Since a prerequisite for a well-trained deep model is the availability of large amounts of labeled training videos, the collection, preprocessing, and annotation of such datasets (e.g. HMDB51

[5], UCF101 [6], ActivityNet [7], Something Something [8], AVA [9]

, Moments in Time

[10]) is often labor-intensive and time-consuming. Moreover, storing and training on such large amounts of video data can consume substantial computational resources [11]. In some extreme conditions, it is often difficult and infeasible to capture adequate number of videos. Unlike videos, images are much easier and cheaper to be collected and annotated, and there also exists many labeled image datasets which can be utilized such as the BU101 [11], Stanford40 [12], DB [13], PPMI [14], Willow-Actions [15], Still DB [16], HII [17] and VOC 2012 [18], etc. Moreover, the computational cost of learning deep models of images is much less than that of videos [19]. Therefore, the video data are rarer than the image data due to the high acquisition cost and the hard annotating [19, 20, 21].

Fig. 1: Illustration of still images and motion videos of categories horse riding and drinking.

In addition, images tend to focus on representative moments and have more diverse examples of an action in terms of camera viewpoint, background, human appearance, posture, etc. In contrast, videos usually capture a whole temporal progression of an action and contain many redundant and uninformative frames (e.g. action classes horse riding and biking in still images and motion videos, shown in Figure 1). For example, one may need to look through all, or most, of the video frames to determine an action class in the video, while a single glance is enough to annotate the action in the image. This fact makes image data a good auxiliary source to enhance video action recognition. Furthermore, semantically related images and videos may have the similar appearance, posture and object, as shown in Figure 1. If the inherent semantic relationship between images and videos can be used to fuse the features from image and video data, the performance of the action recognition in videos could be improved and we can use fewer labeled video action data and achieve comparable performance to the traditional ones which use much more labeled training video samples.

However, directly discriminating the video data by the classifier trained on images will result in degraded performance [22]. As shown in Figure 1, the still images (left) are different from video frames (right) in viewpoint, background, appearance, posture, etc. In addition, video features are often represented in spatiotemporal form, which is significantly different from image representations in terms of feature dimensionality and physical meaning. Nevertheless, the action information between image and video is complementary, which was validated by works [11, 23, 21]. And previous cross-domain action recognition related works [24, 25, 26, 27] validated that utilize the action knowledges from other domains can substantially improve the performance of action recognition in the target domain. Based on these observations, in this paper, we focus on enhancing action recognition in videos with limited training samples by utilizing the image data, meanwhile solving the aforementioned domain shift problem. In general, we propose a novel unified deep learning framework, named Deep Image-to-Video Adaption and Fusion Networks (DIVAFN), which learns domain-invariant representations and fuses cross-modal feature, simultaneously. An overview of the DIVAFN is presented in Figure 2. Since a video consists of a sequence of frames which are essentially a collection of images, a video is correlated with its related keyframes [28, 29]. Therefore, we can explore the relationship between images and videos by using video keyframes as the bridge.

The DIVAFN consists of three deep neural networks, one for image modality, one for video keyframe modality, and the other for video modality. To reduce the domain shift among images, keyframes and videos, we design a novel cross-modal similarities metric. Because of the similarity on the semantic representation of the same action from the data in different modalities, we utilize the semantic relationship among images, keyframes, and videos as a guidance to fuse the features learned from these domains. Inspired by the semantic autoencoder [30] proposed for zero-shot learning, we adopt this autoencoder architecture with a constraint that the latent representations from the hidden layer of the autoencoder should be equal to the semantic representations of the action class names (e.g. attribute representations [31] or word2vec representations [32]). With this constraint, the autoencoder can project the learned domain-invariant features to the semantic space, and the projected semantic representations can also be projected back to the domain-invariant feature space. Since the images, keyframes and videos from the same action class have the same semantic meaning, we can obtain more compact, informative and discriminative feature representations from the hidden layer of the autoencoder, which contains both learned domain-invariant feature knowledge and semantic relationship knowledge. Different from previous work [30] using semantic autoencoders to solve zero-shot learning problems, we unify the semantic autoencoder and domain-invariant learning into a unified deep learning framework for image-to-video adaption problem. In more detail, we simultaneously project the learned domain-invariant features from keyframes, videos and their concatenations to the same semantic space by learning three semantic autoencoders. Then, the concatenation of the learned semantic feature representations from these three autoencoders is used to train the classifier for action recognition in videos. Extensive experiments on four real-world datasets show that our approach significantly outperforms state-of-the-art approaches.

The main contributions of this paper are as follows:

  • To perform domain-invariant representations learning, we propose a novel unified learning framework with three deep neural networks, one for image modality, one for keyframe modality, and the other for video modality, and design a novel cross-modal similarities metric to reduce the modality shift among images, keyframes and videos.

  • Since the same action class for both image and video action data share the same semantic information, we utilize the semantic information as a guide to fuse the images, keyframes and videos features after learning domain-invariant features. An autoencoder architecture with a constraint that the representations from hidden layer should be equal to the semantic representations of the action classes is adopted. In this way, we can obtain more compact, informative and discriminative representations, which contains both learned domain-invariant feature knowledge and semantic relationship knowledge.

  • To effectively fuse keyframe and video feature representations, we simultaneously project the learned domain-invariant keyframe features, video features and their concatenations to the same semantic space by learning three semantic autoencoders. And the concatenation of the learned semantic representations from these three autoencoders are used to train the classifier for action recognition in videos. Extensive experimental results validate the effectiveness of the newly designed representations.

  • One major advantage of the proposed method is that it takes the situation when the number of training videos is limited into consideration. We transfer knowledge from images to videos, and integrate domain-invariant representations learning and feature fusion into a unified deep learning framework, named Deep Image-to-Video Adaptation and Fusion Networks (DIVAFN).

This paper is organized as follows: Section II briefly reviews the related works. Section III introduces the proposed DIVAFN. Experimental results and related discussions are presented in Section IV. Finally, Section V concludes the paper.

Ii Related Work

Ii-a Action Recognition on Videos

Action recognition is a very active research field and has received great attention in recent years [33]

. Conventional action recognition methods mainly consist of two steps: feature extraction and classifier training. Many hand-crafted features are designed to capture spatiotemporal information in videos. Some representative works include space-time interest points

[34], dense trajectories [35] and Improved Dense Trajectories (IDT) [36]

. Advance feature encoding methods such as Fisher vector (FV) encoding

[37], Vector of Locally Aggregated Descriptors (VLAD) encoding [38] and Locality-constrained Linear Coding (LLC) encoding [39], can further improve the performance of these features. Some recent works use deep neural networks (particularly CNNs) to jointly learn the feature extractors and classifiers for action recognition in videos. Simonyan and Zisserman [2] proposed two-stream CNNs: one stream captures the spatial information from video frames and the other steam captures motion information from stacked optical flows. Tran et al. [3] and Ji et al. [40] proposed 3D convolutional networks to learn spatiotemporal features. Wang et al. [41] extracted trajectory-pooled deep-convolutional descriptor (TDD) using deep architectures to learn discriminative convolutional feature maps, which are then aggregated into effective descriptors by trajectory-constrained pooling. However, these deep learning methods all require a large amount of annotated videos to avoid overfitting.

To address the limited training videos problem, our approach enhances action recognition in videos by transferring knowledge from images, and integrate domain-invariant representations learning and cross-modal feature fusion into a unified deep learning framework.

Ii-B Image-to-Video Action Recognition

To harness the knowledge from images to enhance action recognition in videos, several recent works have focused on knowledge adaptation from images to videos and achieve good performance in image-to-video action recognition [42, 20, 17, 11, 19, 43]. Ma et al. [11] used both images and videos to jointly train shared CNNs and use the shared CNNs to map images and videos into the same feature space. They verified that images are complementary to videos by extensive experimental evaluation. Yu et al. [19] proposed a Hierarchical Generative Adversarial Networks (HiGANs) based image-to-video adaptation method by leveraging the relationship among images, videos and their corresponding video frames. Gan et al. [42] proposed a mutually voting strategy to filter noisy images and video frames and then jointly explored images and videos to address labeling-free video recognition. Zhang et al. [20] proposed a classifier based image-to-video adaptation method by exploring and utilizing the common knowledge of both labeled videos and images. Li et al. [17] used class-discriminative spatial attention maps to transfer knowledge from images to videos. Yu et al. [43] proposed a symmetric generative adversarial learning approach (Sym-GANs) to learn domain-invariant augmented feature with excellent transferability and distinguishability for heterogeneous image-to-video adaptation. However, these methods usually require sufficient labeled training data for both images and videos to learn domain-invariant features. In these methods, the semantic relationship between images and videos is also ignored. In addition, the evaluation datasets used in these works is relatively simple and small compared to the datasets used in our work. For instance, BU101UCF101 are the image and video datasets used in our study. The BU101, which contains K images of action classes, is the first action image dataset that has one-to-one correspondence in action classes with UCF101, which is a large-scale action recognition video benchmark dataset. To meet the challenge of the image-video datasets in our study, more specific and effective domain adaption and cross-modal feature fusion methods should be designed to reduce the significantly larger modality shift problem, and realize the fusion of the complementary information between images and video, respectively.

In this paper, by taking the semantic information between images and videos into consideration, we develop a new method to recognize the actions in video when the number of training videos is limited. To utilize the semantic relationship between images and videos, we adopt an autoencoder architecture with a constraint that the representations from hidden layer should be equal to the semantic representations of the action class names. In this way, the semantic information is employed, which makes the number of labeled training samples reduced in our method.

Ii-C Domain Adaptation on Heterogeneous Features

Domain shift refers to the situation where heterogeneous feature representations exists between source and target domains, which may cause the classifier learned from source domain to perform poorly on target domain. A large number of domain adaptation approaches have been proposed to address this problem [44, 45, 46, 47, 48]. Long et al. [44] proposed a distribution adaptation method to find a common subspace where the marginal and conditional distribution shifts between domains are reduced. Zhang et al. [45] learned two projections to map the source and target domains into their respective subspaces where both the geometrical and statistical distribution difference are minimized. Long et al. [46] built a deep adaptation network (DAN) that explores multiple kernel variant of maximum mean discrepancies (MK-MMD) to learn transferable features. Zhang et al. [47] explored the underlying complementary information from multiple views and proposed a novel subspace clustering model for multi-view data named Latent Multi-View Subspace Clustering (LMSC). Jiang et al. [48] proposed a deep cross-modal hashing method to map images and texts into a domain-invariant hash space, and then performed similarity search in multimedia retrieval applications. Huang et al. [49] jointly minimized the media-level and correlation-level domain discrepancies across texts and images to enrich the training information and boost the retrieval accuracy on target domain. In most of these works [44, 45, 46, 47], data from source and target domains belonged to the same media type (videos, images or texts), and other works [48], [49] focused on image-text retrieval problem.

In contrast, our method focuses on heterogeneous image-to-video action recognition problem where the video is represented by spatial-temporal feature that totally differs from the image representation in terms of feature dimensionality and physical meaning. To perform domain-invariant representations learning, we propose a unified learning framework with three deep neural networks, which are used for images, videos, and keyframes, individually, and design a new cross-modal similarities metric to reduce the modality shift among them.

Iii Deep Image-to-Video Adaptation and Fusion

Fig. 2: Framework of our proposed method DIVAFN.

Iii-a Framework Overview

The framework of the DIVAFN is shown in Figure 2, which is a unified deep learning framework seamlessly constituted by two parts: the domain-invariant representations learning one and the cross-modal feature fusion one. The domain-invariant representations learning part contains three deep neural networks, one for image modality, one for keyframe modality, and the other for video modality. Considering the computational efficiency, we extract keyframes from videos by histogram difference based method [20]. The inputs of the image and keyframe networks are raw image pixels, while the inputs of the video network is video features such as the Improved Dense Trajectories (IDT) [36] and the C3D features [3]. To learn domain-invariant representations, we design a novel cross-modal similarities metric to reduce the domain shift among images, keyframes and videos. With these domain-invariant features, we utilize the semantic relationship among images, keyframes and videos to perform cross-modal feature fusion. Specifically, we adopt an autoencoder architecture and exert an additional constraint that the latent representations from the hidden layer of the autoencoder should be equal to the semantic representations of the action class names (e.g. attribute representations [31] or word2vec representations [32]

). Then, we simultaneously project the learned domain-invariant keyframe features, video features and their concatenations to the same semantic space by learning three semantic autoencoders. Finally, the concatenation of the learned semantic feature representations from these three autoencoders is used to train the Support Vector Machine (SVM)

[50] classifier for action recognition in videos.

Iii-B Notations

To fix notation, boldface lowercase letters like denote vectors. Boldface uppercase letters like denote matrices, and the element in the th row and th column of matrix is denoted as . The th row of is denoted as , and the th column of is denoted as . Assume that we have training samples and each training sample has image, keyframe and video modalities. We use to denote the image modality, where can be handcrafted features or the raw pixels of image . And we use to denote the keyframe modality, where can be handcrafted features or the raw pixels of keyframe . Moreover, we use to denote the video modality, where is visual feature for video , is the dimension of the input video feature. In addition, we define three cross-modal similarity matrices , and . If image and video belong to the same action class, . Otherwise, . If image and keyframe belong to the same action class, . Otherwise, . If keyframe and video belong to the same action class, . Otherwise, . Let , , denote the learned domain-invariant feature for the -th image, -th keframe, and -th video, individually, where denotes the dimension of the features. The parameters of the CNNs for image, keyframe and video modalities are defined as , and , respectively. The semantic representations of action class names for image, keyframe and video modalities are denoted as , and , respectively, where is the dimension of semantic representation.

Iii-C Model Formulation

Iii-C1 Network Architecture

The deep neural network for both image modality and keyframe modality is a convolutional neural network CNN-F adapted from [51]

. The CNN-F has been trained on the ImageNet

and it consists of five convolutional layers (conv1-conv5) and three fully connected layers (fc6, fc7, fc8). The input of our network is the raw image pixels. The first seven layers of our network are the same as those in CNN-F [51]

. The eighth layer is a fully connected layer with the output being the learned domain-invariant features. All the first seven layers use the Rectified Linear Unit (ReLU)

[52]

as activation function. For the eighth layer, we choose identity function as the activation function.

To perform domain-invariant feature learning for video modality, we first represent each video as a vector with Locality-constrained Linear Coding (LLC) [39] encoding of Improved Dense Trajectories (IDT) [36], or the Convolutional 3D (C3D) visual features provided by [3]. And then the extracted video vectors are used as the input to a deep neural network with two fully connected layers. The activation function for the first layer is ReLU and that for the second layer is the identity function. The detailed configuration of the deep neural network for image, keyframe and video modalities can be seen on the website https://yangliu9208.github.io/DIVAFN/. We choose these deep networks because they have been validated their effectiveness in the previous work [53].

Iii-C2 Domain-invariant Representations Learning and Fusion

We first demonstrate how to exploit the label knowledge from training images, video keyframes and videos by designing a cross-modal similarities metric. Since the inner product has been validated as a good pairwise similarity measure by previous work [48, 54], we use the inner product , and as similarity measures for domain-invariant representations learning, where with is the output matrix of the image deep neural network, with is the output matrix of the keyframe deep neural network, with is the output matrix of the video deep neural network, is the dimension of the learn domain-invariant features, and is the number of training samples. Larger value of the inner product means high similarity, and vice versa.

, , and

, the set of pairwise cross-modal similarity matrices, represent the similarity among images, keyframes and videos. The probability of similarity between image

and video given the corresponding domain-invariant feature vectors and , can be expressed as a likelihood function defined as follows:

(1)

where is the inner product between domain-invariant feature vectors for image and video , and

is the sigmoid function. As the inner product

increases, the probability of also increases, i.e., and belong to the same action class. As the inner product decreases, the probability of also decreases, i.e., and belong to different action classes.

Similarly, the probability of similarity between image and keyframe , and the probability of similarity between keyframe and video can be expressed as two likelihood functions defined as follows:

(2)
(3)

Therefore, the negative log likelihood of the similarity matrix given and can be written as follows:

(4)

Similarly, the negative log likelihood of the similarity matrix given and can be written as follows:

(5)

The negative log likelihood of the similarity matrix given and can also be written as follows:

(6)

It is easy to find that minimizing these negative log likelihood is equivalent to maximizing the likelihood, which can simultaneously encourage the similarity (inner product) between and , the similarity (inner product) between and , and the similarity (inner product) between and to be large when , and , and to be small when , and . Therefore, optimizing these cross-modal similarities metric terms in Eqs. (4)-(6) can preserve the cross-modal similarities in , and , and learn domain-invariant representations for image, keyframe and video modalities.

Now we show how to utilize the semantic relationship among images, keyframes and videos to effectively fuse the learned domain-invariant keyframe and video representations. We adopt an autoencoder architecture, whose hidden layer is constrained to be equal to the semantic representation of the names of the actions (e.g. attribute representations [31] or word2vec representations [32]). To realize this, we force the latent space to be the -dimensional semantic representation space, e.g. each column of is an attribute or word2vec vector of an action class name. Specifically, given learned domain-invariant image representation matrix , keyframe representation matrix and video representation matrix as the inputs, we construct three semantic autoencoders named Auto-F, Auto-H and Auto-G, for image modality, keyframe modality and video modality, respectively. The objective functions are defined as follows:

(7)
(8)
(9)

where , and are projection matrices for Auto-F, and are projection matrices for Auto-H, and are projection matrices for Auto-G, , and are the semantic representations for image modality, keyframe modality and video modality, respectively. The feature representations from the hidden layer of these three semantic autoencoders contain both learned domain-invariant knowledge and semantic knowledge. In this way, more informative and discriminative features can be obtained.

Since the semantic representations of the same action classes from images and videos are the same, which means that . Thus, we uniformly use to denote the semantic representations for these modalities. To further simplify the objective functions Eqs. (7)-(9), we consider tied weights [55] such that and substitute , and with . In addition, we relax the constraint into a soft one since hard constraint such as is difficult to solve. Finally, the objective functions Eqs. (7)-(9) can be rewritten as follows:

(10)
(11)
(12)

where is a weighting coefficient controlling the importance of the decoder and encoder.

Since the modality gap among images, video keyframes and videos has been reduced, the learned domain-invariant keyframe feature contains knowledge from both image and keyframe modalities. Thus, we can enhance action recognition performance in videos by fusing the learned domain-invariant keyframe features and video features in a simple yet effective way. Specifically, we construct the fourth semantic autoencoder named Auto-E, with the concatenation of learned domain-invariant keyframe representations and video representations as input, which is denoted as . The objective function of this semantic autoencoder is defined as follows:

(13)

where is the projection matrix for Auto-E. In this way, the hidden unit is a good fused representation for both keyframe and video modalities, as the reconstruction process captures the intrinsic structure of the domain-invariant space and the semantic relationship between keyfame and videos.

To this end, we integrate cross-modal feature fusion Eqs. (10)-(13) and domain-invariant representations learning Eqs. (4)-(6) into a unified deep learning framework named Deep Image-to-Video Adaptation and Fusion Networks (DIVAFN), and the final objective function is defined as follows:

(14)

where , and denote the weighting coefficients controlling the importance of the three negative log likelihood functions, and are weighting coefficients controlling the importance of the decoders and encoders, respectively. To make the objective function concise and converge more easily, the values of and for all semantic autoencoders are encouraged to be the same. The network parameters of the CNN for image, keyframe and video modalities are defined as , and , respectively. Since there are multiple variables in Eq. (14), we propose an efficient alternating optimization algorithm for DIVAFN.

Iii-D Optimization

We adopt an alternating learning strategy to learn , , , , , and . Each time we learn one variable with other variables fixed.

Iii-D1 Learn , with , , , , and fixed

When , , , , and are fixed, we learn the CNN parameter of the image modality by back-propagation (BP) algorithm. As the most existing deep learning approaches [52]

, we adopt the Stochastic Gradient Descent (SGD) to learn

with the BP algorithm. In each iteration, a mini-batch of samples from the training set is used to learn the variable. Specifically, for each training sample from image modality, we first compute the following gradient:

(15)

Then we can compute with

using the chain rule, based on which the BP algorithm can be used to update the variable

.

Iii-D2 Learn , with , , , , and fixed

When , , , , and are fixed, we also learn the CNN parameter of the keyframe modality by using SGD with back-propagation (BP) algorithm. Specifically, for each training sample from keyframe modality, we first compute the following gradient:

(16)

Then we can compute with using the chain rule, based on which the BP algorithm can be used to update the variable .

Iii-D3 Learn , with , , , , and fixed

When , , , , and are fixed, we also learn the CNN parameter of the video modality by using SGD with back-propagation (BP) algorithm. Specifically, for each training sample from video modality, we first compute the following gradient:

(17)

Then we can compute with using the chain rule, based on which the BP algorithm can be used to update the variable .

Iii-D4 Learn , with , , , , and fixed

When , , , , and are fixed, the objective function Eq. (14) can be rewritten as follows:

(18)

Since Eq. (18) is quadratic, it is a convex function with a global optimal solution. To optimize it, we first reorganize Eq. (18) using trace properties: and :

(19)

And then we set the derivative of Eq. (19) with respect to to zero and get:

(20)

If we denote , , and , we have the following well-known Sylvester formulation:

(21)

which can be solved efficiently by the Bartels-Stewart algorithm [56]. More importantly, the calculation complexity of Eq. (21) only depends on the feature dimension () and thus it can scale to large-scale datasets.

Iii-D5 Learn , with , , , , and fixed

When , , , , and are fixed, the objective function Eq. (14) can be rewritten as follows:

(22)

Similarly, Eq. (22) can also be solved efficiently by the Bartels-Stewart algorithm [56].

Iii-D6 Learn , with , , , , and fixed

When , , , , and are fixed, the objective function Eq. (14) can be rewritten as follows:

(23)

Similarly, Eq. (23) can also be solved efficiently by the Bartels-Stewart algorithm [56].

Iii-D7 Learn , with , , , , and fixed

When , , , , and are fixed, the objective function Eq. (14) can be rewritten as follows:

(24)

Similarly, Eq. (24) can also be solved efficiently by the Bartels-Stewart algorithm [56].

Iii-E Classification

When the loss of the objective function Eq. (14) drops to its minimum value after a certain number of iterations, we can learn variables , , , , , and . Then we use the concatenation of the learned semantic feature representations from the hidden layers of Auto-H, Auto-G and Auto-E as the final fused feature representations. For example, given the video sample , the final fused feature representation for action recognition in videos is defined as:

(25)

where denotes the matrix concatenation operation, denotes the learned keyframe semantic representation from Auto-H for video sample , denotes the learned video semantic representation from Auto-G for video sample , denotes the fused semantic representation from Auto-E, and denotes the finally fused feature representation for video sample . Finally, we use the final fused representation matrix to perform action recognition in videos using the SVM [50] classifier, where denotes the number of training samples.

Iv Experiments

Iv-a Datasets

In order to evaluate the performance of our method, we conduct experiments on four real-world image-video action recognition datasets. Among these datasets, videos come from two large-scale and complex datasets, i.e., HMDB51 [5] and UCF101 [6]. And images come from various datasets, e.g. BU101 dataset [11], Stanford40 dataset [12], Action dataset (DB) [13], people playing musical instrument (PPMI) dataset [14], Willow-Actions dataset [15], Still DB [16], and Human Interaction Image (HII) dataset [17]. Figure 3 shows the sample images from each dataset.

Iv-A1 Stanford40Ucf101 (SU)

Fig. 3: Sample images from each image dataset, where the images with bounding boxes in blue, green, orange and red colors are from the image dataset of Stanford40UCF101, ASDUF101, EADHMDB51 and BU101UCF101 datasets, respectively.

The UCF101 [6] is a video action dataset collected from YouTube with action classes. The Stanford40 dataset [12] contains diverse human action images with various pose, appearance and background clutter. We choose common classes between these two datasets, and each action class has samples for both image and video modalities. In Table I, we summarize the chosen common classes of Stanford40UCF101, where “” denotes the direction of adaptation from auxiliary modality to target modality.

Datasets Stanford40 UCF101
Common classes shooting an arrow archery
brushing teeth brushing teeth
cutting vegetables cutting in kitchen
throwing frisby frisbee catch
rowing a boat rowing
walking with dog walking with dog
writing on board writing on board
clearing the floor mopping floor
climbing rock climbing indoor
playing guitar playing guitar
TABLE I: Chosen common classes of Stanford40UCF101.
Datasets ASD (source dataset) UCF101
Common classes riding bike (Willow Actions) biking
cricket bowling (Action DB) cricket bowling
cricket batting (Action DB) cricket shot
riding horse (Willow Actions) horse riding
playing cello (PPMI) playing cello
playing flute (PPMI) playing flute
playing violin (PPMI) playing violin
tennis forehand (Action DB) tennis swing
volleyball smash (Action DB) volleyball spiking
throwing (Still DB) baseball pitch
TABLE II: Chosen common classes of ASDUCF101.
Datasets EAD (source dataset) HMDB51
Common classes applauding (Stanford40) clap
climbing (Stanford40) climb
drinking (Stanford40) drink
hug (HII) hug
jumping (Stanford40) jump
kiss (HII) kiss
pouring liquid (Stanford40) pour
pushing a cart (Stanford40) push
riding a bike (Stanford40) ride bike
riding a horse (Stanford40) ride horse
running (Stanford40) run
shake hands (HII) shake hands
shooting an arrow (Stanford40) shoot bow
smoking (Stanford40) smoke
waving hands (Stanford40) wave
TABLE III: Chosen common classes of EADHMDB51.

Iv-A2 AsdUcf101 (AU)

To further evaluate whether images coming from different sources may influence the action recognition performance in videos, we select common classes between UCF101 and an extended dataset named “actions from still datasets” (ASD), which consists of four publicly available datasets, i.e., Action dataset (DB) [13], people playing musical instrument (PPMI) dataset [14], Willow-Actions dataset [15], and Still DB [16]. Each action class has samples for both image and video modalities. Table II shows the chosen common classes of ASDUCF101.

Iv-A3 EadHmdb51 (EH)

The HMDB51 [5] dataset consists of classes with a total of video clips. We construct an “extensive action dataset” (EAD) which consists of the chosen categories from Stanford40 [12] and HII datasets [17]. The HII dataset contains web action images from search engines (Google, Bing and Flickr) with four types of interactions: handshake, highfive, hug, and kiss. We choose common classes and each class has samples for both image and video modalities. Table III shows the chosen common classes of EADHMDB51.

Iv-A4 Bu101Ucf101 (BU)

To investigate the performance of our method for large-scale dataset, we select BU101 [11] as the image dataset, which is the largest web action image dataset until now. This dataset is more than double the size of the largest previous action image dataset Stanford40 [12], both in the number of images and the number of actions. It consists of K action images and has one-to-one correspondence to the action classes in the UCF101 video dataset. For video dataset, we select all the videos with classes from the UCF101 dataset. Therefore, we can evaluate the action recognition performance in UCF101 using the public three train/test splits and make a fair comparison with other state-of-the-art action recognition algorithms.

Iv-B Experimental Setup

To study the performance variance when the ratios of training samples are different, we select various amount of training samples to evaluate our method. For Stanford40

UCF101, ASDUCF101 and EADHMDB51 datasets, the ratios of training samples are set to , , , and . For BU101 UCF101 dataset, we use the public three train/test splits and randomly sample , , , , and videos from the training set of each split as training samples for our method, and the result is averaged and reported across three train/test splits.

We exploit the CNN-F network [51] pre-trained on ImageNet dataset [57] to initialize the first seven layers of the CNN for image and keyframe modalities. All the other parameters of the deep neural networks in DIVAFN are randomly initialized. For image and keyframe modalities, we use the raw image pixels as the input to the deep neural networks. For video modality, we use two kinds of visual features: the hand-crafted feature Improved Dense Trajectories (IDT) [36] and the deeply-learned feature Convolutional 3D (C3D) visual features [3], to evaluate whether our method can generalize well to both the hand-crafted and deeply-learned video features. For Improved Dense Trajectories (IDT) feature, we adopt the Locality-constrained Linear Coding (LLC) [39] scheme to represent the IDTs by local bases, and the codebook size is set to be . To reduce the complexity when constructing the codebook, only IDTs are randomly selected from each video. This encoding scheme has been verified its effectiveness by [58]. For C3D features, we extract the outputs of fc6 layer from the model pre-trained on the sports-1M dataset [1] and then averaged over the segments to form a -dimensional feature vectors. For semantic representations of the action class names, we use two widely used class attribute vectors: human labeled attributes [31] and automatically learned distributed semantic representations word2vec [32]. We adopt the skip-gram model [32] trained on the Google News dataset ( billion words) and represent each class name by an L2-normalized dimensional vector. For any multi-word class name (e.g. ‘walking with dog’), we generate its vector by accumulating the word vectors of each unique word [59]. For HMDB51 dataset, there is no publicly available attribute representations of the action classes. Hence, only word2vec is used for EADHMDB51 dataset. For UCF101 dataset, dimensional attribute vectors are available. Therefore, both word2vec and class attribute are used on Stanford40UCF101, ASDUCF101 and BU101 UCF101 datasets.

Our DIVAFN method has six free parameters: , , , , and . The optimum values of them are determined by experiments, which will be discussed detailedly in Section IV. D. Parameter is the vector length of domain-invariant representation from image, video keyframe and video neural networks, which is set by using various values (e.g. , , and ) to evaluate the performance. We fix the mini-batch size to be , the iteration number to be , and the learning rate to be . All experiments are repeated for five times and the average results are reported.

Iv-C Experimental Results

We evaluate the performance of the DIVAFN on four datasets, shown in Tables IV-VII. In these tables, the SVM means directly using SVM to classify the target domain videos, following the methods [35, 36]

for action recognition in videos. Therefore, we compare the DIVAFN with the SVM to evaluate whether our method can improve the video action recognition using knowledge from image modality. To evaluate whether directly adding the examples from the image modality into the video modality is effective, we directly use the concatenation of image and video features to train an SVM classifier to recognize the videos in target domain. Specifically, the image features are the outputs of the convolutional neural network CNN-F which has already been pre-trained on the ImageNet, the video features are the IDT or C3D features. And we consider this method as the baseline method. To evaluate whether the finetuned deep model from ImageNet could help video recognition or not, we also compare our DIVAFN with a simple transfer learning method which simply finetunes a model from an ImageNet pretrained network and use this network to classify video keyframes directly. Specifically, we utilize the image network with parameters initialized by ImageNet dataset and use the keyframes as the input, which has been verified its effectiveness by previous work

[2]. To make a fair comparison, we extract the features from the eighth layer of the image network and then use these features to train an SVM classifier for action recognition in videos. And we name this method finetuned. As shown in Table IV-VII, we observe the following.

The baseline method can achieve slightly better performance than the SVM on Stanford40UCF101, ASDUCF101 and EADHMDB51 datasets. But on larger scale dataset BU101UCF101, the performance of the baseline method drops significantly and becomes even lower than that of the SVM. This shows that adding action images into action videos can improve recognition performance with limited extent only when the scale of the dataset is not so large. With the increasing number of the samples, the domain shift between image and video modalities will become significantly larger, which decreases the performance of the recognition. Therefore, effective domain adaptation and cross-modal feature fusion methods should be designed to address these problems.

Ratio SVM baseline finetuned
TABLE IV: Average accuracies on Stanford40UCF101 dataset. The video feature is the IDT and the semantic feature is the word2vec.
Ratio SVM baseline finetuned
TABLE V: Average accuracies on ASDUCF101 dataset. The video feature is the IDT and the semantic feature is the word2vec.
Ratio SVM baseline finetuned
TABLE VI: Average accuracies on EADHMDB51 dataset. The video feature is the IDT and the semantic feature is the word2vec.
Ratio SVM baseline finetuned
53.0
51.4
50.3
47.4
42.9
35.8
TABLE VII: Average accuracies over three train/test splits for BU101UCF101 dataset. The video feature is the C3D and the semantic feature is the word2vec.
Fig. 4: Sample images from BU101 and UCF101 datasets. The first row shows images from BU101 image dataset. The second row shows images from UCF101 video dataset.

The finetuned deep neural network only achieves comparative performance with the SVM on Stanford40UCF101 and ASDUCF101 datasets. But when addressing video action recognition problem on EADHMDB51 and BU101UCF101 datasets, the performance drops significantly especially when the dataset is BU101UCF101. This is because that large modality difference between images and video exists on these large-scale action recognition datasets. To show the difference between images and videos, we show some example action images from the BU101 and the UCF101 datasets, which can be seen in Figure 4. From Figure 4, we can see that many of the action images are significantly different from video frames in camera viewpoint, lighting, human pose, and background. Action images tend to focus on representative moments and have more diverse examples of an action. In contrast, videos usually capture a whole temporal progression of an action and contain many redundant and uninformative frame. Therefore, simply using the finetuned model from image datasets to improve video action recognition performance without effective domain adaption and feature fusion methods is not effective.

From above methioned evaluation results of the baseline and the finetuned methods, we can see that the proposed DIVAFN outperforms the other methods in different settings. This proves that effective domain adaption and feature fusion methods can improve the performance of the image-to-video action recognition. It also validates that our DIVAFN can not noly effectively reduce the modality gap between images and videos but also make good use of the complemetary information between images and videos. Even when the number of training samples is limited, the improvement of the performance is still significant on four datasets. This advantage is especially desirable in real-world scenarios where annotated videos are scarce.

Because high dimensionality of the feature brings more informative and discriminative knowledges, the performance of the DIVAFN increases when the value of increases. This conclusion is valid when the feature’s dimensionality stays in the reasonable scope. If the dimensionality exceeds the reasonable scope, the discriminability of the feature will degrade because of the redundant information. Since the domain-invariant representation is the output of the image, keyframe, and video networks, the last full-connected layer of the deep network decides the reasonable scope the feature’s dimensionality. In our work, the length of the last full-connected layer of the deep image network and deep keyframe network is , and the deep video network is . As the dimensionality of the domain-invariant representation from images, keyframes and videos is set to be the same in our work, which should not exceed . Therefore, the reasonable scope of domain-invariant representation is . Based on the above analysis, we choose as the dimension of the learned domain-invariant representations in the following experiments.

Although the BU101UCF101 dataset is a large-scale image-video dataset which contains significantly large domain shift between image and video modalities, our method can still significantly improve action recognition performance in videos across different number of training samples. This demonstrates that our method is effective on large-scale dataset.

Iv-D Parameter Sensitivity Analysis

(a) Stanford40UCF101
(b) ASDUCF101
(c) EADHMDB51
(d) BU101UCF101
Fig. 5: Recognition accuracies difference with respect to and on four datasets.

There are five regularization parameters , , , and in Eq. (14). To learn how they influence the performance, we conduct the parameter sensitivity analysis on four datasets when fixing . The video feature and semantic feature are C3D and word2vec, respectively. For Stanford40UCF101, ASDUCF101 and EADHMDB51 datasets, we conduct experiments when the ratio of training samples is . For BU101UCF101 dataset, we conduct experiments when the ratio of training samples is and use the train/test split. The results of using other ratios of training samples and train/test splits are not given as the similar results.

(a) Stanford40UCF101
(b) ASDUCF101
(c) EADHMDB51
(d) BU101UCF101
Fig. 6: Comparison with the SVM on four datasets in terms of different video and semantic features.

Specifically, we first determine the optimum values of and while setting . From Fig. 5, we can see that the best performance is achieved when for all these four datasets. This means that the decoders should be attached more importance than the encoders. This is because the decoders of our autoencoders control the reconstruction process which makes the learned hidden feature representations be easily projected back to the original domain-invariant feature space. If we attach more importance to the decoders, the hidden unit can be encouraged to be a good representation of the inputs as the reconstruction process captures the intrinsic structure of the input domain-invariant features. Nevertheless, the importance of the encoders should not be neglected since they are also important to encourage the features in a semantic space to contain both learned domain-invariant information and semantic knowledge by jointly learning encoders and decoders. Therefore, we choose in the following experiments.

When considering an extreme condition that setting the values of and to zero, our method becomes domain-invariant representations learning part of the proposed DIVAFN according to Eq. (14). This is also equivalent to ignoring the cross-modal feature fusion part of our DIVAFN. In the following Ablation Study section (Section IV.F), we will evaluate the performance of our method when ignoring the cross-modal feature fusion part, named DIVA.

SU AU EH BU
TABLE VIII: The relationship between parameters , , and the action recognition accuracy (%) on four datasets. The three numbers in the bracket are the values of , and .

Then, we determine the optimum values of parameters , and while setting . Table VIII shows the performance of our DIVAFN on four datasets by allocating different values to the parameters , and . Actually, , and denote the weighting coefficients controlling the importance of the three similarity related functions: the similarity between image and video modalities, the similarity between image and keyframe modalities, and the similarity between keyframe and video modalities. From Table VIII, we can see that the best accuracies are achieved when for all four datasets, which shows the following. 1) The optimum values of , are equal to that of , which means that both domain-invariant feature learning and semantic learning should be attached the same importance. In this way, we can obtain more discriminative and robust representations which contain both learned domain-invariant information and semantic knowledge. 2) The optimum values of is larger than that of , and , this means that we should attach more importance to the similarity between keyframe and video modalities than other two similarities and the semantic learning. This is because keyframes are essentially images, and the modality gap between keyframes and images is smaller than that between keyframes and videos. Moreover, the keyframe modality is the bridge to utilize the relationship between image and video modalities. If we pay more importance on reducing the modality shift between keyframe and video, we can obtain good domain-invariant representations in all modalities including images, keyframes and videos.

Iv-E Influence of Different Video and Semantic Features

In this experiment, we evaluate the performance difference of our proposed DIVAFN by using different types of video features and semantic features. The video features include the hand-crafted feature the Improved Dense Trajectories (IDT) [36] and the deeply-learned feature Convolutional 3D (C3D) visual features [3]. The semantic features of action class names include human labeled attributes [31] and automatically learned distributed semantic representations word2vec [32]. Therefore, we get four different feature combinations, i.e., IDT_A, IDT_W, C3D_A, and C3D_W. Here A represents the human annotated class attribute vectors and W represents the word2vec embedding. Since there are two kinds of video features IDT and C3D, we have two baseline methods named IDT_SVM and C3D_SVM. To make an intuitive comparison with the SVM on four datasets using four different feature combinations, we show the results in Fig. 6.

Generally, the performance of using C3D as the video feature is better than that of using IDT because the C3D is the deeply learned feature which may be more discriminative than hand-crafted feature IDT. The performance of using word2vec as semantic feature is slightly better than that of using human labeled attributes when the video features are the same. This is because that the higher dimension of word2vec features make the learned hidden representations contain more informative and discriminative knowledge. In addition, the marginal performance difference between annotated attribute and word2vec indicates that they are both effective in our proposed DIVAFN. All four feature combinations achieve significantly better performance than that of the SVM on four datasets, which validates the effectiveness of the video features and semantic features. Most importantly, our DIVAFN can effectively enhance recognition performance in videos by transferring knowledge from image modality.

Iv-F Ablation Study

To evaluate the importance of the domain-invariant representations learning part and cross-modal feature fusion part, we construct three new algorithms based on the proposed DIVAFN method.

Keyframe-Video Concatenation (KVC), which is the simply concatenation of raw keyframe feature and video feature. Since the input of keyframe modality is the image pixels, KVC uses the output from the eighth layer of the keyframe network which has not been trained by the cross-modal similarities metric as the keyframe feature. And the video feature is IDT or C3D. Then, the simply concatenation of keyframe and video features is used to trained an SVM classifier for action recognition in videos.

Deep Image-to-Video Adaptation (DIVA), which is the domain-invariant representations learning part of the DIVAFN. DIVA uses the output from the eighth layer of the keyframe network which has been trained by the cross-modal similarities metric as the keyframe feature. DIVA uses the output from the second layer of the video network which has been trained by the cross-modal similarities metric as the video feature. Then, DIVA uses the simply concatenations of the keyframe feature and video feature as the final fused feature for action recognition in videos by the SVM classifier.

Deep Image-to-Video Fusion (DIVF), which is the cross-modal feature fusion part of the DIVAFN. DIVF uses the output from the eighth layer of the keyframe network which has not been trained by the cross-modal similarities metric as the keyframe feature. Because the video network is only randomly initialized, which is not suitable for video feature extraction. Thus, DIVF uses IDT or C3D as the video feature. Then, the keyframe feature, the video feature, and their concatenations are projected into the same semantic space by learning three autoencoders Auto-H, Auto-G and Auto-E, respectively. Finally, the concatenation of the learned semantic feature representations from these three autoencoders is used to train an SVM classifier for action recognition in videos.

We conduct experiments on Stanford40UCF101 and ASDUCF101 datasets. For Stanford40UCF101 dataset, we use IDT and word2vec as the video feature and semantic feature, respectively. For ASDUCF101 dataset, we use IDT and attribute. The results are shown in Table IX and Table X. Compared with the SVM, KVC achieves slightly better performance on two datasets. This demonstrates that the simply combination of image and video features can improve action recognition performance in videos due to the complementarity. However, the performance improvement is not significant due to the existence of domain shift between image modality and video modality. Both DIVA and DIVF perform better than KVC and SVM, this validates the effectiveness of our domain-invariant learning method and cross-modal feature fusion method. Among various methods and ratios of training samples, DIVAFN achieves the best performance on two datasets. This verifies that we can obtain more compact, informative and discriminative feature representations to enhance action recognition in videos by integrating domain-invariant representations learning and cross-modal feature fusion into a unified deep learning framework.

Ratio SVM KVC DIVA DIVF DIVAFN
TABLE IX: Average accuracies on Stanford40UCF101 dataset using the SVM, KVC, DIVA, DIVF and DIVAFN methods.
Ratio SVM KVC DIVA DIVF DIVAFN