PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-modalities

04/17/2018 ∙ by Lan Wang, et al. ∙ 0

Data of different modalities generally convey complimentary but heterogeneous information, and a more discriminative representation is often preferred by combining multiple data modalities like the RGB and infrared features. However in reality, obtaining both data channels is challenging due to many limitations. For example, the RGB surveillance cameras are often restricted from private spaces, which is in conflict with the need of abnormal activity detection for personal security. As a result, using partial data channels to build a full representation of multi-modalities is clearly desired. In this paper, we propose a novel Partial-modal Generative Adversarial Networks (PM-GANs) that learns a full-modal representation using data from only partial modalities. The full representation is achieved by a generated representation in place of the missing data channel. Extensive experiments are conducted to verify the performance of our proposed method on action recognition, compared with four state-of-the-art methods. Meanwhile, a new Infrared-Visible Dataset for action recognition is introduced, and will be the first publicly available action dataset that contains paired infrared and visible spectrum.



There are no comments yet.


page 2

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human action recognition [1, 2, 3, 4, 5, 6]

aims to recognize the ongoing action from a video clip. As one of the most important tasks in computer vision, action recognition plays a significant role in many useful applications like video surveillance

[7, 8], human-computer interaction [9, 10] and content retrieval [11, 12]

, with great potentials in artificial intelligence. As a result, massive attention has been dedicated to this area which made large progress over the past decades. Most state-of-the-art methods have contributed to the tasks in visible imaging videos, and shows saturated performances among the widely-used benchmark datasets including KTH

[13] and UCF101 [14]. Generally speaking, the task of action recognition is quite well-addressed and has already been applied to real-world problems.

Figure 1: The framework of the proposed Partial-modal Generative Adversarial networks (PM-GANs). Infrared video clips are sent to the transferable generative net to produce fake feature representation of the visible spectrum. And the discriminator attempts to distinguish between the generated features and the real ones. The predictor construct a full representation using the generated features and infrared features to conduct classification.

However, there are still many occasions where visible imaging is limited. First, the RGB cameras rely heavily on the light conditions, and perform poorly when light is insufficient or over-abundant. Action recognition from night-view RGB data remains a rather difficult task. Moreover, as an act to protect the fundamental human dignity–Privacy, RGB cameras are strictly restricted from most private areas including the personal residential, public washroom where abnormal human activities are likely to threat personal security. Infrared cameras, that captures the heat radiation of objects, are excellent alternatives in these occasions [15]. The application of thermal imaging in military affairs and police surveillance has continued for years, and has more potentials beyond the government use. With many advantages over the RGB camera, it is predicted that infrared cameras will become more common in public spaces like hospitals, nursing centers for elderly and home security systems [16].

While infrared cameras can fill the limited spots of RGB cameras, many visible features are nevertheless lost in the infrared spectrum due to their similarity in temperature [17, 18]. Visible features like color, texture are effective clues in activity representations. Since the two are complementary to each other, it is desired to utilize both visible and infrared features to benefit the task of action recognition. Furthermore, it will be more desired to utilize both feature domains when ONLY infrared data is available. In the previous cases when the demand of abnormal action recognition and the demand of privacy conflicted, it will be great if we can obtain both infrared and visible features, while use only the infrared data. The question is, how can one obtain visible features when the visible data is missing? The situation is not unique to the task of action recognition. In fact, data with different modalities of complementary benefits widely exist in multimedia such as systems with multiple sensors, product details with combined information of text description and images [19]. Here we are inspired by the intra-modal feature representations to make up for the missing data using adversarial learning with the available part of the data channel.

Recently, much attention has been given to cross-modal feature representations [20, 21, 22, 23] dealing with unpaired data, which maps multiple feature spaces onto a common one, or to generate a different representation via adversarial training. The basic model of generative adversarial networks (GANs) [24, 25, 26] consists of a generative model and a discriminative model . Many interesting image-to-image translations such as genre translation, face and pose transformation indicate the broader potentials of GANs to explore the hidden correlations in cross-modal representations [16, 27]. Inspired by this, we therefore seek an algorithm that can translate from the infrared representation to the visible domain, which allows us to further exploit the benefits of both feature spaces with only part of the data modalities. More generally speaking, we aim at an architecture that learns a full representation for data of different modalities, using partial modalities. Different from the existing works of cross-modal which seeks a common representation from different data spaces, our goal is to exploit the transferable ability among different modalities, which is further utilized to construct a full-modal representation when only partial data modalities are available.

With a completely different target, in this paper we propose a novel partial-modal Generative Adversarial Networks (PM-GANs), which aims to learn the transferable representation among data of heterogeneous modalities using cross-modal adversarial mechanism and build discriminative full-modal representation architecture using data of one/partial modalities. The main contributions are summarized as follows.

  • Partial-modal representation is proposed to deal with missing data modalities. Specifically, the partial-modal representation aims to obtain the transferable representation among data with different modalities. And when only partial-modal representations are accessible, the model can still generate a comprehensive description, as if constructed with data of all modalities.

  • Partial-modal GANs architecture is proposed that can exploit the complementary benefits of all data channels with heterogeneous features using only one/partial channels. The generative model learns to fit the transferable distribution that characterizes the feature representation in the specific data channels that are likely to be missing in practice. Meanwhile, the discriminative model learns to judge whether the translated distribution is representative enough for the full modalities. Extensive experiment results reveal the effectiveness of the PM-GANs architecture, which outperform four state-of-the-art methods in the task of action recognition.

  • Partial-modal evaluation dataset is newly introduced, which provides paired data of two different modalities–visible and infrared spectrum of human actions. Researchers can evaluate the transferable ability of the algorithms between the two modalities, as well as the discriminative ability of the generated representation by comparing with a series of baselines we provided in this paper. Meanwhile, the dataset can be used as a benchmark for bi-channel action recognition, since it is also carefully designed to serve for this purpose. The dataset contains more than 2,000 videos, 12 different actions, and to the best of our knowledge, is the first publicly available action recognition dataset that contains both infrared and visible spectrum.

The rest of the paper is organized as follows. In Section 2, we review the background and related works. In Section 3, we elaborate the details of our proposed method. Section 4 presents the newly-introduced dataset, its evaluations, and the experimental results of on it. Finally, Section 5 draws the conclusion.

2 Related work

Transfer Learning and Cross-modal Representation

: In the classical pattern recognition and machine learning tasks, sufficient training data that have variations in modality is clearly a desired but unrealistic goal

[28, 29]

, thus restricting the representative ability of the model. Among the studies to address this problem, transfer learning attempts to transfer the feature space from a source domain to a target domain, and to lessen the adaption conflicts via domain adaption

[30, 31, 32, 33]. The transferred knowledge type is not restricted to feature representation or instance, it also contains modality-correlation. With different aims, cross-dataset and cross-modal feature representation fall into feature-representation transfer by adapting the representations from different domains to a single common latent space, where features of multiple modalities are jointly learned and combined. Among these algorithms, Canonical Correlation Analysis (CCA) [34, 35]

is a widely used one, which seeks to maximize the correlation between the projected vectors of two modalities. Another classical algorithm is Data Fusion Hashing (DFH)

[36] that embeds the input data from two arbitary spaces into a Hamming space in a supervised way. Differently, Cross-View Hashing (CVH) [37] maximizes the weighted cumulative correlation and can be viewed as the general representation of CCA.

In recent years, with the renaissance of neural networks, many deep learning based transfer learning and cross-modal representation methods have been proposed as well. Bishifting Autoencoder Network

[21] attempts to alleviate the discrepancy between the source and target datasets to the same space. To further take the feature alignment and auxiliary domain data into consideration, Aligned-to-generalized encoder (AGE) [16] is proposed to map the aligned feature representations to the same generalized feature space with low intraclass variation and high interclass variation. Since GANs have been proposed by Goodfellow [24] et al. in 2014, a series of GANs-based methods have arisen for a wide variety of problems. Recently, a Cross-modal Generative Adversarial Networks for Common Representation Learning (CM-GANs) [27] is proposed. CM-GANs seeks to unify the inconsistent distribution and representation of different modalities by filling the heterogeneity of knowledge types like image and text. In contrast, we have completely different goal, which aims to use only partial data modalities to obtain a full-modal representation. Our focus is beyond the jointly-learned representation of multiple feature spaces, and takes one step further to achieve a discriminative partial-modality representation, which corresponds to our original aim of handling the problem of insufficient training data and data types.

Infrared Action Recognition and Dataset: Most previous contributions [38, 39]

to the progress of action recognition have been made to the visible spectrum. Early approaches utilized the hand-crafted representation followed by classifiers, such as 3D Histogram of Gradient (HOG3D)

[40], Histogram of Optical Flow (HOF) [41], Space Time Interest Points (STIP) [42] and Trajectories [43]. Wang et al. [44] proposed the Improved Dense Trajectories (iDT) representation, making breakthroughs among hand-crafted features. In hand-crafted representation scheme, encoding methods such as Bag of Words (BoW) [45], Fisher vector [46], VLAD [47]

are applied to aggregate the descriptors into video-level representation. Benefiting from the success of Convolutional Neural Networks (CNNs) in image classification, several deep network architectures have been proposed for action recognition. Simonyan

et al. [4] proposed a two-stream CNNs architecture which simultaneously captured appearance and motion information by spatial and temporal nets. Tran et al. [7] investigated 3D ConvNets [48, 8] in large-scale supervised training datasets and effective deep architectures, achieving significant improvements. Carreira et al. [49] designed a two-stream inflated 3D ConvNet, inflating filters and pooling kernels into 3D to learn seamless spatio-temporal feature extractors.

Recently, increasing efforts have been devoted to infrared action recognition [15]. Corresponding to the classical methods employed in visible spectrum, spatiaotemporal representation for human action repretitive action recogtnition is also used under thermal imaging scenarios [50]. The combination of both visible and thermal imaging to improve human silhouette detection is also introduced by Han et al. [51]. However, the scenario has not been studied where infrared data is available while the RGB channel is missing. The scenario has great potential in real-world of protecting privacy while benefiting the task of action recognition, and is meaningful to both the study of pattern recognition and the welfare of the community at large. Therefore, we are motivated to dedicate to improving the situation by constructing a robust and discriminative partial-modal representations, and to specify action recognition as the case in this paper.

There is no publicly available infrared action recognition dataset except the infrared action recognition dataset (InfAR)[15]. To the best of our knowledge, there remains no publicly available action recognition dataset that contains both infrared and visible videos. In this paper we introduce a new dataset that provides paired data of infrared and visible spectrum. The dataset contains large variety in action classes and samples, taking multiple aims into consideration. Researchers can evaluate their methods on visible, thermal data channels, or to use them combined. More importantly, it can be used to evaluate the transferable ability between the two data modalities, and the discriminative ability of the jointly-learned representation. Our dataset will be made as part of the submission and benefit the public.

3 Proposed approach

The overall pipeline of the proposed PM-GANs for action recognition is shown in Fig. 1. Our goal is to generate a full-modal representation using only the partial modalities. The framework learns the transferable representation among different data channels based on a conditional generator networks. Based on the transferred representation, the framework build a discriminative full-modal representation network using only part of the data channels.

3.1 Transferable basis for partial modality

The transferable ability with the PM-GANs architecture is the basis for the construction of full-modal representation with partial modality. We assume that there exists a mapping from an observed distribution and an input distribution , producing an output representation which shares the feature of the observed . Therefore, we attempt to learn a generator to generate the feature distribution of the missing data channel from the partially available distribution denoted as . Based on the scheme of conditional generator networks, the generator immediately transform the partially available distribution and noise to output the missing distribution via the following equation:


where denotes the output distribution. The input distribution and observed distribution denote the data of infrared and RGB channels respectively in our action recognition task. The generator is designed to minimize this objective to fake the generated distribution as well as possible, while the the real output feature discriminator tries to maximize its accuracy of telling the real from the fake one.

In this work, the discriminator is also designed for pattern recognition. Thus, another prediction loss is explored:


where denotes the correct label of partially available data samples, in the form of one-hot vector, and is log loss over the predicted class confidences vector and the ground truth label. For convenience, we denote the discriminator part and the predictor part of discriminative net as and respectively. Finally, the objective function can be formulated as:


3.2 Transferable Net

Figure 2: The proposed transferable generative net is built upon the C3D network [7]. Video clips are sent to 3D ConvNet to obtain feature maps of each clip and all feature maps are fused as to represent the whole action video. Then, residual blocks are added to this net to produce fake feature maps similar to the visible spectrum.

The transferable net simulates the target distribution from the convolutional feature map of the partially-available data distribution, which, as shown in Fig. 2, are then made as an input of the generator to obtain feature maps of missing distribution. For each input distribution, denoted as , where and denote the width, height and number of channels of feature maps. To incorporate all feature maps into a high-level representation, the sum fusion model in [52] is applied to compute the sum of feature maps at the same spatial location and feature channel :


where , , . The final feature map of the input distribution is computed as the average value of sum feature map in each location, denoted as . Then the generator takes the final input feature map and generate the fake target feature map, . The generator consists of two residual blocks [53] to produce feature map with the same size as infrared feature map. Thus, the generative loss is expressed as:


3.3 Discriminative net using partial modality

To enable the generative net to produce full-modal representation which incorporates the complementary benefits among data of different modalities, a two-part discriminative net is designed, as shown in Fig. 1

. The discriminative net contains a discriminator part and a predictor part. The discriminator part follows the scheme of conventional discriminator in GAN which is applied to distinguish between real and fake visible feature map in order to boost the quality of generated fake feature. Specifically, the discriminator part consists of a fully-connected layer followed by a sigmoid function, which produces an adversarial loss. Thus, the adversarial loss

is defined as:


where encourages the discriminator network to distinguish the generated target feature representation from real one.

The predictor aims to boost the accuracy of assigning the right label to each feature distribution. It consists of a fully-connected layer followed by a softmax layer which takes the fusion of the feature map of both the partially-available data channel and generated missing channel and finally outputs the category-level confidences. To fuse these two feature maps, a convolutional fusion model in

[52] is applied to automatically learn the fusion weights:


where f are filters with dimensions , and denotes the stack of two feature maps at the same spatial locations across the feature channels :


where denotes the generated fake feature map .

Thus, the predictive loss can be formulated as:


Thus, the final discriminative loss can be defined as the weighted sum of adversarial loss and predictive loss:


In the training process, the transferable net and the full-modal discriminative net are alternatively trained until the generated feature of missing channel becomes close to real and the discriminative net achieves precise recognition. Detailed training strategies will be given in Section 4. In the testing process, we only need to send one/part of the data modality into the PM-GANs framework, and the generative net will automatically generate a transferred feature representation for the missing modality, and the predictor of discriminative net construct a full-modal representation and predict the label.

4 Experiments

In this section, we first introduce our new dataset for partial-modality infrared action recognition. In detail, the specifications and a complete evaluation of the dataset will be elaborated. For the experiment part, we introduce the configurations of the experiments and show the results and analyses corresponding to our method. Specifically our experiments are in three folds. First, we assess the effectiveness of the transferable net by comparing the generated feature representations with the real ones. Second, we evaluate the discriminative net ability constructed using partial data modality. Finally, we compare our approach with four state-of-the-art methods to verify the effectiveness of the PM-GANs.

4.1 Cross-modal Infrared-Visible Dataset for Action Recognition

We introduce a new action recognition dataset, which is constructed by paired videos of both RGB and infrared data channels. Each action class contains a singular action type, and each video sample contains one action class. In total there are 12 classes of both individual action and person-person interactive actions. For individual actions: one hand wave (wave1), multiple hands wave (wave2), handclap, walk, jog, jump, skip, and interactive actions: handshake, hug, push, punch and fight. For each action class, there are 100 paired videos, with a frame rate of fps. The frame resolutions are for infrared channel and for RGB channel. Each actions is performed by 50 different volunteers. A sample of the frames are illustrated in Fig. 3. The duration distribution of the dataset is listed in Table 1.

In order to simulate the real-world variations, four scenario variables are considered: the background complexity, season, occlusion, and viewpoint.
Background Complexity: In our newly-introduced dataset, the background varies from relatively simple scenes (plain background) to complex ones (with moving objects). For simple background, there are only one or two people performing actions, as shown in Fig.3 (c). While for complex background, interrupting pedestrian activities concur with the objective action in different degrees, as shown in Fig. 3 (d).
Season: The infrared channel is heavily effected by the seasons, because it reflects the heat radiation of objects. In winter, when ambient temperature is in a low value, the imaging of human body is salient and clear. However, in summer, the contrast between human and background is ambiguous. Thus, we divide the seasons into three categories: winter, spring/autumn, summer, as shown in Fig. 3 (e)-(h). The video number proportions of these three seasons are , , and , respectively. All actions were performed in these three seasons.
Occlusion: Specific videos with occlusions from 0% to over 50% are arranged in each action class to promote the diversity and complexity of dataset, as shown in Fig. 3 (a)-(b).
Viewpoint: The variation of different viewpoints is also an important factor considered. The video clips under the front-view, left-side-view, right-side-view are all included in the dataset, as shown in Fig. 3 (e)-(h).

Figure 3: Example paired frames for the action “wave2” in the newly introduced multi-modal dataset for action recognition. The left ones are in infrared channel and the right ones are in RGB channel.
Action Duration
0-5s 5-10s >10s
Fight 27 66 7
Handclapping 45 53 2
Handshake 35 61 4
Hug 26 72 2
Jog 88 12 0
Jump 73 27 0
Punch 55 45 0
Push 55 45 0
Skip 82 18 0
Walk 63 37 0
Wave1 40 59 1
Wave2 42 58 0
Table 1: Duration distribution of videos per class.

We split % of the paired video clips couples as the training set, and the rest as the testing set.

To investigate the suitable representations for each spectrum and the most complimentary representation couples, we select several effective representations to test their discriminative ability on both RGB and infared channels, and the combined channels.

We feed the original video clips , the MHI image clips [54] and the optical flow clips [41], denoted as “Org”, “MHI”, “Optical Flow” into the 3D-CNN [7]

to obtain spatiotemporal features. The 3D-CNN takes a 16-frame clip as inputs and performs 3D convolution and 3D pooling, which calculate the appearance and motion information simultaneously. Specifically, we extract the output of the last fully connected layer and conduct a max pooling to all clip features of one video. In the case of two-modality fusion, we directly concatenate the features of infrared channel and RGB channel. After that, a linear SVM classifier is trained to obtain the final recognition results. The 3D-CNN is fine-tuned by the corresponding maps of our training set.

Method Descriptor Accuracy(%)
Infrared Channel Org 55%
Optical Flow 69.67%
MHI 61%
RGB Channel Org 49%
Optical Flow 78.66%
MHI 65.33%
Fusion Org 55.33%
Optical Flow 80.67%
MHI 68.67%
Table 2: The evaluation results of different features on different channels and their fusion on the proposed dataset.

As shown in Table 2, the performance of different representations for both infrared and RGB channels and their combined results are listed. It is clearly observed that for both infrared and RGB channel, the 3D-CNN features after optical flows achieve the best performance. In two modalities fusion, the 3D-CNN features after optical flows in RGB channel can effectively boost the performance of using the infrared channel only. Thus, in the following experiments of transferable nets and discriminative nets [33], the optical flows are selected as input for representation learning via PM-GANs.

4.2 Implementation Details

For the input data, we compute optical flows using the toolbox of Liu [55]. The 3D ConvNet in transferable generative net is fine-tuned on the infrared optical flows of training set. And the adversarial visible feature maps are extracted from a 3D ConvNet fine-tuned on the visible optical flows of training set. The sampled numbers of clips is set as 5, and each clip has a duration of 16 frames. The loss weights and are set as and respectively. We set the initial learning rate at . The whole network is trained with ADAM optimization algorithm [56] with and

, batch size of 30 on a single NVIDIA GeForce GTX TITAN X GPU with 12GB memory. The framework is implemented using TensorFlow library and accelerated by CUDA 8.0.

Figure 4: The results illustrated in confusion matrices using the proposed method.
Data Modalities Accuracy (%)
Infrared channel 71.67%
RGB channel 79.33%
Generated RGB 76.67%
Infared + RGB channels 82.33%
Infrared channel + Generated RGB 78%
Table 3: Evaluation results on the discriminative ability of transferable modality.

4.3 Transferable Net Evaluations

The PM-GANs model is evaluated on the proposed action recognition datatset. We present the results of five different modalities as shown in Table 3. For single modality, we utilize the 3D ConvNet part and the predictor part without fusion model for training and testing. And for the case of real infrared and RGB channel fusion, we directly input the real feature map of RGB channel to the fusion model instead of using generated ones. From Table 3, we can observe the generated RGB representations perform better than the original infrared ones, which shows that the PM-GANs have indeed discovered useful information through modality transfer. Moreover, the fusion of infrared and generated RGB representations achieves an Accuracy of 78%. Although it performs worse than the original RGB channels and the fusion of infrared and RGB channels, it only utilizes the information of infrared channel in the testing process.

In order to analyze the intra-class performance, the confusion matrices are drawn in Fig. 4. As observed, the proposed method generally shows good performance in action classification: in most classes, the testing samples are assigned the correct label. However, we notice that the “punch”, ”skip” action samples are likely to be classified as “push” and “jump” respectively. One likely reason is that two sets of actions are similar in both movement and process, sometimes even hard to distinguish for human eyes.

To get insight into how effective the transferable ability of PM-GANs are, we rearrange the training and testing splits. Specifically, we utilize the scenes of Spring/Autumn and Summer for training, and Winter for testing. We use this split to examine the generalization ability of the proposed model. As can be seen in Table 4, the generated fake RGB representations outperform the original infrared ones, which shows the robust transferability of PM-GANs.

Modalities of a seperate dataset Accuracy (%)
Infrared channel 74.17%
RGB channel 79.44%
generated RGB 77.78%
Infared+RGB channels 82.78%
Infrared channel+ Generated RGB 80.28%
Table 4: Evaluation of the models generalization ability using a separate dataset.

4.4 Comparisons with Other Methods

To evaluate the effectiveness of PM-GANs, we compared our method with four state-of-the-art methods, including the most effective handcrafted features iDT [44], and the state-of-the-art deep architecture [7]. In addition, we also compare our methods with two state-of-the-art framework for infrared action recognition [15, 57]. For iDT features, Fisher Vector [58] is applied to encode and then a linear SVM classifier [59] is trained for action classification. As for the C3D architecture, the network is fine-tuned by the proposed training dataset. Then max popling followed by a SVM classifier is applied as the evaluation in Table 2. For [15], we followed the original experimental settings provided by the author. For [57]

, we implement and select the configuration with the optimal results based on the original submission. We apply the discriminative code layer and the second fusion strategy for feature extraction, and train a K-nearest neighbor classifier (KNN)

[60] using the provided Gaussian kernel function for classification. Note that all of the results are achieved using unified optical flows as the input.

Method Accuracy (%)
iDT [44] 72.33%
C3D [7] 69.67%
Two-Stream CNN [15] 68%
Two-Stream 3D-CNN [57] 74.67%
PM-GANs 78%
Table 5: Comparisons with four state-of-the-art approaches

Table 5 presents the accuracy of the competing approaches. As observed, the handcrafted iDT method achieves comparable results with some high-level architecture. Methods using 3D-CNN outperform the method with 2D-CNN architecture. One reason to explain is that the 3D-CNN architecture is better in modeling temporal variations. The two-stream 3D-CNNs outperform the conventional iDT framework and robust C3D models, showing effective strength of the proposed discriminative code layer. Our proposed PM-GANs achieves the highest accuracy, which shows the effectiveness of the transferred feature representation and the robustness of our constructed model using only part of the data modalities.

5 Conclusions

In this paper, we proprosed a novel Partial-modal Generative Adversarial Networks (PM-GANs) to construct a discriminative full-modal representation with only part of the data modalities being available. Our methods learns the transferable representation among hetergeneous data modalities using adversarial learning, and build a discriminative net that represents all modalities. Our method is evaluated in the task of action recogntion and outperforms four state-of-the-art methods on four newly-introduced dataset. The dataset, which contains paired videos in both infrared and visible spectrum, will be made as the first publicly available visible-infrared dataset for action recognition.


  • [1] Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence (2017)
  • [2] Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE transactions on pattern analysis and machine intelligence 39(4) (2017) 773–787
  • [3] Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166 (2018) 41–50
  • [4] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems. (2014) 568–576
  • [5] Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 4305–4314
  • [6] Yang, L., Gao, C., Meng, D., Jiang, L.:

    A novel group-sparsity-optimization-based feature selection model for complex interaction recognition.

    In: Asian Conference on Computer Vision, Springer (2014) 508–521
  • [7] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. (2015) 4489–4497
  • [8] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2014) 1725–1732
  • [9] Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review 43(1) (2015) 1–54
  • [10] Lindtner, S., Hertz, G.D., Dourish, P.: Emerging sites of hci innovation: hackerspaces, hardware startups & incubators. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM (2014) 439–448
  • [11] Bouwmans, T., Zahzah, E.H.: Robust pca via principal component pursuit: A review for a comparative evaluation in video surveillance. Computer Vision and Image Understanding 122 (2014) 22–34
  • [12] Yang, Y., Zha, Z.J., Gao, Y., Zhu, X., Chua, T.S.: Exploiting web images for semantic video indexing via robust sample-specific loss. IEEE Transactions on Multimedia 16(6) (2014) 1677–1689
  • [13] Veeriah, V., Zhuang, N., Qi, G.J.:

    Differential recurrent neural networks for action recognition.

    In: Computer Vision (ICCV), 2015 IEEE International Conference on, IEEE (2015) 4041–4049
  • [14] van Gemert, J.C., Jain, M., Gati, E., Snoek, C.G., et al.: Apt: Action localization proposals from dense trajectories. In: BMVC. Volume 2. (2015)  4
  • [15] Gao, C., Du, Y., Liu, J., Lv, J., Yang, L., Meng, D., Hauptmann, A.G.: Infar dataset: Infrared action recognition at different times. Neurocomputing 212 (2016) 36–47
  • [16] Liu, Y., Lu, Z., Li, J., Yao, C., Deng, Y.: Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity 2018 (2018)
  • [17] Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C., Theobalt, C., et al.: Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics (TOG) 33(4) (2014) 156
  • [18] Wu, A., Zheng, W.S., Yu, H.X., Gong, S., Lai, J.: Rgb-infrared cross-modality person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 5380–5389
  • [19] Pereira, J.C., Coviello, E., Doyle, G., Rasiwasia, N., Lanckriet, G.R., Levy, R., Vasconcelos, N.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3) (2014) 521–535
  • [20] Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with cnn visual features: A new baseline. IEEE transactions on cybernetics 47(2) (2017) 449–460
  • [21] Kang, C., Xiang, S., Liao, S., Xu, C., Pan, C.: Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17(3) (2015) 370–381
  • [22] Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on Multimedia, ACM (2014) 7–16
  • [23] Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., Torralba, A.: Learning aligned cross-modal representations from weakly aligned data. arXiv preprint arXiv:1607.07295 (2016)
  • [24] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
  • [25] Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Advances in neural information processing systems. (2015) 1486–1494
  • [26] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  • [27] Peng, Y., Qi, J., Yuan, Y.: ps: Cross-modal generative adversarial networks for common representation learning. arXiv preprint arXiv:1710.05106 (2017)
  • [28] Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35(5) (2016) 1285–1298
  • [29] Shao, L., Zhu, F., Li, X.: Transfer learning for visual categorization: A survey. IEEE transactions on neural networks and learning systems 26(5) (2015) 1019–1034
  • [30] Ganin, Y., Lempitsky, V.:

    Unsupervised domain adaptation by backpropagation.

    In: International Conference on Machine Learning. (2015) 1180–1189
  • [31] Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning. (2015) 97–105
  • [32] Patel, V.M., Gopalan, R., Li, R., Chellappa, R.: Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine 32(3) (2015) 53–69
  • [33] Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition (CVPR). Volume 1. (2017)  4
  • [34] Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: An overview with application to learning methods. Neural computation 16(12) (2004) 2639–2664
  • [35] Yeh, Y.R., Huang, C.H., Wang, Y.C.F.: Heterogeneous domain adaptation and classification by exploiting the correlation subspace. IEEE Transactions on Image Processing 23(5) (2014) 2009–2018
  • [36] Bronstein, M.M., Bronstein, A.M., Michel, F., Paragios, N.: Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE (2010) 3594–3601
  • [37] Kumar, S., Udupa, R.: Learning hash functions for cross-view similarity search. In: IJCAI proceedings-international joint conference on artificial intelligence. Volume 22. (2011) 1360
  • [38] Rahmani, H., Mian, A., Shah, M.: Learning a deep model for human action recognition from novel viewpoints. IEEE transactions on pattern analysis and machine intelligence 40(3) (2018) 667–681
  • [39] Liu, A.A., Xu, N., Nie, W.Z., Su, Y.T., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Transactions on cybernetics 47(7) (2017) 1781–1794
  • [40] Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British Machine Vision Conference, British Machine Vision Association (2008) 275–1
  • [41] Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE (2008) 1–8
  • [42] Laptev, I.: On space-time interest points. International journal of computer vision 64(2-3) (2005) 107–123
  • [43] Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision 103(1) (2013) 60–79
  • [44] Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video representation for action recognition. International Journal of Computer Vision 119(3) (2016) 219–238
  • [45] Li, T., Mei, T., Kweon, I.S., Hua, X.S.: Contextual bag-of-words for visual categorization. IEEE Transactions on Circuits and Systems for Video Technology 21(4) (2011) 381–392
  • [46] Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: Theory and practice. International journal of computer vision 105(3) (2013) 222–245
  • [47] Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P.: Revisiting the vlad image representation. In: Proceedings of the 21st ACM international conference on Multimedia, ACM (2013) 653–656
  • [48] Ji, S., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 35(1) (2012) 221–231
  • [49] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. arXiv preprint arXiv:1705.07750 (2017)
  • [50] Han, J., Bhanu, B.: Human activity recognition in thermal infrared imagery. In: Computer Vision and Pattern Recognition-Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on, IEEE (2005) 17–17
  • [51] Han, J., Bhanu, B.: Fusion of color and infrared video for moving human detection. Pattern Recognition 40(6) (2007) 1771–1784
  • [52] Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. (2016)
  • [53] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
  • [54] Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. Pattern Analysis & Machine Intelligence IEEE Transactions on 23(3) (2001) 257–267
  • [55] Liu, C., et al.: Beyond pixels: exploring new representations and applications for motion analysis. PhD thesis, Massachusetts Institute of Technology (2009)
  • [56] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [57] Jiang, Z., Rozgic, V., Adali, S.: Learning spatiotemporal features for infrared action recognition with 3d convolutional neural networks. (2017)
  • [58] Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: European conference on computer vision, Springer (2010) 143–156
  • [59] Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on. Volume 3., IEEE (2004) 32–36
  • [60] Bui, D.T., Nguyen, Q.P., Hoang, N.D., Klempe, H.: A novel fuzzy k-nearest neighbor inference model with differential evolution for spatial prediction of rainfall-induced shallow landslides in a tropical hilly area using gis. Landslides 14(1) (2017) 1–17