One of the biggest challenges in recognizing activities in videos is obtaining large labeled video datasets. Annotating videos is largely both expensive and cumbersome due to variations in viewpoint, scale and appearance within a video. This suggests a need for semi-supervised approaches to recognize actions in videos. One such approach is to use deep networks to learn a feature representation of videos without activity labels but with temporal order of frames as a ‘weak supervision’ [1, 2]
. This approach still requires some supervision in terms of deciding sampling strategies and related video encoding methods to input to neural networks (such as dynamic images) and designing ‘good questions’ of correct/incorrect orders as input to the deep network.
Generative models such as the recently introduced Generative Adversarial Networks (GANs) 
approximate high dimensional probability distributions like those of natural images using an adversarial process without requiring expensive labeling. To this end, our research question is:How can we use abundant video data without labels to train a generative model such as a GAN and use it to learn action representation in videos with little to no supervision?
GANs are conventionally used to learn a data distribution of images starting from random noise. Adversarial learning in GANs involves two networks: a discriminator network and a generator network. The discriminator network is trained on two kinds of inputs – one consisting of samples drawn from a high dimensional data source such as images and the other consisting of random noise. Its goal is to distinguish between real and generated samples. The generator network uses the output of a discriminator to generate ‘better’ samples. This minimax game aims to converge to a setting where the discriminator is unable to distinguish between real and generated samples. We propose to use the discriminator trained to only differentiate between a real and generated sample for learning a feature representation of actions in videos.
We use the GAN setup to train a discriminator network and use the learned representation of discriminator as “initialized weight.” Then fine-tune that discriminator on labeled video dataset such as UCF101 . Recent works have done small experiments  but to our knowledge, nobody has done an in-depth study and especially considered all the architecture/hyperparameter settings that can yield a good performance across datasets (we do well on HMDB51 too) using only appearance information in the video. This unsupervised pre-training step avoids any manual feature engineering, video frame encoding, searching for the best video frame sampling technique and results in an action recognition performance competitive to the state-of-the-art using only appearance information.
Our key contributions and findings are:
We propose a systematic semi-supervised approach to learn action representations from videos using GANs.
We perform a comprehensive study of best practices to recognize actions from videos using the GAN training process as a good initialization step for recognition.
We find that appearance-based unsupervised pre-training for video action recognition performs superior or comparable to the state-of-the-art semi-supervised multi-stream video action recognition approaches.
Our unsupervised pre-training step does not require weak supervision or computationally expensive steps in the form of video frame encoding, video stabilization and search for best sampling strategies.
2 Related Work
To date, action recognition is one problem in Computer Vision where deep Convolutional Neural Networks (CNNs) have not outperformed hand-crafted features. Action recognition from videos has come a long way from holistic feature learning such as Motion Energy Image (MEI) and Motion History Image (MHI), space-time volumes  and Action Banks  to local feature learning approaches such as space-time interest points , HOG3D , histogram of optical flow  and tracking feature trajectories [13, 14, 15, 16].
The recent success of CNNs in image recognition has enabled many researchers to treat a video as a set of RGB images, perform image classification on the video frames and aggregate the network predictions to achieve video level classification . Our approach is also inspired by local appearance encoding methods for videos. 3D convolutional networks capture spatio-temporal features via 3D convolutions in both spatial and temporal domains . Various fusion techniques are proposed to pool the temporal information to construct video descriptors [19, 20]21, 22]. Using multiple networks to model appearance and motion was first introduced by Simonyan and Zisserman : the two-stream architecture, where the spatial architecture is the standard VGG Net  and the temporal stream network takes input stacked optical flow fields. Wu  added audio and LSTMs to the network to improve video classification performance. We do not experiment with multiple modalities in this paper as we use only RGB frames as input to the model for our proof of concept.
Generative models have been successfully used to avoid manual supervision in labeling videos with the most common application being video frame prediction [25, 26, 27, 28, 29, 30, 27, 31, 32]. Since appearance changes are smooth across videos, temporal consistency  and other constraints  are useful to learn video representations. Our work proposes a generative model as an unsupervised pre-training method for action recognition. While approaches that take temporal coherency into account such as [1, 28, 35, 36] are similar to our work, they are different in that enforcing temporal coherency still involves weak supervision  where they have to pre-select good samples from a video. We do not do any weak supervision in our approach but only use the generative adversarial training as an unsupervised pre-training step to recognize actions.
train a network to predict the odd video out of a set of videos where the “odd one out” is a video with its frames in wrong temporal order. The key difference between our work and theirs is that we do not require any weak supervision in terms of selecting the right video encoding method, sampling strategies or designing effective odd-one-out questions to improve accuracy. Another recent related approach is that of where a network is trained to sort a tuple of frames from videos. This sequence sorting task forms the “unsupervised pretraining” step and the network is finetuned on labelled datasets. Our approach does not use weak supervision (such as selecting the right tuple via optical flow for example) for the unsupervised pretraining task and uses only appearance information in this work.
use the discriminator (pre-trained on ImageNet) to compute features on CIFAR10 dataset
for classification. Other works to use GANs for semi-supervised learning[40, 41, 42, 43, 44] are all designed for image recognition, not videos.
A recent work is  where the authors train GANs for tiny video generation. They fine-tune their trained discriminator model on UCF101 and show promising results. However, their model is significantly more complicated and requires stabilized videos which involves SIFT  and RANSAC  computation per video frame, something that is not required by our method which achieves comparable accuracy after finetuning.
We briefly review the main principles behind GAN models and describe our methodology in detail to recognize actions by leveraging their unsupervised feature learning capability on videos.
3.1 Generative Adversarial Networks
GAN networks  exploit game theoretic approaches to train two different networks; a generator and a discriminator. The generator represented by function parameterized by
starts with an input noise vector
that is sampled from a normal distribution, up-samples this noise distribution and outputs an image . The discriminator network is a CNN network (represented by function ) parameterized by that takes as input an image ( (real image) or (generated or fake image)) and outputs a probability that whether the input image is from the real distribution or generated distribution. Training GANs involve a minimax game in which the generator attempts to ‘fool’ the discriminator into predicting a generated image as real whereas the discriminator attempts to identify correctly which input images are fake. The discriminator cost function is a cross entropy loss defined by:
The minimax objective from Equation 1 can be optimized using gradient-based methods since both discriminator and generator are functions ( and ) that are differentiable with respect to their inputs and parameters 
. The solution to this problem is a Nash equilibrium as both functions are trained to minimize their costs while maximizing the other’s objective. GANs can be trained using Stochastic Gradient Descent (SGD) with any optimizer of choice.
3.2 Training GANs with Video Frames
So far in the research community, GANs have been primarily used for sample generation. Thus, focus has been on modifying generator parameters, network architecture and loss functions in order to generate higher resolution images with minimal artifacts. The discriminator network in all variants of GANs is trained with binary cross entropy loss (see Equation1) . Since our focus is not image generation but learning useful features to transfer to the task of action recognition, we are motivated to train and use the discriminator network in GANs for action recognition. The discriminator network in a GAN learns a representation of local appearance features thus modeling objects and scenes in video frames as context. Lastly, it does so in an unsupervised manner i.e. we do not require explicit labels for objects, scenes or actions to pre-train our action recognition model.
Consider a set of videos where and is the number of videos in the dataset. Each video consists of a variable number of frames (sampled at the rate of one frame per second). We use all the frames in the training set of videos from two challenging video activity datasets without any label information to train the GAN model. Our approach is shown in Figure 1. We train GANs using a variety of techniques proposed in prior research to generate images. To compare with GANs pre-trained on an object recognition dataset, we also train a GAN model on ImageNet  images. We use the same architecture as proposed in the DCGAN 
paper since the authors have demonstrated the transfer learning capability of DCGAN model on CIFAR10 dataset.
3.3 Unsupervised Pre-training
When dealing with small datasets, a CNN’s generalization performance decreases so that the test accuracy remains small even while training accuracy may increase. This is why a common practice is to initialize the weights of the layers with ImageNet pre-trained CNN weights instead of training from scratch. This is referred to as supervised pre-training since ImageNet labels have been used to determine the initial weights.
Our approach is different in that we are trying to do unsupervised pre-training - determining starting weights for a CNN model (discriminator) which is pre-trained without label information using adversarial training. This unsupervised pre-training setup is compared with initializing the weights in the discriminator network using other settings and we show that the GAN-based initialization significantly outperforms other initialization strategies on the test set of UCF101.
3.4 Fine-tuning Discriminator Model
In this step of our approach we initialize the network with the learned weights from adversarial training and fine-tune it on two video activity datasets. In the process of fine-tuning, we are faced with numerous choices of network architecture, learning rate schemes, optimization and data augmentation. We explore in the space of these variations and report all results on the test split of UCF101 dataset.
UCF101  is a benchmark action recgonition dataset comprising 13320 YouTube videos of 101 action categories. Actions include human-object interactions such as ‘apply lipstick’, body motion such as ‘handstand walking’, human-human interactions, playing musical instruments and sports. The dataset is small but challenging in that the videos vary in viewpoint changes, illumination, camera motion and blur. The second dataset we experiment on is the HMDB51 dataset  which contains 6766 videos of 51 actions such as chew, eat, laugh . Sample frames from both datasets are shown in Figures 3 and 4.
4.2 Unsupervised Pre-training
This section describes three experiments to determine: (a) Whether GANs can generate action images (b) Training Protocol of GANs and (c) Data Augmentation steps
Can GANs Generate Action Images?
Since we consider a video as a set of RGB frames, we address the first question: Are GANs, traditionally used for generating faces, objects and scenes capable of generating an image representing an action? This question is crucial to address because it determines the validity of using the trained GAN discriminator as a CNN network and fine-tune it on a labelled video activity dataset. To answer this question, we use all the videos from the train split 1 of UCF101  and sample 1 frame per second from each video. We train a DCGAN model with default parameters and after 100 epochs, obtain results shown in Figure 2
. From visual inspection we can see that vanilla DCGAN is able to learn a coarse representation of activities involving humans. The question now remains whether we can use the feature representation learned by GAN’s discriminator as an unsupervised pre-training step to classify actions in labeled video action recognition datasets.
Training Protocol of GANs:
We use DCGAN’s public implementation in torch and train three separate GAN models: One with UCF101 video frames, second with ImageNet images and third with a subset of Sports1M dataset  frames. We train all three models for 100 epochs using the architectural guidelines proposed in 
, namely, batch normalization52]
in all layers of discriminator, strided convolutions in discriminator instead of pooling layers and fractional-strided convolutions in the generator. There are no fully-connected (FC) layers in the DCGAN architecture as the authors of report no loss in generator performance for not including FC layers. Hence we also use the same architecture for training the GAN model.
performs data augmentation via taking 64 x 64 sized random crops of the image as well as scaling the images to range [-1,1]. This scaling is done for the tanh activation function in the generator. We change that protocol and avoid random cropping. We only scale the frames of videos to the range [-1,1] and scale the size to 64 x 64. The reason why we avoid random cropping is because the action frames from videos are much larger and contain much more information than the original images used for training DCGAN (bedrooms, faces and the like). Taking random crops from action frames will not result in a useful representation because too much information will be lost. Thus, we only scale the images to 64x64 as our aim is not just to generate action images but to learn an effective action representation for recognition.
4.3 Fine-tuning for Action Recognition
Here we describe the set of experiments conducted after the GAN model has been trained. We use the pre-trained discriminator network from our GAN model and fine-tune it on the two labeled video action datasets: UCF101  and HMDB51 . We begin by replacing the last spatial convolutional layer (CONV5) with one that has the correct number of outputs (equal to the number of action classes). See Figure 5. This layer is initialized randomly and the network is trained again with the previous layers initialized with the pre-trained discriminator’s weights.
We perform a comprehensive experimental analysis of architectural choices, hyperparameter settings and other good practices and report the accuracy on the test set of UCF101 dataset.
Does Source Data Distribution Matter?
In this experiment, we determine whether the dataset we train GAN with (which we refer to as the source dataset) determines performance on the target dataset (the labeled dataset on which we fine-tune the discriminator network). To this end, we train DCGAN on three large scale datasets: ImageNet  images, UCF101  video frames and frames of 10,000 videos from Sports1M  dataset. We use the same sampling strategy of 1 frame per second for both video datasets and train all three GAN models separately for 100 epochs.
For each video , there is a set of frames where where is the number of frames extracted for video . Each video’s frames are passed through the trained GAN’s discriminator and we extract CONV4’s activations as features on each frame. We average frame-level features to obtain video-level features. We train a linear SVM classifier  on top of these features using the train/test split1 provided by the dataset authors and obtain classification accuracy on the test set. We use the same setting for training all three GAN models as described in the training protocol earlier. Our results are shown in Table 1.
|Source Dataset||Destination Dataset (accuracy %)|
As can be seen from Table 1 training a GAN with UCF101 frames results in the best test accuracy on both UCF101 and HMDB51. The difference between training a GAN model with ImageNet and Sports1M frames and training it with UCF101 frames is significant. Note that we did not use all videos from the Sports1M dataset; we randomly selected 10,000 videos from the dataset, extracted 1 frame per second from each video and used those frames to train the GAN model. For HMDB51 dataset the difference in test accuracy between using a GAN discriminator pre-trained on UCF101 and other datasets is not very large. But the superior performance of training a GAN model with video action frames is clearly demonstrated by this experiment. The features learned by the discriminator network are strong enough to transfer to other video datasets as well.
Choice of Architecture:
There are several ways of changing the architecture of the pre-trained discriminator network for fine-tuning. Note that the discriminator is just another CNN network with spatial convolutional layers and no fully connected layers. For fine-tuning on the UCF101 dataset, we replace the last convolutional layer (CONV4) with one that has the correct number of outputs, initialize this layer randomly and train this network (fine-tune) for 160 epochs. This fine-tuning experiment is called ‘CONV4’ in Table 2. Network depth determines the model’s performance both in theory and practice . Hence we add another convolutional layer (CONV5) and a fully connected layer (FC), initialize them from scratch and retrain the network till convergence. We extract CONV4, CONV5 and FC features from the finetuned network. We concatenate CONV5 and CONV5 features and test the performance as well as CONV4, CONV5 and FC features. We do not freeze any layers before fine-tuning and keep a learning rate of 0.001 to fine-tune the network. We empirically found that freezing the earlier layers and finetuning only the last layer(s) did not increase performance. We train a linear SVM on top of the extracted features and compute results on UCF101’s test set. Our results are shown in Table 2. Our network architecture is shown in Figure 5.
|Architectural changes||Test Accuracy (%)|
|CONV4 + CONV5 + FC||49.30|
|CONV4 + CONV5||50.12|
Our results in Table 2 show that with all other parameters kept the same, adding a convolutional layer and a fully connected layer in the discriminator network architecture results in only a slight improvement in performance. We note that this is not a huge difference and this may seem counterintuitive but the reason why this happens is that we are initializing the added network layers randomly before fine-tuning. Also, the dataset size of UCF101 frames is not very large with 84,747 frames in the training set and 33,187 frames in the test set. This may lead to over fitting resulting in only a slight increase in performance on the test set especially when the fully connected layer is added.
To reduce overfitting, we add dropout  after the additional convolutional and fully connected layers. We note the performance with/without dropout by extracting CONV4 features from both networks (after finetuning) and training a linear SVM. Adding dropout regularizes the network more thus increasing the performance on test set of UCF101.
Fine-tuning vs Linear SVM:
Once we fine-tune the discriminator model on the datasets, we have a choice of whether to extract the CONV4’s activations and train a linear SVM on top of it or fine-tune the last layers with softmax classifier. We do both in our experiments and note that the outcome is dependent on the dataset. We find that when we fine-tune the discriminator network on UCF101, the test set accuracy using softmax is lower than extracting CONV4 features and training a linear SVM to recognize actions. However when using HMDB51, the softmax classification on the test set results in a higher accuracy than extracting Layer 9 features and training a linear SVM classifier. This result is shown in Table 3.
|Accuracy (%) on test set|
|CONV4 + linear SVM||Softmax|
From Table 3 it is apparent that for UCF101, feature embedding and training a linear SVM results in a better accuracy than softmax classification. The complete opposite is true with HMDB51 dataset. We dig deeper to investigate why this happens. We find that the label distribution of the dataset on which a deep network is being fine-tuned on is the key to determine which method results in a better test accuracy. The label distribution of UCF101 test set is shown in Figure 6. This distribution is not balanced while that of HMDB51 is completely balanced in terms of number of videos per action category. Hence it appears that when classes are unbalanced, since we have not used weighted loss in the neural network, the linear SVM learns the features better hence resulting in an increased performance on the test set. In the case of HMDB51, all classes are balanced equally leading to the superior performance of the softmax classifier over the feature embedding approach.
Unsupervised Pretraining vs Random Initialization
We validate the use of our unsupervised pre-training approach by comparing it with a network that is initialized randomly. We initialize all the layers of the network using ‘xavier’ initialization. Proposed by 
, the authors recommend initializing weights by drawing from a distribution with zero mean and variance given by:where
is the distribution which the neuron is initialized with,is the number of neurons feeding into the layer and is the number of output neurons from this layer. We initialize all layers with this scheme and train the network till convergence on UCF101. For HMDB51, we train a network for 50 epochs with xavier initialized layers and compare that to our proposed discriminator initialized method at 50 epochs. The results are shown in Table 4 and clearly validate the use of our unsupervised pretraining approach to initialize the network before finetuning. As a reference, a supervised ImageNet pretrained network finetuned on UCF101 yields an accuracy of 67.1% and finetuned on HMDB51 yields an accuracy of 28.5% .
|Initialization||UCF101 (%)||HMDB51 (%)|
|Xavier + finetuning||33.10||11.6|
|DiscrimNet (ours) + finetuning||49.30||20.4|
4.4 Comparison with the state-of-the-art
We compare our approach with several recent semi-supervised baselines which recognize actions in videos. The baselines are:
STIP features: Handcrafted Space Time Interest Point (STIP) features introduced by .
DrLim : This method uses temporal coherency by minimizing the L2 distance metric between features of neighboring frames in videos and enforcing a margin between far apart frames.
Shuffle : They use sequence verification as an unsupervised pre-training step for vidoes. The model is then fine-tuned on UCF101.
VideoGAN : They generate tiny videos using a two stream GAN network. Their model is fine-tuned on UCF101.
O3N : They use odd-one-out networks to predict the wrong temporal order from the right ones. Their model is then fine-tuned on UCF101.
OPN : They train a network to predict the order of 4-tuple frames. Their model is then fine-tuned on UCF101.
|STIP features ||43.9|
|Obj. Patch ||40.7|
|DiscrimNet (ours) CONV4 + linear SVM||49.33|
|DiscrimNet (ours) CONV5 + linear SVM||48.88|
|DiscrimNet (ours) (CONV4 + CONV5) + linear SVM||50.12|
Our comparison with several state-of-the-art semi-supervised approaches to recognize actions in vidoes yields important insights. Our results show competitive performance as compared to the state-of-the-art approaches in semi-supervised learning given that:
We only use appearance features and do not experiment with motion content of the video. This is especially intriguing given that our method outperforms STIP features on this dataset. All methods in the results we compare to use temporal coherency as a signal and do motion encoding.
We do not do weak supervision in the form of temporal consistency and do not design temporal order based networks. The only supervision provided to the GAN is the difference between a real image and noise.
Our model outperforms several state-of-the-art approaches on HMDB51 given that no video from the dataset was used in the unsupervised pre-training step of this approach. This shows the domain adaptation capability of GAN discriminator networks and that they are able to capture enough information to learn useful representation of actions in video frames.
The methods that outperform our proposed approach are either computationally expensive or require much more supervision in the form of selecting sampling strategies, video encoding methods or in the case of O3N networks , designing effective odd-one-out questions for the network to learn feature representations for action recognition.
We propose an unsupervised pre-training method using GANs for action recognition in videos. Our method does not require weak supervision in the form of temporal coherency, sampling selection or video encoding methods. Purely on appearance information alone, our method performs either better than or comparable to the state-of-the-art semi-supervised action recognition methods.
Ishan Misra, C Lawrence Zitnick, and Martial Hebert.
Shuffle and learn: unsupervised learning using temporal order verification.In European Conference on Computer Vision, pages 527–544. Springer, 2016.
-  Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. arXiv preprint arXiv:1611.06646, 2016.
Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen
Dynamic image networks for action recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3034–3042, 2016.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. arXiv preprint arXiv:1609.02612, 2016.
-  Aaron F. Bobick and James W. Davis. The recognition of human movement using temporal templates. IEEE Transactions on pattern analysis and machine intelligence, 23(3):257–267, 2001.
-  Alper Yilmaz and Mubarak Shah. Actions sketch: A novel action representation. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 984–989. IEEE, 2005.
-  Sreemanananth Sadanand and Jason J Corso. Action bank: A high-level representation of activity in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1234–1241. IEEE, 2012.
-  Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, 2005.
-  Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference, pages 275–1. British Machine Vision Association, 2008.
-  Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
-  Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the velocity histories of tracked keypoints. In 2009 IEEE 12th international conference on computer vision, pages 104–111. IEEE, 2009.
-  Pyry Matikainen, Martial Hebert, and Rahul Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 514–521. IEEE, 2009.
-  Yu-Gang Jiang, Qi Dai, Xiangyang Xue, Wei Liu, and Chong-Wah Ngo. Trajectory-based modeling of human actions with motion reference points. In European Conference on Computer Vision, pages 425–438. Springer, 2012.
-  Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pages 3551–3558, 2013.
-  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568–576, 2014.
-  Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013.
-  Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694–4702, 2015.
Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla
Sequential deep learning for human action recognition.In International Workshop on Human Behavior Understanding, pages 29–39. Springer, 2011.
-  Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
-  Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
-  Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, Xiangyang Xue, and Jun Wang. Fusing multi-stream deep networks for video classification. arXiv preprint arXiv:1509.06086, 2015.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
-  Yipin Zhou and Tamara L Berg. Temporal perception and prediction in ego-centric video. In Proceedings of the IEEE International Conference on Computer Vision, pages 4498–4506, 2015.
-  Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023, 2015.
-  Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. Unsupervised learning of spatiotemporally coherent metrics. In Proceedings of the IEEE International Conference on Computer Vision, pages 4086–4093, 2015.
-  Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2, 2015.
-  Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
Hossein Mobahi, Ronan Collobert, and Jason Weston.
Deep learning from temporal coherence in video.
Proceedings of the 26th Annual International Conference on Machine Learning, pages 737–744. ACM, 2009.
-  Graham W Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. Convolutional learning of spatio-temporal features. In European conference on computer vision, pages 140–153. Springer, 2010.
-  Zhang Zhang and Dacheng Tao. Slow feature analysis for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):436–450, 2012.
-  Dinesh Jayaraman and Kristen Grauman. Slow and steady feature analysis: higher order temporal coherence in video. arXiv preprint arXiv:1506.04714, 2015.
-  Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2015.
-  Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. Actions~ transformations. arXiv preprint arXiv:1512.00795, 2015.
-  Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 667–676, 2017.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
-  Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. arXiv preprint arXiv:1606.03657, 2016.
-  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
-  Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
-  Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
-  Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
-  David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
-  Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
-  Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011.
-  Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
-  Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
-  Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
-  Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
-  Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks.
-  Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pages 1735–1742. IEEE, 2006.