DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks

01/22/2018 ∙ by Unaiza Ahsan, et al. ∙ Google Georgia Institute of Technology 0

We propose an action recognition framework using Gen- erative Adversarial Networks. Our model involves train- ing a deep convolutional generative adversarial network (DCGAN) using a large video activity dataset without la- bel information. Then we use the trained discriminator from the GAN model as an unsupervised pre-training step and fine-tune the trained discriminator model on a labeled dataset to recognize human activities. We determine good network architectural and hyperparameter settings for us- ing the discriminator from DCGAN as a trained model to learn useful representations for action recognition. Our semi-supervised framework using only appearance infor- mation achieves superior or comparable performance to the current state-of-the-art semi-supervised action recog- nition methods on two challenging video activity datasets: UCF101 and HMDB51.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the biggest challenges in recognizing activities in videos is obtaining large labeled video datasets. Annotating videos is largely both expensive and cumbersome due to variations in viewpoint, scale and appearance within a video. This suggests a need for semi-supervised approaches to recognize actions in videos. One such approach is to use deep networks to learn a feature representation of videos without activity labels but with temporal order of frames as a ‘weak supervision’ [1, 2]

. This approach still requires some supervision in terms of deciding sampling strategies and related video encoding methods to input to neural networks (such as dynamic images

[3]) and designing ‘good questions’ of correct/incorrect orders as input to the deep network.

Generative models such as the recently introduced Generative Adversarial Networks (GANs) [4]

approximate high dimensional probability distributions like those of natural images using an adversarial process without requiring expensive labeling. To this end, our research question is:

How can we use abundant video data without labels to train a generative model such as a GAN and use it to learn action representation in videos with little to no supervision?

GANs are conventionally used to learn a data distribution of images starting from random noise. Adversarial learning in GANs involves two networks: a discriminator network and a generator network. The discriminator network is trained on two kinds of inputs – one consisting of samples drawn from a high dimensional data source such as images and the other consisting of random noise. Its goal is to distinguish between real and generated samples. The generator network uses the output of a discriminator to generate ‘better’ samples. This minimax game aims to converge to a setting where the discriminator is unable to distinguish between real and generated samples. We propose to use the discriminator trained to only differentiate between a real and generated sample for learning a feature representation of actions in videos.

We use the GAN setup to train a discriminator network and use the learned representation of discriminator as “initialized weight.” Then fine-tune that discriminator on labeled video dataset such as UCF101 [5]. Recent works have done small experiments [6] but to our knowledge, nobody has done an in-depth study and especially considered all the architecture/hyperparameter settings that can yield a good performance across datasets (we do well on HMDB51 too) using only appearance information in the video. This unsupervised pre-training step avoids any manual feature engineering, video frame encoding, searching for the best video frame sampling technique and results in an action recognition performance competitive to the state-of-the-art using only appearance information.

Our key contributions and findings are:

  • We propose a systematic semi-supervised approach to learn action representations from videos using GANs.

  • We perform a comprehensive study of best practices to recognize actions from videos using the GAN training process as a good initialization step for recognition.

  • We find that appearance-based unsupervised pre-training for video action recognition performs superior or comparable to the state-of-the-art semi-supervised multi-stream video action recognition approaches.

  • Our unsupervised pre-training step does not require weak supervision or computationally expensive steps in the form of video frame encoding, video stabilization and search for best sampling strategies.

2 Related Work

To date, action recognition is one problem in Computer Vision where deep Convolutional Neural Networks (CNNs) have not outperformed hand-crafted features. Action recognition from videos has come a long way from holistic feature learning such as Motion Energy Image (MEI) and Motion History Image (MHI)

[7], space-time volumes [8] and Action Banks [9] to local feature learning approaches such as space-time interest points [10], HOG3D [11], histogram of optical flow [12] and tracking feature trajectories [13, 14, 15, 16].

The recent success of CNNs in image recognition has enabled many researchers to treat a video as a set of RGB images, perform image classification on the video frames and aggregate the network predictions to achieve video level classification [17]. Our approach is also inspired by local appearance encoding methods for videos. 3D convolutional networks capture spatio-temporal features via 3D convolutions in both spatial and temporal domains [18]. Various fusion techniques are proposed to pool the temporal information to construct video descriptors [19, 20]

. Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTM) networks have also been used to model videos for action recognition

[21, 22]. Using multiple networks to model appearance and motion was first introduced by Simonyan and Zisserman [17]: the two-stream architecture, where the spatial architecture is the standard VGG Net [23] and the temporal stream network takes input stacked optical flow fields. Wu [24] added audio and LSTMs to the network to improve video classification performance. We do not experiment with multiple modalities in this paper as we use only RGB frames as input to the model for our proof of concept.

Generative models have been successfully used to avoid manual supervision in labeling videos with the most common application being video frame prediction [25, 26, 27, 28, 29, 30, 27, 31, 32]. Since appearance changes are smooth across videos, temporal consistency [33] and other constraints [34] are useful to learn video representations. Our work proposes a generative model as an unsupervised pre-training method for action recognition. While approaches that take temporal coherency into account such as [1, 28, 35, 36] are similar to our work, they are different in that enforcing temporal coherency still involves weak supervision [1] where they have to pre-select good samples from a video. We do not do any weak supervision in our approach but only use the generative adversarial training as an unsupervised pre-training step to recognize actions.

Recently [2]

train a network to predict the odd video out of a set of videos where the “odd one out” is a video with its frames in wrong temporal order. The key difference between our work and theirs is that we do not require any weak supervision in terms of selecting the right video encoding method, sampling strategies or designing effective odd-one-out questions to improve accuracy. Another recent related approach is that of

[37] where a network is trained to sort a tuple of frames from videos. This sequence sorting task forms the “unsupervised pretraining” step and the network is finetuned on labelled datasets. Our approach does not use weak supervision (such as selecting the right tuple via optical flow for example) for the unsupervised pretraining task and uses only appearance information in this work.

Generative Adversarial Networks [4] have been used for semi-supervised feature learning particularly after the introduction of Deep Convolutional GANs (or DCGANs) [38]. Radford [38]

use the discriminator (pre-trained on ImageNet) to compute features on CIFAR10 dataset


for classification. Other works to use GANs for semi-supervised learning

[40, 41, 42, 43, 44] are all designed for image recognition, not videos.

A recent work is [6] where the authors train GANs for tiny video generation. They fine-tune their trained discriminator model on UCF101 and show promising results. However, their model is significantly more complicated and requires stabilized videos which involves SIFT [45] and RANSAC [46] computation per video frame, something that is not required by our method which achieves comparable accuracy after finetuning.

3 Approach

We briefly review the main principles behind GAN models and describe our methodology in detail to recognize actions by leveraging their unsupervised feature learning capability on videos.

3.1 Generative Adversarial Networks

GAN networks [4] exploit game theoretic approaches to train two different networks; a generator and a discriminator. The generator represented by function parameterized by

starts with an input noise vector

that is sampled from a normal distribution

, up-samples this noise distribution and outputs an image . The discriminator network is a CNN network (represented by function ) parameterized by that takes as input an image ( (real image) or (generated or fake image)) and outputs a probability that whether the input image is from the real distribution or generated distribution. Training GANs involve a minimax game in which the generator attempts to ‘fool’ the discriminator into predicting a generated image as real whereas the discriminator attempts to identify correctly which input images are fake. The discriminator cost function is a cross entropy loss defined by:


The minimax objective from Equation 1 can be optimized using gradient-based methods since both discriminator and generator are functions ( and ) that are differentiable with respect to their inputs and parameters [47]

. The solution to this problem is a Nash equilibrium as both functions are trained to minimize their costs while maximizing the other’s objective. GANs can be trained using Stochastic Gradient Descent (SGD) with any optimizer of choice.

3.2 Training GANs with Video Frames

So far in the research community, GANs have been primarily used for sample generation. Thus, focus has been on modifying generator parameters, network architecture and loss functions in order to generate higher resolution images with minimal artifacts. The discriminator network in all variants of GANs is trained with binary cross entropy loss (see Equation 

1) [47]. Since our focus is not image generation but learning useful features to transfer to the task of action recognition, we are motivated to train and use the discriminator network in GANs for action recognition. The discriminator network in a GAN learns a representation of local appearance features thus modeling objects and scenes in video frames as context. Lastly, it does so in an unsupervised manner i.e. we do not require explicit labels for objects, scenes or actions to pre-train our action recognition model.

Consider a set of videos where and is the number of videos in the dataset. Each video consists of a variable number of frames (sampled at the rate of one frame per second). We use all the frames in the training set of videos from two challenging video activity datasets without any label information to train the GAN model. Our approach is shown in Figure 1. We train GANs using a variety of techniques proposed in prior research to generate images. To compare with GANs pre-trained on an object recognition dataset, we also train a GAN model on ImageNet [48] images. We use the same architecture as proposed in the DCGAN [38]

paper since the authors have demonstrated the transfer learning capability of DCGAN model on CIFAR10 dataset.

Figure 2:

Results after 100 epochs of running DCGAN

[38] on UCF101 video frames. The images in the top three rows are real while those on the bottom are generated by the model

3.3 Unsupervised Pre-training

When dealing with small datasets, a CNN’s generalization performance decreases so that the test accuracy remains small even while training accuracy may increase. This is why a common practice is to initialize the weights of the layers with ImageNet pre-trained CNN weights instead of training from scratch. This is referred to as supervised pre-training since ImageNet labels have been used to determine the initial weights.

Our approach is different in that we are trying to do unsupervised pre-training - determining starting weights for a CNN model (discriminator) which is pre-trained without label information using adversarial training. This unsupervised pre-training setup is compared with initializing the weights in the discriminator network using other settings and we show that the GAN-based initialization significantly outperforms other initialization strategies on the test set of UCF101.

3.4 Fine-tuning Discriminator Model

In this step of our approach we initialize the network with the learned weights from adversarial training and fine-tune it on two video activity datasets. In the process of fine-tuning, we are faced with numerous choices of network architecture, learning rate schemes, optimization and data augmentation. We explore in the space of these variations and report all results on the test split of UCF101 dataset.

Figure 3: Sample frames from the UCF101 dataset [5] with action classes (from top to bottom): apply eye makeup, juggling balls and rowing

4 Experiments

4.1 Datasets

UCF101 [5] is a benchmark action recgonition dataset comprising 13320 YouTube videos of 101 action categories. Actions include human-object interactions such as ‘apply lipstick’, body motion such as ‘handstand walking’, human-human interactions, playing musical instruments and sports. The dataset is small but challenging in that the videos vary in viewpoint changes, illumination, camera motion and blur. The second dataset we experiment on is the HMDB51 dataset [49] which contains 6766 videos of 51 actions such as chew, eat, laugh . Sample frames from both datasets are shown in Figures 3 and  4.

4.2 Unsupervised Pre-training

This section describes three experiments to determine: (a) Whether GANs can generate action images (b) Training Protocol of GANs and (c) Data Augmentation steps

Figure 4: Sample frames from the HMDB51 dataset [49]

Can GANs Generate Action Images?

Since we consider a video as a set of RGB frames, we address the first question: Are GANs, traditionally used for generating faces, objects and scenes capable of generating an image representing an action? This question is crucial to address because it determines the validity of using the trained GAN discriminator as a CNN network and fine-tune it on a labelled video activity dataset. To answer this question, we use all the videos from the train split 1 of UCF101 [5] and sample 1 frame per second from each video. We train a DCGAN model with default parameters and after 100 epochs, obtain results shown in Figure 2

. From visual inspection we can see that vanilla DCGAN is able to learn a coarse representation of activities involving humans. The question now remains whether we can use the feature representation learned by GAN’s discriminator as an unsupervised pre-training step to classify actions in labeled video action recognition datasets.

Training Protocol of GANs:

We use DCGAN’s public implementation in torch and train three separate GAN models: One with UCF101 video frames, second with ImageNet

[48] images and third with a subset of Sports1M dataset [50] frames. We train all three models for 100 epochs using the architectural guidelines proposed in [38]

, namely, batch normalization


in discriminator as well as the generator, leaky Rectified Linear Units (leaky ReLU)


in all layers of discriminator, strided convolutions in discriminator instead of pooling layers and fractional-strided convolutions in the generator. There are no fully-connected (FC) layers in the DCGAN architecture as the authors of

[38] report no loss in generator performance for not including FC layers. Hence we also use the same architecture for training the GAN model.

Data Augmentation:

The main difference between our GAN training and the DCGAN [38] approach is that DCGAN [38]

performs data augmentation via taking 64 x 64 sized random crops of the image as well as scaling the images to range [-1,1]. This scaling is done for the tanh activation function in the generator. We change that protocol and avoid random cropping. We only scale the frames of videos to the range [-1,1] and scale the size to 64 x 64. The reason why we avoid random cropping is because the action frames from videos are much larger and contain much more information than the original images used for training DCGAN (bedrooms, faces and the like). Taking random crops from action frames will not result in a useful representation because too much information will be lost. Thus, we only scale the images to 64x64 as our aim is not just to generate action images but to learn an effective action representation for recognition.

4.3 Fine-tuning for Action Recognition

Here we describe the set of experiments conducted after the GAN model has been trained. We use the pre-trained discriminator network from our GAN model and fine-tune it on the two labeled video action datasets: UCF101 [5] and HMDB51 [49]. We begin by replacing the last spatial convolutional layer (CONV5) with one that has the correct number of outputs (equal to the number of action classes). See Figure 5. This layer is initialized randomly and the network is trained again with the previous layers initialized with the pre-trained discriminator’s weights.

We perform a comprehensive experimental analysis of architectural choices, hyperparameter settings and other good practices and report the accuracy on the test set of UCF101 dataset.

Does Source Data Distribution Matter?

In this experiment, we determine whether the dataset we train GAN with (which we refer to as the source dataset) determines performance on the target dataset (the labeled dataset on which we fine-tune the discriminator network). To this end, we train DCGAN on three large scale datasets: ImageNet [48] images, UCF101 [5] video frames and frames of 10,000 videos from Sports1M [19] dataset. We use the same sampling strategy of 1 frame per second for both video datasets and train all three GAN models separately for 100 epochs.

For each video , there is a set of frames where where is the number of frames extracted for video . Each video’s frames are passed through the trained GAN’s discriminator and we extract CONV4’s activations as features on each frame. We average frame-level features to obtain video-level features. We train a linear SVM classifier [53] on top of these features using the train/test split1 provided by the dataset authors and obtain classification accuracy on the test set. We use the same setting for training all three GAN models as described in the training protocol earlier. Our results are shown in Table 1.

Source Dataset Destination Dataset (accuracy %)
ImageNet 43.88 12.82
UCF101 47.20 12.94
Sports1M 42.50 13.02
Table 1: Comparing the accuracy on target dataset with three large scale datasets used to train GAN models

As can be seen from Table 1 training a GAN with UCF101 frames results in the best test accuracy on both UCF101 and HMDB51. The difference between training a GAN model with ImageNet and Sports1M frames and training it with UCF101 frames is significant. Note that we did not use all videos from the Sports1M dataset; we randomly selected 10,000 videos from the dataset, extracted 1 frame per second from each video and used those frames to train the GAN model. For HMDB51 dataset the difference in test accuracy between using a GAN discriminator pre-trained on UCF101 and other datasets is not very large. But the superior performance of training a GAN model with video action frames is clearly demonstrated by this experiment. The features learned by the discriminator network are strong enough to transfer to other video datasets as well.

Figure 5: Our network architecture: DCGAN discriminator architecture on the left and our added layers on the right

Choice of Architecture:

There are several ways of changing the architecture of the pre-trained discriminator network for fine-tuning. Note that the discriminator is just another CNN network with spatial convolutional layers and no fully connected layers. For fine-tuning on the UCF101 dataset, we replace the last convolutional layer (CONV4) with one that has the correct number of outputs, initialize this layer randomly and train this network (fine-tune) for 160 epochs. This fine-tuning experiment is called ‘CONV4’ in Table  2. Network depth determines the model’s performance both in theory and practice [54]. Hence we add another convolutional layer (CONV5) and a fully connected layer (FC), initialize them from scratch and retrain the network till convergence. We extract CONV4, CONV5 and FC features from the finetuned network. We concatenate CONV5 and CONV5 features and test the performance as well as CONV4, CONV5 and FC features. We do not freeze any layers before fine-tuning and keep a learning rate of 0.001 to fine-tune the network. We empirically found that freezing the earlier layers and finetuning only the last layer(s) did not increase performance. We train a linear SVM on top of the extracted features and compute results on UCF101’s test set. Our results are shown in Table 2. Our network architecture is shown in Figure 5.

Architectural changes Test Accuracy (%)
CONV4 48.35
CONV4 + CONV5 + FC 49.30
CONV4 + CONV5 50.12
Table 2: Effect of making the network deeper: Adding more layers slightly improves action recognition performance

Our results in Table 2 show that with all other parameters kept the same, adding a convolutional layer and a fully connected layer in the discriminator network architecture results in only a slight improvement in performance. We note that this is not a huge difference and this may seem counterintuitive but the reason why this happens is that we are initializing the added network layers randomly before fine-tuning. Also, the dataset size of UCF101 frames is not very large with 84,747 frames in the training set and 33,187 frames in the test set. This may lead to over fitting resulting in only a slight increase in performance on the test set especially when the fully connected layer is added.

To reduce overfitting, we add dropout [55] after the additional convolutional and fully connected layers. We note the performance with/without dropout by extracting CONV4 features from both networks (after finetuning) and training a linear SVM. Adding dropout regularizes the network more thus increasing the performance on test set of UCF101.

Fine-tuning vs Linear SVM:

Once we fine-tune the discriminator model on the datasets, we have a choice of whether to extract the CONV4’s activations and train a linear SVM on top of it or fine-tune the last layers with softmax classifier. We do both in our experiments and note that the outcome is dependent on the dataset. We find that when we fine-tune the discriminator network on UCF101, the test set accuracy using softmax is lower than extracting CONV4 features and training a linear SVM to recognize actions. However when using HMDB51, the softmax classification on the test set results in a higher accuracy than extracting Layer 9 features and training a linear SVM classifier. This result is shown in Table 3.

Accuracy (%) on test set
CONV4 + linear SVM Softmax
UCF101 48.35 41.40
HMDB51 14.40 21.04
Table 3: Comparing two ways of evaluating fine-tuned network performance on UCF101 and HMDB51 test sets

From Table 3 it is apparent that for UCF101, feature embedding and training a linear SVM results in a better accuracy than softmax classification. The complete opposite is true with HMDB51 dataset. We dig deeper to investigate why this happens. We find that the label distribution of the dataset on which a deep network is being fine-tuned on is the key to determine which method results in a better test accuracy. The label distribution of UCF101 test set is shown in Figure 6. This distribution is not balanced while that of HMDB51 is completely balanced in terms of number of videos per action category. Hence it appears that when classes are unbalanced, since we have not used weighted loss in the neural network, the linear SVM learns the features better hence resulting in an increased performance on the test set. In the case of HMDB51, all classes are balanced equally leading to the superior performance of the softmax classifier over the feature embedding approach.

Figure 6:

Label distributions of UCF101 test set. The HMDB51 dataset has uniform distribution of 30 videos per action class

Unsupervised Pretraining vs Random Initialization

We validate the use of our unsupervised pre-training approach by comparing it with a network that is initialized randomly. We initialize all the layers of the network using ‘xavier’ initialization. Proposed by [56]

, the authors recommend initializing weights by drawing from a distribution with zero mean and variance given by:


is the distribution which the neuron is initialized with,

is the number of neurons feeding into the layer and is the number of output neurons from this layer. We initialize all layers with this scheme and train the network till convergence on UCF101. For HMDB51, we train a network for 50 epochs with xavier initialized layers and compare that to our proposed discriminator initialized method at 50 epochs. The results are shown in Table 4 and clearly validate the use of our unsupervised pretraining approach to initialize the network before finetuning. As a reference, a supervised ImageNet pretrained network finetuned on UCF101 yields an accuracy of 67.1% and finetuned on HMDB51 yields an accuracy of 28.5% [1].

Initialization UCF101 (%) HMDB51 (%)
Xavier + finetuning 33.10 11.6
DiscrimNet (ours) + finetuning 49.30 20.4
Table 4: Validating the use of our unsupervised pretraining approach vs training with random initialization

4.4 Comparison with the state-of-the-art

We compare our approach with several recent semi-supervised baselines which recognize actions in videos. The baselines are:

  • STIP features: Handcrafted Space Time Interest Point (STIP) features introduced by [10].

  • DrLim [57]: This method uses temporal coherency by minimizing the L2 distance metric between features of neighboring frames in videos and enforcing a margin between far apart frames.

  • TempCoh [31]: Enforce temporal coherencFrom the mid-1980s through 2015 the average number of acres burned has grown from about 2 million acres a year to around 8 millione by using L1 distance instead of L2. Similar to DrLim [57].

  • Obj. Patch [35]: They extract similar object patches using videos and learn a representation of objects by tracking them through time. This model is used and fine-tuned on UCF101 by [1].

  • Shuffle [1]: They use sequence verification as an unsupervised pre-training step for vidoes. The model is then fine-tuned on UCF101.

  • VideoGAN [6]: They generate tiny videos using a two stream GAN network. Their model is fine-tuned on UCF101.

  • O3N [2]: They use odd-one-out networks to predict the wrong temporal order from the right ones. Their model is then fine-tuned on UCF101.

  • OPN [37]: They train a network to predict the order of 4-tuple frames. Their model is then fine-tuned on UCF101.

The results are shown in Table 5 and Table 6.

Method UCF101-split1 (%)
STIP features [12] 43.9
DrLim [57] 45.7
TempCoh [31] 45.4
Obj. Patch [35] 40.7
Shuffle [1] 50.9
VideoGAN [6] 52.1
O3N [2] 60.3
OPN [37] 56.3
DiscrimNet (ours) CONV4 + linear SVM 49.33
DiscrimNet (ours) CONV5 + linear SVM 48.88
DiscrimNet (ours) (CONV4 + CONV5) + linear SVM 50.12
Table 5: Comparing our method to state-of-the-art semi-supervised approaches on UCF101
Method HMDB51 (%)
DrLim [57] 16.3
TempCoh [31] 15.9
Obj. Patch [35] 15.6
Shuffle [1] 19.8
O3N [2] 32.5
OPN [37] 22.1
DiscrimNet (ours) (fine-tuned) 21.0
Table 6: Comparing our method to state-of-the-art semi-supervised approaches on HMDB51

5 Discussion

Our comparison with several state-of-the-art semi-supervised approaches to recognize actions in vidoes yields important insights. Our results show competitive performance as compared to the state-of-the-art approaches in semi-supervised learning given that:

  • We only use appearance features and do not experiment with motion content of the video. This is especially intriguing given that our method outperforms STIP features on this dataset. All methods in the results we compare to use temporal coherency as a signal and do motion encoding.

  • We do not do weak supervision in the form of temporal consistency and do not design temporal order based networks. The only supervision provided to the GAN is the difference between a real image and noise.

  • Our model outperforms several state-of-the-art approaches on HMDB51 given that no video from the dataset was used in the unsupervised pre-training step of this approach. This shows the domain adaptation capability of GAN discriminator networks and that they are able to capture enough information to learn useful representation of actions in video frames.

The methods that outperform our proposed approach are either computationally expensive or require much more supervision in the form of selecting sampling strategies, video encoding methods or in the case of O3N networks [2], designing effective odd-one-out questions for the network to learn feature representations for action recognition.

6 Conclusion

We propose an unsupervised pre-training method using GANs for action recognition in videos. Our method does not require weak supervision in the form of temporal coherency, sampling selection or video encoding methods. Purely on appearance information alone, our method performs either better than or comparable to the state-of-the-art semi-supervised action recognition methods.