DistInit: Learning Video Representations without a Single Labeled Video

01/26/2019 ∙ by Rohit Girdhar, et al. ∙ 0

Video recognition models have progressed significantly over the past few years, evolving from shallow classifiers trained on hand-crafted features to deep spatiotemporal networks. However, labeled video data required to train such models has not been able to keep up with the ever increasing depth and sophistication of these networks. In this work we propose an alternative approach to learning video representations that requires no semantically labeled videos, and instead leverages the years of effort in collecting and labeling large and clean still-image datasets. We do so by using state-of-the-art models pre-trained on image datasets as "teachers" to train video models in a distillation framework. We demonstrate that our method learns truly spatiotemporal features, despite being trained only using supervision from still-image networks. Moreover, it learns good representations across different input modalities, using completely uncurated raw video data sources and with different 2D teacher models. Our method obtains strong transfer performance, outperforming standard techniques for bootstrapping video architectures from image-based models and obtains competitive performance with state-of-the-art approaches for video action recognition.



There are no comments yet.


page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Learning video representations through transfer.

Traditional approaches to transfer learning follow the process on left: train deep models on large well labeled datasets and finetune on specific task or dataset of choice. This approach, while hugely popular, significantly limits the types of models we can use for our specific task, as they must be ‘compatible’ with the model pre-trained on the large dataset for the learned weights to transfer. This problem is further accentuated in the case of videos, where datasets tend to be small or weakly labeled, and models tend to involve 3D/(2+1)D convolutions, making them “incompatible” with image models. We propose an approach,

DistInit, to transfer image models to video as shown on the right. DistInit starts from models pre-trained on well labeled image datasets with object or scene labels, and use them as “teachers” for supervising video models. Hence, the video model is able to learn spatio-temporal features for video understanding, without needing an explicit action label for that video.

Recent times have seen a significant boost in visual recognition performance [16, 15, 23]

, primarily fueled by the resurgence of convolutional neural networks (CNNs) 


, the success of deep learning 

[26], along with the availability of large and well-labeled datasets [34, 52, 28]

. This has caused a paradigm shift in computer vision. While before the emergence of deep learning most of the community effort was focused on hand-designing better feature extractors 

[5, 29, 25], now the most prominent approaches train deep models end-to-end, learning features as part of this process. However, deep learning has been transformative not just because models started to work, but because models also transferred

. The most dominant illustration of this is the use of ImageNet 

[34] pre-training for image understanding tasks. It is a near-ubiquitous practice these days, as it has been shown to yield strong improvements for a wide range of tasks, from image classification on small datasets [22] to pixel-labeling tasks like detection and segmentation [15]. Such pre-training is an empirically effective approach to knowledge transfer, where “knowledge” is manifested as labeled and curated datasets.

However, deep learning has not been quite as transformative for video understanding. One of the first deep-learning attempts for human action recognition [20] achieved only marginal improvements over previous state-of-the-art hand-crafted features. Since then, various deep models [37, 42, 7] have been proposed. However the performance improvements have been largely incremental, with some of the biggest gains coming from the recent introduction of a large-scale dataset [4, 21] enabling effective pre-training.

This raises a natural question: is time-consuming manual labeling of large-scale video datasets the most practical means to achieve big performance improvements in video understanding? We decompose this question into two parts:

1. How does one obtain the “right” labels for a video? Previous work has attempted to define ontologies and collected videos for each of the classes, either through Web search [39, 24, 21] or by asking humans to act and record videos [36]. More recent work has avoided predefining explicit ontologies, and has attempted to “discover” useful action classes from video-logs of peoples’ daily lives [11]. These datasets have used action classes ranging from broad and diverse, such as swimming or running [21], to fine-grained and nuanced, such as “snuggling with a pillow” [36]. Evidently, the right vocabulary is unclear and largely up for debate. In contrast, labels of objects seem to much better understood, taking advantage of existing linguistic knowledge bases such as WordNet. Moreover, we humans are able to learn about behaviors and the dynamics of the world even without such explicit labels. In this work, we attempt to answer this question by learning video representation without action labels, by relying solely on image-based models, or teachers, to supervise the video network.

2. How do we transfer the rich knowledge embedded in well curated datasets [34, 52, 28] into our video models? Previous work has tackled this by initializing weights from models pre-trained on such datasets, a procedure popularly referred to as fine-tuning. Recent work has even done so for 3D CNNs, by “inflating” the 2D kernels to 3D [4, 9, 8]. However, all these approaches place severe restrictions on the video architectures, essentially forcing them to be 3D equivalents of still-image architectures. Instead, as Figure 1 shows, we advocate for a general approach of knowledge transfer by distillation, which allows us to transfer knowledge from arbitrary image-based teachers to any spatiotemporal architecture. We refer to our approach as DistInit. Previous work [14] has tackled a similar problem of cross-modal distillation to train architectures across different input modalities, such as depth or flow. However, the goal there is to transfer supervision from one modality to another for the same end task

. In contrast, we transfer supervision from image models trained for object/scene recognition, to video models for human action recognition. To our knowledge, image-to-video distillation has not been explored before. This task opens up a series of design choices that we thoroughly investigate through an extensive ablation study involving the distillation output, the loss, frame selection criteria, and the form of the still-image supervision.

DistInit leads to a significant 16% improvement over from-scratch training on the HMDB dataset, getting almost halfway to the improvement provided by pretraining on a fully-supervised dataset like Kinetics [21]. From-scratch training is the defacto standard for state-of-the-art architectures that can not be initialized or inflated from image architectures [44]. While large-scale video datasets like Kinetics now provide an alternate path for pre-training, DistInit does so without requiring any video data curation. As we show in Section 4.3, it is able to learn competitive representations from an internal uncurated dataset of random web videos. This is contrast to previous works [31, 6, 12]

on unsupervised learning that use ImageNet without labels but still potentially benefit from the data curation.

2 Related Work

Video understanding, specifically for the task of human action recognition, is a well studied problem in computer vision. Analogously to the progress of image-based recognition methods, which have advanced from hand-crafted features [29, 5] to modern deep networks [41, 16, 38], video understanding methods have also evolved from hand-designed models [47, 46, 25] to deep spatiotemporal networks [42, 37]. However, while image based recognition has seen dramatic gains in accuracy, improvements in video analysis have been more modest. In the still-image domain, deep models have greatly benefited from the availability of well-labeled datasets, such as ImageNet [34] or Places [52].

Until recently, video datasets have either been well-labeled but small [24, 39, 36], or large but weakly-labeled [20, 1]. A recently introduced dataset, Kinetics [21], is currently the largest well-annotated dataset, with around 300K videos labeled into 400 categories (we note a larger version with 600K videos in 600 categories was recently released). It is nearly two orders of magnitude larger than previously established benchmarks in video classification [24, 39]. As expected, pre-training networks on this dataset has yielded significant gains in accuracy [4] on many standard benchmarks [24, 39, 36], and have won CVPR 2017 ActivityNet and Charades challenges. However, it is worth noting that this dataset was collected at a significant curation and annotation effort [21].

The challenge in generating large-scale well-labeled video datasets stems from the fact that a human annotator has to spend much longer to label a video compared to a single image. Previous work has attempted to reduce this labeling effort through heuristics 

[51], but these methods still require a human annotator to clean up the final labels. There has also been some work in learning unsupervised video representations [35, 30], however has typically lead to inferior results compared to supervised features.

The question we pose is: since labeling images is faster, and since we already have large, well-labeled image datasets such as ImageNet, can we instead use these to bootstrap the learning of spatiotemporal video architectures? Unsurprisingly, various previous approaches have attempted this. The popular two-stream architecture [37] uses individual frames from the video as input. Hence it initializes the RGB stream of the network with weights pre-trained on ImageNet and then fine-tunes them for action classification on the action dataset. More recent variants of two-stream architectures have also initialized the flow stream [48] from weights pretrained on ImageNet by viewing optical flow as a grayscale image.

However, such initializations are only applicable to video models that use 2D convolutions, analogous to those applied in CNNs for still-images. What about more complex, truly spatiotemporal models, such as 3D convolutional architectures [42]? Until recently, such models have largely been limited to pre-training on large but weakly-labeled video datasets, such as Sports1M [20]. Recent work [4, 8] proposed a nice alternative, consisting of inflating standard 2D CNNs kernels to 3D, by simply replicating the 2D kernels in time. While effective in getting strong performance on large benchmarks, on small datasets this approach tends to bias video models to be close to static replicas of the image models. Moreover, such initialization constrains the 3D architecture to be identical to the 2D CNN, except for the additional third dimension in kernels. This effectively restricts the design of video models to extensions of what works best in the still-image domain, which may not be the architectures for video analysis.

This work leverages the intuition that strong improvements are possible if we can distill information from still-image models in order to train our deep video architecture. We propose an approach that achieves this goal by removing the restriction of 1-to-1 mapping between the 2D and the 3D architectures. We build upon ideas from model [17] and data [33] distillation by proposing to use predictions of still-image models as supervisory signal for 3D CNNs. One key difference, however, is that distillation has typically been used to train smaller models from larger ones, whereas in our case the student, or video models, tend to be larger than the teachers. Some previous works, such as cross-modal distillation [14], have pursued a similar approach of transferring supervision from RGB to flow or depth modalities, but has been been for the same end task, such as object detection. In contrast, our work can be seen more as task distillation, similar to [2], where we transfer supervision from RGB models trained for object or scene recognition, to video models for human action recognition. Similar to [50], we explore different choices of pretraining tasks and their effect on the end task of action recognition, although we transfer supervision via distillation on the target data as opposed to finetuning. We show extensive experiments with standard benchmarks and show significant improvements over inflation and other previous approaches in learning video representations for action recognition.

3 Our Approach

Figure 2: DistInit network architecture. We use random frames from the input clip to generate soft-labels for the video model, using an arbitrary number of image-based teachers networks. The student tries to match the targets provided by the teachers.

We now describe our approach in detail. To reiterate, our goal is to learn video representations without using any video annotations. We do so by leveraging pre-trained 2D networks, using them to supervise or “teach” the video models. Hence, we refer to the 2D pre-trained networks as “teachers” and our target video network as “student”. We make no assumption over the respective architectures of these models, i.e., we do no constrain the structure of the 3D network to be merely a 3D version of the 2D networks it learns from or to have a structure compatible with them.

Figure 2

depicts the network architecture used to train the student network. We start with teacher networks trained on standard image-level tasks, such as image classification on ImageNet. While in this work we primarily focus on classification, our architecture is generic and can also benefit from teachers trained on spatial tasks such as detection, keypoint estimation and so on, with the only difference being the definition of the distillation loss function. Also, our architecture is naturally amenable to work with an arbitrary number of teachers, which can be used in a multi-task learning framework to distill information from multiple domains into the student. Throughout the training process, these teacher networks are kept fixed, in “test” mode, and are used to extract a feature representation from the video to be used as a “target” to supervise the student network.

Since teacher networks are designed to find objects in images, it is not obvious how to use them to extract features for actions in video. We propose a simple solution: pre-train the spatiotemporal action network for finding objects in frames of a video. However, our teacher networks are designed to work over images, so how do we apply them on a video? We experiment with standard approaches from the literature, including uniform or random sampling of frames [37], as well as averaging predictions from multiple different frames [48]

. In this work we use the last-layer features in the form of normalized softmax predictions or (unnormalized) logits. In case of multiple frames, we average the teacher logits before computing a normalized prediction target. The


network then takes the complete video clip as input. We train it to be able to predict the features or probability distribution produced by the teacher. For this purpose, we define the last layer in the student network to be a linear layer that takes the final spatiotemporally averaged feature tensor and maps it to a number of units that matches the dimensionality of the output generated by the teacher. In case of multiple teachers, we define a linear layer per teacher, and optimize all losses jointly.

To formalize, let us denote a video as , where is the frame. In our problem formulation, we have access to a teacher that reports a prediction label at each frame. For simplicity, we assume that the teacher returns a (softmax) distribution over classification labels. We generate a distribution over labels for a video by (1) averaging the per-frame logits for the class and (2) passing the average through a softmax function with temprature (typically ):


The resulting distribution is then used as soft targets for training weights associated with a student network of arbitrary architecture by means of the following objective:


where is the student softmax distribution for label . Finally, we explore multi-source knowledge distillation by adding together losses from different image-based teachers:


In our experiments, we explore teachers trained for object classification (ImageNet) and scene classification (Places).

We train the network using loss functions inspired from the network distillation literature [3, 17]. When using the teacher network to produce a probability distribution, we train the student to produce a label distribution as well and define the loss so as to match the teacher distribution. This can be formulated as a KL-divergence loss, which in the case of the softmax distribution is equivalent to computing the cross entropy. As suggested in [17], we also tried using different values of temperature () to scale the logits before computing softmax and cross entropy, but found temperature value of 1 to perform best in our experiments. We also experimented with the a mean squared loss on the logits (before softmax normalization), as suggested in [3], and compare performance in Section 4.

3.1 Architecture Details

We use recent, state-of-the-art, network architectures for all experiments and comparisons. For the still-image teacher networks, we use the ResNet-50 [16] architecture, trained on different image datasets such ImageNet 1K [34] and Places 365 [52]. For the spatiotemporal (video) student architectures, we first experiment with a variant of the Res3D [43] architecture. Res3D is an improved version of the popular C3D [42]

using residual connections. We denote a

-layer Res3D model as Res3D-, which is compatible with the standard ResNet- [16] architecture. Since there is a one-to-one correspondence between such 2D and 3D models, the 3D models can also be initialized by inflating the learned weights from 2D models (e.g., for each channel, replicate the 2D filter weights along the temporal dimension to produce a 3D convolutional filter). Similar ideas of inflating 2D models to 3D have been proposed previously for Inception-style architectures [4], along with initialization techniques from corresponding 2D models [4, 9, 8]. The existence of a 1-to-1 mapping between the 2D and 3D models used in our experiments allows us to compare our approach to the method of inflation for initialization. However, we stress that unlike inflation, our method is applicable even in scenarios where such 1-to-1 mapping does not hold.

More recently, full 3D models have been superseded by (2+1)D architectures [44], where each 3D kernel is decomposed into a 2D spatial component followed by a 1D temporal filter. Similar models have also been proposed previously [40], and are also known as P3D [32] or S3D [49] architectures. These models have proven to be more efficient, with much fewer parameters, and more effective on various standard benchmarks [49, 44]. However, these models no longer conform to standard 2D architectures, as they contain additional conv and batch_norm layers, parameters that do not exist in corresponding 2D models, and hence can not be initialized using those 2D models. Nevertheless, our distillation remains applicable even in this scenario. In this work, we refer to such networks using R(2+1)D- notation, for -deep architecture.

3.2 Implementation Details

For all experiments, we use the hyperparameter values described in 

[44]. For distillation pre-training, we use the hyper-parameter setup for “Kinetics from-scratch training.” We use distributed Sync-SGD [13] over GPUs, starting with LR=0.01, and dropping it by 10

every 10 epochs. Weight decay is set to

. We train for a total of 45 epochs, where each epoch is defined as a full pass over 1M examples. The video model is learned on 8-frames clips of 112 pixels. The network has depth of 18, which enables faster experimentation compared to the best model reported in [44] which uses 32 frames and has a depth of 34 layers. The batch size used for Kinetics training is 32/GPU, which we reduce to 24/GPU to accommodate the additional memory requirements for the teacher networks. For the finetuning experiment on smaller datasets like HMDB, we use Sync-SGD with GPUs, starting with LR=0.002, an dropping it by every 2 epochs. The weight decay is set to . We train 8 epochs, with each epoch defined as a full pass over 40K examples. When training from scratch, we use initial LR of 0.01 with a step every 10 epochs, trained for total of 45 epochs.

4 Experiments

We now experimentally evaluate our system. We start by introducing the datasets and benchmarks used for training and evaluation in Section 4.1. We then compare DistInit with inflating 2D models for initialization in Section 4.2. Next we ablate the various design choices in DistInit in Section 4.3, and finally compare to previous state of the art on UCF-101 [39] and HMDB-51 [24] in Section 4.4.

4.1 Datasets and Evaluation

Our method involves two stages: pre-training on a large, unlabeled corpus of videos using still-image models as “teachers”, followed by fine-tuning on the training split of a labeled target dataset. After training, we evaluate the performance on the test set of the target dataset. We use HMDB-51 and UCF-101 as our target test beds, while videos taken from datasets such as Kinetics and Sports1M, or even random uncurated internal videos are used as unlabeled video corpuses for distillation. Note that while datasets like Kinetics and Sports1M do come with semantic (action) labels, we ignore such annotations in this work and only use the raw videos from these datasets.

Unlabeled video corpus: We experiment with a variety of different unlabeled video corpuses in Section 4.3, including Kinetics [21], Sports1M [20] and a set of internal videos. Kinetics and Sports1M contain about 300K and 1.1M videos, respectively. In this work, we drop any labels these datasets come with, and only use the videos as a large, unlabeled corpus to train video representations. The internal video set includes 1M videos without any semantic labels and randomly sampled from a larger collection. We use these diverse datasets to show that our method is not limited to any form of data curation, and can work with truly in-the-wild videos.

Test beds: HMDB-51 [24] contains 6766 realistic and varied video clips from 51 action classes. Evaluation is performed using average classification accuracy over three train/test splits from [19], each with 3570 train and 1530 test videos. UCF-101 [39] consists of 13320 sports video clips from 101 action classes, also evaluated using average classification accuracy over 3 splits. We use the HMDB split 1 for ablative experiments, and report the final performance on all splits for HMDB and UCF in Section 4.4.

Figure 3: Learned Res3D filters. We compare the learned first layer representation using our distillation approach, to the inflation. For each case, we show the 64 conv_1 filters, for each time instance of the filter. As described in Section 4.2, our features vary over time, as opposed to ImageNet inflated features that are simply copied over time. We also show the I3D first layer filters as reported in Figure 4 of [4]. Note that I3D uses a conv_1 kernel compared to for Res3D, hence the 7 filter images in the I3D case.
Model Initialization Per clip Top 1 Top 5
Res3D- Scratch 24.6 25.4 55.2
Res3D- ImageNet inflated 32.5 35.8 66.2
Res3D- PlaceNet inflated 32.5 35.6 66.2
Res3D- DistInit (ours) 36.6 39.9 73.5
R(2+1)D- Scratch 22.0 24.1 53.1
R(2+1)D- DistInit (ours) 37.8 40.3 74.4
R(2+1)D- Kinetics pre-training - 51.0 -
Table 1: Distillation vs Inflation. As described in Section 4.2, our distillation approach outperforms training video models from scratch or initializing them by inflating 2D models. We evaluate using percentage accuracy on the HMDB-51 dataset, Split 1. The models used are 18-layer Res3D and R(2+1)D, over 8-frame input, trained with cross-entropy loss (described in Section 4.3). The DistInit training is done using 2D network trained on ImageNet.

4.2 DistInit vs Inflation

We first compare our proposed approach to inflation [4, 8], i.e., initializing video models from 2D models by inflating 2D kernels to 3D via replication over time. Note that inflation is constrained to operate on 3D models that have a one-to-one correspondence with the 2D model. Hence, we use a Res3D-18 model, which is compatible for direct inflation from ResNet-18 models. We experiment with publicly available ImageNet and PlaceNet models. We compare it with our distillation approach in Table 1, trained using an ImageNet pretrained model as the teacher. Distillation improves performance by 15% over a model trained from scratch, and 4% over a model trained with inflated weights (the current best-practice for training such models). More importantly, our approach can also be used to initialize specialized temporal architectures such as R(2+1)D [44], which do not have a natural 2D counterpart. In such a setting, the current best practice is to initialize such networks from scratch. Here, distillation improves performance by 16%. Finally, we also report the model trained using actual Kinetics labels, and as expected, that yields higher performance. Hence there is clear value to the explicit manual supervision provided in such large-scale datasets, but distillation appears to get us “half-way” there.

At this point, it is natural to ask why the distilled model outperforms current best-practices such as inflation? We visualize the learned representation by plotting the first layer conv filters in Figure 3. It can be seen that our distilled model learns truly spatiotemporal filters that vary in time, whereas inflation essentially copies the same filter over time. Such dynamic temporal variation is readily present in the videos used for distillation, even when they are not labelled with spatiotemporal action categories. Filters pre-trained with inflation initialization never see actual video data, and so cannot encode such variation. In Figure 4

we also compare the filters learned by our R(2+1)D model via distillation vs via fully-supervised training. Our filters look quite similar to those learned through supervised learning, showing the effectiveness of our approach. In some sense, the improved performance of distillation can be readily explained by more data – networks learning from scratch see no data for pre-training, inflation networks see ImageNet, while distilled networks see both Imagenet and unlabeled videos. Our practical observation is that one can use image-based teachers to pre-train on massively large, unlabeled video datasets.

Figure 4: Learned R(2+1)D filters. Similar to Figure 3, we show the first layer conv filters for the R(2+1)D models. Note that 2.5D conv layer contains a 2D convolution in space followed by 1D convolution in time, and in this visualization we are only showing the former, i.e. the 45 2D conv filters that operate on the RGB image. We observe that our distillation approach learns spatiotemporal representations that are relatively similar to the fully supervised model (compared to filters learned from Imagenet in Fig. 3).

We can also analyze the effectiveness of distillation pre-training, by visualizing the correlation of the representation we learn with the classes in the task of action recognition. As explained in Figure 5, we can see that the last layer features for the same action class tend to cluster together when projected to 2D using tSNE [45].

Figure 5: Learned high-level representation. While the filter maps in Figure 3 and 4 can be used to interpret the low-level representation learned by our model, we now try to probe the high level representation by visualizing the last layer features. This figure shows tSNE [45] visualization of averaged last layer features from the model trained with DistInit, and trained with full Kinetics supervision. Each dot represents a video from HMDB training set, and is color coded by the class of that video. For ease of visualization, we picked 10 random classes to plot. Note that DistInit is already able to segregate many videos into clusters correlated with their action classes, without ever being trained on any action labels! The fully supervised model naturally does better as it has been trained on a large action dataset, Kinetics. This further suggests DistInit leads to useful representation for classifying actions.

4.3 Diagnostic Analysis

Design choices for the teacher network: We now ablate the design choices for the teacher networks. Teacher networks are required to generate a target label to supervise the video model being trained, by using images from the video clip. We experiment with picking the center frame, a random frame, or multiple random frames from the clip to compute the targets. In case of multiple frames, we average the logits before passing them through softmax to generate the target distribution. We compare these methods in Table 2, and observe higher performance when picking frames randomly. This improvement can be due to less overfitting through label augmentation. We use it in our final model.

Model Pick strategy Per clip Top 1 Top 5
R(2+1)D- Center 37.8 40.3 74.4
R(2+1)D- Random 39.9 43.2 73.9
R(2+1)D- 2 Random 39.6 44.0 73.5
Table 2: Video to Image. We compare different strategies of converting the video into image(s) for extracting the target label. We find strongest performance when picking random frames to generate the target distribution. Model used here is 18-layer R(2+1)D, over 8-frame input, trained with cross-entropy loss (Section 4.3); evaluated using percentage accuracy on HMDB-51 split 1.

Distillation loss: Next, we evaluate the different choices for the loss function in distillation. As already explained in Section 3, previous work has suggested different loss functions for distillation tasks. We compare two popular approaches: KL divergence over distribution and loss over logits. In the case of the former, we compute the softmax distribution from the teacher networks, as well as from the student branch that attempts to match that teacher, and use a cross entropy between the two softmax distributions as the objective to optimize. We find this objective can be well optimized using the standard hyper-parameter setup used for Kinetics training in [44]. In the case of the latter, we skip the softmax normalization step and directly compute the mean squared error between the last linear layers as the objective. Since the initial loss values are much higher, we needed to drop the learning rate by a factor of 10 to optimize this model, with all the other parameters kept the same. As Table 3 shows, we observe nearly similar downstream performance with both.

Loss function Per clip Top 1 Top 5
cross-entropy (over softmax) 37.8 40.3 74.4
mean squared error (over logits) 35.6 39.9 70.5
Table 3: Loss function for distillation. We compare different loss functions for distillation, and find that the performance was relatively stable with different choices. The model used here is a 18-layer R(2+1)D, over 8-frame input, evaluated using percentage accuracy on HMDB-51 split 1.

Selecting confident predictions: As some recent work [33] has shown, distillation techniques can benefit from using only the most confident predictions for training the student. We use the entropy of the predictions from the teachers as a notion of their confidence. We implement this confidence thresholding by setting a zero weight for the loss on each example, for which the teacher is not confident, or has high-entropy predictions; effectively dropping parts of the training data that are confusing for the teacher. We show the performance on dropping different amounts of data in Figure 6

. The red curve shows a kernel density estimate (a PDF) of entropy values for an ImageNet teacher on the Kinetics data. At any given entropy value (

), it shows the relative likelihood of a data point to have that entropy value, and (area under the curve from to ) is the percentage of data with entropy . We experiment with setting different thresholds for dropping the low-confidence data points during DistInit, and show the downstream HMDB-51 split-1 performance in the line plots. We found slightly better performance, even after dropping nearly half the data, making training faster.

Figure 6: Accuracy variation with entropy. Our results suggest that if we use DistInit only on videos for which the teacher is sufficiently sure of its predictions (entropy ), we obtain slightly better performance while ignoring 50% of the input videos, making training faster.

Varying the unlabeled dataset: We now try to evaluate whether our method is dependent on any specific video data source, and if it can benefit from additional data sources. We evaluate this in Table 4 and observe nearly similar performance when using different sets of videos (without labels) like Kinetics [21] and Sports1M [20]. We also experiment with an internal set of videos downloaded from the web, and still get strong DistInit performance. This shows our method is not limited to any form of data curation, and can learn from truly in-the-wild videos.

Model Unlabeled set Size Per clip Top 1 Top 5
R(2+1)D- Kinetics [21] 0.3M 37.8 40.3 74.4
R(2+1)D- Sports1M [20] 1.1M 37.5 39.9 73.3
R(2+1)D- Kinetics+Sports1M 1.4M 38.0 41.8 75.3
R(2+1)D- Internal videos 1.0M 38.2 41.2 72.0
Table 4: Unlabeled data for distillation. This table shows that our model is not limited to any specific source of unlabeled data, and can also benefit from multiple sources of data. Size denotes the number of unlabeled videos used from that set. Performance reported on HMDB-51 split 1.

Using other teachers: Just as our model is capable of learning from more data, our model is also capable of using diverse supervision. We experiment with replacing the ImageNet teacher with a model trained on PlaceNet [52], and obtain 36.8% HMDB fine-tuning performance as opposed to 40.3% before with ImageNet. Apart from the fact that our model can learn from diverse sources of supervision, this result shows that objects actions semantic transfer is more effective than scenes actions. This makes sense as human actions are typically informed more by the objects in their environment, than the environment itself.

Different input modalities: One of the biggest advantages of our method is that it is applicable to learn representations for any arbitrary input data modality. We experiment with optical flow, which still contributes to significant performance improvements on video tasks, even with modern video architectures, across different datasets [4]. Previous work [4, 48] has used ImageNet initialization for networks accepting flow as input. This is far from ideal since flow has much different statistics than RGB images. DistInit, on the other hand, is agnostic to the input data modality of the student network. We train the student network to learn from the input flow modality, while the teacher uses a random RGB frame from the same clip to generate the distillation target. As we show in Table 5, DistInit still produces strong initialization and improves over training from scratch or the ImageNet inflated initialization. However, due to high computational cost of computing flow, we ignore this input modality for the final comparisons.

Model Initialization Per clip Top 1 Top 5
Res3D- Scratch 30.7 38.7 70.0
Res3D- ImageNet mean inflated 33.5 43.9 73.9
R(2+1)D- DistInit (ours) 42.6 49.2 81.2
Table 5: DistInit on Optical Flow. This table shows that our model is also applicable to other modalities, like optical flow. Note that the inflated initialization for the first layer (conv_1) was performed by averaging the kernel on channel dimension, and then replicating it two times. Reported on HMDB-51 split 1.

4.4 Comparison with previous work

Finally, in Table 6 we compare our model to other reported standard models and initialization methods on HMDB and UCF. Here we only compare to RGB-based models for computational speed, even though our approach is applicable to flow or other modalities as shown in Section 4.3. We obtain strong performance compared to standard methods, and other unsupervised feature learning techniques [30]. Finally, in Figure 7 we show which classes benefit the most from the initialization provided by DistInit compared to that computed by Inflation.

Model Architecture #frames Pre-training Split 1 3-split avg
Misra et al. [30] AlexNet [23] 1 Scratch - 13.3
Misra et al. [30] AlexNet [23] 1 Tuple verify [30] - 18.1
Misra et al. [30] AlexNet [23] 1 ImageNet - 28.5
Two-stream (RGB) [37, 10] VGG-M [38] 1 ImageNet - 40.5
C3D [4] Custom 16 Scratch 24.3 -
LSTM [4] BN-Inception [18] - ImageNet 36.0 -
Two stream (RGB) [4] BN-Inception [18] 1 ImageNet 43.2 -
I3D (RGB) [4] BN-Inception [18] 64 ImageNet 49.8 -
Ours (RGB) R(2+1)D-18 [44] 32 DistInit 54.9 54.8

(a) HMDB-51

Model Architecture #frames Pre-training Split 1 3-split avg
Misra et al. [30] AlexNet [23] 1 Scratch - 38.6
Misra et al. [30] AlexNet [23] 1 Tuple verification [30] - 50.2
Two-stream (RGB) [37, 10] VGG-M [38] 1 ImageNet - 73.0
C3D [4] Custom 16 Scratch 51.6 -
LSTM [4] BN-Inception [18] - ImageNet 81.0 -
Two stream (RGB) [4] BN-Inception [18] 1 ImageNet 83.6 -
I3D (RGB) [4] BN-Inception [18] 64 ImageNet 84.5 -
Ours (RGB) R(2+1)D-18 [44] 32 DistInit 85.7 85.8

(b) UCF-101

Table 6: Comparison with previous work on HMDB and UCF. We split the tables based on the base architecture for fair comparison. In the first section, we report architectures with comparable depth as ours, and in the second we report other approaches using deeper architectures. Our model out-performs all these previous methods. Note that we do not compare to Kinetics pre-trained models. Using Kinetics for pre-training with I3D [4] gets 74.3% and 95.1% 3-split avg on HMDB and UCF, but is not comparable to our unsupervised approach which does not use those labels.
Figure 7: HMDB classes with largest gain using DistInit instead of inflation. The plot shows HMDB per-class accuracy difference between the two finetuned models tested. It suggests that our method is most useful for classes that require understanding motion, such as “catching” and “smiling.” On the other hand, classes like “shoot_ball” or “eat” are easy to recognize from single frames.

5 Conclusion

We describe a simple approach to transfer knowledge from image-based datasets labeled for object or scene recognition tasks, to learn spatiotemporal video models for human action recognition tasks. Much previous work has addressed this problem by constraining spatiotemporal architectures to match 2D counterparts, limiting the choice of networks that can be explored. We describe a simple approach, DistInit, based on distillation that can be used to initialize any spatiotemporal architecture. It does so by making use of image-based teachers that can leverage considerable knowledge about objects, scenes, and potentially other semantics (e.g., attributes, pose) encoded in richly-annotated image datasets. Unlike previous unsupervised learning works that depend on the curated ImageNet dataset, albeit without labels, we show our model even works on truly in-the-wild uncurated videos. We demonstrate significant improvements over standard best practices for initializing spatiotemporal models. That said, our results do not match the accuracy of models pretrained on recently-introduced, large-scale supervised video datasets. But we note that these were collected and annotated with significant manual effort. Because our approach requires only unsupervised videos, it has the potential to make use of massively-larger data for learning accurate video models.


This work was partly supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number D17PC00345. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright annotation theron. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied of IARPA, DOI/IBC or the U.S. Government.


  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
  • [2] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In NIPS, 2016.
  • [3] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil. Model compression. In KDD, 2006.
  • [4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • [5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
  • [6] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
  • [7] J. Donahue, L. A. Hendricks, S. Guadarrama, S. V. M. Rohrbach, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  • [8] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
  • [9] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
  • [10] C. Feichtenhofer, K. Simonyan, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition (online). http://www.robots.ox.ac.uk/~vgg/software/two_stream_action/.
  • [11] D. F. Fouhey, W. Kuo, A. A. Efros, and J. Malik. From lifestyle VLOGs to everyday interactions. In CVPR, 2018.
  • [12] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
  • [13] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  • [14] S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.
  • [15] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [17] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [19] Y. Jiang, J. Liu, A. Roshan Zamir, I. Laptev, M. Piccardi, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://www.thumos.info/, 2013.
  • [20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  • [21] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [22] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent initializations of convolutional neural networks. In ICLR, 2016.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [24] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
  • [25] I. Laptev. On space-time interest points. IJCV, 2005.
  • [26] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
  • [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. IEEE, 1998.
  • [28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [29] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  • [30] I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and Learn: Unsupervised Learning using Temporal Order Verification. In ECCV, 2016.
  • [31] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In CVPR, 2018.
  • [32] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
  • [33] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He. Data distillation: Towards omni-supervised learning. In CVPR, 2018.
  • [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • [35] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. In ICRA, 2018.
  • [36] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
  • [37] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [39] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012.
  • [40] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, 2015.
  • [41] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR Workshops, 2016.
  • [42] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [43] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.
  • [44] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
  • [45] L. van der Maaten and G. E. Hinton.

    Visualizing high-dimensional data using t-sne.

    JMLR, 2008.
  • [46] H. Wang, A. Kläser, C. Schmid, and L. Cheng-Lin. Action Recognition by Dense Trajectories. In CVPR, 2011.
  • [47] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
  • [48] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [49] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851, 2017.
  • [50] A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In CVPR, 2018.
  • [51] H. Zhao, Z. Yan, H. Wang, L. Torresani, and A. Torralba. SLAC: A sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374, 2017.
  • [52] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million Image Database for Scene Recognition. TPAMI, 2017.