, primarily fueled by the resurgence of convolutional neural networks (CNNs)
, the success of deep learning, along with the availability of large and well-labeled datasets [34, 52, 28]
. This has caused a paradigm shift in computer vision. While before the emergence of deep learning most of the community effort was focused on hand-designing better feature extractors[5, 29, 25], now the most prominent approaches train deep models end-to-end, learning features as part of this process. However, deep learning has been transformative not just because models started to work, but because models also transferred
. The most dominant illustration of this is the use of ImageNet pre-training for image understanding tasks. It is a near-ubiquitous practice these days, as it has been shown to yield strong improvements for a wide range of tasks, from image classification on small datasets  to pixel-labeling tasks like detection and segmentation . Such pre-training is an empirically effective approach to knowledge transfer, where “knowledge” is manifested as labeled and curated datasets.
However, deep learning has not been quite as transformative for video understanding. One of the first deep-learning attempts for human action recognition  achieved only marginal improvements over previous state-of-the-art hand-crafted features. Since then, various deep models [37, 42, 7] have been proposed. However the performance improvements have been largely incremental, with some of the biggest gains coming from the recent introduction of a large-scale dataset [4, 21] enabling effective pre-training.
This raises a natural question: is time-consuming manual labeling of large-scale video datasets the most practical means to achieve big performance improvements in video understanding? We decompose this question into two parts:
1. How does one obtain the “right” labels for a video? Previous work has attempted to define ontologies and collected videos for each of the classes, either through Web search [39, 24, 21] or by asking humans to act and record videos . More recent work has avoided predefining explicit ontologies, and has attempted to “discover” useful action classes from video-logs of peoples’ daily lives . These datasets have used action classes ranging from broad and diverse, such as swimming or running , to fine-grained and nuanced, such as “snuggling with a pillow” . Evidently, the right vocabulary is unclear and largely up for debate. In contrast, labels of objects seem to much better understood, taking advantage of existing linguistic knowledge bases such as WordNet. Moreover, we humans are able to learn about behaviors and the dynamics of the world even without such explicit labels. In this work, we attempt to answer this question by learning video representation without action labels, by relying solely on image-based models, or teachers, to supervise the video network.
2. How do we transfer the rich knowledge embedded in well curated datasets [34, 52, 28] into our video models? Previous work has tackled this by initializing weights from models pre-trained on such datasets, a procedure popularly referred to as fine-tuning. Recent work has even done so for 3D CNNs, by “inflating” the 2D kernels to 3D [4, 9, 8]. However, all these approaches place severe restrictions on the video architectures, essentially forcing them to be 3D equivalents of still-image architectures. Instead, as Figure 1 shows, we advocate for a general approach of knowledge transfer by distillation, which allows us to transfer knowledge from arbitrary image-based teachers to any spatiotemporal architecture. We refer to our approach as DistInit. Previous work  has tackled a similar problem of cross-modal distillation to train architectures across different input modalities, such as depth or flow. However, the goal there is to transfer supervision from one modality to another for the same end task
. In contrast, we transfer supervision from image models trained for object/scene recognition, to video models for human action recognition. To our knowledge, image-to-video distillation has not been explored before. This task opens up a series of design choices that we thoroughly investigate through an extensive ablation study involving the distillation output, the loss, frame selection criteria, and the form of the still-image supervision.
DistInit leads to a significant 16% improvement over from-scratch training on the HMDB dataset, getting almost halfway to the improvement provided by pretraining on a fully-supervised dataset like Kinetics . From-scratch training is the defacto standard for state-of-the-art architectures that can not be initialized or inflated from image architectures . While large-scale video datasets like Kinetics now provide an alternate path for pre-training, DistInit does so without requiring any video data curation. As we show in Section 4.3, it is able to learn competitive representations from an internal uncurated dataset of random web videos. This is contrast to previous works [31, 6, 12]
on unsupervised learning that use ImageNet without labels but still potentially benefit from the data curation.
2 Related Work
Video understanding, specifically for the task of human action recognition, is a well studied problem in computer vision. Analogously to the progress of image-based recognition methods, which have advanced from hand-crafted features [29, 5] to modern deep networks [41, 16, 38], video understanding methods have also evolved from hand-designed models [47, 46, 25] to deep spatiotemporal networks [42, 37]. However, while image based recognition has seen dramatic gains in accuracy, improvements in video analysis have been more modest. In the still-image domain, deep models have greatly benefited from the availability of well-labeled datasets, such as ImageNet  or Places .
Until recently, video datasets have either been well-labeled but small [24, 39, 36], or large but weakly-labeled [20, 1]. A recently introduced dataset, Kinetics , is currently the largest well-annotated dataset, with around 300K videos labeled into 400 categories (we note a larger version with 600K videos in 600 categories was recently released). It is nearly two orders of magnitude larger than previously established benchmarks in video classification [24, 39]. As expected, pre-training networks on this dataset has yielded significant gains in accuracy  on many standard benchmarks [24, 39, 36], and have won CVPR 2017 ActivityNet and Charades challenges. However, it is worth noting that this dataset was collected at a significant curation and annotation effort .
The challenge in generating large-scale well-labeled video datasets stems from the fact that a human annotator has to spend much longer to label a video compared to a single image. Previous work has attempted to reduce this labeling effort through heuristics, but these methods still require a human annotator to clean up the final labels. There has also been some work in learning unsupervised video representations [35, 30], however has typically lead to inferior results compared to supervised features.
The question we pose is: since labeling images is faster, and since we already have large, well-labeled image datasets such as ImageNet, can we instead use these to bootstrap the learning of spatiotemporal video architectures? Unsurprisingly, various previous approaches have attempted this. The popular two-stream architecture  uses individual frames from the video as input. Hence it initializes the RGB stream of the network with weights pre-trained on ImageNet and then fine-tunes them for action classification on the action dataset. More recent variants of two-stream architectures have also initialized the flow stream  from weights pretrained on ImageNet by viewing optical flow as a grayscale image.
However, such initializations are only applicable to video models that use 2D convolutions, analogous to those applied in CNNs for still-images. What about more complex, truly spatiotemporal models, such as 3D convolutional architectures ? Until recently, such models have largely been limited to pre-training on large but weakly-labeled video datasets, such as Sports1M . Recent work [4, 8] proposed a nice alternative, consisting of inflating standard 2D CNNs kernels to 3D, by simply replicating the 2D kernels in time. While effective in getting strong performance on large benchmarks, on small datasets this approach tends to bias video models to be close to static replicas of the image models. Moreover, such initialization constrains the 3D architecture to be identical to the 2D CNN, except for the additional third dimension in kernels. This effectively restricts the design of video models to extensions of what works best in the still-image domain, which may not be the architectures for video analysis.
This work leverages the intuition that strong improvements are possible if we can distill information from still-image models in order to train our deep video architecture. We propose an approach that achieves this goal by removing the restriction of 1-to-1 mapping between the 2D and the 3D architectures. We build upon ideas from model  and data  distillation by proposing to use predictions of still-image models as supervisory signal for 3D CNNs. One key difference, however, is that distillation has typically been used to train smaller models from larger ones, whereas in our case the student, or video models, tend to be larger than the teachers. Some previous works, such as cross-modal distillation , have pursued a similar approach of transferring supervision from RGB to flow or depth modalities, but has been been for the same end task, such as object detection. In contrast, our work can be seen more as task distillation, similar to , where we transfer supervision from RGB models trained for object or scene recognition, to video models for human action recognition. Similar to , we explore different choices of pretraining tasks and their effect on the end task of action recognition, although we transfer supervision via distillation on the target data as opposed to finetuning. We show extensive experiments with standard benchmarks and show significant improvements over inflation and other previous approaches in learning video representations for action recognition.
3 Our Approach
We now describe our approach in detail. To reiterate, our goal is to learn video representations without using any video annotations. We do so by leveraging pre-trained 2D networks, using them to supervise or “teach” the video models. Hence, we refer to the 2D pre-trained networks as “teachers” and our target video network as “student”. We make no assumption over the respective architectures of these models, i.e., we do no constrain the structure of the 3D network to be merely a 3D version of the 2D networks it learns from or to have a structure compatible with them.
depicts the network architecture used to train the student network. We start with teacher networks trained on standard image-level tasks, such as image classification on ImageNet. While in this work we primarily focus on classification, our architecture is generic and can also benefit from teachers trained on spatial tasks such as detection, keypoint estimation and so on, with the only difference being the definition of the distillation loss function. Also, our architecture is naturally amenable to work with an arbitrary number of teachers, which can be used in a multi-task learning framework to distill information from multiple domains into the student. Throughout the training process, these teacher networks are kept fixed, in “test” mode, and are used to extract a feature representation from the video to be used as a “target” to supervise the student network.
Since teacher networks are designed to find objects in images, it is not obvious how to use them to extract features for actions in video. We propose a simple solution: pre-train the spatiotemporal action network for finding objects in frames of a video. However, our teacher networks are designed to work over images, so how do we apply them on a video? We experiment with standard approaches from the literature, including uniform or random sampling of frames , as well as averaging predictions from multiple different frames 
. In this work we use the last-layer features in the form of normalized softmax predictions or (unnormalized) logits. In case of multiple frames, we average the teacher logits before computing a normalized prediction target. Thestudent
network then takes the complete video clip as input. We train it to be able to predict the features or probability distribution produced by the teacher. For this purpose, we define the last layer in the student network to be a linear layer that takes the final spatiotemporally averaged feature tensor and maps it to a number of units that matches the dimensionality of the output generated by the teacher. In case of multiple teachers, we define a linear layer per teacher, and optimize all losses jointly.
To formalize, let us denote a video as , where is the frame. In our problem formulation, we have access to a teacher that reports a prediction label at each frame. For simplicity, we assume that the teacher returns a (softmax) distribution over classification labels. We generate a distribution over labels for a video by (1) averaging the per-frame logits for the class and (2) passing the average through a softmax function with temprature (typically ):
The resulting distribution is then used as soft targets for training weights associated with a student network of arbitrary architecture by means of the following objective:
where is the student softmax distribution for label . Finally, we explore multi-source knowledge distillation by adding together losses from different image-based teachers:
In our experiments, we explore teachers trained for object classification (ImageNet) and scene classification (Places).
We train the network using loss functions inspired from the network distillation literature [3, 17]. When using the teacher network to produce a probability distribution, we train the student to produce a label distribution as well and define the loss so as to match the teacher distribution. This can be formulated as a KL-divergence loss, which in the case of the softmax distribution is equivalent to computing the cross entropy. As suggested in , we also tried using different values of temperature () to scale the logits before computing softmax and cross entropy, but found temperature value of 1 to perform best in our experiments. We also experimented with the a mean squared loss on the logits (before softmax normalization), as suggested in , and compare performance in Section 4.
3.1 Architecture Details
We use recent, state-of-the-art, network architectures for all experiments and comparisons. For the still-image teacher networks, we use the ResNet-50  architecture, trained on different image datasets such ImageNet 1K  and Places 365 . For the spatiotemporal (video) student architectures, we first experiment with a variant of the Res3D  architecture. Res3D is an improved version of the popular C3D 
using residual connections. We denote a-layer Res3D model as Res3D-, which is compatible with the standard ResNet-  architecture. Since there is a one-to-one correspondence between such 2D and 3D models, the 3D models can also be initialized by inflating the learned weights from 2D models (e.g., for each channel, replicate the 2D filter weights along the temporal dimension to produce a 3D convolutional filter). Similar ideas of inflating 2D models to 3D have been proposed previously for Inception-style architectures , along with initialization techniques from corresponding 2D models [4, 9, 8]. The existence of a 1-to-1 mapping between the 2D and 3D models used in our experiments allows us to compare our approach to the method of inflation for initialization. However, we stress that unlike inflation, our method is applicable even in scenarios where such 1-to-1 mapping does not hold.
More recently, full 3D models have been superseded by (2+1)D architectures , where each 3D kernel is decomposed into a 2D spatial component followed by a 1D temporal filter. Similar models have also been proposed previously , and are also known as P3D  or S3D  architectures. These models have proven to be more efficient, with much fewer parameters, and more effective on various standard benchmarks [49, 44]. However, these models no longer conform to standard 2D architectures, as they contain additional conv and batch_norm layers, parameters that do not exist in corresponding 2D models, and hence can not be initialized using those 2D models. Nevertheless, our distillation remains applicable even in this scenario. In this work, we refer to such networks using R(2+1)D- notation, for -deep architecture.
3.2 Implementation Details
For all experiments, we use the hyperparameter values described in. For distillation pre-training, we use the hyper-parameter setup for “Kinetics from-scratch training.” We use distributed Sync-SGD  over GPUs, starting with LR=0.01, and dropping it by 10
every 10 epochs. Weight decay is set to. We train for a total of 45 epochs, where each epoch is defined as a full pass over 1M examples. The video model is learned on 8-frames clips of 112 pixels. The network has depth of 18, which enables faster experimentation compared to the best model reported in  which uses 32 frames and has a depth of 34 layers. The batch size used for Kinetics training is 32/GPU, which we reduce to 24/GPU to accommodate the additional memory requirements for the teacher networks. For the finetuning experiment on smaller datasets like HMDB, we use Sync-SGD with GPUs, starting with LR=0.002, an dropping it by every 2 epochs. The weight decay is set to . We train 8 epochs, with each epoch defined as a full pass over 40K examples. When training from scratch, we use initial LR of 0.01 with a step every 10 epochs, trained for total of 45 epochs.
We now experimentally evaluate our system. We start by introducing the datasets and benchmarks used for training and evaluation in Section 4.1. We then compare DistInit with inflating 2D models for initialization in Section 4.2. Next we ablate the various design choices in DistInit in Section 4.3, and finally compare to previous state of the art on UCF-101  and HMDB-51  in Section 4.4.
4.1 Datasets and Evaluation
Our method involves two stages: pre-training on a large, unlabeled corpus of videos using still-image models as “teachers”, followed by fine-tuning on the training split of a labeled target dataset. After training, we evaluate the performance on the test set of the target dataset. We use HMDB-51 and UCF-101 as our target test beds, while videos taken from datasets such as Kinetics and Sports1M, or even random uncurated internal videos are used as unlabeled video corpuses for distillation. Note that while datasets like Kinetics and Sports1M do come with semantic (action) labels, we ignore such annotations in this work and only use the raw videos from these datasets.
Unlabeled video corpus: We experiment with a variety of different unlabeled video corpuses in Section 4.3, including Kinetics , Sports1M  and a set of internal videos. Kinetics and Sports1M contain about 300K and 1.1M videos, respectively. In this work, we drop any labels these datasets come with, and only use the videos as a large, unlabeled corpus to train video representations. The internal video set includes 1M videos without any semantic labels and randomly sampled from a larger collection. We use these diverse datasets to show that our method is not limited to any form of data curation, and can work with truly in-the-wild videos.
Test beds: HMDB-51  contains 6766 realistic and varied video clips from 51 action classes. Evaluation is performed using average classification accuracy over three train/test splits from , each with 3570 train and 1530 test videos. UCF-101  consists of 13320 sports video clips from 101 action classes, also evaluated using average classification accuracy over 3 splits. We use the HMDB split 1 for ablative experiments, and report the final performance on all splits for HMDB and UCF in Section 4.4.
|Model||Initialization||Per clip||Top 1||Top 5|
4.2 DistInit vs Inflation
We first compare our proposed approach to inflation [4, 8], i.e., initializing video models from 2D models by inflating 2D kernels to 3D via replication over time. Note that inflation is constrained to operate on 3D models that have a one-to-one correspondence with the 2D model. Hence, we use a Res3D-18 model, which is compatible for direct inflation from ResNet-18 models. We experiment with publicly available ImageNet and PlaceNet models. We compare it with our distillation approach in Table 1, trained using an ImageNet pretrained model as the teacher. Distillation improves performance by 15% over a model trained from scratch, and 4% over a model trained with inflated weights (the current best-practice for training such models). More importantly, our approach can also be used to initialize specialized temporal architectures such as R(2+1)D , which do not have a natural 2D counterpart. In such a setting, the current best practice is to initialize such networks from scratch. Here, distillation improves performance by 16%. Finally, we also report the model trained using actual Kinetics labels, and as expected, that yields higher performance. Hence there is clear value to the explicit manual supervision provided in such large-scale datasets, but distillation appears to get us “half-way” there.
At this point, it is natural to ask why the distilled model outperforms current best-practices such as inflation? We visualize the learned representation by plotting the first layer conv filters in Figure 3. It can be seen that our distilled model learns truly spatiotemporal filters that vary in time, whereas inflation essentially copies the same filter over time. Such dynamic temporal variation is readily present in the videos used for distillation, even when they are not labelled with spatiotemporal action categories. Filters pre-trained with inflation initialization never see actual video data, and so cannot encode such variation. In Figure 4
we also compare the filters learned by our R(2+1)D model via distillation vs via fully-supervised training. Our filters look quite similar to those learned through supervised learning, showing the effectiveness of our approach. In some sense, the improved performance of distillation can be readily explained by more data – networks learning from scratch see no data for pre-training, inflation networks see ImageNet, while distilled networks see both Imagenet and unlabeled videos. Our practical observation is that one can use image-based teachers to pre-train on massively large, unlabeled video datasets.
We can also analyze the effectiveness of distillation pre-training, by visualizing the correlation of the representation we learn with the classes in the task of action recognition. As explained in Figure 5, we can see that the last layer features for the same action class tend to cluster together when projected to 2D using tSNE .
4.3 Diagnostic Analysis
Design choices for the teacher network: We now ablate the design choices for the teacher networks. Teacher networks are required to generate a target label to supervise the video model being trained, by using images from the video clip. We experiment with picking the center frame, a random frame, or multiple random frames from the clip to compute the targets. In case of multiple frames, we average the logits before passing them through softmax to generate the target distribution. We compare these methods in Table 2, and observe higher performance when picking frames randomly. This improvement can be due to less overfitting through label augmentation. We use it in our final model.
|Model||Pick strategy||Per clip||Top 1||Top 5|
Distillation loss: Next, we evaluate the different choices for the loss function in distillation. As already explained in Section 3, previous work has suggested different loss functions for distillation tasks. We compare two popular approaches: KL divergence over distribution and loss over logits. In the case of the former, we compute the softmax distribution from the teacher networks, as well as from the student branch that attempts to match that teacher, and use a cross entropy between the two softmax distributions as the objective to optimize. We find this objective can be well optimized using the standard hyper-parameter setup used for Kinetics training in . In the case of the latter, we skip the softmax normalization step and directly compute the mean squared error between the last linear layers as the objective. Since the initial loss values are much higher, we needed to drop the learning rate by a factor of 10 to optimize this model, with all the other parameters kept the same. As Table 3 shows, we observe nearly similar downstream performance with both.
|Loss function||Per clip||Top 1||Top 5|
|cross-entropy (over softmax)||37.8||40.3||74.4|
|mean squared error (over logits)||35.6||39.9||70.5|
Selecting confident predictions: As some recent work  has shown, distillation techniques can benefit from using only the most confident predictions for training the student. We use the entropy of the predictions from the teachers as a notion of their confidence. We implement this confidence thresholding by setting a zero weight for the loss on each example, for which the teacher is not confident, or has high-entropy predictions; effectively dropping parts of the training data that are confusing for the teacher. We show the performance on dropping different amounts of data in Figure 6
. The red curve shows a kernel density estimate (a PDF) of entropy values for an ImageNet teacher on the Kinetics data. At any given entropy value (), it shows the relative likelihood of a data point to have that entropy value, and (area under the curve from to ) is the percentage of data with entropy . We experiment with setting different thresholds for dropping the low-confidence data points during DistInit, and show the downstream HMDB-51 split-1 performance in the line plots. We found slightly better performance, even after dropping nearly half the data, making training faster.
Varying the unlabeled dataset: We now try to evaluate whether our method is dependent on any specific video data source, and if it can benefit from additional data sources. We evaluate this in Table 4 and observe nearly similar performance when using different sets of videos (without labels) like Kinetics  and Sports1M . We also experiment with an internal set of videos downloaded from the web, and still get strong DistInit performance. This shows our method is not limited to any form of data curation, and can learn from truly in-the-wild videos.
|Model||Unlabeled set||Size||Per clip||Top 1||Top 5|
Using other teachers: Just as our model is capable of learning from more data, our model is also capable of using diverse supervision. We experiment with replacing the ImageNet teacher with a model trained on PlaceNet , and obtain 36.8% HMDB fine-tuning performance as opposed to 40.3% before with ImageNet. Apart from the fact that our model can learn from diverse sources of supervision, this result shows that objects actions semantic transfer is more effective than scenes actions. This makes sense as human actions are typically informed more by the objects in their environment, than the environment itself.
Different input modalities: One of the biggest advantages of our method is that it is applicable to learn representations for any arbitrary input data modality. We experiment with optical flow, which still contributes to significant performance improvements on video tasks, even with modern video architectures, across different datasets . Previous work [4, 48] has used ImageNet initialization for networks accepting flow as input. This is far from ideal since flow has much different statistics than RGB images. DistInit, on the other hand, is agnostic to the input data modality of the student network. We train the student network to learn from the input flow modality, while the teacher uses a random RGB frame from the same clip to generate the distillation target. As we show in Table 5, DistInit still produces strong initialization and improves over training from scratch or the ImageNet inflated initialization. However, due to high computational cost of computing flow, we ignore this input modality for the final comparisons.
|Model||Initialization||Per clip||Top 1||Top 5|
|Res3D-||ImageNet mean inflated||33.5||43.9||73.9|
4.4 Comparison with previous work
Finally, in Table 6 we compare our model to other reported standard models and initialization methods on HMDB and UCF. Here we only compare to RGB-based models for computational speed, even though our approach is applicable to flow or other modalities as shown in Section 4.3. We obtain strong performance compared to standard methods, and other unsupervised feature learning techniques . Finally, in Figure 7 we show which classes benefit the most from the initialization provided by DistInit compared to that computed by Inflation.
|Model||Architecture||#frames||Pre-training||Split 1||3-split avg|
|Misra et al. ||AlexNet ||1||Scratch||-||13.3|
|Misra et al. ||AlexNet ||1||Tuple verify ||-||18.1|
|Misra et al. ||AlexNet ||1||ImageNet||-||28.5|
|Two-stream (RGB) [37, 10]||VGG-M ||1||ImageNet||-||40.5|
|LSTM ||BN-Inception ||-||ImageNet||36.0||-|
|Two stream (RGB) ||BN-Inception ||1||ImageNet||43.2||-|
|I3D (RGB) ||BN-Inception ||64||ImageNet||49.8||-|
|Ours (RGB)||R(2+1)D-18 ||32||DistInit||54.9||54.8|
|Model||Architecture||#frames||Pre-training||Split 1||3-split avg|
|Misra et al. ||AlexNet ||1||Scratch||-||38.6|
|Misra et al. ||AlexNet ||1||Tuple verification ||-||50.2|
|Two-stream (RGB) [37, 10]||VGG-M ||1||ImageNet||-||73.0|
|LSTM ||BN-Inception ||-||ImageNet||81.0||-|
|Two stream (RGB) ||BN-Inception ||1||ImageNet||83.6||-|
|I3D (RGB) ||BN-Inception ||64||ImageNet||84.5||-|
|Ours (RGB)||R(2+1)D-18 ||32||DistInit||85.7||85.8|
We describe a simple approach to transfer knowledge from image-based datasets labeled for object or scene recognition tasks, to learn spatiotemporal video models for human action recognition tasks. Much previous work has addressed this problem by constraining spatiotemporal architectures to match 2D counterparts, limiting the choice of networks that can be explored. We describe a simple approach, DistInit, based on distillation that can be used to initialize any spatiotemporal architecture. It does so by making use of image-based teachers that can leverage considerable knowledge about objects, scenes, and potentially other semantics (e.g., attributes, pose) encoded in richly-annotated image datasets. Unlike previous unsupervised learning works that depend on the curated ImageNet dataset, albeit without labels, we show our model even works on truly in-the-wild uncurated videos. We demonstrate significant improvements over standard best practices for initializing spatiotemporal models. That said, our results do not match the accuracy of models pretrained on recently-introduced, large-scale supervised video datasets. But we note that these were collected and annotated with significant manual effort. Because our approach requires only unsupervised videos, it has the potential to make use of massively-larger data for learning accurate video models.
This work was partly supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number D17PC00345. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright annotation theron. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied of IARPA, DOI/IBC or the U.S. Government.
-  S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
-  Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In NIPS, 2016.
-  C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil. Model compression. In KDD, 2006.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
-  C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, S. V. M. Rohrbach, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
-  C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
-  C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
-  C. Feichtenhofer, K. Simonyan, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition (online). http://www.robots.ox.ac.uk/~vgg/software/two_stream_action/.
-  D. F. Fouhey, W. Kuo, A. A. Efros, and J. Malik. From lifestyle VLOGs to everyday interactions. In CVPR, 2018.
-  S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  Y. Jiang, J. Liu, A. Roshan Zamir, I. Laptev, M. Piccardi, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://www.thumos.info/, 2013.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
-  P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent initializations of convolutional neural networks. In ICLR, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
-  I. Laptev. On space-time interest points. IJCV, 2005.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. IEEE, 1998.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
-  I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and Learn: Unsupervised Learning using Temporal Order Verification. In ECCV, 2016.
-  M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In CVPR, 2018.
-  Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
-  I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He. Data distillation: Towards omni-supervised learning. In CVPR, 2018.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. IJCV, 2015.
-  P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. In ICRA, 2018.
-  G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012.
-  L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, 2015.
-  C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR Workshops, 2016.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
-  D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.
-  D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
L. van der Maaten and G. E. Hinton.
Visualizing high-dimensional data using t-sne.JMLR, 2008.
-  H. Wang, A. Kläser, C. Schmid, and L. Cheng-Lin. Action Recognition by Dense Trajectories. In CVPR, 2011.
-  H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
-  S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851, 2017.
-  A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In CVPR, 2018.
-  H. Zhao, Z. Yan, H. Wang, L. Torresani, and A. Torralba. SLAC: A sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374, 2017.
-  B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million Image Database for Scene Recognition. TPAMI, 2017.