Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at



There are no comments yet.


page 1

page 2

page 15

page 16

page 17

page 18


Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly

Due to the importance of zero-shot learning, i.e. classifying images whe...

Using Fictitious Class Representations to Boost Discriminative Zero-Shot Learners

Focusing on discriminative zero-shot learning, in this work we introduce...

Transductive Zero-Shot Learning for 3D Point Cloud Classification

Zero-shot learning, the task of learning to recognize new classes not se...

Deep Multiple Instance Learning for Zero-shot Image Tagging

In-line with the success of deep learning on traditional recognition pro...

End-to-end Generative Zero-shot Learning via Few-shot Learning

Contemporary state-of-the-art approaches to Zero-Shot Learning (ZSL) tra...

DeepPSL: End-to-end perception and reasoning with applications to zero shot learning

We introduce DeepPSL a variant of Probabilistic Soft Logic (PSL) to prod...

Fast Video Classification via Adaptive Cascading of Deep Models

Recent advances have enabled "oracle" classifiers that can classify acro...

Code Repositories


Zero-shot video classification by end-to-end training of 3D convolutional neural networks

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: (Top) Our model is state-of-the-art (error computed on the UCF test dataset.) (Bottom) Our e2e model is simple but powerful. URL [59], Action2Vec [14] and TARN [4] are state-of-the-art approaches. Gray blocks represent modules fixed during training. Colors (blue, red, orange, yellow) indicate modules trained in separate stages.

Training image and video classification algorithms requires large training datasets [17, 23, 46, 47, 48]. With no task-specific training data available one may still attempt to train a model using related information and transfer the learned knowledge to classify previously unseen categories. This approach is called zero-shot learning (ZSL) [25, 30] and it is quite successful in the image domain [37, 39, 40, 42, 51].

We focus on ZSL for video action recognition, where data sourcing and annotation is particularly expensive. Since the set of possible human actions is huge, action recognition is a great ZSL testbed. Trained on large-scale academic datasets [11, 13, 20, 21, 24, 45]

, supervised 3D convolutional neural networks (CNNs) proved successful in this domain 

[12, 46, 47]. How well modern deep networks can recognize human actions in the ZSL setting is, however, an open question.

To our knowledge, all current ZSL methods for video recognition use pretrained visual embeddings [1, 4, 14, 29, 31, 49, 50, 53, 54, 55, 56, 59]. This provides a good tradeoff between training efficiency and using prior knowledge. Shallow trainable models then convert the pretrained representations to ZSL embeddings, as shown in Fig. 1 (Bottom). Low training space complexity of shallow models allows them to benefit from long video sequences [46] and large feature extractors [17].

In contrast, state-of-the-art algorithms in the fundamental CV domains of image classification [17], object detection [32, 34, 44] and segmentation [8, 16, 58]

all rely on end-to-end (e2e) training. Representation learning is at the core of deep networks’ success across machine learning domains 

[3], and deeper models can better utilize information available in large datasets [2, 17]. This poses a question: How can an e2e ZSL compete with current methods?

Our contributions involve multiple aspects of ZSL video classification:

  1. We propose the first e2e-trained model for zero-shot action recognition. The training procedure is inspired by modern supervised video classification practices. Fig. 1 shows that our method is simple, yet outperforms previous work. Moreover, we devise a novel easy pretraining technique that targets the ZSL scenario for video recognition.

  2. We propose a novel ZSL training and evaluation protocol that enforces a realistic ZSL setting. Extending the work of Roitberg et al[36], we test a single trained model on multiple test datasets, where sets of training and test classes are disjoint. In addition, we argue that training and test domains should not be identical.

  3. We perform an in-depth analysis of the e2e model and a pretrained baseline. In a series of guided experiments we explore the characteristics of good ZSL datasets.

Our model, training and evaluation code, are available at

Figure 2: Training and test classes, t-SNE [26] visualization of Word2Vec embeddings. Red dots represent training classes we used, and gray dots training classes we removed in order to separate training and test data. Crosses represent test classes. Pictures are actual dataset videoframes.

2 Related work

We focus on inductive ZSL in which test data is fully unknown at training time. There exists a body of literature on transductive ZSL [1, 29, 49, 50, 54, 53, 55], where test images or videos are available during training but test labels are not. We do not discuss the transductive approach in this work.

Video classification: Modern, DL-based video classification methods fall largely into two categories: 2D networks [43, 48] that operate on 1-5 frame snippets and 3D networks [5, 6, 7, 12, 15, 27, 41, 46, 47] that operate on 16-128 frames. One of the earliest works of this type, Simonyan and Zisserman [43], trained with only 1-5 frames sampled randomly from the video. At inference many more frames were sampled and the classifier outputs were averaged across all samples taken for a video clip. This implied that looking at a large chunk of the video was important during inference but wasn’t strictly required during training. Wang et al[48] showed that sampling multiple frames throughout the video during training could improve performance, opening the question whether training also requires a large temporal context. However, a body of later work based on more powerful 3D networks [7, 12, 46] showed that for most datasets sampling 16 frames during training is sufficient. Increasing training frame count from 16 to 128 improved performance only marginally.

In this work, we adapt the training-time sampling philosophy of state-of-the-art video classification to the ZSL setup. This allows us to train the visual embedding e2e. As a consequence, the overall architecture and inference procedure are very simple compared to previous work, and the results are state-of-the-art – as shown in Fig. 1.

Zero shot video classification: The common practice in zero-shot video classification is to first extract visual features from video frames using a pretrained network such as C3D [46] or ResNet [17], then trains a temporal model that maps the visual embedding to a semantic embedding space [4, 14, 31, 56, 59]. Good generalization on semantic embeddings of class names means that the model can be applied to new videos where the possible output classes are not present in training data. Inference reduces to finding the test class whose embedding is the nearest-neighbor of the model’s output. Word2Vec [28] is commonly used to produce the ground-truth word embeddings. An alternative approach is to use manually crafted class attributes [19]. We decided not to pursue the manual approach as it harder to apply in general scenarios.

Figure 3: Removing overlapping training and test classes. The y-axis shows Kinetics classes closest to the test sets UCF and HMDB. x-axis shows the distance (see Eq. 4) of the corresponding closest test class. In our experiments, we removed training classes closer than to the test set – to the left of the red line in the figure.

Two effective recent methods, Hahn et al[14] and Bishay et al[4]

, extract C3D features from 52 clips of 16 frames from each video. They then learn a recurrent neural network 

[10, 18]

to encode the result as a single vector. Finally, a fully connected layer maps the encoded video into Word2Vec embedding. Fig. 

1 illustrates this approach. Both  [14] and [4] use the same dataset for training and testing, after splitting the available dataset classes into two sets. Using a pretrained deep network is convenient because pre-extracted visual features easily fit in GPU memory, even for a large number of video frames. Alternative approaches use generative models to compensate for the gap between semantic and visual distributions [29, 57]. Unfortunately, performance is limited by the inability to fine-tune the visual embedding. We show fine-tuning is crucial to generalize across datasets.

Our work is similar to Zhu et al[59] in that both methods learn a universal action representation that generalizes across datasets. However, their proposed model does not leverage the potential of 3D CNNs. Instead, they utilize the very deep ResNet200 [17]

, pretrained on ImageNet 

[9, 38], which cannot utilize temporal information.

As pointed out by Roitberg et al[36], previous works train their models on actions overlapping with those of the the target dataset, violating ZSL assumptions. For example, Zhu et al[59] train on the full ActivityNet [11] dataset. This makes their results difficult to fairly compare with ours. Under our definition of ZSL (Sec. 3.3), Zhu et al. have 23 classes in their training datasets that overlap with the test dataset. The situation is similar for all other methods to varying degrees.

3 Zero-shot action classification

We first carefully define ZSL in the context of video classification. This will allow us to propose not only a new ZSL algorithm, but also a clear evaluation protocol that we hope will direct future research towards practical ZSL solutions. We stay within the inductive setting, as described in Sec. 2.

3.1 Problem setting

A video classification task is defined by a training set (source) consisting of pairs of videos and their class labels , and a video-label test set . In addition, previous work often uses pretraining datasets as explained in Sec. 2.

Intuitively, ZSL is any procedure for training a classification model on (and possibly ) and then testing on where does not overlap with . How this overlap is defined varies. Sec. 3.3 proposes a definition that is more restrictive than those used by previous work, and forces the algorithms into a more realistic ZSL setting.

ZSL classifiers need to generalize to unseen test classes. One way to achieve this is using nearest-neighbor search in a semantic class embedding space.

Formally, given a video , we infer the corresponding semantic embedding and classify as the nearest-neighbor of in the set of embeddings of the test classes. Then, a trained classification model outputs


where is the cosine distance and the semantic embedding is computed using the Word2Vec function [28] .

The function is a composition of a visual encoder and a semantic encoder .

3.2 End-to-end training

In previous work, the visual embedding function is either hand-crafted [55, 59] or computed by a pretrained deep network [4, 14, 50, 59]. It is fixed during optimization, forcing model development to focus on improving . Resulting models need to learn to transform fixed visual embeddings into meaningful semantic features and can be very complex, as shown in Fig. 1 (Bottom).

Instead, we propose to optimize both and at the same time. Such e2e training offers multiple advantages:

  1. Since provides a complex computation engine, can be a simple linear layer (see Fig. 1).

  2. We can implement the full model using standard 3D CNNs.

  3. Pretraining the visual embedding on a classification task is not necessary.

End-to-end optimization using the full video is unfeasible due to GPU memory limitations. Our implementation is based on standard video classification methods which are effective even when only a small snippet is used during training, as discussed in detail in Sec 2. Formally, given a training video/class pair we extract a snippet of frames at a random time . The network is optimized by minimizing the loss


Inference procedure is similar but pools information from multiple snippets following Wang et al[48]. Sec. 4.4 details both our training and inference procedures.

To better understand our method’s performance under various experimental conditions, we implemented a baseline model that uses identical , and training data, but fixes

’s weights to values pretrained on the classification task (available out-of-the-box in the most recent PyTorch implementation, see Sec. 

4.4). This was necessary since we were not able to access implementations of any of the state-of-the-art methods ([4, 14, 59]). Unfortunately, our own re-implementations achieved results far below numbers reported by their authors, even with their assistance.

3.3 Towards realistic ZSL

To ensure that our ZSL setting is realistic, we extend the methods of [36] that carefully separates training and test data. This is cumbersome to achieve in practice, and has not been attempted by most previous work. We hope that our clear formulation of the training and evaluation protocols will make it easy for future researchers to understand the performance of their models in true ZSL scenarios.

Non-overlapping training and test classes: Our first goal is to make sure that and have ”non-overlapping classes”. The simple solution – to remove source class names from target classes or vice-versa – does not work, because two classes with slightly different names can easily refer to the same concept, as shown in Fig. 3. A distance between class names is needed. Equipped with such a metric, we can make sure training and test classes are not too similar. Formally, let denote a distance metric on the space of all possible class names , and let denote a similarity threshold. A video classification task fully respects the zero-shot constraint if


A straightforward way to define is using semantic embeddings of class names. We define the distance between two classes to be simply


where indicates cosine distance. This is consistent with the use of the cosine distance in the ZSL setting as we do in Eq. 1. Fig. 2 shows an embedding of training and test classes after we removed from Kinetics classes overlapping with test data using the procedure outlined above. Fig. 3 shows the distribution of distances between training and test classes in our datasets. There is a cliff between distances very close to and larger than . In our expeirments we use as a natural, unbiased threshold.

Different training and test video domains: We argue that video domains of and should differ. In previous work, the standard evaluation protocol is to use one dataset for training and testing, using 10 random splits. This does not account for domain shifts that happen in real world scenarios due to data compression, camera artefacts, and so on. For this reason ZSL training and test datasets should ideally have disjoint video sources.

Dataset VisualFeat UCF HMDB Activity
URL [59] ResNet200 42.5 51.8 -
DataAug [55] - 18.3 19.7 -
InfDem [35] I3D 17.8 21.3 -
Bidirectional [50] IDT 21.4 18.9 -
FairZSL [36] - - 23.1 -
TARN [4] C3D 19 19.5 -
Action2Vec [14] C3D 22.1 23.5 -
Ours(605classes) C3D 41.5 25.0 24.8
Ours(664classes) C3D 43.8 24.7 -
Ours(605classes) R(2+1)D_18 44.1 29.8 26.6
Ours(664classes) R(2+1)D_18 48 32.7 -
Table 1: Comparison with the state-of-the-art on standard benchmarks. We evaluate on half test classes following Evaluation Protocol 1 (Sec. 4.3). Ours(605classes) indicates we removed all training classes that overlap with UCF, HMDB, or ActivityNet. Ours(664classes) indicates we removed only training classes overlapping with UCF and HMDB. We outperform previous work in both scenarios. Sec. 2 argues that URL’s results are not compatible with other works as their training and test sets overlap and their VisualFeat is an order of magnitude deeper.

Multiple test datasets: A single ZSL model should perform well on multiple test datasets. As outlined above, previous works train and test anew for each available dataset (typically UCF and HMDB). In our experiments, training happens only once on the Kinetics dataset [21], and testing on all of UCF [45], HMDB [24] and ActivityNet [11].

3.4 Easy pretraining for video ZSL

In a real-world scenario a model is trained once and then deployed on diverse unseen test datasets. A large and diverse training dataset is crucial to achieve good performance. Ideally, the training dataset would be tailored to the general domain of inference – for example, a strong ZSL surveillance model to be deployed at multiple unknown locations would require a large surveillance and action recognition dataset.

Sourcing and labeling domain-specific video datasets is, however, very expensive. On the other hand, annotating images is considerably faster. Therefore, we designed a simple dataset augmentation scheme which creates synthetic training videos from still images. Sec. 5 shows that pretraining our model using this dataset boosts performance, especially if available training data is small.

We convert images to videos using the Ken Burns effect: a sequence of crops moving around the image simulates video-like motion. Sec. 4.1 provides more details.

Our experiments focus on the action recognition domain. In action recognition (as well as in many other classification tasks), location and scenery of the video is strongly predictive of action category. Because of this we choose SUN [52]

, a standard scene recognition dataset. Fig. 

2 shows the complete class embedding of our the scene dataset’s class names.

4 Experimental setup

To facilitate reproducibility, we describe our training and evaluation protocols in detail. The protocols propose one way of training and evaluating ZSL models that is consistent with our definitions in Sec. 3.3.

4.1 Datasets

UCF101 [45] has 101 action classes primarily focused around sports, with 13320 videos sourced from YouTube. HMDB51 [24] is divided into 51 human actions focused around sports and daily activities and contains 6767 videos sourced from commercial videos and YouTube. ActivityNet [11] contains 27,801 untrimmed videos divided in 200 classes focusing on daily activities with videos sourced using web search. We extracted only the labeled frames from each video. Kinetics [21] is the largest currently available action recognition dataset, covering a wide range of human activity. The first version of the dataset contains over 200K videos divided in 400 categories. The newest version has 700 classes for a total of 541624 videos sourced from YouTube. SUN397 [52] (see Sec. 3.4

) is a scene understanding image dataset. It contains 397 scene categories for a total of over 100K high-resolution images. We converted it to a simulated video dataset using the Ken Burns effect: To create a 16-frame video from an image, we randomly choose ”start” and ”end” crop locations (and crop sizes) in the image, and linearly interpolate to obtain 16 crops. Each of them are then resized to


Method UCF HMDB Activity
Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
URL [59] 34.2 - - - - -
664classes 37.6 62.5 26.9 49.8 - -
605classes 35.3 60.6 24.8 44.0 20.0 42.7
Table 2: Evaluation on all test classes. In contrast to Table 1, here we report results of our method applied to all three test datasets using Evaluation Protocol 2 (Sec. 4.3). We applied a single model trained on classes dissimilar from all of UCF, HMDB and ActivityNet. Nevertheless, we outperform URL [59] on UCF101. URL authors do not report results on full HMDB51. Remaining previous work do not report results on neither full UCF101 nor full HMDB51.

4.2 Training protocol

Our experiments in Sec. 5 use two training methods:

Training Protocol 1: Remove from Kinetics 700 all the classes whose distance to any class in is smaller than (see Eq. 4). This results in a subset of Kinetics with 664 classes, which we call Kinetics 664. As explained in Sec. 3.3, this setting is already more restrictive than that of the previous methods, which train new models for each test dataset.

Training Protocol 2: Remove from Kinetics 700 all the classes whose distance to any class in is smaller than (see Eq. 4). This results in a subset of Kinetics with 605 classes which we call Kinetics 605. This setting is even more restrictive, but is closer to true ZSL. Our goal is to show that it is possible to train a single ZSL model that applies to multiple diverse test datasets.

Figure 2 shows a t-SNE projection of the semantic embeddings of all Kinetics 700 classes, as well as the 101 UCF classes and the classes we removed to obtain Kinetics 664.

4.3 Evaluation protocol

We tested our model using two protocols: the first follows Sec. 3.3 to emulate a true ZSL setting, the second is compatible with previous work. Both Evaluation Protocols apply the same model to multiple test datasets.

Evaluation Protocol 1: In order to make our results comparable with previous work, we use the following procedure: Randomly choose half of the test dataset’s classes, 50 for UCF and 25 for HMDB. Evaluate the classifier on that test set. Repeat ten times and average the results for each test dataset.

Evaluation Protocol 2: Previous work uses random training/test splits of UCF [45] and HMDB [24] to evaluate their algorithms. However, we train on a separate dataset Kinetics 664 / 605 and can test on full UCF and HMDB. This allows us to return more realistic accuracy scores. The evaluation protocol is simple: evaluate the classifier on all 101 UCF classes and all 51 HMDB classes.

4.4 Implementation details

In our experiments, (see Sec. 3.1) is the PyTorch implementation of R(2+1)D_18 [47] or C3D[46]. In the pretrained setting, we use the out-of-the-box R(2+1)D_18 pretrained on Kinetics 400[21], while C3D is pretrained on Sports-1M[20]. In the e2e setting, we initialize the model with the pretrained=False argument. The visual embedding is BxTx512 where B is the batch size and T is the number of clips per video. We use for training, and for evaluation in Tables 1 and 2. The clips are 16 frames long and we choose them following standard protocols established by Wang et al[48]. We average across time (video snippets) similarly to previous approaches [46, 59]. is a linear classifier with 512x300 weights. The output of is of shape Bx300.

We follow standard protocol in computing semantic embeddings of class names [4, 53, 59]. Word2Vec [28] – in particular, the gesim [33] Python implementation – encodes each word. We average multi-word class names. In rare cases of words not available in the pretrained W2V model (for example, ’rubiks’ or ’photobombing’) we manually change the words (see the code for more details). Formally, for a class name consisting of N words , we embed it as . We set to following the analysis in Sec. 3.3 based on Fig. 3.

To minimize the loss of Eq. 2 we use the Adam optimizer [22], starting with a learning rate of

. Batch size is 22 snippets, with 16 frames each. The model trained for 150 epochs, with a tenfold learning rate decrease at epochs 60 and 120. All experiments are performed on the Nvidia Tesla V100 GPU.

Following [46], we reshaped each frame’s shortest side to 128 pixels, and cropped a random 112x112 patch on training and the center patch on inference.

Figure 4: Number of training classes matters in ZSL. Orange curves show performance on subsets of Kinetics 664, as we keep all the training classes and increase the subset size. The blue curves, whose markers become progressively brighter, indicate a separate experiment where we increased the number of training classes starting from 2, all the way up to 664 (Sec. 5.2). For any given training dataset size, performance on test data is much better with more training classes. In addition, when few training classes are available the e2e model is not able to outperform the baseline.
Figure 5: Diverse training classes are good for ZSL. Here we trained our algorithm on subsets of 50 Kinetics 664 classes. (Top left) Training classes picked uniformly at random. (Top right) We clustered Word2Vec embeddings of classes into two clusters, then trained and evaluated separately using each cluster, and averaged the results. (Bottom) Here we averaged the results of training using three and six clusters. The figure shows that the more clusters, the less diverse the training classes were semantically. At the same time, less diversity caused higher errors.
Figure 6: Augmented pretraining with videos-from-images. We trained our algorithm on progressively smaller subsets of Kinetics 664 classes (Sec. 5.2). We compared the results to training on the same dataset, after pretraining the model on our synthetic SUN video dataset (Sec. 5.3). The pretraining procedure boosts performance up to 10% points.

5 Results

Our experiments have two goals: compare our method to previous work and investigate our method’s performance vs the baseline (see Sec. 3.2.) The first is necessary to validate that e2e ZSL on videos can outperform more complex approaches that use pretrained features. The latter will allow us to understand under what conditions e2e training can be particularly beneficial.

5.1 Comparison to the state of the art

Table 1 compares our method to existing approaches. We followed our Training and Evaluation Protocol 1, as described in Sections 4.2 and 4.3. Our protocols are more restrictive than that of previous methods: we removed training classes that overlap with test classes, introduced domain shift, and applied one model to multiple test datasets. Despite this, we outperform previous video-based methods by a large margin. Furthermore, when testing on UCF we outperform URL [59] which uses a network an order of magnitude deeper than ours – 18 vs 200 layers – and 23 classes overlap between training and testing (see Sec. 2).

5.2 Comparison to a baseline method

Our baseline method described in Sec. 3.2 uses a fixed, pretrained visual feature extractor but is otherwise identical to our e2e method. This allows us to study the benefits of e2e training under Evaluation Protocol 2, (see Sections 4.2 and 4.3). Using all test classes provides a more direct evaluaition of the method.

Training dataset size: To investigate the effect of training set size on performance we subsampled Kinetics 664 uniformly at random, then re-trained and re-evaluated the model. Fig. 4 shows that the e2e algorithm consistently outperforms the baseline on both datasets. Both algorithms’ performance is worse with smaller training data. However, the baseline flattens out at about 100K training datapoints, whereas our method’s error keeps decreasing. This is expected, as the e2e model has more capacity.

Number of training classes: In many video domains diverse data is difficult to obtain. Small datasets might not only have few datapoints, but also contain only a few training classes. We show that the number of training classes can impact ZSL results as much as training dataset size.

To obtain Fig. 4 we subsampled Kinetics 664 class-wise. We first picked 2 Kinetics 664 classes at random, and trained the algorithm on those classes only. We repeated the procedure using 4, 10, 25, 50, 100, 200, 400 and all 664 classes. Naturally, the fewer classes the fewer datapoints the training set contained. This results are compared in Fig. 4 with the procedure described above, where we removed Kinetics datapoints at random – independent of their classes.

The figure shows that it is better to have few training samples from a large number of classes rather than many from a very small number of classes. This effect is more pronounced for the e2e model rather than the baseline.

Training dataset class diversity: We showed that ZSL works better with more training classes. If we have a limited budget for collecting classes and datapoints, how should we choose them? We investigated whether the set of training classes should emphasize fine differences (e.g. ”shooting basketball” vs ”passing basketball” vs ”shooting soccerball” and so on) or diversity.

In Fig. 5 we selected training classes in four ways: (Top Left) We randomly choose 50 classes from the whole Kinetics 664 dataset, trained the algorithm on these classes, and ran inference on the test set. We repeated this process ten times and averaged inference error. (Top Right) We clustered the 664 classes into 2 clusters in the Word2Vec embedding space, and chose 50 classes at random within one of the clusters, trained and ran inference. We then repeated the procedure ten times and averaged the result. (Bottom) Here we chose 50 classes in one of 3 clusters (Left) and one of 6 clusters (Right), trained, and averaged inference results of 10 runs. The figure shows that test error for our method increases as class diversity decreases. This result is not obvious, since the task becomes harder with increasing class diversity.

5.3 Easy pretraining with images

Previous section showed that class count and diversity are important drivers of ZSL performance. This inspired us to develop the pretraining method described in Sec. 3.4: we pretrain our model on a synthetic video dataset created from still images from the SUN dataset. Fig. 6 shows that this simple procedure consistently decreases test errors by up to 10%. In addition, Fig. 7 shows that this initialization scheme makes the model more robust to large domain shift between train and test classes. The following section describes the latter finding in more detail.

Figure 7: Error as test classes move away from training. For each UCF101 test class, we computed its distance to 10 nearest neighbors in the training dataset. We arranged all such distance thresholds on the x-axis. For each threshold, we computed the accuracy of the algorithms on test classes whose distance from training data is larger than the threshold. In other words, as x-axis moves to the right, the model is evaluated on cumulatively smaller, but harder test sets.

5.4 Generalization and domain shift

A good ZSL model generalizes well to classes that differ significantly from training classes. To investigate the performance of our models under heavy domain shift, we computed the accuracy on subsets of test data with a growing distance from the training dataset. We first trained our model on Kinetics 664. Then, for a given distance threshold (see Sec. 3.3), we computed accuracy on the set of UCF classes whose mean distance from the closest 10 Kinetics 664 classes is larger than . Fig. 7 shows that the baseline model’s (not trained e2e) performance drops to zero at around . Our method performs much better, never dropping to zero accuracy for high thresholds. Finally, using the SUN pretraining further increases performance.

5.5 Ablation study

Table 3 studies contributions of different elements of our model to its performance. The performance is low when the visual embedding is fixed. The e2e approach improves the performance by a large margin. Our class augmentation method further boosts performance. Finally it helps to extract linearly spaced snippets from a video on testing, and average their visual embeddings. Using snippets improves considerably the performances without influencing the training time of the model.

UCF101 accuracy 50 classes 101 classes
e2e Augment Multi Top-1 Top-5 Top-1 Top-5
26.8 55.5 19.8 40.5
43.0 68.2 35.1 56.4
45.6 73.1 36.8 61.7
48.0 74.2 37.6 62.5
49.2 77.0 39.8 65.6
Table 3: Ablation study. Numbers represent classification accuracy. “50 classes” uses Evaluation Protocol 1 (Sec. 4.3.) “101 classes” uses Evaluation Protocol 2. e2e: training the visual embedding as opposed to fixed, pretrained baseline (Sec. 3.2). Augment: pretrain using the SUN augmentation scheme (Sec. 5.3). Multi: At test time, extract multiple snippets from each video and average the visual embeddings (Sec. 4.4).

6 Conclusion

We followed practices from recent video classification literature to train the first e2e system for video recognition ZSL. Our evaluation protocol is stricter than that of existing work, and measures more realistic zero-shot classification accuracy. Even under this stricter protocol, our method outperforms previous works whose performance was measured with training and test sets overlapping and sharing domains. Through a series of directed experiments, we showed that a good ZSL dataset should have many diverse classes. Guided by this insight, we formulated a simple pretraining technique that boosts ZSL performance.

Our model is easy to understand and extend. Our training and evaluation protocols are easy to use with alternative approaches. We made our code available at to encourage the community to build on our insights and create a strong foundation for future video ZSL research.

Acknowledgement. We thank Amazon for providing the computational means and Alina Roitberg for a productive discussion about the evaluation protocol.


  • [1] I. Alexiou, T. Xiang, and S. Gong (2016) Exploring synonyms as context in zero-shot action recognition. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 4190–4194. Cited by: §1, §2.
  • [2] S. Beery, G. Van Horn, and P. Perona (2018-09) Recognition in terra incognita. In

    The European Conference on Computer Vision (ECCV)

    Cited by: §1.
  • [3] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
  • [4] M. Bishay, G. Zoumpourlis, and I. Patras (2019) TARN: temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021. Cited by: Figure 1, §1, §2, §2, §3.2, §3.2, Table 1, §4.4.
  • [5] B. Brattoli, U. Büchler, A. Wahl, M. E. Schwab, and B. Ommer (2017) LSTM self-supervision for detailed behavior analysis. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.
  • [6] U. Büchler, B. Brattoli, and B. Ommer (2018)

    Improving spatiotemporal self-supervisionby deep reinforcement learning

    In IEEE Conference on European Conference on Computer Vision (ECCV), Cited by: §2.
  • [7] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2.
  • [8] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §1.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.
  • [10] R. Dey and F. M. Salemt (2017)

    Gate-variants of gated recurrent unit (gru) neural networks

    In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pp. 1597–1600. Cited by: §2.
  • [11] B. G. Fabian Caba Heilbron and J. C. Niebles (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970. Cited by: §1, §2, §3.3, §4.1.
  • [12] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2018) Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982. Cited by: §1, §2.
  • [13] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The” something something” video database for learning and evaluating visual common sense.. In ICCV, Vol. 1, pp. 3. Cited by: §1.
  • [14] M. Hahn, A. Silva, and J. M. Rehg (2019) Action2Vec: a crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484. Cited by: Figure 1, §1, §2, §2, §3.2, §3.2, Table 1.
  • [15] K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555. Cited by: §2.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §1, §1, §2, §2.
  • [18] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • [19] H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah (2017) The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155, pp. 1–23. Cited by: §2.
  • [20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In CVPR, Cited by: §1, §4.4.
  • [21] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1, §3.3, §4.1, §4.4.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.4.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [24] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563. Cited by: §1, §3.3, §4.1, §4.3.
  • [25] H. Larochelle, D. Erhan, and Y. Bengio (2008) Zero-data learning of new tasks.. In AAAI, Vol. 1, pp. 3. Cited by: §1.
  • [26] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure 2.
  • [27] B. Martinez, D. Modolo, Y. Xiong, and J. Tighe (2019-10) Action recognition with spatial-temporal discriminative filter banks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [28] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781. Cited by: §2, §3.1, §4.4.
  • [29] A. Mishra, V. K. Verma, M. S. K. Reddy, S. Arulkumar, P. Rai, and A. Mittal (2018) A generative approach to zero-shot and few-shot action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 372–380. Cited by: §1, §2, §2.
  • [30] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell (2009) Zero-shot learning with semantic output codes. In Advances in neural information processing systems, pp. 1410–1418. Cited by: §1.
  • [31] A. Piergiovanni and M. S. Ryoo (2018) Learning shared multimodal embeddings with unpaired data.. CoRR. Cited by: §1, §2.
  • [32] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [33] R. Řehůřek and P. Sojka (2010-05-22) Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50 (English). Note: Cited by: §4.4.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
  • [35] A. Roitberg, Z. Al-Halah, and R. Stiefelhagen (2018)

    Informed democracy: voting-based novelty detection for action recognition

    arXiv preprint arXiv:1810.12819. Cited by: Table 1.
  • [36] A. Roitberg, M. Martinez, M. Haurilet, and R. Stiefelhagen (2018) Towards a fair evaluation of zero-shot action recognition using external data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: item Evaluation Protocol:, §2, §3.3, Table 1.
  • [37] K. Roth, B. Brattoli, and B. Ommer (2019) MIC: mining interclass characteristics for improved metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8000–8009. Cited by: §1.
  • [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2.
  • [39] A. Sanakoyeu, V. Khalidov, M. S. McCarthy, A. Vedaldi, and N. Neverova (2020) Transferring dense pose to proximal animal classes. In CVPR, Cited by: §1.
  • [40] A. Sanakoyeu, V. Tschernezki, U. Büchler, and B. Ommer (2019) Divide and conquer the embedding space for metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [41] N. Sayed, B. Brattoli, and B. Ommer (2018) Cross and learn: cross-modal self-supervision. In German Conference on Pattern Recognition (GCPR), Cited by: §2.
  • [42] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1.
  • [43] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §2.
  • [44] B. Singh, M. Najibi, and L. S. Davis (2018) SNIPER: efficient multi-scale training. In Advances in Neural Information Processing Systems, pp. 9310–9320. Cited by: §1.
  • [45] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §1, §3.3, §4.1, §4.3.
  • [46] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §1, §1, §1, §2, §2, §4.4, §4.4.
  • [47] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §1, §1, §2, §4.4.
  • [48] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §1, §2, §3.2, §4.4.
  • [49] Q. Wang and K. Chen (2017) Alternative semantic representations for zero-shot human action recognition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 87–102. Cited by: §1, §2.
  • [50] Q. Wang and K. Chen (2017) Zero-shot visual recognition via bidirectional latent embedding. International Journal of Computer Vision 124 (3), pp. 356–383. Cited by: §1, §2, §3.2, Table 1.
  • [51] Y. Xian, B. Schiele, and Z. Akata (2017) Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4582–4591. Cited by: §1.
  • [52] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010) Sun database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: §3.4, §4.1.
  • [53] X. Xu, T. Hospedales, and S. Gong (2015) Semantic embedding space for zero-shot action recognition. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 63–67. Cited by: §1, §2, §4.4.
  • [54] X. Xu, T. Hospedales, and S. Gong (2017) Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision 123 (3), pp. 309–333. Cited by: §1, §2.
  • [55] X. Xu, T. M. Hospedales, and S. Gong (2016) Multi-task zero-shot action recognition with prioritised data augmentation. In European Conference on Computer Vision, pp. 343–359. Cited by: §1, §2, §3.2, Table 1.
  • [56] B. Zhang, H. Hu, and F. Sha (2018) Cross-modal and hierarchical modeling of video and text. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 374–390. Cited by: §1, §2.
  • [57] C. Zhang and Y. Peng (2018) Visual data synthesis via gan for zero-shot video classification. arXiv preprint arXiv:1804.10073. Cited by: §2.
  • [58] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1.
  • [59] Y. Zhu, Y. Long, Y. Guan, S. Newsam, and L. Shao (2018) Towards universal representation for unseen action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9436–9445. Cited by: Figure 1, §1, §2, §2, §2, §3.2, §3.2, Table 1, §4.4, §4.4, Table 2, §5.1.