Language or text is probably the most common and natural way to describe the semantic information of a video, and thereby the associated textual information could be easily acquired when collecting video dataset[45, 35]. For example, as shown in Figure 1, a movie clip is equipped with the script, and a web video is accompanied with title. These abundant textual information has turned out to be useful cues to learn a high-level visual-text embedding [49, 35], which could be deployed or fine-tuned for text-to-video retrieval or video captioning. We argue that this correlation between a clip and its associated text could be further investigated in a more fundamental way for visual representation learning. However, intuitively, it is more challenging to learn a general visual representation solely from noisy text information due to the lack of careful initialization, effective objectives, and well-designed training strategy.
In this paper, we address the challenging problem of learning effective spatiotemporal features from noisy and diverse textual information, which could serves as the basis for a variety of downstream tasks such as example-based recognition without any fine-tuning, action recognition in a smaller target dataset with fine-tuning, and zero-shot action classification. Basically, we learn a mapping of text and video into a shared embedding space and leverage their correlation as supervision signal. The technical difficulty is how to design an effective objective function, that is capable of modeling this complex visual-textual correlation and as well easily optimized by training from scratch on noisy datasets. Inspired by unsupervised feature learning in images [63, 50]
, we present a general cross-modal pair discrimination (CPD) framework, which tries to recognize each video and text pair into a class via a non-parametric classifier. To solve the computational issues imposed by the huge numbers of pair classes, we adapt noise-contrastive estimation technique
to approximate the original loss function.
Specifically, we investigate the proposed CPD framework from two sources of visual-textual pairs: (1) movie videos with the corresponding movie scripts or audio description which could be easily obtained in a semi-automatic way , and (2) web videos with the associated title that could be directly crawled from web platforms such as YouTube . As our main goal is to learn general spatiotemporal representations as good initialization for downstream visual tasks, we utilize the off-the-shelf language models such as BERT  or Skip-Thoughts 
to extract textual features. For video modeling, we resort to the recent 3D convolutional neural networks (3D CNNs)[52, 11]. We design a curriculum learning strategy to progressively train our CPD framework: first train video models alone, and then jointly fine tune video and text networks. Experimental results imply this training scheme is helpful to relieve the training difficulty and improve the effectiveness of learned CPD models.
We mainly demonstrate the effectiveness of CPD on the task of spatiotemporal representation learning. A main purpose of weakly-supervised representation learning is to test its generalization ability on a variety of tasks. First, without any further fine-tuning, we report the performance of action recognition on the Kinetics dataset 
by using shallow classifiers such k-NN and linear classifier, following a common protocol in unsupervised learning
. It shows that our learned spatiotemporal features obtain promising results which are comparable to some supervised learning methods on the Kinetics dataset. Then, we investigate the generalization power of learned spatiotemporal features of CPD by fine-tuning on the Kinetics , UCF101  and HMDB51  datasets, demonstrating that our method obtain superior performance to previous state-of-the-art self-supervised method. In addition, we test CPD on learning visual-textual embedding by reporting performance for zero-shot action classification, which demonstrates that our CPD is able to yield a new state of the art on this challenging task.
2 Related Work
Motion, Audio, and Text. Multi-modal information in videos provides natural cues for learning deep models. Motion or temporal information has been studied as to design proxy tasks to assist cross-modal learning, such as optical flow or tracking [37, 59], frame prediction [7, 54], or high-level temporal structure [61, 64, 13]. As most video contain synchronized audio and visual signals, audio information has served another common modality to supervised visual learning [2, 1, 28]. However, both motion and audio information seem low-level signals and may lack high-level semantic for cross-modal learning. Speech or text has been widely studied as another cross-modal setting in video learning [49, 35, 10, 34, 40, 41]. These works mainly aimed to learn a joint video-text embedding where visual and textual cues are adjacent if they are semantically. However, these works focused on learn high-level visual-textual relation where they ignore the fundamental issue of visual representation learning by using the off-the-shelf models as feature extractors. Instead, our proposed CPD framework aims at learn general and effective spatiotemporal features which could serve the basics for a variety of downstream video tasks.
Supervised Video Representation. Since the breakthrough of AlexNet  in image recognition for representation learning, huge numbers of video-based deep models have been developed for action recognition [25, 47, 51, 66, 12, 57, 56, 43, 9, 52, 58, 11]. Two-stream networks  turned to be the first successful deep architectures for video recognition by introducing optical flow for motion modeling and the following works tried improve two-stream method from fusion  or speed  aspects. As the large-scale video dataset (e.g., Sports 1M , Kinetics ), 3D convolutional neural networks (3D CNNs)  started to be popular in video recognition  as it only required to input RGB frames to learn spatiotemporal features directly. Recent advanced architectures focused on improving 3D CNNs from aspects of spatial-temporal factorization [52, 43], relation modeling [56, 58], or sampling scheme . Another research line has shifted to modeling long-range temporal structures with longer temporal convolutions , sparse sampling and aggregation , or LSTM [38, 9, 62]. All these deep models are based on training from a human annotated dataset. Our CPD framework aims to investigate in an orthogonal direction by training deep architecture in a weakly-supervised manner.
Self/Weakly Supervised Video Representation. Self supervised representation was popular in both image and video domains in the last few years by training a model on a carefully designed proxy task. In image domain, for instance, these tasks could be predicting the image context , counting the objects , converting gray images to color one , keeping global and local consistency . In video domain, typical examples include frame prediction [7, 54], optical flow estimation [37, 69, 23], instance tracking [59, 60], temporal order or structure prediction [36, 13, 61, 64]. These learnt representations may capture some aspects of low-level image or video structures, yet it might be not optimal for semantic tasks. Some cross-modal self-supervised tasks was proposed to enhance single-modality representation power and typical example is audio-visual representation learning [2, 1, 28]. To further improve descriptive power of self-supervised representation, some weakly-supervised methods were developed by utilizing more semantic information obtained in automatic way, such as web search engine [4, 14], and hashtag . Different from these methods, our CPD framework explore a new instance-level discriminative training scheme to learn general spatiotemporal representations from the correlation between a clip and its associated text with. Our CPD is inspired by these low-level instance discrimination framework [63, 50], but extend to video domain and use more semantic pair (i.e., text and video) discrimination for spatiotemporal feature learning, and we believe this semantic pair discrimination is more useful for representation learning than low-level instance discrimination.
3 Cross-Modal Pair Discrimination
In this section we provide an detailed description on our proposed cross-modal pair discrimination (CPD) for weakly supervised spatiotemporal feature learning. First, we present the whole framework and analyze its important properties. Then, we describe the training strategy of CPD framework. Finally, we introduce text and video feature extraction networks.
3.1 Framework and analysis
Our goal is to propose a weakly supervised representation learning method by exploiting the correlation between each video clip and its associated text information, which could be easily obtained from a variety of sources such as movie scripts, YouTube titles, and automatic speech recognition (ASR). It is generally assumed that these text information contains semantic information, but also might be noisy and irrelevant. Therefore, from technical perspective, we need to design an effective objective function and training strategy to capture this semantic correlation and as well also suppress the effect of noisy and irrelevant information. To this end, we devise a video-text pair discrimination objective and a curriculum learning strategy as follows.
More formally, as shown in Figure 2, we aim to learn a modality-specific embedding function and for the visual and textual information from a set of video clips and their associated textual information . Let and denote and , respectively. These embedding functions would map these two modality into a common space (i.e., and ), and related visual and text information should be close to each other. The embedding functions could be implemented by neural networks which will be clarified in next section. We first focus on how to devise objective function to optimize these embedding functions. Inspired by the work of unsupervised learning in images , we design a cross-modal pair discrimination objective to learn these two embedding functions.
Self-instance discrimination. In the original instance-level discrimination framework , each image is treated as a distinct class and it would learn a classifier to categorize each image into its own class. This framework could be naturally extended into the setting of video and text pair by directly using feature concatenation, and we call this extension as self-instance discrimination. Formally, this video-text level instance discrimination objective could be implemented with the following softmax criterion:
where the video-text pair define a class , is a weight for class , and the class number is equal to training sample number . This class weight represent a class prototype for each video-text instance and is probably not easy to optimize as we only have a single sample for each class. Thus, the above parametric classifier could be refined with the following non-parametric variant:
where is a temperature parameter to control the class concentration level and our training objective is to optimize the likelihood . This straight forward extension shares the advantage of instance-level discrimination by directly modeling in the joint video-text space. Yet, in fact, the semantic information of text modality is higher than video pixels and we aims at learning video features with the supervision of textual information. To meet this requirement, we propose a refined objective function from the perspective of conditional distribution.
Cross-pair discrimination. According to the above analysis, we design the objective function by considering conditional distribution and rather than implicitly modeling distribution . Specifically, we design the following conditional distribution:
where text define a text class , and both and with unit-norm constraint. The conditional distribution could be defined at the same way. We call this framework as cross-pair discrimination, and during training phase, the objective is to maximize the likelihood . The key difference between Equation (2) and (3) is that we propose to use cross-correlation term to replace the self-correlation term . This cross correlation is more effective to capture the mutual information between visual and textual information, and thereby better at guiding the spatiotemporal feature learning from video with text information as supervision.
Ranking loss. There is some common ranking loss for cross-modal matching. To well study the effectiveness of proposed cross-modal pair discrimination objective, we also compare with a baseline of ranking loss, which is defined as follows:
where each video has a associated text and unrelated text from current batch.
is the cosine similarity,is the batch size and is a margin. In experiment, we empirically compare this ranking loss with our designed cross-pair discrimination objective.
3.2 Training CPD
The training of CPD framework needs to address two technical issues: (1) large number of video-text pair classes; (2) optimization difficulty on noisy video-text datasets by training from scratch.
Noise-contrastive estimation. In training stage, we adopt noise-contrastive estimation technique  to approximate Equation (3) to solve the computational issues by the huge numbers of pairs. The basic idea is to transform the multi-class classification problem in Equation ( 3) into a set of binary classification problem. In the binary classification task, the task is to distinguish between data sample and noise sample. The approximate training objective is to minimize the following loss function:
where , is the actual data distribution and
is the uniform distribution for noise, anddenotes the noise frequency. To compute efficiently and avoid large memory consumption, following , we maintain a memory bank to store the visual and textual features for each training pair. The memory bank is updated dynamically during the training procedure.
Curriculum learning. To handle the optimization difficulty of directly training from scratch on noisy video-text dataset, we present a curriculum training strategy by resorting to the existing unsupervised pre-trained language models. To relieve the training difficulty, our curriculum learning strategy divides the training procedure into two stage. In the first stage, we fix the pre-trained language model and only update the parameters of visual model and embedding function. The motivation is that the language model is pre-trained well using corpus much larger than ours and the video model is totally trained from scratch. If we train both models simultaneously in the beginning, the random noise produced by video model will destroy the parameters of language model. In the second stage, after the well initialization of video model, we start to jointly train the visual-textual model with a smaller learning rate.
3.3 Architecture design
After the presentation of CPD framework and its training strategy, we are now ready to describe its network architectures. Our CPD present a general framework for weakly-supervised spatiotemporal feature learning by exploiting the correlation between video and text pairs. To study the effectiveness of CPD framework, we instantiate CPD with different network architectures.
For video representation, we use the 3D CNNs to extract spatiotemporal features from a video clip. Specifically, we randomly sample 8 frames from each video clip and sampling stride is 4. Following the implementation of slow stream in the recent SlowFast, all filters from to degenerate temporal convolutions into 2D convolution kernels and it only reserves 3D convolution kernels in and without temporal downsampling. We try two kinds of network architectures: (1) 3D ResNet34 trained on volumes and (2) 3D ResNet50 trained on volumes. The first tiny network is efficient for ablation study and then we transfer its optimal setting to the larger backbone and frame resolution. We also add a mapping layer to transform the visual features into 256-dimensional embedding space
and this 256-d vector is-normalized.
Text architecture. Our textual stream subnetwork is based on the off-the-shelf language models. We choose Skip-Thoughts  and DistilBERT [6, 46] as our textual encoders. Specifically, we extract sentence features of movie script with Skip-Thought model, and encode textual features of YouTube title with DistilBERT model. Skip-Thoughts is an unsupervised sentence encoder, pre-trained by reconstructing the surrounding sentences of the continue text in books. We use combine-skip vectorc extracted from Skip-Thoughts as text feature which is 4800 dimensional. BERT  encodes long sentences by predicting the missing words given their bidirectional context, and DistilBERT achieves comparable performance with a faster and lighter model via knowledge distillation 22] are added to our textual encoder to obtain textual feature in the common embedding space, which is also -normalized.
In this section, we present the experimental results of our proposed CPD framework. First, we describe the training and evaluation datasets with implementation details. Then, we conduct ablation study on our proposed CPD framework. Finally, we verify the effectiveness of CPD from three aspects: weakly-supervised representation learning, representation transfer, and zero-shot classification.
In our experiments, we pre-train our weakly supervised representation learning method on two video datasets: LSMDC  and Kinetics . To evaluate our learned spatiotemporal feature, we fine-tune the video model on two challenging human action datasets: UCF101  and HMDB51 .
LSMDC-100K. LSMDC  is a large-scale movie description dataset. Each clip has a automatically-collected description of movie script or audio description. To fully explore the effectiveness of CPD, we reserve 1k clips from test split for validation and use the rest as the pre-training set, which contains 117k video-text pairs.
Kinetics-210K. The first version of Kinetics  is a large scale human action dataset which contains 400 action classes and around 240k videos for training, 20k video for validation, and 40k videos for testing. It is often called Kinetics-400, but we count training video number as we do not use any class information for weakly-supervised representation learning. Due to invalid urls and data cleaning, the collected dataset contains around 210K video-text pairs for weakly supervised pre-training, and thus we call this dataset as Kinetics-210k. Similar to LSMDC, we also reserve 1k video-text pair for validation during CPD training and the rest as training set of CPD. To construct video-text pairs, we equip each clip with the video title directly crawled from YouTube, termed as Kinetics-title. As the original title may be very noisy, we pre-process the text information in two ways. First, we delete special symbols and characters such as non-English words and emoji, termed as Kinetics-title-clean. Second, we use StanfordNLP  to obtain the dependency tree of a sentence and only reserve verbs and nouns of title, named Kinetics-title-tree.
UCF101 and HMDB51. We evaluate the generalization of our pre-trained models by fine-tuning on two small human action datasets: UCF101  and HMDB51 , which contain 13k videos of 101 classes and 7k video of 51 classes respectively. We report ablation study on the first split and report average performance over three splits for fair comparison.
4.2 Implementation details
Weakly supervised learning of CPD. We train our CPD model on video-text dataset and use video-text retrieval on 1k unseen video-text pairs as validation set duration training. To keep a balance between temporal receptive field and GPU memory consumption, 8 frames are sampled from each video clip and the sampling stride is 4. Following the procedure [17, 16], we perform multi-scale cropping, random horizontal flip and color jittering for data augmentation. We use SGD to optimize our objective and the training parameters include a momentum of 0.9 and 1e-4 for weight decay. We set temperature parameter
. In the beginning, we fix the pre-trained language model and the learning rate is set as 0.2. When the retrieval performance on validation set saturates (170 epochs for 3D ResNet34 and 110 epochs for 3D ResNet50), we start to update the language model with learning rate of 3e-5 and decrease the rest learning rate to 0.02. The maximize training number is 250 epochs. For input size of, the mini-batch size is 64 clips per GPUs and 16 clips per GPUs for input size of , and we use 8 GPUs for training.
Evaluation on representation learning. We first verify our CPD learned representation by employ a shallow classifier on frozen features, following a common protocol in self/weakly supervised representation learning 
. In this experiment, we pre-trained our CPD on Kinetics-210K datasets and report performance on its validation set. Specifically, we utilize k-Nearest Neighbor (kNN) and linear classifier based on extracted features for classifciation. For feature extraction, we sample 10 clips from each video and each clip contains 8 frames with 4 sampling stride. The 256-dimensional embedding feature and the output of global average pooling afterare extracted as feature representation. The extracted features over 10 clips in a video are averaged as video representation. We choose cosine distance as distance metric in kNN and set . As for linear classifier, a fully connected layer after Batch Normalization is added with cross-entropy loss. We adopt Adam with learning rate of 1e-3 and reduce by a factor of 10 every 10 epochs, stopping at 30 epochs.
Evaluation on representation transfer. A main goal of representation learning is to transfer them to downstream tasks. We fine-tune the learned spatiotemporal representation on the UCF101, HMDB51 and small fraction of Kinetics-400. During fine-tuning, 16 frames with stride 4 are sampled as input. We simply replace the embedding layer of video model with a new fully-connected layer and multi-way softmax for action recognition. We adopt the same procedure of data augmentation with weakly supervised pre-training. The classifier is trained using Adam optimizer with an initial learning rate 1e-4 and weight decay 5e-4. Learning rate is decay twice by the factor of 10 when the validation loss saturates. During testing, for each video, we uniformly sample 10 clips and each clip contains 3 crops, following the common practice .
Evaluation on zero-shot classification Our learnt visual-textual embedding could be used for zero-shot classification on Kinetics and UCF101 without any fine-tuning. We regard class labels as text information and pass them to our pre-trained textual subnetwork, transforming class labels to class feature in embedding space. Also, for each video we uniformly crop 10 clips which contains 8 frames with 4 sampling stride and spatial size of or . 10 clips are passed into visual subnetwork and averaged as video feature. The video is recognized as closest class with cosine metric.
4.3 Ablation study
In this subsection, we perform ablation study on our CPD from aspects: objective design, training strategy, pre-training datasets and textual encoders. In this study, we choose the task of representation transfer by fine-tuning on UCF101 split 1 for evaluation.
Objective function. We compare three objective functions for cross-modal pair discrimination described in Section 3.1. In this study, we pre-train models on Kinetics-title-clean and utilize DistilBERT as textual encoder without fine-tuning. The experimental results are reported in Table 1. From the results, we can see self-instance discrimination almost has no contribution to learn effective representation as there is no cross-modal correlation modeling. Cross-pair discrimination gives a better performance than ranking loss as cross-pair discrimination can construct negative video-text pairs from entire dataset while ranking loss is only optimized by negative pairs from current batch. In the remaining experiments, we choose the cross-pair discrimination as objective function by default.
|Curriculum learning stage I||82.2|
|Curriculum learning stage II||84.2|
Curriculum learning. We design different training strategies to handle the difficulty of optimizing on noisy video-text datasets from scratch. The first strategy is to fine-tune the pre-trained textual encoder directly at the beginning. Also we experiment the performance of stage I and stage II of curriculum learning proposed in Section 3.2. All these strategies are pre-trained on Kinetics-title-clean. The numerical results are summarized in Table 2. From the results, we see all of these strategies yield a performance gain compared to learning from scratch. And fixing the pre-trained language model gives better performance than direct fine-tuning at the beginning (0.9%). We ascribe this to the fact that the random noise produced by video model destroy the well pre-trained textual encoder at the beginning. Also, fine-tuning the language model after the video model is well initialized further boost the accuracy by 2.0%.
Different datasets and textual encoders. An important component in our CPD model is text information and encoder. In this experiment, we compare text information from different datasets and textual encoders. We choose video-text pairs from LSMDC, Kinetics-title-tree, Kinetics-title-clean datasets and utilize Skip-Thoughts, DistilBERT as a textual extractor. The experimental results are reported in Table 3. From the results, we can see models pre-trained on the Kinetics datasets outperform those learned on the LSMDC dataset (76.4% vs. 71.9% for Skip-Thoughts as textual encoder), as videos of LSMDC are from movies which may have a different distribution with the web videos of UCF101. Also less training samples in LSMDC (117k vs. 210k) is another reason of lower performance. For textual encoder, stronger language model such as DistilBERT is of great assistance to train a better video model (76.4% vs. 82.1% in Kinetics-title-tree dataset). In addition, abundant and video-specific text information benefits to train our CPD model according to the performance difference between Kinetics-title-tree and Kinetics-title-clean. The textual information in former dataset only contain verbs and nouns while the latter one reserve almost all original information. In the remaining experiments, we employ the spatiotemporal feature pre-trained on Kinetics-title-clean with DistilBERT as textual encoder.
|3D ResNet34 ||Label||-||-||60.1|
|3D ResNet50 ||Label||-||-||61.3|
|3D ResNet50 (ours)||Label||-||-||73.2|
|3D ResNet34||Text||emb (256)||49.9||50.8|
|3D ResNet34||Text||res5 (512)||50.1||53.3|
|3D ResNet50||Text||emb (256)||58.0||58.7|
|3D ResNet50||Text||res5 (2048)||58.2||63.1|
4.4 Evaluation on representation learning
To evaluate our learned representation, we report the classification performance on validation set of Kinetics via training shallow classifiers on frozen features as shown in Table 4. We compare our method with those networks [26, 17, 11] trained with annotated action label in an end-to-end manner. For fair comparison, we also train the same architecture of 3D ResNet50 from scratch by ourselves on the Kinetics dataset (denoted as 3D ResNet50-ours). For our CPD learnt representations, we perform kNN classifiers and linear classifiers (LC) on the 256-dimensional embedding features or features from global average pooling after res5, which is 512-dimensional for 3D ResNet34 and 2048-dimensional 3D ResNet50. In this shallow learning setting, we also compare with ImageNet pretraining representation (ResNet50) by using the same classifier. The experimental results demonstrate that: First, the 256-dimensional embedding feature of 3D ResNet50 outperforms methods proposed in  yet our feature is highly compact, which is much cheaper for storage and inference. Second,the learned feature of 3D ResNet34 is comparable to previous fully supervised methods [26, 17]. We also note that there is still performance gap between our CPD learnt representation with those fully supervised networks such as SlowFast . Finally, our feature of ResNet50 achieves a higher performance than these fully supervised ImageNet pre-trained features under the same shallow classifier.
|Random Initialization ||-||3D ResNet18||-||42.4||17.1|
|ImageNet Pretrained ||Image label||VGGNet||ImageNet||73.0||40.5|
|Kinetics Pretrained ||Action label||3D ResNet34||Kinetics||87.7||59.1|
|Kinetics Pretrained ||Action label||3D ResNet50||Kinetics||89.3||61.0|
|Shuffle & Learn ||Order Verification||CaffeNet||UCF101/HMDB51||50.2||18.1|
|OPN ||Sequence order||VGGNet||UCF101/HMDB51||59.8||23.8|
|CMC ||Optical flow||CaffeNet||UCF101||55.3||-|
|COP ||Clip order||3D ResNet10||UCF101||64.9||29.5|
|DPC ||Predict feature||3D ResNet34||Kinetics||75.7||35.7|
|CPD (Ours)||Text||3D ResNet34||Kinetics||83.5||53.2|
|CPD (Ours)||Text||3D ResNet50||Kinetics||88.7||57.7|
4.5 Evaluation on representation transfer
Results on Kinetics. Our weakly-supervised pre-trained representation can be an efficient initialization when training the model with only a small amount of labeled data. We randomly choose a small fraction of Kinetics-400 training set as labeled data and fine-tune the pre-trained model on it. We report the performance of top-1 and top-5 accuracy which is trained on labeled subset of 1%, 10% and 20% of the entire dataset in Table 6. We compare our method with training from scratch as baseline. Our method significantly surpasses the baselines on all present proportion of labeled subset especially when the amount of labeled data is extremely small. When only 1% of data is labeled, training from scratch can not learn anything yet our model achieves 14.7% top-1 accuracy. In addition, we outperform nearly 30% when training on 10% of labeled data.
|Method||The Amount of Labeled Data|
|From scratch||0.3 / 1.3||10.7 / 28.5||33.3 / 60.0|
|Ours||14.7 / 30.2||40.2 / 67.8||47.8 / 74.0|
Results on UCF101 and HMDB51.Transferring learned representation to downstream tasks is a main goal of representation learning. We transfer them to action recognition task on small datasets, such as UCF101 and HMDB51. In Table 5, we compare our CPD model with a randomly initialized network, fully supervised methods  and several self-supervised methods, including Shuffle & Learn , CMC , OPN , O3N , MASN , COP , DPC , and AVTS  on UCF101 and HMDB51 over three splits. As shown in Table 5, our CPD model of 3D ResNet50 with spatial input size of performs better than other methods on UCF101 and HMDB51, which even surpasses the performance of models (3D ResNet34) pre-trained with manually annotated action labels on Kinetics. Furthermore, our CPD model of 3D ResNet34 with input resolution achieves comparable performance to AVTS  with a strong backbone (I3D) and large input resolution, which leverages audio-visual temporal synchronization as proxy task.
|Mettes et al. ||-||101||3||32.8|
|Mettes et al. ||-||50||10||40.4|
|Mettes et al. ||-||20||10||51.2|
4.6 Evaluation on zero-shot classification
We evaluate our visual-textual embedding of CPD model with zero-shot classification on UCF101 and Kinetics-400 without any fine-tuning. We transform class labels and video clips into the same embedding space and recognize the video clip to its closest class with cosine distance. In Table 7, we compare our method with Mettes et al.  which realizes zero-shot localization and classification of human action in video via spatial-aware object embeddings on UCF101. Following , we select different classes for 10 times and average their accuracies for testing except the class number is 101. We outperform for every number of testing classes. For Kinetics-400, we achieve top-1 accuracy of 43.7% without fine-tuning and training label as shown in Table 8. In addition, top-1 accuracy of 20 random classes reaches to 74.4%, which shows the strong capability of our visual-textual embedding.
In this paper, we have presented a general cross-modal pair discrimination (CPD) framework to capture the correlation between a video clip and its associated text and adopt noise-contrastive estimation to approximate the objective. Without fine-tuning, the learned models obtain competitive results for action classification on Kinetics dataset with a shallow classifier. Also, our visual models provide an effective initialization to fine-tune on the datasets of downstream task. In addition, our CPD model yields a new state-of-the-art for zero-shot action recognition on UCF101 by directly utilizing the learnt visual-textual embedding. In the future, we may consider designing more effective proxy tasks and efficient training strategies for learning spatiotemporal representations from noisy video and text pairs.
Relja Arandjelovic and Andrew Zisserman.
Look, listen and learn.
IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 609–617, 2017.
-  Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 892–900, 2016.
João Carreira and Andrew Zisserman.
Quo vadis, action recognition? A new model and the kinetics
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4724–4733, 2017.
-  Xinlei Chen and Abhinav Gupta. Webly supervised learning of convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1431–1439, 2015.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255, 2009.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
-  Ali Diba, Vivek Sharma, Luc Van Gool, and Rainer Stiefelhagen. Dynamonet: Dynamic action and motion network. CoRR, abs/1904.11407, 2019.
-  Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1422–1430, 2015.
-  Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
-  Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. Dual encoding for zero-example video retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 9346–9355, 2019.
-  Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. CoRR, abs/1812.03982, 2018.
-  Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016.
-  Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5729–5738, 2017.
-  Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 12046–12055, 2019.
Michael Gutmann and Aapo Hyvärinen.
Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 297–304, 2010.
-  Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. CoRR, abs/1909.04656, 2019.
-  Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6546–6555, 2018.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2980–2988, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
-  Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
-  R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448–456, 2015.
-  Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to egomotion from unlabeled video. International Journal of Computer Vision, 125(1-3):136–161, 2017.
-  Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–231, 2013.
-  Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014.
-  Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
-  Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3294–3302, 2015.
-  Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 7774–7785, 2018.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1106–1114, 2012.
-  Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. Poggio, and Thomas Serre. HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 2556–2563, 2011.
-  Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 667–676, 2017.
-  Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, pages 185–201, 2018.
-  Pascal Mettes and Cees G. M. Snoek. Spatial-aware object embeddings for zero-shot localization and classification of actions. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  Antoine Miech, Ivan Laptev, and Josef Sivic. Learning a text-video embedding from incomplete and heterogeneous data. CoRR, abs/1804.02516, 2018.
-  Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. CoRR, abs/1906.03327, 2019.
-  Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 527–544, 2016.
-  Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, and Larry S. Davis. Actionflownet: Learning motion representation for action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 1616–1624, 2018.
-  Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, pages 4694–4702, 2015.
-  Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5899–5907, 2017.
-  Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. Jointly modeling embedding and translation to bridge video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 4594–4602, 2016.
-  Bryan A. Plummer, Matthew Brown, and Svetlana Lazebnik. Enhancing video summarization via vision-language embedding. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1052–1060, 2017.
-  Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D. Manning. Universal dependency parsing from scratch. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 160–170, Brussels, Belgium, October 2018. Association for Computational Linguistics.
-  Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, pages 5534–5542, 2017.
-  Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.
-  Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Joseph Pal, Hugo Larochelle, Aaron C. Courville, and Bernt Schiele. Movie description. International Journal of Computer Vision, 123(1):94–120, 2017.
-  Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019.
-  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
-  Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
-  Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. CoRR, abs/1904.01766, 2019.
-  Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. CoRR, abs/1906.05849, 2019.
-  Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
-  Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6450–6459, 2018.
-  Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. CoRR, abs/1604.04494, 2016.
-  Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 98–106, 2016.
-  Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  Limin Wang, Wei Li, Wen Li, and Luc Van Gool. Appearance-and-relation networks for video classification. In CVPR, pages 1430–1439, 2018.
-  Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
-  Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7794–7803, 2018.
-  Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2794–2802, 2015.
-  Xiaolong Wang, Allan Jabri, and Alexei A. Efros. Learning correspondence from the cycle-consistency of time. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 2566–2576, 2019.
-  Donglai Wei, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. Learning and using the arrow of time. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8052–8060, 2018.
-  Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, pages 461–470, 2015.
-  Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 3733–3742, 2018.
-  Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 10334–10343, 2019.
-  Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. Real-time action recognition with enhanced motion vector CNNs. In CVPR, pages 2718–2726, 2016.
Richard Zhang, Phillip Isola, and Alexei A. Efros.
Colorful image colorization.In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, pages 649–666, 2016.
-  Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2933–2942, 2017.
-  Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and ego-motion from video. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6612–6619, 2017.