A Survey on multimodal learning research.
The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8 top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.gitREAD FULL TEXT VIEW PDF
A Survey on multimodal learning research.
Video action recognition is the first step of video understanding, and it is an active research area in recent years. We have observed that it mainly went through two stages, feature engineering and architecture engineering. Since there were no sufficient data for learning high-quality models before the birth of large datasets like Kinetics [carreira2017quo], early methods focused on the feature engineering, where researchers considered the temporal information inside the videos and used their knowledge to design specific hand-crafted representations [dollar2005behavior, wang2013dense]
. Then, with the advent of deep neural networks and large benchmarks, we are now in the second stage,architecture engineering. Lots of well-designed networks sprang up by reasonably absorbing the temporal dimension like two-stream networks [wang2016temporal]
, 3D convolutional neural networks (CNN)[feichtenhofer2019slowfast], compute-efficient networks [jiang2019stm] and transformer-based networks [arnab2021vivit].
Though the features and network architectures have been well-studied in the last few years, they are trained to predict a fixed set of predefined categories within a unimodal framework as shown in Figure 1(a). This predetermined manner limits their generality and employment since additional labeled training data is required to transfer to any other new and unseen concepts. Instead of directly mapping labels to numbers like traditional works, learning from the raw text will be a promising solution which could be a much broader source of supervision and provide a more comprehensive representation. Reminiscent of how our humans do this job, we can recognize both known and unknown videos by associating the semantic information from the visual appearance to natural language sources rather than numbers. In this paper, we explore the natural language supervision in a multimodal framework as shown Figure 1(b) with two objectives, i) strengthening the representation of the traditional action recognition with more semantic language supervision, and ii) enabling our model to realize zero-shot transfer without any further labeled data or parameters requirements. Our multimodal framework includes two separate unimodal encoders for videos and labels and a similarity calculation module. The training objective is to pull the pairwise video and label representations close to each other, thus the learned representations will be more semantic than unimodal methods. In the inference phase, it becomes a video-text matching problem rather than a classical 1-of-N majority vote task and is capable of zero-shot prediction.
However, labels of existing fully-supervised action recognition datasets are always too succinct to construct rich sentences for language learning. Collecting and annotating new video datasets require huge storage resources and enormous human effort and time. On the other hand, a sea of videos with noisy but rich text labels are stored and generated on the web every day. Is there a way to energize the abundant web data for action recognition? Pre-training may be a solution that is demonstrated in ViViT [arnab2021vivit]. But it is not easy to pre-train with a large magnitude of web data. It is expensive on storage hardware, computational resources and experiment cycles111[dosovitskiy2020image] reports that pre-training a ViT-H/14 model on JFT takes 2.5k TPUv3-core-days. This triggers another motivation of this paper, could we directly adapt a pre-trained multimodal model into this task, avoiding the above dilemma? We find this is possible. Formally, we define a new paradigm “pre-train, prompt, and fine-tune” for video action recognition. Although it is appealing to pre-train the whole model end-to-end with large-scale video-text datasets such as HowTo100M [miech2019howto100m], we are restricted by the enormous computation cost. Luckily, we find it is also worked to use a pre-trained model. Here we use the word “pre-train” rather than “pre-trained” in the new paradigm to keep the pre-training function. Then, instead of adapting the pre-trained model in specific benchmarks by substituting the final classification layers and objective functions, we reformulate our task to look more like those solved during the original pre-training procedure via prompt. Prompt-based learning [liu2021pre]
is regarded as a sea change to natural language processing (NLP), but it is not active in vision tasks, especially has not been exploited in action recognition. We believe it will have attractive prospects in many vision-text-related tasks and explore it in action recognition here. Finally, wefine-tune the whole model on target datasets. We implement an instantiation of this paradigm, ActionCLIP, which employs CLIP [radford2021learning] as the pre-trained model. It obtains a top performance of 83.8% top-1 accuracy on Kinetics-400. Our contributions can be summarized as follows:
We formulate the action recognition task as a multimodal learning problem rather than a traditional unimodal classification task. It strengthens the representations with more semantic language supervision and enlarges the generality and employment of the model in zero-shot/few-shot situations.
We propose a new paradigm for action recognition, which we dub “pre-train, prompt, and fine-tune”. In this paradigm, we could directly reuse powerful large-scale web data pre-trained models by designing appropriate prompts, significantly reducing the pre-training cost.
Comprehensive experiments demonstrate the potential and effectiveness of our method, which consistently outperforms the state-of-the-art methods on several public benchmark datasets.
We have observed that video action recognition mainly went through two stages, feature engineering and architecture engineering. In the first stage, lots of hand-craft descriptors are designed for spatio-temporal representations, like Cuboids [dollar2005behavior], 3D 3DHOG [klaser2008spatio] and Dense Trajectories [wang2013dense]. However, these features lack generalization since they are not end-to-end learned in large-scale datasets. Now we are in the second stage, architecture engineering
. We coarsely classify these architectures into four categories, two-stream networks, 3D CNNs, compute-efficient networks and transformer-based networks. Two-stream-based methods[feichtenhofer2016convolutional, wang2016temporal, wang2017spatiotemporal] are introduced to model appearance and dynamics separately with two networks and fuse two streams through the middle or at last. 3D CNNs [carreira2017quo, diba2018spatio, feichtenhofer2019slowfast, stroud2020d3d] intuitively learn spatiotemporal features from RGB frames directly which extend the common 2D CNNs with an extra temporal dimension. Due to the heavy computational burden of 3D CNNs, many compute-efficient networks are designed to find the trade-off between precision and speed [tran2018closer, xie2018rethinking, zhou2018mict, jiang2019stm, kumawat2021depthwise, li2020tea]. Transformer-based networks [arnab2021vivit, bertasius2021space, neimark2021video, sharir2021image, fan2021multiscale] employ and modify recent strong vision transformers to jointly encode the spatial and temporal features. Yet, most works of both stages are unimodal, without considering the semantic information contained in the labels. We propose a new paradigm “pre-trained, prompt, and fine-tune” based on a video-text multimodal learning framework for action recognition, which sheds light on the language modeling of label words.
Vision-text multi-modality is a hot topic in several vision-text related fields recently, like pre-training [lei2021less, li2021align], vision-text retrieval [fang2021clip2video, miech2018learning] and so on. Video action recognition could be interpreted as a text-insufficient video-to-text retrieval problem. Therefore, it may also be feasible to apply vision-text multi-modal learning in this task. However, to the best of our knowledge, we have not found mature and effective methods from this perspective in general video action recognition. We do find several vision-text multi-modality works in self-supervised video representation learning [miech2020end, alayrac2020self] and zero-shot action recognition [piergiovanni2020learning, brattoli2020rethinking, zhang2018cross]. Yet, the former is prone to just learn a strong pre-trained video representation with a large web dataset and still neglects the label texts features when doing specific classification, just attaching and learning a linear classifier on top of the learned vision representation. The latter mainly concentrates on the embedding space designation with a pre-trained vision model and a simple text embedding like Word2Vec [mikolov2013efficient], paying less attention to the upstream general action recognition task. Different from them, in this paper, we focus on the vision-text multi-modality learning in general action recognition and build a bridge for it and zero-shot/few-shot action recognition.
Previous comprehensive video action recognition methods treat this task as a classic and standard 1-of-N majority vote problem, mapping labels into numbers. This pipeline completely ignores the semantic information contained in the label texts. We instead model this task as a video-text multimodal learning problem, in contrast to pure video modeling. We believe learning from the supervision of natural language could not only enhance the representation power but also enable flexible zero-shot transfer.
Formally, given an input video x and a label y from a predefined label set
, the prior works usually train a model to predict the conditional probabilityand turn y into a number or a one-hot vector to indicate its index of the whole label set length . In the inference phase, the highest-scoring index of the prediction is regarded as the corresponding category. We try to break this routine and model the problem as , where y is the original words of the label and is a similarity function. Then, the testing is more likely a matching process, the label words of the highest similarity score is the classification result:
As shown Figure 1(b), we learn separate unimodal encoders for video and label words inside a dual-stream framework. The video encoder extracts spatio-temporal features for the visual modality and could be any well-designed architectures. The language encoder
is used to extract features of input label texts and could be a wide variety of language models. Then, to pull the pairwise video and label representations close to each other, we define symmetric similarities between the two modalities with cosine distances in the similarity calculation module:
where and are encoded features of x and y, respectively. Then the softmax-normalized video-to-text and text-to-video similarity scores can be calculated as:
where is a learnable temperature parameter and is the number of training pairs. Let indicate the ground-truth similarity scores, where the negative pair has a probability of 0 and the positive pair has a probability of 1. Since the amount of videos are much larger than the fixed labels, it will inevitably appear multiple videos belonging to one label in a batch. Therefore, it may exist more than one positive pair in both and . It is not proper to regard the similarity score learning as a 1-in-N classification problem with cross-entropy loss. Instead, we define the Kullback–Leibler (KL) divergence as the video-text contrastive loss to optimize our framework as:
where is the whole training set. Based on the multimodal framework, we can simply carry out zero-shot prediction as the normal testing process in Equation 1.
When considering the above multimodal learning framework, we need to consider the deficiency of label words. The most intuitive way is to take advantage of vast web image-text or video-text data. To cater for this, we propose a new “pre-train, prompt and fine-tune” paradigm for action recognition.
Pre-train. As prior arts suggested, pre-training has a large impact on vision-language multimodal learning [lu2019vilbert, lei2021less, li2021align, kim2021vilt]. Since the training data is directly collected from the web, one of the hot topics is to design appropriate objectives to handle these noisy data during this procedure. We find there are mainly three upstream pre-training proxy tasks in the pre-training procedure: multimodal matching (MM), multimodal contrastive learning (MCL) and masked language modeling (MLM). MM predicts whether a pair of modalities is matched or not. MCL aims to draw pairwise unimodal representations close to each other. MLM utilizes the features of both modalities to predict the masked words. However, this paper does not focus on this step due to the restriction of enormous computation cost. We directly choose to apply a pre-trained model and make efforts on the following two steps.
Prompt. Prompt in NLP means the original input is modified using a template into a textual string prompt that has some unfilled slots to fill with expected results. Here we borrow the word “prompt” for the meaning of adjusting and reformulating the downstream tasks to act more like the upstream pre-training tasks. Notably, the traditional practice is adapting the pre-trained model to the downstream classification task via attaching a new linear layer to the pre-trained feature extractor, which is reversed to ours. Here we make two kinds of prompts, textual prompt and visual prompt. The former is significant for label text extension. Given a label y, we first define a set of permissible values , then the prompted textual input is obtained by a filling function , where . There are three varieties of , prefix prompt, cloze prompt and suffix prompt. They are classified based on the filling locations. For visual prompt, its designation mainly depends on the pre-trained model. If the model is pre-trained on video-text data, it is almost no extra reformulation for the visual part since the model is already trained to output video representations. While if the model is pre-trained with image-text data, then we should empower the model to learn the important temporal relationship of videos. Formally, given a video x, we introduce prompt function as , where is the visual encoding network of pre-trained models. Similarly, has three variants based on where it works against , pre-network prompt, in-network prompt and post-network prompt. With the elaborate designation of prompt, we could even avoid the above unreachable computational “pre-train” step by keeping the learned ability of a pre-trained model. Note that in the new paradigm, the pre-trained model should not be largely modified due to catastrophic forgetting [mccloskey1989catastrophic], where the pre-trained model loses its ability to do things that it was able to do in the pre-training. We also demonstrate this point in our experiments.
Fine-tune. When there are sufficient downstream training datasets like Kinetics, it is no doubt that fine-tuning on specific datasets will dramatically improve the performance. Also, if the prompt introduces extra parameters, it is necessary to train these parameters and learn with the whole framework end-to-end.
Each component of the new paradigm has a wide variety of choices. As presented in Figure 2, we show an instantiation example here and conduct all the experiments with this instantiation.
We employ a firsthand pre-trained model, CLIP [radford2021learning] to avoid the enormous computational resources at the first pre-training step. This instantiation model is called ActionCLIP as shown in Figure 2(a). CLIP is an efficient image-text representation trained with the MCL task, similar to our multimodal learning framework. Figure 2(b) shows concrete examples of the textual prompts used in the instantiation. We define to be discrete manual sentences which is the most natural way based on human introspection. Then the prompted input is fed into the language encoder that is the same with pre-trained language model . For the vision model, based on the pre-trained image encoder of CLIP, we employ three types of visual prompts as follows.
Pre-network Prompt. This type operates on the inputs before feeding into the encoder, as shown in Figure 2(c). Given a video x, we simply forward all spatio-temporal tokens extracted from the video through the visual encoder to jointly learn spatio-temporal attentions. Except for the spatial positional embedding, an extra learnable temporal positional embedding will be added to the token embedding to indicate the frame index. could use the original pre-trained image encoder . We call this type Joint for short.
In-network Prompt. We attempt a parameter-free prompt abbreviated as Shift for this type as shown in Figure 2(d). We introduce the temporal shift module [lin2018temporal], which shifts part of the feature channels along the temporal dimension and facilitates information exchanged among neighboring input frames. We insert the module between every two adjacent layers of . The architecture and pre-trained weights of could directly reuse since this module brings no parameters.
Post-network prompt. Given a video x with extracted frames, we sequentially encode spatial and temporal features with two separate encoders in this prompt. The first is a spatial encoder which is responsible for only modeling interactions between tokens extracted from the same temporal index. We use as our . The extracted frame-level representations are then concatenated into , and then fed to a temporal encoder to model interactions between tokens from different temporal indices. We offer four choices for , MeanP, Conv1D, LSTM and Transf, presented in Figure 2(e-g). MeanP is short for mean pooling on the temporal dimension. Conv1D is a 1d convolutional layer applied on the temporal dimension. LSTM
is a recurrent neural network andTransf means a layer temporal vision transformer encoder. Since the temporal dimensions of Conv1D, LSTM and Transf keep the same with input , the subsequent operations are the same as MeanP.
Then we end-to-end fine-tune the whole network with the training objective Equation 4.
Network architectures. Our textual encoder follows that of CLIP which is a 12-layer, 512-wide Transformer with 8 attention heads and the activations from the highest layer at [EOS] are treated as the feature representation . We use ViT-B/32 and ViT-B/16 of CLIP’s visual encoder . They are all 12-layer vision transformers, with different input patch sizes of 32 and 16 respectively. The [Class] token of their highest layers’ outputs are used. We use =18 permissible values for textual prompt. For visual prompts, the layer of Conv1D and LSTM is 1, Transf has =6 layers. Two versions of Transf are implemented, they are different in not using or using [Class] token. We distinguish them as Transf and Transf.
Training. We use AdamW optimizer with a base learning rate of for pre-trained parameters and
for new modules with learnable parameters. Models are trained with 50 epochs and the weight decay is 0.2. The learning rate is warmed up for 10% of the total training epochs and decayed to zero following a cosine schedule for the rest of the training. The spatial resolution of the input frames is. We use the same segment-based input frame sampling strategy as [wang2018temporal] with 8, 16 or 32 frames. Even the largest model of our method, ViT-B/16 could be trained with 4 NVIDIA GeForce RTX 3090 GPUs on Kinetics-400 when inputting 8 frames, and the training process takes about 2.5 days. Compared to X3D and SlowFast, both trained with 128 GPUs for 256 epochs, our training is much faster and requires fewer GPUs (30).
In this section, we do extensive ablation experiments to demonstrate our method with the instantiation, ActionCLIP. Models in this section use 8-frame input, Transf for temporal modeling, ViT-B/32 as the backbone and single view testing on Kinetics-400, unless specified otherwise.
Is the “multimodal framework” helpful? To compare with the traditional video-unimodal 1-of-N classification model, we implement a variant called unimodality which has the same backbone, pre-trained weights and temporal modeling strategy (before the final linear layer) with our ActionCLIP. The results are shown in Table 1. When exploiting the semantic information of label texts with our multimodal learning framework, it dramatically improves the performance with 2.91% top-1 accuracy gains, demonstrating that the multimodal framework is helpful to learn powerful representations for action recognition.
Is the “pre-train” step important? In Table 2, we validate the impact of this step by experimenting with random initialized or CLIP pre-trained vision and language encoders. In particular, from the large gap (40.10% vs. 78.36%) between model V3 and V4, we find that the visual encoder needs proper initialization, otherwise the model will fail to obtain a strong performance. The language encoder has a smaller influence, since model V2 could also get a comparative result (76.63%) compared with model V4 (78.36%). When both the visual and language encoders are randomly initialized, model V1 is hard to learn a good representation and drops a large margin of 41.4% from model V4. Therefore, the final conclusion is that the “pre-train” step is important, especially for the visual encoder.
|[HTML]ECF4FF||only label||textual prompt|
Is the “prompt” step important? Table 3 shows the results of textual prompt. It can be seen that using only the label words drops 0.54% compared with using textual prompt, demonstrating the validness of this simple, discrete and human-comprehensible textual prompt. For the visual prompt, note that MeanP is the simplest temporal fusion way and we compare other visual prompts with it. As shown in Table 4, we find Joint and Shift obviously decrease the performance by 2.74% and 5.38%, respectively. We believe the reason is the catastrophic forgetting phenomenon since the input pattern is changed in Joint and the features of pre-trained image encoder are changed in Shift. These operations may break the original learned strong representations and yield performance drop. Post-network prompts are more suitable and safer options to keep the learned character. Specifically, LSTM and Conv1D cause a negligible top-1 drop but they all improve the top-5 accuracy. Transf and Transf improve the top-1 results with 1.01% and 1.25%. We choose Transf as our final visual prompt since it has the best results. In a word, the designation of prompt is significant since proper prompts could avoid catastrophic forgetting and maintain the representation power of existing pre-trained models, giving a shortcut to the usage of tremendous web data.
|[HTML]ECF4FF Visual prompt||Top-1||Top-5|
Is the “fine-tune” step important? We demonstrate this step by separately freezing the parameters of the pre-trained language encoder and image encoder . The results are presents in Table 5. When the two encoders are all frozen and only the visual prompt Transf is trained, the performance decreases 6.15% on top-1 accuracy. When all the parameters are end-to-end fine-tuned, we obtain the best results of 78.36%. It will have a negative influence on the accuracy if either of the pre-trained encoders is frozen. Therefore, the “fine-tune” step is indeed crucial to specific datasets, which is consistent with our perceptual intuition.
Backbones and input frames. In Table 6, we experiment ActionCLIP with different backbones and input frames configurations. The input frames vary from 8, 16 to 32. Two different backbones are used, ViT-B/32 and ViT-B/16. The conclusion is intuitive that larger models and more input frames yield better performance.
|[HTML]ECF4FF Backbone||Input frames||Top1||Top5|
For different backbones and input frame configurations, we present their model sizes, FLOPs and inference speeds in Table 7. The textual encoder of all backbones has the same architecture, which has 37.8M parameters. We show the whole parameter in the table. Notably, ViT-B/32 has a little more parameters than ViT-B/16, which comes from the linear projection layer before feeding into the vision transformer. While ViT-B/32 has much a faster inference speed (3.3) and fewer FLOPs (4) than ViT-B/16. Moreover, we provide two very recent methods for comparison, TimeSformer [bertasius2021space] and ViViT [arnab2021vivit] with their highest configurations which obtain similar accuracy with ActionCLIP. Specifically, compared with the highest configuration of ActionCLIP, TimeSformer needs more input frames (3) and much more computational FLOPs (12.7) to obtain its best performance, which is still worse than ActionCLIP (82.3% vs. 80.7%). Similarly, ViViT has much more computational FLOPs (7.1) to obtain its best results with a larger input solution 320320, while ActionCLIP’s input is 224224 and it surpasses ViViT with 1% top-1 accuracy gap and runs faster (3.1) than ViViT. In conclusion, ActionCLIP is a cost-effective and efficient method for action recognition.
In this section, we demonstrate the attractive zero-shot/few-shot recognition ability of our ActionCLIP (ViT-B/16). We implement two representative methods for comparison, STM [jiang2019stm] which is a well-designed temporal-encoded 2D network, and 3D-ResNet-50 which is the slow path of SlowFast [feichtenhofer2019slowfast]. We use 8-frame input, single view inference in all models of this section. We first conduct the zero-shot/few-shot experiments on Kinetics-400. ActionCLIP uses pre-trained model of CLIP with MeanP visual prompt (since Transf
has no pre-trained parameters), STM and 3D-ResNet-50 are pre-trained on ImageNet. Then, we validate on UCF-101 and HMDB-51 with Kinetics-400 pre-trained models (ActionCLIP uses Transf here). As shown in Figure 3, the results demonstrate the strong transfer power of ActionCLIP under these data-poor conditions, while the traditional unimodality methods are not able to do zero-shot recognition and their few-shot performance is ineffective compared with ActionCLIP even pre-trained on large Kinetics-400.
|I3D NL [wang2018non]||32||77.7||93.3|
In this section, we evaluate the performance of our method on a diverse set of action recognition datasets: Kinetics-400 [carreira2017quo]
, Charades[sigurdsson2016hollywood], UCF-101 [soomro2012ucf101] and HMDB-51 [Kuehne11]. ViT-B/16, Transf prompt and multi-view testing are used in ActionCLIP. The results of UCF-101 and HMDB-51 are shown in Appendix. Kinetics-400. Table 8 compares to prior methods on Kinetics-400. There are four parts in this table, corresponding to 3D-CNN-based methods, 2D-CNN-based methods, transformer-based methods and our method. According to the table, the third section achieves better results with strong vision transformers than the first and second parts. Among them, the first four methods build on the 12-layer ViT-B/16 model for parameter-accuracy balance, so do our ActionCLIP. ViViT instead uses a larger model, 24-layer ViT-L for better results. Also, it introduces JFT for pre-training for further gain. Our ActionCLIP achieves 82.6% top-1 accuracy with only 16-frame input, which exceeds all the methods in the first and second parts of the table and most transformer-based methods that may use more input frames like 250 frames of ViT-B-VTN. An interesting discovery is that our top-5 accuracy is always higher than other methods. We think this benefits from our multimodal framework’s different inference process, which calculates the similarity between the semantic representations of videos and all labels. ActionCLIP further reaches a leading performance of 83.8% when increasing the input to 32 frames. We believe that more input frames, larger models and larger input resolutions will yield better results and leave it to future work. The current performance of ActionCLIP could already reveal the potential of the multimodal learning framework and the proposed new paradigm for action recognition.
|MultiScale TRN [zhou2018temporal]||-||25.2|
|SlowFast 50 [feichtenhofer2019slowfast]||8+32||38.0|
|X3D-XL (312) [feichtenhofer2020x3d]||16||43.4|
Charades. This is a dataset with longer-range activities and it has multiple actions inside every video. We show Kinetics-400 pre-trained models in Table 9. Mean Average Precision (mAP) is used for evaluation. ActionCLIP achieves the top performance of 44.3 mAP, which demonstrates its effectiveness on multi-label video classification.
This paper provides a new perspective for action recognition by regarding it as a video-text multimodal learning problem. Unlike the canonical approaches that model the task as a video unimodality classification problem, we propose a multimodal learning framework to exploit the semantic information of label texts. Then, we formulate a new paradigm, i.e., “pre-train, prompt, and fine-tune” to enable our framework to directly reuse powerful large-scale web data pre-trained models, greatly reducing the pre-training cost. We implement an instantiation of the new paradigm, ActionCLIP, which has a superior performance on both general and zero-shot/few-shot action recognition. We hope our work could provide a new perspective for this task, especially raising attention on language modeling.
We would like to thank Zeyi Huang for his constructive suggestions and comments on this work.