UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

02/15/2020 ∙ by Huaishao Luo, et al. ∙ Microsoft Southwest Jiaotong University Institute of Computing Technology, Chinese Academy of Sciences Beijing Institute of Technology 16

We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the recent advances of self-supervised learning, pre-training techniques play a vital role in learning good representation for visual and language. The paradigm is to pre-train the model on a large scale

unlabeled data, and then fine-tune the downstream tasks using task specific labeled data. Inspired by the success of BERT Devlin et al. (2019) model for NLP tasks, numerous multimodal image-language pre-training models Lu et al. (2019); Li et al. (2019a, b) have been proposed and demonstrated the effectiveness on various visual and language tasks such as VQA (visual question answering) and image-text match etc. Nevertheless, there are still few works on video-linguistic pre-training.

Figure 1: A showcase of video and language pre-train based model for multimodal understanding (retrieval) and generation (captioning).

Videos contain rich visual, acoustic and language information for people to acquire knowledge or learn how to perform a task. This motivates researchers to investigate whether AI agents can learn task completion from videos like human with both low-level visual and high-level semantic language signal. Therefore, multimodal video-language tasks are of great importance to investigate for both research and applications. In this work, we first propose to pre-train a unified video-language model using video and acoustic speech recognition (ASR) transcript in instructional videos to learn a joint representation of both video and language. Then, we fine-tune this model on two typical multimodal tasks including text-based video retrieval for understanding and multimodal video captioning for generation. Figure 1 presents a showcase of our pre-training and fine-tuning flow and both tasks take video and language as input. Take multimodel video captioning as an example, the model input video and ASR transcript and predict a captioning sentence.

VideoBERT and CBT Sun et al. (2019b, a) are the first pioneers to investigate video-language pre-training with regard to video representation on instructional videos. They have demonstrated the effectiveness of the BERT based model for capturing video temporal and language sequential features. Our work differs from VideoBERT and CBT on two aspects: 1) previous work only pre-trains the model on understanding task, while we explore to pre-train on both understanding and generation tasks; 2) they fine-tune the downstream tasks for a better video representation with only video as input, while our goal is to learn video and language joint representation by downstream multimodal tasks.

In this paper, we propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Our UniViLM model adopts Transformer Vaswani et al. (2017) as backbone and has 4 components including two single-modal encoders, a cross encoder and a decoder. In detail, we first encode the text and visual separately by two single-modal encoders. Then we adopt the Transformer based encoder-decoder model to perform the understanding and generation pre-training by 4 tasks: 1) masked language model (MLM for language corruption); 2) masked frame model (MFM for video corruption); 3) video-text alignment and 4) language reconstruction.

As shown in Figure 1, we fine-tune our pre-trained model on two typical video-language tasks: text-based video retrieval and multimodal video captioning. For the first task, we remove the decoder and fine-tune the alignment task. For the second task, we directly fine-tune the pre-trained encoder-decoder model.

We list our contributions below:

1) We propose a multimodal video-language pre-training model trained on a large scale instructional video dataset, which is a unified model for both video-language understanding and generation tasks.

2) The pre-training stage consists of 4 tasks including MLM (masked language model), MFM (masked video frame model), video-text alignment, and language reconstruction.

3) We fine-tune our pre-trained model on two typical multimodal video-language tasks: text-based video retrieval and multimodal video captioning. The extensive experiments demonstrate the effectiveness of our unified pre-trained model on both understanding and generation tasks and achieves state-of-the-art results.

2 Related Works

Single Modal Pre-Training

Self supervised representation learning has been shown to be effective for sequential data including language and video. Language pre-training models including BERT Devlin et al. (2019), GPT Radford et al. (2018), RoBERTa Liu et al. (2019), XLNet Yang et al. (2019), MASS Song et al. (2019), UniLM Dong et al. (2019), BART Lewis et al. (2019) have achieved great success on NLP tasks. BERT Devlin et al. (2019) is a denoising auto-encoder network using Transformer with MLM (masked language model) and NSP (next sentence prediction) as pre-training tasks and has strong performance for understanding task. MASS Song et al. (2019) focus on pre-training for generation tasks. UniLM Dong et al. (2019) and BART Lewis et al. (2019) continuously study a unified pre-training model for both understanding and generation tasks.

Video representation learning mostly focuses on the video sequence reconstruction or future frames prediction as pre-training (pretext) tasks. Early works like Mathieu et al. (2015); Srivastava et al. (2015); Han et al. (2019) aim to synthetic video frames through the image patches. Similarly, Wang and Gupta (2015)

adopt Siamese-triplet network to rank continuous patches more similar than patches of different videos. Other works predict the feature vectors in latent space using auto-regressive models with the noise contrastive estimation (NCE)

Lotter et al. (2016); Oord et al. (2018). Sun et al. (2019a) adopt NCE to make prediction on corrupted (masked) latent space using auto-encoder model.

Multimodal Pre-Training

Recently, numerous visual-linguistic pre-training models Lu et al. (2019); Li et al. (2019b); Tan and Bansal (2019); Li et al. (2019a); Zhou et al. (2019); Lu et al. (2019); Sun et al. (2019b); Li et al. (2019b) are proposed for multimodel tasks. For image and text pre-training, ViLBERT Lu et al. (2019), LXMERT Tan and Bansal (2019) adopt two separate Transformers for image and text encoding independently. Other models like Unicoder-VL Li et al. (2019a), VL-BERT Lu et al. (2019), UNITER Zhou et al. (2019) use one shared BERT model. These models employ MLM and image-text matching as pre-training tasks which are effective for downstream multimodal tasks. VLP Zhou et al. (2019) proposes a unified image-language model for understanding and generation task. Different from these works, we focus on video and text pre-training for universal representation.

VideoBERT Sun et al. (2019b) and CBT Sun et al. (2019a) are the first works of video and language pre-training models which are the most similar works to ours. Although VideoBERT and CBT pre-train the model on multimodal data, the downstream tasks only take video representation for further prediction. We believe that video-language pre-training can learn a universal representation of video and text. Besides, previous works only pre-train the encoder and suffer from uninitialized decoder for generation tasks. We further pre-train the decoder for generation task and experimental results show that the pre-trained decoder is effective for generation.

Multimodal Retrieval and Captioning

Multimodal video and language learning is a nascent research area. In this work, we fine-tune and evaluate our pre-trained model on two multimodal tasks including text-based video retrieval and multimodal video captioning. Text-based video retrieval task is to predict whether the video and text query match each other. Yu et al. (2018) densely align each token with each frame. Miech et al. (2019) embed text and video into the same latent space through a joint embedding network on 1.2 million videos. Multimodel video captioning task is to generate captions given an input video together with ASR transcript. Different from existing works Sun et al. (2019b, a); Krishna et al. (2017); Zhou et al. (2018a, b); Shi et al. (2019); Palaskar et al. (2019); Hessel et al. (2019) which only use video signal, recent works Shi et al. (2019); Palaskar et al. (2019); Hessel et al. (2019) study the multimodal captioning by taking both video and transcript as input, and show that incorporating transcript can largely improve the performance. Our model achieves state-of-the-art results in both tasks.

3 Method

The problem is defined as: given the input video and the corresponding ASR transcript pairs, pre-train a model to learn a joint video and text representation and fine-tune downstream tasks. In this section, we describe the details of the architecture, and the pre-training tasks.

3.1 Model Architecture

Figure 2 presents the model structure as an encoder-decoder architecture. First, the model extracts representations of the input text tokens and the video frame sequences using various feature extractors. Then a text encoder adopts the BERT model to embed the text and a video encoder utilizes the Transformer model to embed the video frames. Next, we employ a Transformer based cross encoder for interacting between the text and the video. Finally, another Transformer based decoder learns to reconstruct the input text.

Figure 2: The main structure of our pre-training model, which comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. represents position embedding, and is segment embedding to represent text and video types. denotes the embedding of each token.

Pre-processing

First we pre-process video and language before feeding to this model. For the input text, we tokenize all words by WordPieces Wu et al. (2016) following the pre-processing method in BERT to obtain the token sequence , where is the -th token and is the length of token sequence. For each video clip, we sample a frame sequence to represent the video clip, where is the -th video frame and is the length of the frame sequence.

Single Modal Encoder

We encode the text and video separately. First we adopt the BERT-base model to encode the token sequence . The text encoding is ,

(1)

where is hidden size of text encoding.

Next, we adopt the off-the-shelf image feature extractors to generate input feature matrix for the video frame sequence before feeding to the video encoder. While image representation only considers spatial feature, video representation encodes both spatial and temporal feature. We extract video feature using 2D and 3D CNNs for spatial and spatial-temporal representation. Then, we concatenate two features to one unified video feature . The represents hidden size of video feature. Finally, the is fed to the video encoder to embed the contextual information,

(2)

The dimension of is .

Cross Encoder

To make the text and video fully interact with each other, we design a cross encoder to fuse these features. We first combine the text encoding and the video encoding to get the encoding . Then, the Transformer based cross encoder takes the encoding as input to generate the attended encoding ,

(3)
(4)

where denotes the combination operation.

Decoder

The decoder learns to reconstruct the input text during pre-training, as well as generating captions during fine-tuning and inferring. The input is the attended encoding of text and video. We unexceptionally exploit Transformer to get the decoded feature from ,

(5)

where is the decoder length.

3.2 Pre-training Objectives

We have four pre-training objectives: 1) masked language model (for text corruption); 2) masked frame model (for video corruption); 3) video-text alignment and 4) language reconstruction.

MLM: Masked Language Model

Following BERT, we randomly mask 15% tokens with the special token [MASK] in the sentence and the objective is to re-produce the masked tokens. Since the ASR transcript is automatically extracted from speech, which is noisy and in low quality, we further conditionally mask key concepts. Specifically, we conditionally mask 15% verbs or nouns in the sentences111We use package scapy (https://scapy.net) to extract verbs and nouns automatically.

to compel the encoder to learn these key concepts. This loss function is defined as:

(6)

where means the contextual tokens surrounding the masked token , is the trainable parameters.

MFM: Masked Frame Model

Similarly, we also propose a masked frame model to predict the correct frames given contextual frames. This loss function is NCE Sun et al. (2019a). We randomly mask 15% vectors (also 15% frames) with zeros. The objective is to identify the correct frame compared to negative distractors. The loss is defined as:

(7)
(8)
(9)

where means the surrounding frames except , is a linear output of , is the real-valued vectors of video feature, , and is the feature matrix of the video part in . We take other frames in the same batch as negative cases defined as .

Video-Text Alignment

We use the fused representation that corresponds to the special token [CLS] to predict scores for the Video-Text Alignment task. Specifically, a BertPooler layer and a linear layer are designed to project the first hidden state of to scores which is similar to the BERT sentence pair classification task. We also adopt the NCE loss to learn to discriminate the positive from negative video-text pairs. To enhance this capability, we not only randomly sample negative cases but also re-sample video clips from the same video Han et al. (2019). The reason is that the frames inside the same video are more similar than frames of different videos. This loss function is defined as follows,

(10)
(11)

where means the BertPooler layer and linear layer operations. We take other video clips in the same batch as negative cases .

Language Reconstruction

An auto-regressive decoder is also involved in our pre-training objective, and the loss function is,

(12)

It is note that is the mask of ground-truth text when pre-training. As shown in BART Lewis et al. (2019), pre-training decoder benefits generation tasks.

Loss Function

We jointly optimize our model by a weighted loss:

(13)

where , , , and are set to 1 in this paper.

Figure 3: Two downstream tasks.

4 Downstream tasks

Figure 3 presents the two downstream tasks: text based video retrieval (left) and multimodal video captioning (right).

4.1 Text based Video Retrieval

Text based video retrieval is defined to retrieve a relevant video/clip given an input text query. During inference, the model takes the input text query and each candidate video to calculate the similarity score, and then rank to select the best matched video clip. The model encodes query and video through text encoder and video encoder respectively, then feed the embeddings to the cross encoder, and make final prediction through the fused representation corresponding to [CLS] by in Eq. (10). We use as the loss during the fine-tuning stage.

4.2 Multimodal Video Captioning

Given a video, multimodal video captioning aims to generate a sequence of descriptive sentences. In this work, we focus on generating better captions and use the ground-truth segments in the experiment. Similarly, the model encodes the input video frames as well as transcripts inside the clips through video encoder and text encoder respectively, then feeds the embeddings to the cross encoder to get a unified representation, and finally generates token sequence by the decoder. We use as the loss during the fine-tuning stage.

5 Experiment

We first pre-train our model on the large scale dataset HowTo100M Miech et al. (2019), then fine-tune our pre-trained model on two downstream multimodal tasks including text-based video retrieval and multimodel video captioning. Finally, we evaluate our model on both In-domain Youcook2 Zhou et al. (2018a) dataset and Out-domain MSR-VTT Xu et al. (2016) dataset.

5.1 Dataset

HowTo100M Miech et al. (2019)

222https://www.di.ens.fr/willow/research/howto100m/

is the pre-training dataset. We download videos in the Food and Entertaining domain with ASR transcript from Howto100M dataset. After filtering the unavailable ones, we finally get 380K videos for pre-training our model. On average, the duration of each video is 6.5 minutes with 110 clip-text pairs.

Youcook2 Zhou et al. (2018a)

333http://youcook2.eecs.umich.edu/

is the In-domain dataset for both downstream tasks. It contains 2,000 cooking videos on 89 recipes with 14K video clips. The overall duration is 176 hours (5.26 minutes on average). Each video clip is annotated with one captioning sentence. We evaluate both text-based video retrieval and multimodel video captioning task on this dataset. For the first task, we follow the same experimental setting in Miech et al. (2019), and use the captions as the input text queries to find the corresponding video clips. For the second task, we use the same setting as in Shi et al. (2019). We filter the data and make sure there is no overlap between pre-training and evaluation data. In all, we have 1,261 training videos and 439 test videos, that is, 9,776 training clip-text pairs and 3,369 test clip-text pairs.

Msr-Vtt Xu et al. (2016)

is the Out-domain dataset for downstream task. It has open domain video clips, and each clip has 20 captioning sentences labeled by human. In all, there are 200K clip-text pairs from 10K videos in 20 categories including sports, music, etc. Following JSFusion Yu et al. (2018), we randomly sampled 1,000 clip-text pairs as test data to evaluate the performance of our model on text-based video retrieval task.

5.2 Experimental Details

Text encoding

for text encoding, we apply WordPiece embeddings Wu et al. (2016) with a 30,000 token vocabulary to input to BERT model. We exploit the BERT-base model Devlin et al. (2019) with 12 layers of Transformer blocks. Each block has 12 attention heads and the hidden size is 768.

Video encoding

Similar to Miech’s work Miech et al. (2019), we extract both 2D and 3D features from video clips. We use an off-the-shelf ResNet-152 He et al. (2016)

that pre-trained on the ImageNet dataset to extract 2D feature. For 3D feature extraction, we employ ResNeXt-101

Hara et al. (2018) that pre-trained on Kinetics to extract 3D features. The fps of 2D and 3D feature extractor are 1 and 1.5 respectively. Then we directly concatenate both 2D and 3D features to one unified 4,096 dimensional vector. For video encoding, we employ Transformer Vaswani et al. (2017) with 1 layer. Each block has 12 attention heads and the hidden size is 768.

Model setting

The model consumes the clip-text pairs. The maximal input tokens of text is 32 and the maximal frames of video is 48. For short sentence and clip, we concatenate contextual tokens and frames. For cross encoder and decoder, we use a 2 layers Transformer as the encoder and a 1 layer Transformer as the decoder with 12 heads. For generation task during the inference stage, we use the beam search with the size of 5.

Training time

We pre-train our model on 4 NVIDIA Tesla V100 GPUs. The batch size is set to 96 and the model is trained 12 epochs for 5 days. We use the Adam optimizer

Kingma and Ba (2015) with an initial learning rate of 1e-4, and employ a linear decay learning rate schedule with warm up strategy. To fasten the pre-training speed, we adopt two-stage training fashion. For the first stage, we only preserve the text BERT and video Transformer to learn the weights using alignment similarity like the work in Miech et al. (2019). Next we freeze the single modal encoders with the learned weights and continue to further pre-train the subsequent cross encoder and decoder.

5.3 Task I: Text-based Video Retrieval

We fine-tune our pre-trained model for text-based video retrieval task on both Youcook2 and MSR-VTT datasets. The evaluation metrics are Recall@n (R@n) and Median R.

Method PT data FT data R@1 R@5 R@10 Median R
Random 0 0 0.03 0.15 0.3 1675
HGLMM FV CCA Klein et al. (2015) 0 Youcook2 4.6 14.3 21.6 75
HowTo100M Miech et al. (2019) 1.2M 0 6.1 17.3 24.8 46
HowTo100M Miech et al. (2019) 0 Youcook2 4.2 13.7 21.5 65
HowTo100M Miech et al. (2019) 1.2M Youcook2 8.2 24.5 35.3 24
HowTo100M 380K 0 6.50 19.73 27.77 35
HowTo100M 380K Youcook2 7.45 22.60 33.34 25
Our model.1st 380K 0 5.52 17.74 27.41 42
Our model.2nd 0 Youcook2 3.35 10.79 17.76 76
Our model.3rd 200K Youcook2 7.53 22.00 32.77 28
Our model.4th 380K Youcook2 9.97 27.53 38.77 20
Table 1: Results of text-based video retrieval on Youcook2 dataset. PT stands for pre-training and FT for fine-tuning. means the re-running the code of HowTo100M model on our dataset.
Method PT data FT data R@1 R@5 R@10 Median R
Random 0 0 0.1 0.5 1.0 500
C+LSTM+SA Klein et al. (2015) 0 MSR-VTT 4.2 12.9 19.9 55
VSE Klein et al. (2015) 0 MSR-VTT 3.8 12.7 17.1 66
SNUVL Klein et al. (2015) 0 MSR-VTT 3.5 15.9 23.8 44
Kaufman Klein et al. (2015) 0 MSR-VTT 4.7 16.6 24.1 41
CT-SAN Klein et al. (2015) 0 MSR-VTT 4.4 16.6 22.3 35
JSFusion Klein et al. (2015) 0 MSR-VTT 10.2 31.2 43.2 13
HowTo100M Miech et al. (2019) 1.2M 0 7.5 21.2 29.6 38
HowTo100M Miech et al. (2019) 0 MSR-VTT 12.1 35.0 48.0 12
HowTo100M Miech et al. (2019) 1.2M MSR-VTT 14.9 40.2 52.8 9
HowTo100M 380K 0 5.40 13.40 19.70 66
HowTo100M 380K MSR-VTT 13.80 32.30 43.00 16
Our model.1st 380K 0 2.90 8.30 12.40 173
Our model.2nd 0 MSR-VTT 14.60 39.00 52.60 10
Our model.3rd 380K MSR-VTT 15.40 39.50 52.30 9
Table 2: Results of text-based video retrieval on MSR-VTT dataset. PT stands for pre-training and FT for fine-tuning. means the re-running the code of HowTo100M model on our dataset.

Youcook2

provides the ground-truth video clip and caption pairs. We use the caption to retrieve the relevant video clip. Miech Miech et al. (2019) reports baseline methods including Random and KGLMM FV CCA Klein et al. (2015) and their model results, which we directly apply as our baseline methods. Table 1 lists the results of all baselines and our models. We can see that our model can improve the performance over all baseline methods and achieve state-of-the-art result. Since our 380K data are all food domain related videos, we investigate whether this domain specific data biases the model performance. So we re-run the HowTo100M model on our 380K dataset and fine-tune on Youcook2 dataset. The performance drops a lot which demonstrates that the data does not bias the model. Through the comparison of our model pre-trained on various data sizes, the performance increases with increment of data.

Msr-Vtt

Besides the Food domain videos, we also evaluate text-based video retrieval on open domain MSR-VTT dataset. We present several baseline methods with/without pre-training. For Out-domain dataset, our pre-trained method (Our model.2nd vs. 3rd) has generalization capability on other domain but not as significant as in-domain data. We also notice that without fine-tuning, our pre-trained model performs worse than the HowTo100M model, which shows that the fine-tuning is a very important stage for our model. Our full model (3rd) achieves the state-of-the-art results on R@1 and Median R metrics. The best results on R@5 and R@10 are achieved by the HowTo100M model pre-trained on 1.2M dataset which contains more open domain videos that could benefit the results on MSR-VTT. This motivates us to further examine the HowTo100M model pre-trained on our 380K dataset. The experimental results demonstrate our model.3rd outperforms the HowTo100M model pre-trained on the same dataset(380K) on all metrics.

According to our extensive experiments on text based video retrieval, we find that: 1) our model can largely increase the performance of video and language understanding task; 2) with the increase of the training data, our model performs consistently better; 3) Our model outperforms baselines on both In-domain and Out-domain data and achieves the state-of-the-art results. The performance boost is more remarkable for In-domain data.

5.4 Task II: Multimodal Video Captioning

Methods Input Pre-training Data B-3 B-4 M R-L CIDEr
Bi-LSTM Zhou et al. (2018a) Video 0 - 0.87 8.15 - -
EMT Zhou et al. (2018b) Video 0 - 4.38 11.55 27.44 0.38
VideoBERT Sun et al. (2019b) Video 312K 6.80 4.04 11.01 27.50 0.49
VideoBERT (+S3D) Sun et al. (2019b) Video 312K 7.59 4.33 11.94 28.80 0.55
CBT Sun et al. (2019a) Video 1.2M - 5.12 12.97 30.44 0.64
DPC Shi et al. (2019) Video + Transcript 0 7.60 2.76 18.08 - -
AT+Video Hessel et al. (2019) Video + Transcript 0 - 9.01 17.77 36.65 1.12
Our model.1st Video 380K 10.16 6.06 12.47 31.48 0.6430
Our model.2nd Video + Transcript 0 13.57 8.67 15.38 35.18 1.0015
Our model.3rd Video + Transcript 200K 14.97 9.92 16.24 37.07 1.1554
Our model.4th (no decoder) Video + Transcript 380K 14.43 9.78 15.81 36.84 1.1043
Our model.5th Video + Transcript 380K 15.52 10.42 16.93 38.02 1.1998
Table 3: The multimodal video captioning results on Youcook2 dataset.

We adopt the corpus-level generation evaluation metric using open-source tool

444https://github.com/Maluuba/nlg-eval including BLEU Papineni et al. (2002), METEOR Banerjee and Lavie (2005), ROUGE-L Lin and Och (2004) and CIDEr Vedantam et al. (2015).

First we compare our pre-trained model with several baseline methods. We classify the methods with two settings: 1) with/without pre-training; 2) the input is video-only or video+transcript.

Zhou et al. (2018a) propose an end-to-end model for both procedural segmentation and captioning. Sun et al. (2019b, a) adopt the pre-training strategy and evaluate the captioning with only video as input. Shi et al. (2019) and Hessel et al. (2019) discuss the multimodal input with both video and transcript. Table 3 presents the results of baseline models and the performance of our model in various settings. We study the video-only captioning models and find that our model (our model.1st) can get comparable results with CBT. Furthermore, we compare our model with various data sizes (our model.2nd, 3rd, 5th), the performance of our models improves with the increasing of the pre-training data size. Moreover, according to the comparison of our models with or without pre-trained decoder (our model.4th vs. 5th), pre-training the decoder improves the performance of generation task, and our full model (our model.5th) on the largest pre-training dataset achieves the best results.

According to our extensive experiments on multimodal video captioning, our key findings are: 1) our pre-trained model can improve the performance of generation task with the help of pre-trained decoder; 2) our model outperforms baseline models for multimodal video captioning task and achieves the state-of-the-art results.

6 Conclusion and Discussion

In this paper, we study the self-supervised learning for video and language representation on large scale videos and pre-train a multimodal model using video and corresponding ASR transcript. We propose a unified pre-training model for both understanding and generation tasks. Then, we conduct extensive experiments on evaluating our models for two downstream tasks including text-based video retrieval and multimodel video captioning. From the experiments, we find that 1) our pre-trained model can improve the performance to a large extent over the baseline models and achieve the state-of-the-art results on two typical multimodal tasks; 2) The pre-trained decoder can benefit the generation tasks such as captioning. For the future work, we will investigate the performance of our model on a larger dataset and more downstream tasks.

References

  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §5.4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2, §5.2.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197. Cited by: §2.
  • T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In

    Proceedings of the IEEE International Conference on Computer Vision Workshops

    ,
    pp. 0–0. Cited by: §2, §3.2.
  • K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    ,
    pp. 6546–6555. Cited by: §5.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
  • J. Hessel, B. Pang, Z. Zhu, and R. Soricut (2019) A case study on combining asr and visual features for generating instructional video captions. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Cited by: §2, §5.4, Table 3.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §5.2.
  • B. Klein, G. Lev, G. Sadeh, and L. Wolf (2015) Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4437–4446. Cited by: §5.3, Table 1, Table 2.
  • R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles (2017) Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp. 706–715. Cited by: §2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019)

    Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    .
    arXiv preprint arXiv:1910.13461. Cited by: §2, §3.2.
  • G. Li, N. Duan, Y. Fang, D. Jiang, and M. Zhou (2019a) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066. Cited by: §1, §2.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019b) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §1, §2.
  • C. Lin and F. J. Och (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp. 605. Cited by: §5.4.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • W. Lotter, G. Kreiman, and D. Cox (2016)

    Deep predictive coding networks for video prediction and unsupervised learning

    .
    arXiv preprint arXiv:1605.08104. Cited by: §2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: §1, §2.
  • M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §2.
  • A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. ICCV. Cited by: §2, §5.1, §5.1, §5.2, §5.2, §5.3, Table 1, Table 2, §5.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
  • S. Palaskar, J. Libovickỳ, S. Gella, and F. Metze (2019) Multimodal abstractive summarization for how2 videos. arXiv preprint arXiv:1906.07901. Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §5.4.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §2.
  • B. Shi, L. Ji, Y. Liang, N. Duan, P. Chen, Z. Niu, and M. Zhou (2019) Dense procedure captioning in narrated instructional videos. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 6382–6391. Cited by: §2, §5.1, §5.4, Table 3.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. Cited by: §2.
  • N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In

    International conference on machine learning

    ,
    pp. 843–852. Cited by: §2.
  • C. Sun, F. Baradel, K. Murphy, and C. Schmid (2019a) Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Cited by: §1, §2, §2, §2, §3.2, §5.4, Table 3.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019b) Videobert: a joint model for video and language representation learning. Proceedings of the IEEE international conference on computer vision. Cited by: §1, §2, §2, §2, §5.4, Table 3.
  • H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §5.2.
  • R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §5.4.
  • X. Wang and A. Gupta (2015) Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    .
    arXiv preprint arXiv:1609.08144. Cited by: §3.1, §5.2.
  • J. Xu, T. Mei, T. Yao, and Y. Rui (2016) Msr-vtt: a large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296. Cited by: §5.1, §5.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2.
  • Y. Yu, J. Kim, and G. Kim (2018) A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 471–487. Cited by: §2, §5.1.
  • L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao (2019)

    Unified vision-language pre-training for image captioning and vqa

    .
    arXiv preprint arXiv:1909.11059. Cited by: §2.
  • L. Zhou, C. Xu, and J. J. Corso (2018a) Towards automatic learning of procedures from web instructional videos. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2, §5.1, §5.4, Table 3, §5.
  • L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong (2018b)

    End-to-end dense video captioning with masked transformer

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8739–8748. Cited by: §2, Table 3.

7 Supplementary Material

Figure 4 presents two randomly selected case studies comparing our results with groundtruth captioning, from which we noticed that most of the results are semantically aligned with the groundtruth sentences.

Figure 4: Case studies for multimodal video dense captioning