Video understanding has typically been focused on action recognition and object tracking as the temporal aspect of videos lends itself strongly to the task of representing motion, a key component of an action. Breaking down video analysis to simple tasks, such as action recognition, allows for efficient data annotation for building large datasets to train deep learning models[kay2017kinetics, monfortmoments, goyal2017something] which has been extremely successful for images with object annotations [krizhevsky2012imagenet]. A main difficulty is that, in contrast to an image, a video often captures an interaction between agents and objects that evolves over time. These interactions can be as simple as “a person picking up a glass of water”, but even in this case three different objects (“person”, “glass” and “water”) are included in the interaction. Additionally, the video may also continue to depict the “person drinking from a glass” and the “person putting the glass back down on the table”. These sequential events present additional challenges for video datasets where single annotations may not be sufficient to explain the events depicted. Multi-label approaches to video annotation have attempted to address this problem by labeling multiple actions in a video [monfort2019multimoments, Gu_2018_CVPR, zhang2018multi]. However, these methods focus on single domain annotations, such as actions or objects, and do not capture additional contextual information, such as “person angrily putting down the dirty glass on a rusted table”, which can change the interpretation of an event and how it fits into a sequence of observations.
A solution for capturing more fully the content of video is to annotate multiple actions or objects in each video [Gu_2018_CVPR, yeung2015every, monfort2019multimoments, real2017youtubeboundingboxes]. However labels like “drinking”, “glass”, only provide a portion of the information needed to interpret the veracity of the event. Additional narratives may include intuitive descriptions and intentions, such as “an exhausted man picks up a dirty glass of water and drinks from it before angrily putting it down on a table” which would dramatically change the event interpretation. The full lingual description combines these actions with adjectives and nouns (objects) that contextualize the events depicted leading to a better understanding of the video. This is our goal in providing a new large scale dataset for training models for full video understanding.
We introduce a large scale video caption dataset, Spoken Moments in Time (S-MiT), to allow large deep learning models for video understanding to learn contextual information. Most existing video description datasets [xu2016msr-vtt, Sigurdsson2016HollywoodIH, krishna2017dense, gella-etal-2018-dataset, youcook2] are limited in size when compared to the large datasets for action recognition [kay2017kinetics, monfortmoments, goyal2017something]
. A likely cause is the increased cost of collecting full text descriptions for videos compared to single label annotations. Recent work in image captioning[david2020ijcv] addressed this problem by collecting audio descriptions for a large set of images from the Places dataset [zhouKLTO16]
. Collecting spoken captions is faster and more efficient due to the low overhead of speaking compared to typing. In addition, recording of spontaneous speech rather than typed text can produce more natural descriptions of an event.An automatic speech recognition (ASR) system was then used to transcribe the spoken descriptions to text captions. In this work, both audio, text and video models were jointly trained via contrastive learning to learn joint cross-modal representations. We build on this approach and compare models that learn directly from the spoken captions to models that include a trained ASR model which feeds generated text transcriptions into an NLP language model. We then jointly train caption and visual models (based on concatenated video and image features) using a novel Adaptive Mean Margin (AMM) approach to contrastive learning to align the visual and caption representations. We evaluate our models on multiple datasets for video/caption retrieval and show that a model trained using AMM on S-MiT achieves the best general performance across four datasets.
Altogether, our novel contributions include:
The large-scale Spoken Moments in Time dataset (S-MiT) which includes 500k pairs of video clips and corresponding audio descriptions. This new dataset represents the largest video description dataset available and will serve as a new benchmark for the community.
Benchmark models with aligned spoken caption and video representations learned via contrastive learning. We compare approaches that learn directly from the spoken descriptions as well as approaches that include ASR transcriptions that feed into different language models to generate caption representations.
An Adaptive Mean Margin (AMM) approach to cross-entropy based contrastive learning.
2 Related work
2.1 Video Understanding
The field of video understanding has recently seen fast progress partly due to the availability of large scale video datasets including ActivityNet [caba2015activitynet], Kinetics [kay2017kinetics], Moments in Time [monfortmoments, monfort2019multimoments] and YouTube-8M [youtube8m]
. These large datasets are used to pretrain models that are fine-tuned on smaller action recognition datasets such as UCF101[soomro2012ucf101] and HMDB [10.1007/978-3-642-33374-3_41]
. With the increased availability of large scale video datasets, many different models have been proposed to improve performance on a number of video understanding tasks. Two-stream convolutional neural networks (CNNs) combine optical flow with RGB frames to capture both temporal and spatial information[simonyan2014two]. I3D models [carreira2017quo] combine 3D CNNs [tran2015learning]
, which use a 3D kernel to learn temporal information from a frame sequence, with optical flow to form a two-stream 3D network “inflated” from 2D filters pre-trained on ImageNet[deng2009imagenet]. More recently a temporal shift module has been proposed to integrate temporal information into 2D models by shifting frame representations across the temporal dimension [Lin_2019_ICCV].
Recently multi-modal visual understanding methods have received significant attention [david2020ijcv, Suris_2019_CVPR, merkx2019interspeech, vasudevan2018wacv, ilharco2019largescale, groundedWordsVideo]. The DAVEnet model [david2020ijcv] has been proposed for jointly learning aligned representations between images and spoken captions, and has been extended to align frame-wise video representations with synchronized audio narration for cross-modal audio-visual concept learning [groundedWordsVideo]. Here, we build on the motivation from this paper and learn aligned representations between videos and unsynchronized spoken descriptions using the S-MiT Dataset.
2.2 Caption Datasets
There have been a number of different datasets released for providing language descriptions of visual information. Flickr8k [Hodosh2013FramingID] and Flickr30k [Plummer_2015_ICCV] include 8k and 30k images respectively each sourced from Flickr. Each image is associated with 5 text captions describing what is in the image. An additional set of 5 audio captions per image in both sets was recently collected for learning joint embeddings between speech and images [david2020ijcv]. The Visual Genome dataset [krishnavisualgenome] includes captions for multiple regions of more than 180k images allowing for fine-grained descriptions of each image. The Places Audio Caption dataset [david2016nips] contains approximately 400k images from the Places 205 [NIPS2014_5349]
image dataset with audio captions of people verbally describing each image. MS COCO[capeval2015] is a large image dataset for object recognition, segmentation, and captioning which includes roughly 1 million captions for 160k Flickr images. Conceptual Captions [sharma2018conceptual] contains 3.3M images with captions generated from HTML attributes associated with web based images. The Stock3M dataset [Wang_2017_CVPR] includes 3.2 million images each with a crowdsourced caption.
Beyond the numerous datasets available or image captioning [Hodosh2013FramingID, Plummer_2015_ICCV, krishnavisualgenome, capeval2015, sharma2018conceptual, Wang_2017_CVPR], including those that provide spoken descriptions [david2016nips, david2020ijcv], there are a variety of different video caption datasets available. A number of these datasets are related to cooking [tacos:regnerietal:tacl, cispa1826, 10.1007/s11263-015-0851-8, Damen2018EPICKITCHENS, Damen2020RESCALING] including YouCook [DaXuDoCVPR2013] and YouCook II [youcook2] which include 2k videos from YouTube each with multiple captions annotated at different segments of each video. MPII-Movie Description Corpus  contains transcribed audio descriptions from 94 Hollywood movies split into 68k clips where each clip is paired with a sentence from the movie scripts and an audio description of the visual content in each clip. Similarly, Large Scale Movie Description Challenge (LSMDC) dataset [lsmdc] contains 200 movies with 120K sentence descriptions. VideoStory [gella-etal-2018-dataset] contains 20k social media videos where each video contains a paragraph length description. The ActivityNet Captions dataset [krishna2017dense] has 20k videos with 100k text descriptions. The Microsoft Video Description (MSVD) dataset [chen-dolan-2011-collecting] contains 2k YouTube clips with a 10-25 second duration and an average of 41 single sentence descriptions per clip. MSR-Video to Text (MSR-VTT) [xu2016msr-vtt] contains 7k videos split into 10k clips with 20 captions per video.
HowTo100M [miech19howto100m] contains 136 million clips sourced from 1.22 million instructional videos with narrations generated from subtitles associated with each video. However, the subtitles are not human verified captions and the content is constrained to instructional videos. Since the text associated with the clips in the HowTo100M dataset are transcriptions of a narrator completing a task in the video, the short text phrases from the subtitles occasionally share noisy associations with the reference clip. In Section 5, and Table 2, we decided to compare our contributions using strict caption datasets as we are proposing a large-scale human annotated caption dataset with full human generated descriptions for each video.
VaTeX [Wang_2019_ICCV] contains 41k videos sourced from the Kinetics-600 dataset [kay2017kinetics, carreira2018short] annotated with 10 English captions and 10 Chinese captions for multilingual captioning. VaTeX is the most similar to our proposed dataset in that it is sourced from an existing video dataset for action recognition and the captions are directly annotated.
In this work, we present a new dataset, Spoken Moments in Time (S-MiT), which includes spoken audio captions for 500k unique three second clips each with different source videos from the Moments in Time dataset [monfortmoments, monfort2019multimoments]. In addition to vast increase in scale over other video-caption datasets, a major contribution is that we are using spoken descriptions rather than text. This allows us to train spoken caption models to directly align with video models. This is not possible with the other large video caption datasets and allows for spoken caption models to be analyzed with matching video information. We also show that models trained on our S-MiT dataset generalize much better in retrieval to the video-caption pairs in other datasets. This is due to the large coverage, diversity and scale of our proposed dataset.
2.3 Cross Modal Contrastive Learning
Cross modal learning has been used to jointly self-supervise audio-visual models [NIPS2016_6146, owens2016, Zhao_2018_ECCV] with synchronized information while NLP approaches have been leveraged to align joint representations for both visual and language modalities using spoken and text descriptions [alayrac:hal-01171193, ZhLoCoBMVC18]. This is typically done via Contrastive Learning where the alignment between positive pairs (language and visual input) is trained to be stronger than those of non-positive pairs . For visual representations, a triplet based max-margin loss is commonly used to discriminate representations between positive and negative pairs [zhang2016colorful, 8099559, 7410524]. Semi-hard negative mining [Schroff_2015_CVPR] and a dot-product based similarity score have been used to jointly learn audio-visual embeddings between images and spoken captions [david2020ijcv] while batch-wise cross-entropy approaches to contrastive learning have been used to increase the amount of information utilized in learning by considering all negative examples in a mini-batch [DBLP:journals/corr/abs-1807-03748, chen2020simple]
. Work on bidirectional speech/image retrieval using audio descriptions of images integrated ideas from max-margin contrastive learning and added a margin into the cross-entropy loss[ilharco2019largescale]. SimCLR [chen2020simple] added a non-linear projection head that maps paired representations into a common space allowing for stronger representations.
A pretrained language model has recently been used to improve cross-modality learning with language and visual input pairs. VilBERT [NIPS2019_8297] added a pretrained BERT [DBLP:conf/naacl/DevlinCLT19] transformer to capture semantic language representations associated with object detection proposals from a pretrained faster RCNN network. VideoBERT [videoBert2019iccv] extended BERT to jointly learn the visual and linguistic domain by generating tokenized visual words. Inspired by this prior work, we propose adding a pretrained language model that maps word predictions from a trained ASR model to semantic language features in order to generate rich spoken caption representations. We then utilize an MLP to project these caption representations, and our video representations, to an aligned joint representation which can be used for video/caption retrieval (see Section 5).
2.3.1 Optimization Approaches
A common approach to optimization in contrastive learning settings is to use a similarity based loss function. We formulate the contrastive loss as,, where the goal is to maximize the discrimination between positive and negative paired captions and videos . The loss is split into two tasks where forms pairs from a fixed video and each caption in a sampled mini-batch, while fixes the caption and forms pairs with each video in the mini-batch. Below we discuss different approaches of , where and are interchangeable with and .
Semi-hard negative mining (SHN) [Schroff_2015_CVPR] has been used for learning aligned cross-modal embeddings using a triplet loss [david2020ijcv, 8461684]. This is an improvement over hard negative mining [DBLP:journals/corr/FaghriFKF17] since a sampled negative example is constrained to be less similar to the anchor than the positive sample while still being within the margin and thus contributing a loss at each step with the margin , , where is a similarity score for the representations of and , with and forming a positive pair.
Noise contrastive estimation (NCE) [pmlr-v9-gutmann10a] has been applied to contrastive learning [chen2020simple, DBLP:journals/corr/abs-1807-03748] by using a log-likelihood based loss function that learns to discriminate between positive and negative pairs of feature embeddings,
where is an indicator function that we only considers negative pairs in the denominator. This has been shown to improve feature alignment compared to SHN [chen2020simple].
Masked Margin Softmax Loss (MMS) [ilharco2019largescale] and Large Margin Cosine Loss (LMCL) [Wang_2018_CVPR] incorporate a positive margin into the contrastive learning framework in order to improve feature discrimination among non-paired embeddings. MMS uses a monotonically increasing margin to allow for initial learning to begin to converge before a large alteration to the loss is added. LMCL proposes a theoretical limit on the maximum margin size of where refers to the number of classes being discriminated. For aligning captions to visual information, the class size can be considered unbounded as each caption represents a slightly different representation that we want to discriminate leading to a max margin size of . Concretely, MMS proposes adding a margin to Equation 1,
where the margin, , starts as 0.001 and is exponentially increased by a factor of 1.002 every 1000 training steps.
We propose extending the idea of an increasing margin in MMS to an adaptive setting that does not require setting the initial value of the margin or the growth rate. We refer to this approach as an Adaptive Mean Margin (AMM) where the margin is set as the mean distance between the positive pair and the set of negative pairs in a batch. We describe AMM in more detail in Section 4.3.
3 The Spoken Moments Dataset
We begin with the Moments in Time dataset [monfortmoments] as it includes over 1 million videos sourced from a number of different video hosting sites with strong inter & intra-varietal variation in terms of the number of events depicted in each video. Further, the videos are all cut to 3 seconds allowing for a concise description to effectively capture the localized information of each event. Here we refer to concise descriptions as those that focus on key events depicted in the video and does not imply partial descriptions. In data collection, annotators may watch a video as many times as desired. During recording, we block the annotators from seeing/hearing the video to encourage descriptions of important memorable events rather than every specific detail. This approach does not preclude the annotators from describing sequential or simultaneous events as shown in our qualitative examples (see Figure 1). We describe our annotation approach in more detail in the supplementary material.
3.1 Dataset Statistics
Our proposed Spoken Moments dataset contains 500k videos randomly chosen from the Multi-Moments in Time (M-MiT) training set and all of the 10k videos from the validation set. Each video in the training set contains at least one audio description. We transcribed each audio recording using the public Google Automatic Speech Recognition (ASR) engine to generate text captions for each video. When analyzing these transcriptions, we build a picture of the coverage and diversity of our captions. Table 2 (left) shows that our captions have an average length of 18 words with a unique vocabulary of 50,570 words consisting of 20,645 nouns, 12,523 adjectives and 7,436 verbs with a total word count of 5.6 million. Table 2 (right) shows a comparison of our Spoken Moments dataset to other existing datasets for video captioning. Our dataset will be the largest public dataset in terms of video clips, source videos, total number of captions, total words in the captions and the vocabulary set of unique words occurring in the captions. The increase in vocabulary size is important as it shows that our increase in the number of videos over previous datasets does not simply include repeated events but covers a novel breadth of information. We can see the opposite effect of this in YouCook II [youcook2] where the restricted domain of cooking videos results in a limited vocabulary used in the descriptions.
To understand how this vocabulary covers the class labels typically used for training computer vision models, we examined whether these labels exist in our vocabulary. Table2 (right) shows that we have strong coverage of the two largest action recognition datasets for video understanding (Kinetics [kay2017kinetics] and M-MiT [monfort2019multimoments]). We expected a large coverage of the events in M-MiT as we sourced our videos from this dataset and the action labels themselves are fairly general (e.g. “running” and “cooking”). For Kinetics, the labels are commonly tied to a noun preceded by a verb (e.g. “brushing hair”). For these labels we consider them to exist in our dataset if both the verb and noun are in the same caption. For example, “A boy is in a bathroom brushing his teeth” would cover the class “brushing teeth”. With this approach we see a 85.1% coverage of the classes in Kinetics and a 96.2% coverage of the classes in M-MiT showing a strong level of event diversity. Similarly we see a strong overlap of the object classes in MS-COCO [lin2014microsoft] (100%) and ImageNet [deng2009imagenet] (69.2%) in our captions. ImageNet coverage is likely lower due to the specific labels used for many of its classes (e.g. “coucal”). Still, 69.2% coverage means 692 ImageNet classes appear in our captions. Similarly, Places [NIPS2014_5349] scene labels are very specific and don’t necessarily match the language used in our descriptions. For example, an “abbey” will typically be described as a “church” or “monastery” in our captions. We did not account for all of the synonyms possible and are only considering direct matches in our captions. Even so we are able to find a 47.4% coverage of the scene labels in Places365 in our dataset.
Here we provide information on some additional characteristics of our data that may be of interest. While we do not release demographic info of our annotators or captions, about 57% of the spoken captions were recorded by male voices and 43% female. For the audio streams of the videos, roughly 51% include natural sound, 5% have music as the audio and 44% have no audio. This is consistent with the M-MiT dataset [monfort2019multimoments] from which we source our videos. Additionally, we found that less than 3% of the videos contain captions that describe non-visible events (e.g. a car horn when no car is visible in the video frames). For this reason we have chosen to focus our approach on learning a strong visual model in Section 4.
4 Learning Audio-Visual Representations
In order to learn from the large set of spoken captions in the proposed S-MiT dataset, we adopt a cross-modal architecture used in prior work [miech19howto100m, david2020ijcv, AVLnet] which is composed of a video model and a caption model as depicted in Figure 3. Specifically, we take video-caption pairs as input and encode each modality into a
-D feature vector. We do this by adding a multilayer perceptron (MLP) as a projection head on top of both the video and the caption model. This projection head is composed of two linear layers followed by gated linear units (GLU)[GLU]. We then compute the dot product between the video and caption representations to produce an x similarity matrix, , which is used to compute our contrastive loss for training. In Section 4.3, we describe our modified approach to margined contrastive learning which uses an Adaptive Mean Margin (AMM) which automatically adjusts itself during learning to improve the optimization signal during training.
4.1 Video Model
Following prior work [miech19howto100m], we use two encoders to represent input videos: image & video encoders. Specifically, we use a ResNet-152 [resNet] pretrained on ImageNet [krizhevsky2012imagenet] and a temporal shift module (TSM) ResNet-50 model [Lin_2019_ICCV] pretrained on M-MiT [monfort2019multimoments]. Each encoder outputs a
-D feature vector after max-pooling over the temporal dimension (8 frames for the TSM (3 fps) and 3 frames for the image model (1 fps)). We concatenate the two -D vectors and feed the concatenated vector into an MLP projection head to get the final -D visual representation. We examine the effect of using the image and video encoders as well as different pretrained models in the supplementary material.
4.2 Caption Model
4.2.1 Language Caption Model
Prior work in learning joint representations between audio captions and visual models has shown that utilizing ASR transcriptions greatly improves results [david2020ijcv]. We build on this idea and use the predicted words from a pretrained ASR model (e.g. Google’s public ASR engine) to train our models. Concretely, we examine the effect of using different pretrained language models stacked on top of the ASR model predictions. We begin by comparing the results of using Fasttext [bojanowski2016enriching], BART [lewis2019bart] and BERT [DBLP:conf/naacl/DevlinCLT19] models to generate semantic and contextual word representations for our captions. During training, we randomly select 10 words from each caption to be included in training. In the case of the BART and BERT models, this selection happens after the full transformer model has been applied to avoid altering the results from the self-attention mechanisms. If less than 10 words occur in a caption then we allow words to be sampled multiple times in the random selection. This training augmentation allows different words in each caption to be represented differently at different training iterations. We examine the effect of this approach in the supplementary material. In test, we use the full transcription as input into the language model. We average the word representations from the output of the language model to generate a single representation for each caption which we align to the video representations described in the previous section.
4.2.2 Spoken Caption Model
We also train caption models with raw spoken captions instead of the corresponding transcription. For each caption, we randomly sample 10 seconds of speech for training and compute the 40-dimensional log Mel spectrogram to serve as the input of spoken caption model. The input is fed into a spoken caption model where we consider ResDavenet [david2020ijcv] (which is designed specifically for speech) and two ImageNet ResNet [resNet] models (ResNet-34, ResNet-50). For the ResNet models, we modify the first convolutional layer to take the -channel input so that spectrogram can be processed. In addition, the wav2vec [wav2vec] model, which takes raw waveform as the input, is also involved in our experiments. Spoken captions are first fed into the pre-trained wav2vec model, which produces -D vectors per 210 ms. We then feed them into a learnable ResStack, taken from ResDavenet, to learn representations of spoken captions.
|Dataset||Loss||Caption to Video||Video to Caption||Mean|
|Trained On||Evaluated On|
4.3 Adaptive Mean Margin
We train our model using the contrastive loss with a similar setting to MMS (Equation 2). The only difference is that we replace the margin, , with an adaptive margin based on the difference between the similarity of the positive pair and the set of negative pairs in each batch.
The challenge in using the MMS margin for mini-batch sampled contrastive learning is that the initial margin and growth schedule are difficult to tune for a specific dataset and similarity metric. Additionally, depending on the sampled pairs in a mini-batch, the margin calculated may be too weak if the positive pair is much more similar than the sampled negative pairs and too strong if it is very similar to the negative pairs. The approach to monotonically increase the margin during training is meant to address this as the positive and negative pairs will share similar alignment early in training and begin to diverge closer to convergence. However, variable rates of convergence of different models on different datasets make this growth rate difficult to tune and this approach does not account for differences in the negative samples that appear in different mini-batches. To address this, we propose an adaptive margin based on relative batch-wise similarity scores.
Class labels have been proposed to be used for generating adaptive margins based on class similarity between positive and negative pairs [adaptivemargin2020cvpr, Liu_2019_CVPR]
. Likewise, prior work explored a non-class dependant approach for an adaptive similarity-based margin for human pose estimation[Li_2015_ICCV] where the mean joint error between a positive pose and a hard sampled negative pose was used as a margin with the triplet loss. This adaptively increases the margin when the sampled negative pair is dissimilar to the positive pair in order to maximize the learning signal on less aligned negative samples. We follow a similar intuition and simply replace in Equation 2 with
where is a dampening parameter to weight the strength of the margin. When in Equation 3 is applied to Equation 2 with , the margin removes the positive pair similarity from the optimization. Ablation studies on different alpha values can be found in the supplementary material. In practice we use in our experiments.
This has the effect of increasing the margin as the difference between the true pair similarity and the similarity of the negative pairs increases. As the training progresses, and the learning approaches convergence, the margin generally increases with the increased separation between positive and negative pair-wise similarities. This also removes the need to tune the margin and growth rate which may have different optimal values for different similarity metrics, batch sizes and datasets.
We refer to this as an Adaptive Mean Margin (AMM) for contrastive learning and show in Section 5 the effect of applying this adaptive margin.
5.1 Video/Caption Retrieval
In Tables 2, 2 and 4 we show results of R@k recall scores (for ) and mean average precision (mAP) on both caption to video and video to caption retrieval. Results are averaged over five random sets of 1k video-caption pairs from the test set. Each model in Tables 2 and 2 uses the output of a pretrained ASR model, the Google Cloud ASR engine, as input into a trained language model to generate a feature representation for each caption. Alternatively, the spoken caption models align visual representations directly from the audio signal without pretrained modules.
Table 2 shows the result of using different language models to generate our caption representations from ASR text transcriptions. Each of these models was trained using the proposed AMM loss function described in Section 4.3. We evaluate the AMM loss in Table 2 where we compare the results on the NCE, SHN, MMS and AMM loss functions described in Sections 2.3.1 and 4.3 on four different datasets (the proposed Spoken Moments in Time dataset (S-MiT) as well as Vatex-en [Wang_2019_ICCV], MSR-VTT [xu2016msr-vtt] and ActivityNet Captions111We used the groundtruth timestamps to get corresponding video clips. [krishna2017dense]). The proposed AMM loss function consistently achieves the best results across each dataset in Table 2 and the BART language model provides the strongest representations for the retrieval task in Table 2.
Table 2 shows a comparison of our AMM approach to other methods for cross-modal contrastive learning. We use the BART language model [lewis2019bart] to generate representations of words transcribed from the audio captions via a pretrained ASR model. Replacing the monotonically increasing margin used in MMS [ilharco2019largescale] with an adaptive margin that scales with the samples in a batch achieves the strongest results. We observed that as training continues and the margin in MMS continues to grow the training performance begins to degrade. This is likely due to the margin becoming too large for stable training as described in prior work [Wang_2018_CVPR].
In Table 4, we show a comparison of different spoken caption models with different loss functions. The proposed AMM approach beats the other loss functions consistently.
5.2 Cross Dataset Evaluation
To further examine the strength of our proposed Spoken Moments in Time (S-MiT) dataset, we compare the generalization performance of models trained on four different datasets (S-MiT as well as Vatex-en [Wang_2019_ICCV], MSR-VTT [xu2016msr-vtt] and ActivityNet Captions [krishna2017dense]) for video/caption retrieval (see Table 2 (right) for comparisons of these datasets). We train each model on a single dataset using the approach described in Section 4.3 and evaluate on the test set from each other dataset. For example, a model trained on Vatex is evaluated on, in addition to its own, the test sets of ActivityNet Captions, MSR-VTT and S-MiT. We sample five sets of 1k video-caption pairs from each test set. This allows us to fairly compare results across test sets of different sizes (see supplementary material for full test set results). Each model in this evaluation was trained using the BART [lewis2019bart] language model and the proposed AMM loss function which was found to give the best results (see Tables 2, 2). We evaluate the models using the mean between the video-to-caption and caption-to-video retrieval tasks. We are not able to compare the spoken caption models from Table 4 here as the other datasets only include text captions.
In Table 4, we can see that the S-MiT model generalizes better than the other models in spite of the additional noise introduced by the ASR model. Additionally, the restriction to 3-second videos in S-MiT does not hinder it’s ability to generalize to the much longer videos of the other datasets.
In this paper, we have introduced the Spoken Moments in Time dataset which includes 500k pairs of video clips and corresponding spoken descriptions. This new dataset represents the largest video caption dataset available and will serve as a new benchmark for the community. We compared various benchmark models for learning joint representations between captions and videos, and evaluated our approaches on multiple datasets to highlight the strength of the models as well as the ability of models trained on our proposed dataset to generalize to tasks in other datasets. With these results we are confident that the presented Spoken Moments dataset will have a positive impact on the fields of video understanding and cross-modal learning.
This work was supported by the MIT-IBM Watson AI Lab as well as the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00341.
The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.
We follow the approach used to collect the Places Audio Caption dataset [david2016nips, david2020ijcv] and collect audio descriptions of each video in the dataset using Amazon Mechanical Turk (AMT). In order to ensure that we have a large and diverse dataset, we collect an audio description using AMT for each video in a set of 500k randomly selected videos from the training set and at least two unique descriptions for each video in the 10k videos used for both the validation and test sets. Each AMT worker is presented with a task of recording themselves describing 10 different videos. Each video is shown on the left of the screen while a video with an example text description is shown on the right. This example helps to show the workers the types of descriptions we are looking for and the amount of detail we expect from them. This example stays on the right side of the screen throughout the task while the target videos on the left cycle as the worker completes each description. Figure 4 shows an example of this interface with an example video and caption on the right and a target video on the left. Below each target description is a button that allows the worker to start recording their voice as they describe the video. Once they press this button, the video is removed from the screen and the recording is started. We block the worker from seeing the video while recording the description to ensure that the recordings are concise and pertain only to the important events highlighted in their memory. We use the Google Cloud ASR engine to verify the quality of each recorded description and flag AMT workers for poor performance. This is done by checking that the generated text has at least five words, is unique (some bots repeat pre-recorded audio to trick the system) and that the audio is at least three seconds long. If any of these checks fail we don’t let the worker continue to the next video until they record a new description that passes our checks. Once the descriptions are recorded, we periodically sample videos to check the quality of the audio paired with the ASR to ensure they match the videos and have an appropriate level of detail. If these checks fail, we flag the workers that recorded the descriptions, don’t allow them to record in the future and recheck all of their recorded data. This process allows us to ensure a strong level of quality in our collected spoken captions. Examples of some of the videos and corresponding text transcriptions of the descriptions we collected can be seen in Figure 1.
B Implementation Details
We train each model on a server with 8 24GB Titan RTX cards using a mini-batch size of 2048 for 100 epochs. We examine the effect of the mini-batch size on learning in the next section. We take the best parameters as evaluated on the evaluation set of the training dataset after each epoch. We repeat this process for two phases of training. First we freeze the visual backbone models and train only the projection heads (including the full caption model for the spoken models) and then, in a second round, allow the full visual model to train as well. We keep the language and ASR components frozen for the language caption models and reserve fine-tuning these components for future work. For model training, we use an Adam[Kingma2015AdamAM] optimizer where a fixed learning rate of and are set for the first and the second round model training, respectively.
C Ablation Studies
|Dataset||Pretrained TSM||Caption to Video||Video to Caption||Mean|
|Visual Base Model||Caption to Video||Video to Caption||Mean|
|ResNet-152 ImageNet (2D)||24.22.4||53.61.8||66.52.1||37.92.0||32.92.1||61.71.6||71.61.0||45.91.8||28.52.2||57.71.7||69.11.5||41.91.9|
|TSM Kinetics + 2D||27.61.4||57.52.4||70.41.9||41.31.7||37.22.3||65.01.7||75.21.5||50.01.7||32.41.8||61.32.0||72.81.6||45.71.7|
|TSM M-MiT + 2D||29.82.5||60.62.4||72.21.9||44.02.2||39.42.1||68.02.0||77.51.8||52.32.0||34.62.1||64.32.2||74.91.8||48.22.0|
|Batch Size||Caption to Video||Video to Caption||Mean|
|Projection Size||Caption to Video||Video to Caption||Mean|
|Sampling||Caption to Video||Video to Caption||Mean|
|Caption to Video||Video to Caption||Mean|
|Trained On||Evaluated On|
In Tables 11, 11, 11, 11, and 11, we show several ablation studies. Unless otherwise listed in the table we use the proposed AMM loss function with the BART [lewis2019bart] language model as part of the language caption model described in Section 4.2.1 for each experiment. Results are averaged over five rounds with a single random batch of 1k caption-video pairs from the test set. Due to the increased computation demand of these studies we freeze the base models and train the projection heads for alignment. We use the best model settings found in this analysis to train the full models with results reported in Section 5.
Table 11 shows the effect of using two different pretrained temporal shift [Lin_2019_ICCV] video models on four different datasets in order to choose the most appropriate base models (Multi-Moments in Time (M-MiT) [monfort2019multimoments] or Kinetics [kay2017kinetics]). Here we use the BART language model and the proposed AMM loss function as described in Section 4 as this combination gave us the best results on each dataset.
Table 11 compares the effect of the video model (TSM) trained for action recognition and the 2D model trained for object recognition. Most captions reference both objects and actions in a video with an average of 4.37 nouns used per caption compared to 1.58 verbs. The strength of the 2D obect model makes sense when we consider this prevalence of nouns in the captions. The combination of the TSM model trained on M-MiT [monfort2019multimoments] and the 2D models trained on ImageNet [krizhevsky2012imagenet] provided the best performance when used with the model described in Section 4.
In Tables 11 and 11 we compare the the effect of the batch size and projection size on the performance of the S-MiT model described in Section 4 in order to validate our choice of a 2048 batch and a 4096 projection. Similarly, Table 11 shows the effect of using the caption sampling approach for the transcription model as described in Section 4.2.1. In Table 11, we explore different dampening parameters.
D Cross Dataset Generalization
In Table 11, we expand on Table 4 and compare the generalization performance of models trained on four different datasets (S-MiT as well as Vatex-en [Wang_2019_ICCV], MSR-VTT [xu2016msr-vtt] and ActivityNet Captions [krishna2017dense]) for video/caption retrieval on their full test sets. In Table 4 we ran the comparison on five samples of 1k video-caption pairs to be consistent on evaluating across different size test sets. Here we evaluate on the full test set of each dataset to provide a baseline for each test set. The strength of the model trained on S-MiT is even more evident here as it achieves higher results on the test sets of both ActivityNet and MSR-VTT than the models trained on those datasets. It even comes very close to the performance of the Vatex model on the Vatex test set. This shows that the scale and diversity of the S-MiT dataset is highly beneficial to training robust models.
E Qualitative Results
In Tables 12 and 13, we show the top five retrieval results for some examples from the Spoken Moments dataset. For this analysis, we use the language caption model described in Section 4.2.1 with the BART [lewis2019bart] language model and the proposed AMM loss function. Table 12 shows the top five retrieved captions given a query video, while Table 13 shows the top five retrieved videos given a query caption. Blue boxes indicate the ground-truth results.
Our model retrieves results by recognizing key objects or environments in the videos. For example, in Table 12 (c), lettuce is distinguished from the other vegetables. In Table 13 (f), the model not only recognizes the planets in space but also understands that they are crashing into each other. Some of the examples show that the top retrieval result is not the ground-truth. However, as we can see, the top predictions are typically still a strong match for the queries, as in (e), (i) in Table 12 and (a), (b) in Table 13.
For this demonstration, we use transcribed words from the audio captions using a pretrained ASR model. Noise in these transcriptions may contribute to some errors. In the future, we plan to investigate jointly training a pre-trained ASR model, and language model, with the video model to improve our performance.
F Captions in the Spoken Moments Dataset
Table 14 shows some captions in the Spoken Moments dataset that capture motion and sequential events which would be difficult to represent with a single image.
|(a)||a boy and a red white and blue shirt is sitting on a couch he is holding an infant life vest and picks it up to blow through the two|
|(b)||there’s a gauge or a lock thing turns from rides and then being turned to the left|
|(c)||a picture of a man drinking coffee and play with a cell phone in fast motion|
|(d)||in slow motion we see a collie jump into the air and catch a white frisbee in flight|
|(e)||these are track and field runners and it’s a relay race and they take off when they are handed the batons|
|(f)||there is water dripping off the edge of something all you can hear is the water dripping|