Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain. To address this, we present the Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible while allowing us to scale the size of a large classification dataset. In order to utilize our proposed dataset, we present a novel Adaptive Mean Margin (AMM) approach to contrastive learning and evaluate our models on video/caption retrieval on multiple datasets. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.


page 3

page 13

page 15

page 16

page 17


QuerYD: A video dataset with high-quality textual and audio narrations

We introduce QuerYD, a new large-scale dataset for retrieval and event l...

Moments in Time Dataset: one million videos for event understanding

We present the Moments in Time Dataset, a large-scale human-annotated co...

Text Synopsis Generation for Egocentric Videos

Mass utilization of body-worn cameras has led to a huge corpus of availa...

Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

Automated Audio Captioning is a cross-modal task, generating natural lan...

TennisVid2Text: Fine-grained Descriptions for Domain Specific Videos

Automatically describing videos has ever been fascinating. In this work,...

Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability

A key capability of an intelligent system is deciding when events from p...

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Visually-grounded spoken language datasets can enable models to learn cr...

1 Introduction

Video understanding has typically been focused on action recognition and object tracking as the temporal aspect of videos lends itself strongly to the task of representing motion, a key component of an action. Breaking down video analysis to simple tasks, such as action recognition, allows for efficient data annotation for building large datasets to train deep learning models

[kay2017kinetics, monfortmoments, goyal2017something] which has been extremely successful for images with object annotations [krizhevsky2012imagenet]. A main difficulty is that, in contrast to an image, a video often captures an interaction between agents and objects that evolves over time. These interactions can be as simple as “a person picking up a glass of water”, but even in this case three different objects (“person”, “glass” and “water”) are included in the interaction. Additionally, the video may also continue to depict the “person drinking from a glass” and the “person putting the glass back down on the table”. These sequential events present additional challenges for video datasets where single annotations may not be sufficient to explain the events depicted. Multi-label approaches to video annotation have attempted to address this problem by labeling multiple actions in a video [monfort2019multimoments, Gu_2018_CVPR, zhang2018multi]. However, these methods focus on single domain annotations, such as actions or objects, and do not capture additional contextual information, such as “person angrily putting down the dirty glass on a rusted table”, which can change the interpretation of an event and how it fits into a sequence of observations.

A solution for capturing more fully the content of video is to annotate multiple actions or objects in each video [Gu_2018_CVPR, yeung2015every, monfort2019multimoments, real2017youtubeboundingboxes]. However labels like “drinking”, “glass”, only provide a portion of the information needed to interpret the veracity of the event. Additional narratives may include intuitive descriptions and intentions, such as “an exhausted man picks up a dirty glass of water and drinks from it before angrily putting it down on a table” which would dramatically change the event interpretation. The full lingual description combines these actions with adjectives and nouns (objects) that contextualize the events depicted leading to a better understanding of the video. This is our goal in providing a new large scale dataset for training models for full video understanding.

We introduce a large scale video caption dataset, Spoken Moments in Time (S-MiT), to allow large deep learning models for video understanding to learn contextual information. Most existing video description datasets [xu2016msr-vtt, Sigurdsson2016HollywoodIH, krishna2017dense, gella-etal-2018-dataset, youcook2] are limited in size when compared to the large datasets for action recognition [kay2017kinetics, monfortmoments, goyal2017something]

. A likely cause is the increased cost of collecting full text descriptions for videos compared to single label annotations. Recent work in image captioning

[david2020ijcv] addressed this problem by collecting audio descriptions for a large set of images from the Places dataset [zhouKLTO16]

. Collecting spoken captions is faster and more efficient due to the low overhead of speaking compared to typing. In addition, recording of spontaneous speech rather than typed text can produce more natural descriptions of an event.An automatic speech recognition (ASR) system was then used to transcribe the spoken descriptions to text captions. In this work, both audio, text and video models were jointly trained via contrastive learning to learn joint cross-modal representations. We build on this approach and compare models that learn directly from the spoken captions to models that include a trained ASR model which feeds generated text transcriptions into an NLP language model. We then jointly train caption and visual models (based on concatenated video and image features) using a novel Adaptive Mean Margin (AMM) approach to contrastive learning to align the visual and caption representations. We evaluate our models on multiple datasets for video/caption retrieval and show that a model trained using AMM on S-MiT achieves the best general performance across four datasets.

Figure 1: Examples from the Spoken Moments Dataset: The dataset is composed of videos and the corresponding spoken captions. We show some examples of the text transcriptions, automatically generated using the public Google ASR engine.

Altogether, our novel contributions include:

  1. The large-scale Spoken Moments in Time dataset (S-MiT) which includes 500k pairs of video clips and corresponding audio descriptions. This new dataset represents the largest video description dataset available and will serve as a new benchmark for the community.

  2. Benchmark models with aligned spoken caption and video representations learned via contrastive learning. We compare approaches that learn directly from the spoken descriptions as well as approaches that include ASR transcriptions that feed into different language models to generate caption representations.

  3. An Adaptive Mean Margin (AMM) approach to cross-entropy based contrastive learning.

2 Related work

2.1 Video Understanding

The field of video understanding has recently seen fast progress partly due to the availability of large scale video datasets including ActivityNet [caba2015activitynet], Kinetics [kay2017kinetics], Moments in Time [monfortmoments, monfort2019multimoments] and YouTube-8M [youtube8m]

. These large datasets are used to pretrain models that are fine-tuned on smaller action recognition datasets such as UCF101

[soomro2012ucf101] and HMDB [10.1007/978-3-642-33374-3_41]

. With the increased availability of large scale video datasets, many different models have been proposed to improve performance on a number of video understanding tasks. Two-stream convolutional neural networks (CNNs) combine optical flow with RGB frames to capture both temporal and spatial information

[simonyan2014two]. I3D models [carreira2017quo] combine 3D CNNs [tran2015learning]

, which use a 3D kernel to learn temporal information from a frame sequence, with optical flow to form a two-stream 3D network “inflated” from 2D filters pre-trained on ImageNet

[deng2009imagenet]. More recently a temporal shift module has been proposed to integrate temporal information into 2D models by shifting frame representations across the temporal dimension [Lin_2019_ICCV].

Recently multi-modal visual understanding methods have received significant attention [david2020ijcv, Suris_2019_CVPR, merkx2019interspeech, vasudevan2018wacv, ilharco2019largescale, groundedWordsVideo]. The DAVEnet model [david2020ijcv] has been proposed for jointly learning aligned representations between images and spoken captions, and has been extended to align frame-wise video representations with synchronized audio narration for cross-modal audio-visual concept learning [groundedWordsVideo]. Here, we build on the motivation from this paper and learn aligned representations between videos and unsynchronized spoken descriptions using the S-MiT Dataset.

2.2 Caption Datasets

There have been a number of different datasets released for providing language descriptions of visual information. Flickr8k [Hodosh2013FramingID] and Flickr30k [Plummer_2015_ICCV] include 8k and 30k images respectively each sourced from Flickr. Each image is associated with 5 text captions describing what is in the image. An additional set of 5 audio captions per image in both sets was recently collected for learning joint embeddings between speech and images [david2020ijcv]. The Visual Genome dataset [krishnavisualgenome] includes captions for multiple regions of more than 180k images allowing for fine-grained descriptions of each image. The Places Audio Caption dataset [david2016nips] contains approximately 400k images from the Places 205 [NIPS2014_5349]

image dataset with audio captions of people verbally describing each image. MS COCO

[capeval2015] is a large image dataset for object recognition, segmentation, and captioning which includes roughly 1 million captions for 160k Flickr images. Conceptual Captions [sharma2018conceptual] contains 3.3M images with captions generated from HTML attributes associated with web based images. The Stock3M dataset [Wang_2017_CVPR] includes 3.2 million images each with a crowdsourced caption.

Beyond the numerous datasets available or image captioning [Hodosh2013FramingID, Plummer_2015_ICCV, krishnavisualgenome, capeval2015, sharma2018conceptual, Wang_2017_CVPR], including those that provide spoken descriptions [david2016nips, david2020ijcv], there are a variety of different video caption datasets available. A number of these datasets are related to cooking [tacos:regnerietal:tacl, cispa1826, 10.1007/s11263-015-0851-8, Damen2018EPICKITCHENS, Damen2020RESCALING] including YouCook [DaXuDoCVPR2013] and YouCook II [youcook2] which include 2k videos from YouTube each with multiple captions annotated at different segments of each video. MPII-Movie Description Corpus [7298940] contains transcribed audio descriptions from 94 Hollywood movies split into 68k clips where each clip is paired with a sentence from the movie scripts and an audio description of the visual content in each clip. Similarly, Large Scale Movie Description Challenge (LSMDC) dataset [lsmdc] contains 200 movies with 120K sentence descriptions. VideoStory [gella-etal-2018-dataset] contains 20k social media videos where each video contains a paragraph length description. The ActivityNet Captions dataset [krishna2017dense] has 20k videos with 100k text descriptions. The Microsoft Video Description (MSVD) dataset [chen-dolan-2011-collecting] contains 2k YouTube clips with a 10-25 second duration and an average of 41 single sentence descriptions per clip. MSR-Video to Text (MSR-VTT) [xu2016msr-vtt] contains 7k videos split into 10k clips with 20 captions per video.

HowTo100M [miech19howto100m] contains 136 million clips sourced from 1.22 million instructional videos with narrations generated from subtitles associated with each video. However, the subtitles are not human verified captions and the content is constrained to instructional videos. Since the text associated with the clips in the HowTo100M dataset are transcriptions of a narrator completing a task in the video, the short text phrases from the subtitles occasionally share noisy associations with the reference clip. In Section 5, and Table 2, we decided to compare our contributions using strict caption datasets as we are proposing a large-scale human annotated caption dataset with full human generated descriptions for each video.

VaTeX [Wang_2019_ICCV] contains 41k videos sourced from the Kinetics-600 dataset [kay2017kinetics, carreira2018short] annotated with 10 English captions and 10 Chinese captions for multilingual captioning. VaTeX is the most similar to our proposed dataset in that it is sourced from an existing video dataset for action recognition and the captions are directly annotated.

In this work, we present a new dataset, Spoken Moments in Time (S-MiT), which includes spoken audio captions for 500k unique three second clips each with different source videos from the Moments in Time dataset [monfortmoments, monfort2019multimoments]. In addition to vast increase in scale over other video-caption datasets, a major contribution is that we are using spoken descriptions rather than text. This allows us to train spoken caption models to directly align with video models. This is not possible with the other large video caption datasets and allows for spoken caption models to be analyzed with matching video information. We also show that models trained on our S-MiT dataset generalize much better in retrieval to the video-caption pairs in other datasets. This is due to the large coverage, diversity and scale of our proposed dataset.

2.3 Cross Modal Contrastive Learning

Cross modal learning has been used to jointly self-supervise audio-visual models [NIPS2016_6146, owens2016, Zhao_2018_ECCV] with synchronized information while NLP approaches have been leveraged to align joint representations for both visual and language modalities using spoken and text descriptions [alayrac:hal-01171193, ZhLoCoBMVC18]. This is typically done via Contrastive Learning where the alignment between positive pairs (language and visual input) is trained to be stronger than those of non-positive pairs [1640964]. For visual representations, a triplet based max-margin loss is commonly used to discriminate representations between positive and negative pairs [zhang2016colorful, 8099559, 7410524]. Semi-hard negative mining [Schroff_2015_CVPR] and a dot-product based similarity score have been used to jointly learn audio-visual embeddings between images and spoken captions [david2020ijcv] while batch-wise cross-entropy approaches to contrastive learning have been used to increase the amount of information utilized in learning by considering all negative examples in a mini-batch [DBLP:journals/corr/abs-1807-03748, chen2020simple]

. Work on bidirectional speech/image retrieval using audio descriptions of images integrated ideas from max-margin contrastive learning and added a margin into the cross-entropy loss

[ilharco2019largescale]. SimCLR [chen2020simple] added a non-linear projection head that maps paired representations into a common space allowing for stronger representations.

A pretrained language model has recently been used to improve cross-modality learning with language and visual input pairs. VilBERT [NIPS2019_8297] added a pretrained BERT [DBLP:conf/naacl/DevlinCLT19] transformer to capture semantic language representations associated with object detection proposals from a pretrained faster RCNN network. VideoBERT [videoBert2019iccv] extended BERT to jointly learn the visual and linguistic domain by generating tokenized visual words. Inspired by this prior work, we propose adding a pretrained language model that maps word predictions from a trained ASR model to semantic language features in order to generate rich spoken caption representations. We then utilize an MLP to project these caption representations, and our video representations, to an aligned joint representation which can be used for video/caption retrieval (see Section 5).

2.3.1 Optimization Approaches

A common approach to optimization in contrastive learning settings is to use a similarity based loss function. We formulate the contrastive loss as,

, where the goal is to maximize the discrimination between positive and negative paired captions and videos . The loss is split into two tasks where forms pairs from a fixed video and each caption in a sampled mini-batch, while fixes the caption and forms pairs with each video in the mini-batch. Below we discuss different approaches of , where and are interchangeable with and .

Semi-hard negative mining (SHN) [Schroff_2015_CVPR] has been used for learning aligned cross-modal embeddings using a triplet loss [david2020ijcv, 8461684]. This is an improvement over hard negative mining [DBLP:journals/corr/FaghriFKF17] since a sampled negative example is constrained to be less similar to the anchor than the positive sample while still being within the margin and thus contributing a loss at each step with the margin , , where is a similarity score for the representations of and , with and forming a positive pair.

Noise contrastive estimation (NCE) [pmlr-v9-gutmann10a] has been applied to contrastive learning [chen2020simple, DBLP:journals/corr/abs-1807-03748] by using a log-likelihood based loss function that learns to discriminate between positive and negative pairs of feature embeddings,


where is an indicator function that we only considers negative pairs in the denominator. This has been shown to improve feature alignment compared to SHN [chen2020simple].

Masked Margin Softmax Loss (MMS) [ilharco2019largescale] and Large Margin Cosine Loss (LMCL) [Wang_2018_CVPR] incorporate a positive margin into the contrastive learning framework in order to improve feature discrimination among non-paired embeddings. MMS uses a monotonically increasing margin to allow for initial learning to begin to converge before a large alteration to the loss is added. LMCL proposes a theoretical limit on the maximum margin size of where refers to the number of classes being discriminated. For aligning captions to visual information, the class size can be considered unbounded as each caption represents a slightly different representation that we want to discriminate leading to a max margin size of . Concretely, MMS proposes adding a margin to Equation 1,


where the margin, , starts as 0.001 and is exponentially increased by a factor of 1.002 every 1000 training steps.

We propose extending the idea of an increasing margin in MMS to an adaptive setting that does not require setting the initial value of the margin or the growth rate. We refer to this approach as an Adaptive Mean Margin (AMM) where the margin is set as the mean distance between the positive pair and the set of negative pairs in a batch. We describe AMM in more detail in Section 4.3.

3 The Spoken Moments Dataset

We begin with the Moments in Time dataset [monfortmoments] as it includes over 1 million videos sourced from a number of different video hosting sites with strong inter & intra-varietal variation in terms of the number of events depicted in each video. Further, the videos are all cut to 3 seconds allowing for a concise description to effectively capture the localized information of each event. Here we refer to concise descriptions as those that focus on key events depicted in the video and does not imply partial descriptions. In data collection, annotators may watch a video as many times as desired. During recording, we block the annotators from seeing/hearing the video to encourage descriptions of important memorable events rather than every specific detail. This approach does not preclude the annotators from describing sequential or simultaneous events as shown in our qualitative examples (see Figure 1). We describe our annotation approach in more detail in the supplementary material.

3.1 Dataset Statistics

  Type Total Average Unique   Words 5,618,064 18.01 50,570 Verbs 492,941 1.58 7,436 Nouns 1,365,305 4.37 20,645 Adjectives 386,039 1.24 12,523     Type Dataset Coverage   Objects ImageNet 69.2% MS-COCO 100% Actions Kinetics 85.1% Moments in Time 96.2% Scenes Places365 47.4%     Dataset Clips Videos Captions Words Vocab Domain Spoken   TACoS [tacos:regnerietal:tacl] 7,206 127 18,227 146,771 28,292 Cooking YouCook II [youcook2] 15,400 2,000 15,400 121,418 2,583 Cooking MSVD [chen-dolan-2011-collecting] 1,970 1,970 70,028 607,339 13,010 General Charades [Sigurdsson2016HollywoodIH] 10,000 10,000 27,800 645,636 32,804 General MPII-MD [7298940] 68,337 94 68,375 653,467 24,549 General MSR-VTT [xu2016msr-vtt] 10,000 7,180 200,000 1,856,523 29,316 General ActivityNet Captions [krishna2017dense] 100,000 20,000 100,000 1,348,000 15,564 General VideoStory [gella-etal-2018-dataset] 123,000 20,000 123,000 1,633,226 - General Epic-Kitchens [Damen2018EPICKITCHENS, Damen2020RESCALING] 76,885 633 76,885 227,974 1,737 Cooking Vatex-en [Wang_2019_ICCV] 41,300 41,300 413,000 4,994,768 44,103 General Spoken Moments 515,912 459,742 515,912 5,618,064 50,570 General  
Figure 2: Dataset Statistics: On the top-left we show the total and average number of words, verbs, nouns and adjectives in our captions as well as the number of unique examples of each. On the bottom-left we show the percentage of the class vocabulary from different datasets that occur in our captions. On the right we compare our proposed Spoken Moments dataset to existing video caption datasets. The word count and vocabulary for S-MiT are generated using ASR transcriptions.

Our proposed Spoken Moments dataset contains 500k videos randomly chosen from the Multi-Moments in Time (M-MiT) training set and all of the 10k videos from the validation set. Each video in the training set contains at least one audio description. We transcribed each audio recording using the public Google Automatic Speech Recognition (ASR) engine to generate text captions for each video. When analyzing these transcriptions, we build a picture of the coverage and diversity of our captions. Table 2 (left) shows that our captions have an average length of 18 words with a unique vocabulary of 50,570 words consisting of 20,645 nouns, 12,523 adjectives and 7,436 verbs with a total word count of 5.6 million. Table 2 (right) shows a comparison of our Spoken Moments dataset to other existing datasets for video captioning. Our dataset will be the largest public dataset in terms of video clips, source videos, total number of captions, total words in the captions and the vocabulary set of unique words occurring in the captions. The increase in vocabulary size is important as it shows that our increase in the number of videos over previous datasets does not simply include repeated events but covers a novel breadth of information. We can see the opposite effect of this in YouCook II [youcook2] where the restricted domain of cooking videos results in a limited vocabulary used in the descriptions.

To understand how this vocabulary covers the class labels typically used for training computer vision models, we examined whether these labels exist in our vocabulary. Table

2 (right) shows that we have strong coverage of the two largest action recognition datasets for video understanding (Kinetics [kay2017kinetics] and M-MiT [monfort2019multimoments]). We expected a large coverage of the events in M-MiT as we sourced our videos from this dataset and the action labels themselves are fairly general (e.g. “running” and “cooking”). For Kinetics, the labels are commonly tied to a noun preceded by a verb (e.g. “brushing hair”). For these labels we consider them to exist in our dataset if both the verb and noun are in the same caption. For example, “A boy is in a bathroom brushing his teeth” would cover the class “brushing teeth”. With this approach we see a 85.1% coverage of the classes in Kinetics and a 96.2% coverage of the classes in M-MiT showing a strong level of event diversity. Similarly we see a strong overlap of the object classes in MS-COCO [lin2014microsoft] (100%) and ImageNet [deng2009imagenet] (69.2%) in our captions. ImageNet coverage is likely lower due to the specific labels used for many of its classes (e.g. “coucal”). Still, 69.2% coverage means 692 ImageNet classes appear in our captions. Similarly, Places [NIPS2014_5349] scene labels are very specific and don’t necessarily match the language used in our descriptions. For example, an “abbey” will typically be described as a “church” or “monastery” in our captions. We did not account for all of the synonyms possible and are only considering direct matches in our captions. Even so we are able to find a 47.4% coverage of the scene labels in Places365 in our dataset.

Here we provide information on some additional characteristics of our data that may be of interest. While we do not release demographic info of our annotators or captions, about 57% of the spoken captions were recorded by male voices and 43% female. For the audio streams of the videos, roughly 51% include natural sound, 5% have music as the audio and 44% have no audio. This is consistent with the M-MiT dataset [monfort2019multimoments] from which we source our videos. Additionally, we found that less than 3% of the videos contain captions that describe non-visible events (e.g. a car horn when no car is visible in the video frames). For this reason we have chosen to focus our approach on learning a strong visual model in Section 4.

4 Learning Audio-Visual Representations

Figure 3: Architecture: Videos and captions are fed into the video/caption models where the outputs are used to compute a similarity matrix, , which is used to compute a loss, .

In order to learn from the large set of spoken captions in the proposed S-MiT dataset, we adopt a cross-modal architecture used in prior work [miech19howto100m, david2020ijcv, AVLnet] which is composed of a video model and a caption model as depicted in Figure 3. Specifically, we take video-caption pairs as input and encode each modality into a

-D feature vector. We do this by adding a multilayer perceptron (MLP) as a projection head on top of both the video and the caption model. This projection head is composed of two linear layers followed by gated linear units (GLU) 

[GLU]. We then compute the dot product between the video and caption representations to produce an x similarity matrix, , which is used to compute our contrastive loss for training. In Section 4.3, we describe our modified approach to margined contrastive learning which uses an Adaptive Mean Margin (AMM) which automatically adjusts itself during learning to improve the optimization signal during training.

4.1 Video Model

Following prior work  [miech19howto100m], we use two encoders to represent input videos: image & video encoders. Specifically, we use a ResNet-152 [resNet] pretrained on ImageNet [krizhevsky2012imagenet] and a temporal shift module (TSM) ResNet-50 model [Lin_2019_ICCV] pretrained on M-MiT [monfort2019multimoments]. Each encoder outputs a

-D feature vector after max-pooling over the temporal dimension (8 frames for the TSM (

3 fps) and 3 frames for the image model (1 fps)). We concatenate the two -D vectors and feed the concatenated vector into an MLP projection head to get the final -D visual representation. We examine the effect of using the image and video encoders as well as different pretrained models in the supplementary material.

4.2 Caption Model

4.2.1 Language Caption Model

Prior work in learning joint representations between audio captions and visual models has shown that utilizing ASR transcriptions greatly improves results [david2020ijcv]. We build on this idea and use the predicted words from a pretrained ASR model (e.g. Google’s public ASR engine) to train our models. Concretely, we examine the effect of using different pretrained language models stacked on top of the ASR model predictions. We begin by comparing the results of using Fasttext [bojanowski2016enriching], BART [lewis2019bart] and BERT [DBLP:conf/naacl/DevlinCLT19] models to generate semantic and contextual word representations for our captions. During training, we randomly select 10 words from each caption to be included in training. In the case of the BART and BERT models, this selection happens after the full transformer model has been applied to avoid altering the results from the self-attention mechanisms. If less than 10 words occur in a caption then we allow words to be sampled multiple times in the random selection. This training augmentation allows different words in each caption to be represented differently at different training iterations. We examine the effect of this approach in the supplementary material. In test, we use the full transcription as input into the language model. We average the word representations from the output of the language model to generate a single representation for each caption which we align to the video representations described in the previous section.

4.2.2 Spoken Caption Model

We also train caption models with raw spoken captions instead of the corresponding transcription. For each caption, we randomly sample 10 seconds of speech for training and compute the 40-dimensional log Mel spectrogram to serve as the input of spoken caption model. The input is fed into a spoken caption model where we consider ResDavenet [david2020ijcv] (which is designed specifically for speech) and two ImageNet ResNet [resNet] models (ResNet-34, ResNet-50). For the ResNet models, we modify the first convolutional layer to take the -channel input so that spectrogram can be processed. In addition, the wav2vec [wav2vec] model, which takes raw waveform as the input, is also involved in our experiments. Spoken captions are first fed into the pre-trained wav2vec model, which produces -D vectors per 210 ms. We then feed them into a learnable ResStack, taken from ResDavenet, to learn representations of spoken captions.

  Language Caption to Video Video to Caption Mean Caption Model R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP   Fasttext [bojanowski2016enriching] 17.10.8 44.00.6 57.20.5 30.20.5 24.10.5 49.90.6 61.81.3 36.60.3 20.60.5 46.90.6 59.50.8 33.40.4 BERT [DBLP:conf/naacl/DevlinCLT19] 25.90.6 55.51.2 67.01.1 39.70.7 33.31.4 62.11.0 72.00.6 46.51.2 29.60.8 58.81.0 69.50.8 43.10.8 BART [lewis2019bart] 33.10.9 65.51.5 76.61.3 47.81.1 43.80.7 71.51.2 80.91.6 56.40.7 38.40.4 68.51.3 78.71.4 52.10.8  
Table 1: Language Caption Model Comparison on Video/Caption Retrieval: Here we compare the video/caption retrieval results on the test set of the Spoken Moments dataset using models trained with three different language models.


Dataset Loss Caption to Video Video to Caption Mean
R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP


Vatex [Wang_2019_ICCV] NCE 43.61.4 77.41.4 86.51.4 58.41.2 39.41.3 74.31.0 84.70.8 54.71.0 41.51.2 75.81.1 85.61.1 56.51.0
SHN 19.61.4 50.21.5 63.90.6 33.81.1 22.91.0 54.00.9 68.81.2 37.60.9 21.30.9 52.10.8 66.30.8 35.70.7
MMS 46.21.5 79.70.8 88.10.8 60.71.0 42.00.7 77.70.7 86.80.3 57.50.6 44.11.1 78.70.7 87.40.5 59.10.7
AMM 48.71.4 82.00.9 89.31.1 63.01.0 43.00.7 77.41.1 85.80.7 58.30.6 45.91.0 79.70.4 87.50.8 60.70.6
ActivityNet [krishna2017dense] NCE 11.80.6 35.41.0 50.60.8 23.80.4 16.70.8 43.01.2 57.11.2 29.50.8 14.30.6 39.20.8 53.81.0 26.70.5
SHN 9.90.9 31.21.3 45.20.9 20.90.9 13.71.1 38.50.9 53.40.9 25.91.0 11.80.9 34.90.8 49.30.7 23.40.9
MMS 12.00.7 35.51.0 49.20.8 23.90.6 16.20.4 42.40.9 56.51.6 28.80.6 14.10.4 39.00.2 52.81.2 26.40.2
AMM 17.21.1 46.11.4 60.00.8 30.60.6 20.91.1 50.11.3 62.40.8 34.30.6 19.11.0 48.11.2 61.20.6 32.50.6
MSR-VTT [xu2016msr-vtt] NCE 20.70.9 51.00.7 66.61.2 35.00.4 30.71.4 65.10.7 78.21.3 46.11.2 25.71.0 58.10.6 72.41.2 40.60.7
SHN 11.30.2 32.01.0 44.91.4 21.90.3 22.10.9 54.51.6 68.91.4 37.01.1 16.70.5 43.30.5 56.90.9 29.50.5
MMS 17.61.1 46.50.9 61.60.9 31.50.6 28.31.1 63.11.4 76.10.9 43.81.1 23.00.9 54.80.6 68.90.7 37.60.6
AMM 25.70.8 61.00.8 75.60.7 41.60.6 32.51.5 67.51.7 80.11.4 48.01.2 29.10.8 64.21.0 77.91.0 44.80.8
S-MiT NCE 33.10.9 66.91.9 77.61.2 47.90.7 43.00.8 71.80.9 80.71.2 55.80.7 38.00.5 69.31.4 79.11.1 51.80.6
SHN 23.11.3 55.41.6 69.31.3 37.71.1 41.41.1 70.80.9 79.51.0 54.50.7 32.30.9 63.11.1 74.41.1 46.10.8
MMS 26.51.3 58.31.4 72.00.9 41.11.1 43.31.3 71.21.4 79.90.8 55.81.2 34.91.2 64.81.2 76.00.8 48.51.1
AMM 33.10.9 65.51.5 76.61.3 47.81.1 43.80.7 71.51.2 80.91.6 56.40.7 38.40.4 68.51.3 78.71.4 52.10.8


Table 2: Loss Function Comparison for Video/Caption Retrieval: Models trained on four datasets with different loss functions are compared. The proposed AMM loss function consistently achieves the best performance.
  Spoken Loss Caption to Video Video to Caption Mean Caption Model R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP   ResDavenet [david2020ijcv] NCE 30.70.6 57.10.6 67.61.0 42.90.8 29.31.0 55.81.2 66.21.4 41.80.9 30.00.8 56.40.9 66.91.2 42.30.8 SHN 30.21.1 56.90.8 66.80.5 42.61.0 31.01.2 57.20.8 67.10.9 43.21.0 30.61.1 57.00.8 67.00.7 42.91.0 MMS 32.11.1 58.91.0 68.61.5 44.40.8 32.31.3 57.91.1 68.11.5 44.31.2 32.21.2 58.41.0 68.41.5 44.31.0 AMM 34.81.1 62.01.1 70.41.2 47.01.1 34.61.5 60.81.6 70.00.9 46.81.2 34.71.2 61.41.4 70.21.1 46.91.1 Wav2Vec [wav2vec] NCE 32.60.7 60.40.8 70.31.6 45.30.8 30.91.0 59.60.9 69.81.1 43.90.8 31.80.7 60.00.8 70.01.3 44.60.8 SHN 27.81.0 54.21.7 64.91.8 40.11.0 28.40.7 53.71.6 64.21.7 40.40.8 28.10.8 53.91.6 64.61.7 40.20.9 MMS 33.60.6 60.51.2 71.41.1 46.10.7 33.41.0 60.51.7 70.31.1 45.70.8 33.50.6 60.51.4 70.81.1 45.90.7 AMM 35.00.4 61.70.9 71.00.9 47.10.6 34.71.5 61.10.9 70.20.9 46.81.2 34.80.9 61.40.9 70.60.8 47.00.9 ResNet-34 NCE 32.21.3 59.71.4 70.31.3 44.81.1 32.81.8 58.81.3 69.21.9 45.11.4 32.51.4 59.21.3 69.71.5 45.01.2 SHN 32.71.1 60.31.3 71.01.1 45.51.0 33.11.0 60.11.5 70.11.3 45.60.9 32.91.0 60.21.4 70.61.2 45.60.9 MMS 35.31.0 62.51.2 72.81.8 47.70.6 36.70.9 62.20.8 72.11.6 48.60.9 36.00.7 62.31.0 72.51.6 48.20.7 AMM 36.30.5 63.91.7 73.71.6 48.90.8 37.51.7 63.51.9 73.71.6 49.61.5 36.91.1 63.71.7 73.71.5 49.21.2 ResNet-50 NCE 32.70.6 60.81.9 70.61.6 45.60.8 33.11.0 59.41.5 69.61.4 45.50.9 32.90.5 60.11.7 70.11.4 45.50.8 SHN 33.90.6 60.11.4 70.91.3 45.80.7 34.01.2 60.61.8 70.11.4 46.01.1 34.00.8 60.31.5 70.51.3 45.90.8 MMS 37.20.9 65.40.6 75.11.3 50.00.7 37.81.3 64.61.1 74.20.9 50.11.1 37.51.0 65.00.8 74.71.1 50.00.9 AMM 39.51.3 65.71.5 75.51.3 51.61.1 40.10.7 66.31.1 74.51.2 52.00.7 39.80.9 66.01.2 75.01.1 51.80.8  
Table 3: Spoken Caption Model Comparison: Models trained with different spoken caption architectures and different loss functions are compared for video/caption retrieval on the S-MiT test set. The proposed AMM loss function consistently achieves the highest performance while ResNet-50 is found to be significantly stronger than the other architectures.


Trained On Evaluated On
Vatex ActivityNet MSR-VTT S-MiT Mean
R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP


Vatex 45.9 79.7 87.5 60.7 15.6 39.4 51.7 27.1 22.6 49.8 63.2 35.6 13.1 33.0 45.8 23.5 24.3 50.5 62.1 36.7
ActivityNet 25.0 56.0 68.4 39.1 19.1 48.1 61.2 32.5 15.1 37.1 50.4 26.4 9.8 28.7 40.6 19.7 17.3 42.5 55.2 29.4
MSR-VTT 21.0 51.3 64.8 35.1 9.9 28.3 39.7 19.6 29.1 64.2 77.9 44.8 14.6 39.3 53.4 26.9 18.7 45.8 59.0 31.6
S-MiT 42.7 75.4 84.2 57.1 17.6 41.6 53.8 29.2 33.1 64.8 77.4 47.6 38.4 68.5 78.7 52.1 33.0 62.6 73.5 46.5


Table 4: Cross Dataset Evaluation on Video/Caption Retrieval: Here we compare the generalization performance of models trained on four different datasets for video/caption retrieval. Each model is trained on a single dataset and we average the evaluation on five 1k video-caption samples from the test set of each other dataset. We additionally show the mean performance accross datasets. The S-MiT model shows it generalizes very strongly to the other datasets even beating the MSR-VTT model on its own test set.

4.3 Adaptive Mean Margin

We train our model using the contrastive loss with a similar setting to MMS (Equation 2). The only difference is that we replace the margin, , with an adaptive margin based on the difference between the similarity of the positive pair and the set of negative pairs in each batch.

The challenge in using the MMS margin for mini-batch sampled contrastive learning is that the initial margin and growth schedule are difficult to tune for a specific dataset and similarity metric. Additionally, depending on the sampled pairs in a mini-batch, the margin calculated may be too weak if the positive pair is much more similar than the sampled negative pairs and too strong if it is very similar to the negative pairs. The approach to monotonically increase the margin during training is meant to address this as the positive and negative pairs will share similar alignment early in training and begin to diverge closer to convergence. However, variable rates of convergence of different models on different datasets make this growth rate difficult to tune and this approach does not account for differences in the negative samples that appear in different mini-batches. To address this, we propose an adaptive margin based on relative batch-wise similarity scores.

Class labels have been proposed to be used for generating adaptive margins based on class similarity between positive and negative pairs [adaptivemargin2020cvpr, Liu_2019_CVPR]

. Likewise, prior work explored a non-class dependant approach for an adaptive similarity-based margin for human pose estimation

[Li_2015_ICCV] where the mean joint error between a positive pose and a hard sampled negative pose was used as a margin with the triplet loss. This adaptively increases the margin when the sampled negative pair is dissimilar to the positive pair in order to maximize the learning signal on less aligned negative samples. We follow a similar intuition and simply replace in Equation 2 with


where is a dampening parameter to weight the strength of the margin. When in Equation 3 is applied to Equation 2 with , the margin removes the positive pair similarity from the optimization. Ablation studies on different alpha values can be found in the supplementary material. In practice we use in our experiments.

This has the effect of increasing the margin as the difference between the true pair similarity and the similarity of the negative pairs increases. As the training progresses, and the learning approaches convergence, the margin generally increases with the increased separation between positive and negative pair-wise similarities. This also removes the need to tune the margin and growth rate which may have different optimal values for different similarity metrics, batch sizes and datasets.

We refer to this as an Adaptive Mean Margin (AMM) for contrastive learning and show in Section 5 the effect of applying this adaptive margin.

5 Results

5.1 Video/Caption Retrieval

In Tables 22 and 4 we show results of R@k recall scores (for ) and mean average precision (mAP) on both caption to video and video to caption retrieval. Results are averaged over five random sets of 1k video-caption pairs from the test set. Each model in Tables 2 and  2 uses the output of a pretrained ASR model, the Google Cloud ASR engine, as input into a trained language model to generate a feature representation for each caption. Alternatively, the spoken caption models align visual representations directly from the audio signal without pretrained modules.

Table 2 shows the result of using different language models to generate our caption representations from ASR text transcriptions. Each of these models was trained using the proposed AMM loss function described in Section 4.3. We evaluate the AMM loss in Table 2 where we compare the results on the NCE, SHN, MMS and AMM loss functions described in Sections  2.3.1 and  4.3 on four different datasets (the proposed Spoken Moments in Time dataset (S-MiT) as well as Vatex-en [Wang_2019_ICCV], MSR-VTT [xu2016msr-vtt] and ActivityNet Captions111We used the groundtruth timestamps to get corresponding video clips. [krishna2017dense]). The proposed AMM loss function consistently achieves the best results across each dataset in Table 2 and the BART language model provides the strongest representations for the retrieval task in Table 2.

Table 2 shows a comparison of our AMM approach to other methods for cross-modal contrastive learning. We use the BART language model [lewis2019bart] to generate representations of words transcribed from the audio captions via a pretrained ASR model. Replacing the monotonically increasing margin used in MMS [ilharco2019largescale] with an adaptive margin that scales with the samples in a batch achieves the strongest results. We observed that as training continues and the margin in MMS continues to grow the training performance begins to degrade. This is likely due to the margin becoming too large for stable training as described in prior work [Wang_2018_CVPR].

In Table 4, we show a comparison of different spoken caption models with different loss functions. The proposed AMM approach beats the other loss functions consistently.

5.2 Cross Dataset Evaluation

To further examine the strength of our proposed Spoken Moments in Time (S-MiT) dataset, we compare the generalization performance of models trained on four different datasets (S-MiT as well as Vatex-en [Wang_2019_ICCV], MSR-VTT [xu2016msr-vtt] and ActivityNet Captions [krishna2017dense]) for video/caption retrieval (see Table 2 (right) for comparisons of these datasets). We train each model on a single dataset using the approach described in Section 4.3 and evaluate on the test set from each other dataset. For example, a model trained on Vatex is evaluated on, in addition to its own, the test sets of ActivityNet Captions, MSR-VTT and S-MiT. We sample five sets of 1k video-caption pairs from each test set. This allows us to fairly compare results across test sets of different sizes (see supplementary material for full test set results). Each model in this evaluation was trained using the BART [lewis2019bart] language model and the proposed AMM loss function which was found to give the best results (see Tables 2, 2). We evaluate the models using the mean between the video-to-caption and caption-to-video retrieval tasks. We are not able to compare the spoken caption models from Table 4 here as the other datasets only include text captions.

In Table 4, we can see that the S-MiT model generalizes better than the other models in spite of the additional noise introduced by the ASR model. Additionally, the restriction to 3-second videos in S-MiT does not hinder it’s ability to generalize to the much longer videos of the other datasets.

6 Conclusions

In this paper, we have introduced the Spoken Moments in Time dataset which includes 500k pairs of video clips and corresponding spoken descriptions. This new dataset represents the largest video caption dataset available and will serve as a new benchmark for the community. We compared various benchmark models for learning joint representations between captions and videos, and evaluated our approaches on multiple datasets to highlight the strength of the models as well as the ability of models trained on our proposed dataset to generalize to tasks in other datasets. With these results we are confident that the presented Spoken Moments dataset will have a positive impact on the fields of video understanding and cross-modal learning.

7 Acknowledgment

This work was supported by the MIT-IBM Watson AI Lab as well as the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00341.

8 Disclaimer

The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.


A Annotation

Figure 4: Spoken Caption Collection: Target videos for which descriptions are collected on the left and a video with an example text description is always visible on the right.

We follow the approach used to collect the Places Audio Caption dataset [david2016nips, david2020ijcv] and collect audio descriptions of each video in the dataset using Amazon Mechanical Turk (AMT). In order to ensure that we have a large and diverse dataset, we collect an audio description using AMT for each video in a set of 500k randomly selected videos from the training set and at least two unique descriptions for each video in the 10k videos used for both the validation and test sets. Each AMT worker is presented with a task of recording themselves describing 10 different videos. Each video is shown on the left of the screen while a video with an example text description is shown on the right. This example helps to show the workers the types of descriptions we are looking for and the amount of detail we expect from them. This example stays on the right side of the screen throughout the task while the target videos on the left cycle as the worker completes each description. Figure 4 shows an example of this interface with an example video and caption on the right and a target video on the left. Below each target description is a button that allows the worker to start recording their voice as they describe the video. Once they press this button, the video is removed from the screen and the recording is started. We block the worker from seeing the video while recording the description to ensure that the recordings are concise and pertain only to the important events highlighted in their memory. We use the Google Cloud ASR engine to verify the quality of each recorded description and flag AMT workers for poor performance. This is done by checking that the generated text has at least five words, is unique (some bots repeat pre-recorded audio to trick the system) and that the audio is at least three seconds long. If any of these checks fail we don’t let the worker continue to the next video until they record a new description that passes our checks. Once the descriptions are recorded, we periodically sample videos to check the quality of the audio paired with the ASR to ensure they match the videos and have an appropriate level of detail. If these checks fail, we flag the workers that recorded the descriptions, don’t allow them to record in the future and recheck all of their recorded data. This process allows us to ensure a strong level of quality in our collected spoken captions. Examples of some of the videos and corresponding text transcriptions of the descriptions we collected can be seen in Figure 1.

B Implementation Details

We train each model on a server with 8 24GB Titan RTX cards using a mini-batch size of 2048 for 100 epochs. We examine the effect of the mini-batch size on learning in the next section. We take the best parameters as evaluated on the evaluation set of the training dataset after each epoch. We repeat this process for two phases of training. First we freeze the visual backbone models and train only the projection heads (including the full caption model for the spoken models) and then, in a second round, allow the full visual model to train as well. We keep the language and ASR components frozen for the language caption models and reserve fine-tuning these components for future work. For model training, we use an Adam

[Kingma2015AdamAM] optimizer where a fixed learning rate of and are set for the first and the second round model training, respectively.

C Ablation Studies

Dataset Pretrained TSM Caption to Video Video to Caption Mean
Dataset R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP
Vatex [Wang_2019_ICCV] Kinetics 39.61.0 77.51.5 87.21.0 55.90.8 46.40.6 82.11.0 90.21.2 61.90.6 43.00.7 79.81.1 88.71.0 58.90.7
S-MiT 47.41.1 81.50.7 89.01.1 62.30.6 43.10.9 78.30.6 86.20.3 58.50.5 45.30.8 79.90.4 87.60.6 60.40.5
ActivityNet [krishna2017dense] Kinetics 18.71.0 45.60.9 57.21.4 31.00.7 20.80.8 50.11.4 61.81.3 34.10.4 19.80.8 47.80.9 59.51.0 32.50.4
M-MiT 16.11.7 44.01.0 57.51.7 29.31.0 19.01.3 48.20.9 61.01.1 32.50.8 17.61.5 46.10.7 59.21.4 30.90.8
MSR-VTT [xu2016msr-vtt] Kinetics 17.61.3 48.91.8 65.61.2 31.61.3 25.50.7 59.71.8 74.11.6 40.60.9 21.60.8 54.31.4 69.81.4 36.10.9
M-MiT 20.70.5 54.20.9 70.61.0 30.50.4 31.31.1 61.01.0 75.00.9 40.90.8 24.00.6 57.60.6 72.80.8 37.70.4
S-MiT Kinetics 27.61.4 57.52.4 70.41.9 41.31.7 37.22.3 65.01.7 75.21.5 50.01.7 32.41.8 61.32.0 72.81.6 45.71.7
M-MiT 29.82.5 60.62.4 72.21.9 44.02.2 39.42.1 68.02.0 77.51.8 52.32.0 34.62.1 64.32.2 74.91.8 48.22.0
Table 5: Comparison of different Pretrained TSM models on multiple datasets using AMM and Bart
Visual Base Model Caption to Video Video to Caption Mean
R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP
TSM Kinetics 20.21.1 47.92.3 61.00.8 33.21.1 28.21.5 54.91.5 67.11.6 40.81.6 24.21.1 51.41.9 64.01.0 37.01.3
TSM M-MiT 19.71.1 48.62.0 61.91.6 33.51.3 28.41.4 58.02.5 69.21.9 41.91.4 24.11.2 53.32.1 65.61.7 37.71.4
ResNet-152 ImageNet (2D) 24.22.4 53.61.8 66.52.1 37.92.0 32.92.1 61.71.6 71.61.0 45.91.8 28.52.2 57.71.7 69.11.5 41.91.9
TSM Kinetics + 2D 27.61.4 57.52.4 70.41.9 41.31.7 37.22.3 65.01.7 75.21.5 50.01.7 32.41.8 61.32.0 72.81.6 45.71.7
TSM M-MiT + 2D 29.82.5 60.62.4 72.21.9 44.02.2 39.42.1 68.02.0 77.51.8 52.32.0 34.62.1 64.32.2 74.91.8 48.22.0
Table 6: Comparison of different visual base model combinations on S-MiT using AMM and Bart
Batch Size Caption to Video Video to Caption Mean
R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP
512 27.21.6 57.41.3 69.41.0 41.01.5 35.52.5 64.01.4 74.41.1 48.42.1 31.31.9 60.71.3 71.91.0 44.71.7
1024 27.82.0 57.71.4 69.81.2 41.51.9 36.52.8 65.61.4 75.21.7 49.72.0 32.22.3 61.71.4 72.51.3 45.61.9
2048 29.82.5 60.62.4 72.21.9 44.02.2 39.42.1 68.02.0 77.51.8 52.32.0 34.62.1 64.32.2 74.91.8 48.22.0
4096 29.22.7 58.41.6 70.81.9 42.82.3 39.42.3 66.61.8 75.71.4 51.81.9 34.32.3 62.51.6 73.31.6 47.32.0
Table 7: Comparison of different batch sizes on S-MiT using AMM and Bart
Projection Size Caption to Video Video to Caption Mean
R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP
1024 27.41.8 56.61.6 69.50.9 41.11.5 38.61.6 66.61.1 76.31.3 51.31.2 33.01.6 61.61.3 72.91.0 46.21.3
2048 27.81.8 57.42.0 69.21.5 41.51.8 38.42.1 65.91.4 75.61.5 51.11.6 33.11.9 61.61.6 72.41.4 46.31.7
4096 29.82.5 60.62.4 72.21.9 44.02.2 39.42.1 68.02.0 77.51.8 52.32.0 34.62.1 64.32.2 74.91.8 48.22.0
8192 29.42.0 58.02.3 70.31.2 42.61.8 38.52.4 66.12.1 76.11.5 51.22.1 33.92.2 62.02.2 73.21.3 46.91.9
Table 8: Comparison of different projection sizes on S-MiT using AMM and Bart
Sampling Caption to Video Video to Caption Mean
R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP
N 28.11.1 57.52.0 69.81.4 41.81.3 39.11.3 66.52.0 76.31.8 51.51.4 33.61.1 62.01.9 73.01.5 46.71.3
Y 29.82.5 60.62.4 72.21.9 44.02.2 39.42.1 68.02.0 77.51.8 52.32.0 34.62.1 64.32.2 74.91.8 48.22.0
Table 9: Comparison of sampling approach on S-MiT using AMM and Bart
Caption to Video Video to Caption Mean
R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP
0.1 29.31.4 60.01.2 72.71.4 43.41.2 39.21.8 66.21.5 77.01.3 51.71.6 34.21.5 63.11.3 74.81.1 47.51.4
0.2 28.41.2 58.11.9 70.91.5 42.31.4 39.31.6 67.41.5 77.01.8 52.01.4 33.91.4 62.71.7 74.01.5 47.21.4
0.3 27.12.5 58.92.9 71.52.2 41.62.3 38.52.4 67.11.0 76.61.9 51.51.9 32.82.3 63.01.9 74.01.9 46.52.1
0.4 28.11.1 58.12.1 69.82.3 41.91.5 38.82.4 66.91.2 75.81.5 51.51.8 33.51.8 62.51.6 72.81.7 46.71.6
0.5 29.82.5 60.62.4 72.21.9 44.02.2 39.42.1 68.02.0 77.51.8 52.32.0 34.62.1 64.32.2 74.91.8 48.22.0
0.6 28.12.1 59.12.3 71.32.1 42.31.9 38.31.9 67.11.6 76.61.7 51.41.6 33.21.9 63.11.8 73.91.8 46.91.7
0.7 28.91.5 59.21.3 70.81.3 42.81.4 38.91.7 66.31.4 76.01.5 51.31.5 33.91.6 62.71.4 73.41.3 47.11.4
0.8 29.01.9 59.22.4 70.71.4 42.81.9 38.32.1 66.31.6 75.91.5 51.11.7 33.61.9 62.81.9 73.31.3 46.91.7
0.9 27.72.1 57.02.5 68.22.0 41.12.2 37.52.6 64.62.4 74.11.5 49.82.2 32.62.4 60.82.4 71.21.7 45.52.2
Table 10: Comparison of different dampening multipliers, , in AMM on S-MiT using Bart


Trained On Evaluated On
Vatex ActivityNet MSR-VTT S-MiT Mean
R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP


Vatex 19.8 48.4 63.7 33.4 1.5 5.2 8.6 4.2 10.3 28.7 39.3 19.8 7.1 20.2 28.6 14.4 9.7 25.6 35.1 18.0
ActivityNet 12.1 33.3 46.8 23.0 2.0 7.3 12.0 5.6 7.5 22.1 31.2 15.4 4.9 15.6 24.1 11.4 6.6 19.6 28.5 13.9
MSR-VTT 6.5 19.2 28.8 13.8 1.3 4.6 7.8 3.7 11.8 33.9 48.2 23.2 8.0 23.6 34.3 16.4 6.9 20.3 29.8 14.3
S-MiT 19.4 44.6 57.7 31.7 2.7 8.4 13.6 6.5 17.3 39.8 51.8 28.4 25.8 52.8 64.7 38.5 16.3 36.4 47.0 26.3


Table 11: Cross Dataset Evaluation on Video/Caption Retrieval on Full Test Set

In Tables 11111111, and 11, we show several ablation studies. Unless otherwise listed in the table we use the proposed AMM loss function with the BART [lewis2019bart] language model as part of the language caption model described in Section 4.2.1 for each experiment. Results are averaged over five rounds with a single random batch of 1k caption-video pairs from the test set. Due to the increased computation demand of these studies we freeze the base models and train the projection heads for alignment. We use the best model settings found in this analysis to train the full models with results reported in Section 5.

Table 11 shows the effect of using two different pretrained temporal shift [Lin_2019_ICCV] video models on four different datasets in order to choose the most appropriate base models (Multi-Moments in Time (M-MiT) [monfort2019multimoments] or Kinetics [kay2017kinetics]). Here we use the BART language model and the proposed AMM loss function as described in Section 4 as this combination gave us the best results on each dataset.

Table 11 compares the effect of the video model (TSM) trained for action recognition and the 2D model trained for object recognition. Most captions reference both objects and actions in a video with an average of 4.37 nouns used per caption compared to 1.58 verbs. The strength of the 2D obect model makes sense when we consider this prevalence of nouns in the captions. The combination of the TSM model trained on M-MiT [monfort2019multimoments] and the 2D models trained on ImageNet [krizhevsky2012imagenet] provided the best performance when used with the model described in Section 4.

In Tables 11 and 11 we compare the the effect of the batch size and projection size on the performance of the S-MiT model described in Section 4 in order to validate our choice of a 2048 batch and a 4096 projection. Similarly, Table 11 shows the effect of using the caption sampling approach for the transcription model as described in Section 4.2.1. In Table 11, we explore different dampening parameters.

D Cross Dataset Generalization

In Table 11, we expand on Table 4 and compare the generalization performance of models trained on four different datasets (S-MiT as well as Vatex-en [Wang_2019_ICCV], MSR-VTT [xu2016msr-vtt] and ActivityNet Captions [krishna2017dense]) for video/caption retrieval on their full test sets. In Table 4 we ran the comparison on five samples of 1k video-caption pairs to be consistent on evaluating across different size test sets. Here we evaluate on the full test set of each dataset to provide a baseline for each test set. The strength of the model trained on S-MiT is even more evident here as it achieves higher results on the test sets of both ActivityNet and MSR-VTT than the models trained on those datasets. It even comes very close to the performance of the Vatex model on the Vatex test set. This shows that the scale and diversity of the S-MiT dataset is highly beneficial to training robust models.

E Qualitative Results

In Tables 12 and 13, we show the top five retrieval results for some examples from the Spoken Moments dataset. For this analysis, we use the language caption model described in Section 4.2.1 with the BART [lewis2019bart] language model and the proposed AMM loss function. Table 12 shows the top five retrieved captions given a query video, while Table 13 shows the top five retrieved videos given a query caption. Blue boxes indicate the ground-truth results.

Our model retrieves results by recognizing key objects or environments in the videos. For example, in Table 12 (c), lettuce is distinguished from the other vegetables. In Table 13 (f), the model not only recognizes the planets in space but also understands that they are crashing into each other. Some of the examples show that the top retrieval result is not the ground-truth. However, as we can see, the top predictions are typically still a strong match for the queries, as in (e), (i) in Table 12 and (a), (b) in Table 13.

For this demonstration, we use transcribed words from the audio captions using a pretrained ASR model. Noise in these transcriptions may contribute to some errors. In the future, we plan to investigate jointly training a pre-trained ASR model, and language model, with the video model to improve our performance.

Query Retrieval Results
R@1 R@2 R@3 R@4 R@5
Table 12: Spoken Moments Examples of Caption to Video Retrieval Results: Given a query caption, we show five top retrieved captions where words transcribed from the audio captions using a pretrained ASR model are used as a caption. We use a BART model trained with the AMM loss function on the S-MiT dataset. Blue indicates the ground-truth results.
Query Retrieval Results
R@1 R@2 R@3 R@4 R@5
Table 13: Spoken Moments Examples of Video to Caption Retrieval Results: Given a query video, we show five top retrieval captions where words transcribed from the audio captions using a pretrained ASR model are used as a caption. We use a BART model trained with the AMM loss function on the S-MiT dataset. Blue indicates the ground-truth results.

F Captions in the Spoken Moments Dataset

Table 14 shows some captions in the Spoken Moments dataset that capture motion and sequential events which would be difficult to represent with a single image.

Caption Frames
(a) a boy and a red white and blue shirt is sitting on a couch he is holding an infant life vest and picks it up to blow through the two
(b) there’s a gauge or a lock thing turns from rides and then being turned to the left
(c) a picture of a man drinking coffee and play with a cell phone in fast motion
(d) in slow motion we see a collie jump into the air and catch a white frisbee in flight
(e) these are track and field runners and it’s a relay race and they take off when they are handed the batons
(f) there is water dripping off the edge of something all you can hear is the water dripping
Table 14: Spoken Moments Captions: We show some examples of captions, and associated video frames, from the Spoken Moments dataset, where the captions describe a sequence of actions or motion.