Code for the AVLnet (Interspeech 2021) and Cascaded Multilingual (Interspeech 2021) papers.
Current methods for learning visually grounded language from videos often rely on time-consuming and expensive data collection, such as human annotated textual summaries or machine generated automatic speech recognition transcripts. In this work, we introduce Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. We circumvent the need for annotation and instead learn audio-visual language representations directly from randomly segmented video clips and their raw audio waveforms. We train AVLnet on publicly available instructional videos and evaluate our model on video clip and language retrieval tasks on three video datasets. Our proposed model outperforms several state-of-the-art text-video baselines by up to 11.8 video clip retrieval task, despite operating on the raw audio instead of manually annotated text captions. Further, we show AVLnet is capable of integrating textual information, increasing its modularity and improving performance by up to 20.3 perform analysis of AVLnet's learned representations, showing our model has learned to relate visual objects with salient words and natural sounds.READ FULL TEXT VIEW PDF
Learning text-video embeddings usually requires a dataset of video clips...
A vast amount of audio-visual data is available on the Internet thanks t...
Imagining a scene described in natural language with realistic layout an...
Pretraining from unlabelled web videos has quickly become the de-facto m...
More than half of the 7,000 languages in the world are in imminent dange...
In this work, we address the problem of audio-based near-duplicate video...
We introduce a non-parametric approach for infinite video texture synthe...
Code for the AVLnet (Interspeech 2021) and Cascaded Multilingual (Interspeech 2021) papers.
Humans learn to understand language, recognize objects, and identify the correspondences between the two by recognizing patterns in what they see and what they hear, often with very weak supervision. In this paper, we develop machine learning methods for this kind of audio-visual learning. Researchers have already developed models capable of learning language concepts from paired images and spoken audio captions describing the images(Harwath et al., 2016, 2018). However, these approaches require a supervised data collection procedure where annotators are paid to describe images. Recent work on learning language concepts using text instead leverages instructional videos that are freely available on the internet (Miech et al., 2019, 2020), but these videos require expensive and time-consuming annotation such as human-generated textual summaries or Automatic Speech Recognition (ASR) transcripts.
In this paper, we circumvent the need for text annotation by learning from naturally occurring audio-visual correspondences in instructional videos. We introduce the Audio-Video Language Network (AVLnet) architecture — a self-supervised model that learns a shared audio-visual embedding space directly from raw video. We leverage the HowTo100M dataset (Miech et al., 2019) to train models on publicly available instructional videos. In contrast to prior work using annotated data and supervised techniques to define video clips, AVLnet uses randomly sampled video clips to learn audio-visual representations from raw video. AVLnet can be further extended to integrate text as a third modality (AVLnet-Text), demonstrating that it can learn language representations from both raw audio and text.
Our AVLnet model achieves state-of-the-art performance on the YouCook2 (Zhou et al., 2018b) video clip and language retrieval tasks. Further, we demonstrate the transferability of our models to non-instructional video datasets: MSR-VTT Xu et al. (2016) and LSDMC Rohrbach et al. (2017). Integrating text captions in the AVLnet-Text model outperforms the prior state-of-the-art results on all three datasets. Finally, we show our models are able to semantically relate the audio and visual modalities to learn static concepts such as “flour”, action words like “chop”, and salient natural sounds such as sizzling.
Most closely related to this paper is the work combining paired image and speech information in an unsupervised setting (Chrupała et al., 2017; Harwath and Glass, 2015, 2017; Harwath et al., 2016, 2018; Leidal et al., 2017; Synnaeve et al., 2014; Boggust et al., 2019; Merkx et al., 2019; Scharenborg et al., 2018; Kamper et al., 2018; Ilharco et al., 2019). These models attempt to leverage the correlations between visual objects in images with spoken words as a grounding signal for learning visual semantics directly from speech. Our work builds upon recent results that demonstrate an ability to uncover concepts from images paired with spoken descriptions (Harwath et al., 2018, 2016) or video frames paired with raw audio Boggust et al. (2019) by learning a joint audio-visual latent space that reflects the underlying semantics of both modalities. While the aforementioned work relies on still image inputs, our proposed architecture learns from entire video clips. Further we eliminate the need for human-generated captions by applying our models to publicly available instructional videos.
Recently there has been an influx of instructional video datasets including How2 (Sanabria et al., 2018), Inria Instructional Videos (Alayrac et al., 2016), CrossTask (Zhukov et al., 2019), YouCook2 (Zhou et al., 2018b), and HowTo100M (Miech et al., 2019). A variety of tasks, focused primarily on text-video modelling, have been applied to these datasets including: task segmentation (Sener and Yao, 2018; Alayrac et al., 2016; Zhukov et al., 2019), reference resolution (Huang et al., 2017, 2018), action segmentation (Zhou et al., 2018b), video clip ordering (Fernando et al., 2017; Lee et al., 2017; Misra et al., 2016; Xu et al., 2019), and action recognition (Ghadiyaram et al., 2019; Sun et al., 2019b, a). More related to our task is text-video modelling focused on learning a joint multimodal embedding space (Miech et al., 2019, 2020, 2018; Mithun et al., 2018; Liu et al., 2019; Yu et al., 2018; Wray et al., 2019; Amrani et al., 2020). We build upon this work and remove the need for human generated textual summaries or ASR transcripts by learning from videos and their raw audio.
Much of the prior work on learning from video and audio has been focused on correlating objects in videos with the sounds they produce (e.g., sight and sound of musical instruments) as a signal for self-supervised learning of both audio and visual features(Arandjelovic and Zisserman, 2017; Aytar et al., 2016; Owens et al., 2016a, b; Korbar et al., 2018; Yang et al., 2020). This idea has been further developed for visually-guided audio source separation (Zhao et al., 2018; Gao et al., 2018; Owens and Efros, 2018; Rouditchenko et al., 2019; Zhao et al., 2019; Gao and Grauman, 2019b), sound generation from silent videos (Zhou et al., 2018c; Owens et al., 2016a), sound localization in video frames (Arandjelovic and Zisserman, 2018; Gan et al., 2019), face and voice association (Nagrani et al., 2018; Kim et al., 2018), and video-based audio localization (Morgado et al., 2018; Gao and Grauman, 2019a). While our work has a similar focus on audio and video, instead of learning the source of audio in the visual domain, we focus on learning the semantic correlations between objects and their spoken language descriptions.
Current approaches to learning language representations from videos rely on text annotations and do not primarily leverage the existing audio in videos (Miech et al., 2019, 2020; Sun et al., 2019b, a). Formally, these approaches start with a corpus of videos , where and denote the audio samples and visual sequence in the video. Since videos can be several minutes long, they are further segmented into shorter clips using supervision such as human annotation or the silence boundaries from ASR transcripts. Once the clip boundaries are decided, this results in a corpus of clips , where and denote the text caption and visual sequence in the clip of the video. The text caption is written by a human annotator or generated from ASR transcripts and replaces the audio in each clip.
In our work, we generate training samples from the corpus without supervision by randomly segmenting each video into clips of length (which may overlap) to obtain a corpus of clips . Unlike previous methods, we do not replace the raw audio with text. This procedure allows us to sample clips without supervised annotation and enables greater flexibility to vary the number and length of clips in the resulting dataset. Although unsupervised clip selection may result in silent or non-salient clips, our experimental results (Section 4.4) show our models perform comparably whether trained on randomly sampled clips or on clips determined by ASR boundaries.
While our main contribution is developing self-supervised models that learn from , we also show how to extend our models to use the textual summaries that exist in many video datasets. We use the text annotations from the corpus to generate a corpus of clips , where , , and denote the audio samples, text caption, and visual sequence in the clip of the video.
In this work, we introduce AVLnet — a self-supervised model architecture that learns the correlation between semantically related visual objects and audio, including speech, from video clips in . The AVLnet architecture consists of parallel audio and visual branches as shown in Figure 1. The audio branch consists of a convolutional model with residual layers as proposed in Harwath et al. (2018). It takes in spectrograms as input and first outputs a temporal feature map with dimensions , where is the downsampled temporal dimension and is the dimension of the joint audio-visual embedding space. The feature map is then mean-pooled over the time dimension to obtain a
. The visual branch consists of a 2D and 3D CNN feature extraction pipeline as inMiech et al. (2019) (further described in in Section 4.2) and outputs a -dimensional vector v.
After the audio and visual features are extracted, we apply nonlinear gating (Miech et al., 2017) to both modalities:
where and represent the output language and visual embedding vectors respectively, matrices and vectors are learnable parameters, denotes element-wise multiplication, and is an element-wise sigmoid activation.
The AVLnet architecture is able to learn visually grounded language without text captions. However, our model is also capable of incorporating the text captions in , enabling it to utilize the textual information that exists in many instructional datasets. To incorporate text in AVLnet, we add a third branch that processes the text caption into a -dimensional vector t. Due to the complementary language information in the raw audio and text, we fuse the outputs of the audio and text branches before non-linear gating. Specifically, we modify Equation 1 as follows:
where represents the output language embedding vector combining speech and text information, matrices and vectors are learnable parameters, denotes element-wise multiplication, and is the element-wise sigmoid activation. The visual embedding vector (Equation 2) remains unchanged. We refer to this variant of AVLnet as AVLnet-Text. In Section 4.4, we show that our proposed fusion approach outperforms an alternative architecture in which audio and text are processed in independent branches.
Due to the self-supervised nature of AVLnet and AVLnet-Text, we use a contrastive loss function that maximizes the similarity between audio and video from the same clip while minimizing the similarity of audio paired with imposter video from another clip or video paired with imposter audio from another clip. Here we define the similarity between audio and video as the dot product of their learned embedding vectors. In particular, we utilize the Masked Margin Softmax (MMS) loss function(Ilharco et al., 2019), and explore other loss functions in Section 4.4.
Unlike the triplet loss function used in prior unsupervised audio-image modeling (Harwath et al., 2018) that samples imposter pairs randomly or using negative mining, the MMS loss enables comparisons of positives with a wider range of negative samples. During training, we use a batch size of videos and sample clips per video, resulting in video clips per batch. The MMS loss trains the model to discriminate between the true audio-visual embedding pairs (, ), and all imposter pairs where either the audio is paired with a visual imposter , or the visuals are paired with an audio imposter . The indices (, , ) indicate the index of the video clip in the batch. The loss is defined as:
To train AVLnet-Text, is replaced with
). In other words, the audio sample and text caption from each clip are treated as inseparable and are sampled together. In our experiments, we fixed the margin hyperparameter, videos, and video clips.
We train AVLnet and AVLnet-Text on the 1.2 million instructional YouTube videos from the HowTo100M (Miech et al., 2019) dataset. The HowTo100M dataset provides video clip segmentations according to time intervals of each video’s ASR transcript and captions each clip with the text from its transcript. Since AVLnet-Text requires textual input, we train it on the video, audio, and text captions corresponding to the given clips (denoted by in Section 3.1); however, to reduce the amount of supervision in our method, we train AVLnet on the video and audio from randomly segmented clips (denoted by in Section 3.1).
After training on HowTo100M, we evaluate and fine-tune our models on three established video and language datasets: YouCook2 (Zhou et al., 2018b), MSR-VTT (Xu et al., 2016), and LSMDC Rohrbach et al. (2017). Each dataset provides human-annotated video clip boundaries and text summaries of the clips (details in the Appendix). We evaluate our models on the video clip and language retrieval tasks, in which a language query (audio or audio and text) is used to retrieve video and vice versa. In contrast to prior models applied to video and text annotations, our AVLnet model operates on the raw audio available in the clips. For both retrieval tasks, we use standard recall metrics R@1, R@5, R@10, and the median rank (Md. R).
In the AVLnet audio branch, the audio input is represented as a log Mel filterbank spectrogram. We use a 16 kHz sampling rate, 25 ms Hamming window, 10 ms window stride, and 40 Mel filter bands. During training we use 10 seconds of audio per video clip for HowTo100M, 50 seconds for YouCook2, and 30 seconds from MSR-VTT and LSMDC due to variation in clip length per dataset.
In the AVLnet visual branch, the 2D features are extracted at 1 feature per second using a ResNet-152 model (He et al., 2016)
pretrained on ImageNet(Deng et al., 2009). The 3D features are extracted at 1.5 features per second using a ResNeXt-101 model (Hara et al., 2018) pretrained on Kinetics (Carreira and Zisserman, 2017)
. For both architectures, we use the pretrained models from PyTorch(Paszke et al., 2019) and feature extraction implementation provided by Miech et al. (2019)
. The output of each model is max-pooled over the time dimension and concatenated, resulting in a single 4096-dimensional visual embedding vector for each clip. When training AVLnet, we do not update the weights of the 2D and 3D feature extractors due to GPU memory limitations.
In the AVLnet-Text branch, we generate text features using a feature extraction pipeline (Miech et al., 2019) that generates word embeddings from a GoogleNews pretrained Word2vec model (Mikolov et al., 2013) and max-pools over the embeddings of the words in each clip’s text caption. Although this text model is shallower than our audio model, a study of deeper text models for learning a text-video embedding found little improvement over this simple text model (Miech et al., 2020).
As described in Section 4.1, we evaluate AVLnet and AVLnet-Text on the video clip retrieval and language retrieval tasks on three datasets: YouCook2 (Zhou et al., 2018b), MSR-VTT Xu et al. (2016), and LSMDC (Rohrbach et al., 2017). For video clip retrieval, AVLnet retrieves video clips given input audio and AVLnet-Text retrieves video clips given input audio and text. We compare our models to state-of-the-art text-video models that retrieve video clips given text (Miech et al., 2019, 2020; Amrani et al., 2020) and text-video models that additionally leverage audio (Wray et al., 2019; Yu et al., 2018; Liu et al., 2019). In contrast to our work, these methods encode audio jointly with video frames instead of with text. Further, Liu et al. (2019) use a pre-trained audio branch and additional models such as ASR, while our audio branch is not pre-trained. For language retrieval, given a video query, AVLnet retrieves audio and AVLnet-Text retrieves paired audio and text. We compare our models to state-of-the-art text-video models that retrieve text given input video and audio (Wray et al., 2019), or given video, audio, and ASR transcripts (Liu et al., 2019). Since prior language retrieval results were only available on MSR-VTT, we also evaluated the text-video model provided by Miech et al. (2019) on the language retrieval task.
Our models’ video and language retrieval results are shown in Table 1. AVLnet outperforms all prior models on YouCook2 (zero-shot and fine-tune), with a 11.8% and 27.7% absolute increase in performance at R@10 over the previous state-of-the-art for the video clip and language retrieval tasks respectively. Compared with state-of-the-art text-video results (Miech et al., 2019, 2020; Amrani et al., 2020), AVLnet achieves higher zero-shot and similar fine-tune performance on MSR-VTT, and similar performance to the baselines on LSMDC. Overall, our results indicate that AVLnet has learned powerful language representations despite never seeing text and only using raw audio.
The AVLnet-Text model further improves performance by leveraging text captions along with the raw audio and establishes state-of-the-art performance on all datasets. In the video clip retrieval task, AVLnet-Text improves the previous state-of-the-art R@10 score by 20.3% on YouCook2, 4.2% on MSR-VTT, and 13.8% on LSMDC. AVLnet-Text further improves the R@10 score on the language retrieval task by 38.2% on YouCook2, 1.2% on MSR-VTT, and 21.7% on LSMDC.
To better understand the performance gains our models achieve over state-of-the-art, we analyze retrieval examples from our AVLnet model fine-tuned on YouCook2. We show video and language retrieval examples from the YouCook2 validation set in Figure 2 (additional examples are shown in the Appendix). We find the retrieved results display high semantic similarity to salient content in the query. For example, in the top row in Figure 1(b), the query video clip shows oil spread on bread and the retrieved audio contains the words ‘bread’ and ‘spread’. This semantic relationship persists even when the correct clip is not the top result or is not in the top five results. For instance, in the bottom row of Figure 1(a), the correct clip is not recalled in the top five results; however, the video and retrieved audio are related to chopping green onions. Further, we find our model has learned to relate natural sounds to salient video clips. The middle row of Figure 1(a) shows an example audio query containing only sizzling sounds; the ASR system fails as there was no speech, yet our model retrieved video clips of frying oil. These results suggest our model has learned the semantic relationships between speech, natural sounds, and visual clips.
To verify the effectiveness of our proposed methods, we report the results of several additional experiments on AVLnet and AVLnet-Text in Table 2. For AVLnet, we investigate the effect of adding text to the audio-visual model during evaluation and fine-tuning, changing the HowTo100M clip sampling method, and varying the loss function. For AVLnet-Text, we investigate the effect of using independent audio and text branches as compared to our language fusion technique. We report each model’s video clip retrieval R@10 on all three evaluative datasets in the zero-shot setting and after fine-tuning.
Adding text from the downstream datasets. We evaluate the performance of AVLnet trained on audio and video from HowTo100M and fine-tuned/evaluated on audio, video, and text from YouCook2, MSR-VTT, and LSMDC. This experiment represents the scenario where obtaining text annotations during training is expensive, but text exists or can be obtained for smaller evaluative datasets or real world applications. In Table 2(a), we observe that the fine-tuned performance is higher than audio-video AVLnet, but lower than AVLnet-Text, indicating that using ASR text captions during training on HowTo100M is beneficial. Further, this result suggests that AVLnet learns language representations from speech, not just natural sounds or voice characteristics.
|AVLnet Baseline||Random clips, no text, MMS loss||54.3||63.0||40.3||51.0||10.1||26.1|
|(a) Downstream Text||Text + audio for fine-tuning/evaluation||49.3||66.3||37.0||59.7||10.4||44.4|
|(b) Clip Sampling||HowTo100M ASR clips||57.6||62.8||38.5||49.4||7.7||26.1|
|(c) Loss Function||Max-Margin Ranking Loss||27.4||39.1||29.8||39.3||6.7||24.2|
|AVLnet-Text Baseline||Audio & Text Branch Fusion||64.4||71.5||50.7||66.6||15.3||48.6|
|(d) No Fusion||Independent Audio & Text Branches||57.0||65.5||50.9||64.9||17.0||48.0|
HowTo100M clip selection. We compare our approach using randomly sampled HowTo100M video clips to train AVLnet to the approach of prior work (Miech et al., 2019, 2020; Amrani et al., 2020) that used video clips segmented at the ASR speech boundaries. Table 2(b) shows AVLnet performs similarly on downstream tasks regardless of sampling method, suggesting our approach reduces the need for supervised clip sampling.
, and Noise Contrastive Estimation (NCE) loss(Gutmann and Hyvärinen, 2010; Jozefowicz et al., 2016) to train AVLnet. We find the MMS loss and the NCE loss to outperform the Max-Margin Ranking loss, prompting us to use the MMS loss in our experiments.
Processing text with an independent branch. We study an alternative AVLnet-Text architecture that processies text in an independent branch instead of fusing the text and audio branches as in Equation 3. The MMS loss is applied over each of the modality pairs (audio-video, audio-text, and video-text), and the branches are jointly optimized through the sum of these three losses. During evaluation, we use the sum of the audio and text embedding vectors to retrieve video clips. Table 2(d) shows that this approach performs worse than AVLnet-Text.
To understand the audio-visual concepts learned by our models, we employ the unit visualization technique introduced by Zhou et al. (2015). In this procedure, we calculate the audio and visual purity of each dimension in AVLnet’s learned audio-visual embedding space. We pass each YouCook2 validation clip through the AVLnet model trained on HowTo100M and fine-tuned on YouCook2 to extract its video and frame-level audio features. To extract frame-level audio features, we remove the temporal pooling layer from the audio branch. Each audio frame is mapped to the seconds of audio surrounding it and the corresponding words during that time using the ASR transcripts. Each video clip is mapped to its set of food object labels given by the YouCook2 dataset (Zhou et al., 2018a). Each dimension is given an audio label — defined as the word that occurs in the largest number of the dimension’s top 50 maximally activating audio frames — and a visual label — defined as the food label that occurs in the largest number of the dimension’s top 50 maximally activating video clips. We calculate audio purity and visual purity as the fraction of the dimension’s top 50 maximally activating audio frames or video clips, respectively, that contain the dimension’s label.
To identify dimensions that have learned audio-visual concepts, we sort all dimensions by the geometric mean of their audio and visual purity scores. The top 8 dimensions are shown in Figure 3 (additional dimensions are shown in the Appendix). Although the maximally activating video clips are chosen independently of the maximally activating audio, we find correspondences between the audio and visual content. For example, dimension 201’s audio and visual labels are ‘oil’ and ‘pan’, and its maximally activating clips show pans of oil. Similarly, dimension 1761’s labels are ‘baking’ and ‘flour’ and its top audio and visual frames contain language and visuals related to flour mixtures, and dimension 2655’s audio label, visual label, and maximally activating clips are all related to bowls of sauce. These results suggest AVLnet has learned to align semantically related audio and visual features to particular dimensions of the embedding space.
In this paper, we present a novel self-supervised approach for learning audio-visual language representations from instructional videos. We circumvent the need for expensive and time consuming data annotation by introducing the AVLnet model that learns from audio naturally present in these videos. This work establishes audio-video benchmarks on the YouCook2, MSR-VTT, and LSMDC video and language retrieval tasks and outperforms several text-video baselines. Further, we extend the AVLnet model to learn from audio, video, and text, leading to state-of-the-art performance on all downstream tasks. Finally, we show that training on natural audio from video enables our models to learn salient words and natural sounds, such as chopping and sizzling. Future work may include training the AVLnet visual branch on video frames instead of extracted visual features as well as further developing tri-branch architectures to learn from audio, video, and text.
We have demonstrated a method to learn correspondences between video and speech using video content naturally generated by humans instead of using manually annotated data. This enables the possibility of learning correspondences in any language in the world with such video content. As less than 2% of the world’s languages have Automatic Speech Recognition (ASR) capability, this presents a significant opportunity. Our work could help scale the advancements in speech technologies developed for these languages, which would enable a greater number of people to interact more effectively with computers.
Grounded learning of audio and visual concepts is a fundamental problem in machine learning. We believe that developing grounded learning is promising for addressing problems such as bias, accountability, and robustness because it will allow systems to learn from a much broader variety of data modalities, in a way more analogous to the multi-faceted way in which people learn from their environment. As such we think this will mitigate many of the problems that exist in current AI systems that present inconsistent behavior patterns and give the appearance of malevolence, but actually reflect nothing more than inadequate learning mechanisms. Such systems will learn in more natural and explainable ways in contrast to current approaches which pick up on (often irrelevant) minute differences in data characteristics.
Our work here relies heavily on video datasets curated from YouTube (e.g., HowTo100M, YouCook2, MSR-VTT). To comply with YouTube’s terms of service, these video datasets are typically distributed via URL, and each research group must scrape the videos independently. Over time, as YouTube and YouTubers remove videos from the platform, the original datasets shrink, making it challenging to reproduce, expand upon, and compare to our results. Further, there are ethical considerations using these datasets since YouTubers did not opt in to having their videos included in the datasets and videos used in this research may no longer exist publicly.
The authors are grateful for the support from the MIT-IBM Watson AI Lab.
Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3636–3645. Cited by: §2.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 297–304. Cited by: §4.4.
Language learning using speech to image retrieval. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), Cited by: §2.
PyTorch: an imperative style, high-performance deep learning library. In Proceedings of Neural Information Processing Systems (NeurIPS), pp. 8024–8035. Cited by: §4.2.
We train AVLnet and AVLnet-Text on instructional videos from the HowTo100M dataset (Miech et al., 2019) and evaluate our models on the YouCook2 instructional cooking video dataset (Zhou et al., 2018b), the MSR-VTT video dataset (Xu et al., 2016), and the LSMDC movie dataset (Rohrbach et al., 2017)111We downloaded the HowTo100M and MSR-VTT datasets from YouTube between Dec. 2019 - Mar. 2020. The numbers we report reflect the videos that were available at the time of download..
The HowTo100M dataset (Miech et al., 2019) contains instructional YouTube videos from domains such as home and garden, computers and electronics and food and entertaining. At the time of download 1,166,089 videos were available on YouTube.
The YouCook2 dataset (Zhou et al., 2018b) consists of 2,000 instructional cooking videos from YouTube. The videos were separated into a 67-23-10 training-validation-testing split and categorized by humans into one of 89 recipe types (e.g., spaghetti and meatballs). Videos were segmented by human annotators into clips representing recipe steps, and each clip was annotated with a text summary of the recipe step. As in prior work (Miech et al., 2019), we evaluate on the validation clips because the test set does not contain text annotations. Following Miech et al. (2019), we use 9,586 training clips and 3,350 validation clips.
The MSR-VTT (Xu et al., 2016) dataset consists of YouTube videos from categories such as music and sports that are not necessarily instructional. Videos were segmented into video clips by human annotators and annotated with 20 natural language sentences each. At the time of download, 5,722 videos were available, resulting in 7,751 video clips. We train our model on 6,783 training clips and evaluate on 968 audio containing test clips of the 1,000 test clips used in prior work (Yu et al., 2018; Miech et al., 2019). For consistency, we count the 32 test clips without audio as mistakes in our retrieval calculations.
The LSMDC dataset (Rohrbach et al., 2017) consists of movies with audio description (AD) — audio descriptions of movie scenes for viewers with visual impairments. The movies were split into video clips corresponding to scenes with AD narration, and each clip is annotated with the text transcript of the AD narration. Following Miech et al. (2019), we use 101,079 training clips and 1000 testing clips. We use the audio from the original movie clips; however, the audio is often silent because AD narration is inserted at breaks in dialogue. The recorded AD narrations were not available.
In Section 4.3, we analyze the video and language retrieval results of our model and show qualitative retrieval examples in Figure 2. We show additional video and language retrieval examples in Figures A1 and A2, respectively. These examples were generated using AVLnet trained on HowTo100M and fine-tuned on YouCook2. Consistent with our findings in Section 4.3, we find the recalled clips are often semantically related to the query clip.
In Section 4.5, we show AVLnet learns to relate semantically related audio and visual features to dimensions of the shared embedding space. In Figure A3, we show six additional dimensions that exhibit salient relationships between their maximally activating audio and visual segments. In particular, Figure 2(a) shows dimensions that activate on words such as ‘chicken’ and ‘egg’ and Figure 2(b) shows dimensions that actions such as ‘cut’ and ‘stir’. In Figure 2(c) we show dimensions that activate on natural sounds (e.g., sizzling and chopping) as opposed to speech.