Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors. We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets. The proposed MEE model demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video-to-text retrieval tasks. Code is available at: https://github.com/antoine77340/Mixture-of-Embedding-ExpertsREAD FULL TEXT VIEW PDF
Learning text-video embeddings usually requires a dataset of video clips...
Our objective in this work is video-text retrieval - in particular a joi...
Video-text retrieval plays an essential role in multi-modal research and...
Most existing video-and-language (VidL) research focuses on a single dat...
The rapid growth of video on the internet has made searching for video
Employing large-scale pre-trained model CLIP to conduct video-text retri...
The canonical approach to video-and-language learning (e.g., video quest...
Automatic video understanding is an active research topic with a wide range of applications including activity capture and recognition, video search, editing and description, video summarization and surveillance. In particular, the joint understanding of video and natural language holds a promise to provide a convenient interface and to facilitate access to large amounts of video data. Towards this goal recent works study representations of vision and language addressing tasks such as visual question answering [1, 2], action learning and discovery [3, 4, 5], text-based event localization  as well as video captioning, retrieval and summarization [7, 8, 9, 10]. Notably, many of these works adopt and learn joint text-video representations where semantically similar video and text samples are mapped to close points in the joint embedding space. Such representations have been proven efficient for joint text-video modeling e.g., in [7, 8, 9, 4, 6].
Learning video representations is known to require large amounts of training data [11, 12]. While video data with label annotations is already scarce, obtaining a large number of videos with text descriptions is even more difficult. Currently available video datasets with ground truth captions include DiDeMo  (27K unique videos), MSR-VTT (10K unique videos) and the MPII Movie Description dataset  (120K unique videos). To compensate for the lack of video data, one possibility would be to pre-train visual representations on still image datasets 
with object labels or image captions such as ImageNet, COCO , Visual Genome  and Flickr30k . Pre-training, however, does not provide a principled way of learning from different data sources and suffers from the “forgetting effect” where the knowledge acquired from still images is removed during fine-tuning on video tasks. More generally, it would be beneficial to have methods that can learn embeddings simultaneously from heterogeneous and partially-available data sources such as appearance, motion and sound but also from other modalities such as facial expressions or human poses.
In this work we address the challenge of learning from heterogeneous data sources. Our method is designed to learn a joint text-video embedding and is able to handle missing video modalities during training. To enable this property, we propose a Mixture-of-Embedding-Experts (MEE) model that computes similarities between text and a varying number of video modalities. The model is learned end-to-end and generates expert weights determining individual contributions of each modality. During training we combine image-caption and video-caption datasets and treat images as a special case of videos without motion and sound. For example, our method can learn an embedding for “Eating banana” even if “banana” only appears in training images but never in training videos (see Fig. 1). We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets. The proposed MEE model demonstrates significant improvements and outperforms all previously reported methods on both text-to-video and video-to-text retrieval tasks.
Our MEE model can be easily extended to other data sources beyond global appearance, motion and sound. In particular, faces in video contain valuable information including emotions, gender, age and identities of people. As not all videos contain people, faces constitute a typical case of a potentially missing data source for our model. To demonstrate the generalization of our model and to show the importance of faces for video retrieval, we compute facial descriptors for images and videos with faces. We then treat faces as an additional data source in the MEE model and aggregate facial descriptors within a video (see Fig. 2). The resulting MEE combining faces with appearance, motion and sound produces consistent improvements in our experiments.
This paper provides the following contributions: (i) First, we propose a new model for learning a joint text-video embedding called Mixture-of-Embedding-Experts (MEE). The model is designed to handle missing video modalities during training and enables simultaneous learning from heterogeneous data sources. (ii) We showcase two applications of our framework. First, we can data augment video-caption datasets with image-caption datasets during training. We can also leverage face descriptors in videos to improve the joint text-video embedding. In both cases, we show improvements in several video retrieval benchmarks. (iii) By using MEE and leveraging multiple sources of training data we outperform state-of-the-art on the standard text-to-video and video-to-text retrieval benchmarks defined by the LSMDC  challenge.
In this section we review prior work related to vision and language, video representations and learning from sources with missing data.
There is a large amount of work leveraging language in computer vision. Language is often used as a more powerful and subtle source of supervision than predefined classes. One way to leverage language in vision is to find a joint embedding space for both visual and textual modalities[19, 20, 7, 8, 9, 21, 22, 23]. In this common embedding space, visual and textual samples are close if and only if they are semantically similar. This common embedding space enables multiple applications such as text-to-image/video retrieval and image/video-to-text retrieval. The work of Aytar et al.  is going further by learning a cross-modal embedding space for visual, textual and aural samples. In vision, language is also used in captioning where the task is to generate a descriptive caption of an image or a video[25, 26, 27, 10]. Another related application is visual question answering [28, 29, 1, 2]. A useful application of learning jointly from video and text is the possibility of performing video summarization with natural language . Other works also tackle the problem of visual grounding of sentences: it can be applied to spatial grounding in images [25, 18, 30] or temporal grounding (i.e temporal localization) in videos [4, 6]. Our method improves text-video embeddings and has potential to improve any method relying on such representations.
separate videos into multiple stream of modalities. The appearance, which are features capturing visual cues, the motion, computed from optical flow estimation or dense trajectories, and the audio signal are the commonly used video modalities. Investigating on which video descriptors to combine and how to efficiently fuse them has been extensively studied. Most prior works [12, 31, 37, 33, 38] address the problem of appearance and motion fusion for video representation. Other more recent works [39, 34] explore appearance-audio two-stream architectures for video representation. This and other work has consistently demonstrated the benefits of combining different video modalities for tasks such as video classification and action recognition. Similar to previous work in video understanding, our model combines multiple modalities but can also handle missing modalities during training and testing.
Our work is also closely related to learning methods designed to handle missing data. Handling missing data in machine learning is far from being a solved problem, yet it is widespread in various fields. Data can be missing due to several reasons: it can be corrupted, it may have not been possible to record the data or, in some cases, the data may be intentionally missing (take an example of forms with answers to some fields being optional). Common practices in machine learning aim at imputing the missing values with a default value such as zero, the mean, the median or the most frequent value in the discrete case111http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html. In the matrix completion theory, a low rank approximation of the matrix  can be performed to fill the missing values. In computer vision, one main application of learning with missing data is the inpainting task. Several approaches such as: Low rank matrix factorization , Generative Adversarial Network  or more recently  have successfully addressed the problem. The UberNet network  is a universal multi-task model aiming at solving multiple problems such as: object detection, object segmentation or surface normal estimation. To do so, the model is trained on a mix of different annotated datasets, each one having its own task-oriented set of annotation. Their work is also related to ours as we also combine diverse types of datasets. However in our case, we have to address the problem of missing video modalities instead of missing task annotation.
Handling missing modalities can be seen as a specific case of learning from missing data. In image recognition the recent work  has tackled the task of learning with missing modalities to treat the problem of missing sensor information. In this work, we address the problem of missing video modalities. As explained above, videos can be divided into multiple relevant modalities such as appearance, audio and motion. Being able to train and infer models without all modalities makes it possible to mix different type of data such as illustrated in Figure 1.
In this section we introduce the proposed mixture of embedding experts (MEE) model and explain how this model handles heterogeneous input sources with incomplete sets of data streams during both training and inference.
Our goal is to learn a common embedding space for video and text. More formally, if is a sentence and a video, we would like to learn embedding functions and such that similarity is high if and only if and are semantically similar. We assume that each input video is composed of different streams of descriptors, that represent, for example, motion, appearance, audio, or facial appearance of people. Note that as we assume the videos come from diverse data sources a particular video may contain only a subset of these descriptor types. For example, some videos may not have audio, or will not have face descriptors when they don’t depict people. As we will show later, the same model will be able to represent still images as (very) simple videos composed of a single frame without motion. To address the issue that not all videos will have all descriptors, we design a model inspired by the mixture of experts , where we learn a separate “expert” embedding model for each descriptor type. The expert embeddings are combined in an end-to-end trainable fashion using weights that depend on the input caption. As a result, the model can learn to increase the relative weight of motion descriptors for input captions concerning human actions, or increase the relative weight of face descriptors for input captions that require detailed face understanding.
The overview of the model is shown in Figure 2. Descriptors of each input stream are first aggregated over time using the temporal aggregation module and the resulting aggregated descriptor is embedded using a gated embedding module (see 3.4). Similarly, the individual word embeddings from the input caption are first aggregated using a text aggregation module into a single descriptor, which is then embedded using gated embedding modules , one for each input source . The resulting expert embeddings for each input source are then weighted using normalized weights estimated by the weight estimation module from caption to obtain the final similarity score . Details of the individual components are given next.
The textual input is a sequence of word embeddings for each input sentence. These individual word embedding vectors are then aggregated into a single vector representing the entire sentence using a NetVLAD aggregation module, denoted . This is motivated by the recent results 
demonstrating superior performance of NetVLAD aggregation over other common aggregation architectures such as long short-term memory (LSTM)
or gated recurrent units (GRU).
The gated embedding module takes a -dimensional feature as input and embeds (transforms) it into a new feature in -dimensional output space. This is achieved using the following sequence of operations:
where are learnable parameters, is an element-wise sigmoid activation and is the element-wise multiplication (Hadamard product). Note that the first layer, given by (1), describes a projection of the input feature to the embedding space . The second layer, given by (2), performs context gating , where individual dimensions of are reweighted using learnt gating weights with values between 0 and 1, where and are learnt parameters. The motivation for such gating is two-fold: (i) we wish to introduce non-linear interactions among dimensions of and (ii) we wish to recalibrate the strengths of different activations of through a self-gating mechanism. Finally, the last layer, given by (3), performs L2 normalization to obtain the final output . We experimentally demonstrate the benefits of using gated embedding functions in section 4.
In this section we explain how to compute the final similarity score between the input text sentence and video . Recall, that each video is represented by several input streams of descriptors. Our proposed model learns separate (expert) embedding between the input text and each of the input video streams. These expert embeddings are then combined together to obtain the final similarity score. More formally, we first compute a similarity score between the input sentence and input video stream
where is the text embedding composed of aggregation module and gated embedding module ; is the embedding of the input video stream composed of descriptor aggregation module and gated embedding module ; and denotes a scalar product. Please note that we learn a separate text embedding for each input video stream . In other words, we learn different embedding parameters to match the same input sentence to different video descriptors. For example, such embedding can learn to emphasize words related to facial expressions when computing similarity score between the input sentence and the input face descriptors, or to emphasize action words when computing the similarity between the input text and input motion descriptors.
The goal is to combine the similarity scores between the input sentence and different streams of input descriptors into the final similarity score. To achieve that we employ the mixture of experts approach . In detail, the final similarity score between the input sentence and video is computed as
where is the weight of similarity score predicted from the input sentence , is the aggregated sentence representation and , the learnt parameters. Please note again that the weights of experts are predicted from sentence . In other words, the input sentence provides a prior on which of the embedding experts to put more weight to compute the final global similarity score. The estimation of the weight of the different input streams can be seen as an attention mechanism that uses the input text sentence. For instance, we may expect to have high weight on the motion stream for input captions such as: “The man is practicing karate”, facial descriptors for captions such as “Barack Obama is giving a talk”, or on audio descriptors for input captions such as “The woman is laughing out loud”.
Please note that equation (5) can be viewed as a single text-video embedding , where:
is the vector concatenating individual text embedding vectors weighted by estimated expert weights , and is the concatenation of the individual video embedding vectors . This is important for retrieval applications in large-scale datasets, where individual embedding vectors for text and video can be pre-computed offline and indexed for efficient search using techniques such as product quantization .
The formulation of the similarity score as a mixture of experts provides a proper way to handle situations where the input set of video streams is incomplete. For instance, when audio descriptors are missing for silent videos or when face descriptors are missing in shots without people. In detail, in such situations we estimate the similarity score using the remaining available experts by renormalizing the remaining mixture weights to sum to one as
where indexes the subset of available input streams for the particular input video
. When training the model, the gradient thus only backpropagates to the available branches of both text and video.
To train the model, we use the bi-directional max-margin ranking loss [51, 21, 52, 22] as we would like to learn an embedding that works for both text-to-video and video-to-text retrieval tasks. More formally, at training time, we sample a batch of sentence-video pairs where is the batch size. We wish to enforce that, for any given , the similarity score between video and its ground truth caption is greater than every possible pair of scores and , where of non-matching videos and captions. This is implemented by using the following loss for each batch of sentence-video pairs
where is the similarity score of sentence and video , and is the margin. We set in practice.
In this section, we report experiments with our mixture of embedding experts (MEE) model on different text-video retrieval tasks. We perform a thorough ablation study to highlight the benefits of our approach and compare the proposed model with current state-of-the-art methods.
In the following, we describe the used datasets and details of data pre-processing and training procedures.
We perform experiments on the following three datasets:
1 - MPII movie description/LSMDC dataset. We report results on the MPII movie description dataset . This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the movie script or from transcribed audio description. The dataset is used in the Large Scale Movie Description Challenge (LSMDC). We report experiments on two LSMDC challenge tasks: movie retrieval and movie annotation. The first task evaluates text-to-video retrieval: given a sentence query, retrieve the corresponding video from 1,000 test videos. The performance is measured using recall@k (higher is better) for different values of k, or median rank (lower is better). The second, movie annotation task evaluates video-to-text retrieval: we are provided with 10,053 short clips, where each clip comes with five captions, with only one being correct. The goal is to find the correct one. The performance is measured using the accuracy. For both tasks we follow the same evaluation protocol as described on the LSMDC website222https://sites.google.com/site/describingmovies/lsmdc-2017.
2 - MSR-VTT dataset. We also report several experiments on the MSR-VTT dataset . This dataset contains 10,000 unique Youtube video clips. Each of them is annotated with 20 different text captions, which results in a total of 200,000 unique video-caption pairs. Because we are only provided with URLs for each video, some of the video are, unfortunately, not available for download anymore. In total, we have successfully downloaded 7,656 videos (out of the original 10k videos). Similar to the LSMDC challenge and , we evaluate on the MSR-VTT dataset the text-to-video retrieval task on randomly sampled 1,000 video-caption pairs from the test set.
3 - COCO 2014 Image-Caption dataset.
We also report results on the text to still image retrieval task on the 2014 version of the COCO image-caption dataset. Again, we emulate the LSMDC challenge and evaluate text-to-image retrieval on randomly sampled 1000 image-caption pairs from the COCO 2014 validation set.
|Evaluation task||Text-to-Video retrieval||Video-to-Text retrieval|
|Zero-padding + Gated Embd|
|Zero-padding + Gated Embd + Face|
|Zero-padding + Gated Embd + COCO|
|Zero-padding + Gated Embd + COCO + Face|
|MEE + Gated Embd|
|MEE + Gated Embd + Face|
|MEE + Gated Embd + COCO|
|MEE + Gated Embd + COCO + Face||12.7||28.9||39.6||21||76.0|
For text pre-processing, we use the Google News333GoogleNews-vectors-negative300 trained word2vec word embeddings . For sentence representation, we use NetVLAD  with 32 clusters. For videos, we extract frames at 25 frames per seconds and resize each frame to have a consistent height of 300 pixels. We consider up to four different descriptors representing the visual appearance, motion, audio and facial appearance. We pre-extract the descriptors for each input video resulting in up to four input streams of descriptors. The appearance features are extracted using the Imagenet pre-trained ResNet-152  CNN. We extract 2048-dimensional features from the last global average pooling layer. The motion features are computed using a Kinetics pre-trained I3D flow network . We extract the 1024-dimensional features from the last global average pooling layer. The audio features are extracted using the audio CNN . Finally, for the face descriptors, we use the dlib framework444http://dlib.net/
to detect and align faces. Facial features are then computed on the aligned faces using the same framework, which implements a ResNet CNN trained for face recognition. For each detected face, we extract 128-dimensional representation. We use max-pooling operation to aggregate appearance, motion and face descriptors over the entire video. To aggregate the audio features, we follow and use a NetVLAD module with 16 clusters.
Our work was implemented using the PyTorch555http://pytorch.org/ framework. We train our models using the ADAM optimizer . On the MPII dataset, we use a learning rate of with a batch size of 512. On the MSR-VTT dataset, we use a learning rate of with a batch size of 64. Each training is performed using a single GPU and takes only several minutes to finish.
To assess benefits of the proposed mixture of embeddings model, we introduce a standard embedding baseline that learns a single embedding function for both text and video without re-weighting the input streams. In detail, we concatenate the aggregated video descriptors into a single vector and pad unavailable descriptors with zeros. Then we learn a single text to video embedding using the same data and ranking loss.
The proposed embedding model is designed for learning from diverse and incomplete inputs. We demonstrate this ability on two examples. First, we show how a text-video embedding model can be learnt by augmenting captioned video data with captioned still images. For this we use the Microsoft COCO dataset  that contains captions provided by humans. Methods augmenting training data with still images from the COCO dataset are denoted (+COCO). Second, we show how our embedding model can incorporate an incomplete input stream of facial descriptors, where face descriptors are present in videos containing people but are absent in videos without people. Methods that incorporate face descriptors are denoted (+Face). We also evaluate benefits of the proposed gated embedding unit (+Gated Embd) described in section 3.4 and we compare results to the baseline embedding with zero-padding denoted (Zero-padding) described in section 4.2.
Table 1 shows a detailed ablation study on the LSMDC Text-to-Video and Video-to-Text retrieval tasks on the MPII movie dataset. The results clearly demonstrate that our model (MEE) is effective in incorporating captioned still images and face descriptors at training time clearly outperforming the zero-padding baseline.
|Evaluation set||COCO images||MPII videos|
|Zero-padding + Gated Embd + Face|
|Zero-padding + Gated Embd + Face + COCO||29.5||64.6||80.1||3||10.5||26.1||37.1||26||75.1|
|MEE + Gated Embd + Face|
|MEE + Gated Embd + Face + COCO||31.4||64.5||79.3||3||12.7||28.9||39.6||21||76.0|
|Evaluation set||COCO images||MSR-VTT videos|
|Zero-padding + Gated Embd + Face||14.3||38.3||52.8||10|
|Zero-padding + Gated Embd + Face + COCO||8.6||27.0||42.8||14||10|
|MEE + Gated Embd + Face|
|MEE + Gated Embd + Face + COCO||20.7||54.5||72.0||5||16.8||41.0||54.4||9|
Next, we evaluate in detail the benefits of augmenting captioned video datasets (MSR-VTT and MPII movie) with captioned still images from the Microsoft COCO dataset. Table 2 shows the effect of adding the still image data during training. For all models, we report results on both the COCO image dataset and the MPII videos. For both our mixture of embeddings (MEE) model and the zero-padding baseline adding COCO images to the video training set improves performance on both COCO images but also MPII videos, showing that a single model trained from the two different data sources can improve performance on both datasets. This is an interesting result as the two datasets are quite different in terms of depicted scenes and textual captions. MS COCO dataset contains mostly Internet images of scenes containing multiple objects. MPII dataset contains video clips from movies often depicting people interacting with each other or objects.
We also evaluate the impact of augmenting MSR-VTT video caption dataset with the captioned still images from the MS COCO dataset. As the MSR-VTT is much smaller than the COCO dataset, it becomes crucial to carefully sample COCO image-caption samples when augmenting MSR-VTT during training. In detail, for each epoch, we randomly inject image-caption samples such that the ratio of image-caption samples to video-caption samples is set to a fixed sampling rate:. Note that means that no data augmentation is performed and means that exactly the same amount of COCO image-caption and MSR-VTT video-caption samples are used at each training epoch. Table 3 shows the effect of still image augmentation on the MSR-VTT dataset for the text-to-video retrieval task when the proportion of image-caption samples is half of the MSR-VTT video caption samples, i.e. . Figure 3 then illustrates the effect of varying the sampling rate . As opposed to the MPII dataset, the zero-padding baseline does not leverage the still image data augmentation as increasing decreases the retrieval performance. On the other hand, our proposed MEE model fully leverages the additional still images. Indeed, we observe significant gains in video retrieval performances for all metrics. Figure 4 shows qualitative results of our model highlighting some of the best relative improvement in retrieval ranking using the still image data augmentation. Note that many of the improved queries involve objects frequently appearing in the COCO dataset including elephant, umbrella, baseball or train.
The MPII movie dataset contains a significant amount of captions involving facial appearance including emotions (smiling, crying or sad), gender, facial features (blue eye, mustache or large nose), age (young, adult or an old person) or actions that involve faces (gazing, frowning or mouth opening). Here we evaluate the effect of adding face descriptors for videos that contain faces on the LSMDC retrieval tasks.
Table 1 shows benefits of including face descriptors (+face). We observe a significant increase in retrieval performance for our MEE model on all metrics while performance of the zero-padding baseline achieves only moderate improvements. Figure 5 shows qualitative results of some of the best relative improvements in retrieval ranking when adding face descriptors to our MEE model.
In summary, we observe that in comparison with the zero-padding embedding baseline the proposed MEE model obtains significant improvements in video retrieval performance when trained from heterogeneous (still images and videos) and incomplete (missing face descriptors in some videos) data sources.
|Evaluation task||Text-to-Video retrieval||Video-to-Text retrieval|
|SNUVL  (LSMDC16 Winner)|
|Miech et al. |
|CCA (FV HGLMM)  (same features)|
|JSFusion  (LSMDC17 Winner)|
|MEE + Gated Embd + COCO + Face (Ours)||12.7||28.9||39.6||21||76.0|
Table 4 compares our best approach to the state-of-the-art results on the LSMDC challenge test sets. Note that our approach significantly outperforms all other available results including JSFusion 666This method is yet unpublished, only slides from the LSMDC17 workshop are available., which is the winning method of the LSMDC 2017 Text-to-Video and Video-to-Text retrieval challenge. We also reimplemented the normalized CCA approach from Klein et al. . To make the comparison fair, we used our video features and word embeddings. Finally, we also significantly outperform the C+LSTM+SA+FC7  baseline that augments the MPII movie dataset with COCO image caption data.
We have described a new model, called mixture of embedding experts (MEE), that learns text-video embeddings from heterogeneous data sources and is able to deal with missing video input modalities during training. We have shown that our model can be trained from image-caption and video-caption datasets treating images as a special case of videos without motion and sound. In addition, we have demonstrated that our model can optionally incorporate at training, input stream of facial descriptors, where faces are present in videos containing people but missing in videos without people. We have evaluated our model on the task of video retrieval. Our approach outperforms all reported results on the MPII Movie Description. Our work opens-up the possibility of learning text-video embedding models from large-scale weakly-supervised image and video datasets such as the Flickr 100M .
This work has been partly supported by ERC grants ACTIVIA (no. 307574) and LEAP (no. 336845), CIFAR Learning in Machines Brains
program, European Regional Development Fund under the project IMPACT
(reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468) and a Google Research Award.
Localizing moments in video with natural language.ICCV (2017)
Video paragraph captioning using hierarchical recurrent neural networks.In: CVPR. (2016) 4584–4593
Densecap: Fully convolutional localization networks for dense captioning.In: CVPR. (2016)
Ask your neurons: A neural-based approach to answering questions about images.In: ICCV. (2015)
In: Proceedings of the forty-fifth annual ACM symposium on Theory of computing, ACM (2013) 665–674
Semantic image inpainting with deep generative models.In: CVPR. (2017)
Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory.In: CVPR. (2017)
Missing modalities imputation via cascaded residual autoencoder.In: CVPR. (2017)