In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods. Our work takes on the novel task of action highlighting, which visualizes where and when actions occur in an untrimmed video setting. Action highlighting is a fine-grained task, compared to conventional action recognition tasks which focus on classification or window-based localization. Leveraging weak supervision from annotated captions, our framework acquires spatiotemporal relevance maps and generates local embeddings which relate to the nouns and verbs in captions. Through experiments, we show that our model generates various maps conditioned on different actions, in which conventional visual reasoning methods only go as far as to show a single deterministic saliency map. Also, our model improves retrieval recall over our baseline without alignment by 2-3 dataset.READ FULL TEXT VIEW PDF
Given the exponential growth of media in the digital age, precise and fine-grained retrieval becomes a crucial issue. Content-based search, specifically embedding learning, is done for media retrieval within large amounts of data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. Unlike simple classification from predefined labels, these methods learn instance-specific embeddings, allowing finer search granularity. Using embedding-based methods, retrieval is accomplished efficiently even in the cross-modal setting by carrying out a nearest neighbor search in the embedding space.
Embedding learning and retrieval of videos from text is especially challenging, due to the fact that videos contain substantial low-level information compared to semantically abstract captions. That is, video captions include high-level, global information often describing either only a salient part of a video or what the whole video is about. Conventional approaches to this problem in embedding learning between video and text leverage various rich features from videos which are not limited to visual phenomena, such as optical flow, sound, and detected objects [9, 12]. This aids the extraction of high-level information crucial for making caption-like embeddings. However, model interpretability degrades as more features are generated, and the reason behind the retrieval result is ambiguous.
To alleviate the gap in abstractness and adhere interpretability when retrieving videos from captions, we propose the novel task of action highlighting. Action highlighting is a challenging task for generating spatiotemporal voxels which score the relevance of the local region to a certain action class or word. For this problem, our method generates these voxels in a weakly-supervised manner, given only pairs of videos and captions. Aligning local regions extracted by the features from a spatiotemporal 3D-CNN to nouns and verbs in captions, our model generates robust embeddings for both videos and associated captions, which can then be used for retrieval.
We hypothesize that the information in captions that is crucial to retrieval concerns “what” is done “when” and “where” in the video. Therefore, as shown in Figure 1, we learn three different embeddings for the video-caption retrieval problem, namely the motion, visual, and joint spaces. These three spaces model the embeddings for verbs, nouns and whole captions respectively, to extract specific features from videos which relate to the above mentioned crucial information of the caption. This aids the model in obtaining important motion and visual features from the video, alleviating the redundancy existent in video features compared to caption features.
The contributions of our work are two-fold;
We address the novel setting of action highlighting, which requires the generation of spatiotemporal voxels conditioned on action words, indicating where and when actions occur. This enables reasoning and model interpretability in video-text retrieval.
We show through experiments that our spatiotemporal alignment loss generates local embeddings associated to verbs or nouns in captions, and improves video-text retrieval performance by 2-3% compared to our baseline model.
To extract motion and visual features from videos, we use the SlowFast Network . This network uses two different branches, with the slow branch extracting more visual information using higher spatial resolution, and the fast branch focusing on high-framerate motion information. The feature from the slow branch, , will be projected to the visual space and that of the fast branch, , will be projected to the motion space. Here and are the channel dimensions, and
are the temporal dimensions for the features extracted by the slow and fast paths respectively.
For textual features, we first tagged the captions and extracted the tokens that have a part-of-speech tag with verbs and nouns. These verb and noun tokens are converted to distributed representations with an learnable embedding matrix to generate vectors. Then, we encode these nouns and verbs with a token encoder, which is a GRU cell to project these features to each of the motion and visual spaces:
denotes the sigmoid function, andare weights and biases for affine transformation. We also use the sets of all verbs and nouns in the whole training dataset to sample the negative verbs and nouns that are not seen in the same caption sets, and use the same model to generate .
We also extract global textual features using a caption encoder, which is a simple GRU. We denote this feature as . Whole negative caption features that are not in the same caption set are randomly sampled, and their features are denoted . In our implementation, all textual embeddings generated from the caption and token encoders are normalized via .
Here, we describe our novel method on how to generate the spatiotemporal relevance map for action highlighting and align nouns and verbs to local regions in the spatiotemporal embeddings.
First, we pass SlowFast features through a single trainable 3D convolution of kernel size 1, and project them to respectively. Note that the spatiotemporal dimension sizes are arbitrary. Figure 2 shows our feature space alignment for both motion and visual spaces. Given the -th position of our video feature vector , we use the positive token encodings, and , to calculate relevance maps in an attention-like procedure. The relevance map’s -th element is calculated by
denotes the similarity function, implemented as cosine similarity. Note that the softmax temperaturedetermines how “sharp” the relevance map would be, and can be modified during inference for visualization. Then, we apply a weighted
triplet loss function for each of the spatiotemporal positions of the video feature vector as per the following equations,
Here, is equivalent to , and
is the margin hyperparameter. In this triplet loss function, we only use the negatives for the caption side (), rendering it unidirectional. This weighting with the relevance map allows the alignment loss to only be computed locally where textual features describing motion or visual information are close to visual features.
We calculate this loss between the embeddings of video features from the fast branch, with the textual features of the nouns , and the embeddings of video features from the slow branch with the textual features of the verbs , resulting in two relevance maps and as well as two loss terms and .
Using the video features and , we construct a global feature for the video. We simply concatenate and project the two features using a single 3d convolution with sigmoid activation into a joint video feature,
where and .
The joint embedding loss, is calculated as
where denotes a batch and the on the shoulder of vectors denote if the sample is a positive or a negative. This objective is known as the triplet loss  and, specifically, we use all of the negatives in a single batch for good gradient signals during training following .
Our overall loss function for training embeddings, is
For cross-modal retrieval, we evaluate for each -th video and -th caption to get similarity rankings for all combinations of embeddings.
Additionally, we use the motion and visual spaces to conduct reranking and further refine our retrieval. Tagging the tokens of the captions, we get the feature vectors of the verbs and nouns and . Then, we pool the features and to get global motion and visual video representations:
Evaluating and , we conduct reranking by summing these similarities in the embedding spaces before building the rankings.
We use the MSR-VTT  dataset for our retrieval experiments, which consists of 10k short but untrimmed videos. Additionally, we use part of the Kinetics-700 dataset  for action highlighting, which consists of 700 action classes with 650K trimmed videos in total.
In Figure 3, we compare our results of action highlighting in the MSR-VTT dataset to Grad-CAM  saliency visualization. For the gradient signals used in Grad-CAM visualization, we train a simple video captioning model using a 3D-ResNet  encoder and an LSTM decoder. However, Grad-CAM shows which pixels contributed to the output of the downstream task, video captioning in this case, and is therefore insufficient for action highlighting. For our results, we show highlighting results done on the same video conditioned on two words “smile” and “woman”, each for the motion and visual spaces.
Note that, during inference, the visualized regions using Grad-CAM are deterministic. On the contrary, ours notably generates different maps for different tokens even for a single video, and produces verb-conditioned action highlighting. Moreover, general search queries use the open vocabulary, which our model covers.
See Figure 4 for results of action highlighting using the Kinetics-700 dataset. Of the 700 action class labels introduced in the dataset, we use a small subset and extract nouns and verbs in the action class names, using them to show relevance maps for both the motion and visual spaces. Because of the fully convolutional nature of our model, we were able to extract relevance maps with varying resolution and aspect ratio. This is crucial in action highlighting for natural videos, as the spatiotemporal resolution is not distorted with respect to the resizing of the frames.
From the first row in Figure 4, the responsive local features to “pour” include the poured milk as well as the target cup, thus showing understanding of which local region represents the action well. The second row shows two blobs of local embeddings similar to the representation of the token “dance”. We point out that strong responses do not span over the whole silhouette of the people dancing, but only moving body parts such as the hand or head.
From the above results, our model shows to ground high-level features into local regions that represent the action, not superficial features that depend on the object. On the contrary, the third row shows a failure case for the token “ride”. In this scene, our model is expected to highlight the child “riding” the bicycle, but instead embeds the surrounding regions close to “ride”. Since videos with people “riding” often show similar surroundings, for example a scene showing a street with trees, we believe that our model attempting to encode high-level information misunderstands the semantics of these verbs.
|SF||R@1||Med r||R@1||Med r|
In Table 1, we evaluate the effect of our model components.
The first and second rows show retrieval recall and median ranking when training with 3D-ResNet features and SlowFast features respectively. Note that simply using the slow and fast features for training improves the retrieval performance substantively, suggesting that SlowFast features are richer and provide better signals for cross-modal matching.
From the second and third rows, we can see that the motion space alignment itself shows marginal improvement of the score. The motion space alignment is difficult compared to visual space alignment, due to the fact that local regions in the video do not correlate strongly with verbs in the caption. On the contrary, the second and fourth rows show a noticeable difference in recall for the video-to-caption task. By aligning the visual space with nouns, or tokens that refer to objects or people in the video, the model learns to generate object-aware local embeddings. Finally, through the fifth row, we see that a combination of verb and noun alignment improves the Recall@1 by 2-3% for both retrieval directions. Therefore, motion and visual alignments, especially when done altogether, show to provide effective signals when generating cross-modal embeddings.
In this work, we proposed the novel task of action highlighting, which requires the generation of spatiotemporal maps where and when an action occurs. Action highlighting is a more fine-grained task for retrieving actions from videos compared to conventional action recognition tasks.
Using pairs of videos and captions, our proposed model aligns spatiotemporal local embeddings to nouns and verbs in captions, which references which part of a video represents the desired action. Visualizing where these maps highlight a video, our method incorporates interpretability in the video-text retrieval task. Empirical results show that our generated local embeddings go farther than simple object representations, and encode high-level information which enhances cross-modal retrieval performance as well.
IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5814–5824.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9346–9355.
“Facenet: A unified embedding for face recognition and clustering,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823.
“Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6546–6555.