Cross-Modal Graph with Meta Concepts for Video Captioning

08/14/2021 ∙ by Hao Wang, et al. ∙ Singapore Management University 10

Video captioning targets interpreting the complex visual contents as text descriptions, which requires the model to fully understand video scenes including objects and their interactions. Prevailing methods adopt off-the-shelf object detection networks to give object proposals and use the attention mechanism to model the relations between objects. They often miss some undefined semantic concepts of the pretrained model and fail to identify exact predicate relationships between objects. In this paper, we investigate an open research task of generating text descriptions for the given videos, and propose Cross-Modal Graph (CMG) with meta concepts for video captioning. Specifically, to cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions, where the associated visual regions and textual words are named cross-modal meta concepts. We further build meta concept graphs dynamically with the learned cross-modal meta concepts. We also construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures. We validate the efficacy of our proposed techniques with extensive experiments and achieve state-of-the-art results on two public datasets.



There are no comments yet.


page 1

page 3

page 4

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video captioning aims to give precise descriptions for input videos, which benefits many relevant applications such as human-computer interaction and video retrieval [1]. Although this task is trivial for humans, it can be very challenging for machine to achieve satisfying results. The main reasons are (i) videos contain complex spatial and temporal information within changing scenes, and (ii) captions have underlying syntax, including objects and their relationships. That means the captioning model is required to interpret input videos through identifying multiple objects as well as predicting their interactions.

Fig. 1: Two video frames extracted from MSR-VTT dataset as well as the caption. We show the differences between object detection model outputs and our weakly-learned cross-modal meta concepts. In the top row, object proposals are produced with YOLOv3 model [2], which fails to detect the undefined semantic concept of pretrained model: festival. In the bottom row, we show the predicted visual regions for the given captioning words with our proposed model, where the green region and red region refer to guy and festival in the caption respectively.

Prior works [3, 4, 5] adopt 3D convolutional models to incorporate motion and temporal feature representations, which consider global visual information for videos but lack object-level representations. Recently, some works [6, 7, 8] focus on exploiting more fine-grained video representations by building graphs based on object proposals that are given by pretrained model. However, there are two main limitations of their work affecting model performance: (i) the pretrained object detector may fail to detect the undefined semantic concepts (Figure 1) in video frames and (ii) the predicate relationships between objects are not explicitly predicted.

In terms of the first limitation, as is shown in Figure 1, when we adopt a pretrained detection model to predict possible existing objects in the video, since the model is not trained on the video captioning datasets, it would miss some semantic concepts that are not defined during pretraining. For instance, in the top row of Figure 1, YOLOv3 model [2] fails to detect festival. Besides, there are many animation clips in given video captioning datasets [9, 10], the pretrained detection model may also fail at these scenarios. As a result, the constructed graphs can hardly provide enough fine-grained semantic information for the caption generation process. Regarding the second limitation above, previous works [6, 7] only construct soft connections between detected objects, which is realized by attention mechanism or similarity computation. This would result in the computed relationships between objects to be implicit, in other words, the predicate information remains unclear. However, predicates play an important role in caption generation [11, 8], which can guide the model to be aware of the syntax and the exact interactions between objects.

To address the aforementioned limitations, we propose the Cross-Modal Graph (CMG) with meta concepts for video captioning. Specifically, to cover missed semantic concepts by the pretrained detection model, we propose to learn the cross-modal meta concepts, consisting of the visual and semantic meta concepts, which are defined as the visual regions and their corresponding semantic words in captions. Since we do not have explicit pixel labeling, we adopt a weakly-supervised approach to uncover attended visual regions for words, which are learned through training the decoder to generate video captions. We further use the learned meta concepts as pseudo masks to train a localization model, which is incorporated into our captioning model and predicts the underlying meta concepts when processing video key frames. After obtaining multiple predicted meta concepts, we construct meta concept graphs dynamically to output representations. In the bottom row of Figure 1, we show the learned cross-modal meta concepts, which can find the attended regions with their corresponding semantic classes. To give explicit predicate relationships between objects in videos, we also use a scene graph detection model [12] to predict object pairs along with their predicates. Based on the detected scene graphs, we then build holistic video-level and local frame-level graphs to give multi-scaled video structure representations.

Our contributions can be summarized as:

  • We propose to use a weakly-supervised learning method to discover the corresponding visual regions of the given words of target captions, i.e. cross-modal meta concepts, which are used as the pseudo masks to train a localization model. This localization model is applied to predict meta concepts for the caption generation model.

  • We build three types of cross-modal graph representations, i.e. the dynamically constructed meta concept graphs, holistic video-level and local frame-level video graphs from the detected scene graphs.

  • We conduct extensive experiments to verify the usefulness of various modules of our model, and our model achieves state-of-the-art results in MSR-VTT [9] and MSVD [10] datasets.

2 Related Work

2.1 Video Captioning

Most video captioning works [13, 3, 14, 15, 4] are built with encoder-decoder architecture, where a CNN is adopted to extract video features and a RNN is used to recurrently generate the descriptions. Earlier work improve input video representations with different manners. Xu et al. [14] incorporate a multimodal attention over LSTM to obtain video representations. Wang et al. [16] exploit a cycle-consistent idea to reproduce the visual contents after caption generation. In [17], Chen et al. introduce a frame picking module to select video key frames. Recent video captioning works [11, 8, 6, 7] focus on improving correspondence between videos and captions. Both Zhang et al. [18] and Zheng et al. [8] propose to use Part-of-Speech (POS) information to guide video captioning process, where they generate objects first and consequently output actions of captions. Modeling interactions between objects in videos [6, 7] is an emerging way for video captioning. Specifically, Pan et al. [6] propose to construct a spatial-temporal graph and use knowledge distillation way to integrate the object features with video visual features. Zhang et al. [7] take an external large-scale language model to boost the original language decoder learning. RCG [19] introduces to train an additional retrieval model for the video captioning task, Zhang et al. [19] firstly train an individual video-to-text retrieval model, where the corresponding descriptive sentences can be retrieved based on the given videos. During the caption generation phase, RCG learns a weighted sum between the retrieval words distribution and the captioning decoder output vocabulary distribution, based on which the final captions can be generated.

In this paper, instead of using object proposals detected by the pretrained models as the graph nodes [6, 7], we adopt a weakly-supervised learning approach to localize the visual regions, that are aligned with the corresponding caption semantic concepts, and use them as the nodes of our proposed meta concept graph. Through this way, our method is not limited by the pretrained detector, we can predict wider range of meta concepts.

Fig. 2: CMG: Cross-Modal Graph with meta concepts: Our proposed framework to achieve effective video captioning, which consists of three modules: meta concept localization model, meta concept graph encoder and video scene graph encoder. In the first module, we first use a weakly-supervised method to learn cross-modal meta concepts for the given datasets, which are then used as the pseudo masks to train a localization model to output the visual regions and semantic information correlated with the target captions. In the meta concept graph encoder, to enable adaptive information flow across the video, we encode the predicted meta concepts with a dynamic graph construction way and give meta concept representation . In the video scene graph encoder, we take a pretrained model to detect scene graphs for video frames, and then build local frame-level and holistic video-level graphs to give video object structural representation . Finally, we concatenate video context features , and as decoder inputs to generate video captions.

2.2 Concept Prediction

Learning semantic concepts from visual input has been validated to be useful in the captioning task [20, 21, 22], where they mainly use a multi-label classification to predict the hidden high-level concepts. To be specific, You et al. [20]

predict the semantic attributes from the given images before the captioning process, which are fused with the recurrent neural networks and the attention mechanism. Gan et al.


propose a semantic compositional network for the image captioning task, where they detect the semantic concepts from the input images and the probability of each of them is adopted to compose the parameters of the caption generator. Wu et al.

[22] use the multi-label classification model to produce the predicted attributes, which can improve the performance of both the image captioning and VQA task on several benchmark datasets.

Recently, Zhou et al. [23] propose to use grounding approach to find the attended visual regions for words in the given captions, which improves previous methods by introducing the localized visual representations. However, grounding methods [23, 24] are based on the object proposals that are produced by pretrained object detectors. That means these models can only find some object classes that are pre-defined. In contrast, our method is not constrained by the pretrained detector, we can predict wider range of semantic concepts and the corresponded visual regions.

2.3 Graph Models

To model non-Euclidean structures such as graphs and trees, Graph Convolution Networks (GCN) [25] are proposed to give graph structure representations. Later, graph attention networks (GAT) [26] introduce attention mechanism when encoding node features. In [27], Li et al. further explore the effect of stacking deeper layers for GCN. However, both GCN and GAT require the pre-defined edge information as the input to compute the embedded graph features. To alleviate this limitation, Wang et al. [28] compute the node relationships and aggregate neighbourhood features at each iteration, so that we can compute the graph representations without the pre-defined edges and build the relationships between nodes dynamically.

There have been some efforts applying graph models on the captioning task [29, 30, 31]. To be specific, Yang et al. [29] use a pretrained detector to give predicted scene graphs of the images first, and then adopt a GCN to embed the scene graphs. In [31], semantic graph is built with directional edges on the detected object regions, where they also use GCN to embed and produce the contextual features. In terms of video related tasks, spatial and temporal information is of great significance. Xu et al. [32] perform graph convolution on segmented video snippets to leverage both spatial and temporal context for action localization. Liu et al. [33] take videos to be 3D point clouds in the spatial-temporal space. Wang et al. [34] build video graphs from the computed similarity and use GCN model to give representations for video action recognition. In this paper, we construct various types of graphs with different graph embeddings, where we use dynamic graph embedding way to model the cross-modal meta concepts across the whole video, and include the sequence information to the detected scene graphs in video frames. We also give detailed analysis on their effects for video captioning task.

Fig. 3: The demonstration of our proposed weakly-supervised meta concept learning process. Step A: We use a LSTM model to localize the visual meta concepts, where the input is the concatenated feature maps from ResNet-101. In each training step, we compute the attention map from and , then use to generate each word, which is supervised by cross-entropy loss . We also give cross-modal alignments over visual and sentence features by to improve the correspondence between vision and language. Step B: The learned attention maps of each caption word from step A are used as the pseudo masks, to train a localization model. The localization model takes the video frames as the inputs, and predicts the visual regions (visual meta concepts) along with the corresponding words (semantic meta concepts). The predicted cross-modal meta concepts are fed into the meta concept graph encoder to give the graph features.

3 Method

The proposed Cross-Modal Graph (CMG) with meta concepts for video captioning is presented in Figure 2. Here we define cross-modal meta concepts as: the visual regions and the corresponding caption words, which cover both visual and semantic information. Note that for computational efficiency, we first find the absolute difference between frames, then we sort the computed difference and pick top frames to be key frames, which are inputs for our proposed model. Our goal has two folds: (i) learn informative cross-modal meta concepts and bridge the gap between videos and their descriptions and (ii) capture the complex relationships between various objects in the video. To this end, our model consists of three modules: meta concept localization model, meta concept graph encoder and video scene graph encoder.

In the meta concept localization model, we use a weakly-supervised manner to discover the cross-modal meta concepts with caption guidance, which is shown in Figure 3. It is observed that there may have multiple text descriptions for one video, we use a sentence scene graph parsing tool111 to extract object tokens from all captions, then we group based on synonym rules and sort them by frequency order, we take the top groups of synonyms to be semantic classes. We adopt the attended visual regions for to be pseudo masks to train a semactic segmentation network, which is used to localize objects from during caption generation. Note that the localization model is trained individually to ease training difficulty, otherwise we need to jointly train the whole model and slower training speed.

With the trained localization model, we are able to localize caption word in video frames and get visual representations

for the attended regions. We encode the one-hot vectors of

as semantic information, denoting as . Our predicted meta concepts can be formulated as: , where is the number of predicted meta concepts across the video frames. The cross-modal meta concept representation is the sum of and . Since have implicit interactions between each other, we propose to use a dynamic graph construction way to give output representation of meta concept graph encoder.

In the video scene graph encoder module, we take two ways to build the object graph , i.e. local frame-level graphs and holistic video-level graphs , as shown in Figure 4. In , we first encode scene graphs through GAT [26] and obtain graph feature for each frame. We then adopt a transformer [35] to embed frame scene graph features to model sequence dependency between frames, giving . We follow [7, 6] to construct

, which is based on the cosine similarity and interaction over union (IoU) between adjacent frame objects. We also use GAT to encode

and output . We concatenate and , and get the video scene graph representations as .

Apart from the aforementioned modules, we also follow [36, 8] to do prepossessing and extract video context representations given input videos. We concatenate all obtained features together and get

for the decoder to recurrently generate captions, where the decoder is a one-layer plain LSTM. We train our model by two methods: minimizing the cross-entropy loss or maximizing a reinforcement learning (RL)

[37] based reward.

Fig. 4: The demonstration of our constructed frame-level and video-level graphs. We use the pretrained scene graph detector to generate the detected scene graphs for the given video key frames. For each scene graph triplet, we construct two edges between the objects and the predicates. There are multiple frame-level scene graphs in videos, while we only plot two of them for simplicity. The video-level graph is built by grouping frame-level graphs, where we connect nodes from the adjacent frames with high similarity. The black blocks denote connections between nodes.

3.1 Weakly-Supervised Meta Concept Learning

The learning process is shown in Figure 3. We first learn the cross-modal meta concepts in a weakly-supervised approach, which are adopted as pseudo masks to train the localization model. Then we use the trained localization model to predict the meta concepts, which are embedded with meta concept graph encoder and further used to generate video captions. In other words, the meta concept model is trained to cover semantic concepts in captions.

Different from using the detected bounding boxes of the pretrained model, we are able to customize the classes of cross-modal meta concepts for various datasets. To this end, we adapt the model of [38] to video captioning datasets. Specifically, we randomly sample frames from picked key frames for each video and use ResNet-101 [39] to embed sampled frames to get the output feature maps from the last convolutional layer. We then concatenate extracted feature maps to be the video representation for model training. We use a LSTM to weakly learn the corresponding regions for words in captions, where we impose word-level constraints and cross-modal alignment. In the word-level training, we feed in the initialized hidden states and previous word for LSTM model:


where indicates concatenation operation, and denote different learnable embedding matrix and corresponds to the attention weights on visual feature map for the th word. That means we use the frame features with attention weights to be the context vector for LSTM. The cross-entropy loss is used to decode caption sequence for word-level training:


where denotes the predicted probability over the vocabulary and is the predicted word in training phase.

To improve the matching between video frame visual representations and caption text representations, we propose to use cross-modal alignment. Technically, we extract the final LSTM output to be sentence embedding

, and apply linear transforms on

and to map them to the same dimension as and . We aim to impose alignment constraints over and , hence in the mini-batch, we sample video-caption pairs from different identities, where is the batch size. We define only matches while the rest are all mismatched with . We adopt triplet loss to achieve the alignment, the objective is given as:


where denotes the Euclidean distance measurement, superscripts and refer to anchor, positive and negative instances respectively, and is the margin of error.

The whole training objective for weakly-supervised meta concept learning is given as:


where is the trade-off parameter.

To localize the corresponding meta concepts, we take as pseudo masks to train another semantic segmentation model [40] to infer meta concepts when generating video captions. We first extract all semantic concepts from given captions and group them based on synonym rules, then the top classes of cross-modal meta concepts are taken to be the pseudo labels for segmentation model training. Since each video may have various meta concepts, we train the segmentation model with multi-label loss, i.e. the probability of each class is computed separately and optimized with a binary cross-entropy loss.

3.2 Meta Concept Graph Encoder

With the trained localization model, we can obtain a set of cross-modal meta concepts for given video: , where is the number of predicted meta concepts, and denote visual and semantic features from localization model output. We use the sum of and to be the representation of . Note the span of is not restricted in a single frame, but covers all detected meta concepts from the picked key frames. We propose to construct a dynamic graph to integrate features of , where is regraded as the node set and denotes edge set.

Our target of constructing this dynamic graph is to capture interactions between semantically similar meta concepts, which can be defined by feature distance [32]. Hence we use the k-nearest neighbour algorithm to build edges between nodes as follows:


where denotes the th neighbour of . After we build edges for , we can get the adjacency matrix for the constructed graph. Since we connect nodes dynamically during training phase, the model can keep updating node features and aggregating both intra- and inter-frame information.

Based on the obtained adjacency matrix and node features , we perform graph convolution as follows:


where denotes concatenation operation, is a learnable adaption layer.


We apply a max-pooling operation on the graph convolution output

and obtain the cross-modal meta concept graph representation .

3.3 Video Scene Graph Encoder

In this module, we use an off-the-shelf scene graph model [12], to give video frame-level graph results , where denotes the number of predicted relationship triplets, is the frame number and , , meaning we have classes of objects and types of predicate relationships.

We build edges for and respectively, and then we can obtain the adjacency matrix for . We further extract node features for , where we take the one-hot vectors for each node and use a linear layer to encode them, then we are able to get the node features . and are fed into GAT [26] to obtain frame-level graph features . We introduce to apply transformer [35] on to further include the temporal dependency between frame-level graph features and give the final representation for frame-level graph .

We also construct a holistic video-level graph to capture fine-grained temporal node connections, which builds edges between nodes in not only one single frame but also adjacent frames. Specifically, we compute the cosine similarity and Interaction over Union (IoU) between node pairs from adjacent frames. If the computed similarity and IoU are greater than the pre-defined thresholds, we connect them together. By this way, we group all together and build . We also adopt GAT [26] to encode and give . The output representation of this module is the concatenation of and .

Model Backbone BLEU@1 BLEU@2 BLEU@3 BLEU@4 Meteor Rouge-L CIDEr Training
SA [13] V+C3D 72.2 58.9 46.8 35.9 24.9 - - XE
M3 [3] V+C3D 73.6 59.3 48.26 38.1 26.6 - - XE
MA-LSTM [14] G+C3D+A - - - 36.5 26.5 59.8 41.0 XE
VideoLab [5] R-152+C3D+A+Ca - - - 39.1 27.7 60.6 44.1 XE
v2t_navigator [4] C3D+A+Ca - - - 42.6 28.8 61.7 46.7 XE
RecNet [15] InceptionV4 - - - 39.1 26.6 59.3 42.7 XE
OA-BTG [18] R-200+RoI - - - 41.4 28.2 - 46.9 XE
MARN [41] R-101+C3D+Ca - - - 40.4 28.1 60.7 47.1 XE
MGSA [42] IRV2+C3D - - - 42.4 27.6 - 47.5 XE
STG [6] IRV2+I3D+RoI - - - 40.5 28.3 60.9 47.1 XE
ORG-TRL [7] IRV2+C3D+RoI+BERT - - - 43.6 28.8 62.1 50.9 XE
POS-CG [11] IRV2+I3D+Ca 79.1 66.0 53.3 42.0 28.1 61.1 49.0 XE
SAAT [8] IRV2+C3D+Ca+RoI 80.2 66.2 52.6 40.5 28.2 60.9 49.1 XE
RCG [19] IRV2+C3D+h-LSTMs - - - 43.1 29.0 61.9 52.3 XE
CMG (ours) IRV2+C3D 79.2 65.7 52.6 40.9 28.7 61.3 49.2 XE
CMG (ours) IRV2+C3D+Ca 81.6 67.0 54.4 43.1 29.2 61.8 51.5 XE
CMG (ours) IRV2+C3D+A+Ca 83.5 70.7 57.4 44.9 29.6 62.9 53.0 XE
PickNet [17] R-152+Ca - - - 41.3 27.7 59.8 44.1 RL
SAAT [8] IRV2+C3D+Ca+RoI 79.6 66.2 52.1 39.9 27.7 61.2 51.0 RL
POS-CG [11] IRV2+I3D+Ca 81.2 67.9 53.8 41.3 28.7 62.1 53.4 RL
CMG (ours) IRV2+C3D+A+Ca 83.4 70.1 56.3 43.7 29.4 62.8 55.9 RL
TABLE I: Main results. Evaluation of performance compared against various baseline models on the MSR-VTT dataset, we evaluate the results with BLEU@14, METEOR, ROUGE-L and CIDEr scores (%). We also state the video context features used by the listed methods, where V, G, R-N, A, Ca, IRV2 and RoI denote VGG19, GoogleNet, N-layer ResNet, Audio, Category, InceptionResnetV2 and Region of Interest (RoI) features respectively. BERT and h-LSTMs denote BERT pretrained model and the hierarchical-LSTMs respectively. XE and RL denote training with cross-entropy loss and reinforcement learning respectively.
Model B@4 M R C
MA-LSTM [14] 52.3 33.6 - 70.4
MGSA [42] 53.4 35.0 - 86.7
OA-BTG [18] 56.9 36.2 - 90.6
POS-CG [11] 52.5 34.1 71.3 88.7
SAAT [8] 46.5 33.5 69.4 81.0
STG [6] 52.2 36.9 73.9 93.0
ORG-TRL [7] 54.3 36.4 73.9 95.2
CMG (ours) 59.5 38.8 76.2 107.3
TABLE II: Performance comparisons with different baseline methods on the testing set of the MSVD dataset. The results are evaluated with BLEU@4, METEOR, ROUGE-L and CIDEr scores (%).

4 Experiments

We evaluate the efficacy of our proposed framework in two public datasets: MSR-Video To Text (MSR-VTT) dataset [9] and Microsoft Video Description (MSVD) dataset [10]

. The results are obtained from four captioning metrics: BLEU, METEOR, ROUGE-L and CIDEr. “-” means number not available. The reported results are evaluated with the Microsoft COCO evaluation server

[43]. We compare our results with previous state of the art models and report extensive ablation studies to show the effectiveness of each module of our model.

4.1 Datasets

MSR-VTT [9]. MSR-VTT is a commonly used benchmark dataset for video captioning task. It is composed of video clips, where each video clip is annotated with English text. These video clips are categorized into classes, such as music, cooking and etc. We follow the standard splits [11, 7, 8, 6], i.e. there are , and for training, validation and testing respectively.

MSVD [10]. MSVD is a relatively small-scale dataset compared with MSR-VTT, as it in total contains video clips. MSVD has multilingual captions, while we only consider the English annotations. There are roughly English sentences for each video clip. Similar with prior work [11, 7, 8, 6], the dataset is separated into training clips, validation clips and test clips.

4.2 Implementation Details

4.2.1 Feature Extraction

We follow [44, 8] to extract video context features for MSR-VTT dataset, and use four types of features. Specifically, we extract 2D features from the last avg-pooling layer of pretrained InceptionResnetV2 (IRV2) [45]. We adopt a C3D [46]

model pretrained on Sports-1M

[47] dataset to capture short-term motion features. In terms of audio features, they are extracted from audio segments within frame steps from MFCC [48]. Since MSR-VTT provides category information, we also use GloVe [49] to encode the semantic labels for each video.

For MSVD, we take 2D and 3D visual features as the video context features following the previous practice [8, 50]. Since it has limited number of training video clips, we only take two features to avoid over-fitting in this dataset, i.e. ResNeXt [51]

pretrained on the ImageNet dataset is adopted to extract visual features, an ECO

[52] pretrained on the Kinetics400 dataset is used to give video temporal features. Specifically, we use evenly extracted video frames as the input, which are fed into ResNeXt and ECO respectively. We take the averaged ResNeXt conv5/block3 output as 2D visual features and the global pool results of ECO as 3D features.

4.2.2 Model Setting

We pick key frames to be the input based on difference between frames, and we take top synonym categories out of to be semantic classes of cross-modal meta concepts. In the weakly-supervised meta concept learning, we train the model with batch size of and learning rate of , and set the parameter and as and respectively. We train semantic segmentation model PSPNet [40] to localize the learned meta concepts. The segmentation model is trained with batch size of and learning rate of . In the dynamic meta concept graph construction, we allow each node to connect its nearest neighbour and output -dimensional features. The adopted scene graph detector [12] is pretrained on Visual Genome (VG) dataset [53] with Faster R-CNN [54] backbone. We use a two-layer graph attention networks (GAT) [26] to encode the constructed video graphs, where we set the hidden dimension as , head number as and output dimension as . Then we use a one-layer transformer [35] with heads to give temporal representations of frame-level graphs.

In the caption decoder, we use word embedding layer to give word representations, whose dimension is . We also map all the used visual context features onto the space of -dimensional space and then concatenate them together to be decoder input. We take a one-layer plain LSTM as the decoder. We train the decoder with batch size of and learning rate of . Our implemented reinforcement learning strategy is based on SCST [37]. We use beam search for evaluation, and set the beam size as .

4.3 Experimental Results

4.3.1 Performance Comparison

In Table I we compare our results against earlier models under different training strategies, i.e. cross-entropy loss and reinforcement learning, in MSR-VTT dataset [9]. In tabel II, we show model performance in MSVD dataset [10]. It can be observed that our results gain remarkable improvement across various metrics on both MSR-VTT and MSVD datasets. We also conduct experiments with different video backbone features, to indicate the importance of multi-sourced information.

STG [6] and ORG-TRL [7] models utilize similar methodologies, which construct graphs based on object proposals. STG uses a transformer [35] as the decoder, while ORG-TRL uses an extra pretrained BERT [55] model, it gets higher results than STG. Another reason of STG having relatively low performance is that, MSR-VTT contains a large portion of animations, making pretrained object detection model often fail in these scenarios. When we use similar video context features (IRV2+C3D) as STG, our model outperforms STG by around in CIDEr score, indicating our proposed cross-modal meta concepts can be adapted to different datasets and help alleviate such issues even without external language model.

To enable the generation model to keep aware of the syntax information, Zheng et al. [8] adopt predicate and object information to guide the language decoder in generation. In SAAT [8], instead of construing graph representations, they directly use the attention mechanism to encode the predicted predicates and objects as the input vectors for decoder. In contrast, when we use similar context features (IRV2+C3D+Ca) as SAAT, our model can outperform SAAT by a margin, where we propose to build holistic and local video graphs for the predicted syntax, indicating the effectiveness of our model. RCG [19] incorporates the retrieval learning into the captioning process, where they use IRV2+C3D as the video context features and adopt the hierarchical-LSTMs to generate captions. While here we only use the plain LSTM for the caption generation, in order to make fair comparisons with most of the previous works [8, 11, 42, 41]. It may indicate better decoder can give better generation results.

When we shift to the RL training strategy that directly optimizes our model with CIDEr scores, we achieve the highest CIDEr score. In general, the performance of our proposed model is shown to be very promising, having improvements in all metrics consistently.

Method B@4 M R C
Baseline (BL) 43.0 28.1 61.5 50.2
BL + MC () 44.0 29.5 62.5 51.9
BL + MC () 44.1 29.5 62.6 52.0
BL + MC () 44.7 29.4 62.9 52.2
   - attention-LSTM 43.6 29.1 62.4 51.3
   - Semantic Only 43.1 28.9 62.0 51.0
   - Visual Only 43.7 29.1 62.4 51.6
BL + FG 43.8 29.1 62.1 51.4
BL + VG (no rel) 43.3 29.0 61.9 51.2
BL + VG 44.0 29.1 62.2 51.7
BL + FG + VG 44.3 29.1 62.4 51.8
All 44.9 29.6 62.9 53.0
TABLE III: Ablation Studies. Evaluation of the benefits of different modules of the proposed model, where , MC, FG and VG denote the number of connected neighbours, meta concept graphs, frame-level and video-level graphs respectively. We also show the results of MC with and without the dynamic graph encoding or visual/semantic features. The results are evaluated with BLEU@4, METEOR, ROUGE-L and CIDEr scores (%) at MSR-VTT dataset.

4.3.2 Ablation Studies

We conduct extensive ablation studies as shown in Table III.

Effectiveness of the meta concept graph. To observe the impact of the number of neighbours in the dynamic meta concept (MC) graph embedding process, we change to different values. The results show there is minor difference between different settings for , one possible reason is: with adaptive updating on node embeddings and edge connections, nodes can aggregate around semantically similar node features. To validate the efficacy of our dynamic graph construction method, we follow [23] to build an attention-LSTM to encode the learned cross-modal meta concepts, denoting as - attention-LSTM in Table III, for comparison. It can be seen that the dynamic graph encoding method of our framework gives better performance than the attention-LSTM embedding method, indicating the graph embedding gives better feature aggregation. Besides, we also evaluate the usefulness of our learned visual and semantic concepts separately, the results suggest that visual features can give better performance than pure semantic features.

Effectiveness of scene graph. We compare the performance of video-level graphs (VG) without and with predicate information, which corresponds to VG (no rel) and VG respectively. In VG (no rel), which is similar with the setting in [6, 7], we only connect object nodes together without using predicate relationships. We observe that with predicates, our model can gain better results. We incrementally add frame-level graphs (FG) and VG on the baseline (BL) respectively, which can be seen both types of graphs boost baseline performance. VG give better scores regarding BLEU@4 and CIDEr, showing that graphs with more informative connections can output better representations. On the whole, we can see that each proposed module gives positive effects to our model by improving captioning performance, and they can work collaboratively with other modules to output overall boosted results.

Methods B@4 Top-5 Acc
MSR-VTT without CA 15.7 61.5
with CA 16.3 63.3
MSVD without CA 24.8 68.6
with CA 26.1 70.7
TABLE IV: Caption generation performance of weakly-supervised meta concept learning evaluated with BLEU@4 and Top-5 accuracy (%) at MSR-VTT and MSVD datasets, where CA denotes cross-modal alignment.
Fig. 5: Qualitative analysis of the segmented regions of our proposed weakly-supervised learning method. We present 4 groups of video frames and the corresponding segmented visual regions of the given semantic meta concepts. The first row are the picked video key frames and the second row stated the learned visual meta concepts, which are the visual segmented areas. The segmented areas are obtained through the weakly-supervised learning method, which are used as the meta concepts in our proposed CMG video captioning framework.

Evaluation of weakly-supervised meta concept learning. In Table IV, we show the efficacy of our proposed cross-modal alignment (CA) for weakly-supervised meta concept learning. It is hard to evaluate the quality for learned cross-modal meta concepts directly, as we do not have any ground truth. Hence we choose to evaluate the model caption generation performance, since the sequence is produced based on localized meta concepts, such that the generation results can reflect the quality of learned cross-modal meta concepts to some extent. It can be observed that CA can help improve captioning performance on both MSR-VTT and MSVD datasets.

4.3.3 Qualitative Results

The visualization of the weakly-learned visual meta concepts. In Figure 5, we present the learned visual meta concepts with the weakly-supervised learning method. Since the only supervision for the attended semantic visual regions is the given video captions, we do not expect precise pixel labelling for the regions. However, we can still observe the proposed framework outputs mostly reasonable results. To be specific, we show the most activated regions across the video frames of the given semantic meta concepts. For example, in the third sample, the model gives the coarse region of the given semantic meta concept dish, which is not defined by the prevailing object detectors. Besides, based on the given captions, the model can also localize the visual regions of jar. These predicted visual meta concepts help give more find-grained information compared to the traditional object detectors, which allows the model to learn different classes of meta concepts for different datasets. This property enables our proposed framework to produce some semantic information that is missed by the existing object detection methods.

Fig. 6: Qualitative analysis of our generated video captions. We present 4 groups of video frames and their ground truth (GT), baseline and our model generated video captions. For the baseline model, we do not apply the proposed cross-modal graph framework on it, and it produces relatively short captions, which may lose some semantic attributes.
Fig. 7: Qualitative analysis of the learned cross-modal meta concepts and our generated captions. We present 4 groups of video frames and their ground truth (GT), baseline and our model generated captions, where the first row are the picked key frames and the second row stated the learned visual meta concepts. We indicate the cross-modal meta concepts with the same color in the GT captions, and the underline generated words denote correspondence with the localized meta concepts.

The visualization of the generated captions from the videos. In Figure 6, we present the visualization examples of our generated video captions. For each video, we show the picked key frames, ground truth (GT) captions and the captions generated by the baseline model and our proposed CMG model respectively. To be specific, in the first example, our generated caption also gives the information about the person’s clothes, which is the black shirt, while the baseline results lose such fine-grained attributes. In the last example, our generated captions produce more descriptions on the video context: a fashion show. We observe our generated video captions generally contain more useful semantic information than the baseline results. The proposed cross-modal graph with the learned meta concepts guides the video captioning model to focus on the fine-grained semantic attributes of the video frames, hence it allows the model to produce textual descriptions with more sufficient desired information than the baseline model.

The visualization of the correspondence between the generated visual and semantic meta concepts. In Figure 7, we show the qualitative analysis for our learned visual and semantic meta concepts, where we visualize the video frames from different scenarios. Specifically, in the second example that is an animation clip, our learned cross-modal meta concepts can localize the visual regions of cartoon characters, while some pretrained object detection model may fail [6]. The learned meta concepts also allow the generation model to keep aware of the visual context information, such as tournament in the first video and race in the third one, which generate precise captioning words on the video context. The model can hardly give reasonable attended visual regions for some items that are not shown in the video frames, for instance, camera in the last video caption. Generally, the learned cross-modal meta concepts show promising results, where the CMG model gives more useful textual descriptions than the baseline model, thus boosts the model captioning performance.

5 Conclusion

In this paper, we propose CMG with meta concepts for video captioning. Specifically, we use a weakly-supervised learning approach to localize the attended visual regions and their semantic classes for objects shown in captions, in an attempt to cover some undefined classes of pretrained models. We then use dynamic graph embeddings to aggregate semantically similar nodes and give meta concept representations. To include predicate relationships between objects, we adopt detected scene graphs in frames to build video- and frame-level graphs and give structure representations. We conduct extensive experiments and ablation studies, and achieve state-of-the-art results in MSR-VTT and MSVD datasets for video captioning.


This research is supported, in part, by the National Research Foundation (NRF), Singapore under its AI Singapore Programme (AISG Award No: AISG-GC-2019-003) and under its NRF Investigatorship Programme (NRFI Award No. NRF-NRFI05-2019-0002). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of National Research Foundation, Singapore. This research is also supported, in part, by the Singapore Ministry of Health under its National Innovation Challenge on Active and Confident Ageing (NIC Project No. MOH/NIC/COG04/2017 and MOH/NIC/HAIG03/2017), and the MOE Tier-1 research grants: RG28/18 (S) and RG22/19 (S).


  • [1] Y. Yu, H. Ko, J. Choi, and G. Kim, “End-to-end concept word detection for video captioning, retrieval, and question answering,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017, pp. 3165–3173.
  • [2] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [3] J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan, “M3: Multimodal memory modelling for video captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7512–7520.
  • [4] Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann, “Describing videos using multi-modal fusion,” in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 1087–1091.
  • [5] V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, and K. Saenko, “Multimodal video description,” in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 1092–1096.
  • [6] B. Pan, H. Cai, D.-A. Huang, K.-H. Lee, A. Gaidon, E. Adeli, and J. C. Niebles, “Spatio-temporal graph for video captioning with knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 870–10 879.
  • [7] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z.-J. Zha, “Object relational graph with teacher-recommended learning for video captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 278–13 288.
  • [8] Q. Zheng, C. Wang, and D. Tao, “Syntax-aware action targeting for video captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 096–13 105.
  • [9] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296.
  • [10] D. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 190–200.
  • [11] B. Wang, L. Ma, W. Zhang, W. Jiang, J. Wang, and W. Liu, “Controllable video captioning with pos sequence guidance based on gated fusion network,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2641–2650.
  • [12]

    K. Tang, “A scene graph generation codebase in pytorch,” 2020,
  • [13] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4507–4515.
  • [14] J. Xu, T. Yao, Y. Zhang, and T. Mei, “Learning multimodal attention lstm networks for video captioning,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 537–545.
  • [15] B. Wang, L. Ma, W. Zhang, and W. Liu, “Reconstruction network for video captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7622–7631.
  • [16] H. Wang, D. Sahoo, C. Liu, E.-p. Lim, and S. C. Hoi, “Learning cross-modal embeddings with adversarial networks for cooking recipes and food images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 572–11 581.
  • [17] Y. Chen, S. Wang, W. Zhang, and Q. Huang, “Less is more: Picking informative frames for video captioning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 358–373.
  • [18] J. Zhang and Y. Peng, “Object-aware aggregation with bidirectional temporal graph for video captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8327–8336.
  • [19] Z. Zhang, Z. Qi, C. Yuan, Y. Shan, B. Li, Y. Deng, and W. Hu, “Open-book video captioning with retrieve-copy-generate network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9837–9846.
  • [20] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4651–4659.
  • [21] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5630–5639.
  • [22] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van Den Hengel, “What value do explicit high level concepts have in vision to language problems?” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 203–212.
  • [23] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach, “Grounded video description,” in CVPR, 2019.
  • [24] C.-Y. Ma, Y. Kalantidis, G. AlRegib, P. Vajda, M. Rohrbach, and Z. Kira, “Learning to generate grounded visual captions without localization supervision,” in Proceedings of the European Conference on Computer Vision (ECCV), vol. 2.   Springer, 2020.
  • [25] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2017.
  • [26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph Attention Networks,” International Conference on Learning Representations, 2018. [Online]. Available:
  • [27] G. Li, M. Muller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9267–9276.
  • [28] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” Acm Transactions On Graphics (tog), vol. 38, no. 5, pp. 1–12, 2019.
  • [29] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 685–10 694.
  • [30] X. Li and S. Jiang, “Know more say less: Image captioning based on scene graphs,” IEEE Transactions on Multimedia, vol. 21, no. 8, pp. 2117–2130, 2019.
  • [31] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 684–699.
  • [32] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem, “G-tad: Sub-graph localization for temporal action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 156–10 165.
  • [33] X. Liu, J.-Y. Lee, and H. Jin, “Learning video representations from correspondence proposals,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4273–4281.
  • [34] X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 399–417.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [36] S. Phan, G. E. Henter, Y. Miyao, and S. Satoh, “Consensus-based sequence training for video captioning,” ArXiv e-prints, 2017.
  • [37] R. Luo, B. Price, S. Cohen, and G. Shakhnarovich, “Discriminability objective for training descriptive captions,” arXiv preprint arXiv:1803.04376, 2018.
  • [38] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in

    International conference on machine learning

    , 2015, pp. 2048–2057.
  • [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [40] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
  • [41] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y.-W. Tai, “Memory-attended recurrent network for video captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8347–8356.
  • [42] S. Chen and Y.-G. Jiang, “Motion guided spatial attention for video captioning,” in

    Proceedings of the AAAI Conference on Artificial Intelligence

    , vol. 33, 2019, pp. 8191–8198.
  • [43] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
  • [44] S. Phan, G. E. Henter, Y. Miyao, and S. Satoh, “Consensus-based sequence training for video captioning,” arXiv preprint arXiv:1712.09532, 2017.
  • [45] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261, 2016.
  • [46] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  • [47]

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in

    CVPR, 2014.
  • [48] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
  • [49] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , 2014, pp. 1532–1543.
  • [50] H. Chen, K. Lin, A. Maye, J. Li, and X. Hu, “A semantics-assisted video captioning model trained with scheduled sampling,” arXiv preprint arXiv:1909.00121, 2019.
  • [51]

    S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
  • [52] M. Zolfaghari, K. Singh, and T. Brox, “Eco: Efficient convolutional network for online video understanding,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 695–712.
  • [53] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017.
  • [54] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” arXiv preprint arXiv:1506.01497, 2015.
  • [55] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.