Log In Sign Up

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.


Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

A major challenge for video captioning is to combine audio and visual cu...

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

This paper addresses the challenging task of video captioning which aims...

Dual-Stream Transformer for Generic Event Boundary Captioning

This paper describes our champion solution for the CVPR2022 Generic Even...

Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video Prediction

The explosion of video data on the internet requires effective and effic...

Multimodal Matching Transformer for Live Commenting

Automatic live commenting aims to provide real-time comments on videos f...

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the...

1 Introduction

Recently, the combination of vision and language attracts more and more attention [23, 14, 2, 12]. Video captioning is a valuable but challenging task in this topic, where the goal is to generate text descriptions for video data directly. The difficulties of video captioning mainly lie in the modeling of temporal dynamics and the fusion of multiple modalities.

Encoder-decoder structures are widely used in video captioning [16, 1, 15, 20, 7]

. In general, the encoder learns multiple types of features from raw video data. The decoder utilizes these features to generate words. Most encoder-decoder structures are built upon the long short-term memory (LSTM) unit, however, LSTM has two main drawbacks. First, LSTM-based decoder does not allow a parallel prediction of words at different time steps, since its hidden state is computed based on the previous one. Second, LSTM-based encoder has insufficient capacity to capture the long-range temporal correlations.

To tackle these issues, [5] and [24] proposed to replace LSTM with transformer for video understanding. Specifically, [5] used multiple transformer-based encoders to encode video features and a transformer-based decoder to generate descriptions. Similarly, [24]

utilized transformer for dense video captioning,

[24] utilized a transformer-based encoder to detect action proposals and described them simultaneously with a transformer-based decoder. Different from LSTM, the self-attention mechanism in transformer correlates the features at any two time steps, enabling the global association of features. However, the vanilla transformer is limited in processing video features with much temporal redundancy like the example in Fig. 1. In addition, the cross-modal interaction between different modalities is ignored in the existing transformer-based methods.

Figure 1: An example of redundancy between video frames.

Motivated by the above observations, we propose a novel method named sparse boundary-aware transformer (SBAT) to improve the transformer-based encoder and decoder architectures for video understanding. In the encoder, we employ sparse attention mechanism to better capture the global and local dynamic information by solving the redundancy between consecutive video frames. Specifically, to capture the global temporal dynamics, we divide all the time steps into

chunks according to the gradient values of attention logits and select

time steps with top- gradient values. To capture the local temporal dynamics, we implement self-attention between adjacent time steps. In the decoder, we also employ the boundary-aware strategy for encoder-decoder multihead attention. In addition, we implement cross-modal sparse attention following the self-attention layer to align multimodal features along temporal dimension. We conduct extensive empirical studies on two benchmark video captioning datasets. The quantitative, qualitative and ablation experimental results comprehensively reveal the effectiveness of our proposed methods.

The main contributions of this paper are three-folded:

(1) We propose the sparse boundary-aware transformer (SBAT) to improve the vanilla transformer. We use boundary-aware pooling operation following the preliminary scores of multihead attention and select the features of different scenarios to reduce the redundancy.

(2) We develop a local correlation scheme to compensate for the local information loss brought by sparse operation. The scheme can be implemented synchronously with the boundary-aware strategy.

(3) We further propose a cross-modal encoding scheme to align the multimodal features along the temporal dimension.

2 Related Work

As a popular variant of RNN, LSTM is widely used in existing video captioning methods. [19] utilized LSTM to encode video features and decode words. [22] integrated the attention mechanism into video captioning, where the encoded features are given different attention weights according to the queries of decoder. [8] further proposed a two-level attention mechanism for video captioning. The first level focuses on different time steps, and the second level focuses on different modalities. [13] and [10] detected local attributes and used them as supplementary information. [9] introduced cross-modal correlation into attention mechanism. Recently, [5] proposed to replace LSTM with transformer in video captioning models. However, directly using transformer for video captioning has several drawbacks, i.e., the redundancy of video features and the lack of multimodal interaction modeling. In this paper, we propose a novel approach called sparse boundary-aware transformer (SBAT) to address these problems.

Figure 2:

(a) is the overall framework of Vanilla Transformer. It consists of multihead attention mechanism and feed-forward neural network. The features of different modalities are processed separately and the queries of decoder associate these features to generate words.

denotes the number of stacked blocks. (b) is the architecture of SBAT. It introduces sparse boundary-aware strategy (Sp) into all the multihead attention blocks in encoder and decoder. In addition, we learn cross-modal interaction after the first feed-forward layer in an encoder block.

3 Transformer-based Video Captioning

Transformer [18] is originally proposed for machine translation. Due to the effectiveness and scalability, transformer is employed in many other tasks including video captioning. A simple illustration of transformer-based video captioning model is shown in Fig. 2(a). The encoder and decoder both consist of multihead attention blocks and feed-forward neural network.

3.1 Encoder

Different from the uni-modal inputs of machine translation, the inputs of video captioning are typically multimodal. As shown in Fig. 2(a), two separate encoders process image and motion features, respectively. We use and to denote the image and motion features, respectively. Here we take the process of image encoding as an example. The self-attention layer is formulated as


where ”Cat” denotes concatenation operation, is a trainable variable. Multihead attention is a special variant of attention, where each head is calculated as


where , , and are also trainable variables, “Attention” denotes scaled dot-product attention:


where is dimension of and

. We adopt residual connection and layer normalization after the self-attention layer:


Every self-attention layer is followed by a feed-forward layer (FFN) that employs non-linear transformation:


where , , , and are trainable variables. The encoded image features is the output of an encoder block. The encoded motion features are calculated in the same way.

3.2 Decoder

The decoder block consists of self-attention layer, enc-dec attention layer, and feed-forward layer. In the self-attention layer, the word embeddings of different time steps associate with each other, and we take the output features as queries. In the enc-dec multihead attention layer, the query first associates image and motion features to get two context vectors respectively, then generates the words. The feed-forward layer in decoder is the same as Eqns.

6 and 7. We also adopt residual connection and layer normalization after all the layers of the decoder.

Specifically, we use to denote the embeddings of target words. To predict the word at time step , the self-attention layer is formulated as


where denotes the word embeddings of time steps less than . The enc-dec attention layer is:


Following [8], we employ a hierarchical attention layer for and :


denotes the output of feed-forward layer. We calculate the probability distributions of words as:


where and

denote the encoded video features. The optimization goal is to minimize the cross-entropy loss function defined as accumulative loss from all the time steps:


where denotes the ground-truth word at time step .

4 Sparse Boundary-Aware Attention

Considering the redundancy of video features, it is not appropriate to compute attention weights using vanilla multihead attention. To solve the problem, we introduce a novel sparse boundary-aware strategy into the multihead attention. In Section 4.1, we introduce the sparse boundary-aware strategy. In Section 4.2, we provide the analysis of sparse boundary-aware strategy. In Section 4.3, we introduce the local correlation attention which compensates for the local information loss. In Section 4.4, we introduce an aligned cross-modal encoding scheme based on SBAT.

4.1 Sparse Boundary-Aware Pooling

We employ sparse boundary-aware strategy following the scaled dot-product attention logits. Specifically, the original logits are calculated as follows:


where and denote the query and key, respectively; represents the dimension of and . We utilize to represent the associated result of and . The discrete first derivative of in the second dimension is obtained as follows:


For time step of the query, we choose top- values in , since the boundary of two scenarios always has high gradient value.


where is the -th largest value of . We implement function for the processed .

Furthermore, to keep the time steps with large original logits, we define to replace :


is a special variant of when .

4.2 Theoretical Analysis of Boundary-Aware Pooling

Suppose we randomly choose one time step of as query , the query associates at all the time steps. The logits of scaled dot-product attention are . We calculate the attention weight of each time step as:


To the best of our knowledge, there are about scenarios on average in a ten-second video clip at a coarse granularity, like the example in Fig. 1. One-second clip usually contains frames. Therefore, most frames in the same scenario are redundant. Existing methods sample the video to a fixed number of frames or directly reduce the frame rate. Although such methods are effective to some extent, there is still much redundancy in the scenarios that have a large number of time steps. The total attention weights of the scenarios with fewer time steps may be influenced. Specifically, we divide time steps into two groups. The scenario one occupies time steps, the remaining time steps belong to scenario two. Suppose that the features of different time steps in the same scenario are the same, we obtain the total weights of two scenarios as follows:


where denotes the total weight of scenario , denotes the associated logit. Suppose the query is related to scenario two () and , the ratio of to may influence the total attention weights ( and ) of two scenarios.

More concretely, we assume that is and is . and are calculated as:


if we apply sparse boundary-aware pooling strategy () for the logits and sample one time step in each scenario. Both and are transformed and the weight of scenario two obviously increases.


However, when the query is related to scenario one (). It is not appropriate to reduce the proportion of scenario one. Therefore, we define to replace and select not only the boundaries of scenarios, but also the time steps with large original logits. Specifically, the number of selected steps is , we sample two boundaries in the two scenarios and the remaining time steps belong to scenario one. is obtained as:


for the increase from in Eqn. 21 to in Eqn. 25, we just need to ensure that .

When the video clip has more than two scenarios, we also divide them into two groups. One has the scenarios with larger logits, the other has the remaining scenarios. The above analysis of two scenarios is approximately applicable in this situation.

Vanilla Transformer 51.4 69.7 34.6 86.4 40.9 60.4 28.5 48.9
SBAT (w/o CM) 52.4 71.2 35.0 87.0 42.0 60.8 28.5 50.1
SBAT (w/o Local) 53.5 72.3 35.2 88.9 41.9 61.0 28.4 50.5
SBAT (Sample) 51.3 71.9 35.2 88.6 42.3 61.0 28.7 51.0
SBAT 53.1 72.3 35.3 89.5 42.9 61.5 28.9 51.6
Table 1: Evaluation results of our proposed methods. Note that we reproduce the results of Vanilla Transformer (TVT [Chen et al., 2018]). Due to different learning rate strategy, our implementation achieves better performances than the original TVT on MSR-VTT.

4.3 Local Correlation

Since we employ sparse boundary-aware strategy for the attention logits, the local information between consecutive frames is ignored. We develop a local correlation scheme based on the multihead attention to compensate for the information loss. Formally, the original logits are obtained following Eqn. 16. The correlation scheme is


where denotes the maximum distance of two frames and the correlation size is . In practice, the local correlation and boundary-aware correlation are utilized simultaneously.

4.4 Cross-Modal Scheme

Existing methods deal with different modalities separately in the encoder and ignore the interaction between different modalities. Here, we propose an aligned cross-modal scheme based on sparse boundary-aware attention. We divide the video into a fixed number of video chunks and then extract image and motion features from these chunks at the same intervals. Therefore, the feature vectors at the same step are extracted from the same video chunk. We directly apply our sparse boundary-aware attention to the aligned features. When the query is image modality, the key is motion modality, vice versa. Taking the former situation as an example, we compute the results of vanilla and boundary-aware cross-modal attentions as follows:


where CM denotes cross-modal.

5 Video Captioning with SBAT

We introduce the encoder-decoder structure combined with our sparse boundary-aware attention for video captioning. As shown in Fig. 2(b), we replace all the vanilla multihead attention blocks with boundary-aware attention blocks, except for the self-attention block for target word embeddings. Different from the original structure, an additional cross-modal attention layer is adopted following the self-attention layer in the encoder. In the decoder, we also introduce the boundary-aware attention into the enc-dec attention layer, but we set to in Eqn. 18 and do not use local correlation.

6 Experimental Methodology

6.1 Datasets and Metrics

We evaluate SBAT on two benchmark video captioning datasets, MSVD [4] and MSR-VTT [21]. Both the datasets are provided by Microsoft Research, and a series of state-of-the-art methods have been proposed based on these datasets in recent years. MSVD contains video clips and each video clip is about to seconds long and annotated with about English sentences. MSR-VTT is larger than MSVD with YouTube video clips in total and each clip is annotated with English sentences. We follow the commonly used protocol in the previous work and evaluate methods under four standard metrics including BLEU, ROUGE, METEOR, and CIDEr.

6.2 Data Preprocessing

We extract image features and motion features of video data. For image features, we sample video data to frames and use the pre-trained Inception-ResNet-v2 [17] model to obtain the activations from the penultimate layer. For motion features, we divide the raw video data into video chunks centered on the sampled frames and use the pre-trained I3D [3] model to obtain the activations from the last convolutional layer. We implement a mean-pooling operation along the temporal dimension to get the motion features. On MSR-VTT, we also employ glove embeddings of the auxiliary video category labels to facilitate feature encoding.

6.3 Experimental Details

The hidden size is set to for all the multihead attention mechanisms. The numbers of heads and attention blocks are and , respectively. The value of is set to in the encoder and in the decoder. In the training phase, we use Adam [11] algorithm to optimize the loss function. The learning rate is initially set to . If the CIDEr on validation set does not improve over epochs, we change the learning rate to . The batch size is set to . In the testing phase, we use the beam-search method with a beam-width of to generate words. We use the pre-trained word2vec embeddings to initialize the word vectors. Each word is represented as a -dimension vector.

Figure 3: Effect of on MSR-VTT. We show the relative results on METEOR and CIDEr. Specifically, we set as the baseline.
Figure 4: Some qualitative results of the video clips on the test sets of MSR-VTT and MSVD. We provide the ground-truth description and the generated descriptions of Vanilla Transformer and SBAT for each video clip.

7 Experimental Results

7.1 Impact of Sparse Boundary-Aware Attention

We first evaluate the effectiveness of different variants of SBAT, as shown in Table 1. Vanilla Transformer and SBAT denote the models in Fig. 2(a) and (b). SBAT (w/o CM) denotes the model without aligned cross-modal attention. SBAT (w/o Local) denotes the model without local correlation in the encoder. SBAT (Sample) denotes the model with equidistant sampling for all the time steps, rather than our boundary-aware operation.

In Table 1, Vanilla Transformer achieves relatively bad results on both datasets. However, when we adopt boundary-aware or equidistant sampling strategies in the multihead attentions, the performances are obviously improved. SBAT with boundary-aware attention, local correlation, and aligned cross-modal interaction achieves promising results under all the metrics. The comparison between SBAT (w/o CM) and SBAT shows that the cross-modal interaction provides useful cues for generating words. The comparison between SBAT (w/o Local) and SBAT shows that the local correlation can make up the loss of local information. Comparing SBAT and SBAT (Sample), although equidistant sampling reduces the feature redundancy to some extent, the ratio between different scenarios is not considered, while SBAT solves this problem effectively.

7.2 Comparison of and

To evaluate the impact of and find an appropriate ratio between and , we adjust the value of in Eqn. 18 based on SBAT. The experimental results are shown in Fig. 3. Note that we only adjust the value of in the encoder, and the value of in the decoder is always . We observe that with achieves the best performances on both METEOR and CIDEr. In addition, only using original logits () shows the worst performances, indicating that our proposed boundary-aware strategy is a significant boost for the transformer-based video captioning model.

Dataset Method B R M C
TVT 40.1 61.1 28.2 47.7
MGSA 42.4 - 27.6 47.5
Dense Cap 41.4 61.1 28.3 48.9
MARN 40.4 60.7 28.1 47.1
GRU-EVE 38.3 60.7 28.4 48.1
POS-CG 42.0 61.1 28.1 49.0
SBAT 42.9 61.5 28.9 51.6
TVT 53.2 - 35.2 86.8
SCN 51.1 - 33.5 77.7
MARN 48.6 71.9 35.1 92.2
GRU-EVE 47.9 71.5 35.0 78.1
POS-CG 52.5 71.3 34.1 88.7
SBAT 53.1 72.3 35.3 89.5
Table 2: Evaluation results of video captioning, where B, R, M, C denote BLEU4, ROUGE, METEOR, CIDEr, respectively.

7.3 Comparison with State-of-the-art

Table 2 shows the results of different methods on MSVD and MSR-VTT. For a fair comparison, we compare SBAT with the methods which also use image features and motion features. The comparison methods include TVT [5], MGSA [6], Dense Cap [16], MARN [15], GRU-EVE [1], POS-CG [20], SCN [7]. In Table 2, SBAT shows better or competitive performances compared with the state-of-the-art methods. On MSR-VTT, SBAT outperforms TVT, MGSA, Dense Cap, MARN, GRU-EVE, POS-CG on all the metrics. In particular, SBAT achieves 51.6% on CIDEr, making an improvement of 2.6% over POS-CG. On MSVD, SBAT outperforms SCN, GRU-EVE, POS-CG on all the metrics and has a better overall performance than TVT and MARN.

Figure 5: Visualization of attention mechanism. (a) and (b) denote Vanilla Transformer and SBAT, respectively. and axes both denote continuous video frames. The generated descriptions of two methods are both “a man is shooting a gun”.

7.4 Visualization of Attention Mechanism

To further illustrate the effectiveness of SBAT, we conduct a case study and visualize the attention distributions of SBAT and Vanilla Transformer. In Fig. 5, we take a video clip for example. Note that we only visualize the weights of image modality for convenience, and we do not show the local attention weights. Fig. 5(a) shows that the attention weights of Vanilla Transformer are dispersed and Vanilla Transformer has a poor ability to detect the boundary of different scenarios. While Fig. 5(b) shows that (1) SBAT has more sparse attention weights than Vanilla Transformer; (2) SBAT accurately detects the scenario boundaries.

7.5 Qualitative Results

Fig. 4 shows several qualitative examples. We compare the descriptions generated by Vanilla Transformer, SBAT, and ground truth (GT). With the help of redundancy reduction and a better usage of global and local information, SBAT generates more accurate descriptions that are close to GT.

8 Conclusion

In this paper, we have proposed a new method called sparse boundary-aware transformer (SBAT) for video captioning. Specifically, we have proposed sparse boundary-aware strategy for improving the attention logits in vanilla transformer. Combined with local correlation and cross-modal encoding, SBAT can effectively reduce the feature redundancy and capture the global-local video information. The quantitative, qualitative, and ablation experiments on two benchmark datasets have demonstrated the advantage of SBAT.


This work is supported in part by Science and Technology Innovation 2030 –“New Generation Artificial Intelligence” Major Project No.(2018AAA0100904), National Key R&D Program of China (No. 2018YFB1403600), NSFC (No. 61672456, 61702448, U19B2043), Artificial Intelligence Research Foundation of Baidu Inc., the funding from HIKVision and Horizon Robotics, and ZJU Converging Media Computing Lab.


  • [1] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR, Cited by: §1, §7.3.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In ICCV, Cited by: §1.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §6.2.
  • [4] D. L. Chen and W. B. Dolan (2011) Collecting highly parallel data for paraphrase evaluation. In ACL, Cited by: §6.1.
  • [5] M. Chen, Y. Li, Z. Zhang, and S. Huang (2018)

    TVT: two-view transformer network for video captioning

    In ACML, Cited by: §1, §2, §7.3.
  • [6] Cited by: §7.3.
  • [7] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng (2017) Semantic compositional networks for visual captioning. In CVPR, Cited by: §1, §7.3.
  • [8] C. Hori, T. Hori, T. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi (2017) Attention-based multimodal fusion for video description. In ICCV, Cited by: §2, §3.2.
  • [9] T. Jin, S. Huang, Y. Li, and Z. Zhang (2019) Low-rank hoca: efficient high-order cross-modal attention for video captioning. arXiv preprint arXiv:1911.00212. Cited by: §2.
  • [10] T. Jin, Y. Li, and Z. Zhang (2019) Recurrent convolutional video captioning with global and local attention. Neurocomputing. Cited by: §2.
  • [11] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.3.
  • [12] X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan (2019) Beyond rnns: positional self-attention with co-attention for video question answering. In AAAI, Cited by: §1.
  • [13] X. Long, C. Gan, and G. de Melo (2018) Video captioning with multi-faceted attention. TACL. Cited by: §2.
  • [14] Y. Pan, T. Yao, H. Li, and T. Mei (2017) Video captioning with transferred semantic attributes. In CVPR, Cited by: §1.
  • [15] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y. Tai (2019) Memory-attended recurrent network for video captioning. In CVPR, Cited by: §1, §7.3.
  • [16] Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue (2017) Weakly supervised dense video captioning. In CVPR, Cited by: §1, §7.3.
  • [17] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017) Inception-v4, inception-resnet and the impact of residual connections on learning.. In AAAI, Cited by: §6.2.
  • [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §3.
  • [19] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015) Sequence to sequence-video to text. In ICCV, Cited by: §2.
  • [20] B. Wang, L. Ma, W. Zhang, W. Jiang, J. Wang, and W. Liu (2019) Controllable video captioning with pos sequence guidance based on gated fusion network. In ICCV, Cited by: §1, §7.3.
  • [21] J. Xu, T. Mei, T. Yao, and Y. Rui (2016) Msr-vtt: a large video description dataset for bridging video and language. In CVPR, Cited by: §6.1.
  • [22] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville (2015) Describing videos by exploiting temporal structure. In ICCV, Cited by: §2.
  • [23] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In CVPR, Cited by: §1.
  • [24] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong (2018) End-to-end dense video captioning with masked transformer. In CVPR, Cited by: §1.