1 Introduction
Recently, the combination of vision and language attracts more and more attention [23, 14, 2, 12]. Video captioning is a valuable but challenging task in this topic, where the goal is to generate text descriptions for video data directly. The difficulties of video captioning mainly lie in the modeling of temporal dynamics and the fusion of multiple modalities.
Encoder-decoder structures are widely used in video captioning [16, 1, 15, 20, 7]
. In general, the encoder learns multiple types of features from raw video data. The decoder utilizes these features to generate words. Most encoder-decoder structures are built upon the long short-term memory (LSTM) unit, however, LSTM has two main drawbacks. First, LSTM-based decoder does not allow a parallel prediction of words at different time steps, since its hidden state is computed based on the previous one. Second, LSTM-based encoder has insufficient capacity to capture the long-range temporal correlations.
To tackle these issues, [5] and [24] proposed to replace LSTM with transformer for video understanding. Specifically, [5] used multiple transformer-based encoders to encode video features and a transformer-based decoder to generate descriptions. Similarly, [24]
utilized transformer for dense video captioning,
[24] utilized a transformer-based encoder to detect action proposals and described them simultaneously with a transformer-based decoder. Different from LSTM, the self-attention mechanism in transformer correlates the features at any two time steps, enabling the global association of features. However, the vanilla transformer is limited in processing video features with much temporal redundancy like the example in Fig. 1. In addition, the cross-modal interaction between different modalities is ignored in the existing transformer-based methods.
Motivated by the above observations, we propose a novel method named sparse boundary-aware transformer (SBAT) to improve the transformer-based encoder and decoder architectures for video understanding. In the encoder, we employ sparse attention mechanism to better capture the global and local dynamic information by solving the redundancy between consecutive video frames. Specifically, to capture the global temporal dynamics, we divide all the time steps into
chunks according to the gradient values of attention logits and select
time steps with top- gradient values. To capture the local temporal dynamics, we implement self-attention between adjacent time steps. In the decoder, we also employ the boundary-aware strategy for encoder-decoder multihead attention. In addition, we implement cross-modal sparse attention following the self-attention layer to align multimodal features along temporal dimension. We conduct extensive empirical studies on two benchmark video captioning datasets. The quantitative, qualitative and ablation experimental results comprehensively reveal the effectiveness of our proposed methods.The main contributions of this paper are three-folded:
(1) We propose the sparse boundary-aware transformer (SBAT) to improve the vanilla transformer. We use boundary-aware pooling operation following the preliminary scores of multihead attention and select the features of different scenarios to reduce the redundancy.
(2) We develop a local correlation scheme to compensate for the local information loss brought by sparse operation. The scheme can be implemented synchronously with the boundary-aware strategy.
(3) We further propose a cross-modal encoding scheme to align the multimodal features along the temporal dimension.
2 Related Work
As a popular variant of RNN, LSTM is widely used in existing video captioning methods. [19] utilized LSTM to encode video features and decode words. [22] integrated the attention mechanism into video captioning, where the encoded features are given different attention weights according to the queries of decoder. [8] further proposed a two-level attention mechanism for video captioning. The first level focuses on different time steps, and the second level focuses on different modalities. [13] and [10] detected local attributes and used them as supplementary information. [9] introduced cross-modal correlation into attention mechanism. Recently, [5] proposed to replace LSTM with transformer in video captioning models. However, directly using transformer for video captioning has several drawbacks, i.e., the redundancy of video features and the lack of multimodal interaction modeling. In this paper, we propose a novel approach called sparse boundary-aware transformer (SBAT) to address these problems.

(a) is the overall framework of Vanilla Transformer. It consists of multihead attention mechanism and feed-forward neural network. The features of different modalities are processed separately and the queries of decoder associate these features to generate words.
denotes the number of stacked blocks. (b) is the architecture of SBAT. It introduces sparse boundary-aware strategy (Sp) into all the multihead attention blocks in encoder and decoder. In addition, we learn cross-modal interaction after the first feed-forward layer in an encoder block.3 Transformer-based Video Captioning
Transformer [18] is originally proposed for machine translation. Due to the effectiveness and scalability, transformer is employed in many other tasks including video captioning. A simple illustration of transformer-based video captioning model is shown in Fig. 2(a). The encoder and decoder both consist of multihead attention blocks and feed-forward neural network.
3.1 Encoder
Different from the uni-modal inputs of machine translation, the inputs of video captioning are typically multimodal. As shown in Fig. 2(a), two separate encoders process image and motion features, respectively. We use and to denote the image and motion features, respectively. Here we take the process of image encoding as an example. The self-attention layer is formulated as
(1) |
(2) |
where ”Cat” denotes concatenation operation, is a trainable variable. Multihead attention is a special variant of attention, where each head is calculated as
(3) |
where , , and are also trainable variables, “Attention” denotes scaled dot-product attention:
(4) |
where is dimension of and
. We adopt residual connection and layer normalization after the self-attention layer:
(5) |
Every self-attention layer is followed by a feed-forward layer (FFN) that employs non-linear transformation:
(6) |
(7) |
where , , , and are trainable variables. The encoded image features is the output of an encoder block. The encoded motion features are calculated in the same way.
3.2 Decoder
The decoder block consists of self-attention layer, enc-dec attention layer, and feed-forward layer. In the self-attention layer, the word embeddings of different time steps associate with each other, and we take the output features as queries. In the enc-dec multihead attention layer, the query first associates image and motion features to get two context vectors respectively, then generates the words. The feed-forward layer in decoder is the same as Eqns.
6 and 7. We also adopt residual connection and layer normalization after all the layers of the decoder.Specifically, we use to denote the embeddings of target words. To predict the word at time step , the self-attention layer is formulated as
(8) |
where denotes the word embeddings of time steps less than . The enc-dec attention layer is:
(9) |
(10) |
denotes the output of feed-forward layer. We calculate the probability distributions of words as:
(13) |
where and
denote the encoded video features. The optimization goal is to minimize the cross-entropy loss function defined as accumulative loss from all the time steps:
(14) |
where denotes the ground-truth word at time step .
4 Sparse Boundary-Aware Attention
Considering the redundancy of video features, it is not appropriate to compute attention weights using vanilla multihead attention. To solve the problem, we introduce a novel sparse boundary-aware strategy into the multihead attention. In Section 4.1, we introduce the sparse boundary-aware strategy. In Section 4.2, we provide the analysis of sparse boundary-aware strategy. In Section 4.3, we introduce the local correlation attention which compensates for the local information loss. In Section 4.4, we introduce an aligned cross-modal encoding scheme based on SBAT.
4.1 Sparse Boundary-Aware Pooling
We employ sparse boundary-aware strategy following the scaled dot-product attention logits. Specifically, the original logits are calculated as follows:
(15) |
where and denote the query and key, respectively; represents the dimension of and . We utilize to represent the associated result of and . The discrete first derivative of in the second dimension is obtained as follows:
(16) |
For time step of the query, we choose top- values in , since the boundary of two scenarios always has high gradient value.
(17) |
where is the -th largest value of . We implement function for the processed .
Furthermore, to keep the time steps with large original logits, we define to replace :
(18) |
is a special variant of when .
4.2 Theoretical Analysis of Boundary-Aware Pooling
Suppose we randomly choose one time step of as query , the query associates at all the time steps. The logits of scaled dot-product attention are . We calculate the attention weight of each time step as:
(19) |
To the best of our knowledge, there are about scenarios on average in a ten-second video clip at a coarse granularity, like the example in Fig. 1. One-second clip usually contains frames. Therefore, most frames in the same scenario are redundant. Existing methods sample the video to a fixed number of frames or directly reduce the frame rate. Although such methods are effective to some extent, there is still much redundancy in the scenarios that have a large number of time steps. The total attention weights of the scenarios with fewer time steps may be influenced. Specifically, we divide time steps into two groups. The scenario one occupies time steps, the remaining time steps belong to scenario two. Suppose that the features of different time steps in the same scenario are the same, we obtain the total weights of two scenarios as follows:
(20) |
where denotes the total weight of scenario , denotes the associated logit. Suppose the query is related to scenario two () and , the ratio of to may influence the total attention weights ( and ) of two scenarios.
More concretely, we assume that is and is . and are calculated as:
(21) |
(22) |
if we apply sparse boundary-aware pooling strategy () for the logits and sample one time step in each scenario. Both and are transformed and the weight of scenario two obviously increases.
(23) |
(24) |
However, when the query is related to scenario one (). It is not appropriate to reduce the proportion of scenario one. Therefore, we define to replace and select not only the boundaries of scenarios, but also the time steps with large original logits. Specifically, the number of selected steps is , we sample two boundaries in the two scenarios and the remaining time steps belong to scenario one. is obtained as:
(25) |
When the video clip has more than two scenarios, we also divide them into two groups. One has the scenarios with larger logits, the other has the remaining scenarios. The above analysis of two scenarios is approximately applicable in this situation.
Method | MSVD | MSR-VTT | ||||||
---|---|---|---|---|---|---|---|---|
BLEU4 | ROUGE | METEOR | CIDEr | BLEU4 | ROUGE | METEOR | CIDEr | |
Vanilla Transformer | 51.4 | 69.7 | 34.6 | 86.4 | 40.9 | 60.4 | 28.5 | 48.9 |
SBAT (w/o CM) | 52.4 | 71.2 | 35.0 | 87.0 | 42.0 | 60.8 | 28.5 | 50.1 |
SBAT (w/o Local) | 53.5 | 72.3 | 35.2 | 88.9 | 41.9 | 61.0 | 28.4 | 50.5 |
SBAT (Sample) | 51.3 | 71.9 | 35.2 | 88.6 | 42.3 | 61.0 | 28.7 | 51.0 |
SBAT | 53.1 | 72.3 | 35.3 | 89.5 | 42.9 | 61.5 | 28.9 | 51.6 |
4.3 Local Correlation
Since we employ sparse boundary-aware strategy for the attention logits, the local information between consecutive frames is ignored. We develop a local correlation scheme based on the multihead attention to compensate for the information loss. Formally, the original logits are obtained following Eqn. 16. The correlation scheme is
(26) |
where denotes the maximum distance of two frames and the correlation size is . In practice, the local correlation and boundary-aware correlation are utilized simultaneously.
4.4 Cross-Modal Scheme
Existing methods deal with different modalities separately in the encoder and ignore the interaction between different modalities. Here, we propose an aligned cross-modal scheme based on sparse boundary-aware attention. We divide the video into a fixed number of video chunks and then extract image and motion features from these chunks at the same intervals. Therefore, the feature vectors at the same step are extracted from the same video chunk. We directly apply our sparse boundary-aware attention to the aligned features. When the query is image modality, the key is motion modality, vice versa. Taking the former situation as an example, we compute the results of vanilla and boundary-aware cross-modal attentions as follows:
(27) |
(28) |
where CM denotes cross-modal.
5 Video Captioning with SBAT
We introduce the encoder-decoder structure combined with our sparse boundary-aware attention for video captioning. As shown in Fig. 2(b), we replace all the vanilla multihead attention blocks with boundary-aware attention blocks, except for the self-attention block for target word embeddings. Different from the original structure, an additional cross-modal attention layer is adopted following the self-attention layer in the encoder. In the decoder, we also introduce the boundary-aware attention into the enc-dec attention layer, but we set to in Eqn. 18 and do not use local correlation.
6 Experimental Methodology
6.1 Datasets and Metrics
We evaluate SBAT on two benchmark video captioning datasets, MSVD [4] and MSR-VTT [21]. Both the datasets are provided by Microsoft Research, and a series of state-of-the-art methods have been proposed based on these datasets in recent years. MSVD contains video clips and each video clip is about to seconds long and annotated with about English sentences. MSR-VTT is larger than MSVD with YouTube video clips in total and each clip is annotated with English sentences. We follow the commonly used protocol in the previous work and evaluate methods under four standard metrics including BLEU, ROUGE, METEOR, and CIDEr.
6.2 Data Preprocessing
We extract image features and motion features of video data. For image features, we sample video data to frames and use the pre-trained Inception-ResNet-v2 [17] model to obtain the activations from the penultimate layer. For motion features, we divide the raw video data into video chunks centered on the sampled frames and use the pre-trained I3D [3] model to obtain the activations from the last convolutional layer. We implement a mean-pooling operation along the temporal dimension to get the motion features. On MSR-VTT, we also employ glove embeddings of the auxiliary video category labels to facilitate feature encoding.
6.3 Experimental Details
The hidden size is set to for all the multihead attention mechanisms. The numbers of heads and attention blocks are and , respectively. The value of is set to in the encoder and in the decoder. In the training phase, we use Adam [11] algorithm to optimize the loss function. The learning rate is initially set to . If the CIDEr on validation set does not improve over epochs, we change the learning rate to . The batch size is set to . In the testing phase, we use the beam-search method with a beam-width of to generate words. We use the pre-trained word2vec embeddings to initialize the word vectors. Each word is represented as a -dimension vector.


7 Experimental Results
7.1 Impact of Sparse Boundary-Aware Attention
We first evaluate the effectiveness of different variants of SBAT, as shown in Table 1. Vanilla Transformer and SBAT denote the models in Fig. 2(a) and (b). SBAT (w/o CM) denotes the model without aligned cross-modal attention. SBAT (w/o Local) denotes the model without local correlation in the encoder. SBAT (Sample) denotes the model with equidistant sampling for all the time steps, rather than our boundary-aware operation.
In Table 1, Vanilla Transformer achieves relatively bad results on both datasets. However, when we adopt boundary-aware or equidistant sampling strategies in the multihead attentions, the performances are obviously improved. SBAT with boundary-aware attention, local correlation, and aligned cross-modal interaction achieves promising results under all the metrics. The comparison between SBAT (w/o CM) and SBAT shows that the cross-modal interaction provides useful cues for generating words. The comparison between SBAT (w/o Local) and SBAT shows that the local correlation can make up the loss of local information. Comparing SBAT and SBAT (Sample), although equidistant sampling reduces the feature redundancy to some extent, the ratio between different scenarios is not considered, while SBAT solves this problem effectively.
7.2 Comparison of and
To evaluate the impact of and find an appropriate ratio between and , we adjust the value of in Eqn. 18 based on SBAT. The experimental results are shown in Fig. 3. Note that we only adjust the value of in the encoder, and the value of in the decoder is always . We observe that with achieves the best performances on both METEOR and CIDEr. In addition, only using original logits () shows the worst performances, indicating that our proposed boundary-aware strategy is a significant boost for the transformer-based video captioning model.
Dataset | Method | B | R | M | C | ||
|
TVT | 40.1 | 61.1 | 28.2 | 47.7 | ||
MGSA | 42.4 | - | 27.6 | 47.5 | |||
Dense Cap | 41.4 | 61.1 | 28.3 | 48.9 | |||
MARN | 40.4 | 60.7 | 28.1 | 47.1 | |||
GRU-EVE | 38.3 | 60.7 | 28.4 | 48.1 | |||
POS-CG | 42.0 | 61.1 | 28.1 | 49.0 | |||
SBAT | 42.9 | 61.5 | 28.9 | 51.6 | |||
|
TVT | 53.2 | - | 35.2 | 86.8 | ||
SCN | 51.1 | - | 33.5 | 77.7 | |||
MARN | 48.6 | 71.9 | 35.1 | 92.2 | |||
GRU-EVE | 47.9 | 71.5 | 35.0 | 78.1 | |||
POS-CG | 52.5 | 71.3 | 34.1 | 88.7 | |||
SBAT | 53.1 | 72.3 | 35.3 | 89.5 |
7.3 Comparison with State-of-the-art
Table 2 shows the results of different methods on MSVD and MSR-VTT. For a fair comparison, we compare SBAT with the methods which also use image features and motion features. The comparison methods include TVT [5], MGSA [6], Dense Cap [16], MARN [15], GRU-EVE [1], POS-CG [20], SCN [7]. In Table 2, SBAT shows better or competitive performances compared with the state-of-the-art methods. On MSR-VTT, SBAT outperforms TVT, MGSA, Dense Cap, MARN, GRU-EVE, POS-CG on all the metrics. In particular, SBAT achieves 51.6% on CIDEr, making an improvement of 2.6% over POS-CG. On MSVD, SBAT outperforms SCN, GRU-EVE, POS-CG on all the metrics and has a better overall performance than TVT and MARN.

7.4 Visualization of Attention Mechanism
To further illustrate the effectiveness of SBAT, we conduct a case study and visualize the attention distributions of SBAT and Vanilla Transformer. In Fig. 5, we take a video clip for example. Note that we only visualize the weights of image modality for convenience, and we do not show the local attention weights. Fig. 5(a) shows that the attention weights of Vanilla Transformer are dispersed and Vanilla Transformer has a poor ability to detect the boundary of different scenarios. While Fig. 5(b) shows that (1) SBAT has more sparse attention weights than Vanilla Transformer; (2) SBAT accurately detects the scenario boundaries.
7.5 Qualitative Results
Fig. 4 shows several qualitative examples. We compare the descriptions generated by Vanilla Transformer, SBAT, and ground truth (GT). With the help of redundancy reduction and a better usage of global and local information, SBAT generates more accurate descriptions that are close to GT.
8 Conclusion
In this paper, we have proposed a new method called sparse boundary-aware transformer (SBAT) for video captioning. Specifically, we have proposed sparse boundary-aware strategy for improving the attention logits in vanilla transformer. Combined with local correlation and cross-modal encoding, SBAT can effectively reduce the feature redundancy and capture the global-local video information. The quantitative, qualitative, and ablation experiments on two benchmark datasets have demonstrated the advantage of SBAT.
Acknowledgments
This work is supported in part by Science and Technology Innovation 2030 –“New Generation Artificial Intelligence” Major Project No.(2018AAA0100904), National Key R&D Program of China (No. 2018YFB1403600), NSFC (No. 61672456, 61702448, U19B2043), Artificial Intelligence Research Foundation of Baidu Inc., the funding from HIKVision and Horizon Robotics, and ZJU Converging Media Computing Lab.
References
- [1] (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR, Cited by: §1, §7.3.
- [2] (2015) Vqa: visual question answering. In ICCV, Cited by: §1.
- [3] (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §6.2.
- [4] (2011) Collecting highly parallel data for paraphrase evaluation. In ACL, Cited by: §6.1.
-
[5]
(2018)
TVT: two-view transformer network for video captioning
. In ACML, Cited by: §1, §2, §7.3. - [6] Cited by: §7.3.
- [7] (2017) Semantic compositional networks for visual captioning. In CVPR, Cited by: §1, §7.3.
- [8] (2017) Attention-based multimodal fusion for video description. In ICCV, Cited by: §2, §3.2.
- [9] (2019) Low-rank hoca: efficient high-order cross-modal attention for video captioning. arXiv preprint arXiv:1911.00212. Cited by: §2.
- [10] (2019) Recurrent convolutional video captioning with global and local attention. Neurocomputing. Cited by: §2.
- [11] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.3.
- [12] (2019) Beyond rnns: positional self-attention with co-attention for video question answering. In AAAI, Cited by: §1.
- [13] (2018) Video captioning with multi-faceted attention. TACL. Cited by: §2.
- [14] (2017) Video captioning with transferred semantic attributes. In CVPR, Cited by: §1.
- [15] (2019) Memory-attended recurrent network for video captioning. In CVPR, Cited by: §1, §7.3.
- [16] (2017) Weakly supervised dense video captioning. In CVPR, Cited by: §1, §7.3.
- [17] (2017) Inception-v4, inception-resnet and the impact of residual connections on learning.. In AAAI, Cited by: §6.2.
- [18] (2017) Attention is all you need. In NIPS, Cited by: §3.
- [19] (2015) Sequence to sequence-video to text. In ICCV, Cited by: §2.
- [20] (2019) Controllable video captioning with pos sequence guidance based on gated fusion network. In ICCV, Cited by: §1, §7.3.
- [21] (2016) Msr-vtt: a large video description dataset for bridging video and language. In CVPR, Cited by: §6.1.
- [22] (2015) Describing videos by exploiting temporal structure. In ICCV, Cited by: §2.
- [23] (2016) Image captioning with semantic attention. In CVPR, Cited by: §1.
- [24] (2018) End-to-end dense video captioning with masked transformer. In CVPR, Cited by: §1.