At any time instant, countless events that happen in the real world are captured by cameras and stored as massive video data resources. To effectively retrieve such recordings, whether in offline or online settings, video captioning is an essential technology thanks to its ability to understand scenes and describe events in natural language.
was first proposed, video captioning has been actively researched in the field of computer vision[3, 4, 5, 6, 7] using sequence-to-sequence models in an end-to-end manner . Its goal is to generate a video description (caption) about objects and events in a video clip. To further leverage audio features to identify events,  proposed the multimodal attention approach to fuse audio and visual features such as VGGish  and I3D 
to generate video captions. Such video clip captioning technologies have been expanded to offline video stream captioning technologies such as dense video captioning and progressive video description generator , where all salient events in a video stream are temporally localized, and event-triggered captions are generated in a multi-thread manner. While all video captioning technologies had so far been based on LSTM,  successfully applied the Transformer [15, 16, 17] together with the audio-visual attention framework . In that work, the audio-visual Transformer was tested using the ActivityNet Captions dataset  within an offline video captioning system and achieved the best performance for the dense video captioning task. However, such offline video captioning technologies are not practical in real-time monitoring or surveillance systems, in which it is essential not only to describe events accurately but also to produce captions as soon as possible to find and report the events quickly. Low-latency captioning is required to realize such functionality, but this research area has not been pursued yet.
This paper proposes a novel approach that optimizes the output timing for each caption based on a trade-off between latency and caption quality. We train a low-latency audio-visual Transformer composed of (1) a Transformer-based caption generator which tries to generate ground-truth captions after only seeing a small portion of all video frames, and also to mimic the outputs of a similar pre-trained caption generator that is allowed to see the entire video, and (2) a CNN-based timing detector that can find the best timing to output a caption, such that the captions ultimately generated by the above two Transformers become sufficiently close to each other.
The proposed jointly-trained caption generator and timing detector can generate captions in an early stage of a video clip, as soon as an event happens. Additionally, this framework has the potential to forecast future events in online captions. Furthermore, by combining multimodal sensing information, an event can be recognized at an earlier timing triggered by the earliest cue in one of the modalities without waiting for other cues in other modalities. For example, the proposed approach has the potential to generate captions earlier than a visual cue’s timing based on the timing of an audio cue. Such a low-latency online video captioning using multimodal sensing information will contribute not only to retrieve events quickly but also to answer questions about scenes earlier [18, 19].
2 Related work
There are some works on low-latency end-to-end sentence generation for machine translation (MT) and automatic speech recognition (ASR). To realize real-time interpretation systems, simultaneous translation using greedy decoding was proposed and opened up the issue of streaming for neural MT (NMT)[20, 21, 22, 23, 24]; an emission point when a phrase is fully translated into a target language was incrementally determined. Another approach iteratively retranslates by concatenating subsequent words and updating the output [25, 26]. The goal is to generate a partial translation in the meantime before a full source sentence is translated. In contrast, our goal is to generate a full caption as soon as the system believes enough cues have been captured before seeing the entire video. Real-time ASR technology is also essential for applications such as closed captions. Some end-to-end systems regularize or penalize the emission delay using endpoint detection, and penalty terms that constrain alignments were proposed [27, 28, 29, 30]. There, the target is to generate a transcription slightly earlier from the end of an utterance. In contrast, our target is to generate video captions as early as possible before the end of events.
In the field of computer vision, PickNet was proposed to find salient visual frames sufficient to generate video captions, where the target number of frames was given, and the captioning capability using only the selected frames was evaluated . The paper mentions that it may be possible to apply PickNet to online captioning, showing a sample use case, but no quantitative evaluation was performed. Another work relevant to online video captioning attempts to anticipate caption generation for future frames . This approach exploited the current event features as a contextual feature and input them into a captioning module to generate future captions. This technology uses temporal dependency between events in a sequence.
3 Online multi-modal captioning Transformer
We describe the proposed low-latency video captioning model. Figure 1 illustrates the model architecture, which consists of an audio-visual encoder, an end detector, and a caption decoder, where the encoder is shared by the detector and the decoder. Our model is based on the Transformer architecture  and its multimodal extension 
, but it receives video and audio features in a streaming manner, and the end detector decides when to generate a caption for the feature sequence the model has received until that moment.
Given a video stream, the audio-visual encoder extracts VGGish and I3D features from the audio and video tracks, respectively, where the frame rate may be different on each track. The sequences of audio and visual features from a starting point to the current time are fed to the encoder, and converted to hidden vector sequences through self-attention, bi-modal attention, and feed-forward layers. Typically, this encoder block is repeatedtimes, e.g., or greater. The final encoded representation is obtained via the -th encoder block.
be audio and visual signals. First, the feature extraction module is applied to the input signals as
to obtain feature vector sequences corresponding to the VGGish and I3D features, respectively. Each encoder block computes hidden vector sequences as
where and denote multi-head attention and feed-forward network, respectively. Layer normalization  is applied before every and layers, but it is omitted from the equations for simplicity. takes three arguments, query, key, and value vector sequences . The self-attention layer extracts temporal dependency within each modality, where the arguments for are all the same, i.e., or , as in (2) and (3). The bi-modal attention layers further extract cross-modal dependency between audio and visual features, taking the keys and values from the other modality as in (4) and (5). After that, the feed-forward layers are applied in a point-wise manner. The encoded representations for audio and visual features are obtained as and .
The end detector receives the encoded representation based on the audio-visual information available at the moment. The role of the end detector is to decide whether the system should generate a caption or not for the given encoded features. The detector first processes the encoded vector sequence from each modality with stacked 1D-convolution layers as
Each time-convoluted sequences are then summarized into a single vector through pooling and concatenation operations:
A feed-forward layer
and sigmoid function
convert the summary vector to the probability of, where indicates whether a relevant caption can be generated or not:
Once the end detector provides a higher probability than a threshold, e.g., , the decoder generates a caption based on the encoded representation .
The decoder iteratively predicts the next word from a starting token (<sos>). At each iteration step, it receives a partial caption that has already been generated, and predicts the next word by applying decoder blocks and a prediction network, where each word is assumed to be converted to a word embedding vector.
Let be partial caption after iterations. Each decoder block has self-attention, bi-modal source attention, and feed-forward layers:
The self-attention layer converts the word vectors to high-level representations considering their temporal dependency in (11). The bi-modal source attention layers update the word representations based on the relevance to the encoded multi-modal representations in (12) and (13). A feed-forward layer is then applied to the outputs of the bi-modal attention layers in (14) and (15
). Finally, a linear transform and a softmax operation are applied to the output of the
-th decoder block to obtain the probability distribution of the next word as
where denotes the vocabulary.
After picking the one-best word , the partial caption is extended by adding the selected word to the previous partial caption as . This is a greedy search process that ends if , which represents an end token. It is also possible to pick multiple words with highest probabilities and consider multiple candidates of captions according to the beam search technique.
Similar architectures have been used for dense video captioning tasks [34, 14], where an event localization network is placed on top of the encoder similarly to our end detector. A difference with those models is that the localization network is assumed to access all frames of the video and chooses a set of regions, which potentially includes specific events, while our end detector can access only partial frames from the beginning or a certain point to the current frame and detect a timing at which the system should emit the caption. Thus, our model is designed and trained for online captioning.
We learn the multi-modal encoder, the end detector, and the caption decoder jointly, so that the model achieves a caption quality comparable to that for a complete video, even if the given video is shorter than the original one by truncating the later part.
Two types of loss functions are combined, a captioning loss to improve the caption quality and an end detection loss to detect a right timing to emit a caption. Figure2 shows an example of video stream, where an event has started at time and ends at , and is associated with ground-truth caption . If time is picked as the emission timing, the captioning decoder generates a caption based on the multi-modal input signal .
The captioning loss is based on a standard cross entropy loss for the ground-truth caption ,
and a Kullback–Leibler (KL) divergence loss between predictions from a pre-trained model allowed to process the complete video and the target model that can only process incomplete videos, i.e.,
This is a student-teacher learning approach to exploit another model’s superior description power , where the teacher model predicts a caption using entire video clip and the student model tries to mimic the teacher’s predictions using only the truncated video clip . This makes the training more stable and achieves better performance.
The end detection loss is based on a binary cross entropy for appropriate timings. In general, however, such timing information does not exist in the training data set. In this work, we decide the right timing based on whether or not the captioning decoder can generate a relevant caption, that is, a caption sufficiently close to the ground-truth or the caption generated for the entire video clip using the pre-trained model . The detection loss is computed as
where is determined based on
where denotes a similarity measure between two word sequences. In this work, we use word accuracy computed in a teacher-forcing manner. is a pre-determined threshold which judges whether or not the online caption is sufficiently close to the references and .
The training process for model repeats the following steps:
Compute loss ,
Update using .
The inference is performed in two steps:
Find that first satisfies ,
Generate a caption based on
where is a pre-determined threshold to control the sensitivity of end detection. Note that we assume that is already determined.
We evaluate our low-latency caption generation method using the ActivityNet Captions dataset , which consists of 100k caption sentences associated with temporal localization information based on 20k YouTube videos. Although the conventional video description dataset MSVD (YouTube2Text)  and MSR-VTT  have 41 and 20 ground-truth captions for each video clip respectively, ActivityNet only has one for each event. The dataset is split into 50%, 25%, and 25% for training, validation, and testing. However, since the ground-truth captions for the test set are not available, we split the validation set into two subsets on which we report the performance as done in a prior study . The average duration of a video clip is 35.5, 37.7, and 40.2 seconds for the training set and the validation subsets 1 and 2, respectively. We used VGGish and I3D features provided by the author of . The VGGish features were configured to form a 128-dimensional vector sequence for the audio track of each video, where each audio frame corresponds to a 0.96 s segment without overlap. The I3D features were configured to form a 1024-dimensional vector sequence for the video track, where each visual frame corresponds to a 2.56 s segment without overlap.
A multi-modal Transformer was first trained with entire video clips and their ground-truth captions. This model was used as a baseline and teacher model. We used encoder blocks and decoder blocks, and the number of attention heads was 4. The vocabulary size was 10,172, and the dimension of word embedding vectors was 300.
The proposed model for online captioning was trained with incomplete video clips according to the steps in Section 3.2. The architecture was the same as the baseline/teacher model except for the addition of the end detector. In the training process, we consistently used
for the loss function. The dimensions of hidden activations in audio and visual attention layers were 128 and 1024, respectively. The dropout rate was set to 0.1, and a label smoothing technique was also applied. The end detector had 2 stacked 1D-convolution layers, with a ReLU non-linearity in between. The performance was measured by BLEU3, BLEU4, and METEOR scores.
Figure 3 shows the latency ratio (left) and METEOR scores (right) on validation subset 1 when training with different in (21). The latency ratio indicates the ratio of the video duration used for captioning to the duration of the original video clip. With the baseline model, the latency ratio is always 1, which means all frames are used to generate captions. With our proposed method, the latency ratio and METEOR scores change depending on the value of , where a larger gives a stricter condition on the caption accuracy, resulting in later detection, while a smaller results in earlier detection. As learning proceeds, the latency ratio gradually decreases, but the METEOR score tends to maintain high values close to the baseline. This result demonstrates that the learning process works to reduce the latency while maintaining caption quality.
Table 1 compares captioning methods in BLEU and METEOR scores on validation subset 1. The model selected for evaluation was trained with and had the best METEOR score on validation subset 2. We controlled the latency with the detection threshold . As shown in the table, our proposed method at a 55% latency achieves 10.45 METEOR score with only a small degradation, which corresponds to 98% of the baseline score . It also achieves METEOR score at a 28% latency, which corresponds to 94% of the baseline. We also evaluated a naive method which takes video frames from the beginning with a fixed ratio to the original video length and runs the baseline captioning on the truncated video clip. The results show that the proposed approach clearly outperforms the naive method at an equivalent latency.
The table also includes the results for a unimodal Transformer that receives only the visual feature. The results show that the proposed method works for the visual feature only, but the performance is degraded due to the lack of the audio feature. This result indicates that the audio feature is essential even in the proposed low-latency method.
|Proposed (w/o ST)||55%||4.22||1.77||10.38|
|Proposed (w/o ST)||29%||3.75||1.52||9.93|
|Baseline (visual only)||100%||4.08||1.80||10.21|
|Proposed (visual only)||54%||3.82||1.61||10.05|
|Proposed (visual only)||30%||3.45||1.42||9.71|
In this paper, we proposed a low-latency audio-visual captioning method, which describes events accurately and quickly without waiting for the end of video clips. The proposed method optimizes each caption’s output timing based on a trade-off between latency and caption quality. We have demonstrated that the proposed system can generate captions in early stages of event-triggered video clips, achieving 94% of the caption quality of the upper bound given by a Transformer processing the entire video clips, using only 28% of frames (10.6 seconds) on average from the beginning.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence – Video to text,” in Proc. ICCV, Dec. 2015, pp. 4534–4542.
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” inProc. NAACL HLT, May 2015.
-  L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in Proc. ICCV, Dec. 2015, pp. 4507–4515.
-  A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie description,” in Proc. GCPR, Oct. 2015, pp. 209–221.
-  Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in Proc. CVPR, Jun. 2016, pp. 4594–4602.
H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” inProc. CVPR, Jun. 2016, pp. 4584–4593.
-  M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, and N. Yokoya, “Learning joint representations of videos and sentences with web image search,” in Proc. ECCV, Oct. 2016, pp. 651–667.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,”Proc. ICML, Jul. 2015.
-  C. Hori, T. Hori, T.-Y. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi, “Attention-based multimodal fusion for video description,” in Proc. ICCV, Oct. 2017.
-  S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “CNN architectures for large-scale audio classification,” in Proc. ICASSP, Mar. 2017.
-  J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proc. CVPR, Jul. 2017.
-  R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-captioning events in videos,” in Proc. ICCV, Oct. 2017, pp. 706–715.
-  Y. Xiong, B. Dai, and D. Lin, “Move forward and tell: A progressive generator of video descriptions,” in Proc. ECCV, Sep. 2018, pp. 468–483.
-  V. Iashin and E. Rahtu, “A better use of audio-visual cues: Dense video captioning with bi-modal transformer,” in Proc. BMVC, 2020.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NeurIPS, Dec. 2017, pp. 5998–6008.
-  S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watanabe, T. Yoshimura, and W. Zhang, “A comparative study on transformer vs RNN in speech applications,” in Proc. ASRU, Dec. 2019.
-  L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in Proc. ICASSP, Apr. 2018, pp. 5884–5888.
-  C. Hori, H. Alamri, J. Wang, G. Wichern, T. Hori, A. Cherian, T. K. Marks, V. Cartillier, R. G. Lopes, A. Das et al., “End-to-end audio visual scene-aware dialog using multimodal attention-based video features,” in Proc. ICASSP, May 2019, pp. 2352–2356.
-  H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson, S. Lee, and D. Parikh, “Audio visual scene-aware dialog,” in Proc. CVPR, Jun. 2019.
-  K. Cho and M. Esipova, “Can neural machine translation do simultaneous translation?” arXiv preprint arXiv:1606.02012, 2016.
-  J. Gu, G. Neubig, K. Cho, and V. O. Li, “Learning to translate in real-time with neural machine translation,” in Proc. EACL, Apr. 2017.
-  M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu, B. Zheng, C. Zhang, Z. He, H. Liu, X. Li, H. Wu, and H. Wang, “STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework,” in Proc. ACL, Jul. 2019.
-  F. Dalvi, N. Durrani, H. Sajjad, and S. Vogel, “Incremental decoding and training methods for simultaneous translation in neural machine translation,” in Proc. NAACL HLT, Jun. 2018.
-  N. Arivazhagan, C. Cherry, W. Macherey, C.-C. Chiu, S. Yavuz, R. Pang, W. Li, and C. Raffel, “Monotonic infinite lookback attention for simultaneous machine translation,” in Proc. ACL, Jul. 2019.
-  J. Niehues, N.-Q. Pham, T.-L. Ha, M. Sperber, and A. Waibel, “Low-latency neural speech translation,” in Proc. Interspeech, Sep. 2018, pp. 1293–1297.
-  N. Arivazhagan, C. Cherry, I. Te, W. Macherey, P. Baljekar, and G. Foster, “Re-translation strategies for long form, simultaneous, spoken language translation,” in Proc. ICASSP, May 2020.
-  B. Li, S.-y. Chang, T. N. Sainath, R. Pang, Y. He, T. Strohman, and Y. Wu, “Towards fast and accurate streaming end-to-end ASR,” in Proc. ICASSP, May 2020, pp. 6069–6073.
-  H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” Proc. Interspeech, pp. 1468–1472, Sep. 2015.
-  T. N. Sainath, R. Pang, D. Rybach, B. Garcıa, and T. Strohman, “Emitting word timings with end-to-end models,” in Proc. Interspeech, Oct. 2020, pp. 3615–3619.
-  J. Yu, C.-C. Chiu, B. Li, S.-y. Chang, T. N. Sainath, Y. He, A. Narayanan, W. Han, A. Gulati, Y. Wu et al., “FastEmit: Low-latency streaming ASR with sequence-level emission regularization,” arXiv preprint arXiv:2010.11148, 2020.
-  Y. Chen, S. Wang, W. Zhang, and Q. Huang, “Less is more: Picking informative frames for video captioning,” in Proc. ECCV, Sep. 2018, pp. 358–373.
-  M. Hosseinzadeh and Y. Wang, “Video captioning of future frames,” in Proc. WACV, Jan. 2021, pp. 980–989.
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” in
Proc. NIPS Deep Learning Symposium, 2016.
-  L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense video captioning with masked transformer,” in Proc. CVPR, Jun. 2018, pp. 8739–8748.
-  C. Hori, A. Cherian, T. K. Marks, and T. Hori, “Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog,” in Proc. Interspeech, Sep. 2019, pp. 1886–1890.
-  S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko, “Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition,” in Proc. ICCV, Dec. 2013.
-  J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. CVPR, Jun. 2016.