Crowd Video Captioning

11/13/2019 ∙ by Liqi Yan, et al. ∙ 0

Describing a video automatically with natural language is a challenging task in the area of computer vision. In most cases, the on-site situation of great events is reported in news, but the situation of the off-site spectators in the entrance and exit is neglected which also arouses people's interest. Since the deployment of reporters in the entrance and exit costs lots of manpower, how to automatically describe the behavior of a crowd of off-site spectators is significant and remains a problem. To tackle this problem, we propose a new task called crowd video captioning (CVC) which aims to describe the crowd of spectators. We also provide baseline methods for this task and evaluate them on the dataset WorldExpo'10. Our experimental results show that captioning models have a fairly deep understanding of the crowd in video and perform satisfactorily in the CVC task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the rapid development of the deep neural network, computers can describe the content of the video in a reasonably deep way. Video captioning has important practicality and wide potential application, and a typical one is news broadcasting.

In most great events, there are professional commentators in the stadium to broadcast the situation in real time, and some studies about automatic sports video commentary [1, 2] have been carried out. Outside the venue, entry and exit of spectators are also important. Reports such as “The line of people snaked into the theater with joy in their faces.” often appear in the news, but deliberately assigning reporters to wait for spectators wastes manpower. Therefore, to report the situation of off-site spectators in real time, we use the surveillance camera to analyze the spectators’ crowd and generate captions with the deep learning methods.

Recently, there has been some works on crowd counting [3, 4, 5] and classification [6, 7, 8, 9], but their output is a number of pedestrians, or the state of mobility and abnormal behaviors. None of these work can produce a descriptive sentence for the crowd as a news report.

A caption of crowd video needs to describe various attributes of a crowd, such as the number of people in the crowd, the situation of movement, direction of flow, etc. Therefore, we use a captioning framework to describe the crowd in videos. In major events, if the surveillance camera uses our system, it can be directly connected to the news broadcasting system to broadcast the real-time off-site situation to the news media.

Our framework uses convolutional neural networks to extract these crowd features, then feeds them into a classifier or a language model to produce the summary. All attributes and situations of the crowd should be included in the output descriptive words of the language model.

To validate our system, we create a crowd video captioning dataset, which is based on the crowd counting dataset: WorldExpo’10. We select some of the videos in this dataset and make captions for them. Several experiments using the proposed models have been carried out to evaluate the performances of those methods.

The main contribution of our work is the proposal of a new task called crowd video captioning (CVC) which aims to generate captions for the crowd video. We provide baselines and a system framework for this task, and the results of the experiments prove the feasibility of our system.

Ii Related Work

Ii-a Crowd Counting

In recent years, many models and datasets have been proposed for crowd counting. For example, Chan et al. [3] collect UCSD in the University of California, San Diego, and it is one of the earliest datasets for crowd counting. Chan et al. [3] have used Dynamic Textures and Gaussian methods to count the crowd in videos. After that, many new models have been proposed for this task, such as CNN [4] and ACSVP [5] (a GAN-based, U-net structured model).

WorldExpo’10 proposed in [4] is another large-scaled dataset for crowd counting. It includes more than a thousand labeled videos captured by over one hundred monitoring cameras, all from the Shanghai World Expo in 2010. We have used it in this research.

Ii-B Crowd Behaviors Analysis

In addition to counting, researches on behavior analysis of crowd are also underway. MED [6] has been carried out as the crowd emotion dataset, but it only has 31 videos, and the people in the crowd are just walking around and making some specific movements, such as fighting, hugging.

The newest dataset, Crowd-11 proposed in [8], has been provided to classify the fine-grained crowd behaviors. It categorizes the flow mainly by the direction of each one in the crowd. Models including LSTM [7], C3D, V3G [8], ConvLSTM [9] and so on, have been used to analyze the crowd abnormal behaviors in those datasets. While these systems work efficiently, they do have significant disadvantages: the accuracy of fine-grained classification for flow is generally low and they are more suitable for abnormal behavior monitoring.

Ii-C Fine-grained Video Captioning

There are also some works about fine-grained video captioning, including broadcasting for tennis videos [1], and Fine-grained Sports Narrative dataset [2]. Models like LSTM-YT [10] and S2VT [11] have been used to complete those tasks. But the aims of those works are all for professional sports broadcasting.

Iii Methods

A conventional video captioning pipeline can be divided into two stages: feature extraction and caption generation.

In the first stage, we can use the model for image classification as a frame feature extractor, or the model for video classification as a video feature extractor. The features extracted by these two methods are different, so suitable models should be selected for different tasks..

In the second stage, a sequence to sequence model is fed with the extracted features and generate sentences. Therefore, a language model such as a recurrent neural network (RNN) can be used to construct this decoder.

Iii-a Frame Feature Extraction

The high-level 2D features of the frame can be extracted by a convolutional neural network (CNN) for image classification, feeding it with frames of the video one after another, and getting the feature of each frame from the last layer of the model.

Inception V3 [12] is one of those networks, evolved from GoogLeNet [13]. It decomposes a 2D convolution mask which is into two 1D masks which are and respectively. It can not only accelerate the calculation but also increase the depth of the network.

In addition to Inception V3, ResNet [14] and Inception V4 [12] are also available for the frame feature extraction. Their inputs have a direct influence to the output through skip connections.

Iii-B Video Feature Extraction

C3D [15] is a network for extracting features from videos. Unlike the 2D convolution in 3D space which can not slide in the temporal domain, 3D convolution can extract the features of the same region in different periods.

Fig. 1: The output of C3D model. (Best View in color)

C3D network has eight convolution and five pooling layers. Using Principal Component Analysis (PCA), the output of the last layer (fc7) in C3D is shown in Fig. 


, where every point represents the feature of its corresponding video, and the 8 different colors represent 8 different classes. Before training (step=0), all points from different categories are mingled together. But after training (step=89), points of the same category are collected together. There are 70 points in the figure, the length of their tensor is 4096.

Iii-C Caption Generation

Fig. 2: Our proposed framework for CVC


S2VT [11]

is a typical sequence-to-sequence network for generating captions for videos. It’s made up of two-layer recurrent neural network (RNN), and long short term memory (LSTM)

[16] is used as the cells of this RNN.

As Fig. 2 shows, S2VT takes the features extracted from the video frames as the input sequences. The “words” of those embeddings are fed into the LSTM cells of the first layer one by one, then, after several iterations, the words for captioning are continuously generated by the second layer.

S2VT generates words by taking the words which are already generated. It uses to indicate the begin-of-sentence and for the end-of-sentence tag. And is used when there is no input at the time step. Those labels are all utilized as the references to produce the next word.

Iv Details of Models

In this section, we first introduce the definition of our crowd video captioning task and the overview of our frameworks. Then, we describe two alternative models to caption the crowd video.

Iv-a Task Definition and Overview

Crowd video captioning aims to describe attributes of a crowd in natural language, such as the number of people in the crowd, the situation of movement, the direction of flow, etc. Our model can be divided into two parts: encoder and decoder. First, we need to identify the crowd from videos. It is easy to detect the changing areas between frames because the pixels representing pedestrians tend to move together as a whole. Then, frames are randomly selected from a video. the attributes and situations of the crowd can be extracted from the following features: crowd extent size, density, individual movements, pedestrian situations and so on. Our framework employs convolutional neural networks as the feature extractor to get these features embeddings.

The features extracted from the videos include the attributes of the crowd. Therefore, crowd’s feature can be directly extracted from the extractor frame by frame, and then joined into one sequence

. Crowd’s features can also be encoded into a vector sequence

through the video feature extractor. Finally, this sequence is fed to a classifier or a language model as the caption generator to produce the description of the crowd.

Iv-B Classification Model

If there are words in the dataset, the number of -word sentences can be formed theoretically is . But very few of those randomly generated sentences are grammatical. So we use grammatical sentences as the tags that classifier needs to recognize. In this model, after the video features are extracted from the encode, a classifier is used as the decoder.

Let’s take p-category classifier based on C3D as example, a linear layer is used as the classifier followed in the rear of C3D, as shown in Fig. 3. The linear layer converts dimensional vectors

to p-dimensional outputs, then the probabilities

for which label to be output can be calculated by softmax as follow:


where , is the total number of categories. After that, the corresponding value of the category with the maximal probability is set to 1.

Fig. 3: The C3D model and the linear layer following it.

Iv-C Captioning Model

Since classification model can only generate captions within the label set, we adopt the captioning model in our CVC system framework. To caption the video, our captioning model needs to generate natural language sequences. Given an frame feature sequence , the goal of captioning model is to generate proper words:

. The model estimates the probability:


where represents all the generated words before the word. When captioning model outputs each word, it chooses the word with the highest probability of this position on the basis of all the words that have been output before.

The overall design of our CVC system framework of captioning model is shown in Fig. 2.

Firstly, the video frames feature or the whole video feature about the crowd is obtained from the encoder, such as C3D, ResNet or Inception.

After that, we take the language model, like S2VT shown in Fig. 2, as the decoder for captioning. In addition to the LSTM, we apply a gated recurrent neural network (GRU) [17] as the cell of S2VT, which has fewer parameters than LSTM and can avoid over-fit. Researches and practical experiences show that these two cells each have their own advantages and disadvantages.

The following steps show how the model produces the captions.

First of all, vocabulary which contains all the words in the dataset is built, and every word is encoded into a vector.

Next, the features are input to the model, and then which word to be selected from the vocabulary depends on the hidden state and the network parameters :


Finally, supposing that all the hidden states of the network are , from the beginning of the input sentence to the end of the output sentence, the model optimizes the parameters by maximizing the sum of a log-likelihood probability of the generated words as follow:


Iv-D Loss Function

During training, we use the Cross-Entropy Loss as the loss function, which is defined as follow:


where is the ideal embedding of the words created by humans and the is the computer-generated words embedding. The greater the difference between the predicted results and the ground truth is, the higher the gradient of the loss function is, and the faster the convergence rate is.

V Experiments

To caption the crowd videos, we select WorldExpo’10 from a series of datasets for crowd analysis. And in order to evaluate our baseline and captioning methods in our dataset, C3D and other feature extractors have been chosen as the encoder, and experiments on S2VT with LSTM and GRU have been token.

V-a Dataset

Crowds in most videos of the WorldExpo’10 dataset are messy because this dataset is mainly used for crowd counting. In order to simplify the task, we select 98 videos of it and caption them based on the crowd. These videos are captured by 7 surveillance cameras.

Fig. 4: The composition of my dataset. (Best View in color)

Keywords of captions in our dataset are shown in Fig. 4, they describe the number of people in the crowd, the situation of movement and the direction of flow respectively. Because “running” accounts for a small percentage in the WorldExpo’10 dataset, so it only accounts for 21% in our dataset. We define the direction to be close to the camera as “in”, and the direction to be far away from the camera as “out”.

The size of the vocabulary is 6, the number of attribute pairs that make up the descriptive sentences is 3. So captions are formed in our dataset.

V-B Baseline Based on C3D

The first baseline is using C3D directly as -category classifier. Since the words in our dataset can make up sentences, we divide all the videos into eight categories. The label of each category is one of these sentences, such as “Many people walk in”.

The dimension of the last layer in C3D is , we add a linear full connected layer after it as the classifier. The dimension of the output of this linear layer the total number of categories, which is 8 in our experiment.

The model is first pre-trained with UCF101 [18]. We then fine-tune the network on our dataset. The split for training, validating and testing is 70:19:9. In order to fit the C3D model, frames are resized into , and the randomly cropped to .

The accuracy and loss curves for training, validating and testing epochs are shown in Fig. 

5, where the number of frames inputted to the model is 16, the learning rate is set to , and the schedule is set to divide the learning rate by 2 every 10 epochs. And the curves in Fig. 5 are smoothed by 0.8, while the original values are reported in faint polylines.

The curves show that in the training epochs, the loss almost converges to zero, and the accuracy can achieve fast convergence as well. It reaches 0.9714 and 0.6842 on training and validating corpus, but it’s only 0.4444 on testing one. Their loss can be reduced to approximately zero on the training set, but not on the validation set or the test set.

(a) Accuracy curve for training epochs
(b) Accuracy curve for validating epochs
(c) Accuracy curve for testing epochs
(d) Loss curve for training epochs
(e) Loss curve for validating epochs
(f) Loss curve for testing epochs
Fig. 5: Accuracy and loss curves for different epochs. (Best View in color)

V-C Evaluation Metrics for Captioning

In this section, we introduce several frequently used evaluation metrics for video captioning.

  • BLEU.

    It is based on modified n-gram precision. To begin with, the modified precision

    is defined as the candidate counts clipped by their corresponding reference maximum value, summed, and divided by the total number of candidate n-grams. In the second place, supposing is the length of the candidate translation and is the effective reference corpus length, we compute the brevity penalty:


    Ultimately, we use n-grams from 1-gram up to length N to calculate the BLEU score with the weighted precision:

  • CIDEr. It measures the similarity of a sentence to the majority, or consensus of how most people describe the image. For instance, sentences such as “Mike has a baseball and Jenny has basketball” is more representative of the consensus descriptions than the sentence “Jenny brought a bigger ball than Mike”. CIDEr is proposed to capture those sentences with more broad consensus.

  • METEOR. It is based on the weighted precision and recall of the matched content-function words in hypothesis and reference. Fragmentation penalty is defined to account for differences in word order, where the chunks defined as a series of matches that is contiguous and identically ordered in both sentences, and

    is the average number of matched words over hypothesis and reference. After the parameterized harmonic mean

    of and is calculated, the METEOR score is computed as follow:


    It’s always small than 1 even if the sentences predicted are the same as the references, because the never equals zero.

  • ROUGE. It counts the number of overlapping units such as n-gram, word sequences, and word pairs between the predicted captions and the referential summaries.

V-D Details and Results of Model based on S2VT

We use C3D, ResNet-152 and Inception V3 (V4) pre-trained on UCF101 [18]

or ImageNet

[19] as the feature extractor of every frame from our dataset, then train the S2VT model directly with those features. LSTM and GRU are used as the RNN cell of the S2VT. For the best performance, the split for training and testing is adjusted to 45:4. We follow the experimental setting from the default. We set the dimension of features of video frames to 2048 and set the number of hidden layers to 512.

Metric BLEU(%) CIDEr(%) METEOR(%) ROUGE_L(%) Accuracy
(n-gram) 1-gram 2-gram 3-gram 4-gram
C3D and LSTM 82.76 76.89 74.24 75.64 63.46 50.30 82.83 0.625
C3D and GRU 92.86 88.84 86.97 90.06 72.53 56.97 91.67 0.75
ResNet152 and LSTM 78.52 69.13 59.37 50.92 47.02 43.24 80.57 0.625
ResNet152 and GRU 92.86 88.84 83.97 81.63 70.44 56.97 92.71 0.75
Incept.v3 and LSTM 86.21 83.54 81.27 80.95 72.07 58.38 86.99 0.75
Incept.v3 and GRU 96.43 95.71 94.34 95.73 81.15 67.81 95.83 0.875
Incept.v4 and LSTM 82.09 81.62 80.68 84.35 68.65 57.90 83.33 0.75
Incept.v4 and GRU 92.86 88.84 83.97 81.63 70.44 56.97 92.71 0.75
TABLE I: Performances of S2VT model with different RNN cell and feature

We report BLEU, CIEDEr, METEOR, and ROUGE_L captioning scores for this method, the results on the testing set are provided in TABLE. I. The learning rate is set to , the scheduler is set to decay the learning rate by 0.8 every 200 epochs.

Results obtained from the S2VT with different RNN cell and feature are shown in TABLE. I, best performances are presented in bold, and second best performances are underlined. For our task, the most appropriate feature extractor is inception V3, followed by C3D. And GRU has fewer parameters than LSTM, so it converges more easily than LSTM when the dataset size is small. It is the reason why GRU works better than LSTM on our dataset.

Fig. 6: The correct and incorrect results generated by captioning models.

The accuracy is calculated in the purpose of being compared with the results of the C3D method, defined as the proportion of complete correct sentences to all of them. It is higher than the counterpart of the C3D classifier. And the two incorrect hypotheses generated by LSTM from Inception V3 features are just predicted with the mistake on verb and direction respectively, as shown in Fig. 6. The second failed case is due to several people who interfere with the judgment. Although the sentence structure is not set manually, the output is completely consistent with the grammar, such as the singular-plural rule.

V-E Experimental Conclusion

Compared with the baseline, this method based on S2VT outperforms the classifier with C3D due to the comprehension of words in the sentence separately. This verifies that it is not necessary to know what the specific features represent when extracting features, and the analysis of crowd features can be directly handed over to the language model.

Moreover, it proves that image classification models such as Inception and ResNet can extract the feature of the crowd in each frame, and the sequence composed of these features can be used for the crowd captioning.

This reflects the power of the video captioning model, especially if more information needs to be described in the summary later. Video captioning models can interpret the temporal information from the features extracted by the convolution neural network. It can also understand which information each word represents, and make them into a sentence without grammatical mistakes.

Vi Conclusion and Future Work

In this paper, we propose a new video captioning task of describing the off-site audiences or visitors crowd, called crowd video captioning (CVC). In our encoder-decoder system for this task, we use a deep convolutional neural network to extract the features of crowd video and feed them to a language model for crowd description generating. We create a dataset based on WorldExpo’10. On this dataset, our experimental trials prove that our CVC system works well to accomplish this task, and they have achieved high accuracy. It shows that the language model can deeply comprehend the information about the crowd which feature extractors don’t understand.

In our approach, S2VT with features extracted from Inception V3 works better than other methods in the CVC task because our dataset is small and the captions are simple. For future work, these models should be adjusted to fit this fine-grained captioning task for the crowd. The number of videos and the complexity of captions needs to be increased as well in the dataset.


  • [1] M. Sukhwani and C. V. Jawahar, “TennisVid2Text: Fine-grained descriptions for domain specific videos,” in BMVC, 2015.
  • [2]

    H. Y. Yu, S. Cheng, B. B. Ni, M. S. Wang, J. Zhang, and X. K. Yang, “Fine-grained video captioning for sports narrative,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6006-6015.

  • [3] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in IEEE Conference on Computer Vision Pattern Recognition, 2008, pp. 1-7.
  • [4] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 833-841.
  • [5] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd counting via adversarial cross-scale consistency pursuit,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5245-5254.
  • [6] H. R. Rabiee, J. Haddadnia, H. Mousavi, M. Nabi, V. Murino, and N. Sebe, “Emotion-based crowd representation for abnormality detection,” CoRR, vol. abs/1607.07646, 2016.
  • [7]

    H. Su, Y. Dong, J. Zhu, H. Ling, and B. Zhang, “Crowd scene understanding with coherent recurrent neural networks,” in IJCAI, 2016, vol. 1, p. 2.

  • [8] C. Dupont, L. Tobias, and B. Luvison, “Crowd-11: A dataset for fine grained crowd behaviour analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 9-16.
  • [9] Y. Li, “A deep spatiotemporal perspective for understanding crowd behavior,” IEEE Transactions on Multimedia, vol. 20, no. 12, pp. 3289-3297, 2018.
  • [10] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” in HLT-NAACL, 2015.
  • [11] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence-video to text,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4534-4542.
  • [12] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818-2826.
  • [13] C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1-9.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
  • [15] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489-4497.
  • [16] D. D’Informatique et al., “Long short-term memory in recurrent neural networks,” Epfl, vol. 9, no. 8, pp. 1735-1780, 2001.
  • [17]

    K. Cho et al., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724-1734.

  • [18] K. Soomro, A. Zamir, R, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” Computer Science, 2012.
  • [19] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255: Ieee.